-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Array operation query with filtering results in InvalidOperationError: cannot reshape empty array into shape (-1, 2)
#18598
Comments
Chiming in to say it wasn't magically fixed in |
Here's a potentially helpful observation. Today I stumbled upon a stacktrace, which points to these lines in code: File "/Users/ryzhakar/similarity-sandbox/.venv/lib/python3.11/site-packages/polars/dataframe/frame.py", line 8968, in select
return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ryzhakar/similarity-sandbox/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 2034, in collect
return wrap_df(ldf.collect(callback))
^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.InvalidOperationError: cannot reshape empty array into shape (-1, 384) Note the |
I can reproduce this. Not sure if it is useful, but import numpy as np
import polars as pl
NUMBER_OF_ROWS = 672
VECTOR_LENGTH = 2
df = pl.LazyFrame({
'idx': np.arange(NUMBER_OF_ROWS),
'vec': np.random.uniform(size=(NUMBER_OF_ROWS, VECTOR_LENGTH))
})
(
df.join(df, how="cross")
.filter(pl.col('idx') < pl.col('idx_right'))
.with_columns(
(pl.col('vec') * pl.col('vec_right')).arr.sum()
)
.collect(streaming=True)
)
# InvalidOperationError: cannot reshape empty array into shape (-1, 2) I was trying to remove the filter, but noticed it still raises with just |
@cmdlineluser thank you for confirming! Your example reproduces only after 751 number of rows on my machine. The same treshold as the original example. Which makes me think this is a memory-related (paging?) issue. Now, another important thing is that the issue persists without streaming. import numpy as np
import polars as pl
NUMBER_OF_ROWS = 751
# This number ^^^ might be
# correlated with hardware.
# Increase it if you cannot reproduce the error.
VECTOR_LENGTH = 2
df = pl.LazyFrame({
'idx': np.arange(NUMBER_OF_ROWS),
'vec': np.random.uniform(size=(NUMBER_OF_ROWS, VECTOR_LENGTH))
})
(
df.join(df, how="cross")
.filter(False)
.with_columns(
(pl.col('vec') * pl.col('vec_right')).arr.sum()
)
.collect()
)
# InvalidOperationError: cannot reshape empty array into shape (-1, 2) Perhaps I should clarify the issue title. |
cannot reshape empty array into shape ( -1, 1000)
cannot reshape empty array into shape ( -1, 2)
Okay, this has become silly very fast 👀 This eager single-array example fails in import polars as pl
df = pl.DataFrame({'v': [[0., 0.]]}, schema={'v': pl.Array(pl.Float32, 2)})
print(df)
print(
df.filter(False)
.select(pl.col('v') + pl.col('v'))
) shape: (1, 1)
┌───────────────┐
│ v │
│ --- │
│ array[f32, 2] │
╞═══════════════╡
│ [0.0, 0.0] │
└───────────────┘
Traceback (most recent call last):
File "/Users/ryzhakar/similarity-sandbox/throwaway_scripts/.issue3.py", line 16, in <module>
.select(pl.col('v') + pl.col('v'))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ryzhakar/similarity-sandbox/.venv/lib/python3.11/site-packages/polars/dataframe/frame.py", line 8968, in select
return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ryzhakar/similarity-sandbox/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 2034, in collect
return wrap_df(ldf.collect(callback))
^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.InvalidOperationError: cannot reshape empty array into shape (-1, 2) @cmdlineluser would you try to reproduce this? |
Yep - reproduces for me on 1.7.1 also. I wonder if it is actually a different issue though, because the filter produces 0 rows. The original problem still had rows, so I may have reduced it down too much. |
cannot reshape empty array into shape ( -1, 2)
InvalidOperationError: cannot reshape empty array into shape (-1, 2)
DISCLAIMER: I don't really know what I'm talking about. Got a hunch though. @cmdlineluser It turns out that we were both right and wrong. My theory is that the minimal case with no rows to return actually happens in the lazy streaming case. To test this out, I adapted some of our cases into this. import numpy as np
import polars as pl
TOTAL_NUMBER_OF_ROWS = 1_500_000
IMPLIED_BATCH_LENGTH = 125_000
# This number ^^^^^^^
# may be correlated with hardware.
# I was looking for it with binary search :(
VECTOR_LENGTH = 2
df = pl.LazyFrame(
{
'i': np.arange(TOTAL_NUMBER_OF_ROWS),
'v': np.random.uniform(size=(TOTAL_NUMBER_OF_ROWS, VECTOR_LENGTH)),
},
)
# One full batch of ids will not meet the modulo expectation.
# All of these rows will result in `false` according to predicate.
predicate_w_rejected_batch = pl.col('i') % IMPLIED_BATCH_LENGTH + 1 == 0
predicate_true_once_every_batch = pl.col('i') % IMPLIED_BATCH_LENGTH == 0
allocating_array_operation = pl.col('v') * pl.col('v')
print(
df
.filter(predicate_true_once_every_batch)
.with_columns(allocating_array_operation)
.collect(streaming=True)
)
# shape: (12, 2)
# ┌─────────┬──────────────────────┐
# │ i ┆ v │
# │ --- ┆ --- │
# │ i64 ┆ array[f64, 2] │
# ╞═════════╪══════════════════════╡
# │ 0 ┆ [0.158439, 0.996373] │
# │ … ┆ … │
# │ 1375000 ┆ [0.315458, 0.859306] │
# └─────────┴──────────────────────┘
print(
df
.filter(predicate_w_rejected_batch)
.with_columns(allocating_array_operation)
.collect(streaming=True)
)
# InvalidOperationError: cannot reshape empty array into shape (-1, 2) So if there's at least one row to return from the batch – it works. But as soon as there a single "empty" filtered out batch – it fails in the same way the "eager no rows to return" case does. Too bad the batch size is a bit tricky to pinpoint. |
TLDR; if there's a completely filtered out batch with no rows to return AND an allocating array operation – polars raises a cryptic |
This seems to be fixed by #18940. |
Checks
Reproducible example
UPD: latest, much clearer understanding of the issue is decribed here: message.
You still can track how the understanding developed starting under the spolier below.
Log output
Issue description
Non-streaming API (streaming=False) works fine.
Streaming API with 750 rows works fine as well.
Streaming API with >750 rows breaks even with very short arrays.
Apart from this, I arrived at this weird code because other APIs like dot-product
.dot()
broke as well.But that's a story for another time.
Expected behavior
Installed versions
The text was updated successfully, but these errors were encountered: