Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array operation query with filtering results in InvalidOperationError: cannot reshape empty array into shape (-1, 2) #18598

Closed
2 tasks done
ryzhakar opened this issue Sep 6, 2024 · 9 comments
Assignees
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@ryzhakar
Copy link

ryzhakar commented Sep 6, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

UPD: latest, much clearer understanding of the issue is decribed here: message.

You still can track how the understanding developed starting under the spolier below.

import numpy as np
import polars as pl

# Try changing   _____ to 750, which works fine.
#                vvv
NUMBER_OF_ROWS = 751
VECTOR_LENGTH = 1_000

def handrolled_cosine_similarity(vec_a, vec_b):
    return _dot_product(vec_a, vec_b) / _magnitude_product(vec_a, vec_b)

def _dot_product(vec_a, vec_b):
    return (vec_a * vec_b).arr.sum()

def _magnitude_product(vec_a, vec_b):
    return _magnitude(vec_a) * _magnitude(vec_b)
    
def _magnitude(vec):
    return (
        vec.cast(pl.List(inner=pl.Float32))
        .list.eval(pl.element().pow(2))
        .list.sum().sqrt()
    )

def spawn_dataframe(rows_number: int, vector_width: int):
    return pl.LazyFrame(
        {
            'idx': np.arange(rows_number),
            'vec': np.random.uniform(size=(rows_number, vector_width)),
        },
    )

def autosimilarity(df: pl.LazyFrame):
    return (
        df.join(df, how="cross")
        .filter(pl.col('idx') < pl.col('idx_right'))
        .with_columns(
            handrolled_cosine_similarity(
                pl.col('vec'),
                pl.col('vec_right')
            ).alias('similarity')
        )
        # Streaming option ____ triggers failure
        #                  vvvv
        .collect(streaming=True)
        .tail(1)
    )

print(autosimilarity(spawn_dataframe(NUMBER_OF_ROWS, VECTOR_LENGTH)))

Log output

❯❯❯ POLARS_VERBOSE=1 poetry run python .issue.py
RUN STREAMING PIPELINE
[df -> callback -> filter -> hstack -> ordered_sink, df -> cross_join_sink]
Traceback (most recent call last):
  File "/Users/ryzhakar/similarity-sandbox/.issue.py", line 49, in <module>
    print(autosimilarity(spawn_dataframe(NUMBER_OF_ROWS, VECTOR_LENGTH)))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ryzhakar/similarity-sandbox/.issue.py", line 45, in autosimilarity
    .collect(streaming=True)
     ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ryzhakar/Library/Caches/pypoetry/virtualenvs/similarity-sandbox-DdCBGatn-py3.11/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 2034, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.InvalidOperationError: cannot reshape empty array into shape (-1, 1000)

Issue description

Non-streaming API (streaming=False) works fine.
Streaming API with 750 rows works fine as well.
Streaming API with >750 rows breaks even with very short arrays.

Apart from this, I arrived at this weird code because other APIs like dot-product .dot() broke as well.
But that's a story for another time.

Expected behavior

❯❯❯ POLARS_VERBOSE=1 poetry run python .issue.py
RUN STREAMING PIPELINE
[df -> callback -> filter -> hstack -> ordered_sink, df -> cross_join_sink]
shape: (1, 5)
┌─────┬───────────────────────┬───────────┬───────────────────────┬────────────┐
│ idx ┆ vec                   ┆ idx_right ┆ vec_right             ┆ similarity │
│ --- ┆ ---                   ┆ ---       ┆ ---                   ┆ ---        │
│ i64 ┆ array[f64, 1000]      ┆ i64       ┆ array[f64, 1000]      ┆ f64        │
╞═════╪═══════════════════════╪═══════════╪═══════════════════════╪════════════╡
│ 749 ┆ [0.919319, 0.257407,  ┆ 750       ┆ [0.554524, 0.207891,  ┆ 0.745843   │
│     ┆ … 0.16321…            ┆           ┆ … 0.43446…            ┆            │
└─────┴───────────────────────┴───────────┴───────────────────────┴────────────┘

Installed versions

Polars:              1.6.0
Index type:          UInt32
Platform:            macOS-14.6.1-arm64-arm-64bit
Python:              3.11.9 (main, Sep  2 2024, 18:32:37) [Clang 15.0.0 (clang-1
500.3.9.4)]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.6.1
gevent               <not installed>
great_tables         <not installed>
matplotlib           <not installed>
nest_asyncio         <not installed>
numpy                2.1.0
openpyxl             <not installed>
pandas               <not installed>
pyarrow              <not installed>
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                2.4.0
xlsx2csv             <not installed>
xlsxwriter           <not installed>

@ryzhakar ryzhakar added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Sep 6, 2024
@ryzhakar
Copy link
Author

Chiming in to say it wasn't magically fixed in v1.7.0.
The issue persists.

@ryzhakar
Copy link
Author

Here's a potentially helpful observation.
The issue with array dimensionality of (-1, x) – which makes no sense and seems like a bug to me – can be generally reproduced any time, when the dataframe was eagerly loaded at some point and THEN transformed into a lazy one.
Which is to say, any time the lazy frame is not read from disk incrementally.

Today I stumbled upon a stacktrace, which points to these lines in code:

  File "/Users/ryzhakar/similarity-sandbox/.venv/lib/python3.11/site-packages/polars/dataframe/frame.py", line 8968, in select
    return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ryzhakar/similarity-sandbox/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 2034, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.InvalidOperationError: cannot reshape empty array into shape (-1, 384)

Note the self.lazy().select().collect().

@cmdlineluser
Copy link
Contributor

I can reproduce this.

Not sure if it is useful, but 672 is the minimum number of rows where it fails for me.

import numpy as np
import polars as pl

NUMBER_OF_ROWS = 672
VECTOR_LENGTH = 2

df = pl.LazyFrame({
    'idx': np.arange(NUMBER_OF_ROWS),
    'vec': np.random.uniform(size=(NUMBER_OF_ROWS, VECTOR_LENGTH))
})

(
    df.join(df, how="cross")
      .filter(pl.col('idx') < pl.col('idx_right'))
      .with_columns(
          (pl.col('vec') * pl.col('vec_right')).arr.sum()
      )
      .collect(streaming=True)
)
# InvalidOperationError: cannot reshape empty array into shape (-1, 2)

I was trying to remove the filter, but noticed it still raises with just .filter(False) instead.

@ryzhakar
Copy link
Author

@cmdlineluser thank you for confirming!

Your example reproduces only after 751 number of rows on my machine. The same treshold as the original example. Which makes me think this is a memory-related (paging?) issue.

Now, another important thing is that the issue persists without streaming.

import numpy as np
import polars as pl

NUMBER_OF_ROWS = 751
# This number    ^^^  might be
# correlated with hardware.
# Increase it if you cannot reproduce the error.
VECTOR_LENGTH = 2

df = pl.LazyFrame({
    'idx': np.arange(NUMBER_OF_ROWS),
    'vec': np.random.uniform(size=(NUMBER_OF_ROWS, VECTOR_LENGTH))
})

(
    df.join(df, how="cross")
      .filter(False)
      .with_columns(
          (pl.col('vec') * pl.col('vec_right')).arr.sum()
      )
      .collect()
)
# InvalidOperationError: cannot reshape empty array into shape (-1, 2)

Perhaps I should clarify the issue title.

@ryzhakar ryzhakar changed the title Lazy frame casted from numpy fails on streaming, cannot reshape empty array into shape ( -1, 1000) Lazy frame casted from numpy fails on collection, cannot reshape empty array into shape ( -1, 2) Sep 12, 2024
@ryzhakar
Copy link
Author

Okay, this has become silly very fast 👀

This eager single-array example fails in 1.7.1.
Apparently, a combination of any falsey filtering predicate and an allocating array operation is enough.

import polars as pl
df = pl.DataFrame({'v': [[0., 0.]]}, schema={'v': pl.Array(pl.Float32, 2)})
print(df)

print(
    df.filter(False)
    .select(pl.col('v') + pl.col('v'))
)
shape: (1, 1)
┌───────────────┐
│ v             │
│ ---           │
│ array[f32, 2] │
╞═══════════════╡
│ [0.0, 0.0]    │
└───────────────┘
Traceback (most recent call last):
  File "/Users/ryzhakar/similarity-sandbox/throwaway_scripts/.issue3.py", line 16, in <module>
    .select(pl.col('v') + pl.col('v'))
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ryzhakar/similarity-sandbox/.venv/lib/python3.11/site-packages/polars/dataframe/frame.py", line 8968, in select
    return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ryzhakar/similarity-sandbox/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 2034, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.InvalidOperationError: cannot reshape empty array into shape (-1, 2)

@cmdlineluser would you try to reproduce this?
I want to know for sure before renaming the issue again.

@cmdlineluser
Copy link
Contributor

cmdlineluser commented Sep 12, 2024

Yep - reproduces for me on 1.7.1 also.

I wonder if it is actually a different issue though, because the filter produces 0 rows.

The original problem still had rows, so I may have reduced it down too much.

@ryzhakar ryzhakar changed the title Lazy frame casted from numpy fails on collection, cannot reshape empty array into shape ( -1, 2) Array operation query with filtering results in InvalidOperationError: cannot reshape empty array into shape (-1, 2) Sep 12, 2024
@ryzhakar
Copy link
Author

ryzhakar commented Sep 12, 2024

DISCLAIMER: I don't really know what I'm talking about.

Got a hunch though.

@cmdlineluser It turns out that we were both right and wrong.
Yes, we did over-reduce the original problem and the minimal cases worked okay with rows to return after filtering.
But no, the issue is not different.

My theory is that the minimal case with no rows to return actually happens in the lazy streaming case.
I don't know how it works for sure, but I've read in some docstring that the streaming engine processes stuff in batches.
What if one of the batches happens to be rejected entirely because of the filter predicate? Would this be a subcase/supercase of the "no rows to return" case?

To test this out, I adapted some of our cases into this.

import numpy as np
import polars as pl

TOTAL_NUMBER_OF_ROWS = 1_500_000
IMPLIED_BATCH_LENGTH = 125_000
# This number          ^^^^^^^
# may be correlated with hardware.
# I was looking for it with binary search :(
VECTOR_LENGTH = 2

df = pl.LazyFrame(
    {
        'i': np.arange(TOTAL_NUMBER_OF_ROWS),
        'v': np.random.uniform(size=(TOTAL_NUMBER_OF_ROWS, VECTOR_LENGTH)),
    },
)
# One full batch of ids will not meet the modulo expectation.
# All of these rows will result in `false` according to predicate.
predicate_w_rejected_batch = pl.col('i') % IMPLIED_BATCH_LENGTH + 1 == 0
predicate_true_once_every_batch = pl.col('i') % IMPLIED_BATCH_LENGTH == 0
allocating_array_operation = pl.col('v') * pl.col('v')

print(
    df
    .filter(predicate_true_once_every_batch)
    .with_columns(allocating_array_operation)
    .collect(streaming=True)
)
# shape: (12, 2)
# ┌─────────┬──────────────────────┐
# │ i       ┆ v                    │
# │ ---     ┆ ---                  │
# │ i64     ┆ array[f64, 2]        │
# ╞═════════╪══════════════════════╡
# │ 0       ┆ [0.158439, 0.996373] │
# │ …       ┆ …                    │
# │ 1375000 ┆ [0.315458, 0.859306] │
# └─────────┴──────────────────────┘
print(
    df
    .filter(predicate_w_rejected_batch)
    .with_columns(allocating_array_operation)
    .collect(streaming=True)
)
# InvalidOperationError: cannot reshape empty array into shape (-1, 2)

So if there's at least one row to return from the batch – it works. But as soon as there a single "empty" filtered out batch – it fails in the same way the "eager no rows to return" case does.

Too bad the batch size is a bit tricky to pinpoint.

@ryzhakar
Copy link
Author

TLDR; if there's a completely filtered out batch with no rows to return AND an allocating array operation – polars raises a cryptic InvalidOperationError: cannot reshape empty array into shape (-1, x) error.

@coastalwhite
Copy link
Collaborator

This seems to be fixed by #18940.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants