-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pl.read_ipc
is up to 10x slower when not using pyarrow
#19635
Comments
pl.read_ipc
is up to 10x slower when not use pyarrowpl.read_ipc
is up to 10x slower when not using pyarrow
This seems wrong. I think we are accidentally quadratic. |
I cannot reproduce, if anything, using Polars is faster for me. I am on linux. Can others on confirm? |
@ritchie46: Yup, on the latest release build (1.12) I also see us being slower here:
Looking at the "version info" output given above ( |
I think one factor is that we are single threaded. I think pyarrow also used to be single-threaded but the recent versions have added parallelism # MacOS
nxs@mt-md-nxs polars % USE_PYARROW=0 zsh -c 'for _ in {1..5}; do time python .env/y.py 2>/dev/null >/dev/null; done'
python .env/y.py 2> /dev/null > /dev/null 0.90s user 0.15s system 92% cpu 1.123 total
python .env/y.py 2> /dev/null > /dev/null 0.93s user 0.15s system 95% cpu 1.133 total
python .env/y.py 2> /dev/null > /dev/null 0.95s user 0.13s system 94% cpu 1.140 total
python .env/y.py 2> /dev/null > /dev/null 0.95s user 0.14s system 94% cpu 1.146 total
python .env/y.py 2> /dev/null > /dev/null 0.89s user 0.13s system 95% cpu 1.072 total
nxs@mt-md-nxs polars % USE_PYARROW=1 zsh -c 'for _ in {1..5}; do time python .env/y.py 2>/dev/null >/dev/null; done'
python .env/y.py 2> /dev/null > /dev/null 0.71s user 0.16s system 211% cpu 0.408 total
python .env/y.py 2> /dev/null > /dev/null 0.71s user 0.14s system 281% cpu 0.301 total
python .env/y.py 2> /dev/null > /dev/null 0.71s user 0.14s system 283% cpu 0.298 total
python .env/y.py 2> /dev/null > /dev/null 0.71s user 0.14s system 282% cpu 0.299 total
python .env/y.py 2> /dev/null > /dev/null 0.71s user 0.14s system 286% cpu 0.296 total # Linux
nxs@ubuntu:~/git/polars$ USE_PYARROW=0 zsh -c 'for _ in {1..5}; do time python .env/y.py 2>/dev/null >/dev/null; done'
python .env/y.py 2> /dev/null > /dev/null 0.88s user 0.08s system 100% cpu 0.956 total
python .env/y.py 2> /dev/null > /dev/null 0.88s user 0.11s system 99% cpu 0.985 total
python .env/y.py 2> /dev/null > /dev/null 0.85s user 0.07s system 100% cpu 0.922 total
python .env/y.py 2> /dev/null > /dev/null 0.86s user 0.06s system 100% cpu 0.919 total
python .env/y.py 2> /dev/null > /dev/null 0.87s user 0.07s system 100% cpu 0.935 total
nxs@ubuntu:~/git/polars$ USE_PYARROW=1 zsh -c 'for _ in {1..5}; do time python .env/y.py 2>/dev/null >/dev/null; done'
python .env/y.py 2> /dev/null > /dev/null 0.61s user 0.14s system 269% cpu 0.278 total
python .env/y.py 2> /dev/null > /dev/null 0.59s user 0.08s system 325% cpu 0.206 total
python .env/y.py 2> /dev/null > /dev/null 0.60s user 0.08s system 336% cpu 0.202 total
python .env/y.py 2> /dev/null > /dev/null 0.59s user 0.09s system 330% cpu 0.205 total
python .env/y.py 2> /dev/null > /dev/null 0.59s user 0.08s system 322% cpu 0.207 total Test script import polars as pl
import os
path = ".env/_A.feather"
use_pyarrow = os.environ["USE_PYARROW"] == "1"
print(f"{use_pyarrow = }")
pl.read_ipc(path, memory_map=False, use_pyarrow=use_pyarrow) Versions
|
#19454 adds all the infrastructure for parallel decoding. Might be worth to port it to the in memory engine after that is merged? |
Yes, maybe we can even dispatch to the streaming engine in that node. It will just be a |
Hi, a friendly follow up to see if we could move forward on this issue, now that #19454 is merged? Much thanks. |
Checks
Reproducible example
For some feather files, I notice that using
pl.read_ipc(..., use_pyarrow=True)
is 10x faster thanpl.read_ipc(..., use_memory=True)
.I cannot share the file, but this is a repro with dummy data:
Script to create the dummy file:
Then, we see that using pyarrow is around ≥10 times faster.
Log output
No response
Issue description
For feather files with a lot of string data and null values,
use_pyarrow=True
is significantly faster than the native reader.Expected behavior
Native feather reading performance should be competitive with pyarrow.
Installed versions
The text was updated successfully, but these errors were encountered: