Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pl.read_ipc is up to 10x slower when not using pyarrow #19635

Open
2 tasks done
legendre6891 opened this issue Nov 5, 2024 · 7 comments
Open
2 tasks done

pl.read_ipc is up to 10x slower when not using pyarrow #19635

legendre6891 opened this issue Nov 5, 2024 · 7 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@legendre6891
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

For some feather files, I notice that using pl.read_ipc(..., use_pyarrow=True) is 10x faster than pl.read_ipc(..., use_memory=True).

I cannot share the file, but this is a repro with dummy data:

Script to create the dummy file:

import polars as pl

import sys 
import random
words = dir(sys)

A = []
for i in range(1000):
    A.append( random.choices(words, k=30000 ) ) 

ds = pl.DataFrame(A)
ds.write_ipc("A.feather", compression="zstd")

Then, we see that using pyarrow is around ≥10 times faster.

>>> import polars as pl
>>> %time ds = pl.read_ipc("A.feather", memory_map=False);
CPU times: user 969 ms, sys: 154 ms, total: 1.12 s
Wall time: 1.13 s

>>> %time ds = pl.read_ipc("A.feather", use_pyarrow=True);
CPU times: user 626 ms, sys: 88.5 ms, total: 715 ms
Wall time: 97.1 ms

Log output

No response

Issue description

For feather files with a lot of string data and null values, use_pyarrow=True is significantly faster than the native reader.

Expected behavior

Native feather reading performance should be competitive with pyarrow.

Installed versions

--------Version info---------
Polars:              1.12.0
Index type:          UInt32
Platform:            macOS-14.3-arm64-arm-64bit
Python:              3.11.10 (main, Oct 16 2024, 08:56:36) [Clang 18.1.8 ]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.9.2
nest_asyncio         <not installed>
numpy                2.1.3
openpyxl             <not installed>
pandas               <not installed>
pyarrow              18.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@legendre6891 legendre6891 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Nov 5, 2024
@legendre6891 legendre6891 changed the title pl.read_ipc is up to 10x slower when not use pyarrow pl.read_ipc is up to 10x slower when not using pyarrow Nov 5, 2024
@ritchie46
Copy link
Member

This seems wrong. I think we are accidentally quadratic.

@ritchie46
Copy link
Member

I cannot reproduce, if anything, using Polars is faster for me. I am on linux. Can others on confirm?

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Nov 5, 2024

I cannot reproduce, if anything, using Polars is faster for me. I am on linux. Can others on confirm?

@ritchie46: Yup, on the latest release build (1.12) I also see us being slower here:
(Test machine: Apple Silicon M3 Max)

%time ds = pl.read_ipc("A.feather", memory_map=False);
CPU times: user 820 ms, sys: 135 ms, total: 955 ms
Wall time: 986 ms

%time ds = pl.read_ipc("A.feather", use_pyarrow=True);
CPU times: user 504 ms, sys: 81.8 ms, total: 586 ms
Wall time: 53.1 ms

Looking at the "version info" output given above (Platform: macOS-14.3-arm64-arm-64bit) and my own results, could potentially be Mac-specific if you're not seeing it on Linux? 🤔

@nameexhaustion
Copy link
Collaborator

nameexhaustion commented Nov 5, 2024

I think one factor is that we are single threaded. I think pyarrow also used to be single-threaded but the recent versions have added parallelism

# MacOS
nxs@mt-md-nxs polars % USE_PYARROW=0 zsh -c 'for _ in {1..5}; do time python .env/y.py 2>/dev/null >/dev/null; done'
python .env/y.py 2> /dev/null > /dev/null  0.90s user 0.15s system 92% cpu 1.123 total
python .env/y.py 2> /dev/null > /dev/null  0.93s user 0.15s system 95% cpu 1.133 total
python .env/y.py 2> /dev/null > /dev/null  0.95s user 0.13s system 94% cpu 1.140 total
python .env/y.py 2> /dev/null > /dev/null  0.95s user 0.14s system 94% cpu 1.146 total
python .env/y.py 2> /dev/null > /dev/null  0.89s user 0.13s system 95% cpu 1.072 total
nxs@mt-md-nxs polars % USE_PYARROW=1 zsh -c 'for _ in {1..5}; do time python .env/y.py 2>/dev/null >/dev/null; done'
python .env/y.py 2> /dev/null > /dev/null  0.71s user 0.16s system 211% cpu 0.408 total
python .env/y.py 2> /dev/null > /dev/null  0.71s user 0.14s system 281% cpu 0.301 total
python .env/y.py 2> /dev/null > /dev/null  0.71s user 0.14s system 283% cpu 0.298 total
python .env/y.py 2> /dev/null > /dev/null  0.71s user 0.14s system 282% cpu 0.299 total
python .env/y.py 2> /dev/null > /dev/null  0.71s user 0.14s system 286% cpu 0.296 total
# Linux
nxs@ubuntu:~/git/polars$ USE_PYARROW=0 zsh -c 'for _ in {1..5}; do time python .env/y.py 2>/dev/null >/dev/null; done'
python .env/y.py 2> /dev/null > /dev/null  0.88s user 0.08s system 100% cpu 0.956 total
python .env/y.py 2> /dev/null > /dev/null  0.88s user 0.11s system 99% cpu 0.985 total
python .env/y.py 2> /dev/null > /dev/null  0.85s user 0.07s system 100% cpu 0.922 total
python .env/y.py 2> /dev/null > /dev/null  0.86s user 0.06s system 100% cpu 0.919 total
python .env/y.py 2> /dev/null > /dev/null  0.87s user 0.07s system 100% cpu 0.935 total
nxs@ubuntu:~/git/polars$ USE_PYARROW=1 zsh -c 'for _ in {1..5}; do time python .env/y.py 2>/dev/null >/dev/null; done'
python .env/y.py 2> /dev/null > /dev/null  0.61s user 0.14s system 269% cpu 0.278 total
python .env/y.py 2> /dev/null > /dev/null  0.59s user 0.08s system 325% cpu 0.206 total
python .env/y.py 2> /dev/null > /dev/null  0.60s user 0.08s system 336% cpu 0.202 total
python .env/y.py 2> /dev/null > /dev/null  0.59s user 0.09s system 330% cpu 0.205 total
python .env/y.py 2> /dev/null > /dev/null  0.59s user 0.08s system 322% cpu 0.207 total

Test script

import polars as pl
import os

path = ".env/_A.feather"

use_pyarrow = os.environ["USE_PYARROW"] == "1"
print(f"{use_pyarrow = }")

pl.read_ipc(path, memory_map=False, use_pyarrow=use_pyarrow)

Versions

polars-1.12.0
pyarrow-18.0.0

@coastalwhite
Copy link
Collaborator

#19454 adds all the infrastructure for parallel decoding. Might be worth to port it to the in memory engine after that is merged?

@ritchie46
Copy link
Member

#19454 adds all the infrastructure for parallel decoding. Might be worth to port it to the in memory engine after that is merged?

Yes, maybe we can even dispatch to the streaming engine in that node. It will just be a source -> in-memory sink. So I think building the graph in neglible there.,

@kevinli1993
Copy link

Hi, a friendly follow up to see if we could move forward on this issue, now that #19454 is merged? Much thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

6 participants