`pl.read_ipc` is up to 10x slower when not using pyarrow #19635

legendre6891 · 2024-11-05T05:53:51Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

For some feather files, I notice that using pl.read_ipc(..., use_pyarrow=True) is 10x faster than pl.read_ipc(..., use_memory=True).

I cannot share the file, but this is a repro with dummy data:

Script to create the dummy file:

import polars as pl

import sys 
import random
words = dir(sys)

A = []
for i in range(1000):
    A.append( random.choices(words, k=30000 ) ) 

ds = pl.DataFrame(A)
ds.write_ipc("A.feather", compression="zstd")

Then, we see that using pyarrow is around ≥10 times faster.

>>> import polars as pl
>>> %time ds = pl.read_ipc("A.feather", memory_map=False);
CPU times: user 969 ms, sys: 154 ms, total: 1.12 s
Wall time: 1.13 s

>>> %time ds = pl.read_ipc("A.feather", use_pyarrow=True);
CPU times: user 626 ms, sys: 88.5 ms, total: 715 ms
Wall time: 97.1 ms

Log output

No response

Issue description

For feather files with a lot of string data and null values, use_pyarrow=True is significantly faster than the native reader.

Expected behavior

Native feather reading performance should be competitive with pyarrow.

Installed versions

--------Version info---------
Polars:              1.12.0
Index type:          UInt32
Platform:            macOS-14.3-arm64-arm-64bit
Python:              3.11.10 (main, Oct 16 2024, 08:56:36) [Clang 18.1.8 ]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.9.2
nest_asyncio         <not installed>
numpy                2.1.3
openpyxl             <not installed>
pandas               <not installed>
pyarrow              18.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

The text was updated successfully, but these errors were encountered:

ritchie46 · 2024-11-05T06:31:31Z

This seems wrong. I think we are accidentally quadratic.

ritchie46 · 2024-11-05T06:41:52Z

I cannot reproduce, if anything, using Polars is faster for me. I am on linux. Can others on confirm?

alexander-beedie · 2024-11-05T07:03:14Z

I cannot reproduce, if anything, using Polars is faster for me. I am on linux. Can others on confirm?

@ritchie46: Yup, on the latest release build (1.12) I also see us being slower here:
(Test machine: Apple Silicon M3 Max)

%time ds = pl.read_ipc("A.feather", memory_map=False);
CPU times: user 820 ms, sys: 135 ms, total: 955 ms
Wall time: 986 ms

%time ds = pl.read_ipc("A.feather", use_pyarrow=True);
CPU times: user 504 ms, sys: 81.8 ms, total: 586 ms
Wall time: 53.1 ms

Looking at the "version info" output given above (Platform: macOS-14.3-arm64-arm-64bit) and my own results, could potentially be Mac-specific if you're not seeing it on Linux? 🤔

nameexhaustion · 2024-11-05T07:26:49Z

I think one factor is that we are single threaded. I think pyarrow also used to be single-threaded but the recent versions have added parallelism

# MacOS
nxs@mt-md-nxs polars % USE_PYARROW=0 zsh -c 'for _ in {1..5}; do time python .env/y.py 2>/dev/null >/dev/null; done'
python .env/y.py 2> /dev/null > /dev/null  0.90s user 0.15s system 92% cpu 1.123 total
python .env/y.py 2> /dev/null > /dev/null  0.93s user 0.15s system 95% cpu 1.133 total
python .env/y.py 2> /dev/null > /dev/null  0.95s user 0.13s system 94% cpu 1.140 total
python .env/y.py 2> /dev/null > /dev/null  0.95s user 0.14s system 94% cpu 1.146 total
python .env/y.py 2> /dev/null > /dev/null  0.89s user 0.13s system 95% cpu 1.072 total
nxs@mt-md-nxs polars % USE_PYARROW=1 zsh -c 'for _ in {1..5}; do time python .env/y.py 2>/dev/null >/dev/null; done'
python .env/y.py 2> /dev/null > /dev/null  0.71s user 0.16s system 211% cpu 0.408 total
python .env/y.py 2> /dev/null > /dev/null  0.71s user 0.14s system 281% cpu 0.301 total
python .env/y.py 2> /dev/null > /dev/null  0.71s user 0.14s system 283% cpu 0.298 total
python .env/y.py 2> /dev/null > /dev/null  0.71s user 0.14s system 282% cpu 0.299 total
python .env/y.py 2> /dev/null > /dev/null  0.71s user 0.14s system 286% cpu 0.296 total

# Linux
nxs@ubuntu:~/git/polars$ USE_PYARROW=0 zsh -c 'for _ in {1..5}; do time python .env/y.py 2>/dev/null >/dev/null; done'
python .env/y.py 2> /dev/null > /dev/null  0.88s user 0.08s system 100% cpu 0.956 total
python .env/y.py 2> /dev/null > /dev/null  0.88s user 0.11s system 99% cpu 0.985 total
python .env/y.py 2> /dev/null > /dev/null  0.85s user 0.07s system 100% cpu 0.922 total
python .env/y.py 2> /dev/null > /dev/null  0.86s user 0.06s system 100% cpu 0.919 total
python .env/y.py 2> /dev/null > /dev/null  0.87s user 0.07s system 100% cpu 0.935 total
nxs@ubuntu:~/git/polars$ USE_PYARROW=1 zsh -c 'for _ in {1..5}; do time python .env/y.py 2>/dev/null >/dev/null; done'
python .env/y.py 2> /dev/null > /dev/null  0.61s user 0.14s system 269% cpu 0.278 total
python .env/y.py 2> /dev/null > /dev/null  0.59s user 0.08s system 325% cpu 0.206 total
python .env/y.py 2> /dev/null > /dev/null  0.60s user 0.08s system 336% cpu 0.202 total
python .env/y.py 2> /dev/null > /dev/null  0.59s user 0.09s system 330% cpu 0.205 total
python .env/y.py 2> /dev/null > /dev/null  0.59s user 0.08s system 322% cpu 0.207 total

Test script

import polars as pl
import os

path = ".env/_A.feather"

use_pyarrow = os.environ["USE_PYARROW"] == "1"
print(f"{use_pyarrow = }")

pl.read_ipc(path, memory_map=False, use_pyarrow=use_pyarrow)

Versions

polars-1.12.0
pyarrow-18.0.0

coastalwhite · 2024-11-05T08:53:06Z

#19454 adds all the infrastructure for parallel decoding. Might be worth to port it to the in memory engine after that is merged?

ritchie46 · 2024-11-07T12:17:24Z

#19454 adds all the infrastructure for parallel decoding. Might be worth to port it to the in memory engine after that is merged?

Yes, maybe we can even dispatch to the streaming engine in that node. It will just be a source -> in-memory sink. So I think building the graph in neglible there.,

kevinli1993 · 2025-01-15T14:43:41Z

Hi, a friendly follow up to see if we could move forward on this issue, now that #19454 is merged? Much thanks.

legendre6891 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Nov 5, 2024

legendre6891 changed the title ~~pl.read_ipc is up to 10x slower when not use pyarrow~~ pl.read_ipc is up to 10x slower when not using pyarrow Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`pl.read_ipc` is up to 10x slower when not using pyarrow #19635

`pl.read_ipc` is up to 10x slower when not using pyarrow #19635

legendre6891 commented Nov 5, 2024

ritchie46 commented Nov 5, 2024

ritchie46 commented Nov 5, 2024

alexander-beedie commented Nov 5, 2024 •

edited

Loading

nameexhaustion commented Nov 5, 2024 •

edited

Loading

coastalwhite commented Nov 5, 2024

ritchie46 commented Nov 7, 2024

kevinli1993 commented Jan 15, 2025

pl.read_ipc is up to 10x slower when not using pyarrow #19635

pl.read_ipc is up to 10x slower when not using pyarrow #19635

Comments

legendre6891 commented Nov 5, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

ritchie46 commented Nov 5, 2024

ritchie46 commented Nov 5, 2024

alexander-beedie commented Nov 5, 2024 • edited Loading

nameexhaustion commented Nov 5, 2024 • edited Loading

coastalwhite commented Nov 5, 2024

ritchie46 commented Nov 7, 2024

kevinli1993 commented Jan 15, 2025

`pl.read_ipc` is up to 10x slower when not using pyarrow #19635

`pl.read_ipc` is up to 10x slower when not using pyarrow #19635

alexander-beedie commented Nov 5, 2024 •

edited

Loading

nameexhaustion commented Nov 5, 2024 •

edited

Loading