-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for datasets in cloud object stores #9124
Comments
@ritchie46 @stinodego what are your thoughts? |
Since this isn't a python specific feature (like pandas or pyarrow interop), I'm in favor of having that logic available in rust, While py-polars is the most complete polars implementation, there are quite a few other ones (r,node.js,ruby, ...) to consider. Adding this to only python further increases the feature disparity between the languages. |
Yes, added this to the list of cons. |
Sorry if this is the wrong place to ask (please correct me if necessary). What's the current status on support to read/write to object stores? I'm guessing by the number of open (and potentially duplicate?) issues, e.g. #800, #5959, #6177 and #6178, this is still a not (fully supported)? Is there currently a workaround that could work? I am mainly going to use Polars in Python, so mostly interested in that. What seems to work for me is the following import fsspec
import polars as pl
import pyarrow.dataset as ds
def lazy_load_dataset(file_uri: str, format="parquet"):
fs = fsspec.filesystem("s3")
dataset = ds.dataset(file_uri, filesystem=fs, format=format)
return pl.scan_pyarrow_dataset(dataset) Is this currently the best way to lazy load a parquet dataset? Thanks a lot in advance for the help! |
This issue is a general one on how the functionality should be implemented. |
This is what we must do. This is 100% within the goals of polars. grace periodIn the mean time we could add add what @chitralverma proposes as well. Later we can |
alright then, let me start the work on a PR for the rust side of things |
Problem description
I have an approach to allow cloud-based datasets with polars without bloating the rust side of things and dealing with the async compatibilities of polars and object_store.
Pseudo-code:
scan_parquet
,scan_avro
etc. first check the path via its URL schemefile://
orNone
the regular flow works (using the mmap readers). This ensures that no behaviour changes happen for local files.s3://
,gs://
,az://
orhdfs://
then instead of going over to the rust side of things, we internally use pyarrow datasets which support various file systems and internally use thescan_pyarrow_dataset
which allows projection and predicate push downs.fsspec
.Pros:
Cons:
Alternatively:
We can do this completely on the rust side by extending the current object_store implementation in polars beyond just the parquet path and block the asyncs on the current thread. This will then require no change on the python side of things.
There is some
object_store
stuff that's touchingpolars-core
, maybe all this should be part ofpolars-io
The text was updated successfully, but these errors were encountered: