Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for datasets in cloud object stores #9124

Open
chitralverma opened this issue May 30, 2023 · 7 comments
Open

Support for datasets in cloud object stores #9124

chitralverma opened this issue May 30, 2023 · 7 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@chitralverma
Copy link
Contributor

chitralverma commented May 30, 2023

Problem description

I have an approach to allow cloud-based datasets with polars without bloating the rust side of things and dealing with the async compatibilities of polars and object_store.

Pseudo-code:

  • in the scan* methods like scan_parquet, scan_avro etc. first check the path via its URL scheme
  • if the scheme is file:// or None the regular flow works (using the mmap readers). This ensures that no behaviour changes happen for local files.
  • if the scheme is one of the supported ones like s3://, gs://, az:// or hdfs:// then instead of going over to the rust side of things, we internally use pyarrow datasets which support various file systems and internally use the scan_pyarrow_dataset which allows projection and predicate push downs.
  • if the scheme is not supported we can either error out or use fsspec.

Pros:

  • no behaviour changes for the local files,
  • directory (and partitioned) datasets can be supported,
  • cloud urls can be supported
  • no changes are required on the rust side of things.
  • we can remove the object_store stuff completely from the rust side.

Cons:

  • this will lead to inconsistency of API - rust users cannot work with datasets in the object stores. Increases feature disparity between languages.
  • other possible problems like no support for Cloud URLs in SQL

Alternatively:
We can do this completely on the rust side by extending the current object_store implementation in polars beyond just the parquet path and block the asyncs on the current thread. This will then require no change on the python side of things.

There is some object_store stuff that's touching polars-core, maybe all this should be part of polars-io

@chitralverma chitralverma added the enhancement New feature or an improvement of an existing feature label May 30, 2023
@chitralverma
Copy link
Contributor Author

@ritchie46 @stinodego what are your thoughts?

@universalmind303
Copy link
Collaborator

Since this isn't a python specific feature (like pandas or pyarrow interop), I'm in favor of having that logic available in rust,

While py-polars is the most complete polars implementation, there are quite a few other ones (r,node.js,ruby, ...) to consider. Adding this to only python further increases the feature disparity between the languages.

@chitralverma
Copy link
Contributor Author

Adding this to only python further increases the feature disparity between the languages.

Yes, added this to the list of cons.

@MatthiasRoels
Copy link

Sorry if this is the wrong place to ask (please correct me if necessary).

What's the current status on support to read/write to object stores? I'm guessing by the number of open (and potentially duplicate?) issues, e.g. #800, #5959, #6177 and #6178, this is still a not (fully supported)?

Is there currently a workaround that could work? I am mainly going to use Polars in Python, so mostly interested in that. What seems to work for me is the following

import fsspec
import polars as pl 
import pyarrow.dataset as ds

def lazy_load_dataset(file_uri: str, format="parquet"): 
    
    fs = fsspec.filesystem("s3")
    dataset = ds.dataset(file_uri, filesystem=fs, format=format)
    
    return pl.scan_pyarrow_dataset(dataset)

Is this currently the best way to lazy load a parquet dataset? Thanks a lot in advance for the help!

@chitralverma
Copy link
Contributor Author

Is this currently the best way to lazy load a parquet dataset? Thanks a lot in advance for the help!
Yes this can be done as log as you are using polars with python.

This issue is a general one on how the functionality should be implemented.

@ritchie46
Copy link
Member

We can do this completely on the rust side by extending the current object_store implementation in polars beyond just the parquet path and block the asyncs on the current thread. This will then require no change on the python side of things.

This is what we must do. This is 100% within the goals of polars.

grace period

In the mean time we could add add what @chitralverma proposes as well. Later we can add and a cloud_engine argument that supports "polars", "pyarrow" and maybe more.

@chitralverma
Copy link
Contributor Author

alright then, let me start the work on a PR for the rust side of things

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

4 participants