Support for datasets in cloud object stores #9124

chitralverma · 2023-05-30T15:15:02Z

Problem description

I have an approach to allow cloud-based datasets with polars without bloating the rust side of things and dealing with the async compatibilities of polars and object_store.

Pseudo-code:

in the scan* methods like scan_parquet, scan_avro etc. first check the path via its URL scheme
if the scheme is file:// or None the regular flow works (using the mmap readers). This ensures that no behaviour changes happen for local files.
if the scheme is one of the supported ones like s3://, gs://, az:// or hdfs:// then instead of going over to the rust side of things, we internally use pyarrow datasets which support various file systems and internally use the scan_pyarrow_dataset which allows projection and predicate push downs.
if the scheme is not supported we can either error out or use fsspec.

Pros:

no behaviour changes for the local files,
directory (and partitioned) datasets can be supported,
cloud urls can be supported
no changes are required on the rust side of things.
we can remove the object_store stuff completely from the rust side.

Cons:

this will lead to inconsistency of API - rust users cannot work with datasets in the object stores. Increases feature disparity between languages.
other possible problems like no support for Cloud URLs in SQL

Alternatively:
We can do this completely on the rust side by extending the current object_store implementation in polars beyond just the parquet path and block the asyncs on the current thread. This will then require no change on the python side of things.

There is some object_store stuff that's touching polars-core, maybe all this should be part of polars-io

The text was updated successfully, but these errors were encountered:

chitralverma · 2023-05-30T15:15:36Z

@ritchie46 @stinodego what are your thoughts?

universalmind303 · 2023-05-30T21:40:07Z

Since this isn't a python specific feature (like pandas or pyarrow interop), I'm in favor of having that logic available in rust,

While py-polars is the most complete polars implementation, there are quite a few other ones (r,node.js,ruby, ...) to consider. Adding this to only python further increases the feature disparity between the languages.

chitralverma · 2023-05-31T06:29:05Z

Adding this to only python further increases the feature disparity between the languages.

Yes, added this to the list of cons.

MatthiasRoels · 2023-05-31T13:44:54Z

Sorry if this is the wrong place to ask (please correct me if necessary).

What's the current status on support to read/write to object stores? I'm guessing by the number of open (and potentially duplicate?) issues, e.g. #800, #5959, #6177 and #6178, this is still a not (fully supported)?

Is there currently a workaround that could work? I am mainly going to use Polars in Python, so mostly interested in that. What seems to work for me is the following

import fsspec
import polars as pl 
import pyarrow.dataset as ds

def lazy_load_dataset(file_uri: str, format="parquet"): 
    
    fs = fsspec.filesystem("s3")
    dataset = ds.dataset(file_uri, filesystem=fs, format=format)
    
    return pl.scan_pyarrow_dataset(dataset)

Is this currently the best way to lazy load a parquet dataset? Thanks a lot in advance for the help!

chitralverma · 2023-05-31T15:10:37Z

Is this currently the best way to lazy load a parquet dataset? Thanks a lot in advance for the help!
Yes this can be done as log as you are using polars with python.

This issue is a general one on how the functionality should be implemented.

ritchie46 · 2023-06-01T13:21:39Z

We can do this completely on the rust side by extending the current object_store implementation in polars beyond just the parquet path and block the asyncs on the current thread. This will then require no change on the python side of things.

This is what we must do. This is 100% within the goals of polars.

grace period

In the mean time we could add add what @chitralverma proposes as well. Later we can add and a cloud_engine argument that supports "polars", "pyarrow" and maybe more.

chitralverma · 2023-06-01T13:28:27Z

alright then, let me start the work on a PR for the rust side of things

chitralverma added the enhancement New feature or an improvement of an existing feature label May 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for datasets in cloud object stores #9124

Support for datasets in cloud object stores #9124

chitralverma commented May 30, 2023 •

edited

Loading

chitralverma commented May 30, 2023

universalmind303 commented May 30, 2023

chitralverma commented May 31, 2023

MatthiasRoels commented May 31, 2023

chitralverma commented May 31, 2023

ritchie46 commented Jun 1, 2023

chitralverma commented Jun 1, 2023

Support for datasets in cloud object stores #9124

Support for datasets in cloud object stores #9124

Comments

chitralverma commented May 30, 2023 • edited Loading

Problem description

chitralverma commented May 30, 2023

universalmind303 commented May 30, 2023

chitralverma commented May 31, 2023

MatthiasRoels commented May 31, 2023

chitralverma commented May 31, 2023

ritchie46 commented Jun 1, 2023

grace period

chitralverma commented Jun 1, 2023

chitralverma commented May 30, 2023 •

edited

Loading