Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discuss: Introduce datafusion-storage as datafusion's own storage interface #14854

Open
Xuanwo opened this issue Feb 24, 2025 · 9 comments
Open

Comments

@Xuanwo
Copy link
Member

Xuanwo commented Feb 24, 2025

Hello everyone, I'm jumping here from [Discussion] Object Store Composition.

Background

Datafusion is using ObjectStore as it's public storage interface for now. We have public API like register_object_store:

let object_store_url = ObjectStoreUrl::parse("file://").unwrap();
let object_store = object_store::local::LocalFileSystem::new();
let ctx = SessionContext::new();
// All files with the file:// url prefix will be read from the local file system
ctx.register_object_store(object_store_url.as_ref(), Arc::new(object_store));

With the growth of DF, we have to continuously add more features to object_store, making it increasingly difficult to compose, as described in [Discussion] Object Store Composition.

The latest example is adding Extensions to object store GetOptions to allow passing tracing spans within the object store, as requested in Improve use of tracing spans in query path.

It's easy to predict that ObjectStore will move further and further away from its initial position:

Initially the ObjectStore API was relatively simple, consisting of a few methods to interact with object stores. As such many systems took this abstraction and used it as a generic IO abstraction, this is good and what the crate was designed for.

Proposal

So I proposse to build datafusion-storage primarily focused on DataFusion's own needs while maintaining datafusion-storage-object-store and datafusion-storage-opendal separately. The benefit is that users can implement innovative features like datafusion-storage-cudf or datafusion-storage-io_uring without being constrained by the current I/O abstraction of object-store or OpenDAL.

If this becomes a reality, DataFusion can design the abstraction based on its own requirements without having to push everything upstream to object_store. This would allow them to maintain useful features such as context management and add additional requirements to the trait while letting datafusion-storage-object-store and datafusion-storage-opendal handle the extra work.

Implematation

We can start by aliasing the ObjectStore trait within datafusion-storage first. Given sufficient migration time, we can then fine-tune the trait to better align with DF's specific needs.

@alamb
Copy link
Contributor

alamb commented Feb 25, 2025

I think this sounds like a great idea ❤

At least for some transition time, we probably would need to offer datafusion-storage-object-store built in to maintain the existing backwards compatibility. Given the good things I have head about open dal, maybe it would be worth considering bringing datafusion-storage-opendal into the repo too 🤔

A separate IO API might also make keeping CPU and IO work separate easier, such as

Anyhow, the first step towards this project is probably a POC PR sketching out what such an API might look like

@Xuanwo
Copy link
Member Author

Xuanwo commented Feb 25, 2025

I'm willing to kick off a PoC first so the community can get a sense of this.

@tustvold
Copy link
Contributor

tustvold commented Feb 26, 2025

I think this proposal makes a lot of sense, one thing I would suggest is thinking about if it is makes sense for datafusion-storage to be a higher-level abstraction, than just concerned with shuffling around ranges of bytes, or lists of files. Just implementing a trait that is a superset of ObjectStore or OpenDAL feels like it would just move the challenges around, instead of introducing a more meaningful abstraction layer.

For example, I wonder if a higher level abstraction concerned with say reading parquet files, or something might be more flexible. It may be this already exists in the form of the new DataSource abstraction, although I have not followed this closely.

Edit: @crepererum's point on apache/arrow-rs#7171 (comment) I think is a good one also

integrating that into a larger code base that isn't all just DataFusion is going to be pain.

I don't really have a good answer here, but I guess one downside of a DF-specific or even operator-specific abstraction is it might make integrating as part of broader codebases more complex.

@westonpace
Copy link
Member

westonpace commented Mar 3, 2025

Out of curiosity, which components of datafusion require object store?

We've been using datafusion for a while without ever calling register_object_store. Even spilling works (I assume it is defaulting to some kind of local object store? Or does spilling not use object store?)

@AdamGS
Copy link
Contributor

AdamGS commented Mar 3, 2025

I think anything that assumes "some files somewhere" uses ObjectStore, much of it behind ListingTable and all related interfaces. Its also available everywhere you have a Session or access to a RuntimeEnv.

@tustvold
Copy link
Contributor

tustvold commented Mar 3, 2025

which components of datafusion require object store

Most components of DataFusion that perform IO will do so via ObjectStore. The only notable exception I am aware of is ParquetExec, which can be constructed with a custom AsyncFileReader that dispatches IO to an abstraction of its choosing.

Even spilling works (I assume it is defaulting to some kind of local object store? Or does spilling not use object store?)

By default DF does register a LocalFilesystem ObjectStore, although I am not familiar with the spilling logic to know if it makes use of this

@westonpace
Copy link
Member

westonpace commented Mar 3, 2025

much of it behind ListingTable and all related interfaces. Its also available everywhere you have a Session or access to a RuntimeEnv.

Ah, this makes sense. ListingTable and friends would definitely need an object store.

Its also available everywhere you have a Session or access to a RuntimeEnv

Hmm, this one is a little more troubling in my mind. What should this object store be (local, for spilling, or remote, for some kind of catalog listing)? Why should it be needed? Or, to focus on the purposes of this discussion, what APIs will it need?

The only notable exception I am aware of is ParquetExec, which can be constructed with a custom AsyncFileReader that dispatches IO to an abstraction of its choosing.

I would expect most file readers to have their own I/O abstractions. Lance has it's own, for example, which requires the equivalent of something like preadv. I think attempting to unify an object store trait across the I/O needs of all possible file readers could be troublesome.

@tustvold
Copy link
Contributor

tustvold commented Mar 3, 2025

What should this object store be

It depends on the ObjectStoreUrl on the FileScanConfig. This in turn is used to fetch an appropriate ObjectStore from the ObjectStoreRegistry. The Default of which registers LocalFileSystem by default, and allows registering further stores manually. datafusion-cli has a more sophisticated implementation, that can automatically register S3, GCS, etc... buckets.

This is the mechanism that allows doing things like SELECT * from s3://foo/bar.parquet, etc... It is also important for distributed systems like Ballista where being able to serialize FileScanConfig and send it over the wire is important.

I think attempting to unify an object store trait across the I/O needs of all possible file readers could be troublesome.

I don't disagree, and was what I was recommending in the initial issue that triggered the creation of this - apache/arrow-rs#7171

The challenge is that there are advantages to having a single unified IO interface, particularly when it comes to integrating DF into a shared codebase that may perform IO elsewhere, or that might want to access multiple different types of file.

I'd personally recommend an approach that keeps ObjectStore as the default, but adds similar interfaces like AsyncFileReader to allow overriding the defaults for particular operators on a case-by-case basis if people wish to do so.

Edit:

what APIs will it need

Currently the ObjectStore API... This is somewhat unfortunate, given many use-cases only need read or write or listing, not all 3 simultaneously, but on the flip side it avoids having to maintain 3 separate IO traits, with accompanying registries, etc...

@Xuanwo
Copy link
Member Author

Xuanwo commented Mar 5, 2025

I'm willing to kick off a PoC first so the community can get a sense of this.

Kicked off at #15018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants