-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
discuss: Introduce datafusion-storage
as datafusion's own storage interface
#14854
Comments
I think this sounds like a great idea ❤ At least for some transition time, we probably would need to offer A separate IO API might also make keeping CPU and IO work separate easier, such as
Anyhow, the first step towards this project is probably a POC PR sketching out what such an API might look like |
I'm willing to kick off a PoC first so the community can get a sense of this. |
I think this proposal makes a lot of sense, one thing I would suggest is thinking about if it is makes sense for datafusion-storage to be a higher-level abstraction, than just concerned with shuffling around ranges of bytes, or lists of files. Just implementing a trait that is a superset of ObjectStore or OpenDAL feels like it would just move the challenges around, instead of introducing a more meaningful abstraction layer. For example, I wonder if a higher level abstraction concerned with say reading parquet files, or something might be more flexible. It may be this already exists in the form of the new DataSource abstraction, although I have not followed this closely. Edit: @crepererum's point on apache/arrow-rs#7171 (comment) I think is a good one also
I don't really have a good answer here, but I guess one downside of a DF-specific or even operator-specific abstraction is it might make integrating as part of broader codebases more complex. |
Out of curiosity, which components of datafusion require object store? We've been using datafusion for a while without ever calling |
I think anything that assumes "some files somewhere" uses |
Most components of DataFusion that perform IO will do so via ObjectStore. The only notable exception I am aware of is ParquetExec, which can be constructed with a custom AsyncFileReader that dispatches IO to an abstraction of its choosing.
By default DF does register a LocalFilesystem ObjectStore, although I am not familiar with the spilling logic to know if it makes use of this |
Ah, this makes sense.
Hmm, this one is a little more troubling in my mind. What should this object store be (local, for spilling, or remote, for some kind of catalog listing)? Why should it be needed? Or, to focus on the purposes of this discussion, what APIs will it need?
I would expect most file readers to have their own I/O abstractions. Lance has it's own, for example, which requires the equivalent of something like |
It depends on the This is the mechanism that allows doing things like
I don't disagree, and was what I was recommending in the initial issue that triggered the creation of this - apache/arrow-rs#7171 The challenge is that there are advantages to having a single unified IO interface, particularly when it comes to integrating DF into a shared codebase that may perform IO elsewhere, or that might want to access multiple different types of file. I'd personally recommend an approach that keeps ObjectStore as the default, but adds similar interfaces like AsyncFileReader to allow overriding the defaults for particular operators on a case-by-case basis if people wish to do so. Edit:
Currently the ObjectStore API... This is somewhat unfortunate, given many use-cases only need read or write or listing, not all 3 simultaneously, but on the flip side it avoids having to maintain 3 separate IO traits, with accompanying registries, etc... |
Kicked off at #15018 |
Hello everyone, I'm jumping here from [Discussion] Object Store Composition.
Background
Datafusion is using
ObjectStore
as it's public storage interface for now. We have public API likeregister_object_store
:With the growth of DF, we have to continuously add more features to
object_store
, making it increasingly difficult to compose, as described in [Discussion] Object Store Composition.The latest example is adding Extensions to object store GetOptions to allow passing tracing spans within the object store, as requested in Improve use of tracing spans in query path.
It's easy to predict that
ObjectStore
will move further and further away from its initial position:Proposal
So I proposse to build
datafusion-storage
primarily focused on DataFusion's own needs while maintainingdatafusion-storage-object-store
anddatafusion-storage-opendal
separately. The benefit is that users can implement innovative features likedatafusion-storage-cudf
ordatafusion-storage-io_uring
without being constrained by the current I/O abstraction of object-store or OpenDAL.If this becomes a reality, DataFusion can design the abstraction based on its own requirements without having to push everything upstream to
object_store
. This would allow them to maintain useful features such as context management and add additional requirements to the trait while lettingdatafusion-storage-object-store
anddatafusion-storage-opendal
handle the extra work.Implematation
We can start by aliasing the
ObjectStore
trait withindatafusion-storage
first. Given sufficient migration time, we can then fine-tune the trait to better align with DF's specific needs.The text was updated successfully, but these errors were encountered: