Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] Split datasources out from datafusion crate (datafusion/core) #14444

Open
alamb opened this issue Feb 3, 2025 · 29 comments
Open

[Epic] Split datasources out from datafusion crate (datafusion/core) #14444

alamb opened this issue Feb 3, 2025 · 29 comments
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Feb 3, 2025

Is your feature request related to a problem or challenge?

Historically DataFusion was one (very) large crate datafusion, and as it grew bigger we extracted various functionality into separate crates. This leads to both faster compile times (as the crates can be compiled in parallel) as well easier to navigate code (as the crates force a cleaner dependency separation)

As described by @waynexia the build time of DataFusion has been growing,

Some of this is due to the fact there is more code / more features to test. However a non trivial part of the long compile time is the time taken to compile the datafusion / core crate in https://github.com/apache/datafusion/tree/main/datafusion/core

Image

While we are pursuing additional ways to reduce compile time, I think we should also move more code out of datafusion/core into their own crates.

We have successfully done this in the past with other projects such as

Describe the solution you'd like

I would like to split out the https://github.com/apache/datafusion/tree/main/datafusion/core/src/datasource from DataFusion core

Describe alternatives you've considered

I think we will end up with several new crates

  • datafusion-catalog-listing: ListingTable and associated types like PartitionedFile
  • datafusion-datasource-parquet: ParquetExec and file firmat
  • datafusion-datasource-avro AvroExec and file formats
  • datafusion-datasource-arrow
  • datafusion-datasource-json
  • datafusion-datasource-csv

I think we could start by creating datafusion-catalog-listing and trying to pull some of the listing table implementation into there and then trying to move one of the simpler datasources out (datafusion-datasource-arrow perhaps)

Additional context

No response

@alamb alamb added the enhancement New feature or request label Feb 3, 2025
@alamb
Copy link
Contributor Author

alamb commented Feb 3, 2025

FYI @logan-keede this is the idea I was mentioning to you regarding more refactoring fun

@logan-keede
Copy link
Contributor

take

@davisp
Copy link
Member

davisp commented Feb 5, 2025

Whoa, this is awesome!

I'm still ramping up on learning DataFusion internals as I add my own extensions and one thing that's been nagging at me is that it almost feels like the built-in data source providers are "cheating" because they get to live in core. Moving them all to separate crates that have to use the same interfaces as external datasources is something I've been contemplating suggesting for awhile so its great to see this already happening.

I'm definitely going to be keeping up and helping with this work.

@davisp
Copy link
Member

davisp commented Feb 5, 2025

@alamb Can you point me at whatever tool you used to generate those compile timing graphs? Those look like something I'd absolutely adopt in a bunch of projects.

@logan-keede
Copy link
Contributor

@alamb Can you point me at whatever tool you used to generate those compile timing graphs? Those look like something I'd absolutely adopt in a bunch of projects.

@davisp try this

rm -rf target 
cargo build --release --timings --lib --quiet

@davisp
Copy link
Member

davisp commented Feb 6, 2025

@logan-keede Thanks for the tip!

@alamb
Copy link
Contributor Author

alamb commented Feb 6, 2025

@alamb Can you point me at whatever tool you used to generate those compile timing graphs? Those look like something I'd absolutely adopt in a bunch of projects.

(and to be clear, that was @waynexia who made that chart for #14256 I just copy pasted it here)

I'm still ramping up on learning DataFusion internals as I add my own extensions and one thing that's been nagging at me is that it almost feels like the built-in data source providers are "cheating" because they get to live in core. Moving them all to separate crates that have to use the same interfaces as external datasources is something I've been contemplating suggesting for awhile so its great to see this already happening.

Awesome -- indeed this is the case (and a similar thing used to happen with functions before we pulled them out)

BTW there is a major refactor for datasources done by @mertak-synnada @ozankabak and others that (just) merged:

Among other things it makes it easier to re-use common datasource features like pushdown, limit, etc

Now that that is in, we will be able to push this project along like a 🚀

@alamb
Copy link
Contributor Author

alamb commented Feb 12, 2025

Update here is that @logan-keede is cranking right along:

After some discussion with @jayzhan211 I think we have a good plan going forward

I feel like we are on the cusp of finally getting these things split from the core 🙏

@logan-keede
Copy link
Contributor

logan-keede commented Feb 12, 2025

This has been an update in plan, specifically addition of datafusion-datasource crate

  • datafusion-catalog-listing: ListingTable and original datasource::listing
  • datafusion-datasource: NEW that holds PartitonedFile,and DataSource and other File related components
  • datafusion-datasource-parquet: ParquetExec and file formats
  • datafusion-datasource-avro AvroExec and file formats
  • datafusion-datasource-arrow
  • datafusion-datasource-json
  • datafusion-datasource-csv

Originally posted by @alamb and suggested by @jayzhan211 in #14616 (comment)

Please refer to the above mentioned issue for more context.

@AdamGS
Copy link
Contributor

AdamGS commented Feb 22, 2025

@logan-keede as part of moving avro functionality into datafusion-datasource-avro, WDYT about putting all of it behind a feature flag, the same way parquet is now? Seems like its the only format/source that is visible with default-features = false but panics at runtime (for example in AvroSource::create_file_opener)

@logan-keede
Copy link
Contributor

@logan-keede as part of moving avro functionality into datafusion-datasource-avro, WDYT about putting all of it behind a feature flag, the same way parquet is now? Seems like its the only format/source that is visible with default-features = false but panics at runtime (for example in AvroSource::create_file_opener)

I do not have much context on why it has been kept this way, but here is what I think

It seems like you can currently make a new AvroSource but not read an existing file. There might be some limited use case where user only needed to simulate AvroSource for some reason. (even that might crash as some other functionalities also panic).
I think this use case is very limited and unintuitive and it is better to have it behind a feature flag(as you suggested), even if it is an API-Change.

TL;DR: I think we can do that, but it seems like an API change so deprecation first.

@alamb WDYT?

@alamb
Copy link
Contributor Author

alamb commented Feb 22, 2025

@logan-keede as part of moving avro functionality into datafusion-datasource-avro, WDYT about putting all of it behind a feature flag, the same way parquet is now? Seems like its the only format/source that is visible with default-features = false but panics at runtime (for example in AvroSource::create_file_opener)

I do not have much context on why it has been kept this way, but here is what I think

It seems like you can currently make a new AvroSource but not read an existing file. There might be some limited use case where user only needed to simulate AvroSource for some reason. (even that might crash as some other functionalities also panic). I think this use case is very limited and unintuitive and it is better to have it behind a feature flag(as you suggested), even if it is an API-Change.

TL;DR: I think we can do that, but it seems like an API change so deprecation first.

@alamb WDYT?

I think putting the entire AvroExec behind a feature flag makes sense even if it is an API change. I don't think we have to go through deprecation first as the change will be "add the flag"

@logan-keede
Copy link
Contributor

logan-keede commented Feb 23, 2025

I might not be able to contribute much till 7th March (due to institute commitments).
If anybody wants to continue on this issue, please feel free to do so. Here are possible next steps that I had in my mind:-

  1. Move remaining part of FileScanConfig and FileStream to datasource .
  2. Move FileSource to datasource.
  3. Move FileFormat to datasource.
  4. Move listing and file format specific implementation to their own crates.

I would be happy to continue working on this as soon as I find some time, though that may only be after March 7.

@AdamGS
Copy link
Contributor

AdamGS commented Feb 23, 2025

I'll try and take a stab at it, @alamb do you have a preference as to how many PRs I should break it into? There are no logical changes but I expect a very large numbers of small changes and would I love to do it in a way that you and others will be happy to review.

edit: Started moving things around and seems like there a bunch of dependencies between them. I'll keep going until I get something that makes sense, hopefully the resulting PR won't be too big.

@alamb
Copy link
Contributor Author

alamb commented Feb 23, 2025

I'll try and take a stab at it, @alamb do you have a preference as to how many PRs I should break it into? There are no logical changes but I expect a very large numbers of small changes and would I love to do it in a way that you and others will be happy to review.

edit: Started moving things around and seems like there a bunch of dependencies between them. I'll keep going until I get something that makes sense, hopefully the resulting PR won't be too big.

Thank you @AdamGS -- that is super helpful

In general, fewer smaller PRs is far easier to review. Also PRs that are mostly mechanical are also easy / fast to review

PRs that make changes that might have larger downstream implications are harder / take longer (as they require finding more focused time to review them)

@AdamGS
Copy link
Contributor

AdamGS commented Feb 23, 2025

Ok I got a first draft that just moves FileStream and FileScanConfig and everything that comes with them. There are still some small issues to solve, mostly around test functionality that they pull from the core crate but expect that to be solvable.

@AdamGS
Copy link
Contributor

AdamGS commented Feb 24, 2025

#14838 turned out to be easier than I thought, mostly because @logan-keede did much of the work to move test infrastructure over.

@AdamGS
Copy link
Contributor

AdamGS commented Feb 25, 2025

I'll keep going later this week, my current plan is to split the rest of the work into three PRs:

  1. FileFormat - Move FileFormat and related pieces to datafusion-datasource #14873
  2. ListingTable and friends.
  3. Split out each individual format implementation into its own crate (csv, json, parquet, avro).

@AdamGS
Copy link
Contributor

AdamGS commented Feb 27, 2025

Just updating that seems like steps 2 and 3 need to happen in one PR, otherwise a lot of code will move around multiple times which seems like unnecessary noise.

Another issue I've run into is that ListingTable (as well as csv, JSON and parquet) all downcast the session into a SessionState (see earlier discussions in #11600 and @logan-keede 's #14543), which is currently core.
The only functionality they want is SessionState::default_table_options and SessionState::default_table_options, so the easiest thing to do will be to just move those functions to the trait. Given the amount of conversations/issues I've found about this point, I think its best to get a second opinion here before I start moving things around.

@logan-keede
Copy link
Contributor

Another issue I've run into is that ListingTable (as well as csv, JSON and parquet) all downcast the session into a SessionState (see earlier discussions in #11600 and @logan-keede 's #14543), which is currently core. The only functionality they want is SessionState::default_table_options and SessionState::default_table_options, so the easiest thing to do will be to just move those functions to the trait. Given the amount of conversations/issues I've found about this point, I think its best to get a second opinion here before I start moving things around.

That's what I had in my mind too, I think the idea is to use Session trait, wherever SessionState is needed especially outside of core (and move those functions, if it is not already present).

@alamb
Copy link
Contributor Author

alamb commented Feb 27, 2025

That's what I had in my mind too, I think the idea is to use Session trait, wherever SessionState is needed especially outside of core (and move those functions, if it is not already present).

Yes this is exactly right

@AdamGS
Copy link
Contributor

AdamGS commented Feb 27, 2025

Seems like that creates a dependency problem I'm not sure how to untangle. catalog::Session will need datasource::FiltFormatFactory, which needs FileFormat, which needs FileSource, which needs FileScanConfig.

I initially considered even returning Any instead of an Ard<dyn FileFormatFactoy> but many usages don't have a statically known format, and returning a FileFormat won't make a difference.
I can merge datafusion-datasource into datafusion-catalog, but also open to other ideas, and I'll also keep at it later today. Hopefully something will come up that won't require a huge breaking change.

@logan-keede
Copy link
Contributor

logan-keede commented Feb 27, 2025

Seems like that creates a dependency problem I'm not sure how to untangle. catalog::Session will need datasource::FiltFormatFactory, which needs FileFormat, which needs FileSource, which needs FileScanConfig

I remember contemplating over that, I think I had a loose sketch but I don't recall anything directly useful at the moment.

I can merge datafusion-datasource into datafusion-catalog, but also open to other ideas, and I'll also keep at it later today. Hopefully something will come up that won't require a huge breaking change.

  1. I think we should try to avoid that mostly because it seems like a regression in term of modularity and code quality. (maybe we can only move some part of either or both crates up or down the dependency tree instead, not sure about this.)

  2. It is my understanding that we do eventually want to move SessionState and everything else out of core, right?

Following on that,
I think the recent refactor has allowed us to move more stuff out of core (like CteWorktable and catalog-common), which may also be direct or indirect dependency of SessionState`.
I suggest that we take an Incremental approach and move those out first, that should help us identify what are the actual blockers* and then we can target them separately.

does that make sense?

*Tightly coupled code that is hindering refactor.

@AdamGS
Copy link
Contributor

AdamGS commented Feb 27, 2025

Makes perfect sense, obviously smashing two crates together just to solve a dependency issue seems like an overreaction, especially as there are so many other dependencies that can be untangled first.

In the context of this work - I'll try and split the file formats tonight, they should be less tied to SessionState than ListingTable. After that's done we can circle back and try and focus more effort specifically on SessionState.

@AdamGS
Copy link
Contributor

AdamGS commented Feb 28, 2025

I have a first version that seems to pass all tests/clippy, but I'm left with one issue before I publish it. In order to keep the tests in the format-specific crates, I pulled core as a dev-dependency using a path, so it can still be published.
Fixing depcheck to accept that is trivial, but I would love to know if that's an acceptable solution here or if I should move all tests up back to core where it has access to everything they need.

@logan-keede
Copy link
Contributor

I have a first version that seems to pass all tests/clippy, but I'm left with one issue before I publish it. In order to keep the tests in the format-specific crates, I pulled core as a dev-dependency using a path, so it can still be published. Fixing depcheck to accept that is trivial, but I would love to know if that's an acceptable solution here or if I should move all tests up back to core where it has access to everything they need.

referring to #14671 (comment) and the fact that no crate has a dev-dependency on a crate above them in dependency tree (atleast none that I know of).

I think it is acceptable but not encouraged.
you may want to wait for a reply from @alamb or somebody else knowledgeable if you want to be sure.

@AdamGS
Copy link
Contributor

AdamGS commented Feb 28, 2025

Yeah also checked and noticed that no other crate does that currently. I'll move the tests back to their original locations for now, some can be spared but most do seem to require SessionState.

EDIT: Before I do that, I'll try and a more minimal Session implementation, I think at least some tests don't need much functionality to run.

@AdamGS
Copy link
Contributor

AdamGS commented Mar 1, 2025

The PR is out, but I suspect it's too big. I tried to get it in the 46 release, but now that I probably missed that, @alamb (or any other reviewers) would you prefer I split it up a bit?

@alamb
Copy link
Contributor Author

alamb commented Mar 3, 2025

The PR is out, but I suspect it's too big. I tried to get it in the 46 release, but now that I probably missed that, @alamb (or any other reviewers) would you prefer I split it up a bit?

I'll try and check it out today.

Seems like that creates a dependency problem I'm not sure how to untangle. catalog::Session will need datasource::FiltFormatFactory, which needs FileFormat, which needs FileSource, which needs FileScanConfig.

I was thinking about this one.

It seems like FileFormat and FileSource may now have some non trivial overlap in functionalty. Maybe we need to rethink some of the APIs / abstractions 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants