[Epic] Split datasources out from `datafusion` crate (`datafusion/core`) #14444

alamb · 2025-02-03T15:10:35Z

Is your feature request related to a problem or challenge?

Historically DataFusion was one (very) large crate datafusion, and as it grew bigger we extracted various functionality into separate crates. This leads to both faster compile times (as the crates can be compiled in parallel) as well easier to navigate code (as the crates force a cleaner dependency separation)

As described by @waynexia the build time of DataFusion has been growing,

Build time regression #14256

Some of this is due to the fact there is more code / more features to test. However a non trivial part of the long compile time is the time taken to compile the datafusion / core crate in https://github.com/apache/datafusion/tree/main/datafusion/core

While we are pursuing additional ways to reduce compile time, I think we should also move more code out of datafusion/core into their own crates.

We have successfully done this in the past with other projects such as

Describe the solution you'd like

I would like to split out the https://github.com/apache/datafusion/tree/main/datafusion/core/src/datasource from DataFusion core

Describe alternatives you've considered

I think we will end up with several new crates

datafusion-catalog-listing: ListingTable and associated types like PartitionedFile
datafusion-datasource-parquet: ParquetExec and file firmat
datafusion-datasource-avro AvroExec and file formats
datafusion-datasource-arrow
datafusion-datasource-json
datafusion-datasource-csv

I think we could start by creating datafusion-catalog-listing and trying to pull some of the listing table implementation into there and then trying to move one of the simpler datasources out (datafusion-datasource-arrow perhaps)

Additional context

No response

The text was updated successfully, but these errors were encountered:

alamb · 2025-02-03T15:10:58Z

FYI @logan-keede this is the idea I was mentioning to you regarding more refactoring fun

logan-keede · 2025-02-03T17:48:06Z

take

davisp · 2025-02-05T23:17:07Z

Whoa, this is awesome!

I'm still ramping up on learning DataFusion internals as I add my own extensions and one thing that's been nagging at me is that it almost feels like the built-in data source providers are "cheating" because they get to live in core. Moving them all to separate crates that have to use the same interfaces as external datasources is something I've been contemplating suggesting for awhile so its great to see this already happening.

I'm definitely going to be keeping up and helping with this work.

davisp · 2025-02-05T23:28:59Z

@alamb Can you point me at whatever tool you used to generate those compile timing graphs? Those look like something I'd absolutely adopt in a bunch of projects.

logan-keede · 2025-02-06T12:48:57Z

@alamb Can you point me at whatever tool you used to generate those compile timing graphs? Those look like something I'd absolutely adopt in a bunch of projects.

@davisp try this

rm -rf target 
cargo build --release --timings --lib --quiet

davisp · 2025-02-06T18:01:01Z

@logan-keede Thanks for the tip!

alamb · 2025-02-06T18:53:19Z

@alamb Can you point me at whatever tool you used to generate those compile timing graphs? Those look like something I'd absolutely adopt in a bunch of projects.

(and to be clear, that was @waynexia who made that chart for #14256 I just copy pasted it here)

I'm still ramping up on learning DataFusion internals as I add my own extensions and one thing that's been nagging at me is that it almost feels like the built-in data source providers are "cheating" because they get to live in core. Moving them all to separate crates that have to use the same interfaces as external datasources is something I've been contemplating suggesting for awhile so its great to see this already happening.

Awesome -- indeed this is the case (and a similar thing used to happen with functions before we pulled them out)

BTW there is a major refactor for datasources done by @mertak-synnada @ozankabak and others that (just) merged:

Introduce unified DataSourceExec for provided datasources, remove ParquetExec, CsvExec, etc #14224

Among other things it makes it easier to re-use common datasource features like pushdown, limit, etc

Now that that is in, we will be able to push this project along like a 🚀

alamb · 2025-02-12T15:11:03Z

Update here is that @logan-keede is cranking right along:

refactor: Move various parts of datasource out of core #14616

After some discussion with @jayzhan211 I think we have a good plan going forward

I feel like we are on the cusp of finally getting these things split from the core 🙏

logan-keede · 2025-02-12T15:13:47Z

This has been an update in plan, specifically addition of datafusion-datasource crate

datafusion-catalog-listing: ListingTable and original datasource::listing
datafusion-datasource: NEW that holds PartitonedFile,and DataSource and other File related components
datafusion-datasource-parquet: ParquetExec and file formats
datafusion-datasource-avro AvroExec and file formats
datafusion-datasource-arrow
datafusion-datasource-json
datafusion-datasource-csv

Originally posted by @alamb and suggested by @jayzhan211 in #14616 (comment)

Please refer to the above mentioned issue for more context.

AdamGS · 2025-02-22T11:33:53Z

@logan-keede as part of moving avro functionality into datafusion-datasource-avro, WDYT about putting all of it behind a feature flag, the same way parquet is now? Seems like its the only format/source that is visible with default-features = false but panics at runtime (for example in AvroSource::create_file_opener)

logan-keede · 2025-02-22T12:51:04Z

@logan-keede as part of moving avro functionality into datafusion-datasource-avro, WDYT about putting all of it behind a feature flag, the same way parquet is now? Seems like its the only format/source that is visible with default-features = false but panics at runtime (for example in AvroSource::create_file_opener)

I do not have much context on why it has been kept this way, but here is what I think

It seems like you can currently make a new AvroSource but not read an existing file. There might be some limited use case where user only needed to simulate AvroSource for some reason. (even that might crash as some other functionalities also panic).
I think this use case is very limited and unintuitive and it is better to have it behind a feature flag(as you suggested), even if it is an API-Change.

TL;DR: I think we can do that, but it seems like an API change so deprecation first.

@alamb WDYT?

alamb · 2025-02-22T12:59:37Z

@logan-keede as part of moving avro functionality into datafusion-datasource-avro, WDYT about putting all of it behind a feature flag, the same way parquet is now? Seems like its the only format/source that is visible with default-features = false but panics at runtime (for example in AvroSource::create_file_opener)

I do not have much context on why it has been kept this way, but here is what I think

It seems like you can currently make a new AvroSource but not read an existing file. There might be some limited use case where user only needed to simulate AvroSource for some reason. (even that might crash as some other functionalities also panic). I think this use case is very limited and unintuitive and it is better to have it behind a feature flag(as you suggested), even if it is an API-Change.

TL;DR: I think we can do that, but it seems like an API change so deprecation first.

@alamb WDYT?

I think putting the entire AvroExec behind a feature flag makes sense even if it is an API change. I don't think we have to go through deprecation first as the change will be "add the flag"

logan-keede · 2025-02-23T09:11:08Z

I might not be able to contribute much till 7th March (due to institute commitments).
If anybody wants to continue on this issue, please feel free to do so. Here are possible next steps that I had in my mind:-

Move remaining part of FileScanConfig and FileStream to datasource .
Move FileSource to datasource.
Move FileFormat to datasource.
Move listing and file format specific implementation to their own crates.

I would be happy to continue working on this as soon as I find some time, though that may only be after March 7.

AdamGS · 2025-02-23T13:43:51Z

I'll try and take a stab at it, @alamb do you have a preference as to how many PRs I should break it into? There are no logical changes but I expect a very large numbers of small changes and would I love to do it in a way that you and others will be happy to review.

edit: Started moving things around and seems like there a bunch of dependencies between them. I'll keep going until I get something that makes sense, hopefully the resulting PR won't be too big.

alamb · 2025-02-23T13:59:04Z

I'll try and take a stab at it, @alamb do you have a preference as to how many PRs I should break it into? There are no logical changes but I expect a very large numbers of small changes and would I love to do it in a way that you and others will be happy to review.

edit: Started moving things around and seems like there a bunch of dependencies between them. I'll keep going until I get something that makes sense, hopefully the resulting PR won't be too big.

Thank you @AdamGS -- that is super helpful

In general, fewer smaller PRs is far easier to review. Also PRs that are mostly mechanical are also easy / fast to review

PRs that make changes that might have larger downstream implications are harder / take longer (as they require finding more focused time to review them)

AdamGS · 2025-02-23T15:31:56Z

Ok I got a first draft that just moves FileStream and FileScanConfig and everything that comes with them. There are still some small issues to solve, mostly around test functionality that they pull from the core crate but expect that to be solvable.

AdamGS · 2025-02-24T20:58:13Z

#14838 turned out to be easier than I thought, mostly because @logan-keede did much of the work to move test infrastructure over.

AdamGS · 2025-02-25T14:09:11Z

I'll keep going later this week, my current plan is to split the rest of the work into three PRs:

FileFormat - Move FileFormat and related pieces to datafusion-datasource #14873
ListingTable and friends.
Split out each individual format implementation into its own crate (csv, json, parquet, avro).

AdamGS · 2025-02-27T09:51:11Z

Just updating that seems like steps 2 and 3 need to happen in one PR, otherwise a lot of code will move around multiple times which seems like unnecessary noise.

Another issue I've run into is that ListingTable (as well as csv, JSON and parquet) all downcast the session into a SessionState (see earlier discussions in #11600 and @logan-keede 's #14543), which is currently core.
The only functionality they want is SessionState::default_table_options and SessionState::default_table_options, so the easiest thing to do will be to just move those functions to the trait. Given the amount of conversations/issues I've found about this point, I think its best to get a second opinion here before I start moving things around.

logan-keede · 2025-02-27T10:44:40Z

Another issue I've run into is that ListingTable (as well as csv, JSON and parquet) all downcast the session into a SessionState (see earlier discussions in #11600 and @logan-keede 's #14543), which is currently core. The only functionality they want is SessionState::default_table_options and SessionState::default_table_options, so the easiest thing to do will be to just move those functions to the trait. Given the amount of conversations/issues I've found about this point, I think its best to get a second opinion here before I start moving things around.

That's what I had in my mind too, I think the idea is to use Session trait, wherever SessionState is needed especially outside of core (and move those functions, if it is not already present).

alamb · 2025-02-27T12:26:41Z

That's what I had in my mind too, I think the idea is to use Session trait, wherever SessionState is needed especially outside of core (and move those functions, if it is not already present).

Yes this is exactly right

AdamGS · 2025-02-27T17:08:07Z

Seems like that creates a dependency problem I'm not sure how to untangle. catalog::Session will need datasource::FiltFormatFactory, which needs FileFormat, which needs FileSource, which needs FileScanConfig.

I initially considered even returning Any instead of an Ard<dyn FileFormatFactoy> but many usages don't have a statically known format, and returning a FileFormat won't make a difference.
I can merge datafusion-datasource into datafusion-catalog, but also open to other ideas, and I'll also keep at it later today. Hopefully something will come up that won't require a huge breaking change.

logan-keede · 2025-02-27T18:35:52Z

Seems like that creates a dependency problem I'm not sure how to untangle. catalog::Session will need datasource::FiltFormatFactory, which needs FileFormat, which needs FileSource, which needs FileScanConfig

I remember contemplating over that, I think I had a loose sketch but I don't recall anything directly useful at the moment.

I can merge datafusion-datasource into datafusion-catalog, but also open to other ideas, and I'll also keep at it later today. Hopefully something will come up that won't require a huge breaking change.

I think we should try to avoid that mostly because it seems like a regression in term of modularity and code quality. (maybe we can only move some part of either or both crates up or down the dependency tree instead, not sure about this.)
It is my understanding that we do eventually want to move SessionState and everything else out of core, right?

Following on that,
I think the recent refactor has allowed us to move more stuff out of core (like CteWorktable and catalog-common), which may also be direct or indirect dependency of SessionState`.
I suggest that we take an Incremental approach and move those out first, that should help us identify what are the actual blockers* and then we can target them separately.

does that make sense?

*Tightly coupled code that is hindering refactor.

AdamGS · 2025-02-27T18:52:04Z

Makes perfect sense, obviously smashing two crates together just to solve a dependency issue seems like an overreaction, especially as there are so many other dependencies that can be untangled first.

In the context of this work - I'll try and split the file formats tonight, they should be less tied to SessionState than ListingTable. After that's done we can circle back and try and focus more effort specifically on SessionState.

AdamGS · 2025-02-28T11:49:36Z

I have a first version that seems to pass all tests/clippy, but I'm left with one issue before I publish it. In order to keep the tests in the format-specific crates, I pulled core as a dev-dependency using a path, so it can still be published.
Fixing depcheck to accept that is trivial, but I would love to know if that's an acceptable solution here or if I should move all tests up back to core where it has access to everything they need.

logan-keede · 2025-02-28T14:34:19Z

I have a first version that seems to pass all tests/clippy, but I'm left with one issue before I publish it. In order to keep the tests in the format-specific crates, I pulled core as a dev-dependency using a path, so it can still be published. Fixing depcheck to accept that is trivial, but I would love to know if that's an acceptable solution here or if I should move all tests up back to core where it has access to everything they need.

referring to #14671 (comment) and the fact that no crate has a dev-dependency on a crate above them in dependency tree (atleast none that I know of).

I think it is acceptable but not encouraged.
you may want to wait for a reply from @alamb or somebody else knowledgeable if you want to be sure.

AdamGS · 2025-02-28T14:49:05Z

Yeah also checked and noticed that no other crate does that currently. I'll move the tests back to their original locations for now, some can be spared but most do seem to require SessionState.

EDIT: Before I do that, I'll try and a more minimal Session implementation, I think at least some tests don't need much functionality to run.

AdamGS · 2025-03-01T13:18:20Z

The PR is out, but I suspect it's too big. I tried to get it in the 46 release, but now that I probably missed that, @alamb (or any other reviewers) would you prefer I split it up a bit?

alamb · 2025-03-03T13:33:33Z

The PR is out, but I suspect it's too big. I tried to get it in the 46 release, but now that I probably missed that, @alamb (or any other reviewers) would you prefer I split it up a bit?

I'll try and check it out today.

Seems like that creates a dependency problem I'm not sure how to untangle. catalog::Session will need datasource::FiltFormatFactory, which needs FileFormat, which needs FileSource, which needs FileScanConfig.

I was thinking about this one.

It seems like FileFormat and FileSource may now have some non trivial overlap in functionalty. Maybe we need to rethink some of the APIs / abstractions 🤔

alamb added the enhancement New feature or request label Feb 3, 2025

alamb mentioned this issue Feb 3, 2025

move resolve_table_references out of datafusion-catalog` #14441

Merged

github-actions bot assigned logan-keede Feb 3, 2025

logan-keede mentioned this issue Feb 13, 2025

refactor: Make catalog datasource #14643

Merged

getChan mentioned this issue Feb 13, 2025

WIP : create datafusion-datasource-avro crate #14651

Closed

logan-keede mentioned this issue Feb 14, 2025

refactor: move DataSource to datafusion-datasource #14671

Merged

AdamGS mentioned this issue Feb 23, 2025

Move FileSourceConfig and FileStream to the new datafusion-datasource #14838

Merged

alamb mentioned this issue Feb 24, 2025

Weekly Plan (Andrew Lamb) Feb 24, 2025 #14850

Closed

10 tasks

AdamGS mentioned this issue Feb 25, 2025

Move FileFormat and related pieces to datafusion-datasource #14873

Merged

AdamGS mentioned this issue Feb 28, 2025

Split out avro, parquet, json and csv into individual crates #14951

Merged

alamb mentioned this issue Mar 3, 2025

Weekly Plan (Andrew Lamb) March 3, 2025 #14978

Closed

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic] Split datasources out from `datafusion` crate (`datafusion/core`) #14444

[Epic] Split datasources out from `datafusion` crate (`datafusion/core`) #14444

alamb commented Feb 3, 2025

alamb commented Feb 3, 2025

logan-keede commented Feb 3, 2025

davisp commented Feb 5, 2025

davisp commented Feb 5, 2025

logan-keede commented Feb 6, 2025

davisp commented Feb 6, 2025

alamb commented Feb 6, 2025 •

edited

Loading

alamb commented Feb 12, 2025

logan-keede commented Feb 12, 2025 •

edited

Loading

AdamGS commented Feb 22, 2025

logan-keede commented Feb 22, 2025

alamb commented Feb 22, 2025

logan-keede commented Feb 23, 2025 •

edited

Loading

AdamGS commented Feb 23, 2025 •

edited

Loading

alamb commented Feb 23, 2025

AdamGS commented Feb 23, 2025 •

edited

Loading

AdamGS commented Feb 24, 2025

AdamGS commented Feb 25, 2025 •

edited

Loading

AdamGS commented Feb 27, 2025 •

edited

Loading

logan-keede commented Feb 27, 2025

alamb commented Feb 27, 2025

AdamGS commented Feb 27, 2025 •

edited

Loading

logan-keede commented Feb 27, 2025 •

edited

Loading

AdamGS commented Feb 27, 2025

AdamGS commented Feb 28, 2025

logan-keede commented Feb 28, 2025

AdamGS commented Feb 28, 2025 •

edited

Loading

AdamGS commented Mar 1, 2025

alamb commented Mar 3, 2025

[Epic] Split datasources out from datafusion crate (datafusion/core) #14444

[Epic] Split datasources out from datafusion crate (datafusion/core) #14444

Comments

alamb commented Feb 3, 2025

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Feb 3, 2025

logan-keede commented Feb 3, 2025

davisp commented Feb 5, 2025

davisp commented Feb 5, 2025

logan-keede commented Feb 6, 2025

davisp commented Feb 6, 2025

alamb commented Feb 6, 2025 • edited Loading

alamb commented Feb 12, 2025

logan-keede commented Feb 12, 2025 • edited Loading

AdamGS commented Feb 22, 2025

logan-keede commented Feb 22, 2025

alamb commented Feb 22, 2025

logan-keede commented Feb 23, 2025 • edited Loading

AdamGS commented Feb 23, 2025 • edited Loading

alamb commented Feb 23, 2025

AdamGS commented Feb 23, 2025 • edited Loading

AdamGS commented Feb 24, 2025

AdamGS commented Feb 25, 2025 • edited Loading

AdamGS commented Feb 27, 2025 • edited Loading

logan-keede commented Feb 27, 2025

alamb commented Feb 27, 2025

AdamGS commented Feb 27, 2025 • edited Loading

logan-keede commented Feb 27, 2025 • edited Loading

AdamGS commented Feb 27, 2025

AdamGS commented Feb 28, 2025

logan-keede commented Feb 28, 2025

AdamGS commented Feb 28, 2025 • edited Loading

AdamGS commented Mar 1, 2025

alamb commented Mar 3, 2025

[Epic] Split datasources out from `datafusion` crate (`datafusion/core`) #14444

[Epic] Split datasources out from `datafusion` crate (`datafusion/core`) #14444

alamb commented Feb 6, 2025 •

edited

Loading

logan-keede commented Feb 12, 2025 •

edited

Loading

logan-keede commented Feb 23, 2025 •

edited

Loading

AdamGS commented Feb 23, 2025 •

edited

Loading

AdamGS commented Feb 23, 2025 •

edited

Loading

AdamGS commented Feb 25, 2025 •

edited

Loading

AdamGS commented Feb 27, 2025 •

edited

Loading

AdamGS commented Feb 27, 2025 •

edited

Loading

logan-keede commented Feb 27, 2025 •

edited

Loading

AdamGS commented Feb 28, 2025 •

edited

Loading