Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split out avro, parquet, json and csv into individual crates #14951

Merged
merged 10 commits into from
Mar 4, 2025

Conversation

AdamGS
Copy link
Contributor

@AdamGS AdamGS commented Feb 28, 2025

Which issue does this PR close?

Part of #14444.

The PR is obviously huge, but because the file are split (keeping tests in core), github doesn't understand some of the changes as moves, even though that's the majority of the change.
I've read through it a couple of times and kept finding small things that can be improved and I'll probably do that again during the weekend.

What changes are included in this PR?

Moving all format-specific functionality into dedicated crates, introducing:

  • datafusion-datasource-csv
  • datafusion-datasource-json
  • datafusion-datasource-avro
  • datafusion-datasource-parquet

Along the way I managed to move some stuff into datasource and tried multiple times to move tests infrastructure, but I think I ended pretty close to where I started.

Are these changes tested?

Are there any user-facing changes?

I tried to maintain the current exposed modules and types, and keep them sane in general.
I'm not sure I managed to keep all of them but I think none got significantly more complex to pull in.

There is more public API surface now, as some of the shared code used its location to stay pub(crate) or just private (for example - the Decoder trait and related pieces).

As discussed, avro is now fully behind a feature flag.

I also moved a couple of functions from SessionState to session, one tiny step towards that goal.

@github-actions github-actions bot added core Core DataFusion crate catalog Related to the catalog crate labels Feb 28, 2025
Copy link
Contributor

@logan-keede logan-keede left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for continuing on this @AdamGS.

I have some minor comments/ suggestion, but I think it might be okay to address them in follow up PR.
+1(non-binding)

cc @alamb

pub fn default_table_options(&self) -> TableOptions {
Session::default_table_options(self)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think default_table_options and table_options only need to be part of impl Session rather than
SessionState

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern is that might be a breaking change for some users, but its probably very minor one in the scope of this PR and other changes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reread the code around it, and seems like the Session implementation mostly delegates to the actual implementation, which saves bringing it into scope in a ton of locations across the codebase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that since Session trait is implemented for SessionState, you should be able to call some_SessionState_struct.default_table_options(), It seems that you are essentially overriding that function with itself(hence the original suggestion of removing it) unless I am missing something. Please let me know if that's the case!!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I change the function here to:

/// return the TableOptions options with its extensions
pub fn default_table_options(&self) -> TableOptions {
    self.default_table_options()
}

clippy tells me its a recursion and the stack blows up when I run tests.
I think the default implementation here messes with with just having the trait implementation delegate to the struct. Of course I can copy it here and make the implementation delegate to it but that seems like unnecessary code duplication.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, this seems like the proper way of doing things. Thanks for clarification!!

pub fn default_table_options(&self) -> TableOptions {
Session::default_table_options(self)
}

/// Return mutable table options
pub fn table_options_mut(&mut self) -> &mut TableOptions {
&mut self.table_options
Copy link
Contributor

@logan-keede logan-keede Mar 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps table_options_mut should also be a part of Session

@@ -501,7 +501,7 @@ impl FileScanConfig {
(schema, constraints, stats, output_ordering)
}

#[cfg_attr(not(feature = "avro"), allow(unused))] // Only used by avro
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is my understanding that this is being removed to avoid dependency on avro.
I am not sure what is better between this and empty feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its a pub fn -- I think we should simply deprecate it if it isn't needed

Copy link
Contributor

@logan-keede logan-keede Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is still in use (in avro-source with_projection function), just not detected

@github-actions github-actions bot added the proto Related to proto crate label Mar 1, 2025
@alamb
Copy link
Contributor

alamb commented Mar 2, 2025

Thank you -- I will look at this shortly


#[cfg(test)]
#[cfg(feature = "avro")]
mod tests {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be nice (eventually) to move these tests to core_integration somewhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn't realize that was a thing. Glad to move everything there (in this PR or a different one).

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @AdamGS -- given the scope of this PR I can't review it line by line, but I think that is ok. We can get this new structure in and then do some additional cleanup PRs.

I think it makes the code much easier to work with

I'll plan to merge it tomorrow early to try and avoid accumulating conflicts

@@ -501,7 +501,7 @@ impl FileScanConfig {
(schema, constraints, stats, output_ordering)
}

#[cfg_attr(not(feature = "avro"), allow(unused))] // Only used by avro
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its a pub fn -- I think we should simply deprecate it if it isn't needed

@alamb
Copy link
Contributor

alamb commented Mar 3, 2025

Thank you @logan-keede for your thoughts and suggestions

@AdamGS
Copy link
Contributor Author

AdamGS commented Mar 3, 2025

projected_file_column_names only has one call site in Avro. I'm glad to add a deprecation notice from 47 and beyond and just put put the code there directly in the relevant place in AvroSource, just give me the +1.

@AdamGS
Copy link
Contributor Author

AdamGS commented Mar 3, 2025

actually - given that the avro feature is completly gone from datafusion-datasource I think we can just delete it here, the logican can be a private function in datasource-avro/src/source.rs.

@logan-keede
Copy link
Contributor

logan-keede commented Mar 3, 2025

projected_file_column_names only has one call site in Avro. I'm glad to add a deprecation notice from 47 and beyond and just put put the code there directly in the relevant place in AvroSource, just give me the +1.

That would be nice.

actually - given that the avro feature is completly gone from datafusion-datasource I think we can just delete it here, the logican can be a private function in datasource-avro/src/source.rs.

I am not too sure about this. It is only meant to be used by avro but maybe, some user might also be using this function as it was not behind a feature gate. (again, not sure if it is possible to be used like this).

@AdamGS
Copy link
Contributor Author

AdamGS commented Mar 3, 2025

After some more thinking - while I couldn't find any use of this function in an active codebase on github, I can see why its useful (or at least nice). I just removed the unused on it, I don't see a real reason to deprecate it unless there's some bigger re-thinking of the API here, it seems pretty consistent with functions like projected_schema or projected_stats.

@github-actions github-actions bot added the datasource Changes to the datasource crate label Mar 3, 2025
@alamb
Copy link
Contributor

alamb commented Mar 4, 2025

Thanks again @AdamGS and @logan-keede -- this is amazing progress. Something I have thought was important for the last t

@alamb alamb merged commit 6d5e00a into apache:main Mar 4, 2025
26 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 4, 2025

I am merging this one in so it doesn't accumulate conflicts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
catalog Related to the catalog crate core Core DataFusion crate datasource Changes to the datasource crate proto Related to proto crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants