Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move FileSourceConfig and FileStream to the new datafusion-datasource #14838

Merged
merged 7 commits into from
Feb 25, 2025

Conversation

AdamGS
Copy link
Contributor

@AdamGS AdamGS commented Feb 23, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate proto Related to proto crate labels Feb 23, 2025
@AdamGS AdamGS changed the title [WIP] [WIP] Move FileSourceConfig and FileStream to the new datafusion-datasource Feb 23, 2025
@github-actions github-actions bot added the common Related to common crate label Feb 23, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking very cool!

Thank you @AdamGS

@@ -40,7 +40,7 @@ nested_expressions = ["datafusion-functions-nested"]
# This feature is deprecated. Use the `nested_expressions` feature instead.
array_expressions = ["nested_expressions"]
# Used to enable the avro format
avro = ["apache-avro", "num-traits", "datafusion-common/avro"]
avro = ["apache-avro", "num-traits", "datafusion-common/avro", "datafusion-datasource/avro"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can remove the avro stuff from datafusion-common 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the next PR will try and do that, but some will always stay even if just for stuff like AvroError/ParquetError

@@ -69,6 +71,7 @@ xz2 = { version = "0.1", optional = true, features = ["static"] }
zstd = { version = "0.13", optional = true, default-features = false }

[dev-dependencies]
datafusion = { workspace = true, default-features = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically to avoid this type of dependnecy there are two options:

  1. Rewrite the tests to not use stuff (like SessionContext) that are in the core
  2. Move the relevant tests somewhere into the core tests -- like datafusion/core/tests/sql_integration.rs

Copy link
Contributor Author

@AdamGS AdamGS Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like some were unnecessary, and some were already moved into memory.rs, so I just moved it up under #[cfg(test)].

@@ -28,7 +28,7 @@ use std::{error::Error, path::PathBuf};
///
/// Expects to be called about like this:
///
/// `assert_batch_eq!(expected_lines: &[&str], batches: &[RecordBatch])`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was just a typo

Comment on lines 69 to 75
/// ```ignore
/// # use std::sync::Arc;
/// # use arrow::datatypes::{Field, Fields, DataType, Schema};
/// # use datafusion_datasource::PartitionedFile;
/// # use datafusion_datasource::file_scan_config::FileScanConfig;
/// # use datafusion_execution::object_store::ObjectStoreUrl;
/// # use datafusion::datasource::physical_plan::ArrowSource;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to ignore this very good rustdoc test because we don't have ArrowSource here. My plan is to only have that as a temporary thing between PRs, with the fix being either:

  1. Have some InMemoryFileSource here that just gets Vec<RecordBatch> and knows how to return it through FileSource (or maybe even the MockSource I added for tests somewhere in this PR).
  2. Potentially put ArrowSource in this crate as the "canonical" source, especially given how small it is at less than 200 lines of actual code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a weird example anyways because it actually refers to ParquetFiles

I took the liberty of pushing a commit to this branch to restore the doc test by mocking out a ParquetSource in 35e6ab7

@AdamGS AdamGS marked this pull request as ready for review February 24, 2025 20:57
@alamb alamb changed the title [WIP] Move FileSourceConfig and FileStream to the new datafusion-datasource Move FileSourceConfig and FileStream to the new datafusion-datasource Feb 24, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @AdamGS -- this is great

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

@xudong963 xudong963 merged commit 9285b84 into apache:main Feb 25, 2025
26 checks passed
@alamb
Copy link
Contributor

alamb commented Feb 25, 2025

woohoo!

@AdamGS
Copy link
Contributor Author

AdamGS commented Feb 25, 2025

🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate proto Related to proto crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants