Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor test suite in EnforceDistribution, to use standard test config. #15010

Merged
merged 2 commits into from
Mar 7, 2025

Conversation

wiedld
Copy link
Contributor

@wiedld wiedld commented Mar 5, 2025

Which issue does this PR close?

Part of #15003

Rationale for this change

Make it easier to determine the differences in test configuration, for each test case.
Also clearly defines what are the defaults used for the test suite.

What changes are included in this PR?

Define TestConfig, which requires one-of DoFirst optimizer run configs. Most test cases uses this test config with the default settings.

Provide a set of interfaces to define which configurations get tweaked.

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label Mar 5, 2025
Comment on lines +377 to +388
// By default, will not repartition / resort data if it is already sorted.
config.optimizer.prefer_existing_sort = false;

// By default, will attempt to convert Union to Interleave.
config.optimizer.prefer_existing_union = false;

// By default, will not repartition file scans.
config.optimizer.repartition_file_scans = false;
config.optimizer.repartition_file_min_size = 1024;

// By default, set query execution concurrency to 10.
config.execution.target_partitions = 10;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These match the defaults on main:

assert_optimized!($EXPECTED_LINES, $PLAN, $FIRST_ENFORCE_DIST, false, 10, false, 1024, false);

Comment on lines +390 to +391
// Use a small batch size, to trigger RoundRobin in tests
config.execution.batch_size = 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches main:

// Use a small batch size, to trigger RoundRobin in tests
config.execution.batch_size = 1;

@wiedld wiedld force-pushed the 15003/switch-to-builder branch from 9af23e0 to 82b0342 Compare March 5, 2025 00:28
Comment on lines -378 to -386
/// * `FIRST_ENFORCE_DIST` -
/// true: (EnforceDistribution, EnforceDistribution, EnforceSorting)
/// false: else runs (EnforceSorting, EnforceDistribution, EnforceDistribution)
/// * `PREFER_EXISTING_SORT` (optional) - if true, will not repartition / resort data if it is already sorted
/// * `TARGET_PARTITIONS` (optional) - number of partitions to repartition to
/// * `REPARTITION_FILE_SCANS` (optional) - if true, will repartition file scans
/// * `REPARTITION_FILE_MIN_SIZE` (optional) - minimum file size to repartition
/// * `PREFER_EXISTING_UNION` (optional) - if true, will not attempt to convert Union to Interleave
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are now replaced by the required DoFirst (for one-of 2 possible optimizer run patterns),
then all the optional settings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much like how it is much more explicit

@wiedld wiedld marked this pull request as ready for review March 5, 2025 00:55
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @wiedld -- I think this is a major step forward in test readability as all the relevant options are now explicit in the tests, rather than being true / false parameters where interpreting them requires looking at a macro definition.

cc @berkaysynnada and @mustafasrepo / @akurmustafa

Comment on lines -378 to -386
/// * `FIRST_ENFORCE_DIST` -
/// true: (EnforceDistribution, EnforceDistribution, EnforceSorting)
/// false: else runs (EnforceSorting, EnforceDistribution, EnforceDistribution)
/// * `PREFER_EXISTING_SORT` (optional) - if true, will not repartition / resort data if it is already sorted
/// * `TARGET_PARTITIONS` (optional) - number of partitions to repartition to
/// * `REPARTITION_FILE_SCANS` (optional) - if true, will repartition file scans
/// * `REPARTITION_FILE_MIN_SIZE` (optional) - minimum file size to repartition
/// * `PREFER_EXISTING_UNION` (optional) - if true, will not attempt to convert Union to Interleave
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much like how it is much more explicit

Comment on lines +717 to +721
assert_optimized!(
expected,
top_join.clone(),
&TestConfig::new(DoFirst::Distribution)
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking this a step more, what do you think of using a function on TestConfig so this looks like:

Suggested change
assert_optimized!(
expected,
top_join.clone(),
&TestConfig::new(DoFirst::Distribution)
);
TestConfig::new(DoFirst::Distribution)
.assert_optimized)
&expected,
&top_join)
);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea. I was already planning for the next followup PR to change the macro to a functional call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the draft of the followup PR: influxdata#61

Just to get an idea of where this is heading.

@berkaysynnada
Copy link
Contributor

I'll review this tomorrow as well

@alamb
Copy link
Contributor

alamb commented Mar 6, 2025

BTW @blaginin is proposing to add insta in this PR:

If we merge that some of these tests may become much eaiser to maintain / change 🤔

@alamb
Copy link
Contributor

alamb commented Mar 6, 2025

Let's plan to merge this PR tomorrow / after @berkaysynnada has had a chance to review

Copy link
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the idea. Interpreting these tests becomes much clearer. Thank you @wiedld. I just have two questions, but perhaps I'm missing something as I'm a bit tired today 😅

true,
repartition_size,
false
&test_config.clone().with_query_execution_partitions(3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we missing prefer_existing_sort and repartition_file_scans here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put the common config a few lines up here: https://github.com/apache/datafusion/pull/15010/files/5de55b3a8bf1858206dcecc0d3b1907228c970a4#r1984443646. Then only modified the unique config (with_query_execution_partitions) per test case.

true,
repartition_size,
false
&test_config.with_query_execution_partitions(8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here?

Comment on lines +2541 to +2543
let test_config = TestConfig::new(DoFirst::Distribution)
.with_prefer_existing_sort()
.with_prefer_repartition_file_scans(1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the common config.

@wiedld
Copy link
Contributor Author

wiedld commented Mar 7, 2025

I just have two questions, but perhaps I'm missing something as I'm a bit tired today 😅

That's my everyday. 😆
Thanks for taking the time to review.

@alamb
Copy link
Contributor

alamb commented Mar 7, 2025

Thank you @berkaysynnada and @wiedld

@alamb alamb merged commit 297af95 into apache:main Mar 7, 2025
24 checks passed
danila-b pushed a commit to danila-b/datafusion that referenced this pull request Mar 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants