Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse last projection layer when renaming columns #14684

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

blaginin
Copy link
Contributor

Which issue does this PR close?

Related to #14563

Rationale for this change

Currently, when with_column_renamed is called, a new Projection layer is always created. For example, in the dataframe benchmark, logical plan looks like this:

Zen Browser 2025-02-15 18 38 03

These layers do not affect the query itself (as they are removed by optimize_projections), but they make renaming new columns (and optimization itself) quite slow.

What changes are included in this PR?

I added an optimization for one edge case - when there's already a projection on top and we're adding a new one. This is a common scenario, especially when renaming many columns (as in the dataframe benchmark). Instead of adding a new projection layer on top, we replace the existing one if possible.

Are these changes tested?

Extended test case

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label Feb 15, 2025
@blaginin
Copy link
Contributor Author

group                                                main                                    one-layer
with_column_10                                       1.39   764.9±10.37µs        ? ?/sec     1.00    550.0±4.71µs        ? ?/sec
with_column_100                                      3.88  1490.1±22.27ms        ? ?/sec     1.00    384.5±3.24ms        ? ?/sec
with_column_200                                      5.63      36.6±0.43s        ? ?/sec     1.00       6.5±0.63s        ? ?/sec

@Omega359
Copy link
Contributor

Quite a speed improvement! I'll look at the code in a bit

.with_column_renamed("c4", "boom")?;

assert_eq!("\
Projection: aggregate_test_100.c1 AS one, aggregate_test_100.c2 AS two, aggregate_test_100.c3, aggregate_test_100.c2 + aggregate_test_100.c3 AS sum AS total\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AS sum AS total is valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the thorough reading 🙏 AS sum AS total happens because there's an alias over an alias - I believe it's harmless since it just wraps the original expression.

There's a function unalias_nested that can resolve this, but I don't want to call it because it's recursive and expensive. A potentially better option IMO is to reuse the existing alias in alias_qualified and alias if it's already there (similar to this PR). This feels like a separate improvement from this PR though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just worry that another column referenced 'sum' and then it is aliased to 'total' and what the behaviour of that would be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I don't think that's possible? "total" cannot be referenced within the same projection, because for that projection it is not in its input. But to display it in a nicer way, I'll do #14781

SchemaError::FieldNotFound { .. },
_,
)) => {
self.plan = LogicalPlan::Projection(
Copy link
Contributor

@Omega359 Omega359 Feb 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the field is not found why are we not just returning Ok(self)? I am unsure why we would need to add a projection for the plan in this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to do that because self was destructed and so I need to put plan back. try_new_with_schema is quite cheap and this shouldn't be happening often anyway

})
.collect();
LogicalPlan::Projection(Projection::try_new(expr, input)?)
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a lot of duplication here. I wonder if there is a way we could reduce that? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very fair! I also dislike repeating qualifier_rename, field_rename twice, but it's happening because they reference plan and must come after its destruction (hence needed twice)

@blaginin blaginin marked this pull request as ready for review February 17, 2025 19:08
@blaginin blaginin requested a review from Omega359 February 19, 2025 14:59
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the status of this PR?

I took a brief look at this PR and it seems like it may have been superceded by this one from @Omega359

@blaginin
Copy link
Contributor Author

blaginin commented Mar 5, 2025

Hey, I still want to merge this one. #14653 has indeed made the benches much faster, but I think it's still good to make the logical plan smaller if we can

@alamb
Copy link
Contributor

alamb commented Mar 5, 2025

Looks to me like there are a few outstanding items:

  1. Reduce the code repetition if possible
  2. Merge up to resolve conflicts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants