-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support WITHIN GROUP syntax to standardize certain existing aggregate functions #13511
base: main
Are you sure you want to change the base?
Support WITHIN GROUP syntax to standardize certain existing aggregate functions #13511
Conversation
FYI @Dandandan |
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Task is done in my local repository, and I will commit changes and write comments this week after final self review. |
* Ensure compatibility with new `within_group` and `order_by` handling. * Adjust tests and examples to align with the new logic.
* Add test cases for changed signature * Update signature in docs
aa23b24
to
cf4faad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating this @Garamda, it's definitely much simpler now ✨
I've left some additional comments.
"| 10 |", | ||
"+---------------------------------------------+", | ||
"+----------------------------------------------------------------------------------+", | ||
"| approx_percentile_cont(test.b,Float64(0.5)) WITHIN GROUP [test.b ASC NULLS LAST] |", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name looks wrong to me. Shouldn't the test.b
argument be only in the WITHIN GROUP section like
approx_percentile_cont(Float64(0.5)) WITHIN GROUP [test.b ASC NULLS LAST]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for identifying that. I have refactored the code.
datafusion/expr/src/udaf.rs
Outdated
/// If this function is ordered-set aggregate function, return true | ||
/// If the function is not, return false | ||
fn is_ordered_set_aggregate(&self) -> Option<bool> { | ||
None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense for this to just return true
if is an ordered-set aggregate function and false
otherwise and avoid the Option entirely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I have applied it to the code.
/// Otherwise return None (the default) | ||
fn supports_null_handling_clause(&self) -> Option<bool> { | ||
None | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something we need? From what I know, there aren't any aggregate functions that have options for null handling. At the moment, the 2 overrides you have of this both return Some(false)
, which is what I would consider the default value anyways.
Speaking of which, if we do need this, do we need to return an Optional<bool>
or could we just return bool
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some aggregate functions using null handling in current datafusion.
(cf. If this is something we need to discuss/fix, then I can make another git issue. Or, I can refactor it too in this PR. I left this comment because I am not 100% sure about the SQL standard.)
And I refactored the function to just return bool
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was smelling odd, so I dug a bit deeper. I think you've inadvertantly stumbled into something even weirder than you anticipated
The example you've linked is
SELECT FIRST_VALUE(column1) RESPECT NULLS FROM t;
which I don't think is a valid query because first_value
should not be an aggregate function, or at the very least the above query is not valid in most SQL dialects. first_value
is actually a window function in other engines (eg. Trino, Postgres, MySQL).
If you try running something like
SELECT first_value(column1) FROM t;
against Postgres you get an error like
Query Error: window function first_value requires an OVER clause
The RESPECT NULLS | IGNORE NULLS
options is only a property of certain window functions, hence we shouldn't need to track it for aggregate functions.
I'm going to file a ticket for the above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #15006
@@ -51,29 +52,43 @@ create_func!(ApproxPercentileCont, approx_percentile_cont_udaf); | |||
|
|||
/// Computes the approximate percentile continuous of a set of numbers | |||
pub fn approx_percentile_cont( | |||
expression: Expr, | |||
within_group: Option<Vec<Sort>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call going from Expr
to Sort
here, as this lets us capture ASC/DESC which does appear to be valid in some engines (eg. SQLServer). It would probably be good to add a test for this if there isn't already one.
Separately, why is this Option<Vec<Sort>>
? As I understand it approx_percentile_cont
must always have a WITHIN GROUP clause with a single ordering expression, so I would expect this to just be Sort
to reflect that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably be good to add a test for this if there isn't already
Do you mean the test case like this #13511 (comment) ?
If I misunderstood, please let me know.
Update : I also add test cases for dataframe function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separately, why is this Option<Vec>? As I understand it approx_percentile_cont must always have a WITHIN GROUP clause with a single ordering expression, so I would expect this to just be Sort to reflect that.
I removed Option
, because Sort
is necessary for ordered set aggregate function as you mentioned.
And also removed Vec
.
(cf. I initially define it as Vec<Sort>
, because there are some aggregate functions which supports multiple ordering expression in WITHIN GROUP
clause. However, I have made sure at this time that even those DB engine (ex. Oracle, SQL Server) uses only a single ordering expression in percentile_cont
and approx_percentile
.)
Thank you for your guidance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks good to me ✨
I did leave a couple of minor comments, as well as some bigger ones, but I think this is ready for review by someone else. Thanks for bearing with me as a reviewed this, it was a good chance for me to look at new parts of DataFusion.
One big question we might need to answer before merging is if we need a migration strategy for this. Because we now require WITHIN GROUP
for these functions, any users who have queries stored outside of DataFusion will experience breakages that they can't work around. If we want to provide a migration path, we may need to support having both forms of calling these functions, as in
SELECT approx_percentile_cont(column_name, 0.75, 100) FROM table_name;
SELECT approx_percentile_cont(0.75, 100) WITHIN GROUP (ORDER BY column_name) FROM table_name;
for at least 1 release so folks can migrate their queries.
datafusion/expr/src/expr.rs
Outdated
@@ -295,6 +295,8 @@ pub enum Expr { | |||
/// See also [`ExprFunctionExt`] to set these fields. | |||
/// | |||
/// [`ExprFunctionExt`]: crate::expr_fn::ExprFunctionExt | |||
/// | |||
/// cf. `WITHIN GROUP` is converted to `ORDER BY` internally in `datafusion/sql/src/expr/function.rs` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor/opinionated: I'm not sure if it's worth mentioning this at all here. WITHIN GROUP
is effectively an ORDER BY
specified differently. This only matters at the SQL layer, and you handle and explain it there already.
/// Otherwise return None (the default) | ||
fn supports_null_handling_clause(&self) -> Option<bool> { | ||
None | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was smelling odd, so I dug a bit deeper. I think you've inadvertantly stumbled into something even weirder than you anticipated
The example you've linked is
SELECT FIRST_VALUE(column1) RESPECT NULLS FROM t;
which I don't think is a valid query because first_value
should not be an aggregate function, or at the very least the above query is not valid in most SQL dialects. first_value
is actually a window function in other engines (eg. Trino, Postgres, MySQL).
If you try running something like
SELECT first_value(column1) FROM t;
against Postgres you get an error like
Query Error: window function first_value requires an OVER clause
The RESPECT NULLS | IGNORE NULLS
options is only a property of certain window functions, hence we shouldn't need to track it for aggregate functions.
I'm going to file a ticket for the above.
@@ -51,29 +52,39 @@ create_func!(ApproxPercentileCont, approx_percentile_cont_udaf); | |||
|
|||
/// Computes the approximate percentile continuous of a set of numbers | |||
pub fn approx_percentile_cont( | |||
expression: Expr, | |||
within_group: Sort, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor/opinionated: I think order_by
would be a clearer name for this, as the WITHIN GROUP
is really just a wrapper around the ORDER BY
clause.
let percentile = if is_descending { | ||
1.0 - percentile | ||
} else { | ||
percentile | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable to me, but I don't have that much experience on the execution side of things.
datafusion/sql/src/expr/function.rs
Outdated
|
||
if within_group.len() > 1 { | ||
return not_impl_err!( | ||
"Multiple column ordering in WITHIN GROUP clause is not supported" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor wording suggestion
Only a single ordering expression is permitted in a WITHIN GROUP clause
which explicitly points users to what they should do, instead of telling them what they can't.
datafusion/sql/src/expr/function.rs
Outdated
if !within_group.is_empty() { | ||
return not_impl_err!("WITHIN GROUP is not supported yet: {within_group:?}"); | ||
if !within_group.is_empty() && order_by.is_some() { | ||
return plan_err!("ORDER BY clause is only permitted in WITHIN GROUP clause when a WITHIN GROUP is used"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: I just noticed that in the block above there is a check for duplicate order bys. I think it would be good to fold this into that check
FunctionArgumentClause::OrderBy(oby) => {
if order_by.is_some() { // can check for within group here
return not_impl_err!("Calling {name}: Duplicated ORDER BY clause in function arguments");
}
order_by = Some(oby);
}
to consolidate the handling into one place.
"[IGNORE | RESPECT] NULLS are not permitted for {}", | ||
fm.name() | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my point about [IGNORE | RESPECT] NULLS
being a property of window functions, I don't think we need this check here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bearing with me as a reviewed this, it was a good chance for me to look at new parts of DataFusion.
I appreciate your elaborate review again. 👍
This PR has become much simpler, clearer, and better now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One big question we might need to answer before merging is if we need a migration strategy for this. Because we now require WITHIN GROUP for these functions, any users who have queries stored outside of DataFusion will experience breakages that they can't work around. If we want to provide a migration path, we may need to support having both forms of calling these functions, as in
SELECT approx_percentile_cont(column_name, 0.75, 100) FROM table_name;
SELECT approx_percentile_cont(0.75, 100) WITHIN GROUP (ORDER BY column_name) FROM table_name;for at least 1 release so folks can migrate their queries.
This is one of the biggest concerns when I started to work on this feature.
If the community decides the migration strategy like that, then I will make both syntax supported.
Also, I will file an issue to track the plan so that the current syntax can be excluded as scheduled. (if I am authorized to do so)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was smelling odd, so I dug a bit deeper. I think you've inadvertantly stumbled into something even weirder than you anticipated
...
The RESPECT NULLS | IGNORE NULLS options is only a property of certain window functions, hence we shouldn't need to track it for aggregate functions.I'm going to file a ticket for the above.
...
Per my point about [IGNORE | RESPECT] NULLS being a property of window functions, I don't think we need this check here.
I understood and agree with your guidance.
I will track what is decided in the issue you filed, and will remove some codes out after determination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cf) I have applied all reviews that you tagged 'minor', since I was also convinced.
Pinging @Dandandan for commiter review as they filed the ticket this fix is for. |
* Uses order by consistently after done with sql * Remove redundant comment * Serve more clear error msg * Handle error cases in the same code block
Which issue does this PR close?
Closes #11732. (cc. #12824)
Rationale for this change
As described in #11732, some certain aggregate functions need to be standardized as ordered set aggregate function.
What changes are included in this PR?
Add andhandle within_group fieldhandle within_group field with existing function argumentsSession stateadd ordered set aggregate function information in session (since this needs to be handled specifically in certain cases)Substraitadd within_group field in protohandle within_group in producer & consumerAre these changes tested?
Are there any user-facing changes?
approx_percentile_cont(expression, percentile, centroids)
approx_percentile_cont(percentile, centroids) WITHIN GROUP (ORDER BY expression)
approx_percentile_cont_with_weight(expression, weight, percentile)
approx_percentile_cont_with_weight(weight, percentile) WITHIN GROUP (ORDER BY expression)
api change
label may be required, which I am not authorized to do.