Support WITHIN GROUP syntax to standardize certain existing aggregate functions #13511

Garamda · 2024-11-21T03:36:45Z

Which issue does this PR close?

Closes #11732. (cc. #12824)

Rationale for this change

As described in #11732, some certain aggregate functions need to be standardized as ordered set aggregate function.

What changes are included in this PR?

SQL
- utilize WITHIN GROUP clause
Logical plan
- ~~Add and~~ handle within_group field
Physical plan
- ~~handle within_group field with existing function arguments~~
- support descending order (DESC) in accumulator
Dataframe
- change function signature to get within_group as pararmeter
~~Session state~~
- ~~add ordered set aggregate function information in session (since this needs to be handled specifically in certain cases)~~
~~Substrait~~
- ~~add within_group field in proto~~
- ~~handle within_group in producer & consumer~~
Test
- reorganize existing test cases for modified syntax
- add new cases
Docs

Are these changes tested?

Yes. (with existing / modified / new test cases)

Are there any user-facing changes?

Yes
- approx_percentile_cont
  - AS-IS : approx_percentile_cont(expression, percentile, centroids)
  - TO-BE : approx_percentile_cont(percentile, centroids) WITHIN GROUP (ORDER BY expression)
- approx_percentile_cont_with_weight
  - AS-IS : approx_percentile_cont_with_weight(expression, weight, percentile)
  - TO-BE : approx_percentile_cont_with_weight(weight, percentile) WITHIN GROUP (ORDER BY expression)
Documents are updated upon those changes.
Adding api change label may be required, which I am not authorized to do.

…functions

alamb · 2024-11-21T21:38:09Z

FYI @Dandandan

github-actions · 2025-01-21T01:57:29Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

Garamda · 2025-01-21T02:12:03Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

Task is done in my local repository, and I will commit changes and write comments this week after final self review.

…ate functions (apache#13511)

…ate function

* Ensure compatibility with new `within_group` and `order_by` handling. * Adjust tests and examples to align with the new logic.

datafusion/sqllogictest/test_files/aggregate.slt

* Add test cases for changed signature * Update signature in docs

…functions

vbarua

Thanks for updating this @Garamda, it's definitely much simpler now ✨
I've left some additional comments.

vbarua · 2025-03-01T00:02:35Z

datafusion/core/tests/dataframe/dataframe_functions.rs

-        "| 10                                          |",
-        "+---------------------------------------------+",
+        "+----------------------------------------------------------------------------------+",
+        "| approx_percentile_cont(test.b,Float64(0.5)) WITHIN GROUP [test.b ASC NULLS LAST] |",


This name looks wrong to me. Shouldn't the test.b argument be only in the WITHIN GROUP section like

approx_percentile_cont(Float64(0.5)) WITHIN GROUP [test.b ASC NULLS LAST]

Thank you for identifying that. I have refactored the code.

vbarua · 2025-03-01T00:04:46Z

datafusion/expr/src/udaf.rs

+    /// If this function is ordered-set aggregate function, return true
+    /// If the function is not, return false
+    fn is_ordered_set_aggregate(&self) -> Option<bool> {
+        None


Would it make sense for this to just return true if is an ordered-set aggregate function and false otherwise and avoid the Option entirely?

I agree. I have applied it to the code.

vbarua · 2025-03-01T00:13:34Z

datafusion/expr/src/udaf.rs

+    /// Otherwise return None (the default)
+    fn supports_null_handling_clause(&self) -> Option<bool> {
+        None
+    }


Is this something we need? From what I know, there aren't any aggregate functions that have options for null handling. At the moment, the 2 overrides you have of this both return Some(false), which is what I would consider the default value anyways.

Speaking of which, if we do need this, do we need to return an Optional<bool> or could we just return bool directly?

There are some aggregate functions using null handling in current datafusion.
(cf. If this is something we need to discuss/fix, then I can make another git issue. Or, I can refactor it too in this PR. I left this comment because I am not 100% sure about the SQL standard.)

And I refactored the function to just return bool.

This was smelling odd, so I dug a bit deeper. I think you've inadvertantly stumbled into something even weirder than you anticipated

The example you've linked is

SELECT FIRST_VALUE(column1) RESPECT NULLS FROM t;

which I don't think is a valid query because first_value should not be an aggregate function, or at the very least the above query is not valid in most SQL dialects. first_value is actually a window function in other engines (eg. Trino, Postgres, MySQL).

If you try running something like

SELECT first_value(column1) FROM t;

against Postgres you get an error like

Query Error: window function first_value requires an OVER clause

dbfiddle

The RESPECT NULLS | IGNORE NULLS options is only a property of certain window functions, hence we shouldn't need to track it for aggregate functions.

I'm going to file a ticket for the above.

Filed #15006

vbarua · 2025-03-01T00:38:07Z

datafusion/functions-aggregate/src/approx_percentile_cont.rs

@@ -51,29 +52,43 @@ create_func!(ApproxPercentileCont, approx_percentile_cont_udaf);

 /// Computes the approximate percentile continuous of a set of numbers
 pub fn approx_percentile_cont(
-    expression: Expr,
+    within_group: Option<Vec<Sort>>,


Good call going from Expr to Sort here, as this lets us capture ASC/DESC which does appear to be valid in some engines (eg. SQLServer). It would probably be good to add a test for this if there isn't already one.

Separately, why is this Option<Vec<Sort>>? As I understand it approx_percentile_cont must always have a WITHIN GROUP clause with a single ordering expression, so I would expect this to just be Sort to reflect that.

It would probably be good to add a test for this if there isn't already

Do you mean the test case like this #13511 (comment) ?
If I misunderstood, please let me know.

Update : I also add test cases for dataframe function

Separately, why is this Option<Vec>? As I understand it approx_percentile_cont must always have a WITHIN GROUP clause with a single ordering expression, so I would expect this to just be Sort to reflect that.

I removed Option, because Sort is necessary for ordered set aggregate function as you mentioned.
And also removed Vec.

(cf. I initially define it as Vec<Sort>, because there are some aggregate functions which supports multiple ordering expression in WITHIN GROUP clause. However, I have made sure at this time that even those DB engine (ex. Oracle, SQL Server) uses only a single ordering expression in percentile_cont and approx_percentile.)

Thank you for your guidance.

…aggr funcs

vbarua

Overall, this looks good to me ✨

I did leave a couple of minor comments, as well as some bigger ones, but I think this is ready for review by someone else. Thanks for bearing with me as a reviewed this, it was a good chance for me to look at new parts of DataFusion.

One big question we might need to answer before merging is if we need a migration strategy for this. Because we now require WITHIN GROUP for these functions, any users who have queries stored outside of DataFusion will experience breakages that they can't work around. If we want to provide a migration path, we may need to support having both forms of calling these functions, as in

SELECT approx_percentile_cont(column_name, 0.75, 100) FROM table_name;
SELECT approx_percentile_cont(0.75, 100) WITHIN GROUP (ORDER BY column_name) FROM table_name;

for at least 1 release so folks can migrate their queries.

vbarua · 2025-03-04T18:18:22Z

datafusion/expr/src/expr.rs

@@ -295,6 +295,8 @@ pub enum Expr {
    /// See also [`ExprFunctionExt`] to set these fields.
    ///
    /// [`ExprFunctionExt`]: crate::expr_fn::ExprFunctionExt
+    ///
+    /// cf. `WITHIN GROUP` is converted to `ORDER BY` internally in `datafusion/sql/src/expr/function.rs`


minor/opinionated: I'm not sure if it's worth mentioning this at all here. WITHIN GROUP is effectively an ORDER BY specified differently. This only matters at the SQL layer, and you handle and explain it there already.

vbarua · 2025-03-04T19:07:55Z

datafusion/expr/src/udaf.rs

+    /// Otherwise return None (the default)
+    fn supports_null_handling_clause(&self) -> Option<bool> {
+        None
+    }


This was smelling odd, so I dug a bit deeper. I think you've inadvertantly stumbled into something even weirder than you anticipated

The example you've linked is

SELECT FIRST_VALUE(column1) RESPECT NULLS FROM t;

which I don't think is a valid query because first_value should not be an aggregate function, or at the very least the above query is not valid in most SQL dialects. first_value is actually a window function in other engines (eg. Trino, Postgres, MySQL).

If you try running something like

SELECT first_value(column1) FROM t;

against Postgres you get an error like

Query Error: window function first_value requires an OVER clause

dbfiddle

The RESPECT NULLS | IGNORE NULLS options is only a property of certain window functions, hence we shouldn't need to track it for aggregate functions.

I'm going to file a ticket for the above.

vbarua · 2025-03-04T19:16:56Z

datafusion/functions-aggregate/src/approx_percentile_cont.rs

@@ -51,29 +52,39 @@ create_func!(ApproxPercentileCont, approx_percentile_cont_udaf);

 /// Computes the approximate percentile continuous of a set of numbers
 pub fn approx_percentile_cont(
-    expression: Expr,
+    within_group: Sort,


minor/opinionated: I think order_by would be a clearer name for this, as the WITHIN GROUP is really just a wrapper around the ORDER BY clause.

vbarua · 2025-03-04T19:19:18Z

datafusion/functions-aggregate/src/approx_percentile_cont.rs

+        let percentile = if is_descending {
+            1.0 - percentile
+        } else {
+            percentile
+        };


This seems reasonable to me, but I don't have that much experience on the execution side of things.

vbarua · 2025-03-04T19:25:06Z

datafusion/sql/src/expr/function.rs

+
+        if within_group.len() > 1 {
+            return not_impl_err!(
+                "Multiple column ordering in WITHIN GROUP clause is not supported"


Minor wording suggestion

Only a single ordering expression is permitted in a WITHIN GROUP clause

which explicitly points users to what they should do, instead of telling them what they can't.

vbarua · 2025-03-04T19:35:31Z

datafusion/sql/src/expr/function.rs

-        if !within_group.is_empty() {
-            return not_impl_err!("WITHIN GROUP is not supported yet: {within_group:?}");
+        if !within_group.is_empty() && order_by.is_some() {
+            return plan_err!("ORDER BY clause is only permitted in WITHIN GROUP clause when a WITHIN GROUP is used");


minor: I just noticed that in the block above there is a check for duplicate order bys. I think it would be good to fold this into that check

FunctionArgumentClause::OrderBy(oby) => { if order_by.is_some() { // can check for within group here return not_impl_err!("Calling {name}: Duplicated ORDER BY clause in function arguments"); } order_by = Some(oby); }

to consolidate the handling into one place.

vbarua · 2025-03-04T19:37:59Z

datafusion/sql/src/expr/function.rs

+                        "[IGNORE | RESPECT] NULLS are not permitted for {}",
+                        fm.name()
+                    );
+                }


Per my point about [IGNORE | RESPECT] NULLS being a property of window functions, I don't think we need this check here.

Thanks for bearing with me as a reviewed this, it was a good chance for me to look at new parts of DataFusion.

I appreciate your elaborate review again. 👍
This PR has become much simpler, clearer, and better now.

One big question we might need to answer before merging is if we need a migration strategy for this. Because we now require WITHIN GROUP for these functions, any users who have queries stored outside of DataFusion will experience breakages that they can't work around. If we want to provide a migration path, we may need to support having both forms of calling these functions, as in

SELECT approx_percentile_cont(column_name, 0.75, 100) FROM table_name;
SELECT approx_percentile_cont(0.75, 100) WITHIN GROUP (ORDER BY column_name) FROM table_name;

for at least 1 release so folks can migrate their queries.

This is one of the biggest concerns when I started to work on this feature.
If the community decides the migration strategy like that, then I will make both syntax supported.
Also, I will file an issue to track the plan so that the current syntax can be excluded as scheduled. (if I am authorized to do so)

This was smelling odd, so I dug a bit deeper. I think you've inadvertantly stumbled into something even weirder than you anticipated
...
The RESPECT NULLS | IGNORE NULLS options is only a property of certain window functions, hence we shouldn't need to track it for aggregate functions.

I'm going to file a ticket for the above.
...
Per my point about [IGNORE | RESPECT] NULLS being a property of window functions, I don't think we need this check here.

I understood and agree with your guidance.
I will track what is decided in the issue you filed, and will remove some codes out after determination.

cf) I have applied all reviews that you tagged 'minor', since I was also convinced.

vbarua · 2025-03-05T00:26:51Z

Pinging @Dandandan for commiter review as they filed the ticket this fix is for.

* Uses order by consistently after done with sql * Remove redundant comment * Serve more clear error msg * Handle error cases in the same code block

Add within group variable to aggregate function and arguments

a9b901a

github-actions bot added sql SQL Planner logical-expr Logical plan and expressions labels Nov 21, 2024

Merge branch 'main' into support_within_group_for_existing_aggregate_…

0918000

…functions

github-actions bot added the Stale PR has not had any activity for some time label Jan 21, 2025

Support within group and disable null handling for ordered set aggreg…

070a96b

…ate functions (apache#13511)

github-actions bot added core Core DataFusion crate functions Changes to functions implementation labels Jan 21, 2025

Refactored function to match updated signature

3fd92fd

github-actions bot added the proto Related to proto crate label Jan 21, 2025

Modify proto to support within group clause

4082a78

github-actions bot added the substrait Changes to the substrait crate label Jan 21, 2025

Modify physical planner and accumulator to support ordered set aggreg…

c3be3c6

…ate function

github-actions bot removed the Stale PR has not had any activity for some time label Jan 22, 2025

Support session management for ordered set aggregate functions

9fd05a3

github-actions bot added catalog Related to the catalog crate execution Related to the execution crate labels Jan 23, 2025

Garamda added 2 commits January 25, 2025 21:54

Align code, tests, and examples with changes to aggregate function logic

8518a59

* Ensure compatibility with new `within_group` and `order_by` handling. * Adjust tests and examples to align with the new logic.

Fix typo in existing comments

79669d9

github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Jan 25, 2025

Garamda commented Jan 25, 2025

View reviewed changes

datafusion/sqllogictest/test_files/aggregate.slt Outdated Show resolved Hide resolved

Enhance test

597f4d7

* Add test cases for changed signature * Update signature in docs

github-actions bot added the documentation Improvements or additions to documentation label Jan 27, 2025

Garamda added 3 commits January 28, 2025 21:46

Merge branch 'main' into support_within_group_for_existing_aggregate_…

d3b483c

…functions

Fix bug : handle missing within_group when applying children tree node

a827c9d

Change the signature of approx_percentile_cont for consistency

23bdf70

github-actions bot added physical-expr Changes to the physical-expr crates optimizer Optimizer rules catalog Related to the catalog crate common Related to common crate execution Related to the execution crate labels Feb 28, 2025

Garamda force-pushed the support_within_group_for_existing_aggregate_functions branch from aa23b24 to cf4faad Compare February 28, 2025 14:55

github-actions bot removed development-process Related to development process of DataFusion physical-expr Changes to the physical-expr crates optimizer Optimizer rules catalog Related to the catalog crate common Related to common crate execution Related to the execution crate labels Feb 28, 2025

Garamda added 4 commits March 1, 2025 00:04

Merge branch 'main' into support_within_group_for_existing_aggregate_…

fc7d2bc

…functions

Convert order by to within group

5469e39

Apply cargo fmt

d96b667

Remove plain line breaks

293d33e

github-actions bot removed the substrait Changes to the substrait crate label Feb 28, 2025

vbarua reviewed Mar 1, 2025

View reviewed changes

Garamda added 8 commits March 1, 2025 12:27

Remove duplicated column arg in schema name

ecdb21b

Refactor boolean functions to just return primitive type

d65420e

Make within group necessary in the signature of existing ordered set …

b6d426a

…aggr funcs

Apply cargo fmt

4b0c52f

Support a single ordering expression in the signature

36a732d

Apply cargo fmt

8d6db85

Add dataframe function test cases to verify descending ordering

db0355a

Apply cargo fmt

37b783e

vbarua approved these changes Mar 4, 2025

View reviewed changes

Garamda added 2 commits March 5, 2025 14:56

Apply code reviews

124d8c5

* Uses order by consistently after done with sql * Remove redundant comment * Serve more clear error msg * Handle error cases in the same code block

Update error msg in test as corresponding code changed

3259c95

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support WITHIN GROUP syntax to standardize certain existing aggregate functions #13511

Support WITHIN GROUP syntax to standardize certain existing aggregate functions #13511

Garamda commented Nov 21, 2024 •

edited

Loading

alamb commented Nov 21, 2024

github-actions bot commented Jan 21, 2025

Garamda commented Jan 21, 2025 •

edited

Loading

vbarua left a comment

vbarua Mar 1, 2025

Garamda Mar 1, 2025

vbarua Mar 1, 2025

Garamda Mar 1, 2025

vbarua Mar 1, 2025

Garamda Mar 1, 2025 •

edited

Loading

vbarua Mar 4, 2025

vbarua Mar 5, 2025

vbarua Mar 1, 2025

Garamda Mar 1, 2025 •

edited

Loading

Garamda Mar 1, 2025

vbarua left a comment

vbarua Mar 4, 2025

vbarua Mar 4, 2025

vbarua Mar 4, 2025

vbarua Mar 4, 2025

vbarua Mar 4, 2025

vbarua Mar 4, 2025

vbarua Mar 4, 2025

Garamda Mar 5, 2025

Garamda Mar 5, 2025

Garamda Mar 5, 2025

Garamda Mar 5, 2025

vbarua commented Mar 5, 2025

Support WITHIN GROUP syntax to standardize certain existing aggregate functions #13511

Are you sure you want to change the base?

Support WITHIN GROUP syntax to standardize certain existing aggregate functions #13511

Conversation

Garamda commented Nov 21, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb commented Nov 21, 2024

github-actions bot commented Jan 21, 2025

Garamda commented Jan 21, 2025 • edited Loading

vbarua left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Garamda Mar 1, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Garamda Mar 1, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbarua left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbarua commented Mar 5, 2025

Garamda commented Nov 21, 2024 •

edited

Loading

Garamda commented Jan 21, 2025 •

edited

Loading

Garamda Mar 1, 2025 •

edited

Loading

Garamda Mar 1, 2025 •

edited

Loading