Fix regression in CASE expression #14283

andygrove · 2025-01-24T18:07:42Z

Which issue does this PR close?

Closes #14277

Rationale for this change

Add unit test that demonstrates the regression reported in #14277

What changes are included in this PR?

Added failing test. Fails with arguments need to have the same data type
Add fix

Are these changes tested?

Are there any user-facing changes?

jayzhan211 · 2025-01-25T00:48:13Z

In the unit test issue_14277, we can see that the then is ScalarValue::Null. What is the reason it is not ScalarValue::In32(None)?

let then = Arc::new(Literal::new(ScalarValue::Null));

I suggest finding such test reproducible in SLT since it runs end-to-end, ensuring the arguments passed through the function are as expected. For example, in this case, I think we should convert then from ScalarValue::Null to ScalarValue::Int32(None) rather than handling ScalarValue::Null—unless we find a test case in SLT that confirms we need to handle this scenario.

For example, the then_value we received is Scalar(Int32(NULL)), not Scalar(NULL). This suggests that the main issue might lie in type handling upstream and needs to be addressed there

statement ok
create table foo (a int, b int) as values (1, 2), (3, 4), (5, 6), (null, null), (6, null), (null, 7);

query I
SELECT CASE WHEN a is null THEN null ELSE b END FROM foo
----
2
4
6
NULL
NULL
NULL

If there is any reason that we really need to handle ScalarValue::Null case
I think this is all what we need

       let then_value = self.when_then_expr[0]
            .1
            .evaluate_selection(batch, &when_value)?;

        let then_value = if let ColumnarValue::Scalar(ScalarValue::Null) = then_value {
            new_null_array(&return_type,batch.num_rows())
        } else {
            then_value.into_array(batch.num_rows())?
        };

aweltsch · 2025-01-25T05:30:20Z

I also looked into this a little bit. Here's my observations:
The data_type of a CaseExpr is the first non-null data_type. In the expr_or_expr specialization this is only the same type as the else expression if the then_expr is null. My impression is that @jayzhan211 's suggestion to handle only the null value would thus make sense.

I personally would have applied the transformation

        let e = &self.when_then_expr[0].1;
        let then_expr = try_cast(Arc::clone(e), &batch.schema(), return_type.clone())?;

        let then_value = then_expr
            .evaluate_selection(batch, &when_value)?
            .into_array(batch.num_rows())?;

since to me this seems more obvious what's the intended goal, i.e. "make sure that the then_expr returns the correct datatype for the case expression" and seems to be a little bit more generic.

If the suggested fix here is to be accepted I would rather remove the expr_or_expr specialization again, because from what I have seen the current suggestion regresses on the case_when: expr_or_expr benchmark.

Also: I was surprised to see that I can't reproduce this issue when using datafusion-cli, so from an end-to-end perspective the type coercion seems to work somehow 🤔
This is what I did:

> CREATE TABLE batch2(a int[]) AS VALUES ([1, 2, 3]), (null);
> SELECT CASE WHEN a IS NULL THEN NULL ELSE a END FROM batch2;
+--------------------------------------------------------+
| CASE WHEN batch2.a IS NULL THEN NULL ELSE batch2.a END |
+--------------------------------------------------------+
| [1, 2, 3]                                              |
| NULL                                                   |
+--------------------------------------------------------+

aweltsch · 2025-01-25T05:35:15Z

If the suggested fix here is to be accepted I would rather remove the expr_or_expr specialization again, because from what I have seen the current suggestion regresses on the case_when: expr_or_expr benchmark.

Sorry I just realized that the wording might be a little ambiguous. "suggested fix" refers to this commit

jayzhan211 · 2025-01-25T06:15:16Z

let then_expr = try_cast(Arc::clone(e), &batch.schema(), return_type.clone())?;

I think type handling should be handled correctly in logical layer before physical expression creation, so this doesn't make sense to me. Even checking type in try_new is also not ideal (although I think we didn't block the check for safety). We have done such type handling and coercion in logical layer already, we should try our best to avoid doing it twice in physical layer.

andygrove · 2025-01-25T06:30:04Z

let then_expr = try_cast(Arc::clone(e), &batch.schema(), return_type.clone())?;
I think type handling should be handled correctly in logical layer before physical expression creation, so this doesn't make sense to me. Even checking type in try_new is also not ideal (although I think we didn't block the check for safety). We have done such type handling and coercion in logical layer already, we should try our best to avoid doing it twice in physical layer.

@jayzhan211 in the Comet use case we do not have a DataFusion logical plan but are instead translating Spark physical plans to DataFusion physical plans.

jayzhan211 · 2025-01-25T06:36:38Z

let then_expr = try_cast(Arc::clone(e), &batch.schema(), return_type.clone())?;
I think type handling should be handled correctly in logical layer before physical expression creation, so this doesn't make sense to me. Even checking type in try_new is also not ideal (although I think we didn't block the check for safety). We have done such type handling and coercion in logical layer already, we should try our best to avoid doing it twice in physical layer.
@jayzhan211 in the Comet use case we do not have a DataFusion logical plan but are instead translating Spark physical plans to DataFusion physical plans.

How do you handle types mismatch issue? Does Comet has another type handling logic to find the correct types for datafusion physical plan?

andygrove · 2025-01-25T16:37:50Z

How do you handle types mismatch issue? Does Comet has another type handling logic to find the correct types for datafusion physical plan?

We map Spark types to Arrow types. For UDFs, we call DataFusion's coerce_types function. There are some other places where we add casts when needed.

This is the DataFusion plan that was failing.

ProjectionExec: expr=[CASE WHEN Cast [data_type: Int32, timezone: America/Los_Angeles, child: col_0@0, eval_mode: Legacy] IS NULL THEN NULL ELSE array_remove(make_array(Cast [data_type: Int32, timezone: America/Los_Angeles, child: col_0@0, eval_mode: Legacy], Cast [data_type: Int32, timezone: America/Los_Angeles, child: col_1@1, eval_mode: Legacy], col_2@2), Cast [data_type: Int32, timezone: America/Los_Angeles, child: col_0@0, eval_mode: Legacy]) END as col_0]
  CometFilterExec: col_0@0 IS NULL
    ScanExec: source=[CometScan parquet  (unknown)], schema=[col_0: Int8, col_1: Int16, col_2: Int32]

jayzhan211 · 2025-01-26T00:46:33Z

The function coerce_types is used exclusively within function handling. For case expressions, coerce_types is not utilized. Instead, the function get_coerce_type_for_case_expression is used, which internally relies on comparison_coercion as the actual coercion function

pub fn get_coerce_type_for_case_expression(
    when_or_then_types: &[DataType],
    case_or_else_type: Option<&DataType>,
) -> Option<DataType> {
    let case_or_else_type = match case_or_else_type {
        None => when_or_then_types[0].clone(),
        Some(data_type) => data_type.clone(),
    };
    when_or_then_types
        .iter()
        .try_fold(case_or_else_type, |left_type, right_type| {
            // TODO: now just use the `equal` coercion rule for case when. If find the issue, and
            // refactor again.
            comparison_coercion(&left_type, right_type)
        })
}

Therefore, I think Comet also need to applies such coercion before calling CaseExpr::try_new

In datafusion, Null is coerced to Int32(Null) in type_coercion.

[2025-01-26T00:36:20Z DEBUG datafusion_optimizer::utils] resolve_grouping_function:
    Projection: CASE WHEN foo.a IS NULL THEN NULL ELSE foo.b END
      TableScan: foo
    
[2025-01-26T00:36:20Z DEBUG datafusion_optimizer::utils] type_coercion:
    Projection: CASE WHEN foo.a IS NULL THEN CAST(NULL AS Int32) ELSE foo.b END
      TableScan: foo
    
[2025-01-26T00:36:20Z DEBUG datafusion_optimizer::utils] count_wildcard_rule:
    Projection: CASE WHEN foo.a IS NULL THEN CAST(NULL AS Int32) ELSE foo.b END
      TableScan: foo

aweltsch · 2025-01-26T08:20:55Z

I have two questions:

If we assume that the expressions in the CaseExpr (both when_then_expr and else_expr) are of the correct type, can we then not remove all of try_cast for the else_expr that are sprinkled in the code as well?
Would it make sense to return an error already during try_new if the types of are incorrect? We could add a helpful message like "[...] you can use 'get_coerce_type_for_case_expression' to get the appropriate types.". My feeling is that this would enforce the precondition 'when-then and else expr types are consistent' and catch errors already during expression construction vs execution (fail early).

jayzhan211 · 2025-01-26T08:48:27Z

I think so, the try_cast in else_expr seems not ideal to me
I agree type checking is a choice if we prefer to return Result instead of panicking, so I'm ok with that. For casting, we are able to avoid it at all.

andygrove · 2025-01-26T18:10:22Z

@jayzhan211 Thanks, I will look into using get_coerce_type_for_case_expression.

@aweltsch Adding type checks in try_new makes sense to me.

Tomorrow, I will try updating the Comet PR to use get_coerce_type_for_case_expression and see if that fixes things or not.

andygrove · 2025-01-26T18:11:24Z

Ideally, invoking a physical optimization rule to apply coercion as needed would be nice.

alamb · 2025-01-27T15:42:52Z

Ideally, invoking a physical optimization rule to apply coercion as needed would be nice.

I agree -- this would be helpful for any system that only uses the Physical layer.

I reviewed this PR and I think it looks like it isn't waiting on review (Andy is going to try get_coerce_type_for_case_expression) so marking it as draft for now to try and manage the "to review" queue

andygrove · 2025-01-27T23:26:55Z

Using get_coerce_type_for_case_expression resolved the issue in Comet, so I am closing this PR.

Thanks @aweltsch @jayzhan211 @alamb

repro for issue 14277

fd7b417

github-actions bot added the physical-expr Changes to the physical-expr crates label Jan 24, 2025

andygrove added 3 commits January 24, 2025 11:20

comment out assertions for now

6f3f9ba

add assertions

fee779e

revert a change

33b01fa

andygrove mentioned this pull request Jan 24, 2025

Regression in CASE expression since DF 44 #14277

Closed

fix

d77d500

andygrove changed the title ~~wip: Repro for regression in CASE expression~~ Fix regression in CASE expression Jan 24, 2025

andygrove marked this pull request as ready for review January 24, 2025 19:36

alamb requested a review from jayzhan211 January 24, 2025 20:29

findepi approved these changes Jan 24, 2025

View reviewed changes

alamb marked this pull request as draft January 27, 2025 15:42

andygrove mentioned this pull request Jan 27, 2025

Implement physical optimizer rule to apply type coercion to physical plans #14324

Open

andygrove closed this Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix regression in CASE expression #14283

Fix regression in CASE expression #14283

andygrove commented Jan 24, 2025 •

edited

Loading

jayzhan211 commented Jan 25, 2025 •

edited

Loading

aweltsch commented Jan 25, 2025

aweltsch commented Jan 25, 2025

jayzhan211 commented Jan 25, 2025 •

edited

Loading

andygrove commented Jan 25, 2025

jayzhan211 commented Jan 25, 2025

andygrove commented Jan 25, 2025

jayzhan211 commented Jan 26, 2025

aweltsch commented Jan 26, 2025 •

edited

Loading

jayzhan211 commented Jan 26, 2025

andygrove commented Jan 26, 2025

andygrove commented Jan 26, 2025

alamb commented Jan 27, 2025

andygrove commented Jan 27, 2025

Fix regression in CASE expression #14283

Fix regression in CASE expression #14283

Conversation

andygrove commented Jan 24, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jayzhan211 commented Jan 25, 2025 • edited Loading

aweltsch commented Jan 25, 2025

aweltsch commented Jan 25, 2025

jayzhan211 commented Jan 25, 2025 • edited Loading

andygrove commented Jan 25, 2025

jayzhan211 commented Jan 25, 2025

andygrove commented Jan 25, 2025

jayzhan211 commented Jan 26, 2025

aweltsch commented Jan 26, 2025 • edited Loading

jayzhan211 commented Jan 26, 2025

andygrove commented Jan 26, 2025

andygrove commented Jan 26, 2025

alamb commented Jan 27, 2025

andygrove commented Jan 27, 2025

andygrove commented Jan 24, 2025 •

edited

Loading

jayzhan211 commented Jan 25, 2025 •

edited

Loading

jayzhan211 commented Jan 25, 2025 •

edited

Loading

aweltsch commented Jan 26, 2025 •

edited

Loading