Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix regression in CASE expression #14283

Closed
wants to merge 5 commits into from
Closed

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Jan 24, 2025

Which issue does this PR close?

Closes #14277

Rationale for this change

Add unit test that demonstrates the regression reported in #14277

What changes are included in this PR?

  • Added failing test. Fails with arguments need to have the same data type
  • Add fix

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the physical-expr Changes to the physical-expr crates label Jan 24, 2025
@andygrove andygrove changed the title wip: Repro for regression in CASE expression Fix regression in CASE expression Jan 24, 2025
@andygrove andygrove marked this pull request as ready for review January 24, 2025 19:36
@alamb alamb requested a review from jayzhan211 January 24, 2025 20:29
@jayzhan211
Copy link
Contributor

jayzhan211 commented Jan 25, 2025

In the unit test issue_14277, we can see that the then is ScalarValue::Null. What is the reason it is not ScalarValue::In32(None)?

let then = Arc::new(Literal::new(ScalarValue::Null));

I suggest finding such test reproducible in SLT since it runs end-to-end, ensuring the arguments passed through the function are as expected. For example, in this case, I think we should convert then from ScalarValue::Null to ScalarValue::Int32(None) rather than handling ScalarValue::Null—unless we find a test case in SLT that confirms we need to handle this scenario.

For example, the then_value we received is Scalar(Int32(NULL)), not Scalar(NULL). This suggests that the main issue might lie in type handling upstream and needs to be addressed there

statement ok
create table foo (a int, b int) as values (1, 2), (3, 4), (5, 6), (null, null), (6, null), (null, 7);

query I
SELECT CASE WHEN a is null THEN null ELSE b END FROM foo
----
2
4
6
NULL
NULL
NULL

If there is any reason that we really need to handle ScalarValue::Null case
I think this is all what we need

       let then_value = self.when_then_expr[0]
            .1
            .evaluate_selection(batch, &when_value)?;

        let then_value = if let ColumnarValue::Scalar(ScalarValue::Null) = then_value {
            new_null_array(&return_type,batch.num_rows())
        } else {
            then_value.into_array(batch.num_rows())?
        };

@aweltsch
Copy link
Contributor

I also looked into this a little bit. Here's my observations:
The data_type of a CaseExpr is the first non-null data_type. In the expr_or_expr specialization this is only the same type as the else expression if the then_expr is null. My impression is that @jayzhan211 's suggestion to handle only the null value would thus make sense.

I personally would have applied the transformation

        let e = &self.when_then_expr[0].1;
        let then_expr = try_cast(Arc::clone(e), &batch.schema(), return_type.clone())?;

        let then_value = then_expr
            .evaluate_selection(batch, &when_value)?
            .into_array(batch.num_rows())?;

since to me this seems more obvious what's the intended goal, i.e. "make sure that the then_expr returns the correct datatype for the case expression" and seems to be a little bit more generic.

If the suggested fix here is to be accepted I would rather remove the expr_or_expr specialization again, because from what I have seen the current suggestion regresses on the case_when: expr_or_expr benchmark.

Also: I was surprised to see that I can't reproduce this issue when using datafusion-cli, so from an end-to-end perspective the type coercion seems to work somehow 🤔
This is what I did:

> CREATE TABLE batch2(a int[]) AS VALUES ([1, 2, 3]), (null);
> SELECT CASE WHEN a IS NULL THEN NULL ELSE a END FROM batch2;
+--------------------------------------------------------+
| CASE WHEN batch2.a IS NULL THEN NULL ELSE batch2.a END |
+--------------------------------------------------------+
| [1, 2, 3]                                              |
| NULL                                                   |
+--------------------------------------------------------+

@aweltsch
Copy link
Contributor

If the suggested fix here is to be accepted I would rather remove the expr_or_expr specialization again, because from what I have seen the current suggestion regresses on the case_when: expr_or_expr benchmark.

Sorry I just realized that the wording might be a little ambiguous. "suggested fix" refers to this commit

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jan 25, 2025

let then_expr = try_cast(Arc::clone(e), &batch.schema(), return_type.clone())?;

I think type handling should be handled correctly in logical layer before physical expression creation, so this doesn't make sense to me. Even checking type in try_new is also not ideal (although I think we didn't block the check for safety). We have done such type handling and coercion in logical layer already, we should try our best to avoid doing it twice in physical layer.

@andygrove
Copy link
Member Author

let then_expr = try_cast(Arc::clone(e), &batch.schema(), return_type.clone())?;

I think type handling should be handled correctly in logical layer before physical expression creation, so this doesn't make sense to me. Even checking type in try_new is also not ideal (although I think we didn't block the check for safety). We have done such type handling and coercion in logical layer already, we should try our best to avoid doing it twice in physical layer.

@jayzhan211 in the Comet use case we do not have a DataFusion logical plan but are instead translating Spark physical plans to DataFusion physical plans.

@jayzhan211
Copy link
Contributor

let then_expr = try_cast(Arc::clone(e), &batch.schema(), return_type.clone())?;

I think type handling should be handled correctly in logical layer before physical expression creation, so this doesn't make sense to me. Even checking type in try_new is also not ideal (although I think we didn't block the check for safety). We have done such type handling and coercion in logical layer already, we should try our best to avoid doing it twice in physical layer.

@jayzhan211 in the Comet use case we do not have a DataFusion logical plan but are instead translating Spark physical plans to DataFusion physical plans.

How do you handle types mismatch issue? Does Comet has another type handling logic to find the correct types for datafusion physical plan?

@andygrove
Copy link
Member Author

How do you handle types mismatch issue? Does Comet has another type handling logic to find the correct types for datafusion physical plan?

We map Spark types to Arrow types. For UDFs, we call DataFusion's coerce_types function. There are some other places where we add casts when needed.

This is the DataFusion plan that was failing.

ProjectionExec: expr=[CASE WHEN Cast [data_type: Int32, timezone: America/Los_Angeles, child: col_0@0, eval_mode: Legacy] IS NULL THEN NULL ELSE array_remove(make_array(Cast [data_type: Int32, timezone: America/Los_Angeles, child: col_0@0, eval_mode: Legacy], Cast [data_type: Int32, timezone: America/Los_Angeles, child: col_1@1, eval_mode: Legacy], col_2@2), Cast [data_type: Int32, timezone: America/Los_Angeles, child: col_0@0, eval_mode: Legacy]) END as col_0]
  CometFilterExec: col_0@0 IS NULL
    ScanExec: source=[CometScan parquet  (unknown)], schema=[col_0: Int8, col_1: Int16, col_2: Int32]

@jayzhan211
Copy link
Contributor

The function coerce_types is used exclusively within function handling. For case expressions, coerce_types is not utilized. Instead, the function get_coerce_type_for_case_expression is used, which internally relies on comparison_coercion as the actual coercion function

pub fn get_coerce_type_for_case_expression(
    when_or_then_types: &[DataType],
    case_or_else_type: Option<&DataType>,
) -> Option<DataType> {
    let case_or_else_type = match case_or_else_type {
        None => when_or_then_types[0].clone(),
        Some(data_type) => data_type.clone(),
    };
    when_or_then_types
        .iter()
        .try_fold(case_or_else_type, |left_type, right_type| {
            // TODO: now just use the `equal` coercion rule for case when. If find the issue, and
            // refactor again.
            comparison_coercion(&left_type, right_type)
        })
}

Therefore, I think Comet also need to applies such coercion before calling CaseExpr::try_new

In datafusion, Null is coerced to Int32(Null) in type_coercion.

[2025-01-26T00:36:20Z DEBUG datafusion_optimizer::utils] resolve_grouping_function:
    Projection: CASE WHEN foo.a IS NULL THEN NULL ELSE foo.b END
      TableScan: foo
    
[2025-01-26T00:36:20Z DEBUG datafusion_optimizer::utils] type_coercion:
    Projection: CASE WHEN foo.a IS NULL THEN CAST(NULL AS Int32) ELSE foo.b END
      TableScan: foo
    
[2025-01-26T00:36:20Z DEBUG datafusion_optimizer::utils] count_wildcard_rule:
    Projection: CASE WHEN foo.a IS NULL THEN CAST(NULL AS Int32) ELSE foo.b END
      TableScan: foo

@aweltsch
Copy link
Contributor

aweltsch commented Jan 26, 2025

I have two questions:

  1. If we assume that the expressions in the CaseExpr (both when_then_expr and else_expr) are of the correct type, can we then not remove all of try_cast for the else_expr that are sprinkled in the code as well?
  2. Would it make sense to return an error already during try_new if the types of are incorrect? We could add a helpful message like "[...] you can use 'get_coerce_type_for_case_expression' to get the appropriate types.". My feeling is that this would enforce the precondition 'when-then and else expr types are consistent' and catch errors already during expression construction vs execution (fail early).

@jayzhan211
Copy link
Contributor

  1. I think so, the try_cast in else_expr seems not ideal to me
  2. I agree type checking is a choice if we prefer to return Result instead of panicking, so I'm ok with that. For casting, we are able to avoid it at all.

@andygrove
Copy link
Member Author

@jayzhan211 Thanks, I will look into using get_coerce_type_for_case_expression.

@aweltsch Adding type checks in try_new makes sense to me.

Tomorrow, I will try updating the Comet PR to use get_coerce_type_for_case_expression and see if that fixes things or not.

@andygrove
Copy link
Member Author

Ideally, invoking a physical optimization rule to apply coercion as needed would be nice.

@alamb
Copy link
Contributor

alamb commented Jan 27, 2025

Ideally, invoking a physical optimization rule to apply coercion as needed would be nice.

I agree -- this would be helpful for any system that only uses the Physical layer.

I reviewed this PR and I think it looks like it isn't waiting on review (Andy is going to try get_coerce_type_for_case_expression) so marking it as draft for now to try and manage the "to review" queue

@andygrove
Copy link
Member Author

Using get_coerce_type_for_case_expression resolved the issue in Comet, so I am closing this PR.

Thanks @aweltsch @jayzhan211 @alamb

@andygrove andygrove closed this Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Changes to the physical-expr crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression in CASE expression since DF 44
5 participants