Support bounds evaluation for temporal data types #14523

ch-sc · 2025-02-06T11:36:31Z

Which issue does this PR close?

As discussed in 14237 temporal data types should be supported in bounds evaluation.

Closes Add support for filter selectivity for more expression types #14237.

Rationale for this change

We want to extend the number of expressions that can evaluate bounds to calculate statistics and enable better/more query plan optimisations.

What changes are included in this PR?

Additional data types are supported: Timestamp, Date, Time, Duration, Interval
Additional binary operators are supported: ~~NotEq~~, IsDistinctFrom, IsNotDistinctFrom
Whether an expression supports bounds evaluation or not is now part of the PhysicalExpr trait and therefore decided by every expression type itself. This will also enable UDFs to implement bounds evaluation.

Are these changes tested?

yes

Are there any user-facing changes?

PhysicalExpr and ScalarUDFImpl change, but both provide default implementations for additional functions.

berkaysynnada · 2025-02-07T07:50:05Z

Thank you @ch-sc for working on this. When you need a review, I can do that if you ping me.

ch-sc · 2025-02-07T12:54:46Z

Thanks @berkaysynnada! Yeah I'd like to get a review.

There is still a bug, though, that I don't understand yet. Sometimes the sort operator gets removed as can be seen in the repartition_scan.slt test.

berkaysynnada · 2025-02-07T12:59:12Z

Sometimes the sort operator gets removed as can be seen in the repartition_scan.slt test.

Seems like you're losing the order requirement somehow

…t-temporal-types-in-interval-arithmetics

ch-sc · 2025-02-12T13:53:55Z

@berkaysynnada do you have time to take another look? :)

NotEq leads to the removal of the sort operator.

I debugged into this and noticed that the EnforceSorting optimiser removes SortExec. enforce_sorting/mod.rs:415
However, prior to that the order requirement somehow is lost and I don't understand why that happens.

I removed NotEq again from the supported binary operators to move this forward. We might want to look into this in a follow-up PR. WDYT?

berkaysynnada · 2025-02-12T14:35:11Z

@berkaysynnada do you have time to take another look? :)

Thanks again for driving this forward :) I'm a bit busy these days, but I'd love to go over it. If it's not urgent for you, would you mind waiting until the weekend?

I removed NotEq again from the supported binary operators to move this forward. We might want to look into this in a follow-up PR. WDYT?

Thanks for the heads-up. I try to add support for NotEq. If I find a solution, I can fix it within this PR itself—if that's okay with you.

I took a quick look, and you can use the union logic there:
https://github.com/Fly-Style/datafusion/blob/fee3023a63227f0a22ac2da1d040d373cc028c34/datafusion/expr-common/src/interval_arithmetic.rs#L667
It seems to follow a better style, IMO.

ch-sc · 2025-02-13T12:00:34Z

I try to add support for NotEq. If I find a solution, I can fix it within this PR itself—if that's okay with you.

Sure, feel free to make any adjustment as you see fit.

ch-sc · 2025-02-14T19:48:31Z

I took a quick look, and you can use the union logic there

I thought a bit more about the union logic and came up with another solution.

Sometimes intervals are not overlapping and sometimes have no bound in one direction (NULL). Typically the union of non-overlapping intervals would result in a set of intervals, but there is no support of this.

Here are some examples of what I think should be correct, without interval sets support:

(1, 2)    ∪ (3,4)        = (1,4)
(1, NULL) ∪ (NULL, NULL) = (1, NULL)
(NULL, 1) ∪ (NULL, NULL) = (NULL, 1)
(3, NULL) ∪ (NULL, 1)    = (NULL, NULL)
(5, NULL) ∪ (1, 2)       = (1, NULL)

I adapted the union logic accordingly and added tests. Happy for your feedback!

ch-sc · 2025-02-18T14:10:05Z

Looks like I confused the meaning of nulls in intervals. I switched to the union logic @berkaysynnada suggested.

…t-temporal-types-in-interval-arithmetics

berkaysynnada

Hi @ch-sc. Both your implementations and clean-ups on the existing code seem very good to me. I suggested some further improvements, and have one point to discuss: Rather than introducing new supports_() API's, should we force the users to infer the support by the evaluate API? IMO that will make things simpler, WDYT?

and sorry for the late response 😞

berkaysynnada · 2025-02-23T18:25:06Z

datafusion/common/src/scalar/mod.rs

@@ -1583,6 +1583,17 @@ impl ScalarValue {
        }
    }

+    /// Returns negation for a boolean scalar value
+    pub fn boolean_negate(&self) -> Result<Self> {


Maybe we can extend arithmetic_negate() with booleans, wdyt?

We can do that, but I think we should then rename arithmetic_negate to just negate to cause no ambiguity. WDYT?

datafusion/expr-common/src/interval_arithmetic.rs

berkaysynnada · 2025-02-23T18:41:39Z

datafusion/expr-common/src/interval_arithmetic.rs

+        Operator::IsDistinctFrom | Operator::IsNotDistinctFrom => {
+            NullableInterval::from(lhs)
+                .apply_operator(op, &rhs.into())
+                .and_then(|x| {


We can avoid cloning here actually introducing a new taker API to NullableInterval

datafusion/expr-common/src/operator.rs

berkaysynnada · 2025-02-23T18:51:27Z

datafusion/expr/src/udf.rs

-    fn evaluate_bounds(&self, _input: &[&Interval]) -> Result<Interval> {
-        // We cannot assume the input datatype is the same of output type.
-        Interval::make_unbounded(&DataType::Null)
+    fn evaluate_bounds(&self, input: &[&Interval]) -> Result<Interval> {


I really like getting rid of Null bounds and looking up the result type, but I have a question:
You declared supports_bounds_evaluation() as false. So, do the "evaluate" invokers even call this function after seeing supports_bounds_evaluation() as false? I think supports_bounds_evaluation() shouldn't say false

This should work similar to what we do in PhysicalExpr. By default, this is not supported, but every UDF can implement its own version of bounds evaluation.

evaluate_bounds in ScalarUDFImpl returns an unbounded Interval which is not helpful for statistics. This is rather a leftover of my attempt to make use of known bounds as much as possible and ignore null bounds. We should probably throw a not implemented error here as well.

berkaysynnada · 2025-02-23T19:04:33Z

datafusion/physical-expr/src/expressions/in_list.rs

+        if self.negated {
+            expr_bounds.contains(list_bounds)?.boolean_negate()
+        } else {
+            expr_bounds.contains(list_bounds)


Let's say expr_bounds is [3,5], and list bounds are [[0,2], [6,9]]. As the union logic cannot yet support interval sets, the union result would be [0, 9], and this contains check will give certainly_true; however, the actual result is certainly_false. I think that's not what you prefer

Even if we can support interval sets, let's consider this example
expr_bounds is [0,10] & list_bounds: [[2,6], [3,5], [6,9]]. Do we want to return certainly_true? expr could be 5, and the list might include [3,4,7]. So, in_list() is false, even if we say its bounds are certainly_true. Am I missing something?

contains for [3,5] and [0,9] should return UNCERTAIN. I added a test with your example in in_list.rs.

CERTAINLY_TRUE should only be returned if expr_bounds is a superset of the union of the list bounds.

…t-temporal-types-in-interval-arithmetics

ch-sc · 2025-02-25T13:49:26Z

@berkaysynnada thank you for your review 🙂

Rather than introducing new supports_() API's, should we force the users to infer the support by the evaluate API?

I agree, something like that would be ideal. I'll look into it today

Support temporal data types in interval arithmetics

a5f5729

github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Feb 6, 2025

ch-sc added 3 commits February 6, 2025 13:42

fix test code

68784cc

cargo fmt

bc9eb4d

clean up

5960170

github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Feb 6, 2025

fix interval type coercion

fb1d681

ch-sc added 4 commits February 11, 2025 18:17

remove NotEq from supported operators

4ea39eb

Merge branch 'main' of github.com:apache/arrow-datafusion into suppor…

b873f50

…t-temporal-types-in-interval-arithmetics

fix imports

b1b8aec

cargo fmt

cb11459

revisit union interval logic

b4bd851

revisit union interval logic

aaf116b

ch-sc mentioned this pull request Feb 17, 2025

StatisticsV2: initial statistics framework redesign #14699

Merged

ch-sc added 2 commits February 18, 2025 13:38

treat null as unbounded

5d65ba1

clean up

4642bfa

csv source yields too many column stats

12f7b3c

ch-sc added 3 commits February 19, 2025 15:20

clippy

cbeb26c

csv source yields too many column stats

ec9f8be

Merge branch 'main' of github.com:apache/arrow-datafusion into suppor…

621809f

…t-temporal-types-in-interval-arithmetics

berkaysynnada reviewed Feb 23, 2025

View reviewed changes

ch-sc added 5 commits February 24, 2025 18:11

addressing comments

d14747d

omit clone

d3810de

Merge branch 'main' of github.com:apache/arrow-datafusion into suppor…

0147f66

…t-temporal-types-in-interval-arithmetics

remove check

8813adb

UDF evaluate bounds default impl

f521a6a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support bounds evaluation for temporal data types #14523

Support bounds evaluation for temporal data types #14523

ch-sc commented Feb 6, 2025 •

edited

Loading

berkaysynnada commented Feb 7, 2025

ch-sc commented Feb 7, 2025

berkaysynnada commented Feb 7, 2025

ch-sc commented Feb 12, 2025

berkaysynnada commented Feb 12, 2025

ch-sc commented Feb 13, 2025

ch-sc commented Feb 14, 2025 •

edited

Loading

ch-sc commented Feb 18, 2025

berkaysynnada left a comment

berkaysynnada Feb 23, 2025

ch-sc Feb 24, 2025

berkaysynnada Feb 23, 2025

berkaysynnada Feb 23, 2025

ch-sc Feb 24, 2025 •

edited

Loading

berkaysynnada Feb 23, 2025

ch-sc Feb 24, 2025 •

edited

Loading

ch-sc commented Feb 25, 2025

Support bounds evaluation for temporal data types #14523

Are you sure you want to change the base?

Support bounds evaluation for temporal data types #14523

Conversation

ch-sc commented Feb 6, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

berkaysynnada commented Feb 7, 2025

ch-sc commented Feb 7, 2025

berkaysynnada commented Feb 7, 2025

ch-sc commented Feb 12, 2025

berkaysynnada commented Feb 12, 2025

ch-sc commented Feb 13, 2025

ch-sc commented Feb 14, 2025 • edited Loading

ch-sc commented Feb 18, 2025

berkaysynnada left a comment

Choose a reason for hiding this comment

berkaysynnada Feb 23, 2025

Choose a reason for hiding this comment

ch-sc Feb 24, 2025

Choose a reason for hiding this comment

berkaysynnada Feb 23, 2025

Choose a reason for hiding this comment

berkaysynnada Feb 23, 2025

Choose a reason for hiding this comment

ch-sc Feb 24, 2025 • edited Loading

Choose a reason for hiding this comment

berkaysynnada Feb 23, 2025

Choose a reason for hiding this comment

ch-sc Feb 24, 2025 • edited Loading

Choose a reason for hiding this comment

ch-sc commented Feb 25, 2025

ch-sc commented Feb 6, 2025 •

edited

Loading

ch-sc commented Feb 14, 2025 •

edited

Loading

ch-sc Feb 24, 2025 •

edited

Loading

ch-sc Feb 24, 2025 •

edited

Loading