Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support bounds evaluation for temporal data types #14523

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

ch-sc
Copy link
Contributor

@ch-sc ch-sc commented Feb 6, 2025

Which issue does this PR close?

As discussed in 14237 temporal data types should be supported in bounds evaluation.

Rationale for this change

We want to extend the number of expressions that can evaluate bounds to calculate statistics and enable better/more query plan optimisations.

What changes are included in this PR?

  • Additional data types are supported: Timestamp, Date, Time, Duration, Interval
  • Additional binary operators are supported: NotEq, IsDistinctFrom, IsNotDistinctFrom
  • Whether an expression supports bounds evaluation or not is now part of the PhysicalExpr trait and therefore decided by every expression type itself. This will also enable UDFs to implement bounds evaluation.

Are these changes tested?

yes

Are there any user-facing changes?

PhysicalExpr and ScalarUDFImpl change, but both provide default implementations for additional functions.

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Feb 6, 2025
@github-actions github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Feb 6, 2025
@berkaysynnada
Copy link
Contributor

Thank you @ch-sc for working on this. When you need a review, I can do that if you ping me.

@ch-sc
Copy link
Contributor Author

ch-sc commented Feb 7, 2025

Thanks @berkaysynnada! Yeah I'd like to get a review.

There is still a bug, though, that I don't understand yet. Sometimes the sort operator gets removed as can be seen in the repartition_scan.slt test.

@berkaysynnada
Copy link
Contributor

Sometimes the sort operator gets removed as can be seen in the repartition_scan.slt test.

Seems like you're losing the order requirement somehow

@ch-sc
Copy link
Contributor Author

ch-sc commented Feb 12, 2025

@berkaysynnada do you have time to take another look? :)

NotEq leads to the removal of the sort operator.

I debugged into this and noticed that the EnforceSorting optimiser removes SortExec. enforce_sorting/mod.rs:415
However, prior to that the order requirement somehow is lost and I don't understand why that happens.

I removed NotEq again from the supported binary operators to move this forward. We might want to look into this in a follow-up PR. WDYT?

@berkaysynnada
Copy link
Contributor

@berkaysynnada do you have time to take another look? :)

Thanks again for driving this forward :) I'm a bit busy these days, but I'd love to go over it. If it's not urgent for you, would you mind waiting until the weekend?

I removed NotEq again from the supported binary operators to move this forward. We might want to look into this in a follow-up PR. WDYT?

Thanks for the heads-up. I try to add support for NotEq. If I find a solution, I can fix it within this PR itself—if that's okay with you.

I took a quick look, and you can use the union logic there:
https://github.com/Fly-Style/datafusion/blob/fee3023a63227f0a22ac2da1d040d373cc028c34/datafusion/expr-common/src/interval_arithmetic.rs#L667
It seems to follow a better style, IMO.

@ch-sc
Copy link
Contributor Author

ch-sc commented Feb 13, 2025

I try to add support for NotEq. If I find a solution, I can fix it within this PR itself—if that's okay with you.

Sure, feel free to make any adjustment as you see fit.

@ch-sc
Copy link
Contributor Author

ch-sc commented Feb 14, 2025

I took a quick look, and you can use the union logic there

I thought a bit more about the union logic and came up with another solution.

Sometimes intervals are not overlapping and sometimes have no bound in one direction (NULL). Typically the union of non-overlapping intervals would result in a set of intervals, but there is no support of this.

Here are some examples of what I think should be correct, without interval sets support:

(1, 2)    ∪ (3,4)        = (1,4)
(1, NULL) ∪ (NULL, NULL) = (1, NULL)
(NULL, 1) ∪ (NULL, NULL) = (NULL, 1)
(3, NULL) ∪ (NULL, 1)    = (NULL, NULL)
(5, NULL) ∪ (1, 2)       = (1, NULL)

I adapted the union logic accordingly and added tests. Happy for your feedback!

@ch-sc
Copy link
Contributor Author

ch-sc commented Feb 18, 2025

Looks like I confused the meaning of nulls in intervals. I switched to the union logic @berkaysynnada suggested.

Copy link
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ch-sc. Both your implementations and clean-ups on the existing code seem very good to me. I suggested some further improvements, and have one point to discuss: Rather than introducing new supports_() API's, should we force the users to infer the support by the evaluate API? IMO that will make things simpler, WDYT?

and sorry for the late response 😞

@@ -1583,6 +1583,17 @@ impl ScalarValue {
}
}

/// Returns negation for a boolean scalar value
pub fn boolean_negate(&self) -> Result<Self> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can extend arithmetic_negate() with booleans, wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that, but I think we should then rename arithmetic_negate to just negate to cause no ambiguity. WDYT?

Operator::IsDistinctFrom | Operator::IsNotDistinctFrom => {
NullableInterval::from(lhs)
.apply_operator(op, &rhs.into())
.and_then(|x| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can avoid cloning here actually introducing a new taker API to NullableInterval

fn evaluate_bounds(&self, _input: &[&Interval]) -> Result<Interval> {
// We cannot assume the input datatype is the same of output type.
Interval::make_unbounded(&DataType::Null)
fn evaluate_bounds(&self, input: &[&Interval]) -> Result<Interval> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like getting rid of Null bounds and looking up the result type, but I have a question:
You declared supports_bounds_evaluation() as false. So, do the "evaluate" invokers even call this function after seeing supports_bounds_evaluation() as false? I think supports_bounds_evaluation() shouldn't say false

Copy link
Contributor Author

@ch-sc ch-sc Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work similar to what we do in PhysicalExpr. By default, this is not supported, but every UDF can implement its own version of bounds evaluation.

evaluate_bounds in ScalarUDFImpl returns an unbounded Interval which is not helpful for statistics. This is rather a leftover of my attempt to make use of known bounds as much as possible and ignore null bounds. We should probably throw a not implemented error here as well.

if self.negated {
expr_bounds.contains(list_bounds)?.boolean_negate()
} else {
expr_bounds.contains(list_bounds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say expr_bounds is [3,5], and list bounds are [[0,2], [6,9]]. As the union logic cannot yet support interval sets, the union result would be [0, 9], and this contains check will give certainly_true; however, the actual result is certainly_false. I think that's not what you prefer

Even if we can support interval sets, let's consider this example
expr_bounds is [0,10] & list_bounds: [[2,6], [3,5], [6,9]]. Do we want to return certainly_true? expr could be 5, and the list might include [3,4,7]. So, in_list() is false, even if we say its bounds are certainly_true. Am I missing something?

Copy link
Contributor Author

@ch-sc ch-sc Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contains for [3,5] and [0,9] should return UNCERTAIN. I added a test with your example in in_list.rs.

CERTAINLY_TRUE should only be returned if expr_bounds is a superset of the union of the list bounds.

@ch-sc
Copy link
Contributor Author

ch-sc commented Feb 25, 2025

@berkaysynnada thank you for your review 🙂

Rather than introducing new supports_() API's, should we force the users to infer the support by the evaluate API?

I agree, something like that would be ideal. I'll look into it today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for filter selectivity for more expression types
2 participants