-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support bounds evaluation for temporal data types #14523
base: main
Are you sure you want to change the base?
Support bounds evaluation for temporal data types #14523
Conversation
Thank you @ch-sc for working on this. When you need a review, I can do that if you ping me. |
Thanks @berkaysynnada! Yeah I'd like to get a review. There is still a bug, though, that I don't understand yet. Sometimes the sort operator gets removed as can be seen in the |
Seems like you're losing the order requirement somehow |
…t-temporal-types-in-interval-arithmetics
@berkaysynnada do you have time to take another look? :)
I debugged into this and noticed that the I removed |
Thanks again for driving this forward :) I'm a bit busy these days, but I'd love to go over it. If it's not urgent for you, would you mind waiting until the weekend?
Thanks for the heads-up. I try to add support for NotEq. If I find a solution, I can fix it within this PR itself—if that's okay with you. I took a quick look, and you can use the union logic there: |
Sure, feel free to make any adjustment as you see fit. |
I thought a bit more about the union logic and came up with another solution. Sometimes intervals are not overlapping and sometimes have no bound in one direction ( Here are some examples of what I think should be correct, without interval sets support:
I adapted the union logic accordingly and added tests. Happy for your feedback! |
Looks like I confused the meaning of nulls in intervals. I switched to the union logic @berkaysynnada suggested. |
…t-temporal-types-in-interval-arithmetics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ch-sc. Both your implementations and clean-ups on the existing code seem very good to me. I suggested some further improvements, and have one point to discuss: Rather than introducing new supports_() API's, should we force the users to infer the support by the evaluate API? IMO that will make things simpler, WDYT?
and sorry for the late response 😞
@@ -1583,6 +1583,17 @@ impl ScalarValue { | |||
} | |||
} | |||
|
|||
/// Returns negation for a boolean scalar value | |||
pub fn boolean_negate(&self) -> Result<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can extend arithmetic_negate() with booleans, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do that, but I think we should then rename arithmetic_negate
to just negate
to cause no ambiguity. WDYT?
Operator::IsDistinctFrom | Operator::IsNotDistinctFrom => { | ||
NullableInterval::from(lhs) | ||
.apply_operator(op, &rhs.into()) | ||
.and_then(|x| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can avoid cloning here actually introducing a new taker API to NullableInterval
fn evaluate_bounds(&self, _input: &[&Interval]) -> Result<Interval> { | ||
// We cannot assume the input datatype is the same of output type. | ||
Interval::make_unbounded(&DataType::Null) | ||
fn evaluate_bounds(&self, input: &[&Interval]) -> Result<Interval> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like getting rid of Null bounds and looking up the result type, but I have a question:
You declared supports_bounds_evaluation() as false. So, do the "evaluate" invokers even call this function after seeing supports_bounds_evaluation() as false? I think supports_bounds_evaluation() shouldn't say false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should work similar to what we do in PhysicalExpr. By default, this is not supported, but every UDF can implement its own version of bounds evaluation.
evaluate_bounds
in ScalarUDFImpl returns an unbounded Interval which is not helpful for statistics. This is rather a leftover of my attempt to make use of known bounds as much as possible and ignore null bounds. We should probably throw a not implemented error here as well.
if self.negated { | ||
expr_bounds.contains(list_bounds)?.boolean_negate() | ||
} else { | ||
expr_bounds.contains(list_bounds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's say expr_bounds is [3,5], and list bounds are [[0,2], [6,9]]. As the union logic cannot yet support interval sets, the union result would be [0, 9], and this contains check will give certainly_true; however, the actual result is certainly_false. I think that's not what you prefer
Even if we can support interval sets, let's consider this example
expr_bounds is [0,10] & list_bounds: [[2,6], [3,5], [6,9]]. Do we want to return certainly_true? expr could be 5, and the list might include [3,4,7]. So, in_list() is false, even if we say its bounds are certainly_true. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
contains
for [3,5]
and [0,9]
should return UNCERTAIN
. I added a test with your example in in_list.rs.
CERTAINLY_TRUE
should only be returned if expr_bounds is a superset of the union of the list bounds.
…t-temporal-types-in-interval-arithmetics
@berkaysynnada thank you for your review 🙂
I agree, something like that would be ideal. I'll look into it today |
Which issue does this PR close?
As discussed in 14237 temporal data types should be supported in bounds evaluation.
Rationale for this change
We want to extend the number of expressions that can evaluate bounds to calculate statistics and enable better/more query plan optimisations.
What changes are included in this PR?
Timestamp
,Date
,Time
,Duration
,Interval
,NotEq
IsDistinctFrom
,IsNotDistinctFrom
PhysicalExpr
trait and therefore decided by every expression type itself. This will also enable UDFs to implement bounds evaluation.Are these changes tested?
yes
Are there any user-facing changes?
PhysicalExpr
andScalarUDFImpl
change, but both provide default implementations for additional functions.