Support Push down expression evaluation in `TableProviders` #14993

adriangb · 2025-03-03T22:37:49Z

For the scenario of select expensive_thing(col1) from data I would like to pre-process (speed up) expensive_thing(col1).
The easiest way I can think of doing this is by pre-computing the expression and saving it as a column with a specific name like _expensive_thing__col1.
But then I have to hardcode this column into my schema, the hardcoded expressions need to be the same for every file, etc.

To me an ideal solution would be to push down the expression to the file reading level so that I can then check "does _some_other_expensive_expr__col38 exist in the file? if so read that, otherwise read col38 and compute the expression".

The tricky thing is I'd want to do this on a per-file level: depending on the data different expression/column combinations would be pre-computed; it's prohibitive to put them all in the schema that is shared amongst all files.

The text was updated successfully, but these errors were encountered:

adriangb · 2025-03-03T22:42:40Z

Argh I've asked this before and it's been answered: #7845 (comment)

alamb · 2025-03-05T21:10:19Z

I am reopening this ticket as I think it covers serveral important usecases (that are all subsets of @adriangb 's example of expensive_thing(col1) above

EXTRACT (minute from "EventDate"). For example, @gatesn mentions that the Vortex format may be able to evaluate this more quickly on the compressed format than extracting the full expression
struct_column["field_name"]: For example, extracting one field from a struct column -- in this case we could potentially update the json or parquet decoders to avoid materializing other fields (we would likely need more arrow-rs support too)

So a query might be

select EXTRACT (minute from "EventDate"),  SUM(something) 
FROM hits 
GROUP BY EXTRACT (minute from "EventDate");

Being able to evlauate the EXTRACT (minute from "EventDate") expression during the scan would be super helpful

One possibility here might be add an API to TableProvider similar to TableProvider::supports_filters_pushdown

maybe something like

/// Returns true for all `Expr` in `expr` that can be directly evaluated by the TableProvider
fn supports_expr_pushdown(
    &self,
    expr: &[&Expr],
) -> Result<Vec<bool>, DataFusionError>

This information would have to be threaded through to TableProvider::scan as well

(maybe it would be time to make TableProvider::scan_with_args 🤔 )

alamb · 2025-03-05T21:32:26Z

Related blog from @gatesn

https://blog.spiraldb.com/what-if-we-just-didnt-decompress-it/

adriangb · 2025-03-05T21:48:10Z

Being able to evaluate the EXTRACT (minute from "EventDate") expression during the scan would be super helpful

Maybe this is what you meant but my mind it's possible to do even better: instead of evaluating it during the scan, the file might even contain a pre-evaluated version of it. You can imagine something like a column called _computed_<hash of expr's SQL> so that you can read directly from (and use statistics of) a column called _computed_8has07fas98a (made up hash) instead of reading EventDate and computing EXTRACT (minute from "EventDate") on it. This becomes even more useful if you can use the statistics of the pre-computed expressions with a filter (e.g. where extract(minute from ts) = 1.

One note about this use case though: you can mostly achieve it today with some rewrite rules (you rewrite select EXTRACT (minute from "EventDate") from hits to select _computed_8has07fas98a as "EXTRACT (minute from \"EventDate ")" from hits. It's error prone and annoying though:

You have to explicitly list all of these columns as part of your TableProvider::schema which can be super annoying if you have 100s of them.
Users can do select _computed_8has07fas98a from hits and they show up in show columns from hits. My goal with Support marking columns as system columns via Field's metadata #14362 was to solve this.

On top of these issues for a Variant type there is the issue that these columns can't vary file by file.
While this is crucial for a Variant type I think it can still be very helpful for other scenarios: if you are adding pre-computed columns as an optimization it might make sense to only compute them during an optimization / compaction pass. So you end up with some files that have the pre-computed column and some that don't. Blindly rewriting the expression to select _computed_8has07fas98a would by default result in incorrect results because SchemaAdapter would fill in nulls instead of computing the expression in real time. We worked around this by essentially forking SchemaAdapter and giving it the ability to generate columns from other columns instead of always filling them in with nulls, but that's very error prone and wonky.

One possibility here might be add an API to TableProvider similar to TableProvider::supports_filters_pushdown
maybe it would be time to make TableProvider::scan_with_args

Something like that sounds great to me!

A couple things to think about taking the example of select variant_get(json_col, 'key')::int from data:

Who is responsible for evaluating these expressions if they can vary on a per-file basis?
If the TableProvider says "yes, I can evaluate that expression" it is then responsible for doing the compute to evaluate it for every single file. If that data comes from a shredded column that makes sense, it's cheap. But if it has to start deserializing the Variant value column it's going to get expensive. Maybe that's not an issue but I did want to point out that it blurs the lines of where IO happens and where compute happens. If this is a problem I think it would complicate the API substantially.

What expression does the TableProvider get passed? In this example if could be variant_get(json_col, 'key') or variant_get(json_col, 'key')::int. I'm guessing it's the former, and the same rules as TableProvider::supports_filters_pushdown apply.

That said I think the best path forward is likely to prototype something in a PR and go from there 😄

alamb · 2025-03-05T22:26:33Z

Maybe this is what you meant but my mind it's possible to do even better: instead of evaluating it during the scan, the file might even contain a pre-evaluated version of it.

Yes indeed ! great point

Who is responsible for evaluating these expressions if they can vary on a per-file basis?
If the TableProvider says "yes, I can evaluate that expression" it is then responsible for doing the compute to evaluate it for every single file.

Yes, this is how I would expect it to work.

The table provider would have to figure out the best way to evalute the projection depending on its actual layout

for variants this would mean:

Files that had a field extracted to a shredded column would use that
Files that didn't have the field extracted would need to read the entire variant and pull out the field of interest

Maybe that's not an issue but I did want to point out that it blurs the lines of where IO happens and where compute happens. If this is a problem I think it would complicate the API substantially.

I agree -- implementing this optimally in a TableProvider will be complex

I note that the IO/CPU is already intertwined when implementing something like filter pushdown in parquet, so I am not sure also pusing down expressions makes the problem worse (or better)

adriangb · 2025-03-06T01:12:32Z

I note that the IO/CPU is already intertwined when implementing something like filter pushdown in parquet, so I am not sure also pusing down expressions makes the problem worse (or better)

I agree here - it's likely to not be a problem in practice.

So it sounds like the main complexity is going to be the TableProvider having to take ownership of applying the expression. It would be interesting to see if there's a way for it to dynamically decide to fall back to DataFusion, I can imagine situations where it wants to handle certain branches / conditions but would rather not re-implement others.

adriangb · 2025-03-06T02:41:42Z

@cetra3 suggested that maybe this can be done with a rewrite of the PhysicalPlan? I guess the main issue would be that you don't know anything about the file except the file path at this point. You'd need to do at least some IO to get the the Parquet metadata to know the file's actual schema.

cetra3 · 2025-03-06T03:28:27Z

I guess what I was getting at is that maybe we could use a PhysicalOptimizerRule to do this sort of thing. However currently all the OptimizerRule traits are non async, which makes it a bit hard to do this sort of thing. I don't see why they can't be async

gatesn · 2025-03-06T10:42:15Z

I can imagine situations where it wants to handle certain branches / conditions but would rather not re-implement others.

Given these are DataFusion scalar expressions (rather than any relational algebra), can the implementation not just invoke the expression as a fallback?

--

With Vortex, we've gone one step further and the scan accepts projection: Expr, filter: Option<Expr> where projection can arbitrarily select columns and apply scalar expressions to them. While this may be a little too general for DataFusion, it works well provided the system has good support for struct types and expressions for manipulating them. We have select (to select/exclude fields from a struct), pack (to assemble fields into a new struct), getitem (to extract a field from a struct), and merge (to union multiple structs, although this can be implemented with getitem+pack).

However this does have the downside of pushing disproportionate complexity onto the TableProvider for the simple case of projecting out a few columns.

alamb · 2025-03-06T12:35:33Z

So it sounds like the main complexity is going to be the TableProvider having to take ownership of applying the expression.

Given these are DataFusion scalar expressions (rather than any relational algebra), can the implementation not just invoke the expression as a fallback?

I agree with @gatesn -- I don't think adding new columns based on expressions has to be all that complex (you can already do it via a SchemaAdapter / SchemaMaper)

(there is a similar usecase for filling in new columns with default values rather than NULL)

With Vortex, we've gone one step further and the scan accepts projection: Expr, filter: Option where projection can arbitrarily select columns and apply scalar expressions to them. While this may be a little too general for DataFusion, it works well provided the system has good support for struct types and expressions for manipulating them.

The normal DataFusion filter pushdown API allows table providers to report which expressions they can handle, which means most providers can ignore filters unless they have code to handle it.

However this does have the downside of pushing disproportionate complexity onto the TableProvider for the simple case of projecting out a few columns.

Again, I think simple TableProviders can use something like SchemaAdapter for this usecase

adriangb · 2025-03-06T12:56:58Z

Would SchemaAdapter be a good place to implement this functionality? It already has knowledge of the required columns and file schema. We'd need piping around it (changes to TableProvider, Execs?) but at least implementing this as a user of DataFusion could be relatively self contained.

gatesn · 2025-03-06T13:25:10Z

Is plumbing this through the schema overly specific to the pre-computed expression use-case? Or are you suggesting this is the mechanism by which all expression push-down occurs, by faking additional schema columns?

alamb · 2025-03-06T14:49:27Z

Or are you suggesting this is the mechanism by which all expression push-down occurs, by faking additional schema columns?

I think I was suggesting this (though I haven't thought about the API it too carefully)

Basically the TableSource would somehow have to say "I can evaluate this expression" and then have to tell DataFusion somehow what column corresponded to that expression.

Maybe a good first step would be to try and code up an example showing how to "push down" a field access

like input is a column user with documents like

{ id : 124
  name: 'foo'
},
{ id : 567
  name: 'bar'
}

And the table provider also has a shredded column like

foo
bar

And the idea is to show how a query like

select user['name'] from table

Could be evaluated using the table provider using the separate shredded column

alamb · 2025-03-06T14:50:15Z

BTW there is a related issue for parquet itself (where we don't support pushdown for sub fields):

Introduce ProjectionMask To Allow Nested Projection Pushdown #2581

gatesn · 2025-03-06T17:52:47Z

I think I was suggesting this (though I haven't thought about the API it too carefully)

That feels like quite a roundabout way to do it, assuming I'm understanding correctly. DataFusion would ask which expressions can be pushed down, the provider would reply with some (typically) randomly generated column name string corresponding to the expression, augment its schema to include these expressions, and then DataFusion will ask for that column as part of the projection?

Is there a specific objection to general case? Similarly, I haven't thought through this fully in the context of DataFusion.

But roughly, DataFusion asks the table provider which expressions it can push-down, and the node is configured with both a projection expression and a filter expression. Exact same mechanism as filter expressions.

In your example, the expression would be $.getitem("user").getitem("name")

An advanced table provider could do with that as it pleases.

A simple one can construction the old (existing) projection mask with expr.accessed_fields() -> ["user"] or similar, projects out the fields, and then invokes the projection expression:

expr.evaluate(self.project(expr.accessed_fields()))

alamb · 2025-03-06T18:13:14Z

Is there a specific objection to general case? Similarly, I haven't thought through this fully in the context of DataFusion.

I am not sure I understand what the general case is you are referring to.

The only thing DataFusion needs is to somehow know what output column corresponds to the expression that was pushed down (so it can match it up with the rest of the plan).

DataFusion would ask which expressions can be pushed down, the provider would reply with some (typically) randomly generated column name string corresponding to the expression, augment its schema to include these expressions, and then DataFusion will ask for that column as part of the projection?

I agree this sounds complicated and not a great idea.

It sounds like we are basically saying the same thing (which I view as a good thing)

adriangb · 2025-03-06T18:14:27Z

But roughly, DataFusion asks the table provider which expressions it can push-down, and the node is configured with both a projection expression and a filter expression. Exact same mechanism as filter expressions.

@gatesn I'm in agreement with you, what you are proposing makes total sense, especially the bit about sharing code paths with filter expressions.

An advanced table provider could do with that as it pleases.

👍🏻 agreed, but we should have at least some reasonable examples of how to use this that don't require tremendous complexity. E.g. hopefully we can use SchemaAdapter or something similar so that I can write <100 LOC and get inject a custom implementation of struct unpacking / shredding into an existing table provider like ListingTableProvider

adriangb closed this as completed Mar 3, 2025

adriangb mentioned this issue Mar 3, 2025

Run / test Datafusion with JSON Bench from ClickHouse #14874

Open

alamb reopened this Mar 5, 2025

alamb changed the title ~~Push down expression to files~~ Support Push down expression evaluation in TableProviders Mar 5, 2025

This was referenced Mar 6, 2025

Introduce ProjectionMask To Allow Nested Projection Pushdown #2581

Open

Weekly Plan (Andrew Lamb) March 3, 2025 #14978

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Push down expression evaluation in `TableProviders` #14993

Support Push down expression evaluation in `TableProviders` #14993

adriangb commented Mar 3, 2025

adriangb commented Mar 3, 2025

alamb commented Mar 5, 2025

alamb commented Mar 5, 2025

adriangb commented Mar 5, 2025

alamb commented Mar 5, 2025 •

edited

Loading

adriangb commented Mar 6, 2025

adriangb commented Mar 6, 2025

cetra3 commented Mar 6, 2025

gatesn commented Mar 6, 2025

alamb commented Mar 6, 2025

adriangb commented Mar 6, 2025

gatesn commented Mar 6, 2025

alamb commented Mar 6, 2025

alamb commented Mar 6, 2025

gatesn commented Mar 6, 2025

alamb commented Mar 6, 2025

adriangb commented Mar 6, 2025 •

edited

Loading

Support Push down expression evaluation in TableProviders #14993

Support Push down expression evaluation in TableProviders #14993

Comments

adriangb commented Mar 3, 2025

adriangb commented Mar 3, 2025

alamb commented Mar 5, 2025

alamb commented Mar 5, 2025

adriangb commented Mar 5, 2025

alamb commented Mar 5, 2025 • edited Loading

adriangb commented Mar 6, 2025

adriangb commented Mar 6, 2025

cetra3 commented Mar 6, 2025

gatesn commented Mar 6, 2025

alamb commented Mar 6, 2025

adriangb commented Mar 6, 2025

gatesn commented Mar 6, 2025

alamb commented Mar 6, 2025

alamb commented Mar 6, 2025

gatesn commented Mar 6, 2025

alamb commented Mar 6, 2025

adriangb commented Mar 6, 2025 • edited Loading

Support Push down expression evaluation in `TableProviders` #14993

Support Push down expression evaluation in `TableProviders` #14993

alamb commented Mar 5, 2025 •

edited

Loading

adriangb commented Mar 6, 2025 •

edited

Loading