-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run / test Datafusion with JSON Bench from ClickHouse #14874
Comments
I didn't do that :) |
Sorry -- I got my github <-> real life handles mixed up. Maybe @goldmedal knows the right person |
I think you mean @douenergy (Alex) |
Try to run the query directly based on the json file but got the error:
JSON bench will have a different schema for row and looks like datafusion(arrow-json) can't support this? E.g. type of
|
Thanks for checking it out @ZENOTME
I think this may be related to what @TheBuilderJR is seeing / discussing here: Allowing different (yet compatible) schemas |
Seems it's not the case about different (yet compatible) schema. E.g.
The same key(column) can have incompatible type for json, I think this is more about #7845. According to the following query, looks basically what they do is treat the field of json as column and do some aggregate for them. I think for this case choice 1 of #7845 (comment) (add Json/Jsonb type to Arrow) may be the appropriate one. I guess this maybe also the reason clickhouse new json type design(https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse) It also can be a good reference if we want to choose this solution.🤔
|
Maybe this is the usecase for variant 🤔 |
I'll share that one thing we've found from using JSON quite extensively is that often query times are dominated by downloading the large json column, not parsing it or extracting data from it. I don't see any way to avoid this unless we split the data up into multiple columns or teach DataFusion how to only read parts of a column. I opened #14993 today which I realized is a duplicate of a question I asked before in #7845 (comment). My understanding of how ClickHouse handles JSON is by creating specialized "hidden" columns for each key (linked above but see https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse). I think if DataFusion supported something like what I'm proposing in those comments (pushing down an expression into a file) we could:
|
FWIW the idea to store some fields as separate columns is referred to as "shredding" in the Parquet doc / format they are adding:
I think adding expression pushdown into table providers would be valuable and has come up a number of times. This usecase is a good one. I'll work on a writeup |
Follow on thoughts: |
Is your feature request related to a problem or challenge?
@dentiny (and maybe @onlyjackfrost?) pointed me at a new benchmark from ClickHouse, related to processing JSON files: https://github.com/ClickHouse/JSONBench
It would be great to figure out how to get DataFusion represented in that benchmark / show its performance of processing JSON files
Describe the solution you'd like
Figure out how to run datafusion on the JSONBench test
bench.sh
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: