Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run / test Datafusion with JSON Bench from ClickHouse #14874

Open
alamb opened this issue Feb 25, 2025 · 10 comments
Open

Run / test Datafusion with JSON Bench from ClickHouse #14874

alamb opened this issue Feb 25, 2025 · 10 comments
Labels
enhancement New feature or request performance Make DataFusion faster

Comments

@alamb
Copy link
Contributor

alamb commented Feb 25, 2025

Is your feature request related to a problem or challenge?

@dentiny (and maybe @onlyjackfrost?) pointed me at a new benchmark from ClickHouse, related to processing JSON files: https://github.com/ClickHouse/JSONBench

It would be great to figure out how to get DataFusion represented in that benchmark / show its performance of processing JSON files

Describe the solution you'd like

Figure out how to run datafusion on the JSONBench test

  1. Integrate it locally into bench.sh
  2. Figure out how to submit results to the JSON bench repo

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added enhancement New feature or request performance Make DataFusion faster labels Feb 25, 2025
@dentiny
Copy link
Contributor

dentiny commented Feb 25, 2025

I didn't do that :)

@alamb
Copy link
Contributor Author

alamb commented Feb 25, 2025

I didn't do that :)

Sorry -- I got my github <-> real life handles mixed up. Maybe @goldmedal knows the right person

@goldmedal
Copy link
Contributor

I think you mean @douenergy (Alex)

@ZENOTME
Copy link
Contributor

ZENOTME commented Feb 26, 2025

Try to run the query directly based on the json file but got the error:

Error: ArrowError(JsonError("Expected scalar or scalar array JSON type, found: Object({\"cid\": Scalar({Utf8}), \"uri\": Scalar({Utf8})})"), None)

JSON bench will have a different schema for row and looks like datafusion(arrow-json) can't support this? E.g. type of subject for these two row are different.

{"did":"did:plc:3i4xf2v4wcnyktgv6satke64","time_us":1732206349000644,"kind":"commit","commit":{"rev":"3lbhuvzds6d2a","operation":"create","collection":"app.bsky.feed.like","rkey":"3lbhuvzdked2a","record":{"$type":"app.bsky.feed.like","createdAt":"2024-11-21T16:25:46.221Z","subject":{"cid":"bafyreidjvrcmckkm765mct5fph36x7kupkfo35rjklbf2k76xkzwyiauge","uri":"at://did:plc:azrv4rcbws6kmcga4fsbphg2/app.bsky.feed.post/3lbgjdpbiec2l"}},"cid":"bafyreia5l5vrkh5oj4cjyhcqby2dprhyvcyofo2q5562tijlae2pzih23m"}}
{"did":"did:plc:gccfnqqizz4urhchsaie6jft","time_us":1732206349001108,"kind":"commit","commit":{"rev":"3lbhuvze3gi2u","operation":"create","collection":"app.bsky.graph.follow","rkey":"3lbhuvzdtmi2u","record":{"$type":"app.bsky.graph.follow","createdAt":"2024-11-21T16:27:40.923Z","subject":"did:plc:r7cdh4sgzqbfdc6wcdxxti7c"},"cid":"bafyreiew2p6cgirfaj45qoenm4fgumib7xoloclrap3jgkz5es7g7kby3i"}}

@alamb
Copy link
Contributor Author

alamb commented Feb 26, 2025

Thanks for checking it out @ZENOTME

JSON bench will have a different schema for row and looks like datafusion(arrow-json) can't support this? E.g. type of subject for these two row are different.

I think this may be related to what @TheBuilderJR is seeing / discussing here:

Allowing different (yet compatible) schemas

@ZENOTME
Copy link
Contributor

ZENOTME commented Feb 26, 2025

Thanks for checking it out @ZENOTME

JSON bench will have a different schema for row and looks like datafusion(arrow-json) can't support this? E.g. type of subject for these two row are different.

I think this may be related to what @TheBuilderJR is seeing / discussing here:

Allowing different (yet compatible) schemas

Seems it's not the case about different (yet compatible) schema. E.g.

first row: "subject":{"cid":"bafyreidjvrcmckkm765mct5fph36x7kupkfo35rjklbf2k76xkzwyiauge","uri":"at://did:plc:azrv4rcbws6kmcga4fsbphg2/app.bsky.feed.post/3lbgjdpbiec2l"}
second row: "subject":"did:plc:r7cdh4sgzqbfdc6wcdxxti7c"

The same key(column) can have incompatible type for json, I think this is more about #7845. According to the following query, looks basically what they do is treat the field of json as column and do some aggregate for them. I think for this case choice 1 of #7845 (comment) (add Json/Jsonb type to Arrow) may be the appropriate one. I guess this maybe also the reason clickhouse new json type design(https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse) It also can be a good reference if we want to choose this solution.🤔

SELECT data.commit.collection AS event, count() AS count FROM bluesky GROUP BY event ORDER BY count DESC;
SELECT data.commit.collection AS event, count() AS count, uniqExact(data.did) AS users FROM bluesky WHERE data.kind = 'commit' AND data.commit.operation = 'create' GROUP BY event ORDER BY count DESC;
SELECT data.commit.collection AS event, toHour(fromUnixTimestamp64Micro(data.time_us)) as hour_of_day, count() AS count FROM bluesky WHERE data.kind = 'commit' AND data.commit.operation = 'create' AND data.commit.collection in ['app.bsky.feed.post', 'app.bsky.feed.repost', 'app.bsky.feed.like'] GROUP BY event, hour_of_day ORDER BY hour_of_day, event;
SELECT data.did::String as user_id, min(fromUnixTimestamp64Micro(data.time_us)) as first_post_ts FROM bluesky WHERE data.kind = 'commit' AND data.commit.operation = 'create' AND data.commit.collection = 'app.bsky.feed.post' GROUP BY user_id ORDER BY first_post_ts ASC LIMIT 3;
SELECT data.did::String as user_id, date_diff( 'milliseconds', min(fromUnixTimestamp64Micro(data.time_us)), max(fromUnixTimestamp64Micro(data.time_us))) AS activity_span FROM bluesky WHERE data.kind = 'commit' AND data.commit.operation = 'create' AND data.commit.collection = 'app.bsky.feed.post' GROUP BY user_id ORDER BY activity_span DESC LIMIT 3;

@alamb
Copy link
Contributor Author

alamb commented Feb 26, 2025

Maybe this is the usecase for variant 🤔

@adriangb
Copy link
Contributor

adriangb commented Mar 3, 2025

I'll share that one thing we've found from using JSON quite extensively is that often query times are dominated by downloading the large json column, not parsing it or extracting data from it. I don't see any way to avoid this unless we split the data up into multiple columns or teach DataFusion how to only read parts of a column.

I opened #14993 today which I realized is a duplicate of a question I asked before in #7845 (comment). My understanding of how ClickHouse handles JSON is by creating specialized "hidden" columns for each key (linked above but see https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse). I think if DataFusion supported something like what I'm proposing in those comments (pushing down an expression into a file) we could:

  • At write time for each file being written out take the first (or most frequent) X JSON keys encountered and write them out as a Union type in their own column (Union type serving a similar purpose to CH's variant type). We still retain the original column. Any keys that didn't make the cut (often zero if X is large enough) get put into a "remainder" column which if it has some data still has a lot less data than the original column and hence is faster to query.
  • At query time push the selection / filter down into each file and check if there is a specialized column for that key, otherwise fall back to reading from the remainder column.
  • The Union type can be a single level union, i.e. Union(int, float, bool, string, array(stored as utf8 + metadata), object(stored as utf8 + metadata)). I'm not sure if it should be dense or sparse.
  • If the data is truly heterogenous in the keys it's important that splitting out of keys into pre-computed columns be done at a per-file basis (you can't go more granular without completely abandoning Parquet or something), otherwise you have to pick the right set of keys to optimize for upfront which is a much harder problem.

@alamb
Copy link
Contributor Author

alamb commented Mar 5, 2025

I opened #14993 today which I realized is a duplicate of a question I asked before in #7845 (comment). My understanding of how ClickHouse handles JSON is by creating specialized "hidden" columns for each key (linked above but see https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse). I think if DataFusion supported something like what I'm proposing in those comments (pushing down an expression into a file) we could:

FWIW the idea to store some fields as separate columns is referred to as "shredding" in the Parquet doc / format they are adding:

I think if DataFusion supported something like what I'm proposing in those comments (pushing down an expression into a file) we could:

I think adding expression pushdown into table providers would be valuable and has come up a number of times. This usecase is a good one. I'll work on a writeup

@alamb
Copy link
Contributor Author

alamb commented Mar 5, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Make DataFusion faster
Projects
None yet
Development

No branches or pull requests

5 participants