Run / test Datafusion with JSON Bench from ClickHouse #14874

alamb · 2025-02-25T15:03:48Z

Is your feature request related to a problem or challenge?

@dentiny (and maybe @onlyjackfrost?) pointed me at a new benchmark from ClickHouse, related to processing JSON files: https://github.com/ClickHouse/JSONBench

It would be great to figure out how to get DataFusion represented in that benchmark / show its performance of processing JSON files

Describe the solution you'd like

Figure out how to run datafusion on the JSONBench test

Integrate it locally into bench.sh
Figure out how to submit results to the JSON bench repo

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

dentiny · 2025-02-25T20:01:20Z

I didn't do that :)

alamb · 2025-02-25T20:02:42Z

I didn't do that :)

Sorry -- I got my github <-> real life handles mixed up. Maybe @goldmedal knows the right person

goldmedal · 2025-02-26T00:45:31Z

I think you mean @douenergy (Alex)

ZENOTME · 2025-02-26T12:05:52Z

Try to run the query directly based on the json file but got the error:

Error: ArrowError(JsonError("Expected scalar or scalar array JSON type, found: Object({\"cid\": Scalar({Utf8}), \"uri\": Scalar({Utf8})})"), None)

JSON bench will have a different schema for row and looks like datafusion(arrow-json) can't support this? E.g. type of subject for these two row are different.

{"did":"did:plc:3i4xf2v4wcnyktgv6satke64","time_us":1732206349000644,"kind":"commit","commit":{"rev":"3lbhuvzds6d2a","operation":"create","collection":"app.bsky.feed.like","rkey":"3lbhuvzdked2a","record":{"$type":"app.bsky.feed.like","createdAt":"2024-11-21T16:25:46.221Z","subject":{"cid":"bafyreidjvrcmckkm765mct5fph36x7kupkfo35rjklbf2k76xkzwyiauge","uri":"at://did:plc:azrv4rcbws6kmcga4fsbphg2/app.bsky.feed.post/3lbgjdpbiec2l"}},"cid":"bafyreia5l5vrkh5oj4cjyhcqby2dprhyvcyofo2q5562tijlae2pzih23m"}}
{"did":"did:plc:gccfnqqizz4urhchsaie6jft","time_us":1732206349001108,"kind":"commit","commit":{"rev":"3lbhuvze3gi2u","operation":"create","collection":"app.bsky.graph.follow","rkey":"3lbhuvzdtmi2u","record":{"$type":"app.bsky.graph.follow","createdAt":"2024-11-21T16:27:40.923Z","subject":"did:plc:r7cdh4sgzqbfdc6wcdxxti7c"},"cid":"bafyreiew2p6cgirfaj45qoenm4fgumib7xoloclrap3jgkz5es7g7kby3i"}}

alamb · 2025-02-26T15:54:24Z

Thanks for checking it out @ZENOTME

JSON bench will have a different schema for row and looks like datafusion(arrow-json) can't support this? E.g. type of subject for these two row are different.

I think this may be related to what @TheBuilderJR is seeing / discussing here:

Datafusion can't seem to cast evolving structs #14757

Allowing different (yet compatible) schemas

ZENOTME · 2025-02-26T19:14:29Z

Thanks for checking it out @ZENOTME

JSON bench will have a different schema for row and looks like datafusion(arrow-json) can't support this? E.g. type of subject for these two row are different.

I think this may be related to what @TheBuilderJR is seeing / discussing here:

Datafusion can't seem to cast evolving structs #14757

Allowing different (yet compatible) schemas

Seems it's not the case about different (yet compatible) schema. E.g.

first row: "subject":{"cid":"bafyreidjvrcmckkm765mct5fph36x7kupkfo35rjklbf2k76xkzwyiauge","uri":"at://did:plc:azrv4rcbws6kmcga4fsbphg2/app.bsky.feed.post/3lbgjdpbiec2l"}
second row: "subject":"did:plc:r7cdh4sgzqbfdc6wcdxxti7c"

The same key(column) can have incompatible type for json, I think this is more about #7845. According to the following query, looks basically what they do is treat the field of json as column and do some aggregate for them. I think for this case choice 1 of #7845 (comment) (add Json/Jsonb type to Arrow) may be the appropriate one. I guess this maybe also the reason clickhouse new json type design(https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse) It also can be a good reference if we want to choose this solution.🤔

SELECT data.commit.collection AS event, count() AS count FROM bluesky GROUP BY event ORDER BY count DESC;
SELECT data.commit.collection AS event, count() AS count, uniqExact(data.did) AS users FROM bluesky WHERE data.kind = 'commit' AND data.commit.operation = 'create' GROUP BY event ORDER BY count DESC;
SELECT data.commit.collection AS event, toHour(fromUnixTimestamp64Micro(data.time_us)) as hour_of_day, count() AS count FROM bluesky WHERE data.kind = 'commit' AND data.commit.operation = 'create' AND data.commit.collection in ['app.bsky.feed.post', 'app.bsky.feed.repost', 'app.bsky.feed.like'] GROUP BY event, hour_of_day ORDER BY hour_of_day, event;
SELECT data.did::String as user_id, min(fromUnixTimestamp64Micro(data.time_us)) as first_post_ts FROM bluesky WHERE data.kind = 'commit' AND data.commit.operation = 'create' AND data.commit.collection = 'app.bsky.feed.post' GROUP BY user_id ORDER BY first_post_ts ASC LIMIT 3;
SELECT data.did::String as user_id, date_diff( 'milliseconds', min(fromUnixTimestamp64Micro(data.time_us)), max(fromUnixTimestamp64Micro(data.time_us))) AS activity_span FROM bluesky WHERE data.kind = 'commit' AND data.commit.operation = 'create' AND data.commit.collection = 'app.bsky.feed.post' GROUP BY user_id ORDER BY activity_span DESC LIMIT 3;

alamb · 2025-02-26T20:12:19Z

Maybe this is the usecase for variant 🤔

[Parquet] Implement Variant type support in Parquet arrow-rs#6736

adriangb · 2025-03-03T22:57:27Z

I'll share that one thing we've found from using JSON quite extensively is that often query times are dominated by downloading the large json column, not parsing it or extracting data from it. I don't see any way to avoid this unless we split the data up into multiple columns or teach DataFusion how to only read parts of a column.

I opened #14993 today which I realized is a duplicate of a question I asked before in #7845 (comment). My understanding of how ClickHouse handles JSON is by creating specialized "hidden" columns for each key (linked above but see https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse). I think if DataFusion supported something like what I'm proposing in those comments (pushing down an expression into a file) we could:

At write time for each file being written out take the first (or most frequent) X JSON keys encountered and write them out as a Union type in their own column (Union type serving a similar purpose to CH's variant type). We still retain the original column. Any keys that didn't make the cut (often zero if X is large enough) get put into a "remainder" column which if it has some data still has a lot less data than the original column and hence is faster to query.
At query time push the selection / filter down into each file and check if there is a specialized column for that key, otherwise fall back to reading from the remainder column.
The Union type can be a single level union, i.e. Union(int, float, bool, string, array(stored as utf8 + metadata), object(stored as utf8 + metadata)). I'm not sure if it should be dense or sparse.
If the data is truly heterogenous in the keys it's important that splitting out of keys into pre-computed columns be done at a per-file basis (you can't go more granular without completely abandoning Parquet or something), otherwise you have to pick the right set of keys to optimize for upfront which is a much harder problem.

alamb · 2025-03-05T20:56:52Z

I opened #14993 today which I realized is a duplicate of a question I asked before in #7845 (comment). My understanding of how ClickHouse handles JSON is by creating specialized "hidden" columns for each key (linked above but see https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse). I think if DataFusion supported something like what I'm proposing in those comments (pushing down an expression into a file) we could:

FWIW the idea to store some fields as separate columns is referred to as "shredding" in the Parquet doc / format they are adding:

I think if DataFusion supported something like what I'm proposing in those comments (pushing down an expression into a file) we could:

I think adding expression pushdown into table providers would be valuable and has come up a number of times. This usecase is a good one. I'll work on a writeup

alamb · 2025-03-05T21:32:07Z

Follow on thoughts:

Support Push down expression evaluation in TableProviders #14993 (comment)

alamb added enhancement New feature or request performance Make DataFusion faster labels Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run / test Datafusion with JSON Bench from ClickHouse #14874

Run / test Datafusion with JSON Bench from ClickHouse #14874

alamb commented Feb 25, 2025

dentiny commented Feb 25, 2025

alamb commented Feb 25, 2025

goldmedal commented Feb 26, 2025

ZENOTME commented Feb 26, 2025

alamb commented Feb 26, 2025 •

edited

Loading

ZENOTME commented Feb 26, 2025

alamb commented Feb 26, 2025

adriangb commented Mar 3, 2025 •

edited

Loading

alamb commented Mar 5, 2025 •

edited

Loading

alamb commented Mar 5, 2025

Run / test Datafusion with JSON Bench from ClickHouse #14874

Run / test Datafusion with JSON Bench from ClickHouse #14874

Comments

alamb commented Feb 25, 2025

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

dentiny commented Feb 25, 2025

alamb commented Feb 25, 2025

goldmedal commented Feb 26, 2025

ZENOTME commented Feb 26, 2025

alamb commented Feb 26, 2025 • edited Loading

ZENOTME commented Feb 26, 2025

alamb commented Feb 26, 2025

adriangb commented Mar 3, 2025 • edited Loading

alamb commented Mar 5, 2025 • edited Loading

alamb commented Mar 5, 2025

alamb commented Feb 26, 2025 •

edited

Loading

adriangb commented Mar 3, 2025 •

edited

Loading

alamb commented Mar 5, 2025 •

edited

Loading