Datablock indexing with bloom filter #3130

junli1026 · 2021-11-27T02:41:35Z

junli1026
Nov 27, 2021

Hi all,
The discussion is about adding bloom filter for data block. It is very useful for high cardinality columns like email or account name. Different products have different design philosophy about this concept.

Snowflake
I went through some Snowflake tech talk and doc, only found this one. https://nattaylor.com/blog/2019/snowflake-internals/
It says: Per file min/max values, #distinct values, #nulls, bloom filters etc. -- That is to say, Snowflake default create a bloom filter file for every data file.
Databricks:
Databricks take the same approach with Snowflak, see this doc: https://docs.microsoft.com/en-us/azure/databricks/delta/optimizations/bloom-filters

A Bloom filter index is an uncompressed Parquet file that contains a single row. 
Indexes are stored in the _delta_index subdirectory relative to the data file and 
use the same name as the data file with the suffix index.v1.parquet.

Clickhouse
There are many clickhouse experts in the community, correct me if I was wrong, Clickhouse doesn't create bloom filter by default -- requires user to do that.

In summary, Snowflake and Databricks create a bloom-filter index file for every data-block file, it keeps per-column indexing information, that will benefit predicate like 't.number=1', pruning unrelated data-block. While Clickhouse requires use to create the indexing.

Which one do you think are better ?

PsiACE · 2021-11-27T03:01:38Z

PsiACE
Nov 27, 2021

What do you think of xorfilter? Might also be a good option if we don't need to rebuild the index frequently/dynamically.

Thomas Mueller Graf, Daniel Lemire, Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters, Journal of Experimental Algorithmics 25 (1), 2020. DOI: 10.1145/3376122
Use XorFilter instead of Bloomfilter dgraph-io/badger#1264 , worth a look, contains some comparisons
add Xorfilter oceanbase/oceanbase#324 , just list it

Ribbon filter looks great too.

2 replies

junli1026 Nov 27, 2021
Author

Very interesting! Thanks for letting me know. Is there any on-going internal discussing about the data skipping with Bloom/Xor filter ?

ZhiHanZ Nov 27, 2021
Collaborator

another interesting paper: https://arxiv.org/pdf/2009.08150.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datablock indexing with bloom filter #3130

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Datablock indexing with bloom filter #3130

junli1026 Nov 27, 2021

Replies: 1 comment · 2 replies

PsiACE Nov 27, 2021

junli1026 Nov 27, 2021 Author

ZhiHanZ Nov 27, 2021 Collaborator

junli1026
Nov 27, 2021

Replies: 1 comment 2 replies

PsiACE
Nov 27, 2021

junli1026 Nov 27, 2021
Author

ZhiHanZ Nov 27, 2021
Collaborator