Bloom Filters An Overview

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with a "counting" filter); the more elements that are added to the set, the larger the probability of false positives. (Wikipedia)

In effect the Bloom Filter tells you where things are not. Why do you want to know where they are not? Consider the case where you have lots of items that you place in lots of bins. Now you want to find and item that was previously placed in a bin. The bloom filter can tell you which bins do not contain the item, thus reducing the search space significantly.

Bloom filters are defined by 4 interdependent values:

n - Number of items in the filter
p - Probability of false positives, float between 0 and 1 or a number indicating 1-in-p
m - Number of bits in the filter
k - Number of hash functions

The values are interdependent as shown in the following calculations:

m = ceil((n * log(p)) / log(1.0 / (pow(2.0, log(2.0)))));
k = round(log(2.0) * m / n);

There are a number of bloom filter calculators available on the web.

I find that when constructing a filter I know the approximate number of items in the filter and an acceptable level of false positives, from this I calculate the number of bits in the filter and the number of hash functions.

Constructing a Bloom Filter

As noted above bloom filters are based on a number of hash functions (k). The most common mechanism for creating the hash functions is to pick a hash function and then seed it the the values from 1 to k (or 0 to k-1) to generate k distinct hash functions. This works well but is computationally expensive. The Cassandra code had a comment in it (which I can no longer locate) that indicated that their experimental data showed that for hash functions that return 128 bits this could be split into 2 64-bit integer (long in Java) values. Lets call them h1 and h2. Now the calculation is:

long hash = h1;
for (int i=0;i<k;i++) {
    hash += (k*h2);
    // use the hash here
}

This process yields the necessary number of hash functions without the overhead of calling the hash calculation multiple times.

To build the bloom filter we need a bit vector of size m (as noted above). We generate k hashes and use each one to turn on one bit in the vector. This can be achieved using the modulus function.

BitSet bloomFilter = new BitSet(m);
// calculate the 128-bit hash and split into h1 and h2 as noted above
long hash = h1;
for (int i=o;i<k;i++) {
    hash += (k*h2);
    // take the modulus
    int bitToSet = hash % m
    bloomFilter.set( bitToSet );
}

The bloom filter now contains the value for the item that was hashed.

Accumulating Bloom Filters

In most use cases we want to put the hashed items into some sort of container and be able to quickly determine if an object is not in the container. To do this simply perform a logical-or for the two filters and use the result for the container filter.

// assume bucket filter is set above
BitSet bucketFilter;
// assume bucket is set above
List<Object> bucket;
// assume itemFilter is generated as noted above
BitSet itemFilter;
// assume item is the object used to  generate itemFilter
Object item;
// add the object to the bucket
bucket.add( item );
bucketFilter.or( itemFilter );

The result is that the bucket Bloom Filter now contains the hash for the item.

Determining if a Bloom Filter Contains Another Bloom Filter

To determin if a bucket contains an item we simply perform a logical-and between the bucket Bloom Filter and the item Bloom Filter, if the result is the item bloom filter we have a match. Using the Java BitSet we have to go a bit roundabout since the BitSet logical methods modify the BitSet itself.

// assume configured above
BitSet bucketFilter
// assume configured above
List<Object> bucket;
// assume to be the filter we want to find.
BitSet itemFilter;
// clone the bucket filter so we can perform the check
BitSet cloneFilter = new BitSet(m);
cloneFilter.or( bucketFilter );
// do the test
cloneFilter.and( itemFilter );
if (itemFilter.equals(cloneFilter)) {
    // the item might be in the bucket 
    // scan the bucket for the object that created the filter or perform
    // operations that are valid if the item is not in the bucket.
} else {
    // the item is not in the bucket
}

Removing Items From a Bloom Filter

The standard Bloom Filter does not have a simple mechanism for removing items. The only solution is to recalculate the filter from the items in the bucket. There are extensions to the standard Bloom Filter, like the Counting Bloom Filter, that do have such mechanisms but we are not going to address them here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bloom Filters An Overview

Constructing a Bloom Filter

Accumulating Bloom Filters

Determining if a Bloom Filter Contains Another Bloom Filter

Removing Items From a Bloom Filter

Clone this wiki locally