Unusual Datastructures: The Bloom Filter

by tomjen September, two thousand and fifteen

As a programmer you are familiar with basic data structures: arrays, hashtables and linked lists. These data structures easily cover more than 99% of use-cases (with arrays and hastables alone covering something like 98.5%) but knowing which use-cases each is suitable for are important otherwise you end up writing code like the Ruby Gems which allocated about 100000 times more objects than were necessary because it was written to use an array rather than a linked list.

However there are some cases where you don’t have the resources to use a hashtable, for example because you need to store a large number of small items or a smaller number of large items or because you are writing software for some embedded cpu and can spare a grand total of 4k RAM for the entire program.

That is when you might need a Bloom filter. Bloom filters belong to a class of probabilistic data-structures^[1]. Unlike hashtables and arrays Bloom filters may give false positives (ie. this item has already been added to the collection, whereas it really hasn’t) although it will never give a false-negative (ie. claim, incorrectly, that someting has not been added to the collection, when really it has). In addition it is not possible to remove something from a Bloom filter, if you need to you will have to throw the filter away and start over from scratch.

These limitations may make it seem like a Bloom filter isn’t very useful in practice. Not so!

Imagine a large on-disk cache of, well anything, that may contain what you are looking for. If not you have to access the item over the internet (very, very slow compared to in-memory or even disk access). If you put a Bloom filter in front of your cache lookups in this cache, then the filter will correctly tell you when an item is not in the cache, so you can save the disk access and start fetching the item from the internet immidiately. On the other hand if the Bloom filter lookup tells you the item is in the collection then you have to fetch the item from the collection anyway, so if the filter claimed a false positive no biggie — you will still get the right answer when you query the collection; in fact since the combination of Bloom filter and on-disk cache, when used this way, maintains the original contract of the cache you can insert the Bloom filter right into the code that handles cache access and get a free speed up everywhere it is used.

You could even extend this to a network service, such as Amazon which has a large number of books, but not all books. Rather than sync all the data on all the books in its library, Amazon could share a Bloom filter which contains the ISBN of each book in its collection – whenever the user browses to a page which mentions one or more ISBNs the app can quickly show if the book is available for purchase with Amazon, which a high degree of probability.

So how does a Bloom filter work?

They are actually relatively simple and consist of only two things:

Storage, which is simply an array of booleans
The filter, which is a set of filtering functions. These functions take as input an element in the collection and produce as output a uniformly random set of booleans.

There are two steps to adding to the collection:

Calculate the result of the output of running each of the filtering functions over the element.
For all booleans that were true in the output of the function, mark them as true in the array as well.

There are also two steps to querying the collection:

Calculate the result of the output of running each of the filtering functions over the element.
For all booleans that were not set in the array, the element cannot have been added, so return false. Otherwise return true.

The only way the filter can return a false positive is if other items have been added so that, by chance, all the bits the hash function would have set were already marked as true. Later we shall see how to calculate this probability.

A public interface to the Bloom filter could be implemented like this:

public interface BloomFilter <T>
{
  public void add( item);
  public boolean hasBeenAdded( item);
}

As there is no way to get an item out of a Bloom filter, the generics are a bit more flexible than they otherwise would be.

Implementing the storage for the Bloom filter is just an array of the appropriate length, but how do we implement the hash functions?

Well one of the benefits of open source is that we can spy at what others have done, so googling a bit I found this Ruby library which uses the djb2 hash function:

unsigned long hash(unsigned char *str) {
  unsigned long hash = 5381;
  int c;
  while (c = *str++)
    hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
  return hash;
}

That is just one hash function though, which isn’t enough. What the author did was using the hash function xored with a different salt for each function, thereby creating a much larger number of hash values. This is somewhat of a departure from what is suggested in Building a Better Bloom Filter (and what appears to be the standard way) of using two different hash functions. Building a Better Bloom Filter does show that that method produces both emperically and theoretically optimal results, but that doesn’t mean that Ruby author was wrong — binary xor is super cheap on just about any CPU I know of and it will always be faster to run one function than two, all else being the same.

Okay so now we know how to make one (and more) hash functions, what next?

The only thing now missing is the size of the storage and the number of hash functions. These depend on the rate of the false positives we are willing to accept, the number of items we want to add to the Bloom filter and the number of hash functions we are willing to use. Once we have a guess as to how many items we have to insert we can compute the optimal number of hash functions as:

$k = \frac{m}{n}\ln 2$

Assuming we will be using the optimal number of hash functions (ie. $k$ above) we can compute the size of the Bloom Filter using this formula ( $n$ being the number of items in the filter):

$m = - \frac{n \ln p}{(\ln 2)^2}$

Where $p$ is the false positive rate we are willing to accept. This is obviously another trade of because a large number of hash functions are going to take more time to calculate and must use more memory to store the result than a filter with fewer hash functions – on the other hand such a filter is more likely to give a false positive (which may be very expensive if it results in unecessary disk or internet access).

Building a Better Bloom Filter (see link above) worked with an assumption of a 1% false positive rate, which again do appear to be the standard, but that does not mean it will be the rate that is most suitable for your particular application — it depends on how expensive a false positive is, how often you expect something to be in the collection and how much computing power and memory you have. Do note thought that $m$ changes with the logarithm of $p$ , which means getting a lower false positive rate is going to get exponentially more expensive in terms of the number of hash functions.

1. Meaning that unlike traditional data-structures they are not always going to give the correct answer, but have better memory and/or cpu charecteristics. This is a trade-of of certainty vs performance much like the more traditional trade-of of memory vs cpu. [↩](#fnref1)