For reasons I struggle to fully explain, I get overly excited about data structures. One data structure that has a permanent spot in my top five is the humble Bloom filter. Why? Two reasons, primarily: Bloom filters are beautifully simple yet incredibly versatile. In an era of vast datasets, they provide a mechanism for maintaining speed and overall performance.
What does a Bloom filter do?
A Bloom filter allows you to test whether an element belongs in a set or not. Actually, that’s untrue. It doesn’t. It allows you to test if an element most likely belongs to a set, or if it absolutely doesn’t. In others words, false positives may occur, but false negatives won’t. It’s this susceptibility to errors that makes Bloom filters extremely space efficient. But more on that later; first, let’s see what makes Bloom filters so easy to grasp.
Simplicity is the purest form of elegance
I like to think of a Bloom filter as a mathematical set on steroids. The data structure consists of two discrete components: a bunch of hash functions and a bit array. That’s it.
To initialise a Bloom filter, we set all bits in its bit array to zero. So, if we have an array of ten bits:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
To add an element, we give it to our hash functions as an input. Each hash function will output an array index. Let’s say we passed the string
foo to two hash functions which give us indices 0 and 4. We set the corresponding bits in our bit array to 1:
[1, 0, 0, 0, 1, 0, 0, 0, 0, 0]
Querying for an element is just as straightforward. We pass the element to our hash functions and check the returned indices to see whether the corresponding bits in the array have a value of 1 or 0. If none of them are 1, the element in question is not part of the set. We don’t necessarily even need to check all of the values at the given indices. If we find at least one where the bit is zero, then the element can’t be part of the set. Let’s say we query for the string
bar using our two hash functions. Let’s further assume that the functions output indices 1 and 5. Checking either one of these bit array indices is enough for us to conclude that
bar is not part of our set, since the value at both indices is 0.
If all the corresponding bits are set, then the element may be part of the set. Since hash functions have a nasty habit of producing collisions, we can’t be 100% sure that the queried element is part of the set. There’s a chance that another previously added element caused the same bits to be flipped.
Removing an element is a simple operation inasmuch as it can’t be done. If we were to reset the corresponding bits of the element we wish to remove, we may end up removing other elements that have the same corresponding bits. What happens in a Bloom filter stays in a Bloom filter.
How efficient are Bloom filters?
Adding elements is very efficient in terms of time. Since we don’t iterate the bit array but simply set bits, the time the operation takes is only dependent on the number of hash functions we have. So, for k hash functions, the time complexity is O(k). Lovely.
The same goes for querying for an element. For k hash functions, there is a constant upper bound to the number of indices we need to check in the bit array. O(k) for both addition and lookup — very nice.
If you think the time complexities are encouraging, then the space complexity is even better. Since we aren’t storing the elements themselves, only the information tells us if they are present or not, the size of a Bloom filter isn’t dependent of the size or number of elements already in the set. If we are willing to accept a false probability of 1%, less than 10 bits per element to be added is sufficient.
What can Bloom filters be used for?
Short answer: lots of stuff.
Long answer: as it turns out, there are a vast number of applications where a small probability of getting false positives is fine as long as we don’t get any false negatives. The first example that springs to mind is using a Bloom filter to decide if we need to perform a more costly operation such as accessing some resource on disk or over a network. Asking the Bloom filter “does this resource even exist” before doing a costly lookup yields instant performance gains.
The same goes for maintaining blacklists. Google Chrome previously employed Bloom filters to identify malicious URLs. In this scenario, requested URLs are first checked against a local Bloom filter; a full check against the real blacklist is done if (and only if) the filter indicates that the URL is malicious.
Even though Chrome has moved away from using Bloom filters for blacklisting purposes, it’s very possible that when you first visited this page, a Bloom filter was used to make sure the rendering was snappy. WebKit, the popular web browser engine, consults a Bloom filter to decide which CSS selectors to even bother with before processing declarations.
Off the top of my head, I can think of many more applications, including eliminating unnecessary calls to web caches, preventing the use of weak passwords, and checking for reserved usernames during registration processes. I’ll leave the rest of the ideation up to the readers of this blog post; learning what a data structure does and thinking about what can be achieved with it evokes a childlike sense of discovery in me, and I hope it does the same for you!
A couple of interesting tidbits
- Since you can’t remove an element from a Bloom filter, the probability of false positives will continue to go up if you keep adding elements to it. It starts to degrade. In that sense, Bloom filters are not like wine. They don’t get better with age.
- We owe the existence of Bloom filters to a chap called Burton Howard Bloom, who invented the data structure way back in 1970. There’s precious little information about him online, but from what I could dig up, it seems he attended MIT and worked on database systems.
A Bloom filter is a semi-transient data structure that provides a time and space efficient way for testing if an element is part of a set. Bloom filters are cool. Consider using one in cases where you need 100% recall but a small probability of false positives is acceptable.
This post is also available, in its entirety, on SC5.io.