February 10, 2016
February 8, 2016

What are Bloom filters, and why are they useful?

For reasons I struggle to explain, I get overly excited about data structures. One data structure that has a permanent spot in my top five is the humble Bloom filter. Why? Two reasons, primarily: Bloom filters are beautifully simple yet incredibly versatile. In the world of data structures, that’s like having your cake and eating it, too.

So what does a Bloom filter do?

A Bloom filter allows you to test whether an element belongs in a set or not. Actually, that’s untrue. It doesn’t. It allows you to test if an element most likely belongs to a set, or if it doesn’t. In others words, false positives may occur, but false negatives won’t. It’s this susceptibility to errors that makes Bloom filters extremely space-efficient. But more on that later; first, let’s see what makes Bloom filters so easy to grasp.

Simplicity is the purest form of elegance

I like to think of a Bloom filter as a mathematical set on steroids. The data structure consists of two components: a bunch of hash functions and a bit array. That’s it.

To initialise a Bloom filter, we set all bits in its bit array to zero. So, if we have an array of ten bits[1]:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

To add an element, we give it to our hash functions as an input. Each has function will output an array index. Let’s say we passed the string hello to two hash functions giving us the indexes 0 and 4. We set the corresponding bits in our bit array to 1:

[1, 0, 0, 0, 1, 0, 0, 0, 0, 0]

Querying for an element is just as straightforward. We pass the element to our hash functions and check the returned indices to se whether the corresponding bits in the array have a value of 1 or 0. If none of them are 1, the element in question is not part of the set. In fact, we don’t necessarily even need to check all of the values at the given indices. If we find at least one where the bit is zero, then the element can’t be part of the set.

If all the corresponding bits are set, then the element may be part of the set. Since hash functions have a nasty habit of producing collisions, we can’t be 100% sure that the queried element is part of the set. There’s a chance that another previously added element caused the same bits to be set. That said, the probability of false positives can be kept low by choosing the number of hash functions carefully.

Removing an element is a simple operation inasmuch as it can’t be done. If we were to reset the corresponding bits of the element we wish to remove, we may end up removing other elements that have the same corresponding bits. What happens in a Bloom filter stays in a Bloom filter.

How efficient are Bloom filters?

Adding elements is very efficient in terms of time. Since we don’t iterate the bit array but simply set bits, the time the operation takes is only dependent on the number of hash functions we have. So, for k hash functions, the time complexity is O(k). Lovely.

The same goes for querying for an element. For k hash functions, there is a constant upper bound to the number of indices we need to check in the bit array. O(k) for both addition and lookup — very nice.

If you think the time complexities are encouraging, then the space complexity will blow you away. Since we aren’t storing the elements themselves, only information telling us if they are present or not, the size of a Bloom filter is fixed. It doesn’t grow with each new addition. And, compared to other constructs, bit arrays really don’t take up much space at all.

If we wanted to optimise things even further, we could throw away the bit array altogether and replace it with a simple array that only keeps track of which indices have their value set to 1. Since the only other possible value is 0, we don’t really need to keep a bunch of zeroes in memory. This type of structure is known as a sparse array.

So what can Bloom filters be used for?

Short answer: lots of stuff.

Long answer: as it turns out, there are a vast number of applications where a small probability of getting false positives is fine as long as we don’t get any false negatives. The first example that springs to mind is using a Bloom filter to decide if we need to perform a more costly operation such as accessing some resource on disk or over a network. Asking the Bloom filter “does this resource even exist” before doing a costly lookup yields instant performance gains.

The same goes for maintaining blacklists. Google Chrome uses (or at least used to use) Bloom filters to identify malicious URLs. Requested URLs are first checked against a local Bloom filter; a full check against the real blacklist is done if (and only if) the filter indicates that the URL is malicious.

Off the top of my head, I can think of many more applications, including eliminating unnecessary calls to web caches, preventing the use of weak passwords, and checking if a username is reserved. I’ll leave the rest of the ideation up to you; reading up on what a data structure does and thinking about what can be achieved with it evokes a childlike sense of discovery in me, and I hope it does the same for you!

A couple of interesting tidbits

  • Since you can’t remove and element from a Bloom filter, the probability of false positives will continue to go up if you keep adding elements to it. It starts to degrade. In that sense, Bloom filters are not like wine. They don’t get better with age.
  • We owe the existence of Bloom filters to a chap called Burton Howard Bloom, who invented the data structure way back in 1970. There’s precious little information about him online, but from what I could dig up, it seems he attended MIT and worked on database systems.

TL;DR

Bloom filters provide a time- and space-efficient way for testing if an element is part of a set. Bloom filters are cool. Consider using one in cases where you need 100% recall but a small probability of false positives is acceptable.

  1. I realise that a 10-bit array is an, err, unrealistic example. Bloom filter arrays tend to be much larger. []
February 7, 2016

Heisenberg Developers

Finely grained management of software developers is compelling to a business. Any organization craves control. We want to know what we are getting in return for those expensive developer salaries. We want to be able to accurately estimate the time taken to deliver a system in order to do an effective cost-benefit analysis and to give the business an accurate forecast of delivery. There’s also the hope that by building an accurate database of estimates verses actual effort, we can fine tune our estimation, and by analysis find efficiencies in the software development process.

The problem with this approach is that it fundamentally misunderstands the nature of software development. That it is a creative and experimental process. Software development is a complex system of multiple poorly understood feedback loops and interactions. It is an organic process of trial and error, false starts, experiments and monumental cock-ups. …

Amen. I’ve never been in a development project where estimates actually work. I encourage you to read the entire article — it’s 100% on point.

February 5, 2016

Three and a half degrees of separation

How connected is the world? Playwrights [1], poets [2], and scientists [3] have proposed that everyone on the planet is connected to everyone else by six other people. In honor of Friends Day, we’ve crunched the Facebook friend graph and determined that the number is 3.57. Each person in the world (at least among the 1.59 billion people active on Facebook) is connected to every other person by an average of three and a half other people. The average distance we observe is 4.57, corresponding to 3.57 intermediaries or “degrees of separation.” Within the US, people are connected to each other by an average of 3.46 degrees.

Well, there we are. Not six but something closer to four. Or maybe even closer to three, given that the figure has shrunk each time Facebook—which continues to grow—has done one of these studies.

February 2, 2016

On The Witness and digital piracy

I’ve always taken a harsh against pirating digital goods, and reading Jonathan Blow’s recent tweets about how many people have pirated The Witness just made me more angry about the situation.

For this particular game, the chief excuse seems to be “$40 is too high”.

I understand that some people may not be able to afford The Witness. But even if the price is too high for someone, it doesn’t mean the value proposition isn’t there — quite the opposite. The Witness has over 50 hours of content. It has sublime graphics, great gameplay, and runs well on both platforms it’s currently on. In my estimation, it’s one of the best puzzle games ever made. Just because you can’t afford it, it doesn’t mean it’s okay to not pay for it.

Caption text

When else would I get the chance to use a South Park episode for cover art?

The problem isn’t the $40 price tag. Price is just one in a long line of excuses pirates use to justify not paying for someone else’s work.

  • If a game has DRM: “I’m pirating it because I won’t tolerate DRM”
  • If a game is developed by a large company: “I’m pirating because company X doesn’t care about its customers”
  • If a game isn’t available on every platform under the sun: “I’m pirating because it’s not also available on X”
  • If a game isn’t available in every region yet: “I’m pirating because company X doesn’t give shit about country Y”
  • If the game has DLC: “I’m pirating because they want us to pay for stuff that should have been in the base game”

The list goes on and on. Examining it any further is a futile exercise. Piracy will continue regardless of  prevailing circumstances, because pirates are a bunch of self-entitled people that simply like free shit.

January 29, 2016

You know when someone holds a presentation and uses the word we when discussing something that could be done better? “We can improve here”, “we should not do this”.

In my experience, it always translates into “you, the audience”. Just because someone uses the collective we doesn’t mean the audience doesn’t see right through it.

January 24, 2016

Today’s purchases: Donkey Kong Jungle Beat for the GameCube (I own the bongos, so why not) and Mario’s Tennis for the Virtual Boy (I admit I haven’t got a good reason for this one).

That makes 549 games in my collection. I wonder if I’ll break 600 before year’s end.

January 21, 2016

SEGA just announced that the SEGA 3D Classics Collection is coming to retail in the US (it was only available in Japan previously).

I really wish they’d announce it for Europe, too. Mainly because M2 is behind the compilation, which ensures it’ll be absolutely top-notch.