Monthly Archives: February 2016

February 21, 2016

Google To Stop Showing Ads On Right Side Of Desktop Search Results Worldwide

All of this represents the conclusion of a test that began all the way back in 2010, and has been tweaked over the years since then. Search Engine Land readers — particularly some outside the U.S. — notified us late last year that they were seeing the top-only ads more frequently.

Data-driven design. I love stuff like this.

February 19, 2016

Neglected Utopia: Photographer explores the forgotten modernist estates of Paris

From the 1950s to the 1980s, Paris was booming. Foreign migration and urbanisation of the city caused a huge surge in population and a crisis for housing. France’s solution came in the form of vast housing projects and so during this period massive, modernist and really quite unique estates sprung up across the city — aiming for a new way of living.

Just a few decades later and these towering buildings look dated, discarded and forgotten. Often stigmatised by the media, they divide opinion in France and have been left mostly occupied by the ageing community of ‘urban veterans’ who first made it their home, as the younger generation are drawn to more contemporary city living.

Say what you will about the buildings, but the photography itself is breathtaking.

February 10, 2016
February 8, 2016

What are Bloom filters, and why are they useful?

For reasons I struggle to fully explain, I get overly excited about data structures. One data structure that has a permanent spot in my top five is the humble Bloom filter. Why? Two reasons, primarily: Bloom filters are beautifully simple yet incredibly versatile. In an era of vast datasets, they provide a mechanism for maintaining speed and overall performance.

What does a Bloom filter do?

A Bloom filter allows you to test whether an element belongs in a set or not. Actually, that’s untrue. It doesn’t. It allows you to test if an element most likely belongs to a set, or if it absolutely doesn’t. In others words, false positives may occur, but false negatives won’t. It’s this susceptibility to errors that makes Bloom filters extremely space efficient. But more on that later; first, let’s see what makes Bloom filters so easy to grasp.

Simplicity is the purest form of elegance

I like to think of a Bloom filter as a mathematical set on steroids. The data structure consists of two discrete components: a bunch of hash functions and a bit array. That’s it.

To initialise a Bloom filter, we set all bits in its bit array to zero. So, if we have an array of ten bits:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

To add an element, we give it to our hash functions as an input. Each hash function will output an array index. Let’s say we passed the string foo to two hash functions which give us indices 0 and 4. We set the corresponding bits in our bit array to 1:

[1, 0, 0, 0, 1, 0, 0, 0, 0, 0]

Querying for an element is just as straightforward. We pass the element to our hash functions and check the returned indices to see whether the corresponding bits in the array have a value of 1 or 0. If none of them are 1, the element in question is not part of the set. We don’t necessarily even need to check all of the values at the given indices. If we find at least one where the bit is zero, then the element can’t be part of the set. Let’s say we query for the string bar using our two hash functions. Let’s further assume that the functions output indices 1 and 5. Checking either one of these bit array indices is enough for us to conclude that bar is not part of our set, since the value at both indices is 0.

If all the corresponding bits are set, then the element may be part of the set. Since hash functions have a nasty habit of producing collisions, we can’t be 100% sure that the queried element is part of the set. There’s a chance that another previously added element caused the same bits to be flipped.

Removing an element is a simple operation inasmuch as it can’t be done. If we were to reset the corresponding bits of the element we wish to remove, we may end up removing other elements that have the same corresponding bits. What happens in a Bloom filter stays in a Bloom filter.

How efficient are Bloom filters?

Adding elements is very efficient in terms of time. Since we don’t iterate the bit array but simply set bits, the time the operation takes is only dependent on the number of hash functions we have. So, for k hash functions, the time complexity is O(k). Lovely.

The same goes for querying for an element. For k hash functions, there is a constant upper bound to the number of indices we need to check in the bit array. O(k) for both addition and lookup — very nice.

If you think the time complexities are encouraging, then the space complexity is even better. Since we aren’t storing the elements themselves, only the information tells us if they are present or not, the size of a Bloom filter isn’t dependent of the size or number of elements already in the set. If we are willing to accept a false probability of 1%, less than 10 bits per element to be added is sufficient [1] .

What can Bloom filters be used for?

Short answer: lots of stuff.

Long answer: as it turns out, there are a vast number of applications where a small probability of getting false positives is fine as long as we don’t get any false negatives. The first example that springs to mind is using a Bloom filter to decide if we need to perform a more costly operation such as accessing some resource on disk or over a network. Asking the Bloom filter “does this resource even exist” before doing a costly lookup yields instant performance gains.

The same goes for maintaining blacklists. Google Chrome previously employed Bloom filters to identify malicious URLs. In this scenario, requested URLs are first checked against a local Bloom filter; a full check against the real blacklist is done if (and only if) the filter indicates that the URL is malicious.

Even though Chrome has moved away from using Bloom filters for blacklisting purposes, it’s very possible that when you first visited this page, a Bloom filter was used to make sure the rendering was snappy. WebKit, the popular web browser engine, consults a Bloom filter to decide which CSS selectors to even bother with before processing declarations.

Off the top of my head, I can think of many more applications, including eliminating unnecessary calls to web caches, preventing the use of weak passwords, and checking for reserved usernames during registration processes. I’ll leave the rest of the ideation up to the readers of this blog post; learning what a data structure does and thinking about what can be achieved with it evokes a childlike sense of discovery in me, and I hope it does the same for you!

A couple of interesting tidbits

  • Since you can’t remove an element from a Bloom filter, the probability of false positives will continue to go up if you keep adding elements to it. It starts to degrade. In that sense, Bloom filters are not like wine. They don’t get better with age.
  • We owe the existence of Bloom filters to a chap called Burton Howard Bloom, who invented the data structure way back in 1970. There’s precious little information about him online, but from what I could dig up, it seems he attended MIT and worked on database systems.

TL;DR

A Bloom filter is a semi-transient data structure that provides a time and space efficient way for testing if an element is part of a set. Bloom filters are cool. Consider using one in cases where you need 100% recall but a small probability of false positives is acceptable.

Links

This post is also available, in its entirety, on SC5.io.

  1. Bonomi, F., Mitzenmacher, M., Panigrahy, R., Singh, S. and Varghese, G., 2006. An improved construction for counting bloom filters. In Algorithms–ESA 2006 (pp. 684-695). Springer Berlin Heidelberg []
February 7, 2016

Heisenberg Developers

Finely grained management of software developers is compelling to a business. Any organization craves control. We want to know what we are getting in return for those expensive developer salaries. We want to be able to accurately estimate the time taken to deliver a system in order to do an effective cost-benefit analysis and to give the business an accurate forecast of delivery. There’s also the hope that by building an accurate database of estimates verses actual effort, we can fine tune our estimation, and by analysis find efficiencies in the software development process.

The problem with this approach is that it fundamentally misunderstands the nature of software development. That it is a creative and experimental process. Software development is a complex system of multiple poorly understood feedback loops and interactions. It is an organic process of trial and error, false starts, experiments and monumental cock-ups. …

Amen. I’ve never been in a development project where estimates actually work. I encourage you to read the entire article — it’s 100% on point.

February 5, 2016

Three and a half degrees of separation

How connected is the world? Playwrights [1], poets [2], and scientists [3] have proposed that everyone on the planet is connected to everyone else by six other people. In honor of Friends Day, we’ve crunched the Facebook friend graph and determined that the number is 3.57. Each person in the world (at least among the 1.59 billion people active on Facebook) is connected to every other person by an average of three and a half other people. The average distance we observe is 4.57, corresponding to 3.57 intermediaries or “degrees of separation.” Within the US, people are connected to each other by an average of 3.46 degrees.

Well, there we are. Not six but something closer to four. Or maybe even closer to three, given that the figure has shrunk each time Facebook—which continues to grow—has done one of these studies.