If you use Hadoop for batch processing of large data sets, your data-intensive computing needs probably include transaction-style processing as well. We won’t cover all the techniques for running real-time distributed data processing (caching, sharding, etc.). They aren’t necessarily Hadoop-related and are well beyond the scope of this book. One lesser-known tool for real-time data processing is the Bloom filter, which is a summary of a data set whose usage makes other data processing techniques more efficient. When that data set is big, Hadoop is often called in to generate the Bloom filter representation. As we mentioned earlier, a Bloom filter is also sometimes used for data joining within Hadoop itself. As a data processing expert, you’ll be well rewarded to have the Bloom filter in your bag of tricks.
What does a Bloom filter do?
At its most basic, a Bloom filter object supports two methods: add() and contains(). These two methods work in a similar way as in the Java Set interface. The method add() adds an object to the set, and the method contains() returns a Boolean true/false value denoting whether an object is in the set or not. But, for a Bloom filter, contains() doesn’t always give an accurate answer. It has no false negatives. If contains() returns false, you can be sure that the set doesn’t have the object queried. It does have a small probability of false positives though. contains() can return true for some objects not in the set. The probability of false positives depends on the number of elements in the set and some configuration parameters of the Bloom filter itself.
The major benefit of a Bloom filter is that its size, in number of bits, is constant and is set upon initialization. Adding more elements to a Bloom filter doesn’t increase its size. It only increases the false positive rate. A Bloom filter also has another configuration parameter to denote the number of hash functions it uses. We’ll discuss the reason for this parameter and how the hash functions are used later when we discuss the Bloom filter’s implementation. For now, its main implication is that it affects the false positive rate. The false positive rate is approximated by the equation:
where k is the number of hash functions used, m is the number of bits used to store the Bloom filter, and n is the number of elements to be added to the Bloom filter. In practice, m and n are determined by the requirement of the system, and therefore, k is chosen to minimize the false positive rate given m and n, which (after a little calculus) is:
The false positive rate with the given k is 0.6185^(m/n), and k has to be an integer. The false positive rate will only be an approximation. From a design point of view, one should think in terms of (m/n), number of bits per element, rather than m alone. For example, we have to store a set containing ten million URLs (n=10,000,000). Allocating 8 bits per URL (m/n=8) will require a 10 MB Bloom filter (m = 80,000,000 bits). This Bloom filter will have a false positive rate of(0.6185)^8, or about 2 percent. If we were to implement the Set class by storing the raw URLs, and let’s say the average URL length was 100 bytes, we would have to use 1 GB. Bloom filter has shrunk the storage requirement by 2 orders of magnitude at the expense of only a 2 percent false positive rate! A slight increase in storage allocated to the Bloom filter will reduce the false positive rate further. At 10 bits per URL, the Bloom filter will take up 12.5 MB and have a false positive rate of only 0.8 percent.
In summary, the signature of our Bloom filter class will look like the following:
Conceptually the implementation of a Bloom filter is quite straightforward. We describe its implementation in a single system first before implementing it using Hadoop in a distributed way. The internal representation of a Bloom filter is a bit array of size m. We have k independent hash functions, where each hash function takes an object as input and outputs an integer between 0 and m-1. We use the integer output as an index into the bit array. When we “add” an element to the Bloom filter, we use the hash functions to generate k indexes into the bit array. We set the k bits to 1. Figure 5.3 shows what happens when we add several objects (x, y, and z) over time, in a Bloom filter that uses three hash functions. Note that a bit will be set to 1 regardless of its previous state. The number of 1s in the bit array can only grow.
When an object comes in and we want to check whether it has been previously added to the Bloom filter, we use the same k hash functions to generate the bit array indexes as we would do in adding the object. Now we check whether all those k bits in the bit array are 1s. If all k bits are 1, we return true and claim that the Bloom filter contains the object. Otherwise we return false. We see that if the object has in fact been added before, then the Bloom filter will necessarily return true. There are no false negatives (returning false when the object is truly in the set). The k bits corresponding to the queried object can all be set to 1 even though the object has never been added to the set. It may happen that adding other objects set those bits leading to false positives.
An ingenious way of creating a Bloom filter for the union of two sets is by OR’ing the (bit array of the) Bloom filters of each individual set. As adding an object is setting certain bits in a bit array to 1, it’s easy to see why this union rule is true:
As the Bloom filter will be shuffled around as the mappers’ output, the BloomFilter class will have to implement the Writable interface, which consists of methodswrite() and readFields(). For our purpose these methods transform between the internal BitSet representation and a byte array such that the data can be serialized to DataInput/DataOutput. The final code is in listing 5.4.
- Listing 5.4 Basic Bloom filter implementation
The driver for the MapReduce program is straightforward. Our mappers will output a key/value pair where the value is a BloomFilter instance.
- Listing 5.5 A MapReduce program (Driver class)
Hadoop version 0.20 has a Bloom filter class in it. It plays a support role to some of the new classes introduced in version 0.20, and it will likely stay around for future versions as well. It functions much like our BloomFilter class in listing 5.4, although it’s much more rigorous in its implementation of the hashing functions. As a built-in class, it can be a good choice for semijoin within Hadoop. But it’s not easy to separate this class from the Hadoop framework to use it as a standalone class. If you’re building a Bloom filter for non-Hadoop applications, Hadoop’s built-in BloomFilter may not be appropriate.
* Chaining MapReduce jobs (Part1)
* Joining data from different sources (Part2)
* Replicated joins using DistributedCache (Part3)
* reduce-side join with map-side filtering (Part4)