[ In Action ] Ch5. Advanced MapReduce: Creating a Bloom filter (Part5)
Preface If you use Hadoop for batch processing of large data sets, your data-intensive computing needs probably include transaction-style processing as well. We won’t cover all the techniques for running real-time distributed data processing (caching, sharding, etc.). They aren’t necessarily Hadoop-related and are well beyond the scope of this book. One lesser-known tool for real-time data processing is the Bloom filter, which is a summary of a data set whose usage makes other data processing techniques more efficient. When that data set is big, Hadoop is often called in to generate the Bloom filter representation. As we mentioned earlier, a Bloom filter is also sometimes used for data joining within Hadoop itself. As a data processing expert, you’ll be well rewarded to have the Bloom filter in your bag of tricks.
What does a Bloom filter do? At its most basic, a Bloom filter object supports two methods: add() and contains(). These two methods work in a similar way as in the Java Set interface. The method add() adds an object to the set, and the method contains() returns a Boolean true/false value denoting whether an object is in the set or not. But, for a Bloom filter, contains()doesn’t always give an accurate answer. It has no false negatives.If contains() returns false, you can be sure that the set doesn’t have the object queried. It does have a small probability of false positives though. contains() can return true for some objects not in the set. The probability of false positives depends on the number of elements in the set and some configuration parameters of the Bloom filter itself.
The major benefit of a Bloom filter is that its size, in number of bits, is constant and is set upon initialization. Adding more elements to a Bloom filter doesn’t increase its size. It only increases the false positive rate. A Bloom filter also has another configuration parameter to denote the number of hash functions it uses. We’ll discuss the reason for this parameter and how the hash functions are used later when we discuss the Bloom filter’s implementation. For now, its main implication is that it affects the false positive rate. The false positive rate is approximated by the equation:
(1 – exp(-kn/m))k
where k is the number of hash functions used, m is the number of bits used to store the Bloom filter, and n is the number of elements to be added to the Bloom filter. In practice, m and n are determined by the requirement of the system, and therefore, k is chosen to minimize the false positive rate given m and n, which (after a little calculus) is:
The false positive rate with the given k is 0.6185^(m/n), and k has to be an integer. The false positive rate will only be an approximation. From a design point of view, one should think in terms of (m/n), number of bits per element, rather than m alone. For example, we have to store a set containing ten million URLs (n=10,000,000). Allocating 8 bits per URL (m/n=8) will require a 10 MB Bloom filter (m = 80,000,000 bits). This Bloom filter will have a false positive rate of(0.6185)^8, or about 2 percent. If we were to implement the Set class by storing the raw URLs, and let’s say the average URL length was 100 bytes, we would have to use 1 GB. Bloom filter has shrunk the storage requirement by 2 orders of magnitude at the expense of only a 2 percent false positive rate! A slight increase in storage allocated to the Bloom filter will reduce the false positive rate further. At 10 bits per URL, the Bloom filter will take up 12.5 MB and have a false positive rate of only 0.8 percent.
In summary, the signature of our Bloom filter class will look like the following:
Implementing a Bloom filter Conceptually the implementation of a Bloom filter is quite straightforward. We describe its implementation in a single system first before implementing it using Hadoop in a distributed way. The internal representation of a Bloom filter is a bit array of size m. We have k independent hash functions, where each hash function takes an object as input and outputs an integer between 0 and m-1. We use the integer output as an index into the bit array. When we “add” an element to the Bloom filter, we use the hash functions to generate k indexes into the bit array. We set the k bits to 1. Figure 5.3 shows what happens when we add several objects (x, y, and z) over time, in a Bloom filter that uses three hash functions. Note that a bit will be set to 1 regardless of its previous state. The number of 1s in the bit array can only grow.
When an object comes in and we want to check whether it has been previously added to the Bloom filter, we use the same k hash functions to generate the bit array indexes as we would do in adding the object. Now we check whether all those k bits in the bit array are 1s. If all k bits are 1, we return true and claim that the Bloom filter contains the object. Otherwise we return false. We see that if the object has in fact been added before, then the Bloom filter will necessarily return true. There are no false negatives (returning false when the object is truly in the set). The k bits corresponding to the queried object can all be set to 1 even though the object has never been added to the set. It may happen that adding other objects set those bits leading to false positives.
To implement getHashIndexes() such that it works truly as k independent hash functions is nontrivial. Instead, in our Bloom filter implementation in listing 5.4, we use a hack to generate k indexes that are roughly independent and uniformly distributed. The getHashIndexes() method seeds the Java Random number generator with an MD5 hash of the object and then takes k “random” numbers as indexes. The Bloom filter class would benefit from a more rigorous implementation of getHashIndexes(), but our hack suffices for illustration purposes.
An ingenious way of creating a Bloom filter for the union of two sets is by OR’ing the (bit array of the) Bloom filters of each individual set. As adding an object is setting certain bits in a bit array to 1, it’s easy to see why this union rule is true:
We’ll be exploiting this union trick to build Bloom filters in a distributed fashion. Each mapper will build a Bloom filter based on its own data split. We’ll send the Bloom filters to a single reducer, which will take a union of them and record the final output.
As the Bloom filter will be shuffled around as the mappers’ output, the BloomFilter class will have to implement the Writable interface, which consists of methodswrite() and readFields(). For our purpose these methods transform between the internal BitSet representation and a byte array such that the data can be serialized to DataInput/DataOutput. The final code is in listing 5.4. - Listing 5.4 Basic Bloom filter implementation
Next we’ll create the MapReduce program to make a Bloom filter using Hadoop. As we said earlier, each mapper will instantiate a BloomFilter object and add the key of each record in its split into its BloomFilter instance. (We’re using the key of the record to follow our data joining example.) We’ll create a union of theBloomFilter by collecting them into a single reducer.
The driver for the MapReduce program is straightforward. Our mappers will output a key/value pair where the value is a BloomFilter instance.
We want our reducer to output the final BloomFilter as a binary file. Hadoop’s OutputFormats outputs either text files or assumes a key/value pair. Our reducer, therefore, won’t use Hadoop’s MapReduce output mechanism and instead we’ll write the result out to a file ourselves.
Recall that our strategy for the mapper is to build a single Bloom filter on the entire split and output it at the end of the split to the reducer. Given that themap() method of the Map-Class has no state information about which record in the split it’s processing, we should output the BloomFilter in thecleanup() method to ensure that all the records in the split have been read. The MapClass looks like:
publicvoid map(Text key, Text value, Context context)
throws IOException, InterruptedException
{
// Add object into bloom filter
}
@Override
protectedvoid cleanup(Context context)
{
try
{
context.write(one, bmf);
}
catch(Exception e){e.printStackTrace();}
}
}
The BloomFilters generated by all the mappers are sent to a single reducer. The reduce() method in the Reduce class will do a Bloom filter union of all of them. - Reducer
Path file = new Path(context.getConfiguration().get("mapred.output.dir") + "/bloomfilter");
FSDataOutputStream out = file.getFileSystem(context.getConfiguration()).create(file);
bf.write(out);
out.close();
}
catch(Exception e){e.printStackTrace();}
}
}
As we mentioned earlier, we want the final BloomFilter to be written out in a file of our own format rather than one of Hadoop’s OutputFormats. We had already set the reducer’s OutputFormat to NullOutputFormat in the driver to turn off that output mechanism. The rest of the cleanup() method in Reducer will use Hadoop’s file I/O to write out our BloomFilter in binary to a file in HDFS . The complete code is in listing 5.5. - Listing 5.5 A MapReduce program (Driver class)
int res = ToolRunner.run(new Configuration(), new BloomFilterMR(), args);
System.exit(res);
}
}
Bloom filter in Hadoop version 0.20+ Hadoop version 0.20 has a Bloom filter class in it. It plays a support role to some of the new classes introduced in version 0.20, and it will likely stay around for future versions as well. It functions much like our BloomFilter class in listing 5.4, although it’s much more rigorous in its implementation of the hashing functions. As a built-in class, it can be a good choice for semijoin within Hadoop. But it’s not easy to separate this class from the Hadoop framework to use it as a standalone class. If you’re building a Bloom filter for non-Hadoop applications, Hadoop’s built-in BloomFilter may not be appropriate.
沒有留言:
張貼留言