This chapter covers
This book so far has covered the core techniques for making a MapReduce program. Hadoop is a big framework that supports many more functionalities than those core techniques. In this age of Bing and Google, you can look up specialized MapReduce techniques rather easily, and we don’t try to be an encyclopedic reference. In our own usage and from our discussion with other Hadoop users, we’ve found a number of techniques generally useful, techniques such as being able to take a standard relational database as input or output to a MapReduce job. We’ve collected some of our favorite “recipes” in this cookbook chapter.
Passing job-specific parameters to your tasks
In writing your Mapper and Reducer, you often want to make certain aspects configurable. For example, our joining program in chapter 5 is hardcoded to take the first data column as the join key. The program can be more generally applicable if the column for the join key can be specified by the user at run time. Hadoop itself uses a configuration object to store all the configuration properties for a job. You can use the same object to pass parameters to yourMapper and Reducer.
We’ve seen how the MapReduce driver configures the Configuration object (JobConf is one subclass of it) with properties, such as input format, output format, mapper class, and so forth. To introduce your own property, you give your property a unique name and set it with a value in the same configuration object. This configuration object is passed to all TaskTrackers , so the properties in the configuration object are available to all tasks in that job. Your Mapper and Reducer can read the configuration object and retrieve the property value.
The Configuration class has a number of generic setter methods. Properties are key/value pairs, where key has to be a String, but value can be one of a number of common types. Signature for the common setter methods are:
We can remove the line in the driver that requires the user to always specify the value of this property as an argument. This is more convenient for the user when the default value would work most of the time. When you allow the user to specify custom properties, it’s good practice for the driver to validate any user input.
In addition to retrieving custom properties and global configuration, we can also use the getter methods on the configuration object to obtain certain state information about the current task and job. For example, in the Mapper you can grab the map.input.file property to get the file path to the current map task. This is exactly what we do in the datajoin package’s DataJoinMapperBase does to infer a tag for the data source.
Configuration properties are also available to Streaming programs through environment variables. Before executing a script, the Streaming API will have added all configuration properties to the running environment. The property names are reformatted such that non-alphanumeric characters are replaced with an underscore (_). For example, a Streaming script should look at the environment variable map_input_file for the full file path that the current mapper is reading from.
* Ch7. Cookbook: Partitioning into multiple output files (Part2)
* Exercise7 - Using ToolRunner and Passing Parameters