2014年12月26日 星期五

[ In Action ] Ch7. Cookbook: Passing job-specific parameters to your tasks (Part1)

Preface (P160) 
This chapter covers 
■ Passing custom parameters to tasks
■ Retrieving task-specific information
■ Creating multiple outputs
■ Interfacing with relational databases
■ Making output globally sorted

This book so far has covered the core techniques for making a MapReduce program. Hadoop is a big framework that supports many more functionalities than those core techniques. In this age of Bing and Google, you can look up specialized MapReduce techniques rather easily, and we don’t try to be an encyclopedic reference. In our own usage and from our discussion with other Hadoop users, we’ve found a number of techniques generally useful, techniques such as being able to take a standard relational database as input or output to a MapReduce job. We’ve collected some of our favorite “recipes” in this cookbook chapter. 

Passing job-specific parameters to your tasks 
In writing your Mapper and Reducer, you often want to make certain aspects configurable. For example, our joining program in chapter 5 is hardcoded to take the first data column as the join key. The program can be more generally applicable if the column for the join key can be specified by the user at run time. Hadoop itself uses a configuration object to store all the configuration properties for a job. You can use the same object to pass parameters to yourMapper and Reducer. 

We’ve seen how the MapReduce driver configures the Configuration object (JobConf is one subclass of it) with properties, such as input format, output format, mapper class, and so forth. To introduce your own property, you give your property a unique name and set it with a value in the same configuration object. This configuration object is passed to all TaskTrackers , so the properties in the configuration object are available to all tasks in that job. Your Mapper and Reducer can read the configuration object and retrieve the property value. 

The Configuration class has a number of generic setter methods. Properties are key/value pairs, where key has to be a String, but value can be one of a number of common types. Signature for the common setter methods are: 
  1. public void set(String name, String value)  
  2. public void setBoolean(String name, boolean value)  
  3. public void setInt(String name, int value)  
  4. public void setLong(String name, long value)  
  5. public void setStrings(String name, String... values)  
Your driver will first set the properties in the configuration object to make them available to all tasks. Your Mapper and Reducer have access to the configuration object via calling context.getConfiguration() in the signature method. In the following example we setup our new property myjob.myproperty in driver class, and it takes an integer value specified by the user. 
  1. ...  
  2. public int run(String[] args) throws Exception   
  3. {  
  4.     Job job = new Job(getConf());  
  5.     ...  
  6.     job.getConfiguration().setInt("myjob.myproperty", Integer.parseInt(args[2]));      
  7.     ...  
  8. }  
In MapClass, the map() method can retrieves the property through given context parameters. The getter methods of the Configuration class require specifying default values, which will be returned if the requested property is not set in the configuration object. For this example we use a default of 0: 
  1. @Override  
  2. public void map(Text key, Text value, Context context) throws IOException, InterruptedException {  
  3.     ...  
  4.     int myproperty = context.getConfiguration().getInt("myjob.myproperty"0);  
  5.     ...  
  6. }  
If you want to use the property in the Reducer, the steps are similar: 
  1. @Override  
  2. public void reduce(Text key, Iterable values, Context context)  
  3.         throws IOException, InterruptedException   
  4. {  
  5.     ...  
  6.     int myproperty = context.getConfiguration().getInt("myjob.myproperty"0);  
  7.     ...  
  8. }  
The Configuration class has a larger list of getter methods than setter methods, although they are largely self-explanatory. Almost all the getter methods require a default value as argument. The exception is get(String), which returns null if the property with the specified name is not set. 
  1. public String get(String name)  
  2. public String get(String name, String defaultValue)  
  3. public boolean getBoolean(String name, boolean defaultValue)  
  4. public float getFloat(String name, float defaultValue)  
  5. public int getInt(String name, int defaultValue)  
  6. public long getLong(String name, long defaultValue)  
  7. public String[] getStrings(String name, String... defaultValue)  
Given that our job class implements the Tool interface and uses ToolRunner, we can also let the user set custom properties directly using the generic options syntax, in the same way the user would set Hadoop configuration properties. 
$ hadoop jar MyJob.jar MyJob -D myjob.myproperty=1 input output

We can remove the line in the driver that requires the user to always specify the value of this property as an argument. This is more convenient for the user when the default value would work most of the time. When you allow the user to specify custom properties, it’s good practice for the driver to validate any user input. 
  1. public class MyJob extends Configured implements Tool {  
  2.     public int run(String[] args) throws Exception {  
  3.         Job job = new Job(getConf());  
  4.         Path in = new Path(args[0]);  
  5.         Path out = new Path(args[1]);  
  6.         FileInputFormat.setInputPaths(job, in);  
  7.         FileOutputFormat.setOutputPath(job, out);  
  8.         job.setJarByClass(MyJob.class);  
  9.         job.setJobName("MyJob");  
  10.         ...  
  11.         int myproperty = job.getConfiguration().getInt("myjob.myproperty"0);  
  12.         if (myproperty < 0) {  
  13.             System.err.println("Invalid myjob.myproperty: " + myproperty);  
  14.             System.exit(0);  
  15.         }  
  16.           
  17.           
  18.         boolean success = job.waitForCompletion(true);    
  19.         return(success ? 0 : 1);   
  20.     }  
  21.     ...  
  22. }  
Probing for task-specific information 
In addition to retrieving custom properties and global configuration, we can also use the getter methods on the configuration object to obtain certain state information about the current task and job. For example, in the Mapper you can grab the map.input.file property to get the file path to the current map task. This is exactly what we do in the datajoin package’s DataJoinMapperBase does to infer a tag for the data source. 
  1. this.inputFile = job.getConfiguration().get("map.input.file");  
  2. this.inputTag = generateInputTag(this.inputFile);  
Table 7.1 lists some of the other task-specific state information. 
 

Configuration properties are also available to Streaming programs through environment variables. Before executing a script, the Streaming API will have added all configuration properties to the running environment. The property names are reformatted such that non-alphanumeric characters are replaced with an underscore (_). For example, a Streaming script should look at the environment variable map_input_file for the full file path that the current mapper is reading from. 
  1. import os  
  2. filename = os.environ["map_input_file"]  
  3. localdir = os.environ["job_local_dir"]  
The preceding code shows how one would access configuration properties in Python. 

Supplement 
Ch7. Cookbook: Partitioning into multiple output files (Part2) 
Exercise7 - Using ToolRunner and Passing Parameters

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...