程式扎記: [CCDH] Class3 - Programming with Hadoop Core API (1)

2014年11月27日星期四

[CCDH] Class3 - Programming with Hadoop Core API (1)

Preface
In this chapter you will learn (P255)

* How to use the ToolRunner class
* How to decrease the amount of intermediate data with Combiners
* How to set up and tear down Mappers and Reducer using the setup and cleanup methods.
* How to access HDFS programmatically
* How to use the distributed cache
* How to use the Hadoop API's library of Mapper, Reducers, and Partitioners.

Using the ToolRunner Class
You can use ToolRunner in MapReduce driver classes which is not required but rather a best practice. It uses the GenericOptionsParser class internally which:

* Allow you to specify configuration options from the command line.
* Allow you to specify items for the Distributed Cache from the command line.

Implement ToolRunner - Imports

view plaincopy to clipboardprint?
import org.apache.hadoop.fs.Path;  
import org.apache.hadoop.io.IntWritable;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
import org.apache.hadoop.mapreduce.Job;  
import org.apache.hadoop.conf.Configured;  
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.util.Tool;  
import org.apache.hadoop.util.ToolRunner;  

Implement ToolRunner - Driver Class Definition
The driver class implements the Tool interface and extends the Configured class:

view plaincopy to clipboardprint?
public class AvgWordLength extends Configured implements Tool {  
    public static void main(String[] args) throws Exception {...}  
    @Override  
    public int run(String[] args) throws Exception {...}  
}  

Implement ToolRunner - Main Method
The driver main method calls ToolRunner.run:

view plaincopy to clipboardprint?
...  
public static void main(String[] args) throws Exception {  
    Configuration conf = new Configuration();         
    int exitCode = ToolRunner.run(conf, new AvgWordLength(), args);  
    System.exit(exitCode);  
}  
...  

Implement ToolRunner - Run Method
The driver run method creates, configures, and submits the job:

view plaincopy to clipboardprint?
@Override  
public int run(String[] args) throws Exception {  
    /* 
     * Validate that two arguments were passed from the command line. 
     */  
    if (args.length != 2) {  
        System.out.printf("Usage: AvgWordLength  \n");  
        System.exit(-1);  
    }  
  
    /* 
     * Instantiate a Job object for your job's configuration. 
     */  
    Job job = new Job(getConf());  
  
    /* 
     * Specify the jar file that contains your driver, mapper, and reducer. 
     * Hadoop will transfer this jar file to nodes in your cluster running 
     * mapper and reducer tasks. 
     */  
    job.setJarByClass(AvgWordLength.class);  
  
    /* 
     * Specify an easily-decipherable name for the job. This job name will 
     * appear in reports and logs. 
     */  
    job.setJobName("Average Word Length");  
  
    /* 
     * Specify the paths to the input and output data based on the 
     * command-line arguments. 
     */  
    FileInputFormat.setInputPaths(job, new Path(args[0]));  
    FileOutputFormat.setOutputPath(job, new Path(args[1]));  
  
    /* 
     * Specify the mapper and reducer classes. 
     */  
    job.setMapperClass(LetterMapper.class);  
    job.setReducerClass(AverageReducer.class);  
  
    /* 
     * The input file and output files are text files, so there is no need 
     * to call the setInputFormatClass and setOutputFormatClass methods. 
     */  
  
    /* 
     * The mapper's output keys and values have different data types than 
     * the reducer's output keys and values. Therefore, you must call the 
     * setMapOutputKeyClass and setMapOutputValueClass methods. 
     */  
    job.setMapOutputKeyClass(Text.class);  
    job.setMapOutputValueClass(IntWritable.class);  
  
    /* 
     * Specify the job's output key and value classes. 
     */  
    job.setOutputKeyClass(Text.class);  
    job.setOutputValueClass(DoubleWritable.class);  
  
    /* 
     * Start the MapReduce job and wait for it to finish. If it finishes 
     * successfully, return 0. If not, return 1. 
     */  
    boolean success = job.waitForCompletion(true);  
    return(success ? 0 : 1);  
}  

ToolRunner allows the user to specify configuration options on the command line. Commonly used to specify Hadoop properties using the -D floag. It will override any default or site properties in the configuration. For example:

Note that -D options must appear before any additional program arguments. Besides, you can:

* Specify an XML configuration file with -conf
* Specify the default filesystem with -fs uri - Shortcut for -D fs.default.name=uri
* ...

Setting Up and Tearing Down Mappers and Reducers
The setup Method
It is common to want your Mapper or Reducer to execute some code before the map or reduce method is called for the first time:

* Initialize data structures
* Read data from an external file
* Set parameters

The cleanup Method
Similarly, you may wish to perform some action(s) after all the records have been processed by your Mapper or Reducer. The cleanup method is called before the Mapper or Reducer terminates.

Pass Parameters
You can setup parameters in driver and the mapper/reducer can fetch those parameters to customize MapReduce behavior:

Decreasing the Amount of Intermediate Data with Combiners
Often, Mapper produce large amounts of intermediate data which must be passed to the Reducers and can result in a lot of network traffic. It is often possible to specify a Combiner which likes a 'mini-Reducer' and runs locally on a single Mapper's output.

Combiner and Reducer code are often identical. Technically, this is possible if the operation performed is commutative and associative. Input and output data types for the Combiner/Reducer must be identical!

WordCount Revisited

WordCount with Combiner

Writing a Combiner
The Combiner uses the same signature as the Reducer:

view plaincopy to clipboardprint?
public void reduce(Key key, Iterable values,  
                      Context context) throws IOException, InterruptedException   
{  
    ....  
}  

Combiners and Reducers
Some Reducers may be used as Combiners - If operation is associative and commutative, e.g., SumReader:

Some Reducers cannot be used as a Combiner, e.g., AverageReducer:

Specifying a Combiner
Specify the Combiner class to be used in your MapReduce code in the driver through setCombinerClas method:

view plaincopy to clipboardprint?
job.setCombinerClass(SumReducer.class);  

Input and output data types for the Combiner and the Reducer for a job must be identical. The Combiner may run once, or more than once, on the output from any given Mapper. So don't put code in the Combiner which could influence your results if it run more than once.

Supplement
* Exercise7 - Using ToolRunner and Passing Parameters
* Exercise8 - Writing a Partitioner

程式扎記

標籤

2014年11月27日星期四

[CCDH] Class3 - Programming with Hadoop Core API (1)

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2014年11月27日 星期四

[CCDH] Class3 - Programming with Hadoop Core API (1)

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2014年11月27日星期四