In this chapter you will learn (P255)
Using the ToolRunner Class
You can use ToolRunner in MapReduce driver classes which is not required but rather a best practice. It uses the GenericOptionsParser class internally which:
Implement ToolRunner - Imports
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.IntWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- import org.apache.hadoop.mapreduce.Job;
- import org.apache.hadoop.conf.Configured;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.util.Tool;
- import org.apache.hadoop.util.ToolRunner;
The driver class implements the Tool interface and extends the Configured class:
- public class AvgWordLength extends Configured implements Tool {
- public static void main(String[] args) throws Exception {...}
- @Override
- public int run(String[] args) throws Exception {...}
- }
The driver main method calls ToolRunner.run:
- ...
- public static void main(String[] args) throws Exception {
- Configuration conf = new Configuration();
- int exitCode = ToolRunner.run(conf, new AvgWordLength(), args);
- System.exit(exitCode);
- }
- ...
The driver run method creates, configures, and submits the job:
- @Override
- public int run(String[] args) throws Exception {
- /*
- * Validate that two arguments were passed from the command line.
- */
- if (args.length != 2) {
- System.out.printf("Usage: AvgWordLength );
- System.exit(-1);
- }
- /*
- * Instantiate a Job object for your job's configuration.
- */
- Job job = new Job(getConf());
- /*
- * Specify the jar file that contains your driver, mapper, and reducer.
- * Hadoop will transfer this jar file to nodes in your cluster running
- * mapper and reducer tasks.
- */
- job.setJarByClass(AvgWordLength.class);
- /*
- * Specify an easily-decipherable name for the job. This job name will
- * appear in reports and logs.
- */
- job.setJobName("Average Word Length");
- /*
- * Specify the paths to the input and output data based on the
- * command-line arguments.
- */
- FileInputFormat.setInputPaths(job, new Path(args[0]));
- FileOutputFormat.setOutputPath(job, new Path(args[1]));
- /*
- * Specify the mapper and reducer classes.
- */
- job.setMapperClass(LetterMapper.class);
- job.setReducerClass(AverageReducer.class);
- /*
- * The input file and output files are text files, so there is no need
- * to call the setInputFormatClass and setOutputFormatClass methods.
- */
- /*
- * The mapper's output keys and values have different data types than
- * the reducer's output keys and values. Therefore, you must call the
- * setMapOutputKeyClass and setMapOutputValueClass methods.
- */
- job.setMapOutputKeyClass(Text.class);
- job.setMapOutputValueClass(IntWritable.class);
- /*
- * Specify the job's output key and value classes.
- */
- job.setOutputKeyClass(Text.class);
- job.setOutputValueClass(DoubleWritable.class);
- /*
- * Start the MapReduce job and wait for it to finish. If it finishes
- * successfully, return 0. If not, return 1.
- */
- boolean success = job.waitForCompletion(true);
- return(success ? 0 : 1);
- }
Note that -D options must appear before any additional program arguments. Besides, you can:
Setting Up and Tearing Down Mappers and Reducers
The setup Method
It is common to want your Mapper or Reducer to execute some code before the map or reduce method is called for the first time:
The cleanup Method
Similarly, you may wish to perform some action(s) after all the records have been processed by your Mapper or Reducer. The cleanup method is called before the Mapper or Reducer terminates.
Pass Parameters
You can setup parameters in driver and the mapper/reducer can fetch those parameters to customize MapReduce behavior:
Decreasing the Amount of Intermediate Data with Combiners
Often, Mapper produce large amounts of intermediate data which must be passed to the Reducers and can result in a lot of network traffic. It is often possible to specify a Combiner which likes a 'mini-Reducer' and runs locally on a single Mapper's output.
Combiner and Reducer code are often identical. Technically, this is possible if the operation performed is commutative and associative. Input and output data types for the Combiner/Reducer must be identical!
WordCount Revisited
WordCount with Combiner
Writing a Combiner
The Combiner uses the same signature as the Reducer:
- public void reduce(Key key, Iterable
values, - Context context) throws IOException, InterruptedException
- {
- ....
- }
Some Reducers may be used as Combiners - If operation is associative and commutative, e.g., SumReader:
Some Reducers cannot be used as a Combiner, e.g., AverageReducer:
Specifying a Combiner
Specify the Combiner class to be used in your MapReduce code in the driver through setCombinerClas method:
- job.setCombinerClass(SumReducer.class);
Supplement
* Exercise7 - Using ToolRunner and Passing Parameters
* Exercise8 - Writing a Partitioner
沒有留言:
張貼留言