程式扎記

Preface
File and Directories Used in this Exercise (P30)

Eclipse project: toolrunner
Java files:
AverageReducer.java (Reducer from AverageWordLength)
LetterMapper.java (Mapper from AverageWordLength)
AvgWordLength.java (Driver from AverageWordLength)

Exercise directory: ~/workspace/toolrunner

In this Exercise, you will implement a driver using ToolRunner.

Follow the steps below to start with the Average Word Length program you wrote in an earlier exercise, and modify the driver to use ToolRunner. Then modify the Mapper to reference a Boolean parameter called caseSensitive; if true, the mapper should treat upper and lower case letters as different; if false or unset, all letters should be converted to lower case.

Source Code
Driver
By using ToolRunner, you can easily pass argument in command line and the Mapper/Reducer can have different behavior based on the arguments given in command line. This time your driver class should extends Configured and implements Tool:
- solution/AvgWordLength.java

view plaincopy to clipboardprint?
package solution;  
  
import org.apache.hadoop.fs.Path;  
import org.apache.hadoop.io.DoubleWritable;  
import org.apache.hadoop.io.IntWritable;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
import org.apache.hadoop.mapreduce.Job;  
import org.apache.hadoop.conf.Configured;  
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.util.Tool;  
import org.apache.hadoop.util.ToolRunner;  
  
  
public class AvgWordLength extends Configured implements Tool {  
  
    public static void main(String[] args) throws Exception {         
        Configuration conf = new Configuration();  
          
        /* 
         * set the caseSensitive configuration value for the job 
         * programmatically. 
         * Comment code out to set from the command line instead. 
         */  
        //conf.setBoolean("caseSensitive", false);  
          
        int exitCode = ToolRunner.run(conf, new AvgWordLength(),  
                args);  
        System.exit(exitCode);  
  
    }  
  
    @Override  
    public int run(String[] args) throws Exception {  
        /* 
         * Validate that two arguments were passed from the command line. 
         */  
        if (args.length != 2) {  
            System.out  
                    .printf("Usage: AvgWordLength  \n");  
            System.exit(-1);  
        }  
  
        /* 
         * Instantiate a Job object for your job's configuration. 
         */  
        Job job = new Job(getConf());  
  
        /* 
         * Specify the jar file that contains your driver, mapper, and reducer. 
         * Hadoop will transfer this jar file to nodes in your cluster running 
         * mapper and reducer tasks. 
         */  
        job.setJarByClass(AvgWordLength.class);  
  
        /* 
         * Specify an easily-decipherable name for the job. This job name will 
         * appear in reports and logs. 
         */  
        job.setJobName("Average Word Length");  
  
        /* 
         * Specify the paths to the input and output data based on the 
         * command-line arguments. 
         */  
        FileInputFormat.setInputPaths(job, new Path(args[0]));  
        FileOutputFormat.setOutputPath(job, new Path(args[1]));  
  
        /* 
         * Specify the mapper and reducer classes. 
         */  
        job.setMapperClass(LetterMapper.class);  
        job.setReducerClass(AverageReducer.class);  
  
        /* 
         * The input file and output files are text files, so there is no need 
         * to call the setInputFormatClass and setOutputFormatClass methods. 
         */  
  
        /* 
         * The mapper's output keys and values have different data types than 
         * the reducer's output keys and values. Therefore, you must call the 
         * setMapOutputKeyClass and setMapOutputValueClass methods. 
         */  
        job.setMapOutputKeyClass(Text.class);  
        job.setMapOutputValueClass(IntWritable.class);  
  
        /* 
         * Specify the job's output key and value classes. 
         */  
        job.setOutputKeyClass(Text.class);  
        job.setOutputValueClass(DoubleWritable.class);  
  
        /* 
         * Start the MapReduce job and wait for it to finish. If it finishes 
         * successfully, return 0. If not, return 1. 
         */  
        boolean success = job.waitForCompletion(true);  
        return(success ? 0 : 1);  
    }  
}  

Mapper
The mapper reads parameter caseSensitive from setup() and uses it to decide your letter is case sensitive or not:
- solution/LetterMapper.java

view plaincopy to clipboardprint?
package solution;  
  
import java.io.IOException;  
  
import org.apache.hadoop.io.IntWritable;  
import org.apache.hadoop.io.LongWritable;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapreduce.Mapper;  
import org.apache.hadoop.conf.Configuration;  
  
/** 
* To define a map function for your MapReduce job, subclass the Mapper class 
* and override the map method. The class definition requires four parameters: 
*  
* @param The 
*            data type of the input key - LongWritable 
* @param The 
*            data type of the input value - Text 
* @param The 
*            data type of the output key - Text 
* @param The 
*            data type of the output value - IntWritable 
*/  
public class LetterMapper extends Mapper {  
  
    boolean caseSensitive = false;  
  
    /** 
     * The map method runs once for each line of text in the input file. The 
     * method receives: 
     *  
     * @param A 
     *            key of type LongWritable 
     * @param A 
     *            value of type Text 
     * @param A 
     *            Context object. 
     */  
    @Override  
    public void map(LongWritable key, Text value, Context context)  
            throws IOException, InterruptedException {  
  
        /* 
         * Convert the line, which is received as a Text object, to a String 
         * object. 
         */  
        String line = value.toString();  
  
        /* 
         * The line.split("\\W+") call uses regular expressions to split the 
         * line up by non-word characters. If you are not familiar with the use 
         * of regular expressions in Java code, search the web for 
         * "Java Regex Tutorial." 
         */  
        for (String word : line.split("\\W+")) {  
            if (word.length() > 0) {  
  
                /* 
                 * Obtain the first letter of the word 
                 */  
                String letter;  
  
                if (caseSensitive)  
                    letter = word.substring(0, 1);  
                else  
                    letter = word.substring(0, 1).toLowerCase();  
  
                /* 
                 * Call the write method on the Context object to emit a key and 
                 * a value from the map method. The key is the letter (in 
                 * lower-case) that the word starts with; the value is the 
                 * word's length. 
                 */  
                context.write(new Text(letter), new IntWritable(word.length()));  
            }  
        }  
    }  
  
    @Override  
    public void setup(Context context) {  
        Configuration conf = context.getConfiguration();  
        caseSensitive = conf.getBoolean("caseSensitive", false);  
  
    }  
}  

Reducer
The reducer is doing sum operation and output the frequency of letter:
- solution/AverageReducer.java

view plaincopy to clipboardprint?
package solution;  
  
import java.io.IOException;  
  
import org.apache.hadoop.io.DoubleWritable;  
import org.apache.hadoop.io.IntWritable;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapreduce.Reducer;  
  
/** 
* To define a reduce function for your MapReduce job, subclass 
* the Reducer class and override the reduce method. 
* The class definition requires four parameters:  
* @param The data type of the input key - Text 
* @param The data type of the input value - IntWritable 
* @param The data type of the output key - Text 
* @param The data type of the output value - DoubleWritable 
*/  
public class AverageReducer extends  
    Reducer {  
  
  /** 
   * The reduce method runs once for each key received from 
   * the shuffle and sort phase of the MapReduce framework. 
   * The method receives: 
   * @param A key of type Text 
   * @param A set of values of type IntWritable 
   * @param A Context object 
   */  
  @Override  
  public void reduce(Text key, Iterable values, Context context)  
      throws IOException, InterruptedException {  
  
    long sum = 0, count = 0;  
  
    /* 
     * For each value in the set of values passed to us by the mapper: 
     */  
    for (IntWritable value : values) {  
        
      /* 
       * Add up the values and increment the count 
       */  
      sum += value.get();  
      count++;  
    }  
    if (count != 0) {  
        
      /* 
       * The average length is the sum of the values divided by the count. 
       */  
      double result = (double)sum / (double)count;  
       
      /* 
       * Call the write method on the Context object to emit a key 
       * (the words' starting letter) and a value (the average length  
       * per word starting with this letter) from the reduce method.  
       */  
      context.write(key, new DoubleWritable(result));  
    }  
  }  
}  

Lab Experiment
1. Build project and run MapReduce

$ ant -f build.xml # Build project
$ hadoop fs -rm -r toolrunnerout # Remove toolrunnerout directory in HDFS in case it exist
$ hadoop jar toolrunner.jar solution.AvgWordLength -DcaseSensitive=false shakespeare toolrunnerout
# You can change argument -DcaseSensitive to modify the behavior of Mapper.

2. Review the result

$ hadoop fs -cat toolrunnerout/*

程式扎記

標籤

2014年12月26日星期五

[CCDH] Exercise7 - Using ToolRunner and Passing Parameters (P30)

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2014年12月26日 星期五