程式扎記: [CCDH] Exercise3 - Writting a MapReduce Java Program (P16)

2014年12月3日星期三

[CCDH] Exercise3 - Writting a MapReduce Java Program (P16)

Preface
Projects and Directories Used in this Exercise

Eclipse project: averagewordlength
Java files:
AverageReducer.java (Reducer)
LetterMapper.java (Mapper)
AvgWordLength.java (driver)

Test data (HDFS):
shakesepare

Exercise directory: ~/workspace/averagewordlength

In this exercise you write a MapReduce job that reads any text input and computes the average length of all words that start with each char. For any text input, the job should report the average length of words that begin with 'a', 'b' and so forth. For example, for input:

view plaincopy to clipboardprint?
No now is definitely not the time  

The output would be:

N 2.0
n 3.0
d 10.0
i 2.0
t 3.5

The Algorithm
The algorithm for this program is a simple one-pass MapReduce program:

The Mapper
The Mapper receives a line of text for each input value. (Ignore the input key.) For each word in the line, emit the first letter of the word as a key, and the length of the word as a value. Check source code below:
- LetterMapper.java

view plaincopy to clipboardprint?
package stubs;  
  
import java.io.IOException;  
  
import org.apache.hadoop.io.IntWritable;  
import org.apache.hadoop.io.LongWritable;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapreduce.Job;  
import org.apache.hadoop.mapreduce.Mapper;  
  
public class LetterMapper extends Mapper {  
  
  @Override  
  public void map(LongWritable key, Text value, Context context)  
      throws IOException, InterruptedException {  
  
    for(String token:value.toString().split("\\W"))  
    {  
        if(token.length()>0)   
            context.write(new Text(String.valueOf(token.charAt(0))), new IntWritable(token.length()));  
    }  
  }  
}  

The Reducer
Thanks to the shuffle and sort phrase built into MapReduce, the Reduce receives the keys in sorted order, and all the values for one key are grouped together. So for the Mapper output above, the Receive source code as below:
- AverageReducer.java

view plaincopy to clipboardprint?
package stubs;  
  
import java.io.IOException;  
  
import org.apache.hadoop.io.DoubleWritable;  
import org.apache.hadoop.io.IntWritable;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapreduce.Reducer;  
  
public class AverageReducer extends  
        Reducer {  
  
    @Override  
    public void reduce(Text key, Iterable values, Context context)  
            throws IOException, InterruptedException {  
        Double wls = 0.0;  
        Double cnt = 0.0;  
        for (IntWritable wl : values) {  
            wls += wl.get();  
            cnt++;  
        }  
        context.write(key, new DoubleWritable(wls / cnt));  
    }  
}  

The Driver
The driver is almost the same as the one in WordCount. Source code as below:
- AvgWordLength.java

view plaincopy to clipboardprint?
package stubs;  
import java.util.Iterator;  
import java.util.Map.Entry;  
  
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.fs.Path;  
import org.apache.hadoop.io.DoubleWritable;  
import org.apache.hadoop.io.IntWritable;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapreduce.Job;  
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
  
public class AvgWordLength {  
  
  public static void main(String[] args) throws Exception {  
    if (args.length != 2) {  
      System.out.printf("Usage: AvgWordLength  \n");  
      System.exit(-1);  
    }  
  
    Job job = new Job();  
      
    job.setJarByClass(AvgWordLength.class);  
      
    job.setJobName("Average Word Length");  
  
    FileOutputFormat.setOutputPath(job, new Path(args[1]));  
    FileInputFormat.setInputPaths(job, new Path(args[0]));  
      
    job.setMapperClass(LetterMapper.class);  
    job.setReducerClass(AverageReducer.class);  
      
    job.setMapOutputKeyClass(Text.class);  
    job.setMapOutputValueClass(IntWritable.class);  
      
    job.setOutputKeyClass(Text.class);  
    job.setOutputValueClass(DoubleWritable.class);  
      
    /* 
     * Start the MapReduce job and wait for it to finish. 
     * If it finishes successfully, return 0. If not, return 1. 
     */  
    boolean success = job.waitForCompletion(true);  
    System.exit(success ? 0 : 1);  
  }  
}  

Lab Experiment
Under path ~/workspace/averagewordlength, you should have a build.xml file which you can use ant to build the project.
1. Build the project

$ ant -f build.xml # The build process will output 'averagewordlength.jar'

2. Run the MapReduce program

$ hadoop jar averagewordlength.jar solution.AvgWordLength shakespeare wordlengths
$ hadoop fs -ls wordlengths # The result will inside wordlengths in HDFS
...
... 2014-12-03 06:09 wordlengths/part-r-00000

3. Review the results

$ hadoop fs -cat wordlengths/*
...
s 4.327014649237208
t 3.733261651336357
u 4.4905590522028875
v 5.726228030644434
w 4.3475752474027844
y 3.5292446231858716
z 4.672727272727273

The file should list all the numbers and letters in the data set, and the average length of the words starting with them.

程式扎記

標籤

2014年12月3日星期三

[CCDH] Exercise3 - Writting a MapReduce Java Program (P16)

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2014年12月3日 星期三

[CCDH] Exercise3 - Writting a MapReduce Java Program (P16)

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2014年12月3日星期三