Projects and Directories Used in this Exercise
In this exercise you write a MapReduce job that reads any text input and computes the average length of all words that start with each char. For any text input, the job should report the average length of words that begin with 'a', 'b' and so forth. For example, for input:
- No now is definitely not the time
The output would be:
The Algorithm
The algorithm for this program is a simple one-pass MapReduce program:
The Mapper
The Mapper receives a line of text for each input value. (Ignore the input key.) For each word in the line, emit the first letter of the word as a key, and the length of the word as a value. Check source code below:
- LetterMapper.java
- package stubs;
- import java.io.IOException;
- import org.apache.hadoop.io.IntWritable;
- import org.apache.hadoop.io.LongWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Job;
- import org.apache.hadoop.mapreduce.Mapper;
- public class LetterMapper extends Mapper
{ - @Override
- public void map(LongWritable key, Text value, Context context)
- throws IOException, InterruptedException {
- for(String token:value.toString().split("\\W"))
- {
- if(token.length()>0)
- context.write(new Text(String.valueOf(token.charAt(0))), new IntWritable(token.length()));
- }
- }
- }
Thanks to the shuffle and sort phrase built into MapReduce, the Reduce receives the keys in sorted order, and all the values for one key are grouped together. So for the Mapper output above, the Receive source code as below:
- AverageReducer.java
- package stubs;
- import java.io.IOException;
- import org.apache.hadoop.io.DoubleWritable;
- import org.apache.hadoop.io.IntWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Reducer;
- public class AverageReducer extends
- Reducer
{ - @Override
- public void reduce(Text key, Iterable
values, Context context) - throws IOException, InterruptedException {
- Double wls = 0.0;
- Double cnt = 0.0;
- for (IntWritable wl : values) {
- wls += wl.get();
- cnt++;
- }
- context.write(key, new DoubleWritable(wls / cnt));
- }
- }
The driver is almost the same as the one in WordCount. Source code as below:
- AvgWordLength.java
- package stubs;
- import java.util.Iterator;
- import java.util.Map.Entry;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.DoubleWritable;
- import org.apache.hadoop.io.IntWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Job;
- import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- public class AvgWordLength {
- public static void main(String[] args) throws Exception {
- if (args.length != 2) {
- System.out.printf("Usage: AvgWordLength );
- System.exit(-1);
- }
- Job job = new Job();
- job.setJarByClass(AvgWordLength.class);
- job.setJobName("Average Word Length");
- FileOutputFormat.setOutputPath(job, new Path(args[1]));
- FileInputFormat.setInputPaths(job, new Path(args[0]));
- job.setMapperClass(LetterMapper.class);
- job.setReducerClass(AverageReducer.class);
- job.setMapOutputKeyClass(Text.class);
- job.setMapOutputValueClass(IntWritable.class);
- job.setOutputKeyClass(Text.class);
- job.setOutputValueClass(DoubleWritable.class);
- /*
- * Start the MapReduce job and wait for it to finish.
- * If it finishes successfully, return 0. If not, return 1.
- */
- boolean success = job.waitForCompletion(true);
- System.exit(success ? 0 : 1);
- }
- }
Under path ~/workspace/averagewordlength, you should have a build.xml file which you can use ant to build the project.
1. Build the project
2. Run the MapReduce program
3. Review the results
The file should list all the numbers and letters in the data set, and the average length of the words starting with them.
沒有留言:
張貼留言