2014年12月28日 星期日

[CCDH] Exercise15 - Creating an Inverted Index (P54)

Preface
Files and Directories Used in this Exercise
Eclipse project: inverted_index
Java files:
IndexMapper.java (Mapper)
IndexReducer.java (Reducer)
InvertedIndex.java (Driver)

Data files:
~/training_materials/developer/data/invertedIndexInput.tgz

Exercise directory: ~/workspace/inverted_index

In this exercise, you will write a MapReduce job that produces an inverted index.

For this lab you will use an alternative input, provided in the file invertedIndexInput.tgz. When decompressed, this archive contains a directory of files; each is a Shakespeare play formatted as follows:


Each line contains:
Line number
Separator: a tab character
value: The line of text

This format can be read directly using the KeyValueTextInputFormat class provided in the Hadoop API. This input format presents each line as one record to your Mapper, with the part before the tab character as the key, and the part after the tab as the value.

Given a body of text in this form, your indexer should produce an index of all the words in the text. For each word, the index should have a list of all the locations where the words appears. For example, for the word "honeysuckle" your output should look like this:
honeysuckle 2kinghenryiv@1038,midsummernightsdream@2175,...

The index should contain such an entry for every word in the text.

Lab Experiment
Prepare The Input Data
1. Extract the invertedIndexInput directory and upload to HDFS:
$ cd ~/training_materials/developer/data/
$ tar -xvf invertedIndexInput.tgz
$ hadoop fs -put invertedIndexInput invertedIndexInput

Define The MapReduce Solution
Remember that for this program you use a special input format to suit the form of your data, so your driver class will do for it:
2. Implement driver class:
  1. package solution;  
  2.   
  3. import org.apache.hadoop.fs.Path;  
  4. import org.apache.hadoop.io.Text;  
  5. import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
  6. import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
  7. import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;  
  8. import org.apache.hadoop.mapreduce.Job;  
  9.   
  10. import org.apache.hadoop.conf.Configured;  
  11. import org.apache.hadoop.conf.Configuration;  
  12. import org.apache.hadoop.util.Tool;  
  13. import org.apache.hadoop.util.ToolRunner;  
  14.   
  15. public class InvertedIndex extends Configured implements Tool {  
  16.   
  17.   public int run(String[] args) throws Exception {  
  18.   
  19.     if (args.length != 2) {  
  20.       System.out.printf("Usage: InvertedIndex \n");  
  21.       return -1;  
  22.     }  
  23.   
  24.     Job job = new Job(getConf());  
  25.     job.setJarByClass(InvertedIndex.class);  
  26.     job.setJobName("Inverted Index");  
  27.   
  28.     /* 
  29.      * We are using a KeyValueText file as the input file. 
  30.      * Therefore, we must call setInputFormatClass. 
  31.      * There is no need to call setOutputFormatClass, because the 
  32.      * application uses a text file for output. 
  33.      */  
  34.     job.setInputFormatClass(KeyValueTextInputFormat.class);  // Here setup our customized input format  
  35.   
  36.     FileInputFormat.setInputPaths(job, new Path(args[0]));  
  37.     FileOutputFormat.setOutputPath(job, new Path(args[1]));  
  38.   
  39.     job.setMapperClass(IndexMapper.class);  
  40.     job.setReducerClass(IndexReducer.class);  
  41.   
  42.     job.setOutputKeyClass(Text.class);  
  43.     job.setOutputValueClass(Text.class);  
  44.   
  45.     boolean success = job.waitForCompletion(true);  
  46.     return success ? 0 : 1;  
  47.   }  
  48.   
  49.   public static void main(String[] args) throws Exception {  
  50.     int exitCode = ToolRunner.run(new Configuration(), new InvertedIndex(), args);  
  51.     System.exit(exitCode);  
  52.   }  
  53. }  
Note that the exercise requires you to retrieve the file name - since that is the name of the play. The Context object can be used to retrieve the name of the file.
2. Implement the Mapper class
  1. package solution;  
  2.   
  3. import java.io.IOException;  
  4.   
  5. import org.apache.hadoop.fs.Path;  
  6. import org.apache.hadoop.io.Text;  
  7. import org.apache.hadoop.mapreduce.lib.input.FileSplit;  
  8. import org.apache.hadoop.mapreduce.Mapper;  
  9.   
  10. public class IndexMapper extends Mapper {  
  11.   
  12.   @Override  
  13.   public void map(Text key, Text value, Context context) throws IOException,  
  14.       InterruptedException {  
  15.   
  16.     /* 
  17.      * Get the FileSplit for the input file, which provides access 
  18.      * to the file's path. You need the file's path because it 
  19.      * contains the name of the play. 
  20.      */  
  21.     FileSplit fileSplit = (FileSplit) context.getInputSplit();  
  22.     Path path = fileSplit.getPath();  
  23.       
  24.     /* 
  25.      * Call the getName method on the Path object to retrieve the 
  26.       * file's name, which is the name of the play. Then append 
  27.      * "@" and the line number to the play's name. The resulting 
  28.      * string is the location of the words on that line. 
  29.      */  
  30.     String wordPlace = path.getName() + "@" + key.toString();  
  31.     Text location = new Text(wordPlace);  
  32.       
  33.     /* 
  34.      * Convert the line to lower case. 
  35.      */  
  36.     String lc_line = value.toString().toLowerCase();  
  37.       
  38.     /*  
  39.      * Split the line into words. For each word on the line, 
  40.      * emit an output record that has the word as the key and 
  41.      * the location of the word as the value.  
  42.      */  
  43.     for (String word : lc_line.split("\\W+")) {  
  44.       if (word.length() > 0) {  
  45.         context.write(new Text(word), location);  
  46.       }  
  47.     }  
  48.   }  
  49. }  
The Reducer will output inverted index information for key as word and value as exist location list:
3. Implement the Reducer class
  1. package solution;  
  2.   
  3. import java.io.IOException;  
  4.   
  5. import org.apache.hadoop.io.Text;  
  6.   
  7. import org.apache.hadoop.mapreduce.Reducer;  
  8.   
  9. /** 
  10. * On input, the reducer receives a word as the key and a set 
  11. * of locations in the form "play name@line number" for the values.  
  12. * The reducer builds a readable string in the valueList variable that 
  13. * contains an index of all the locations of the word.  
  14. */  
  15. public class IndexReducer extends Reducer {  
  16.   
  17.   private static final String SEP = ",";  
  18.   
  19.   @Override  
  20.   public void reduce(Text key, Iterable values, Context context)  
  21.       throws IOException, InterruptedException {  
  22.   
  23.     StringBuilder valueList = new StringBuilder();  
  24.     boolean firstValue = true;  
  25.   
  26.     /* 
  27.      * For each "play name@line number" in the input value set: 
  28.      */  
  29.     for (Text value : values) {  
  30.   
  31.       /* 
  32.        * If this is not the word's first location, add a comma to the 
  33.        * end of valueList. 
  34.        */  
  35.       if (!firstValue) {  
  36.         valueList.append(SEP);  
  37.       } else {  
  38.         firstValue = false;  
  39.       }  
  40.         
  41.       /* 
  42.        * Convert the location to a String and append it to valueList. 
  43.        */  
  44.       valueList.append(value.toString());   
  45.     }  
  46.   
  47.     /* 
  48.      * Emit the index entry.  
  49.      */  
  50.     context.write(key, new Text(valueList.toString()));  
  51.   }  
  52. }  
4. Build project and run MapReduce job
$ ant -f build.xml # Build project and output inverted_index.jar
$ hadoop fs -rm -r inverted_index # Clean previous result
$ hadoop jar inverted_index.jar solution.InvertedIndex invertedIndexInput inverted_index # Run MapReduce job
$ hadoop fs -ls inverted_index # Check result
...
... -rw-r--r-- 1 training supergroup 18446906 2014-12-28 21:24 inverted_index/part-r-00000

5. Check result
$ hadoop fs -cat inverted_index/part-r-00000 | less


沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...