程式扎記

Preface
Files and Directories Used in this Exercise

Eclipse project: inverted_index
Java files:
IndexMapper.java (Mapper)
IndexReducer.java (Reducer)
InvertedIndex.java (Driver)

Data files:
~/training_materials/developer/data/invertedIndexInput.tgz

Exercise directory: ~/workspace/inverted_index

In this exercise, you will write a MapReduce job that produces an inverted index.

For this lab you will use an alternative input, provided in the file invertedIndexInput.tgz. When decompressed, this archive contains a directory of files; each is a Shakespeare play formatted as follows:

Each line contains:

- Line number
- Separator: a tab character
- value: The line of text

This format can be read directly using the KeyValueTextInputFormat class provided in the Hadoop API. This input format presents each line as one record to your Mapper, with the part before the tab character as the key, and the part after the tab as the value.

Given a body of text in this form, your indexer should produce an index of all the words in the text. For each word, the index should have a list of all the locations where the words appears. For example, for the word "honeysuckle" your output should look like this:

honeysuckle 2kinghenryiv@1038,midsummernightsdream@2175,...

The index should contain such an entry for every word in the text.

Lab Experiment
Prepare The Input Data
1. Extract the invertedIndexInput directory and upload to HDFS:

$ cd ~/training_materials/developer/data/
$ tar -xvf invertedIndexInput.tgz
$ hadoop fs -put invertedIndexInput invertedIndexInput

Define The MapReduce Solution
Remember that for this program you use a special input format to suit the form of your data, so your driver class will do for it:
2. Implement driver class:

view plaincopy to clipboardprint?
package solution;  
  
import org.apache.hadoop.fs.Path;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;  
import org.apache.hadoop.mapreduce.Job;  
  
import org.apache.hadoop.conf.Configured;  
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.util.Tool;  
import org.apache.hadoop.util.ToolRunner;  
  
public class InvertedIndex extends Configured implements Tool {  
  
  public int run(String[] args) throws Exception {  
  
    if (args.length != 2) {  
      System.out.printf("Usage: InvertedIndex  \n");  
      return -1;  
    }  
  
    Job job = new Job(getConf());  
    job.setJarByClass(InvertedIndex.class);  
    job.setJobName("Inverted Index");  
  
    /* 
     * We are using a KeyValueText file as the input file. 
     * Therefore, we must call setInputFormatClass. 
     * There is no need to call setOutputFormatClass, because the 
     * application uses a text file for output. 
     */  
    job.setInputFormatClass(KeyValueTextInputFormat.class);  // Here setup our customized input format  
  
    FileInputFormat.setInputPaths(job, new Path(args[0]));  
    FileOutputFormat.setOutputPath(job, new Path(args[1]));  
  
    job.setMapperClass(IndexMapper.class);  
    job.setReducerClass(IndexReducer.class);  
  
    job.setOutputKeyClass(Text.class);  
    job.setOutputValueClass(Text.class);  
  
    boolean success = job.waitForCompletion(true);  
    return success ? 0 : 1;  
  }  
  
  public static void main(String[] args) throws Exception {  
    int exitCode = ToolRunner.run(new Configuration(), new InvertedIndex(), args);  
    System.exit(exitCode);  
  }  
}  

Note that the exercise requires you to retrieve the file name - since that is the name of the play. The Context object can be used to retrieve the name of the file.
2. Implement the Mapper class

view plaincopy to clipboardprint?
package solution;  
  
import java.io.IOException;  
  
import org.apache.hadoop.fs.Path;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapreduce.lib.input.FileSplit;  
import org.apache.hadoop.mapreduce.Mapper;  
  
public class IndexMapper extends Mapper {  
  
  @Override  
  public void map(Text key, Text value, Context context) throws IOException,  
      InterruptedException {  
  
    /* 
     * Get the FileSplit for the input file, which provides access 
     * to the file's path. You need the file's path because it 
     * contains the name of the play. 
     */  
    FileSplit fileSplit = (FileSplit) context.getInputSplit();  
    Path path = fileSplit.getPath();  
      
    /* 
     * Call the getName method on the Path object to retrieve the 
      * file's name, which is the name of the play. Then append 
     * "@" and the line number to the play's name. The resulting 
     * string is the location of the words on that line. 
     */  
    String wordPlace = path.getName() + "@" + key.toString();  
    Text location = new Text(wordPlace);  
      
    /* 
     * Convert the line to lower case. 
     */  
    String lc_line = value.toString().toLowerCase();  
      
    /*  
     * Split the line into words. For each word on the line, 
     * emit an output record that has the word as the key and 
     * the location of the word as the value.  
     */  
    for (String word : lc_line.split("\\W+")) {  
      if (word.length() > 0) {  
        context.write(new Text(word), location);  
      }  
    }  
  }  
}  

The Reducer will output inverted index information for key as word and value as exist location list:
3. Implement the Reducer class

view plaincopy to clipboardprint?
package solution;  
  
import java.io.IOException;  
  
import org.apache.hadoop.io.Text;  
  
import org.apache.hadoop.mapreduce.Reducer;  
  
/** 
* On input, the reducer receives a word as the key and a set 
* of locations in the form "play name@line number" for the values.  
* The reducer builds a readable string in the valueList variable that 
* contains an index of all the locations of the word.  
*/  
public class IndexReducer extends Reducer {  
  
  private static final String SEP = ",";  
  
  @Override  
  public void reduce(Text key, Iterable values, Context context)  
      throws IOException, InterruptedException {  
  
    StringBuilder valueList = new StringBuilder();  
    boolean firstValue = true;  
  
    /* 
     * For each "play name@line number" in the input value set: 
     */  
    for (Text value : values) {  
  
      /* 
       * If this is not the word's first location, add a comma to the 
       * end of valueList. 
       */  
      if (!firstValue) {  
        valueList.append(SEP);  
      } else {  
        firstValue = false;  
      }  
        
      /* 
       * Convert the location to a String and append it to valueList. 
       */  
      valueList.append(value.toString());   
    }  
  
    /* 
     * Emit the index entry.  
     */  
    context.write(key, new Text(valueList.toString()));  
  }  
}  

4. Build project and run MapReduce job

$ ant -f build.xml # Build project and output inverted_index.jar
$ hadoop fs -rm -r inverted_index # Clean previous result
$ hadoop jar inverted_index.jar solution.InvertedIndex invertedIndexInput inverted_index # Run MapReduce job
$ hadoop fs -ls inverted_index # Check result
...
... -rw-r--r-- 1 training supergroup 18446906 2014-12-28 21:24 inverted_index/part-r-00000

5. Check result

$ hadoop fs -cat inverted_index/part-r-00000 | less

程式扎記

標籤

2014年12月28日星期日

[CCDH] Exercise15 - Creating an Inverted Index (P54)

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2014年12月28日 星期日