Preface
Files and Directories Used in this Exercise
In this exercise, you will write a MapReduce job that produces an inverted index.
For this lab you will use an alternative input, provided in the file invertedIndexInput.tgz. When decompressed, this archive contains a directory of files; each is a Shakespeare play formatted as follows:
Each line contains:
This format can be read directly using the KeyValueTextInputFormat class provided in the Hadoop API. This input format presents each line as one record to your Mapper, with the part before the tab character as the key, and the part after the tab as the value.
Given a body of text in this form, your indexer should produce an index of all the words in the text. For each word, the index should have a list of all the locations where the words appears. For example, for the word "honeysuckle" your output should look like this:
The index should contain such an entry for every word in the text.
Lab Experiment
Prepare The Input Data
1. Extract the invertedIndexInput directory and upload to HDFS:
Define The MapReduce Solution
Remember that for this program you use a special input format to suit the form of your data, so your driver class will do for it:
2. Implement driver class:
Note that the exercise requires you to retrieve the file name - since that is the name of the play. The
Context object can be used to retrieve the name of the file.
2. Implement the Mapper class
The Reducer will output inverted index information for key as word and value as exist location list:
3. Implement the Reducer class
4. Build project and run MapReduce job
5. Check result
Files and Directories Used in this Exercise
In this exercise, you will write a MapReduce job that produces an inverted index.
For this lab you will use an alternative input, provided in the file invertedIndexInput.tgz. When decompressed, this archive contains a directory of files; each is a Shakespeare play formatted as follows:
Each line contains:
This format can be read directly using the KeyValueTextInputFormat class provided in the Hadoop API. This input format presents each line as one record to your Mapper, with the part before the tab character as the key, and the part after the tab as the value.
Given a body of text in this form, your indexer should produce an index of all the words in the text. For each word, the index should have a list of all the locations where the words appears. For example, for the word "honeysuckle" your output should look like this:
The index should contain such an entry for every word in the text.
Lab Experiment
Prepare The Input Data
1. Extract the invertedIndexInput directory and upload to HDFS:
Define The MapReduce Solution
Remember that for this program you use a special input format to suit the form of your data, so your driver class will do for it:
2. Implement driver class:
- package solution;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
- import org.apache.hadoop.mapreduce.Job;
- import org.apache.hadoop.conf.Configured;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.util.Tool;
- import org.apache.hadoop.util.ToolRunner;
- public class InvertedIndex extends Configured implements Tool {
- public int run(String[] args) throws Exception {
- if (args.length != 2) {
- System.out.printf("Usage: InvertedIndex );
- return -1;
- }
- Job job = new Job(getConf());
- job.setJarByClass(InvertedIndex.class);
- job.setJobName("Inverted Index");
- /*
- * We are using a KeyValueText file as the input file.
- * Therefore, we must call setInputFormatClass.
- * There is no need to call setOutputFormatClass, because the
- * application uses a text file for output.
- */
- job.setInputFormatClass(KeyValueTextInputFormat.class); // Here setup our customized input format
- FileInputFormat.setInputPaths(job, new Path(args[0]));
- FileOutputFormat.setOutputPath(job, new Path(args[1]));
- job.setMapperClass(IndexMapper.class);
- job.setReducerClass(IndexReducer.class);
- job.setOutputKeyClass(Text.class);
- job.setOutputValueClass(Text.class);
- boolean success = job.waitForCompletion(true);
- return success ? 0 : 1;
- }
- public static void main(String[] args) throws Exception {
- int exitCode = ToolRunner.run(new Configuration(), new InvertedIndex(), args);
- System.exit(exitCode);
- }
- }
2. Implement the Mapper class
- package solution;
- import java.io.IOException;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.lib.input.FileSplit;
- import org.apache.hadoop.mapreduce.Mapper;
- public class IndexMapper extends Mapper
{ - @Override
- public void map(Text key, Text value, Context context) throws IOException,
- InterruptedException {
- /*
- * Get the FileSplit for the input file, which provides access
- * to the file's path. You need the file's path because it
- * contains the name of the play.
- */
- FileSplit fileSplit = (FileSplit) context.getInputSplit();
- Path path = fileSplit.getPath();
- /*
- * Call the getName method on the Path object to retrieve the
- * file's name, which is the name of the play. Then append
- * "@" and the line number to the play's name. The resulting
- * string is the location of the words on that line.
- */
- String wordPlace = path.getName() + "@" + key.toString();
- Text location = new Text(wordPlace);
- /*
- * Convert the line to lower case.
- */
- String lc_line = value.toString().toLowerCase();
- /*
- * Split the line into words. For each word on the line,
- * emit an output record that has the word as the key and
- * the location of the word as the value.
- */
- for (String word : lc_line.split("\\W+")) {
- if (word.length() > 0) {
- context.write(new Text(word), location);
- }
- }
- }
- }
3. Implement the Reducer class
- package solution;
- import java.io.IOException;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Reducer;
- /**
- * On input, the reducer receives a word as the key and a set
- * of locations in the form "play name@line number" for the values.
- * The reducer builds a readable string in the valueList variable that
- * contains an index of all the locations of the word.
- */
- public class IndexReducer extends Reducer
{ - private static final String SEP = ",";
- @Override
- public void reduce(Text key, Iterable
values, Context context) - throws IOException, InterruptedException {
- StringBuilder valueList = new StringBuilder();
- boolean firstValue = true;
- /*
- * For each "play name@line number" in the input value set:
- */
- for (Text value : values) {
- /*
- * If this is not the word's first location, add a comma to the
- * end of valueList.
- */
- if (!firstValue) {
- valueList.append(SEP);
- } else {
- firstValue = false;
- }
- /*
- * Convert the location to a String and append it to valueList.
- */
- valueList.append(value.toString());
- }
- /*
- * Emit the index entry.
- */
- context.write(key, new Text(valueList.toString()));
- }
- }
5. Check result
沒有留言:
張貼留言