程式扎記

Preface
In this chapter you will learn

* Basic MapReduce API concepts
* How to write MapReduce drivers, Mappers, and Reducers in Java
* The differences between the old and new MapReduce APIs

Basic MapReduce API Concepts
The MapReduce Flow

A Sample MapReduce Program - WordCount
This will demonstrate the Hadoop API

To investigate the API, we will dissect t he WordCount program we covered in the previous chapter. This consists of three portions:

* The driver code - Code that runs on the client to configure and submit the job
* The Mapper
* The Reducer

Getting Data to the Mapper
The data passed to the Mapper is specified by an InputFormat:

*Specified in the driver code
* Defines the location of the input data - Typically a file or directory
* Determines how to split the input data into input splits
* Create a RecordReader object - RecordReader parses the input data into key/value pairs to pass to the Mapper

TextInputFormat is the default implementation of InputFormat which will create KeyValueLineRecordReader object to treat each "\n" as separator to break file into lines. The key is the byte offset of that line within the file and the value is each line.

There are other standard InputFormat implementation:

* FileInputFormat - Abstract base class used for all file-based InputFormats
* KeyValueTextInputFormat - Maps "\n"-terminated lines as 'key[separator]value'. By default, [separator] is a tab.
* SequenceFileInputFormat - Binary file of (key, value) pairs with some additional metadata
* SequenceFileAsTextInputFormat - Similar, but map(key.toString(), value.toString())
* ...

Keys and Values are Objects

* Keys and values in Hadoop are Java Objects - Not primitives
* Values are objects which implement Writable
* Keys are objects which implement WritableComparable
* Hadoop defines its own 'box classes' for strings, integers, and so on - IntWritable, LongWritable and Text etc.

Writing MapReduce Applications in Java
The Driver Code - Introduction
The driver code runs on the client machine. It configures the job, then submit it to the cluster.
- WordCount.java

view plaincopy to clipboardprint?
package solution;  
  
import org.apache.hadoop.fs.Path;  
import org.apache.hadoop.io.IntWritable;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
import org.apache.hadoop.mapreduce.Job;  
  
/*  
* MapReduce jobs are typically implemented by using a driver class. 
* The purpose of a driver class is to set up the configuration for the 
* MapReduce job and to run the job. 
* Typical requirements for a driver class include configuring the input 
* and output data formats, configuring the map and reduce classes, 
* and specifying intermediate data formats. 
*  
* The following is the code for the driver class: 
*/  
public class WordCount {  
  
  public static void main(String[] args) throws Exception {  
  
    /* 
     * The expected command-line arguments are the paths containing 
     * input and output data. Terminate the job if the number of 
     * command-line arguments is not exactly 2. 
     */  
    if (args.length != 2) {  
      System.out.printf(  
          "Usage: WordCount  \n");  
      System.exit(-1);  
    }  
  
    /* 
     * Instantiate a Job object for your job's configuration.   
     */  
    Job job = new Job();  
      
    /* 
     * Specify the jar file that contains your driver, mapper, and reducer. 
     * Hadoop will transfer this jar file to nodes in your cluster running 
     * mapper and reducer tasks. 
     */  
    job.setJarByClass(WordCount.class);  
      
    /* 
     * Specify an easily-decipherable name for the job. 
     * This job name will appear in reports and logs. 
     */  
    job.setJobName("Word Count");  
  
    /* 
     * Specify the paths to the input and output data based on the 
     * command-line arguments. 
     */  
    FileInputFormat.setInputPaths(job, new Path(args[0]));  
    FileOutputFormat.setOutputPath(job, new Path(args[1]));  
  
    /* 
     * Specify the mapper and reducer classes. 
     */  
    job.setMapperClass(WordMapper.class);  
    job.setReducerClass(SumReducer.class);  
  
    /* 
     * For the word count application, the input file and output  
     * files are in text format - the default format. 
     *  
     * In text format files, each record is a line delineated by a  
     * by a line terminator. 
     *  
     * When you use other input formats, you must call the  
     * SetInputFormatClass method. When you use other  
     * output formats, you must call the setOutputFormatClass method. 
     */  
        
    /* 
     * For the word count application, the mapper's output keys and 
     * values have the same data types as the reducer's output keys  
     * and values: Text and IntWritable. 
     *  
     * When they are not the same data types, you must call the  
     * setMapOutputKeyClass and setMapOutputValueClass  
     * methods. 
     */  
  
    /* 
     * Specify the job's output key and value classes. 
     */  
    job.setOutputKeyClass(Text.class);  
    job.setOutputValueClass(IntWritable.class);  
  
    /* 
     * Start the MapReduce job and wait for it to finish. 
     * If it finishes successfully, return 0. If not, return 1. 
     */  
    boolean success = job.waitForCompletion(true);  
    System.exit(success ? 0 : 1);  
  }  
}  

1. Creating a New Job Object
The Job class allows you to set configuration options for your MapReduce job

* The class to be used for your Mapper and Reducer
* The input/output directories
* Many other options by given Configuration object to construction.

Any options not explicitly set in your driver code will be read from your Hadoop configuration files ($HADOOP_HOME/etc/hadoop/conf). Any options not specified in your configuration files will use Hadoop's default values. You can use Job object to submit the job, control its execution, and query its state.

2. Specifying Input/Output Directories

view plaincopy to clipboardprint?
/* 
* Specify the paths to the input and output data based on the 
* command-line arguments. 
*/  
FileInputFormat.setInputPaths(job, new Path(args[0]));  
FileOutputFormat.setOutputPath(job, new Path(args[1]));  

The default InputFormat (TextInputFormat) will be used unless you specify otherwise. To use other InputFormat than the default, use e.g.

view plaincopy to clipboardprint?
job.setInputFormatClass(KeyValueTextInputFormat.class)  

By default, FileInputFormat.setInputPaths() will read all files from a specified directory and send them to Mappers except items whose names begin with a period (.) or underscore (_). Or you can use Globs to restrict input. For example, /2010/*/01/*

Alternatively, FileInputFormat.addInputPath() can be called multiple times, specifying a single file or directory each time. More advanced filtering can be performed by implementing a PathFilter (Interface with a method named accept which returns true or false depending on whether or not the file should be processed).

FileOutputFormat.setOutputPath() specifies the directory to which the Reducers will write their final output. The default is a plain text file and can be explicitly written as job.setOutputFormatClass(...).

3. Specifying the Mapper/Reducer Classes

view plaincopy to clipboardprint?
/* 
* Specify the mapper and reducer classes. 
*/  
job.setMapperClass(WordMapper.class);  
job.setReducerClass(SumReducer.class);  

Setting the Mapper/Reducer classes is optional! If not set in your driver code, Hadoop uses its defaults IdentityMapper and IdentityReducer:

4. Specifying Intermediate/Final Output data type

view plaincopy to clipboardprint?
/* 
* Specify the intermediate key/value classes. 
*/  
job.setMapOutputKeyClass(Text.class);  
job.setMapOutputValueClass(IntWritable.class);  
  
/* 
* Specify the job's output key and value classes. 
*/  
job.setOutputKeyClass(Text.class);  
job.setOutputValueClass(IntWritable.class);  

5. Running The Job

view plaincopy to clipboardprint?
/* 
* Start the MapReduce job and wait for it to finish. 
* If it finishes successfully, return 0. If not, return 1. 
*/  
boolean success = job.waitForCompletion(true);  
System.exit(success ? 0 : 1);  

There are two ways to run your MapReduce job:

* job.waitForCompletion() - Block and wait for the job to complete before continuing
* job.submit() - Doesn't block.

The client determines the proper division of input data into inputSplits, and then sends the job information to the JobTracker daemon on the cluster.

WordCount Mapper Review

- WordMapper.java

view plaincopy to clipboardprint?
package solution;  
  
import java.io.IOException;  
  
import org.apache.hadoop.io.IntWritable;  
import org.apache.hadoop.io.LongWritable;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapreduce.Mapper;  
  
/*  
* To define a map function for your MapReduce job, subclass  
* the Mapper class and override the map method. 
* The class definition requires four parameters:  
*   The data type of the input key 
*   The data type of the input value 
*   The data type of the output key (which is the input key type  
*   for the reducer) 
*   The data type of the output value (which is the input value  
*   type for the reducer) 
*/  
  
public class WordMapper extends Mapper {  
  
  /* 
   * The map method runs once for each line of text in the input file. 
   * The method receives a key of type LongWritable, a value of type 
   * Text, and a Context object. 
   */  
  @Override  
  public void map(LongWritable key, Text value, Context context)  
      throws IOException, InterruptedException {  
  
    /* 
     * Convert the line, which is received as a Text object, 
     * to a String object. 
     */  
    String line = value.toString();  
  
    /* 
     * The line.split("\\W+") call uses regular expressions to split the 
     * line up by non-word characters. 
     *  
     * If you are not familiar with the use of regular expressions in 
     * Java code, search the web for "Java Regex Tutorial."  
     */  
    for (String word : line.split("\\W+")) {  
      if (word.length() > 0) {  
          
        /* 
         * Call the write method on the Context object to emit a key 
         * and a value from the map method. 
         */  
        context.write(new Text(word), new IntWritable(1));  
      }  
    }  
  }  
}  

WordCount Reducer Review

- SumReducer.java

view plaincopy to clipboardprint?
package solution;  
  
import java.io.IOException;  
  
import org.apache.hadoop.io.IntWritable;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapreduce.Reducer;  
  
/*  
* To define a reduce function for your MapReduce job, subclass  
* the Reducer class and override the reduce method. 
* The class definition requires four parameters:  
*   The data type of the input key (which is the output key type  
*   from the mapper) 
*   The data type of the input value (which is the output value  
*   type from the mapper) 
*   The data type of the output key 
*   The data type of the output value 
*/     
public class SumReducer extends Reducer {  
  
  /* 
   * The reduce method runs once for each key received from 
   * the shuffle and sort phase of the MapReduce framework. 
   * The method receives a key of type Text, a set of values of type 
   * IntWritable, and a Context object. 
   */  
  @Override  
    public void reduce(Text key, Iterable values, Context context)  
            throws IOException, InterruptedException {  
        int wordCount = 0;  
          
        /* 
         * For each value in the set of values passed to us by the mapper: 
         */  
        for (IntWritable value : values) {  
            
          /* 
           * Add the value to the word count counter for this key. 
           */  
            wordCount += value.get();  
        }  
          
        /* 
         * Call the write method on the Context object to emit a key 
         * and a value from the reduce method.  
         */  
        context.write(key, new IntWritable(wordCount));  
    }  
}  

Ensure "Types" Match
Mappers and Reducers declare input/output type parameters through extended Mapper/Reducer with Generic. These must match the types used in the class:

Output types must also match those setting in the driver:

Supplement
* Exercise2 - Running a MapReduce Job
* Exercise3 - Writting a MapReduce Java Program
* Exercise4 - More Practice With MapReduce Java Programs
* Apache Hadoop 2.5.1 - MapReduce Tutorial
* Hadoop2.2.0遇到NativeLibraries錯誤的解決過程

$ export HADOOP_ROOT_LOGGER=DEBUG,console # See more debug information

程式扎記

標籤

2014年11月24日星期一

[CCDH] Class2 - Writing Basic MapReduce Programs (1)

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2014年11月24日 星期一