Preface
In this chapter you will learn
Basic MapReduce API Concepts
The MapReduce Flow
A Sample MapReduce Program - WordCount
This will demonstrate the Hadoop API
To investigate the API, we will dissect t he WordCount program we covered in the previous chapter. This consists of three portions:
Getting Data to the Mapper
The data passed to the Mapper is specified by an InputFormat:
TextInputFormat is the default implementation of InputFormat which will create KeyValueLineRecordReader object to treat each "\n" as separator to break file into lines. The key is the byte offset of that line within the file and the value is each line.
There are other standard InputFormat implementation:
Keys and Values are Objects
Writing MapReduce Applications in Java
The Driver Code - Introduction
The driver code runs on the client machine. It configures the job, then submit it to the cluster.
- WordCount.java
1. Creating a New Job Object
The Job class allows you to set configuration options for your MapReduce job
Any options not explicitly set in your driver code will be read from your Hadoop configuration files ($HADOOP_HOME/etc/hadoop/conf). Any options not specified in your configuration files will use Hadoop's default values. You can use Job object to submit the job, control its execution, and query its state.
2. Specifying Input/Output Directories
The default
InputFormat (TextInputFormat) will be used unless you specify otherwise. To use other InputFormat than the default, use e.g.
By default,
FileInputFormat.setInputPaths() will read all files from a specified directory and send them to Mappers except items whose names begin with a period (.) or underscore (_). Or you can use Globs to restrict input. For example, /2010/*/01/*
Alternatively, FileInputFormat.addInputPath() can be called multiple times, specifying a single file or directory each time. More advanced filtering can be performed by implementing a PathFilter (Interface with a method named accept which returns true or false depending on whether or not the file should be processed).
FileOutputFormat.setOutputPath() specifies the directory to which the Reducers will write their final output. The default is a plain text file and can be explicitly written as job.setOutputFormatClass(...).
3. Specifying the Mapper/Reducer Classes
Setting the Mapper/Reducer classes is optional! If not set in your driver code, Hadoop uses its defaults
IdentityMapper and IdentityReducer:
4. Specifying Intermediate/Final Output data type
5. Running The Job
There are two ways to run your MapReduce job:
The client determines the proper division of input data into inputSplits, and then sends the job information to the JobTracker daemon on the cluster.
WordCount Mapper Review
- WordMapper.java
WordCount Reducer Review
- SumReducer.java
Ensure "Types" Match
Mappers and Reducers declare input/output type parameters through extended Mapper/Reducer with Generic. These must match the types used in the class:
Output types must also match those setting in the driver:
Supplement
* Exercise2 - Running a MapReduce Job
* Exercise3 - Writting a MapReduce Java Program
* Exercise4 - More Practice With MapReduce Java Programs
* Apache Hadoop 2.5.1 - MapReduce Tutorial
* Hadoop2.2.0遇到NativeLibraries錯誤的解決過程
In this chapter you will learn
Basic MapReduce API Concepts
The MapReduce Flow
A Sample MapReduce Program - WordCount
This will demonstrate the Hadoop API
To investigate the API, we will dissect t he WordCount program we covered in the previous chapter. This consists of three portions:
Getting Data to the Mapper
The data passed to the Mapper is specified by an InputFormat:
TextInputFormat is the default implementation of InputFormat which will create KeyValueLineRecordReader object to treat each "\n" as separator to break file into lines. The key is the byte offset of that line within the file and the value is each line.
There are other standard InputFormat implementation:
Keys and Values are Objects
Writing MapReduce Applications in Java
The Driver Code - Introduction
The driver code runs on the client machine. It configures the job, then submit it to the cluster.
- WordCount.java
- package solution;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.IntWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- import org.apache.hadoop.mapreduce.Job;
- /*
- * MapReduce jobs are typically implemented by using a driver class.
- * The purpose of a driver class is to set up the configuration for the
- * MapReduce job and to run the job.
- * Typical requirements for a driver class include configuring the input
- * and output data formats, configuring the map and reduce classes,
- * and specifying intermediate data formats.
- *
- * The following is the code for the driver class:
- */
- public class WordCount {
- public static void main(String[] args) throws Exception {
- /*
- * The expected command-line arguments are the paths containing
- * input and output data. Terminate the job if the number of
- * command-line arguments is not exactly 2.
- */
- if (args.length != 2) {
- System.out.printf(
- "Usage: WordCount );
- System.exit(-1);
- }
- /*
- * Instantiate a Job object for your job's configuration.
- */
- Job job = new Job();
- /*
- * Specify the jar file that contains your driver, mapper, and reducer.
- * Hadoop will transfer this jar file to nodes in your cluster running
- * mapper and reducer tasks.
- */
- job.setJarByClass(WordCount.class);
- /*
- * Specify an easily-decipherable name for the job.
- * This job name will appear in reports and logs.
- */
- job.setJobName("Word Count");
- /*
- * Specify the paths to the input and output data based on the
- * command-line arguments.
- */
- FileInputFormat.setInputPaths(job, new Path(args[0]));
- FileOutputFormat.setOutputPath(job, new Path(args[1]));
- /*
- * Specify the mapper and reducer classes.
- */
- job.setMapperClass(WordMapper.class);
- job.setReducerClass(SumReducer.class);
- /*
- * For the word count application, the input file and output
- * files are in text format - the default format.
- *
- * In text format files, each record is a line delineated by a
- * by a line terminator.
- *
- * When you use other input formats, you must call the
- * SetInputFormatClass method. When you use other
- * output formats, you must call the setOutputFormatClass method.
- */
- /*
- * For the word count application, the mapper's output keys and
- * values have the same data types as the reducer's output keys
- * and values: Text and IntWritable.
- *
- * When they are not the same data types, you must call the
- * setMapOutputKeyClass and setMapOutputValueClass
- * methods.
- */
- /*
- * Specify the job's output key and value classes.
- */
- job.setOutputKeyClass(Text.class);
- job.setOutputValueClass(IntWritable.class);
- /*
- * Start the MapReduce job and wait for it to finish.
- * If it finishes successfully, return 0. If not, return 1.
- */
- boolean success = job.waitForCompletion(true);
- System.exit(success ? 0 : 1);
- }
- }
The Job class allows you to set configuration options for your MapReduce job
Any options not explicitly set in your driver code will be read from your Hadoop configuration files ($HADOOP_HOME/etc/hadoop/conf). Any options not specified in your configuration files will use Hadoop's default values. You can use Job object to submit the job, control its execution, and query its state.
2. Specifying Input/Output Directories
- /*
- * Specify the paths to the input and output data based on the
- * command-line arguments.
- */
- FileInputFormat.setInputPaths(job, new Path(args[0]));
- FileOutputFormat.setOutputPath(job, new Path(args[1]));
- job.setInputFormatClass(KeyValueTextInputFormat.class)
Alternatively, FileInputFormat.addInputPath() can be called multiple times, specifying a single file or directory each time. More advanced filtering can be performed by implementing a PathFilter (Interface with a method named accept which returns true or false depending on whether or not the file should be processed).
FileOutputFormat.setOutputPath() specifies the directory to which the Reducers will write their final output. The default is a plain text file and can be explicitly written as job.setOutputFormatClass(...).
3. Specifying the Mapper/Reducer Classes
- /*
- * Specify the mapper and reducer classes.
- */
- job.setMapperClass(WordMapper.class);
- job.setReducerClass(SumReducer.class);
4. Specifying Intermediate/Final Output data type
- /*
- * Specify the intermediate key/value classes.
- */
- job.setMapOutputKeyClass(Text.class);
- job.setMapOutputValueClass(IntWritable.class);
- /*
- * Specify the job's output key and value classes.
- */
- job.setOutputKeyClass(Text.class);
- job.setOutputValueClass(IntWritable.class);
- /*
- * Start the MapReduce job and wait for it to finish.
- * If it finishes successfully, return 0. If not, return 1.
- */
- boolean success = job.waitForCompletion(true);
- System.exit(success ? 0 : 1);
The client determines the proper division of input data into inputSplits, and then sends the job information to the JobTracker daemon on the cluster.
WordCount Mapper Review
- WordMapper.java
- package solution;
- import java.io.IOException;
- import org.apache.hadoop.io.IntWritable;
- import org.apache.hadoop.io.LongWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Mapper;
- /*
- * To define a map function for your MapReduce job, subclass
- * the Mapper class and override the map method.
- * The class definition requires four parameters:
- * The data type of the input key
- * The data type of the input value
- * The data type of the output key (which is the input key type
- * for the reducer)
- * The data type of the output value (which is the input value
- * type for the reducer)
- */
- public class WordMapper extends Mapper
{ - /*
- * The map method runs once for each line of text in the input file.
- * The method receives a key of type LongWritable, a value of type
- * Text, and a Context object.
- */
- @Override
- public void map(LongWritable key, Text value, Context context)
- throws IOException, InterruptedException {
- /*
- * Convert the line, which is received as a Text object,
- * to a String object.
- */
- String line = value.toString();
- /*
- * The line.split("\\W+") call uses regular expressions to split the
- * line up by non-word characters.
- *
- * If you are not familiar with the use of regular expressions in
- * Java code, search the web for "Java Regex Tutorial."
- */
- for (String word : line.split("\\W+")) {
- if (word.length() > 0) {
- /*
- * Call the write method on the Context object to emit a key
- * and a value from the map method.
- */
- context.write(new Text(word), new IntWritable(1));
- }
- }
- }
- }
- SumReducer.java
- package solution;
- import java.io.IOException;
- import org.apache.hadoop.io.IntWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Reducer;
- /*
- * To define a reduce function for your MapReduce job, subclass
- * the Reducer class and override the reduce method.
- * The class definition requires four parameters:
- * The data type of the input key (which is the output key type
- * from the mapper)
- * The data type of the input value (which is the output value
- * type from the mapper)
- * The data type of the output key
- * The data type of the output value
- */
- public class SumReducer extends Reducer
{ - /*
- * The reduce method runs once for each key received from
- * the shuffle and sort phase of the MapReduce framework.
- * The method receives a key of type Text, a set of values of type
- * IntWritable, and a Context object.
- */
- @Override
- public void reduce(Text key, Iterable
values, Context context) - throws IOException, InterruptedException {
- int wordCount = 0;
- /*
- * For each value in the set of values passed to us by the mapper:
- */
- for (IntWritable value : values) {
- /*
- * Add the value to the word count counter for this key.
- */
- wordCount += value.get();
- }
- /*
- * Call the write method on the Context object to emit a key
- * and a value from the reduce method.
- */
- context.write(key, new IntWritable(wordCount));
- }
- }
Mappers and Reducers declare input/output type parameters through extended Mapper/Reducer with Generic. These must match the types used in the class:
Output types must also match those setting in the driver:
Supplement
* Exercise2 - Running a MapReduce Job
* Exercise3 - Writting a MapReduce Java Program
* Exercise4 - More Practice With MapReduce Java Programs
* Apache Hadoop 2.5.1 - MapReduce Tutorial
* Hadoop2.2.0遇到NativeLibraries錯誤的解決過程
沒有留言:
張貼留言