程式扎記

Preface
Files and Directories Used in this Exercise

Eclipse project: partitioner
Java files:
MonthPartitioner.java (Partitioner)
ProcessLogs.java (Driver)
CountReducer.java (Reducer)
LogMonthMapper.java (Mapper)

Test data (HDFS):
weblog (full web server access log)
testlog (partial data set for testing)

Exercise directory: ~/workspace/partitioner

In this Exercise, you will write a MapReduce job with multiple Reducers, and create a Partitioner to determine which Reducer each piece of Mapper output is sent to.

The Problem
In the "More Practice with Writing MapReduce Java Programs" exercise you did previously, you built the code in log_file_analysis project. That program counted the number of hits for each different IP address in a web log file. The final output was a file containing a list of IP addresses, and the number of hits from that address.

This time, we want to perform a similar task, but we want the final output to consist of 12 files, one each for each month of the year: January, February, and son on. Each file will contain a list of IP address, and the number of hits from that address in that month.

We will accomplish this by having 12 Reducers, each of which is responsible for processing the data for a particular month. Reducer 0 processes January hits, Reducer 1 processes February hits, and so on.
Note:

We are actually breaking the standard MapReduce paradigm here, which says that all the values from a particular key will go to the same Reducer. In this example, which is a very common pattern when analyzing log files, values from the same key (the IP address) will go to multiple Reducers, based on the month portion of the line.

Sample Code
The mapper will analyze each line of log and extract IP address and month information for (key, value)=(IP address, month):
- Mapper

view plaincopy to clipboardprint?
package solution;  
  
import java.io.IOException;  
import java.util.Arrays;  
import java.util.List;  
  
import org.apache.hadoop.io.LongWritable;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapreduce.Mapper;  
  
public class LogMonthMapper extends Mapper {  
  
    public static List months = Arrays.asList("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec");  
  
  /** 
   * Example input line: 
   * 96.7.4.14 - - [24/Apr/2011:04:20:11 -0400] "GET /cat.jpg HTTP/1.1" 200 12433 
   * 
   */  
  @Override  
  public void map(LongWritable key, Text value, Context context)  
      throws IOException, InterruptedException {  
      
    /* 
     * Split the input line into space-delimited fields. 
     */  
    String[] fields = value.toString().split(" ");  
      
    if (fields.length > 3) {  
        
      /* 
       * Save the first field in the line as the IP address. 
       */  
      String ip = fields[0];  
        
      /* 
       * The fourth field contains [dd/Mmm/yyyy:hh:mm:ss]. 
       * Split the fourth field into "/" delimited fields. 
       * The second of these contains the month. 
       */  
      String[] dtFields = fields[3].split("/");  
      if (dtFields.length > 1) {  
        String theMonth = dtFields[1];  
          
        /* check if it's a valid month, if so, write it out */  
        if (months.contains(theMonth))  
            context.write(new Text(ip), new Text(theMonth));  
      }  
    }  
  }  
}  

The partitioner will extends Partitioner class and implement getPartition(KEY key, VALUE value, int numPartitions):
- Month Partitioner

view plaincopy to clipboardprint?
package solution;  
  
import java.util.HashMap;  
  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.conf.Configurable;  
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.mapreduce.Partitioner;  
  
public class MonthPartitioner extends Partitioner implements  
    Configurable {  
  
  private Configuration configuration;  
  HashMap months = new HashMap();  
  
  /** 
   * Set up the months hash map in the setConf method. 
   */  
  @Override  
  public void setConf(Configuration configuration) {  
    this.configuration = configuration;  
    months.put("Jan", 0);  
    months.put("Feb", 1);  
    months.put("Mar", 2);  
    months.put("Apr", 3);  
    months.put("May", 4);  
    months.put("Jun", 5);  
    months.put("Jul", 6);  
    months.put("Aug", 7);  
    months.put("Sep", 8);  
    months.put("Oct", 9);  
    months.put("Nov", 10);  
    months.put("Dec", 11);  
  }  
  
  /** 
   * Implement the getConf method for the Configurable interface. 
   */  
  @Override  
  public Configuration getConf() {  
    return configuration;  
  }  
  
  /** 
   * You must implement the getPartition method for a partitioner class. 
   * This method receives the three-letter abbreviation for the month 
   * as its value. (It is the output value from the mapper.) 
   * It should return an integer representation of the month. 
   * Note that January is represented as 0 rather than 1. 
   *  
   * For this partitioner to work, the job configuration must have been 
   * set so that there are exactly 12 reducers. 
   */  
  public int getPartition(Text key, Text value, int numReduceTasks) {  
    return (int) (months.get(value.toString()));  
  }  
}  

The reducer is a simple count reducer:
- Reducer

view plaincopy to clipboardprint?
package solution;  
  
import java.io.IOException;  
  
import org.apache.hadoop.io.IntWritable;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapreduce.Reducer;  
  
/* Counts the number of values associated with a key */  
  
public class CountReducer extends Reducer {  
  
    @Override  
    public void reduce(Text key, Iterable values, Context context)  
            throws IOException, InterruptedException {  
  
        /* 
         * Iterate over the values iterable and count the number 
         * of values in it. Emit the key (unchanged) and an IntWritable 
         * containing the number of values. 
         */  
  
        int count = 0;  
  
        /* 
         * Use for loop to count items in the iterator.  
         */  
          
        /* Ignore warnings that we 
         * don't use the value -- in this case, we only need to count the 
         * values, not use them. 
         */  
        for (@SuppressWarnings("unused")  
        Text value : values) {  
  
            /* 
             * for each item in the list, increment the count 
             */  
            count++;  
        }  
  
        context.write(key, new IntWritable(count));  
    }  
}  

Finally, below is the driver class:
- Driver

view plaincopy to clipboardprint?
package solution;  
  
import org.apache.hadoop.fs.Path;  
import org.apache.hadoop.io.IntWritable;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
import org.apache.hadoop.mapreduce.Job;  
  
public class ProcessLogs {  
  
  public static void main(String[] args) throws Exception {  
  
    if (args.length != 2) {  
      System.out.printf("Usage: ProcessLogs  \n");  
      System.exit(-1);  
    }  
  
    Job job = new Job();  
    job.setJarByClass(ProcessLogs.class);  
    job.setJobName("Process Logs");  
  
    FileInputFormat.setInputPaths(job, new Path(args[0]));  
    FileOutputFormat.setOutputPath(job, new Path(args[1]));  
  
    job.setMapperClass(LogMonthMapper.class);  
    job.setReducerClass(CountReducer.class);  
      
    job.setMapOutputKeyClass(Text.class);  
    job.setMapOutputValueClass(Text.class);  
  
    job.setOutputKeyClass(Text.class);  
    job.setOutputValueClass(IntWritable.class);  
      
    /* 
     * Set up the partitioner. Specify 12 reducers - one for each 
     * month of the year. The partitioner class must have a  
     * getPartition method that returns a number between 0 and 11. 
     * This number will be used to assign the intermediate output 
     * to one of the reducers. 
     */  
    job.setNumReduceTasks(12);  
      
    /* 
     * Specify the partitioner class. 
     */  
    job.setPartitionerClass(MonthPartitioner.class);  
  
    boolean success = job.waitForCompletion(true);  
    System.exit(success ? 0 : 1);  
  }  
}  

Remember to set number of reducer to 12 by setNumReduceTasks(12) and setup partitioner class by setPartitionerClass(MonthPartitioner.class)

Lab Experiment
1. Compile project and run MapReduce job

$ ant -f build.xml # build project and output partitioner.jar
$ hadoop fs -rm -r output # Clean previous result
$ hadoop jar partitioner.jar solution.ProcessLogs testlog output # Run MapReduce job

2. Check output result

$ hadoop fs -ls output
...
... output/part-r-00000
...
... output/part-r-00011 # part-r-00000~part-r-00011
$ hadoop fs -cat output/part-r-00006
10.114.184.86 1
10.153.239.5 547
10.187.129.140 18
10.207.190.45 21
10.216.113.172 368
10.223.157.186 115
10.82.30.199 183

程式扎記

標籤

2014年12月14日星期日

[CCDH] Exercise12 - Writing a Partitioner (P43)

2 則留言:

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2014年12月14日 星期日