2014年12月15日 星期一

[CCDH] Exercise11 - Using Counters and a Map-Only Job (P41)

Preface
Files and Directories Used in this Exercise
Eclipse project: counters
Java files:
ImageCounter.java (Driver)
ImageCounterMapper.java (Mapper)

Test data (HDFS):
weblog (full web server access log)
testlog (partial data set for testing)

Exercise directory: ~/workspace/counters

In this exercise you will create a Map-only MapReduce job.

Your application will process a web server's access log to count the number of times gifs, jpegs, and other resources have been retrieved. You job will report three figures: number of gif requests, number of jpeg requests, and number of other requests.

Hints
1. You should use a Map-only MapReduce job, by setting the number of Reducers to 0 in the driver code.

2. For input data, use the Web access log file that you uploaded to the HDFS /user/training/weblog directory in the "Using HDFS" exercise.

3. Use a counter group such as ImageCounter, with names gif, jpeg and other.

4. In your driver code, retrieve the values of the counters after the job has completed and report them using System.out.println.

5. The output folder on HDFS will contain Mapper output files which are empty, because the Mapper did not write any data.

Solution Code
- Mapper
  1. package solution;  
  2.   
  3.   
  4. import java.io.IOException;  
  5.   
  6. import org.apache.hadoop.io.IntWritable;  
  7. import org.apache.hadoop.io.LongWritable;  
  8. import org.apache.hadoop.io.Text;  
  9. import org.apache.hadoop.mapreduce.Mapper;  
  10.   
  11. /** 
  12. * Example input line: 
  13. * 96.7.4.14 - - [24/Apr/2011:04:20:11 -0400] "GET /cat.jpg HTTP/1.1" 200 12433 
  14. * 
  15. */  
  16. public class ImageCounterMapper extends  
  17.     Mapper {  
  18.   
  19.   @Override  
  20.   public void map(LongWritable key, Text value, Context context)  
  21.       throws IOException, InterruptedException {  
  22.       
  23.     /* 
  24.      * Split the line using the double-quote character as the delimiter. 
  25.      */  
  26.     String[] fields = value.toString().split("\"");  
  27.     if (fields.length > 1) {  
  28.       String request = fields[1];  
  29.         
  30.       /* 
  31.        * Split the part of the line after the first double quote 
  32.        * using the space character as the delimiter to get a file name. 
  33.        */  
  34.       fields = request.split(" ");  
  35.         
  36.       /* 
  37.        * Increment a counter based on the file's extension. 
  38.        */  
  39.       if (fields.length > 1) {  
  40.         String fileName = fields[1].toLowerCase();  
  41.         if (fileName.endsWith(".jpg")) {  
  42.           context.getCounter("ImageCounter""jpg").increment(1);  
  43.         } else if (fileName.endsWith(".gif")) {  
  44.           context.getCounter("ImageCounter""gif").increment(1);  
  45.         } else {  
  46.           context.getCounter("ImageCounter""other").increment(1);  
  47.         }  
  48.       }  
  49.     }  
  50.   }  
  51. }  
You use group name and counter name to retrieve Counter from Context object passed as parameter in map()

- Driver 
  1. package solution;  
  2.   
  3. import org.apache.hadoop.fs.Path;  
  4. import org.apache.hadoop.io.IntWritable;  
  5. import org.apache.hadoop.io.Text;  
  6. import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
  7. import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
  8. import org.apache.hadoop.mapreduce.Job;  
  9.   
  10. import org.apache.hadoop.conf.Configured;  
  11. import org.apache.hadoop.conf.Configuration;  
  12. import org.apache.hadoop.util.Tool;  
  13. import org.apache.hadoop.util.ToolRunner;  
  14.   
  15. public class ImageCounter extends Configured implements Tool {  
  16.   
  17.   @Override  
  18.   public int run(String[] args) throws Exception {  
  19.   
  20.     if (args.length != 2) {  
  21.       System.out.printf("Usage: ImageCounter \n");  
  22.       return -1;  
  23.     }  
  24.   
  25.     Job job = new Job(getConf());  
  26.     job.setJarByClass(ImageCounter.class);  
  27.     job.setJobName("Image Counter");  
  28.   
  29.     FileInputFormat.setInputPaths(job, new Path(args[0]));  
  30.     FileOutputFormat.setOutputPath(job, new Path(args[1]));  
  31.   
  32.     // This is a map-only job, so we do not call setReducerClass.  
  33.     job.setMapperClass(ImageCounterMapper.class);  
  34.   
  35.     job.setOutputKeyClass(Text.class);  
  36.     job.setOutputValueClass(IntWritable.class);  
  37.   
  38.     /* 
  39.      * Set the number of reduce tasks to 0.  
  40.      */  
  41.     job.setNumReduceTasks(0);  
  42.   
  43.     boolean success = job.waitForCompletion(true);  
  44.     if (success) {  
  45.       /* 
  46.        * Print out the counters that the mappers have been incrementing. 
  47.        */  
  48.       long jpg = job.getCounters().findCounter("ImageCounter""jpg")  
  49.           .getValue();  
  50.       long gif = job.getCounters().findCounter("ImageCounter""gif")  
  51.           .getValue();  
  52.       long other = job.getCounters().findCounter("ImageCounter""other")  
  53.           .getValue();  
  54.       System.out.println("JPG   = " + jpg);  
  55.       System.out.println("GIF   = " + gif);  
  56.       System.out.println("OTHER = " + other);  
  57.       return 0;  
  58.     } else  
  59.       return 1;  
  60.   }  
  61.   
  62.   public static void main(String[] args) throws Exception {  
  63.     int exitCode = ToolRunner.run(new Configuration(), new ImageCounter(), args);  
  64.     System.exit(exitCode);  
  65.   }  
  66. }  
The Job object provide getCounters() to retrieve counters for this job which will return Counters object.

Lab Experiment
1. Build the project, run MapReduce job
$ ant -f build.xml # Build project and output counter.jar
$ hadoop fs -rm -r output # Clean previous result
$ hadoop jar counter.jar solution.ImageCounter weblog output # Run MapReduce
...
[exec] 14/12/16 02:04:29 INFO mapred.JobClient: ImageCounter
[exec] 14/12/16 02:04:29 INFO mapred.JobClient: gif=73682
[exec] 14/12/16 02:04:29 INFO mapred.JobClient: jpg=2629976
[exec] 14/12/16 02:04:29 INFO mapred.JobClient: other=1774178


沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...