2015年1月22日 星期四

[CCDH] Exercise4 - More Practice With MapReduce Java Programs (P24)

Preface
Files and Directories Used in this Exercise
Eclipse project: log_file_analysis
Java files:
SumReducer.java - the Reducer
LogFileMapper.java - the Mapper
ProcessLogs.java - the Driver class

Test data(HDFS):
weblog (full version)
testlog (test sample set)

Exercise directory: ~/workspace/log_file_analysis

In this exercise, you will analyze a log file from a web server to count the number of hits made from each unique IP address.

Your task is to count the number of hits made from each IP address in the sample web server log file that you uploaded to the /user/training/weblog directory in HDFS when you complete the "Using HDFS" exercise.

Source Code
Mapper
Extract the IP address field and output  pairs:
- solution/LogFileMapper.java
  1. package solution;  
  2.   
  3. import java.io.IOException;  
  4.   
  5. import org.apache.hadoop.io.IntWritable;  
  6. import org.apache.hadoop.io.LongWritable;  
  7. import org.apache.hadoop.io.Text;  
  8. import org.apache.hadoop.mapreduce.Mapper;  
  9.   
  10. /** 
  11. * Example input line: 
  12. * 96.7.4.14 - - [24/Apr/2011:04:20:11 -0400] "GET /cat.jpg HTTP/1.1" 200 12433 
  13. * 
  14. */  
  15. public class LogFileMapper extends Mapper {  
  16.   
  17.   @Override  
  18.   public void map(LongWritable key, Text value, Context context)  
  19.       throws IOException, InterruptedException {  
  20.   
  21.     /* 
  22.      * Split the input line into space-delimited fields. 
  23.      */  
  24.     String[] fields = value.toString().split(" ");  
  25.     if (fields.length > 0) {  
  26.   
  27.       /* 
  28.        * Emit the first field - the IP address - as the key 
  29.        * and the number 1 as the value. 
  30.        */  
  31.       String ip = fields[0];  
  32.       context.write(new Text(ip), new IntWritable(1));  
  33.     }  
  34.   }  
  35. }  
Reducer
The reducer just do the sum operation on each ip and output  pairs:
- solution/SumReducer.java 
  1. package solution;  
  2.   
  3. import java.io.IOException;  
  4.   
  5. import org.apache.hadoop.io.IntWritable;  
  6. import org.apache.hadoop.io.Text;  
  7. import org.apache.hadoop.mapreduce.Reducer;  
  8.   
  9. /** 
  10. * This is the SumReducer class from the word count exercise 
  11. */  
  12. public class SumReducer extends Reducer {  
  13.   
  14.   @Override  
  15.   public void reduce(Text key, Iterable values, Context context)  
  16.       throws IOException, InterruptedException {  
  17.     int wordCount = 0;  
  18.     for (IntWritable value : values) {  
  19.       wordCount += value.get();  
  20.     }  
  21.     context.write(key, new IntWritable(wordCount));  
  22.   }  
  23. }  
Driver
The driver is quite straight forward:
- solution/ProcessLogs.java
  1. package solution;  
  2.   
  3. import org.apache.hadoop.fs.Path;  
  4. import org.apache.hadoop.io.IntWritable;  
  5. import org.apache.hadoop.io.Text;  
  6. import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
  7. import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
  8. import org.apache.hadoop.mapreduce.Job;  
  9.   
  10. public class ProcessLogs {  
  11.   
  12.   public static void main(String[] args) throws Exception {  
  13.   
  14.     if (args.length != 2) {  
  15.       System.out.printf("Usage: ProcessLogs \n");  
  16.       System.exit(-1);  
  17.     }  
  18.   
  19.     Job job = new Job();  
  20.     job.setJarByClass(ProcessLogs.class);  
  21.     job.setJobName("Process Logs");  
  22.   
  23.     FileInputFormat.setInputPaths(job, new Path(args[0]));  
  24.     FileOutputFormat.setOutputPath(job, new Path(args[1]));  
  25.   
  26.     job.setMapperClass(LogFileMapper.class);  
  27.     job.setReducerClass(SumReducer.class);  
  28.   
  29.     job.setOutputKeyClass(Text.class);  
  30.     job.setOutputValueClass(IntWritable.class);  
  31.   
  32.     boolean success = job.waitForCompletion(true);  
  33.     System.exit(success ? 0 : 1);  
  34.   }  
  35. }  
Lab Experiment
1. Build the project and run the MapReduce program
$ ant -f build.xml # The build process will output 'log_file_analysis.jar'
$ hadoop jar log_file_analysis.jar solution.ProcessLogs weblog ip_count # Output result to ip_count in HDFS

2. Review the result
$ hadoop fs -ls
...
...ip_count
...

$ hadoop fs -cat ip_count/*
...
10.99.99.186 6
10.99.99.247 1
10.99.99.58 21


沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...