Preface
Files and Directories Used in this Exercise
In this Exercise, you will write a MapReduce job with multiple Reducers, and create a Partitioner to determine which Reducer each piece of Mapper output is sent to.
The Problem
In the "More Practice with Writing MapReduce Java Programs" exercise you did previously, you built the code in log_file_analysis project. That program counted the number of hits for each different IP address in a web log file. The final output was a file containing a list of IP addresses, and the number of hits from that address.
This time, we want to perform a similar task, but we want the final output to consist of 12 files, one each for each month of the year: January, February, and son on. Each file will contain a list of IP address, and the number of hits from that address in that month.
We will accomplish this by having 12 Reducers, each of which is responsible for processing the data for a particular month. Reducer 0 processes January hits, Reducer 1 processes February hits, and so on.
Note:
Sample Code
The mapper will analyze each line of log and extract IP address and month information for (key, value)=(IP address, month):
- Mapper
The partitioner will extends
Partitioner class and implement getPartition(KEY key, VALUE value, int numPartitions):
- Month Partitioner
The reducer is a simple count reducer:
- Reducer
Finally, below is the driver class:
- Driver
Remember to set number of reducer to 12 by
setNumReduceTasks(12) and setup partitioner class by setPartitionerClass(MonthPartitioner.class)
Lab Experiment
1. Compile project and run MapReduce job
2. Check output result
Files and Directories Used in this Exercise
In this Exercise, you will write a MapReduce job with multiple Reducers, and create a Partitioner to determine which Reducer each piece of Mapper output is sent to.
The Problem
In the "More Practice with Writing MapReduce Java Programs" exercise you did previously, you built the code in log_file_analysis project. That program counted the number of hits for each different IP address in a web log file. The final output was a file containing a list of IP addresses, and the number of hits from that address.
This time, we want to perform a similar task, but we want the final output to consist of 12 files, one each for each month of the year: January, February, and son on. Each file will contain a list of IP address, and the number of hits from that address in that month.
We will accomplish this by having 12 Reducers, each of which is responsible for processing the data for a particular month. Reducer 0 processes January hits, Reducer 1 processes February hits, and so on.
Note:
Sample Code
The mapper will analyze each line of log and extract IP address and month information for (key, value)=(IP address, month):
- Mapper
- package solution;
- import java.io.IOException;
- import java.util.Arrays;
- import java.util.List;
- import org.apache.hadoop.io.LongWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Mapper;
- public class LogMonthMapper extends Mapper
{ - public static List
months = Arrays.asList( "Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"); - /**
- * Example input line:
- * 96.7.4.14 - - [24/Apr/2011:04:20:11 -0400] "GET /cat.jpg HTTP/1.1" 200 12433
- *
- */
- @Override
- public void map(LongWritable key, Text value, Context context)
- throws IOException, InterruptedException {
- /*
- * Split the input line into space-delimited fields.
- */
- String[] fields = value.toString().split(" ");
- if (fields.length > 3) {
- /*
- * Save the first field in the line as the IP address.
- */
- String ip = fields[0];
- /*
- * The fourth field contains [dd/Mmm/yyyy:hh:mm:ss].
- * Split the fourth field into "/" delimited fields.
- * The second of these contains the month.
- */
- String[] dtFields = fields[3].split("/");
- if (dtFields.length > 1) {
- String theMonth = dtFields[1];
- /* check if it's a valid month, if so, write it out */
- if (months.contains(theMonth))
- context.write(new Text(ip), new Text(theMonth));
- }
- }
- }
- }
- Month Partitioner
- package solution;
- import java.util.HashMap;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.conf.Configurable;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.mapreduce.Partitioner;
- public class MonthPartitioner
- Configurable {
- private Configuration configuration;
- HashMap
months = new HashMap (); - /**
- * Set up the months hash map in the setConf method.
- */
- @Override
- public void setConf(Configuration configuration) {
- this.configuration = configuration;
- months.put("Jan", 0);
- months.put("Feb", 1);
- months.put("Mar", 2);
- months.put("Apr", 3);
- months.put("May", 4);
- months.put("Jun", 5);
- months.put("Jul", 6);
- months.put("Aug", 7);
- months.put("Sep", 8);
- months.put("Oct", 9);
- months.put("Nov", 10);
- months.put("Dec", 11);
- }
- /**
- * Implement the getConf method for the Configurable interface.
- */
- @Override
- public Configuration getConf() {
- return configuration;
- }
- /**
- * You must implement the getPartition method for a partitioner class.
- * This method receives the three-letter abbreviation for the month
- * as its value. (It is the output value from the mapper.)
- * It should return an integer representation of the month.
- * Note that January is represented as 0 rather than 1.
- *
- * For this partitioner to work, the job configuration must have been
- * set so that there are exactly 12 reducers.
- */
- public int getPartition(Text key, Text value, int numReduceTasks) {
- return (int) (months.get(value.toString()));
- }
- }
- Reducer
- package solution;
- import java.io.IOException;
- import org.apache.hadoop.io.IntWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Reducer;
- /* Counts the number of values associated with a key */
- public class CountReducer extends Reducer
{ - @Override
- public void reduce(Text key, Iterable
values, Context context) - throws IOException, InterruptedException {
- /*
- * Iterate over the values iterable and count the number
- * of values in it. Emit the key (unchanged) and an IntWritable
- * containing the number of values.
- */
- int count = 0;
- /*
- * Use for loop to count items in the iterator.
- */
- /* Ignore warnings that we
- * don't use the value -- in this case, we only need to count the
- * values, not use them.
- */
- for (@SuppressWarnings("unused")
- Text value : values) {
- /*
- * for each item in the list, increment the count
- */
- count++;
- }
- context.write(key, new IntWritable(count));
- }
- }
- Driver
- package solution;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.IntWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- import org.apache.hadoop.mapreduce.Job;
- public class ProcessLogs {
- public static void main(String[] args) throws Exception {
- if (args.length != 2) {
- System.out.printf("Usage: ProcessLogs );
- System.exit(-1);
- }
- Job job = new Job();
- job.setJarByClass(ProcessLogs.class);
- job.setJobName("Process Logs");
- FileInputFormat.setInputPaths(job, new Path(args[0]));
- FileOutputFormat.setOutputPath(job, new Path(args[1]));
- job.setMapperClass(LogMonthMapper.class);
- job.setReducerClass(CountReducer.class);
- job.setMapOutputKeyClass(Text.class);
- job.setMapOutputValueClass(Text.class);
- job.setOutputKeyClass(Text.class);
- job.setOutputValueClass(IntWritable.class);
- /*
- * Set up the partitioner. Specify 12 reducers - one for each
- * month of the year. The partitioner class must have a
- * getPartition method that returns a number between 0 and 11.
- * This number will be used to assign the intermediate output
- * to one of the reducers.
- */
- job.setNumReduceTasks(12);
- /*
- * Specify the partitioner class.
- */
- job.setPartitionerClass(MonthPartitioner.class);
- boolean success = job.waitForCompletion(true);
- System.exit(success ? 0 : 1);
- }
- }
Lab Experiment
1. Compile project and run MapReduce job
2. Check output result
hi the programe you have posted on partitiner for month ,reducer was very much useful to me do keep posting your blog Hadoop Training in Velachery | Hadoop Training .
回覆刪除Hadoop Training in Chennai | Hadoop .
Thanks for sharing all these information. All your hard work is much appreciated. If you are a students and looking for the best writing solution providers for your academic assignment then you can visit here: Academic Writing Solution Providers
回覆刪除