2014年12月16日 星期二

[CCDH] Exercise13 - Implementing a Custom WritableComparable (P46)

Preface
Files and Directories Used in this Exercise
Eclipse project: writables
Java files:
StringPairWritable - implements a WritableComparable type
StringPairMapper - Mapper for test job
StringPairTestDriver - Driver for test job

Data file:
~/training_materials/developer/data/nameyeartestdata (small set of data for the test job)

Exercise directory:~/workspace/writables

In this exercise, you will create a custom WritableComparable type that holds two strings.

Test the new type by creating a simple program that reads a list of names (first and last) and counts the number of occurences of each name. The mapper should accepts lines in the form:
lastname firstname other data

The goal is to count the number of times a lastname/firstname pair occur within the dataset. For example, for input:
  1. Smith Joe 1963-08-12 Poughkeepsie, NY  
  2. Smith Joe 1832-01-20 Sacramento, CA  
  3. Murphy Alice 2004-06-02 Berlin, MA  
We want to output:
(Smith,Joe) 2
(Murphy,Alice) 1

Solution Code
You need to implement a WritableComparable object that holds the two strings. After that, you will need to implement the readFieldswrite and compareTomethods required by WritableComparable and generate hashCode and equals methods. Here we define StringPairWritable class to hold pair strings information:
- Custom WritableComparable
  1. package solution;  
  2.   
  3. import java.io.DataInput;  
  4. import java.io.DataOutput;  
  5. import java.io.IOException;  
  6.   
  7. import org.apache.hadoop.io.WritableComparable;  
  8.   
  9. public class StringPairWritable implements WritableComparable {  
  10.   
  11.   String left;  
  12.   String right;  
  13.   
  14.   /** 
  15.    * Empty constructor - required for serialization. 
  16.    */   
  17.   public StringPairWritable() {  
  18.   
  19.   }  
  20.   
  21.   /** 
  22.    * Constructor with two String objects provided as input. 
  23.    */   
  24.   public StringPairWritable(String left, String right) {  
  25.     this.left = left;  
  26.     this.right = right;  
  27.   }  
  28.   
  29.   /** 
  30.    * Serializes the fields of this object to out. 
  31.    */  
  32.   public void write(DataOutput out) throws IOException {  
  33.     out.writeUTF(left);  
  34.     out.writeUTF(right);  
  35.   }  
  36.   
  37.   /** 
  38.    * Deserializes the fields of this object from in. 
  39.    */  
  40.   public void readFields(DataInput in) throws IOException {  
  41.     left = in.readUTF();  
  42.     right = in.readUTF();  
  43.   }  
  44.   
  45.   /** 
  46.    * Compares this object to another StringPairWritable object by 
  47.    * comparing the left strings first. If the left strings are equal, 
  48.    * then the right strings are compared. 
  49.    */  
  50.   public int compareTo(StringPairWritable other) {  
  51.     int ret = left.compareTo(other.left);  
  52.     if (ret == 0) {  
  53.       return right.compareTo(other.right);  
  54.     }  
  55.     return ret;  
  56.   }  
  57.   
  58.   /** 
  59.    * A custom method that returns the two strings in the  
  60.    * StringPairWritable object inside parentheses and separated by 
  61.    * a comma. For example: "(left,right)". 
  62.    */  
  63.   public String toString() {  
  64.     return "(" + left + "," + right + ")";  
  65.   }  
  66.   
  67.   /** 
  68.    * The equals method compares two StringPairWritable objects for  
  69.    * equality. The equals and hashCode methods have been automatically 
  70.    * generated by Eclipse by right-clicking on an empty line, selecting 
  71.    * Source, and then selecting the Generate hashCode() and equals() 
  72.    * option.  
  73.    */  
  74.   @Override  
  75.   public boolean equals(Object obj) {  
  76.     if (this == obj)  
  77.       return true;  
  78.     if (obj == null)  
  79.       return false;  
  80.     if (getClass() != obj.getClass())  
  81.       return false;  
  82.     StringPairWritable other = (StringPairWritable) obj;  
  83.     if (left == null) {  
  84.       if (other.left != null)  
  85.         return false;  
  86.     } else if (!left.equals(other.left))  
  87.       return false;  
  88.     if (right == null) {  
  89.       if (other.right != null)  
  90.         return false;  
  91.     } else if (!right.equals(other.right))  
  92.       return false;  
  93.     return true;  
  94.   }  
  95.   
  96.   /** 
  97.    * The hashCode method generates a hash code for a StringPairWritable  
  98.    * object. The equals and hashCode methods have been automatically 
  99.    * generated by Eclipse by right-clicking on an empty line, selecting 
  100.    * Source, and then selecting the Generate hashCode() and equals() 
  101.    * option.  
  102.    */  
  103.   @Override  
  104.   public int hashCode() {  
  105.     final int prime = 31;  
  106.     int result = 1;  
  107.     result = prime * result + ((left == null) ? 0 : left.hashCode());  
  108.     result = prime * result + ((right == null) ? 0 : right.hashCode());  
  109.     return result;  
  110.   }  
  111. }  
The mapper just extract first name/last name pair and use them as key to count the occurence of each name:
- Mapper
  1. package solution;  
  2.   
  3. import java.io.IOException;  
  4.   
  5. import org.apache.hadoop.io.LongWritable;  
  6. import org.apache.hadoop.io.Text;  
  7. import org.apache.hadoop.mapreduce.Mapper;  
  8.   
  9. public class StringPairMapper extends  
  10.         Mapper {  
  11.   
  12.     @Override  
  13.     public void map(LongWritable key, Text value, Context context)  
  14.             throws IOException, InterruptedException {  
  15.   
  16.         LongWritable one = new LongWritable(1);  
  17.         /* 
  18.          * Split the line into words. Create a new StringPairWritable consisting 
  19.          * of the first two strings in the line.  Emit the pair as the key, and 
  20.          * '1' as the value (for later summing). 
  21.          */  
  22.         String[] words = value.toString().split("\\W+"3);  
  23.   
  24.         if (words.length > 2) {  
  25.             context.write(new StringPairWritable(words[0], words[1]), one);  
  26.         }  
  27.     }  
  28. }  
There are some plug-in reducers for us to use. Here we leverage LongSumReducer to count the occurence of each name. The last is the driver class:
- Driver
  1. package solution;  
  2.   
  3. import org.apache.hadoop.fs.Path;  
  4. import org.apache.hadoop.io.LongWritable;  
  5. import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
  6. import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
  7. import org.apache.hadoop.mapreduce.lib.reduce.LongSumReducer;  
  8. import org.apache.hadoop.mapreduce.Job;  
  9. import org.apache.hadoop.conf.Configured;  
  10. import org.apache.hadoop.conf.Configuration;  
  11. import org.apache.hadoop.util.Tool;  
  12. import org.apache.hadoop.util.ToolRunner;  
  13.   
  14. public class StringPairTestDriver extends Configured implements Tool {  
  15.   
  16.   @Override  
  17.   public int run(String[] args) throws Exception {  
  18.   
  19.     if (args.length != 2) {  
  20.       System.out.printf("Usage: " + this.getClass().getName() + " \n");  
  21.       return -1;  
  22.     }  
  23.   
  24.     Job job = new Job(getConf());  
  25.     job.setJarByClass(StringPairTestDriver.class);  
  26.     job.setJobName("Custom Writable Comparable");  
  27.   
  28.     FileInputFormat.setInputPaths(job, new Path(args[0]));  
  29.     FileOutputFormat.setOutputPath(job, new Path(args[1]));  
  30.   
  31.     /* 
  32.      * LongSumReducer is a Hadoop API class that sums values into 
  33.      * A LongWritable.  It works with any key and value type, therefore 
  34.      * supports the new StringPairWritable as a key type. 
  35.      */  
  36.     job.setReducerClass(LongSumReducer.class);  
  37.   
  38.     job.setMapperClass(StringPairMapper.class);  
  39.       
  40.     /* 
  41.      * Set the key output class for the job 
  42.      */     
  43.     job.setOutputKeyClass(StringPairWritable.class);  
  44.       
  45.     /* 
  46.      * Set the value output class for the job 
  47.      */  
  48.     job.setOutputValueClass(LongWritable.class);  
  49.   
  50.     boolean success = job.waitForCompletion(true);  
  51.     return success ? 0 : 1;  
  52.   }  
  53.   
  54.   public static void main(String[] args) throws Exception {  
  55.     int exitCode = ToolRunner.run(new Configuration(), new StringPairTestDriver(), args);  
  56.     System.exit(exitCode);  
  57.   }  
  58. }  
Lab Experiment
You can use the simple test data in ~/training_materials/developer/data/nameyeartestdata to make sure your new type works as expected.
1. Build project and execute MapReduce job
$ ant -f build.xml # Build project and output writables.jar
$ rm -rf output # Clean previous result
$ hadoop jar writables.jar solution.StringPairTestDriver -fs=file:/// -jt=local ~/training_materials/developer/data/nameyeartestdata output
# Run LocalJobRunner and output result to output folder

2. Check output result
$ cat output/*
(Addams,Gomez) 1
(Addams,Jane) 1
(Addams,Morticia) 1
...
(Smith,John) 3
(Turing,Alan) 1
(Wamsley,Jayme) 1
(Webre,Josh) 1
(Weston,Clark) 1
(Woodburn,Louis) 1
(Woodburn,Providencia) 1

Supplement
[ Java Essence ] 記憶中的那個東西 : 要怎麼參考呢 (物件相等性)


沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...