程式扎記

Preface
Files and Directories Used in this Exercise

Eclipse project: writables
Java files:
StringPairWritable - implements a WritableComparable type
StringPairMapper - Mapper for test job
StringPairTestDriver - Driver for test job

Data file:
~/training_materials/developer/data/nameyeartestdata (small set of data for the test job)

Exercise directory:~/workspace/writables

In this exercise, you will create a custom WritableComparable type that holds two strings.

Test the new type by creating a simple program that reads a list of names (first and last) and counts the number of occurences of each name. The mapper should accepts lines in the form:

lastname firstname other data

The goal is to count the number of times a lastname/firstname pair occur within the dataset. For example, for input:

view plaincopy to clipboardprint?
Smith Joe 1963-08-12 Poughkeepsie, NY  
Smith Joe 1832-01-20 Sacramento, CA  
Murphy Alice 2004-06-02 Berlin, MA  

We want to output:

(Smith,Joe) 2
(Murphy,Alice) 1

Solution Code
You need to implement a WritableComparable object that holds the two strings. After that, you will need to implement the readFields, write and compareTomethods required by WritableComparable and generate hashCode and equals methods. Here we define StringPairWritable class to hold pair strings information:
- Custom WritableComparable

view plaincopy to clipboardprint?
package solution;  
  
import java.io.DataInput;  
import java.io.DataOutput;  
import java.io.IOException;  
  
import org.apache.hadoop.io.WritableComparable;  
  
public class StringPairWritable implements WritableComparable {  
  
  String left;  
  String right;  
  
  /** 
   * Empty constructor - required for serialization. 
   */   
  public StringPairWritable() {  
  
  }  
  
  /** 
   * Constructor with two String objects provided as input. 
   */   
  public StringPairWritable(String left, String right) {  
    this.left = left;  
    this.right = right;  
  }  
  
  /** 
   * Serializes the fields of this object to out. 
   */  
  public void write(DataOutput out) throws IOException {  
    out.writeUTF(left);  
    out.writeUTF(right);  
  }  
  
  /** 
   * Deserializes the fields of this object from in. 
   */  
  public void readFields(DataInput in) throws IOException {  
    left = in.readUTF();  
    right = in.readUTF();  
  }  
  
  /** 
   * Compares this object to another StringPairWritable object by 
   * comparing the left strings first. If the left strings are equal, 
   * then the right strings are compared. 
   */  
  public int compareTo(StringPairWritable other) {  
    int ret = left.compareTo(other.left);  
    if (ret == 0) {  
      return right.compareTo(other.right);  
    }  
    return ret;  
  }  
  
  /** 
   * A custom method that returns the two strings in the  
   * StringPairWritable object inside parentheses and separated by 
   * a comma. For example: "(left,right)". 
   */  
  public String toString() {  
    return "(" + left + "," + right + ")";  
  }  
  
  /** 
   * The equals method compares two StringPairWritable objects for  
   * equality. The equals and hashCode methods have been automatically 
   * generated by Eclipse by right-clicking on an empty line, selecting 
   * Source, and then selecting the Generate hashCode() and equals() 
   * option.  
   */  
  @Override  
  public boolean equals(Object obj) {  
    if (this == obj)  
      return true;  
    if (obj == null)  
      return false;  
    if (getClass() != obj.getClass())  
      return false;  
    StringPairWritable other = (StringPairWritable) obj;  
    if (left == null) {  
      if (other.left != null)  
        return false;  
    } else if (!left.equals(other.left))  
      return false;  
    if (right == null) {  
      if (other.right != null)  
        return false;  
    } else if (!right.equals(other.right))  
      return false;  
    return true;  
  }  
  
  /** 
   * The hashCode method generates a hash code for a StringPairWritable  
   * object. The equals and hashCode methods have been automatically 
   * generated by Eclipse by right-clicking on an empty line, selecting 
   * Source, and then selecting the Generate hashCode() and equals() 
   * option.  
   */  
  @Override  
  public int hashCode() {  
    final int prime = 31;  
    int result = 1;  
    result = prime * result + ((left == null) ? 0 : left.hashCode());  
    result = prime * result + ((right == null) ? 0 : right.hashCode());  
    return result;  
  }  
}  

The mapper just extract first name/last name pair and use them as key to count the occurence of each name:
- Mapper

view plaincopy to clipboardprint?
package solution;  
  
import java.io.IOException;  
  
import org.apache.hadoop.io.LongWritable;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapreduce.Mapper;  
  
public class StringPairMapper extends  
        Mapper {  
  
    @Override  
    public void map(LongWritable key, Text value, Context context)  
            throws IOException, InterruptedException {  
  
        LongWritable one = new LongWritable(1);  
        /* 
         * Split the line into words. Create a new StringPairWritable consisting 
         * of the first two strings in the line.  Emit the pair as the key, and 
         * '1' as the value (for later summing). 
         */  
        String[] words = value.toString().split("\\W+", 3);  
  
        if (words.length > 2) {  
            context.write(new StringPairWritable(words[0], words[1]), one);  
        }  
    }  
}  

There are some plug-in reducers for us to use. Here we leverage LongSumReducer to count the occurence of each name. The last is the driver class:
- Driver

view plaincopy to clipboardprint?
package solution;  
  
import org.apache.hadoop.fs.Path;  
import org.apache.hadoop.io.LongWritable;  
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
import org.apache.hadoop.mapreduce.lib.reduce.LongSumReducer;  
import org.apache.hadoop.mapreduce.Job;  
import org.apache.hadoop.conf.Configured;  
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.util.Tool;  
import org.apache.hadoop.util.ToolRunner;  
  
public class StringPairTestDriver extends Configured implements Tool {  
  
  @Override  
  public int run(String[] args) throws Exception {  
  
    if (args.length != 2) {  
      System.out.printf("Usage: " + this.getClass().getName() + "  \n");  
      return -1;  
    }  
  
    Job job = new Job(getConf());  
    job.setJarByClass(StringPairTestDriver.class);  
    job.setJobName("Custom Writable Comparable");  
  
    FileInputFormat.setInputPaths(job, new Path(args[0]));  
    FileOutputFormat.setOutputPath(job, new Path(args[1]));  
  
    /* 
     * LongSumReducer is a Hadoop API class that sums values into 
     * A LongWritable.  It works with any key and value type, therefore 
     * supports the new StringPairWritable as a key type. 
     */  
    job.setReducerClass(LongSumReducer.class);  
  
    job.setMapperClass(StringPairMapper.class);  
      
    /* 
     * Set the key output class for the job 
     */     
    job.setOutputKeyClass(StringPairWritable.class);  
      
    /* 
     * Set the value output class for the job 
     */  
    job.setOutputValueClass(LongWritable.class);  
  
    boolean success = job.waitForCompletion(true);  
    return success ? 0 : 1;  
  }  
  
  public static void main(String[] args) throws Exception {  
    int exitCode = ToolRunner.run(new Configuration(), new StringPairTestDriver(), args);  
    System.exit(exitCode);  
  }  
}  

Lab Experiment
You can use the simple test data in ~/training_materials/developer/data/nameyeartestdata to make sure your new type works as expected.
1. Build project and execute MapReduce job

$ ant -f build.xml # Build project and output writables.jar
$ rm -rf output # Clean previous result
$ hadoop jar writables.jar solution.StringPairTestDriver -fs=file:/// -jt=local ~/training_materials/developer/data/nameyeartestdata output
# Run LocalJobRunner and output result to output folder

2. Check output result

$ cat output/*
(Addams,Gomez) 1
(Addams,Jane) 1
(Addams,Morticia) 1
...
(Smith,John) 3
(Turing,Alan) 1
(Wamsley,Jayme) 1
(Webre,Josh) 1
(Weston,Clark) 1
(Woodburn,Louis) 1
(Woodburn,Providencia) 1

Supplement
* [ Java Essence ] 記憶中的那個東西 : 要怎麼參考呢 (物件相等性)

程式扎記

標籤

2014年12月16日星期二

[CCDH] Exercise13 - Implementing a Custom WritableComparable (P46)

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2014年12月16日 星期二