程式扎記: [CCDH] Exercise14 - Using SequenceFiles and File Compression (P49)

Preface
Files and Directories Used in this Exercise

Eclipse project: createsequencefile
Java files:
CreateSequenceFile.java (A driver that converts a text file to a sequence file)
ReadCompressedSequenceFile.java (A driver that converts a compressed sequence file to text)

Test data (HDFS):
weblog (full web server access log)

Exercise directory: ~/workspace/createsequencefile

In this exercise you will practice reading and writing uncompressed and compress SequenceFiles.

First, you will develop a MapReduce application to convert text data to a SequenceFile. Then you will modify the application to compress the SequenceFile using Snappy file compression. When creating the SequenceFile, use the full access log file for input data.

After you have created the compressed SequenceFile, you will write a second MapReduce application to read the compressed SequenceFile and write a text file that contains the original log file text.

Lab Experiment
Write a MapReduce program to create sequence files from text files
1. Determine the number of HDFS blocks occupied by the access log file:

a. In a browser window, start the Name Node Web UI - http://localhost:50070
b. Click "Browse the filesystem"
c. Navigate to the /user/training/weblog/access_log file.
d. Scroll down to the bottom of the page. The total number of blocks occupied by the access log file appears in the browser window.

2. Refer to the solution in the createsequencefile project to read the access log file and create a SequenceFile. Records emitted to theSequenceFile can have any key you like, but the values should match the text in the access log file.

view plaincopy to clipboardprint?
package solution;  
  
import org.apache.hadoop.fs.Path;  
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;  
import org.apache.hadoop.mapreduce.Job;  
  
import org.apache.hadoop.conf.Configured;  
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.util.Tool;  
import org.apache.hadoop.util.ToolRunner;  
  
public class CreateUncompressedSequenceFile extends Configured implements Tool {  
  
  @Override  
  public int run(String[] args) throws Exception {  
  
    if (args.length != 2) {  
      System.out.printf("Usage: CreateUncompressedSequenceFile  \n");  
      return -1;  
    }  
  
    Job job = new Job(getConf());  
    job.setJarByClass(CreateUncompressedSequenceFile.class);  
    job.setJobName("Create Uncompressed Sequence File");  
  
    FileInputFormat.setInputPaths(job, new Path(args[0]));  
    FileOutputFormat.setOutputPath(job, new Path(args[1]));  
  
    /* 
     * There is no need to call setInputFormatClass, because the input 
     * file is a text file. However, the output file is a SequenceFile. 
     * Therefore, we must call setOutputFormatClass. 
     */  
    job.setOutputFormatClass(SequenceFileOutputFormat.class);  
  
    /* 
     * This is a map-only job that uses the default (identity mapper), so we do not need to set 
     * the mapper or reducer classes.  We just need to set the number of reducers to 0. 
     */  
    job.setNumReduceTasks(0);  
  
    boolean success = job.waitForCompletion(true);  
    return success ? 0 : 1;  
  }  
  
  public static void main(String[] args) throws Exception {  
    int exitCode = ToolRunner.run(new Configuration(), new CreateUncompressedSequenceFile(), args);  
    System.exit(exitCode);  
  }  
}  

3. Build and test your solution so far, Use the access log as input data, and specify the uncompressdsf directory for output.

$ ant -f build.xml # Build the project and output createsequencefile.jar
$ hadoop jar createsequencefile.jar solution.CreateUncompressedSequenceFile weblog uncompressdsf
$ hadoop fs -ls uncompressedsf # 8 part files should be generated from 8 mapper output
...
-rw-r--r-- 1 training supergroup 77517687 2014-12-21 02:58 uncompressedsf/part-m-00000
-rw-r--r-- 1 training supergroup 77517464 2014-12-21 02:58 uncompressedsf/part-m-00001
-rw-r--r-- 1 training supergroup 77448148 2014-12-21 02:59 uncompressedsf/part-m-00002
-rw-r--r-- 1 training supergroup 77286206 2014-12-21 02:59 uncompressedsf/part-m-00003
-rw-r--r-- 1 training supergroup 77366617 2014-12-21 03:00 uncompressedsf/part-m-00004
-rw-r--r-- 1 training supergroup 77465310 2014-12-21 03:00 uncompressedsf/part-m-00005
-rw-r--r-- 1 training supergroup 77424243 2014-12-21 03:01 uncompressedsf/part-m-00006
-rw-r--r-- 1 training supergroup 40614390 2014-12-21 03:01 uncompressedsf/part-m-00007

4. Examine the initial portion of the output SequenceFile using the following command:

$ hadoop fs -cat uncompressedsf/part-m-00000 | less

Some of the data in the SequenceFile is unreadable, but parts of them should be recognizable:

* The string SEQ, which appears at the beginning of a SequenceFile.
* The Java classes for the keys and values
* Text from the access log file.

5. Verify that the number of files created by the job is equivalent to the number of blocks required to store the uncompressed SequenceFile.

Compress The Output
6. Modify the MapRedece job to compress the output SequenceFile. Add statements to your driver to configure the output as follows:

* Compress the output file
* Use block compression.
* Use the Snappy compression codec.

view plaincopy to clipboardprint?
package solution;  
  
import org.apache.hadoop.fs.Path;  
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;  
import org.apache.hadoop.io.SequenceFile.CompressionType;  
import org.apache.hadoop.io.compress.SnappyCodec;  
import org.apache.hadoop.mapreduce.Job;  
  
import org.apache.hadoop.conf.Configured;  
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.util.Tool;  
import org.apache.hadoop.util.ToolRunner;  
  
public class CreateCompressedSequenceFile extends Configured implements Tool {  
  
  @Override  
  public int run(String[] args) throws Exception {  
  
    if (args.length != 2) {  
      System.out.printf("Usage: CreateCompressedSequenceFile  \n");  
      return -1;  
    }  
  
    Job job = new Job(getConf());  
    job.setJarByClass(CreateCompressedSequenceFile.class);  
    job.setJobName("Create Compressed Sequence File");  
  
    FileInputFormat.setInputPaths(job, new Path(args[0]));  
    FileOutputFormat.setOutputPath(job, new Path(args[1]));  
  
    /* 
     * There is no need to call setInputFormatClass, because the input 
     * file is a text file. However, the output file is a SequenceFile. 
     * Therefore, we must call setOutputFormatClass. 
     */  
    job.setOutputFormatClass(SequenceFileOutputFormat.class);  
  
    /* 
     * Set the compression options. 
     */  
      
    /* 
     * Compress the output 
     */  
    FileOutputFormat.setCompressOutput(job, true);  
      
    /* 
     * Use Snappy compression 
     */  
    FileOutputFormat.setOutputCompressorClass(job, SnappyCodec.class);  
    /* 
     * Use block compression 
     */  
    SequenceFileOutputFormat.setOutputCompressionType(job,  
        CompressionType.BLOCK);  
  
    /* 
     * This is a map-only job that uses the default (identity mapper), so we do not need to set 
     * the mapper or reducer classes.  We just need to set the number of reducers to 0. 
     */  
    job.setNumReduceTasks(0);  
  
    boolean success = job.waitForCompletion(true);  
    return success ? 0 : 1;  
  }  
  
  public static void main(String[] args) throws Exception {  
    int exitCode = ToolRunner.run(new Configuration(), new CreateCompressedSequenceFile(), args);  
    System.exit(exitCode);  
  }  
}  

7. Compile the code and run your modified MapReduce job. For the MapReduce output, specify the compressdsf directory.

$ hadoop jar createsequencefile.jar solution.CreateCompressedSequenceFile weblog compressdsf
$ hadoop fs -ls compressdsf
...
-rw-r--r-- 1 training supergroup 16820906 2014-12-21 05:44 compressdsf/part-m-00000
...

8. Examine the first portion of the output SequenceFile. Notice the differences between the uncompressed and compressed SequenceFiles:

* The compressed SequenceFile specifies the org.apache.hadoop.io.compress.SnappyCodec compression codec in its header.
* You cannot read the log file text in the compressed file.

9. Compare the file size of the uncompressed and compressed SequenceFiles in the uncompressdsf and compressdsf directories. The compresssed SequenceFiles should be smaller.

Write Another MapReduce Program To UnCompress The Files
10. Write a MapReduce to read the compressed log file and write a text file. This text file should have the same text data as the log file, plus keys. The keys can contain any values you like.

view plaincopy to clipboardprint?
package solution;  
  
import org.apache.hadoop.fs.Path;  
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;  
import org.apache.hadoop.mapreduce.Job;  
  
import org.apache.hadoop.conf.Configured;  
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.util.Tool;  
import org.apache.hadoop.util.ToolRunner;  
  
public class ReadCompressedSequenceFile extends Configured implements Tool {  
  
  @Override  
  public int run(String[] args) throws Exception {  
  
    if (args.length != 2) {  
      System.out  
          .printf("Usage: ReadCompressedSequenceFile  \n");  
      return -1;  
    }  
  
    Job job = new Job(getConf());  
    job.setJarByClass(ReadCompressedSequenceFile.class);  
    job.setJobName("Read Compressed Sequence File");  
  
    FileInputFormat.setInputPaths(job, new Path(args[0]));  
    FileOutputFormat.setOutputPath(job, new Path(args[1]));  
  
    /* 
     * We are using a SequenceFile as the input file. 
     * Therefore, we must call setInputFormatClass. 
     * There is no need to call setOutputFormatClass, because the 
     * application uses a text file on output. 
     */  
    job.setInputFormatClass(SequenceFileInputFormat.class);  
  
    /* 
     * There is no need to set compression options for the input file. 
     * The compression implementation details are encoded within the 
     * input SequenceFile.     
     */  
  
    /* 
     * This is a map-only job that uses the default (identity mapper), so we do not need to set 
     * the mapper or reducer classes.  We just need to set the number of reducers to 0. 
     */  
    job.setNumReduceTasks(0);  
  
    boolean success = job.waitForCompletion(true);  
    return success ? 0 : 1;  
  }  
  
  public static void main(String[] args) throws Exception {  
    int exitCode = ToolRunner.run(new Configuration(), new ReadCompressedSequenceFile(), args);  
    System.exit(exitCode);  
  }  
}  

11. Compile the code and run your MapReduce job. For the MapReduce input, specify the compressdsf directory in which you created the compressed SequenceFile in the previous section. For the MapReduce output, specify the compresseddsftotext directory:

$ hadoop jar createsequencefile.jar solution.ReadCompressedSequenceFile compressdsf compresseddsftotext

12. Examine the first portion of the output in the compresseddsftotext directory. You should be able to read the texual log file entries.

$ hadoop fs -cat compresseddsftotext/part-m-00000 | less

Optional: Use Command Line Options To Control Compression
13. If you used ToolRunner for your driver, you can control compressing using command line arguments. Try commenting out the code in your driver where you can setCompressOutput. Then test setting the mapred.output.compressed option on the command line, e.g.:

$ hadoop jar createsequencefile.jar solution.CreateUncompressedSequenceFile \
-Dmapred.output.compressed=true \
weblog outdir

14. Review the output to confirm the files are compressed.

程式扎記

標籤

2014年12月21日星期日

[CCDH] Exercise14 - Using SequenceFiles and File Compression (P49)

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2014年12月21日 星期日