Files and Directories Used in this Exercise
In this exercise you will practice reading and writing uncompressed and compress SequenceFiles.
First, you will develop a MapReduce application to convert text data to a SequenceFile. Then you will modify the application to compress the SequenceFile using Snappy file compression. When creating the SequenceFile, use the full access log file for input data.
After you have created the compressed SequenceFile, you will write a second MapReduce application to read the compressed SequenceFile and write a text file that contains the original log file text.
Write a MapReduce program to create sequence files from text files
1. Determine the number of HDFS blocks occupied by the access log file:
2. Refer to the solution in the createsequencefile project to read the access log file and create a SequenceFile. Records emitted to theSequenceFile can have any key you like, but the values should match the text in the access log file.
3. Build and test your solution so far, Use the access log as input data, and specify the uncompressdsf directory for output.
4. Examine the initial portion of the output SequenceFile using the following command:
Some of the data in the SequenceFile is unreadable, but parts of them should be recognizable:
5. Verify that the number of files created by the job is equivalent to the number of blocks required to store the uncompressed SequenceFile.
Compress The Output
6. Modify the MapRedece job to compress the output SequenceFile. Add statements to your driver to configure the output as follows:
8. Examine the first portion of the output SequenceFile. Notice the differences between the uncompressed and compressed SequenceFiles:
9. Compare the file size of the uncompressed and compressed SequenceFiles in the uncompressdsf and compressdsf directories. The compresssed SequenceFiles should be smaller.
Write Another MapReduce Program To UnCompress The Files
10. Write a MapReduce to read the compressed log file and write a text file. This text file should have the same text data as the log file, plus keys. The keys can contain any values you like.
12. Examine the first portion of the output in the compresseddsftotext directory. You should be able to read the texual log file entries.
Optional: Use Command Line Options To Control Compression
13. If you used ToolRunner for your driver, you can control compressing using command line arguments. Try commenting out the code in your driver where you can setCompressOutput. Then test setting the mapred.output.compressed option on the command line, e.g.:
14. Review the output to confirm the files are compressed.