程式扎記: [CCDH] Exercise2 - Running a MapReduce Job

標籤

2015年1月22日 星期四

[CCDH] Exercise2 - Running a MapReduce Job

Preface
Files and Directories Used in this Exercise
Source directory: ~/workspace/wordcount/src/solution
Files:
WordCount.java: A simple MapReduce driver class.
WordMapper.java: A mapper class for the job.
SumReducer.java: A reducer class for the job.
wc.jar: The compiled, assembled WordCount program

In this exercise you will compile Java files, create a JAR, and run MapReduce jobs.

In addition to manipulating files in HDFS, the wapper program hadoop is used to launch MapReduce jobs. The code for a job is contained in a compiled JAR file. Hadoop loads the JAR into HDFS and distributes it to the worker nodes, where the individual tasks of the MapReduce job are executed.

One simple example of a MapReduce job is to count the number of occurrences of each word in a file or set of files. In this lab, you will compile and submit a MapReduce job to count the number of occurrences of every word in the works of Shakespeare.

Compiling and Submitting a MapReduce Job
1. In a terminal, change to the exercise source directory, and list the contents:
$ cd ~/workspace/wordcount/src
$ ls

This directory contains three "package" subdirectories: solution, stubs and hints. In this example we will be using the solution code.

2. Before compiling, examine the classpath Hadoop is configured to use:
$ hadoop classpath # We will use this information to compile the code

This shows lists of the locations where the Hadoop core API classes are installed.

3. Compile the three Java classes:
$ javac -classpath `hadoop classpath` solution/*.java

Note: in the command above, the quotes around hadoop classpath are backquotes. This runs the hadoop classpath command and uses its output as part of thejavac command. The compiled (.class) files are placed in the solution directory.

4. Collect your compiled Java files into a JAR file:
$ jar cvf wc.jar solution/*.class

5. Submit a MapReduce job to Hadoop using your JAR file to count the occurrences of each word in Shakespeare:
$ hadoop jar wc.jar solution.WordCount shakespeare wordcounts

This hadoop jar command names the JAR file to use (wc.jar), the class whose main method should be invoked (solution.WordCount), and the HDFS input and output directories to use for the MapReduce job.

Your job reads all the files in your HDFS shakespeare directory, and place its output in a new HDFS directory called wordcounts.

6. Try running this same command again without any change:
$ hadoop jar wc.jar solution.WordCount shakespeare wordcounts

Your job halts right away with an exception, because Hadoop automatically fails if you job tries to write its output into an existing directory.

7. Review the result of your MapReduce job:
$ hadoop fs -ls wordcounts

This lists the output files for your job. (Your job ran with only one Reducer, so there should be on file, named part-r-00000, along with a _SUCCESS file and a_logs directory.)

8. View the contents of the output for your job:
$ hadoop fs -cat wordcounts/part-r-00000 | less

You can page through a few screens to see words and their frequencies in the works of Shakespeare. (The spacebar will scroll the output by one screen; the letter 'q' will quit the less utility.) Note that you could have specified wordcount/* just as well in this command.


9. Try running the WordCount job against a single file:
$ hadoop jar wc.jar solution.WordCount shakespeare/poems pwords

When the job completes, inspect the contents of the pwords HDFS directory.

10. Clean up the output files produced by your job runs:
$ hadoop fs -rm -r wordcounts pwords

Stopping MapReduce Jobs
It is important to be able to stop jobs that are already running. This is useful if, for example, you accidentally introduced an infinite loop into your Mapper. An important point to remember is that processing ^C to kill the current process (the MapReduce job's progressdoes not actually stop the job itself!

A MapReduce job, once submitted to Hadoop, runs independently of the initialing process, so losing the connection to the initiating process does not kill the job. Instead, you need to tell the Hadoop JobTracker to stop the job.

1. Start another word count job:
$ hadoop jar wc.jar solution.WordCount shakesepare count2

2. While this job is running, open another terminal and enter:
$ mapred job -list

This lists the job ids of all running jobs. A job id looks something like: job_200902131742_0002

3. Copy the job id, and then kill the running job by entering
$ mapred job -kill jobid

The JobTracker kills the job, and the program running in the original terminal completes.

Supplement
Hadoop Tutorial 1 -- Running WordCoun
This tutorial will introduce you to the Hadoop Cluster in the Computer Science Dept. at Smith College, and how to submit jobs on it...


沒有留言:

張貼留言

網誌存檔

關於我自己

我的相片
Where there is a will, there is a way!