Preface
Files and Directories Used in this Exercise
In this exercise you will compile Java files, create a JAR, and run MapReduce jobs.
In addition to manipulating files in HDFS, the wapper program hadoop is used to launch MapReduce jobs. The code for a job is contained in a compiled JAR file. Hadoop loads the JAR into HDFS and distributes it to the worker nodes, where the individual tasks of the MapReduce job are executed.
One simple example of a MapReduce job is to count the number of occurrences of each word in a file or set of files. In this lab, you will compile and submit a MapReduce job to count the number of occurrences of every word in the works of Shakespeare.
Compiling and Submitting a MapReduce Job
1. In a terminal, change to the exercise source directory, and list the contents:
This directory contains three "package" subdirectories: solution, stubs and hints. In this example we will be using the solution code.
2. Before compiling, examine the classpath Hadoop is configured to use:
This shows lists of the locations where the Hadoop core API classes are installed.
3. Compile the three Java classes:
Note: in the command above, the quotes around hadoop classpath are backquotes. This runs the hadoop classpath command and uses its output as part of thejavac command. The compiled (.class) files are placed in the solution directory.
4. Collect your compiled Java files into a JAR file:
5. Submit a MapReduce job to Hadoop using your JAR file to count the occurrences of each word in Shakespeare:
This hadoop jar command names the JAR file to use (wc.jar), the class whose main method should be invoked (solution.WordCount), and the HDFS input and output directories to use for the MapReduce job.
Your job reads all the files in your HDFS shakespeare directory, and place its output in a new HDFS directory called wordcounts.
6. Try running this same command again without any change:
Your job halts right away with an exception, because Hadoop automatically fails if you job tries to write its output into an existing directory.
7. Review the result of your MapReduce job:
This lists the output files for your job. (Your job ran with only one Reducer, so there should be on file, named part-r-00000, along with a _SUCCESS file and a_logs directory.)
8. View the contents of the output for your job:
You can page through a few screens to see words and their frequencies in the works of Shakespeare. (The spacebar will scroll the output by one screen; the letter 'q' will quit the less utility.) Note that you could have specified wordcount/* just as well in this command.
9. Try running the WordCount job against a single file:
When the job completes, inspect the contents of the pwords HDFS directory.
10. Clean up the output files produced by your job runs:
Stopping MapReduce Jobs
It is important to be able to stop jobs that are already running. This is useful if, for example, you accidentally introduced an infinite loop into your Mapper. An important point to remember is that processing ^C to kill the current process (the MapReduce job's progress) does not actually stop the job itself!
A MapReduce job, once submitted to Hadoop, runs independently of the initialing process, so losing the connection to the initiating process does not kill the job. Instead, you need to tell the Hadoop JobTracker to stop the job.
1. Start another word count job:
2. While this job is running, open another terminal and enter:
This lists the job ids of all running jobs. A job id looks something like: job_200902131742_0002
3. Copy the job id, and then kill the running job by entering
The JobTracker kills the job, and the program running in the original terminal completes.
Supplement
* Hadoop Tutorial 1 -- Running WordCoun
Files and Directories Used in this Exercise
In this exercise you will compile Java files, create a JAR, and run MapReduce jobs.
In addition to manipulating files in HDFS, the wapper program hadoop is used to launch MapReduce jobs. The code for a job is contained in a compiled JAR file. Hadoop loads the JAR into HDFS and distributes it to the worker nodes, where the individual tasks of the MapReduce job are executed.
One simple example of a MapReduce job is to count the number of occurrences of each word in a file or set of files. In this lab, you will compile and submit a MapReduce job to count the number of occurrences of every word in the works of Shakespeare.
Compiling and Submitting a MapReduce Job
1. In a terminal, change to the exercise source directory, and list the contents:
This directory contains three "package" subdirectories: solution, stubs and hints. In this example we will be using the solution code.
2. Before compiling, examine the classpath Hadoop is configured to use:
This shows lists of the locations where the Hadoop core API classes are installed.
3. Compile the three Java classes:
Note: in the command above, the quotes around hadoop classpath are backquotes. This runs the hadoop classpath command and uses its output as part of thejavac command. The compiled (.class) files are placed in the solution directory.
4. Collect your compiled Java files into a JAR file:
5. Submit a MapReduce job to Hadoop using your JAR file to count the occurrences of each word in Shakespeare:
This hadoop jar command names the JAR file to use (wc.jar), the class whose main method should be invoked (solution.WordCount), and the HDFS input and output directories to use for the MapReduce job.
Your job reads all the files in your HDFS shakespeare directory, and place its output in a new HDFS directory called wordcounts.
6. Try running this same command again without any change:
Your job halts right away with an exception, because Hadoop automatically fails if you job tries to write its output into an existing directory.
7. Review the result of your MapReduce job:
This lists the output files for your job. (Your job ran with only one Reducer, so there should be on file, named part-r-00000, along with a _SUCCESS file and a_logs directory.)
8. View the contents of the output for your job:
You can page through a few screens to see words and their frequencies in the works of Shakespeare. (The spacebar will scroll the output by one screen; the letter 'q' will quit the less utility.) Note that you could have specified wordcount/* just as well in this command.
9. Try running the WordCount job against a single file:
When the job completes, inspect the contents of the pwords HDFS directory.
10. Clean up the output files produced by your job runs:
Stopping MapReduce Jobs
It is important to be able to stop jobs that are already running. This is useful if, for example, you accidentally introduced an infinite loop into your Mapper. An important point to remember is that processing ^C to kill the current process (the MapReduce job's progress) does not actually stop the job itself!
A MapReduce job, once submitted to Hadoop, runs independently of the initialing process, so losing the connection to the initiating process does not kill the job. Instead, you need to tell the Hadoop JobTracker to stop the job.
1. Start another word count job:
2. While this job is running, open another terminal and enter:
This lists the job ids of all running jobs. A job id looks something like: job_200902131742_0002
3. Copy the job id, and then kill the running job by entering
The JobTracker kills the job, and the program running in the original terminal completes.
Supplement
* Hadoop Tutorial 1 -- Running WordCoun
沒有留言:
張貼留言