程式扎記

Preface
Files and Directories Used in this Exercise

Eclipse project: toolrunner
Test data (local):
~/training_materials/developer/data/shakespeare

Exercise directory:~/workspace/toolrunner

In this Hands-On Exercise, you will practice running a job locally for debugging and testing purposes.

In the "Using ToolRunner and Passing Paremeters" exercise, you modified the Average Word Length program to use ToolRunner. This makes it simple to set job configuration properties on the command line!

Lab Experiment
Run the Average Word Length program using LocalJobRunner on the command line
1. Run the Average Word Length program again. Specify -jt=local to run the job locally instead of submitting to the cluster, and -fs=file:/// to use the local file system instead of HDFS. Your input and output files should refer to the local file rather than HDFS files.

$ rm -rf localout # Clean previous result
$ hadoop jar toolrunner.jar solution.AvgWordLength -fs=file:/// -jt=local ~/training_materials/developer/data/shakespeare localout

2. Review the job output in the local output folder you specified.

Preface
Files and Directories Used in this Exercise

Eclipse project: log_file_analysis
Java files:
SumReducer.java - the Reducer
LogFileMapper.java - the Mapper
ProcessLogs.java - the Driver class

Test data(HDFS):
weblog (full version)
testlog (test sample set)

Exercise directory: ~/workspace/log_file_analysis

In this exercise, you will analyze a log file from a web server to count the number of hits made from each unique IP address.

Your task is to count the number of hits made from each IP address in the sample web server log file that you uploaded to the /user/training/weblog directory in HDFS when you complete the "Using HDFS" exercise.

Source Code
Mapper
Extract the IP address field and output pairs:
- solution/LogFileMapper.java

view plaincopy to clipboardprint?
package solution;  
  
import java.io.IOException;  
  
import org.apache.hadoop.io.IntWritable;  
import org.apache.hadoop.io.LongWritable;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapreduce.Mapper;  
  
/** 
* Example input line: 
* 96.7.4.14 - - [24/Apr/2011:04:20:11 -0400] "GET /cat.jpg HTTP/1.1" 200 12433 
* 
*/  
public class LogFileMapper extends Mapper {  
  
  @Override  
  public void map(LongWritable key, Text value, Context context)  
      throws IOException, InterruptedException {  
  
    /* 
     * Split the input line into space-delimited fields. 
     */  
    String[] fields = value.toString().split(" ");  
    if (fields.length > 0) {  
  
      /* 
       * Emit the first field - the IP address - as the key 
       * and the number 1 as the value. 
       */  
      String ip = fields[0];  
      context.write(new Text(ip), new IntWritable(1));  
    }  
  }  
}  

Reducer
The reducer just do the sum operation on each ip and output pairs:
- solution/SumReducer.java

view plaincopy to clipboardprint?
package solution;  
  
import java.io.IOException;  
  
import org.apache.hadoop.io.IntWritable;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapreduce.Reducer;  
  
/** 
* This is the SumReducer class from the word count exercise 
*/  
public class SumReducer extends Reducer {  
  
  @Override  
  public void reduce(Text key, Iterable values, Context context)  
      throws IOException, InterruptedException {  
    int wordCount = 0;  
    for (IntWritable value : values) {  
      wordCount += value.get();  
    }  
    context.write(key, new IntWritable(wordCount));  
  }  
}  

Driver
The driver is quite straight forward:
- solution/ProcessLogs.java

view plaincopy to clipboardprint?
package solution;  
  
import org.apache.hadoop.fs.Path;  
import org.apache.hadoop.io.IntWritable;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
import org.apache.hadoop.mapreduce.Job;  
  
public class ProcessLogs {  
  
  public static void main(String[] args) throws Exception {  
  
    if (args.length != 2) {  
      System.out.printf("Usage: ProcessLogs  \n");  
      System.exit(-1);  
    }  
  
    Job job = new Job();  
    job.setJarByClass(ProcessLogs.class);  
    job.setJobName("Process Logs");  
  
    FileInputFormat.setInputPaths(job, new Path(args[0]));  
    FileOutputFormat.setOutputPath(job, new Path(args[1]));  
  
    job.setMapperClass(LogFileMapper.class);  
    job.setReducerClass(SumReducer.class);  
  
    job.setOutputKeyClass(Text.class);  
    job.setOutputValueClass(IntWritable.class);  
  
    boolean success = job.waitForCompletion(true);  
    System.exit(success ? 0 : 1);  
  }  
}  

Lab Experiment
1. Build the project and run the MapReduce program

$ ant -f build.xml # The build process will output 'log_file_analysis.jar'
$ hadoop jar log_file_analysis.jar solution.ProcessLogs weblog ip_count # Output result to ip_count in HDFS

2. Review the result

$ hadoop fs -ls
...
...ip_count
...
$ hadoop fs -cat ip_count/*
...
10.99.99.186 6
10.99.99.247 1
10.99.99.58 21

Preface
Files and Directories Used in this Exercise

Source directory: ~/workspace/wordcount/src/solution
Files:
WordCount.java: A simple MapReduce driver class.
WordMapper.java: A mapper class for the job.
SumReducer.java: A reducer class for the job.
wc.jar: The compiled, assembled WordCount program

In this exercise you will compile Java files, create a JAR, and run MapReduce jobs.

In addition to manipulating files in HDFS, the wapper program hadoop is used to launch MapReduce jobs. The code for a job is contained in a compiled JAR file. Hadoop loads the JAR into HDFS and distributes it to the worker nodes, where the individual tasks of the MapReduce job are executed.

One simple example of a MapReduce job is to count the number of occurrences of each word in a file or set of files. In this lab, you will compile and submit a MapReduce job to count the number of occurrences of every word in the works of Shakespeare.

Compiling and Submitting a MapReduce Job
1. In a terminal, change to the exercise source directory, and list the contents:

$ cd ~/workspace/wordcount/src
$ ls

This directory contains three "package" subdirectories: solution, stubs and hints. In this example we will be using the solution code.

2. Before compiling, examine the classpath Hadoop is configured to use:

$ hadoop classpath # We will use this information to compile the code

This shows lists of the locations where the Hadoop core API classes are installed.

3. Compile the three Java classes:

$ javac -classpath `hadoop classpath` solution/*.java

Note: in the command above, the quotes around hadoop classpath are backquotes. This runs the hadoop classpath command and uses its output as part of thejavac command. The compiled (.class) files are placed in the solution directory.

4. Collect your compiled Java files into a JAR file:

$ jar cvf wc.jar solution/*.class

5. Submit a MapReduce job to Hadoop using your JAR file to count the occurrences of each word in Shakespeare:

$ hadoop jar wc.jar solution.WordCount shakespeare wordcounts

This hadoop jar command names the JAR file to use (wc.jar), the class whose main method should be invoked (solution.WordCount), and the HDFS input and output directories to use for the MapReduce job.

Your job reads all the files in your HDFS shakespeare directory, and place its output in a new HDFS directory called wordcounts.

6. Try running this same command again without any change:

$ hadoop jar wc.jar solution.WordCount shakespeare wordcounts

Your job halts right away with an exception, because Hadoop automatically fails if you job tries to write its output into an existing directory.

7. Review the result of your MapReduce job:

$ hadoop fs -ls wordcounts

This lists the output files for your job. (Your job ran with only one Reducer, so there should be on file, named part-r-00000, along with a _SUCCESS file and a_logs directory.)

8. View the contents of the output for your job:

$ hadoop fs -cat wordcounts/part-r-00000 | less

You can page through a few screens to see words and their frequencies in the works of Shakespeare. (The spacebar will scroll the output by one screen; the letter 'q' will quit the less utility.) Note that you could have specified wordcount/* just as well in this command.

9. Try running the WordCount job against a single file:

$ hadoop jar wc.jar solution.WordCount shakespeare/poems pwords

When the job completes, inspect the contents of the pwords HDFS directory.

10. Clean up the output files produced by your job runs:

$ hadoop fs -rm -r wordcounts pwords

Stopping MapReduce Jobs
It is important to be able to stop jobs that are already running. This is useful if, for example, you accidentally introduced an infinite loop into your Mapper. An important point to remember is that processing ^C to kill the current process (the MapReduce job's progress) does not actually stop the job itself!

A MapReduce job, once submitted to Hadoop, runs independently of the initialing process, so losing the connection to the initiating process does not kill the job. Instead, you need to tell the Hadoop JobTracker to stop the job.

1. Start another word count job:

$ hadoop jar wc.jar solution.WordCount shakesepare count2

2. While this job is running, open another terminal and enter:

$ mapred job -list

This lists the job ids of all running jobs. A job id looks something like: job_200902131742_0002

3. Copy the job id, and then kill the running job by entering

$ mapred job -kill jobid

The JobTracker kills the job, and the program running in the original terminal completes.

Supplement
* Hadoop Tutorial 1 -- Running WordCoun

This tutorial will introduce you to the Hadoop Cluster in the Computer Science Dept. at Smith College, and how to submit jobs on it...

Preface
Files and Directories Used in this Exercise

Test data (HDFS):
movie
movierating

Exercise directory: ~/workspace/hive

In this exercise, you will practice data processing in Hadoop using Hive.

Lab Experiment
The data sets for this exercise are the movie and movierating data imported from MySQL into Hadoop in the "Importing Data with Sqoop" exercise.

Review the Data
1. Make sure you've completed the "Importing Data with Sqoop" exercise. Review the data you already loaded into HDFS in that exercise:

$ hadoop fs -cat movie/part-m-00000 | head
...
$ hadoop fs -cat movierating/part-m-00000 | head
...

Prepare The Data For Hive
For Hive data sets, you create tables, which attach field names and data types to your Hadoop data for subsequent queries. You can create external tables on themovie and movierating data sets, without having to move the data at all. Prepare the Hive tables for this exercise by performing the following steps:

1. Invoke the Hive shell.
2. Create the movie table:

3. Create the movierating table:

4. Quit the Hive shell.

Practicing HiveQL
If you are familiar with SQL, most of what you already know is applicably to HiveQL. Skip ahead to section called "The Questions" later in this exercise, and see if you can solve the problems based on your knowledge of SQL.

If you are unfamiliar with SQL, follow the steps below to learn how to use HiveSQL to solve problems.
1. Start the Hive shell.
2. Show the list of tables in Hive

hive> SHOW TABLES;
OK
customers
movie
movierating
order_details
orders
products
Time taken: 0.34 seconds

3. View the metadata for the two tables you created previously:

hive> DESCRIBE movie;
hive> DESCRIBE movieratings;

Hint: You can use the up and down arrow keys to see and edit your command history in the hive shell, just as you can in the Linux command shell.

4. The SELECT * FROM TABLENAME command allows you to query data from a table. Although it is very easy to select all the rows in a table, Hadoop generally deals with very large tables; so it is best to limit how many you select. Use LIMIT to view only the first N rows:

5. Use the WHERE clause to select only rows that match certain criteria. For example, select movies released before 1930:

hive> SELECT * FROM movie WHERE year < 1930;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
...
3289 Not One Less 0
3306 Circus, The 1928
3309 Dog's Life, A 1920
3310 Kid, The 1921
3320 Mifune 0
3357 East-West 0
...

6. The results include movies whose year field is 0, meaning that the year is unknown or unavailable. Exclude those movies from the results:

hive> SELECT * FROM movie WHERE year < 1930 AND year != 0;

7. The results now correctly include movies before 1930, but the list is unordered. Order them alphabetically by title:

hive> SELECT * FROM movie WHERE year < 1930 AND year != 0 ORDER BY name;

8. Now let's move on to the movierating table. List all the ratings by a particular user, e.g.

hive> SELECT * FROM movierating WHERE userid=149;

9. SELECT * shows all the columns, but as we've already selected by userid, display the other columns but not that one:

hive> SELECT movieid, rating FROM movierating WHERE userid=149;

10. Use the JOIN function to display data from both tables. For example, include the name of the movie (from the movie table) in the list of a user's ratings:

hive> SELECT movieid, rating, name FROM movierating
> JOIN movie ON movierating.movieid=movie.id
> WHERE userid=149;
Total MapReduce jobs = 1
Launching Job 1 out of 1
...

11. How tough a rater is user 149? Find out by calculating the average rating she gave to all movies using the AVG function (UDAV):

hive> SELECT AVG(rating) FROM movierating WHERE userid=149;
...
Total MapReduce CPU Time Spent: 4 seconds 270 msec
OK
3.9408783783783785
Time taken: 14.753 seconds

12. List each user who rated movies, the number of movies they've rated, and their average rating.

hive> SELECT userid, COUNT(userid), AVG(rating) FROM movierating GROUP BY userid;
...
6038 20 3.8
6039 123 3.8780487804878048
6040 341 3.5777126099706744
Time taken: 17.281 seconds

13. Take the same data, and copy it into a new table called userrating.

hive> CREATE TABLE userrating (userid INT, numratings INT, avgrating FLOAT);
OK
Time taken: 0.069 seconds
hive> INSERT OVERWRITE TABLE userrating
> SELECT userid, COUNT(userid), AVG(rating)
> FROM movierating GROUP BY userid;
...
Total MapReduce CPU Time Spent: 8 seconds 120 msec
OK
Time taken: 20.651 seconds

The Questions
Now that the data is imported and suitably prepared, write a HiveQL command to implement each of the following queries.

Supplement
* Apache Hive - LanguageManual

程式扎記

標籤

2015年1月22日星期四

[CCDH] Exercise9 - Testing with LocalJobRunner (P34)

[CCDH] Exercise4 - More Practice With MapReduce Java Programs (P24)

[CCDH] Exercise2 - Running a MapReduce Job

[CCDH] Exercise18 - Manipulating Data With Hive (P63)

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2015年1月22日 星期四

[CCDH] Exercise9 - Testing with LocalJobRunner (P34)

[CCDH] Exercise4 - More Practice With MapReduce Java Programs (P24)

[CCDH] Exercise2 - Running a MapReduce Job

[CCDH] Exercise18 - Manipulating Data With Hive (P63)

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2015年1月22日星期四