Preface
Files and Directories Used in this Exercise
In this Hands-On Exercise, you will practice running a job locally for debugging and testing purposes.
In the "Using ToolRunner and Passing Paremeters" exercise, you modified the Average Word Length program to use ToolRunner. This makes it simple to set job configuration properties on the command line!
Lab Experiment
Run the Average Word Length program using LocalJobRunner on the command line
1. Run the Average Word Length program again. Specify -jt=local to run the job locally instead of submitting to the cluster, and -fs=file:/// to use the local file system instead of HDFS. Your input and output files should refer to the local file rather than HDFS files.
2. Review the job output in the local output folder you specified.
This is a blog to track what I had learned and share knowledge with all who can take advantage of them
標籤
- [ 英文學習 ]
- [ 計算機概論 ]
- [ 深入雲計算 ]
- [ 雜七雜八 ]
- [ Algorithm in Java ]
- [ Data Structures with Java ]
- [ IR Class ]
- [ Java 文章收集 ]
- [ Java 代碼範本 ]
- [ Java 套件 ]
- [ JVM 應用 ]
- [ LFD Note ]
- [ MangoDB ]
- [ Math CC ]
- [ MongoDB ]
- [ MySQL 小學堂 ]
- [ Python 考題 ]
- [ Python 常見問題 ]
- [ Python 範例代碼 ]
- [心得扎記]
- [網路教學]
- [C 常見考題]
- [C 範例代碼]
- [C/C++ 範例代碼]
- [Intro Alg]
- [Java 代碼範本]
- [Java 套件]
- [Linux 小技巧]
- [Linux 小學堂]
- [Linux 命令]
- [ML In Action]
- [ML]
- [MLP]
- [Postgres]
- [Python 學習筆記]
- [Quick Python]
- [Software Engineering]
- [The python tutorial]
- 工具收集
- 設計模式
- 資料結構
- ActiveMQ In Action
- AI
- Algorithm
- Android
- Ansible
- AWS
- Big Data 研究
- C/C++
- C++
- CCDH
- CI/CD
- Coursera
- Database
- DB
- Design Pattern
- Device Driver Programming
- Docker
- Docker 工具
- Docker Practice
- Eclipse
- English Writing
- ExtJS 3.x
- FP
- Fraud Prevention
- FreeBSD
- GCC
- Git
- Git Pro
- GNU
- Golang
- Gradle
- Groovy
- Hadoop
- Hadoop. Hadoop Ecosystem
- Java
- Java Framework
- Java UI
- JavaIDE
- JavaScript
- Jenkins
- JFreeChart
- Kaggle
- Kali/Metasploit
- Keras
- KVM
- Learn Spark
- LeetCode
- Linux
- Lucene
- Math
- ML
- ML Udemy
- Mockito
- MPI
- Nachos
- Network
- NLP
- node js
- OO
- OpenCL
- OpenMP
- OSC
- OSGi
- Pandas
- Perl
- PostgreSQL
- Py DS
- Python
- Python 自製工具
- Python Std Library
- Python tools
- QEMU
- R
- Real Python
- RIA
- RTC
- Ruby
- Ruby Packages
- Scala
- ScalaIA
- SQLAlchemy
- TensorFlow
- Tools
- UML
- Unix
- Verilog
- Vmware
- Windows 技巧
- wxPython
2015年1月22日 星期四
[CCDH] Exercise4 - More Practice With MapReduce Java Programs (P24)
Preface
Files and Directories Used in this Exercise
In this exercise, you will analyze a log file from a web server to count the number of hits made from each unique IP address.
Your task is to count the number of hits made from each IP address in the sample web server log file that you uploaded to the /user/training/weblog directory in HDFS when you complete the "Using HDFS" exercise.
Source Code
Mapper
Extract the IP address field and output pairs:
- solution/LogFileMapper.java
Reducer
The reducer just do the sum operation on each ip and output pairs:
- solution/SumReducer.java
Driver
The driver is quite straight forward:
- solution/ProcessLogs.java
Lab Experiment
1. Build the project and run the MapReduce program
2. Review the result
Files and Directories Used in this Exercise
Eclipse project: log_file_analysis
Java files:
SumReducer.java - the Reducer
LogFileMapper.java - the Mapper
ProcessLogs.java - the Driver class
Test data(HDFS):
weblog (full version)
testlog (test sample set)
Exercise directory: ~/workspace/log_file_analysis
In this exercise, you will analyze a log file from a web server to count the number of hits made from each unique IP address.
Your task is to count the number of hits made from each IP address in the sample web server log file that you uploaded to the /user/training/weblog directory in HDFS when you complete the "Using HDFS" exercise.
Source Code
Mapper
Extract the IP address field and output
- solution/LogFileMapper.java
- package solution;
- import java.io.IOException;
- import org.apache.hadoop.io.IntWritable;
- import org.apache.hadoop.io.LongWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Mapper;
- /**
- * Example input line:
- * 96.7.4.14 - - [24/Apr/2011:04:20:11 -0400] "GET /cat.jpg HTTP/1.1" 200 12433
- *
- */
- public class LogFileMapper extends Mapper
{ - @Override
- public void map(LongWritable key, Text value, Context context)
- throws IOException, InterruptedException {
- /*
- * Split the input line into space-delimited fields.
- */
- String[] fields = value.toString().split(" ");
- if (fields.length > 0) {
- /*
- * Emit the first field - the IP address - as the key
- * and the number 1 as the value.
- */
- String ip = fields[0];
- context.write(new Text(ip), new IntWritable(1));
- }
- }
- }
The reducer just do the sum operation on each ip and output
- solution/SumReducer.java
- package solution;
- import java.io.IOException;
- import org.apache.hadoop.io.IntWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Reducer;
- /**
- * This is the SumReducer class from the word count exercise
- */
- public class SumReducer extends Reducer
{ - @Override
- public void reduce(Text key, Iterable
values, Context context) - throws IOException, InterruptedException {
- int wordCount = 0;
- for (IntWritable value : values) {
- wordCount += value.get();
- }
- context.write(key, new IntWritable(wordCount));
- }
- }
The driver is quite straight forward:
- solution/ProcessLogs.java
- package solution;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.IntWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- import org.apache.hadoop.mapreduce.Job;
- public class ProcessLogs {
- public static void main(String[] args) throws Exception {
- if (args.length != 2) {
- System.out.printf("Usage: ProcessLogs );
- System.exit(-1);
- }
- Job job = new Job();
- job.setJarByClass(ProcessLogs.class);
- job.setJobName("Process Logs");
- FileInputFormat.setInputPaths(job, new Path(args[0]));
- FileOutputFormat.setOutputPath(job, new Path(args[1]));
- job.setMapperClass(LogFileMapper.class);
- job.setReducerClass(SumReducer.class);
- job.setOutputKeyClass(Text.class);
- job.setOutputValueClass(IntWritable.class);
- boolean success = job.waitForCompletion(true);
- System.exit(success ? 0 : 1);
- }
- }
1. Build the project and run the MapReduce program
$ ant -f build.xml # The build process will output 'log_file_analysis.jar'
$ hadoop jar log_file_analysis.jar solution.ProcessLogs weblog ip_count # Output result to ip_count in HDFS
2. Review the result
$ hadoop fs -ls
...
...ip_count
...
$ hadoop fs -cat ip_count/*
...
10.99.99.186 6
10.99.99.247 1
10.99.99.58 21
[CCDH] Exercise2 - Running a MapReduce Job
Preface
Files and Directories Used in this Exercise
In this exercise you will compile Java files, create a JAR, and run MapReduce jobs.
In addition to manipulating files in HDFS, the wapper program hadoop is used to launch MapReduce jobs. The code for a job is contained in a compiled JAR file. Hadoop loads the JAR into HDFS and distributes it to the worker nodes, where the individual tasks of the MapReduce job are executed.
One simple example of a MapReduce job is to count the number of occurrences of each word in a file or set of files. In this lab, you will compile and submit a MapReduce job to count the number of occurrences of every word in the works of Shakespeare.
Compiling and Submitting a MapReduce Job
1. In a terminal, change to the exercise source directory, and list the contents:
This directory contains three "package" subdirectories: solution, stubs and hints. In this example we will be using the solution code.
2. Before compiling, examine the classpath Hadoop is configured to use:
This shows lists of the locations where the Hadoop core API classes are installed.
3. Compile the three Java classes:
Note: in the command above, the quotes around hadoop classpath are backquotes. This runs the hadoop classpath command and uses its output as part of thejavac command. The compiled (.class) files are placed in the solution directory.
4. Collect your compiled Java files into a JAR file:
5. Submit a MapReduce job to Hadoop using your JAR file to count the occurrences of each word in Shakespeare:
This hadoop jar command names the JAR file to use (wc.jar), the class whose main method should be invoked (solution.WordCount), and the HDFS input and output directories to use for the MapReduce job.
Your job reads all the files in your HDFS shakespeare directory, and place its output in a new HDFS directory called wordcounts.
6. Try running this same command again without any change:
Your job halts right away with an exception, because Hadoop automatically fails if you job tries to write its output into an existing directory.
7. Review the result of your MapReduce job:
This lists the output files for your job. (Your job ran with only one Reducer, so there should be on file, named part-r-00000, along with a _SUCCESS file and a_logs directory.)
8. View the contents of the output for your job:
You can page through a few screens to see words and their frequencies in the works of Shakespeare. (The spacebar will scroll the output by one screen; the letter 'q' will quit the less utility.) Note that you could have specified wordcount/* just as well in this command.

9. Try running the WordCount job against a single file:
When the job completes, inspect the contents of the pwords HDFS directory.
10. Clean up the output files produced by your job runs:
Stopping MapReduce Jobs
It is important to be able to stop jobs that are already running. This is useful if, for example, you accidentally introduced an infinite loop into your Mapper. An important point to remember is that processing ^C to kill the current process (the MapReduce job's progress) does not actually stop the job itself!
A MapReduce job, once submitted to Hadoop, runs independently of the initialing process, so losing the connection to the initiating process does not kill the job. Instead, you need to tell the Hadoop JobTracker to stop the job.
1. Start another word count job:
2. While this job is running, open another terminal and enter:
This lists the job ids of all running jobs. A job id looks something like: job_200902131742_0002
3. Copy the job id, and then kill the running job by entering
The JobTracker kills the job, and the program running in the original terminal completes.
Supplement
* Hadoop Tutorial 1 -- Running WordCoun
Files and Directories Used in this Exercise
Source directory: ~/workspace/wordcount/src/solution
Files:
WordCount.java: A simple MapReduce driver class.
WordMapper.java: A mapper class for the job.
SumReducer.java: A reducer class for the job.
wc.jar: The compiled, assembled WordCount program
In this exercise you will compile Java files, create a JAR, and run MapReduce jobs.
In addition to manipulating files in HDFS, the wapper program hadoop is used to launch MapReduce jobs. The code for a job is contained in a compiled JAR file. Hadoop loads the JAR into HDFS and distributes it to the worker nodes, where the individual tasks of the MapReduce job are executed.
One simple example of a MapReduce job is to count the number of occurrences of each word in a file or set of files. In this lab, you will compile and submit a MapReduce job to count the number of occurrences of every word in the works of Shakespeare.
Compiling and Submitting a MapReduce Job
1. In a terminal, change to the exercise source directory, and list the contents:
$ cd ~/workspace/wordcount/src
$ ls
This directory contains three "package" subdirectories: solution, stubs and hints. In this example we will be using the solution code.
2. Before compiling, examine the classpath Hadoop is configured to use:
$ hadoop classpath # We will use this information to compile the code
This shows lists of the locations where the Hadoop core API classes are installed.
3. Compile the three Java classes:
$ javac -classpath `hadoop classpath` solution/*.java
Note: in the command above, the quotes around hadoop classpath are backquotes. This runs the hadoop classpath command and uses its output as part of thejavac command. The compiled (.class) files are placed in the solution directory.
4. Collect your compiled Java files into a JAR file:
$ jar cvf wc.jar solution/*.class
5. Submit a MapReduce job to Hadoop using your JAR file to count the occurrences of each word in Shakespeare:
$ hadoop jar wc.jar solution.WordCount shakespeare wordcounts
This hadoop jar command names the JAR file to use (wc.jar), the class whose main method should be invoked (solution.WordCount), and the HDFS input and output directories to use for the MapReduce job.
Your job reads all the files in your HDFS shakespeare directory, and place its output in a new HDFS directory called wordcounts.
6. Try running this same command again without any change:
$ hadoop jar wc.jar solution.WordCount shakespeare wordcounts
Your job halts right away with an exception, because Hadoop automatically fails if you job tries to write its output into an existing directory.
7. Review the result of your MapReduce job:
$ hadoop fs -ls wordcounts
This lists the output files for your job. (Your job ran with only one Reducer, so there should be on file, named part-r-00000, along with a _SUCCESS file and a_logs directory.)
8. View the contents of the output for your job:
$ hadoop fs -cat wordcounts/part-r-00000 | less
You can page through a few screens to see words and their frequencies in the works of Shakespeare. (The spacebar will scroll the output by one screen; the letter 'q' will quit the less utility.) Note that you could have specified wordcount/* just as well in this command.
9. Try running the WordCount job against a single file:
$ hadoop jar wc.jar solution.WordCount shakespeare/poems pwords
When the job completes, inspect the contents of the pwords HDFS directory.
10. Clean up the output files produced by your job runs:
$ hadoop fs -rm -r wordcounts pwords
Stopping MapReduce Jobs
It is important to be able to stop jobs that are already running. This is useful if, for example, you accidentally introduced an infinite loop into your Mapper. An important point to remember is that processing ^C to kill the current process (the MapReduce job's progress) does not actually stop the job itself!
A MapReduce job, once submitted to Hadoop, runs independently of the initialing process, so losing the connection to the initiating process does not kill the job. Instead, you need to tell the Hadoop JobTracker to stop the job.
1. Start another word count job:
$ hadoop jar wc.jar solution.WordCount shakesepare count2
2. While this job is running, open another terminal and enter:
$ mapred job -list
This lists the job ids of all running jobs. A job id looks something like: job_200902131742_0002
3. Copy the job id, and then kill the running job by entering
$ mapred job -kill jobid
The JobTracker kills the job, and the program running in the original terminal completes.
Supplement
* Hadoop Tutorial 1 -- Running WordCoun
This tutorial will introduce you to the Hadoop Cluster in the Computer Science Dept. at Smith College, and how to submit jobs on it...
[CCDH] Exercise18 - Manipulating Data With Hive (P63)
Preface
Files and Directories Used in this Exercise
In this exercise, you will practice data processing in Hadoop using Hive.
Lab Experiment
The data sets for this exercise are the movie and movierating data imported from MySQL into Hadoop in the "Importing Data with Sqoop" exercise.
Review the Data
1. Make sure you've completed the "Importing Data with Sqoop" exercise. Review the data you already loaded into HDFS in that exercise:
Prepare The Data For Hive
For Hive data sets, you create tables, which attach field names and data types to your Hadoop data for subsequent queries. You can create external tables on themovie and movierating data sets, without having to move the data at all. Prepare the Hive tables for this exercise by performing the following steps:
1. Invoke the Hive shell.
2. Create the movie table:
3. Create the movierating table:
4. Quit the Hive shell.
Practicing HiveQL
If you are familiar with SQL, most of what you already know is applicably to HiveQL. Skip ahead to section called "The Questions" later in this exercise, and see if you can solve the problems based on your knowledge of SQL.
If you are unfamiliar with SQL, follow the steps below to learn how to use HiveSQL to solve problems.
1. Start the Hive shell.
2. Show the list of tables in Hive
3. View the metadata for the two tables you created previously:
Hint: You can use the up and down arrow keys to see and edit your command history in the hive shell, just as you can in the Linux command shell.
4. The SELECT * FROM TABLENAME command allows you to query data from a table. Although it is very easy to select all the rows in a table, Hadoop generally deals with very large tables; so it is best to limit how many you select. Use LIMIT to view only the first N rows:
5. Use the WHERE clause to select only rows that match certain criteria. For example, select movies released before 1930:
6. The results include movies whose year field is 0, meaning that the year is unknown or unavailable. Exclude those movies from the results:
7. The results now correctly include movies before 1930, but the list is unordered. Order them alphabetically by title:
8. Now let's move on to the movierating table. List all the ratings by a particular user, e.g.
9. SELECT * shows all the columns, but as we've already selected by userid, display the other columns but not that one:
10. Use the JOIN function to display data from both tables. For example, include the name of the movie (from the movie table) in the list of a user's ratings:
11. How tough a rater is user 149? Find out by calculating the average rating she gave to all movies using the AVG function (UDAV):
12. List each user who rated movies, the number of movies they've rated, and their average rating.
13. Take the same data, and copy it into a new table called userrating.
The Questions
Now that the data is imported and suitably prepared, write a HiveQL command to implement each of the following queries.
Supplement
* Apache Hive - LanguageManual
Files and Directories Used in this Exercise
Test data (HDFS):
movie
movierating
Exercise directory: ~/workspace/hive
In this exercise, you will practice data processing in Hadoop using Hive.
Lab Experiment
The data sets for this exercise are the movie and movierating data imported from MySQL into Hadoop in the "Importing Data with Sqoop" exercise.
Review the Data
1. Make sure you've completed the "Importing Data with Sqoop" exercise. Review the data you already loaded into HDFS in that exercise:
$ hadoop fs -cat movie/part-m-00000 | head
...
$ hadoop fs -cat movierating/part-m-00000 | head
...
Prepare The Data For Hive
For Hive data sets, you create tables, which attach field names and data types to your Hadoop data for subsequent queries. You can create external tables on themovie and movierating data sets, without having to move the data at all. Prepare the Hive tables for this exercise by performing the following steps:
1. Invoke the Hive shell.
2. Create the movie table:
3. Create the movierating table:
4. Quit the Hive shell.
Practicing HiveQL
If you are familiar with SQL, most of what you already know is applicably to HiveQL. Skip ahead to section called "The Questions" later in this exercise, and see if you can solve the problems based on your knowledge of SQL.
If you are unfamiliar with SQL, follow the steps below to learn how to use HiveSQL to solve problems.
1. Start the Hive shell.
2. Show the list of tables in Hive
hive> SHOW TABLES;
OK
customers
movie
movierating
order_details
orders
products
Time taken: 0.34 seconds
3. View the metadata for the two tables you created previously:
hive> DESCRIBE movie;
hive> DESCRIBE movieratings;
Hint: You can use the up and down arrow keys to see and edit your command history in the hive shell, just as you can in the Linux command shell.
4. The SELECT * FROM TABLENAME command allows you to query data from a table. Although it is very easy to select all the rows in a table, Hadoop generally deals with very large tables; so it is best to limit how many you select. Use LIMIT to view only the first N rows:
5. Use the WHERE clause to select only rows that match certain criteria. For example, select movies released before 1930:
hive> SELECT * FROM movie WHERE year < 1930;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
...
3289 Not One Less 0
3306 Circus, The 1928
3309 Dog's Life, A 1920
3310 Kid, The 1921
3320 Mifune 0
3357 East-West 0
...
6. The results include movies whose year field is 0, meaning that the year is unknown or unavailable. Exclude those movies from the results:
hive> SELECT * FROM movie WHERE year < 1930 AND year != 0;
7. The results now correctly include movies before 1930, but the list is unordered. Order them alphabetically by title:
hive> SELECT * FROM movie WHERE year < 1930 AND year != 0 ORDER BY name;
8. Now let's move on to the movierating table. List all the ratings by a particular user, e.g.
hive> SELECT * FROM movierating WHERE userid=149;
9. SELECT * shows all the columns, but as we've already selected by userid, display the other columns but not that one:
hive> SELECT movieid, rating FROM movierating WHERE userid=149;
10. Use the JOIN function to display data from both tables. For example, include the name of the movie (from the movie table) in the list of a user's ratings:
hive> SELECT movieid, rating, name FROM movierating
> JOIN movie ON movierating.movieid=movie.id
> WHERE userid=149;
Total MapReduce jobs = 1
Launching Job 1 out of 1
...
11. How tough a rater is user 149? Find out by calculating the average rating she gave to all movies using the AVG function (UDAV):
hive> SELECT AVG(rating) FROM movierating WHERE userid=149;
...
Total MapReduce CPU Time Spent: 4 seconds 270 msec
OK
3.9408783783783785
Time taken: 14.753 seconds
12. List each user who rated movies, the number of movies they've rated, and their average rating.
hive> SELECT userid, COUNT(userid), AVG(rating) FROM movierating GROUP BY userid;
...
6038 20 3.8
6039 123 3.8780487804878048
6040 341 3.5777126099706744
Time taken: 17.281 seconds
13. Take the same data, and copy it into a new table called userrating.
hive> CREATE TABLE userrating (userid INT, numratings INT, avgrating FLOAT);
OK
Time taken: 0.069 seconds
hive> INSERT OVERWRITE TABLE userrating
> SELECT userid, COUNT(userid), AVG(rating)
> FROM movierating GROUP BY userid;
...
Total MapReduce CPU Time Spent: 8 seconds 120 msec
OK
Time taken: 20.651 seconds
The Questions
Now that the data is imported and suitably prepared, write a HiveQL command to implement each of the following queries.
Supplement
* Apache Hive - LanguageManual
訂閱:
文章 (Atom)
[Git 常見問題] error: The following untracked working tree files would be overwritten by merge
Source From Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 # git clean -d -fx 方案2: 今天在服务器上 gi...
-
CNN 卷積神經網路簡介 STEP1. 卷積神經網路介紹 CNN 卷積神經網路可以分成兩大部分: * 影像的特徵提取 : 透過 Convolution 與 Max Pooling 提取影像特徵. * Fully connected Feedforward n...
-
前言 : 為什麼程序管理這麼重要呢?這是因為: * 首先,本章一開始就談到的,我們在操作系統時的各項工作其實都是經過某個 PID 來達成的 (包括你的 bash 環境), 因此,能不能進行某項工作,就與該程序的權限有關了。 * 再來,如果您的 Linux 系統是個...
-
來源自 這裡 說明 : split 是 Perl 中非常有用的函式之一,它可以將一個字串分割並將之置於陣列中。若無特別的指定,該函式亦使用 RE 與 $_ 變數 語法 : * split /PATTERN/,EXPR,LIMIT * split /...