程式扎記: [CCDH] Exercise18 - Manipulating Data With Hive (P63)

2015年1月22日星期四

[CCDH] Exercise18 - Manipulating Data With Hive (P63)

Preface
Files and Directories Used in this Exercise

Test data (HDFS):
movie
movierating

Exercise directory: ~/workspace/hive

In this exercise, you will practice data processing in Hadoop using Hive.

Lab Experiment
The data sets for this exercise are the movie and movierating data imported from MySQL into Hadoop in the "Importing Data with Sqoop" exercise.

Review the Data
1. Make sure you've completed the "Importing Data with Sqoop" exercise. Review the data you already loaded into HDFS in that exercise:

$ hadoop fs -cat movie/part-m-00000 | head
...
$ hadoop fs -cat movierating/part-m-00000 | head
...

Prepare The Data For Hive
For Hive data sets, you create tables, which attach field names and data types to your Hadoop data for subsequent queries. You can create external tables on themovie and movierating data sets, without having to move the data at all. Prepare the Hive tables for this exercise by performing the following steps:

1. Invoke the Hive shell.
2. Create the movie table:

3. Create the movierating table:

4. Quit the Hive shell.

Practicing HiveQL
If you are familiar with SQL, most of what you already know is applicably to HiveQL. Skip ahead to section called "The Questions" later in this exercise, and see if you can solve the problems based on your knowledge of SQL.

If you are unfamiliar with SQL, follow the steps below to learn how to use HiveSQL to solve problems.
1. Start the Hive shell.
2. Show the list of tables in Hive

hive> SHOW TABLES;
OK
customers
movie
movierating
order_details
orders
products
Time taken: 0.34 seconds

3. View the metadata for the two tables you created previously:

hive> DESCRIBE movie;
hive> DESCRIBE movieratings;

Hint: You can use the up and down arrow keys to see and edit your command history in the hive shell, just as you can in the Linux command shell.

4. The SELECT * FROM TABLENAME command allows you to query data from a table. Although it is very easy to select all the rows in a table, Hadoop generally deals with very large tables; so it is best to limit how many you select. Use LIMIT to view only the first N rows:

5. Use the WHERE clause to select only rows that match certain criteria. For example, select movies released before 1930:

hive> SELECT * FROM movie WHERE year < 1930;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
...
3289 Not One Less 0
3306 Circus, The 1928
3309 Dog's Life, A 1920
3310 Kid, The 1921
3320 Mifune 0
3357 East-West 0
...

6. The results include movies whose year field is 0, meaning that the year is unknown or unavailable. Exclude those movies from the results:

hive> SELECT * FROM movie WHERE year < 1930 AND year != 0;

7. The results now correctly include movies before 1930, but the list is unordered. Order them alphabetically by title:

hive> SELECT * FROM movie WHERE year < 1930 AND year != 0 ORDER BY name;

8. Now let's move on to the movierating table. List all the ratings by a particular user, e.g.

hive> SELECT * FROM movierating WHERE userid=149;

9. SELECT * shows all the columns, but as we've already selected by userid, display the other columns but not that one:

hive> SELECT movieid, rating FROM movierating WHERE userid=149;

10. Use the JOIN function to display data from both tables. For example, include the name of the movie (from the movie table) in the list of a user's ratings:

hive> SELECT movieid, rating, name FROM movierating
> JOIN movie ON movierating.movieid=movie.id
> WHERE userid=149;
Total MapReduce jobs = 1
Launching Job 1 out of 1
...

11. How tough a rater is user 149? Find out by calculating the average rating she gave to all movies using the AVG function (UDAV):

hive> SELECT AVG(rating) FROM movierating WHERE userid=149;
...
Total MapReduce CPU Time Spent: 4 seconds 270 msec
OK
3.9408783783783785
Time taken: 14.753 seconds

12. List each user who rated movies, the number of movies they've rated, and their average rating.

hive> SELECT userid, COUNT(userid), AVG(rating) FROM movierating GROUP BY userid;
...
6038 20 3.8
6039 123 3.8780487804878048
6040 341 3.5777126099706744
Time taken: 17.281 seconds

13. Take the same data, and copy it into a new table called userrating.

hive> CREATE TABLE userrating (userid INT, numratings INT, avgrating FLOAT);
OK
Time taken: 0.069 seconds
hive> INSERT OVERWRITE TABLE userrating
> SELECT userid, COUNT(userid), AVG(rating)
> FROM movierating GROUP BY userid;
...
Total MapReduce CPU Time Spent: 8 seconds 120 msec
OK
Time taken: 20.651 seconds

The Questions
Now that the data is imported and suitably prepared, write a HiveQL command to implement each of the following queries.

Supplement
* Apache Hive - LanguageManual

程式扎記

標籤

2015年1月22日星期四

[CCDH] Exercise18 - Manipulating Data With Hive (P63)

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2015年1月22日 星期四

[CCDH] Exercise18 - Manipulating Data With Hive (P63)

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2015年1月22日星期四