Preface
Files and Directories Used in this Exercise
In this exercise, you will practice data processing in Hadoop using Hive.
Lab Experiment
The data sets for this exercise are the movie and movierating data imported from MySQL into Hadoop in the "Importing Data with Sqoop" exercise.
Review the Data
1. Make sure you've completed the "Importing Data with Sqoop" exercise. Review the data you already loaded into HDFS in that exercise:
Prepare The Data For Hive
For Hive data sets, you create tables, which attach field names and data types to your Hadoop data for subsequent queries. You can create external tables on themovie and movierating data sets, without having to move the data at all. Prepare the Hive tables for this exercise by performing the following steps:
1. Invoke the Hive shell.
2. Create the movie table:
3. Create the movierating table:
4. Quit the Hive shell.
Practicing HiveQL
If you are familiar with SQL, most of what you already know is applicably to HiveQL. Skip ahead to section called "The Questions" later in this exercise, and see if you can solve the problems based on your knowledge of SQL.
If you are unfamiliar with SQL, follow the steps below to learn how to use HiveSQL to solve problems.
1. Start the Hive shell.
2. Show the list of tables in Hive
3. View the metadata for the two tables you created previously:
Hint: You can use the up and down arrow keys to see and edit your command history in the hive shell, just as you can in the Linux command shell.
4. The SELECT * FROM TABLENAME command allows you to query data from a table. Although it is very easy to select all the rows in a table, Hadoop generally deals with very large tables; so it is best to limit how many you select. Use LIMIT to view only the first N rows:
5. Use the WHERE clause to select only rows that match certain criteria. For example, select movies released before 1930:
6. The results include movies whose year field is 0, meaning that the year is unknown or unavailable. Exclude those movies from the results:
7. The results now correctly include movies before 1930, but the list is unordered. Order them alphabetically by title:
8. Now let's move on to the movierating table. List all the ratings by a particular user, e.g.
9. SELECT * shows all the columns, but as we've already selected by userid, display the other columns but not that one:
10. Use the JOIN function to display data from both tables. For example, include the name of the movie (from the movie table) in the list of a user's ratings:
11. How tough a rater is user 149? Find out by calculating the average rating she gave to all movies using the AVG function (UDAV):
12. List each user who rated movies, the number of movies they've rated, and their average rating.
13. Take the same data, and copy it into a new table called userrating.
The Questions
Now that the data is imported and suitably prepared, write a HiveQL command to implement each of the following queries.
Supplement
* Apache Hive - LanguageManual
This is a blog to track what I had learned and share knowledge with all who can take advantage of them
標籤
- [ 英文學習 ]
- [ 計算機概論 ]
- [ 深入雲計算 ]
- [ 雜七雜八 ]
- [ Algorithm in Java ]
- [ Data Structures with Java ]
- [ IR Class ]
- [ Java 文章收集 ]
- [ Java 代碼範本 ]
- [ Java 套件 ]
- [ JVM 應用 ]
- [ LFD Note ]
- [ MangoDB ]
- [ Math CC ]
- [ MongoDB ]
- [ MySQL 小學堂 ]
- [ Python 考題 ]
- [ Python 常見問題 ]
- [ Python 範例代碼 ]
- [心得扎記]
- [網路教學]
- [C 常見考題]
- [C 範例代碼]
- [C/C++ 範例代碼]
- [Intro Alg]
- [Java 代碼範本]
- [Java 套件]
- [Linux 小技巧]
- [Linux 小學堂]
- [Linux 命令]
- [ML In Action]
- [ML]
- [MLP]
- [Postgres]
- [Python 學習筆記]
- [Quick Python]
- [Software Engineering]
- [The python tutorial]
- 工具收集
- 設計模式
- 資料結構
- ActiveMQ In Action
- AI
- Algorithm
- Android
- Ansible
- AWS
- Big Data 研究
- C/C++
- C++
- CCDH
- CI/CD
- Coursera
- Database
- DB
- Design Pattern
- Device Driver Programming
- Docker
- Docker 工具
- Docker Practice
- Eclipse
- English Writing
- ExtJS 3.x
- FP
- Fraud Prevention
- FreeBSD
- GCC
- Git
- Git Pro
- GNU
- Golang
- Gradle
- Groovy
- Hadoop
- Hadoop. Hadoop Ecosystem
- Java
- Java Framework
- Java UI
- JavaIDE
- JavaScript
- Jenkins
- JFreeChart
- Kaggle
- Kali/Metasploit
- Keras
- KVM
- Learn Spark
- LeetCode
- Linux
- Lucene
- Math
- ML
- ML Udemy
- Mockito
- MPI
- Nachos
- Network
- NLP
- node js
- OO
- OpenCL
- OpenMP
- OSC
- OSGi
- Pandas
- Perl
- PostgreSQL
- Py DS
- Python
- Python 自製工具
- Python Std Library
- Python tools
- QEMU
- R
- Real Python
- RIA
- RTC
- Ruby
- Ruby Packages
- Scala
- ScalaIA
- SQLAlchemy
- TensorFlow
- Tools
- UML
- Unix
- Verilog
- Vmware
- Windows 技巧
- wxPython
訂閱:
張貼留言 (Atom)
[Git 常見問題] error: The following untracked working tree files would be overwritten by merge
Source From Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 # git clean -d -fx 方案2: 今天在服务器上 gi...
-
前言 : 為什麼程序管理這麼重要呢?這是因為: * 首先,本章一開始就談到的,我們在操作系統時的各項工作其實都是經過某個 PID 來達成的 (包括你的 bash 環境), 因此,能不能進行某項工作,就與該程序的權限有關了。 * 再來,如果您的 Linux 系統是個...
-
屬性 : 系統相關 - 檔案與目錄 語法 : du [參數] [檔案] 參數 | 功能 -a | 顯示目錄中個別檔案的大小 -b | 以bytes為單位顯示 -c | 顯示個別檔案大小與總和 -D | 顯示符號鏈結的來源檔大小 -h | Hum...
-
來源自 這裡 說明 : split 是 Perl 中非常有用的函式之一,它可以將一個字串分割並將之置於陣列中。若無特別的指定,該函式亦使用 RE 與 $_ 變數 語法 : * split /PATTERN/,EXPR,LIMIT * split /...
沒有留言:
張貼留言