2014年11月29日 星期六

[CCDH] Exercise17 - Importing Data With Sqoop (P60)

Preface
In this exercise you will import data from a relational database using Sqoop. The data you load here will be used subsequent exercises.

Lab Experiment
Consider the MySQL databases movielens, derived from the MovieLens project from University of Minnesota. (See note at the end of this exercise.) The database consists of several related tables, but we will import only two of these: movie, which contains about 3,900 movies; and movierating, which has about 1,000,000 ratings of those movies.

Review the Database Tables
First, review the database tables to be loaded into Hadoop:
1. Log on to MySQL:
$ mysql --user=training --password=training movielens
...
mysql> # Now we are in MySQL interactive console

2. Review the structure and contents of the movie table:


3. Note the column names for the table
4. Review the structure and contents of the movierating table:



5. Note these column names
6. Exit mysql

mysql> quit

Import with Sqoop
You invoke Sqoop on the command line to perform several commands. With it you can connect to your database server to list the databases (schemas) to which you have access, and list the tables available for loading. For database access, you provide a connect string to identify the server, and if required - your username and password.

1. Show the commands available in Sqoop
$ sqoop help

2. List the databases (schemas) in your database server:
$ sqoop list-databases \
--connect jdbc:mysql://localhost \
--username training --password training

14/11/30 00:06:54 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
14/11/30 00:06:54 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
information_schema
dualcore
hue
metastore
movielens
mysql
test
training

Note: Instead of entering --password training on your command line, you may prefer to enter -P, and let Sqoop prompt you for the password, which is then not visible when you type it.

3. List the tables in the movielens database:
$ sqoop list-tables \
--connect jdbc:mysql://localhost/movielens \
--username training --password training

...
genre
movie
moviegenre
movierating
occupation
user

4. Import the movie table into Hadoop:
$ sqoop import \
--connect jdbc:mysql://localhost/movielens \
--username training --password training \
--fields-terminated-by '\t' --table movie

...
14/11/30 00:23:11 INFO mapreduce.ImportJobBase: Transferred 99.6602 KB in 18.9969 seconds (5.2461 KB/sec)
14/11/30 00:23:11 INFO mapreduce.ImportJobBase: Retrieved 3881 records.

5. Verify that the command has worked:
$ hadoop fs -ls movie
$ hadoop fs -tail movie/part-m-00000

6. Import the movierating table into Hadoop
- Step6.sh
  1. #!/bin/sh  
  2. Table="movierating"  
  3. echo "Start importing table movierating..."  
  4. sqoop import \  
  5. --connect jdbc:mysql://localhost/movielens \  
  6. --username training --password training \  
  7. --fields-terminated-by '\t' --table "$Table"  
  8.   
  9. echo "Check importing result..."  
  10. Result=`hadoop fs -ls $Table 2>&1`  
  11. if [[ $Result =~ "No such file or directory" ]]; then  
  12.     echo "Importing Table=$Table fail!"  
  13. else  
  14.     echo "Check import content:"  
  15.     hadoop fs -tail movierating/part-m-00000  
  16. fi  
Supplement
Apache Sqoop Home Page
Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

Sqoop 1.4.3 User Guide


沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...