Preface
Files used in this exercise:
In this exercise you will begin to get acquainted with the Hadoop tools. You will manipulate files in HDFS, the Hadoop Distributed File System.
Exercise
Before starting the exercises, run the course setup script in a terminal window:
Hadoop
Hadoop is already installed, configured, and running on your virtual machine. Most of your interaction with the system will be through a command-line wrapper called hadoop. If you run this program with no arguments, it prints a help message. To try this, run the below command in a terminal window:
The hadoop command is subdivided into several subsystems. For example, there is a subsystem for working with files in HDFS and another for launching and managing MapReduce processing jobs.
Step1: Exploring HDFS
The subsystem associated with HDFS in the Hadoop wrapper program is called FsShell. This subsystem can be invoked with command hadoop fs.
1. In the terminal window, enter
You see a help messge describing all the commands associated with the FsShell subsystem.
2. Enter:
This shows you the contents of the root directory in HDFS. There will be multiple entries, one of which is /user. Individual users have a "home" directory under this directory, named after their username.
Step2: Uploading Files
Besides browsing the existing filesystem, another important thing you can do with FsShell is to upload new data into HDFS.
1. Change directories to the local filesystem directory containing the sample data we will be using in the course.
If you perform a regular Linux ls command in this directory, you will see a few files, including two named shakespeare.tar.gz and shakespeare-stream.tar.gz. Both of those contain the complete works of Shakespeare in text format, but with different formats and organizations. For now, we will work with shakespeare.tar.gz.
2. Unzip shakespeare.tar.gz by running
This creates a directory named shakespeare/ containing several files on your local filesystem.
3. Insert this directory into HDFS:
This copies the local shakespeare directory and its contents into a remote HDFS directory named /user/training/shakespeare.
4. List the contents of your HDFS home directory now:
You should see an entry for the shakespeare directory. If you don't pass a directory name to the -ls command, it assumes you mean your home directory, i.e./user/training. Any relative path will based on your home directory too!
5. We also have Web server log file, which we will put into HDFS for use in the future exercise:
The file is currently compressed using GZip. Rather than extract the file to the local disk and then upload it, we will extract and upload in one step. Now, extrack and upload the file in one step. The -c option to gunzip uncompresses to standard output, and the dash (-) in the below command takes whatever is being sent to its standard input and places that data in HDFS:
6. Run the hadoop fs -ls command to verify that the log file is in your HDFS home directory
7. The access log file is quite large - around 500 MB. Create a small version of this file, consisting only of its first 5000 lines, and store the smaller version in HDFS. You can use the smaller version for testing in subsequent exercises.
Step3: Viewing and Manipulating Files
Now let's view some of the data you just copied into HDFS.
1. Enter
This lists the contents of the /user/training/shakespeare HDFS directory.
2. The glossary file included in the compressed file you began with is not strictly a work of Shakespere, let's remove it:
3. Enter:
This prints the last 50 lines of Henry IV, Part 1 to your terminal. This command is handy for viewing the output of MapReduce programs. Very often, an individual output file of a MapReduce program is very large, making it inconvenient to view the entire file in the terminal.
4. To download a file to work with on the local filesystem use the fs -get command. This command takes two arguments: an HDFS path and a local path. It copies the HDFS contents into the local filesystem:
Other Commands
Useful arguments for users of a hadoop cluster from hadoop command:
Commands useful for administrators of a hadoop cluster can refer here.
Supplement
* Apache Hadoop 2.5.1 - Command Menu
This is a blog to track what I had learned and share knowledge with all who can take advantage of them
標籤
- [ 英文學習 ]
- [ 計算機概論 ]
- [ 深入雲計算 ]
- [ 雜七雜八 ]
- [ Algorithm in Java ]
- [ Data Structures with Java ]
- [ IR Class ]
- [ Java 文章收集 ]
- [ Java 代碼範本 ]
- [ Java 套件 ]
- [ JVM 應用 ]
- [ LFD Note ]
- [ MangoDB ]
- [ Math CC ]
- [ MongoDB ]
- [ MySQL 小學堂 ]
- [ Python 考題 ]
- [ Python 常見問題 ]
- [ Python 範例代碼 ]
- [心得扎記]
- [網路教學]
- [C 常見考題]
- [C 範例代碼]
- [C/C++ 範例代碼]
- [Intro Alg]
- [Java 代碼範本]
- [Java 套件]
- [Linux 小技巧]
- [Linux 小學堂]
- [Linux 命令]
- [ML In Action]
- [ML]
- [MLP]
- [Postgres]
- [Python 學習筆記]
- [Quick Python]
- [Software Engineering]
- [The python tutorial]
- 工具收集
- 設計模式
- 資料結構
- ActiveMQ In Action
- AI
- Algorithm
- Android
- Ansible
- AWS
- Big Data 研究
- C/C++
- C++
- CCDH
- CI/CD
- Coursera
- Database
- DB
- Design Pattern
- Device Driver Programming
- Docker
- Docker 工具
- Docker Practice
- Eclipse
- English Writing
- ExtJS 3.x
- FP
- Fraud Prevention
- FreeBSD
- GCC
- Git
- Git Pro
- GNU
- Golang
- Gradle
- Groovy
- Hadoop
- Hadoop. Hadoop Ecosystem
- Java
- Java Framework
- Java UI
- JavaIDE
- JavaScript
- Jenkins
- JFreeChart
- Kaggle
- Kali/Metasploit
- Keras
- KVM
- Learn Spark
- LeetCode
- Linux
- Lucene
- Math
- ML
- ML Udemy
- Mockito
- MPI
- Nachos
- Network
- NLP
- node js
- OO
- OpenCL
- OpenMP
- OSC
- OSGi
- Pandas
- Perl
- PostgreSQL
- Py DS
- Python
- Python 自製工具
- Python Std Library
- Python tools
- QEMU
- R
- Real Python
- RIA
- RTC
- Ruby
- Ruby Packages
- Scala
- ScalaIA
- SQLAlchemy
- TensorFlow
- Tools
- UML
- Unix
- Verilog
- Vmware
- Windows 技巧
- wxPython
訂閱:
張貼留言 (Atom)
[Git 常見問題] error: The following untracked working tree files would be overwritten by merge
Source From Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 # git clean -d -fx 方案2: 今天在服务器上 gi...
-
前言 : 為什麼程序管理這麼重要呢?這是因為: * 首先,本章一開始就談到的,我們在操作系統時的各項工作其實都是經過某個 PID 來達成的 (包括你的 bash 環境), 因此,能不能進行某項工作,就與該程序的權限有關了。 * 再來,如果您的 Linux 系統是個...
-
屬性 : 系統相關 - 檔案與目錄 語法 : du [參數] [檔案] 參數 | 功能 -a | 顯示目錄中個別檔案的大小 -b | 以bytes為單位顯示 -c | 顯示個別檔案大小與總和 -D | 顯示符號鏈結的來源檔大小 -h | Hum...
-
來源自 這裡 說明 : split 是 Perl 中非常有用的函式之一,它可以將一個字串分割並將之置於陣列中。若無特別的指定,該函式亦使用 RE 與 $_ 變數 語法 : * split /PATTERN/,EXPR,LIMIT * split /...
沒有留言:
張貼留言