來源自 這裡
Preface:
In this post I will explain you how to install Hadoop in a brand new Linux distribution. This post is based on Michael G. Noll post Running Hadoop on Ubuntu Linux (
Single-Node Cluster) which is still very handful but is somewhat outdated. We assume that you already have Ubuntu 13.04 installed and running. If not, you can download it here. After you have your Ubuntu 13.04 installed, we are ready to get Hadoop running.
STEP 1: Install Java
Since Ubuntu no longer come with Java, the first thing we have to do is install it. For the sake of simplicity, I will not go through this step by step. The post
HOW TO INSTALL ORACLE JAVA 7 UPDATE 25 ON UBUNTU 13.04 LINUX has already done it so brilliantly.
STEP 2: Install SSH
As we are installing Hadoop in a clean version of Ubuntu 13.04, we also should have SSH server installed. A distributed Hadoop cluster requires SSH because is through SSH that Hadoop manage its nodes, e.g. starting and stopping slave nodes. The following command will do that.
$ sudo apt-get install openssh-server
STEP 3: Create a dedicated user
A new user is not required but in a large-scale environment I strongly recommend that you create a separate user account dedicated exclusively to Hadoop. This allows you to restrict the permissions to the mimimum needed by Hadoop. This account does not need to have extra privileges such as sudo privileges. It only needs to have read and write access to some directories in order to perform Hadoop tasks.
Now let’s create a dedicated user to Hadoop:
$ sudo addgroup hadoopgroup
$ sudo adduser --ingroup hadoopgroup hadoop
STEP 4: Configuring passphraseless SSH
To avoid entering passphrase every time Hadoop interacts with its nodes, let’s create an RSA keypair to manage authentication. The
authorized_keys file holds public keys that are allowed to authenticate into the account the key is added to.
$ su - hadoop # 切換到 hadoop 帳號
# Creates an RSA keypair
# The -P "" specifies that an empty password should be used
$ ssh-keygen -t rsa -P ""
# Write the public key file for the generated RSA key into the authorized_key file
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
$ exit
STEP 5: Downloading Hadoop
To download the last stable version, go to
Hadoop Releases and check the last release. Inside
Hadoop Releases page go to
Download a release now! link in order to find a mirror site for your download. Now, just copy the link to the hadoop-0.23.9.tar.gz file (version being used in this post). It will be used in the second command line bellow to download hadoop-0.23.9.tar.gz straight to the desired folder.
$ cd /usr/local
$ sudo wget http://ftp.unicamp.br/pub/apache/hadoop/common/had...op-0.23.9/hadoop-0.23.9.tar.gz
# Extract hadoop-0.23.9 files
$ sudo tar xzf hadoop-0.23.9.tar.gz
# Remove hadoop-0.23.9.tar.gz file we download
$ sudo rm hadoop-0.23.9.tar.gz
# Create a symbolic link to make things easier, but it is not required.
$ sudo ln -s hadoop-0.23.9 hadoop
# The next command gives ownership of hadoop-0.23.9 directory, files
# and sub-directories to the hadoop user.
$ sudo chown -R hadoop:hadoopgroup hadoop-0.23.9
STEP 6: Setting up JAVA_HOME for Hadoop
Now that you have Java installed let’s configure it for Hadoop. In previous versions of Hadoop the file
conf/hadoop-env.sh come for setting environment variables.
Hadoop 0.23.9 don’t have this file in it. In such a case, we will manually create it inside
$HADOOP_HOME/etc/hadoop folder and set the
JAVA_HOME variable.
Firstly, let's check where you installed java:
$ echo $JAVA_HOME
/usr/lib/jvm/jdk1.7.0_25 # Where your JDK home directory
Now let’s create
hadoop-env.sh file:
$ sudo vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh
and add the following line:
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_25
STEP 7: Disabling IPv6
Given the fact that Apache Hadoop is not currently supported on IPv6 networks (
see Hadoop and IPv6) we will disable IPv6 in Java by editing
hadoop-env.sh again.
$ sudo vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh
Add the following line at the bottom of the file:
HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
I am not sure if also disable IPv6 on Ubuntu 13.04 is really necessary (
It worked without this step for me in test environments) but just in case, you can do it adding the following lines at the end of
sysctl.conf file.
$ sudo vi /etc/sysctl.conf
Add below lines in the end:
# IPv6 configuration
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Reload configuration for
sysctl.conf
$ sudo sysctl -p
Check IPv6 is disabled typing
$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
Response:
0 – mean that IPv6 is enabled.
1 – mean that IPv6 is disable. It is what we expect.
STEP 8: Configuring HDFS
The Hadoop Distributed File System (
HDFS) is a reliable distributed file system designed to run on ordinary hardware and to store very large amounts of data (
terabytes or even petabytes). HDFS is highly fault-tolerant because from a pratical standpoint it was built upon the premise that hardware failure is the norm rather than the exception (
see HDFS Architecture Guide). Thus, failure detection, distributed replication and quick recovery are in its core architecture.
The configuration settings are a set of key-value pairs of the format:
The main configurations are stored in the 3 files bellow:
* core-site.xml – contains default values for core Hadoop properties.
* mapred-site.xml – contains configuration information for MapReduce properties.
* hdfs-site.xml – contains server side configuration of your distributed file system.
First, let’s create a temporary directory for Hadoop
$ sudo mkdir /home/hadoop/tmp
$ sudo chown -R hadoop:hadoopgroup /home/hadoop/tmp
# Set folder permissions
$ sudo chmod 750 /home/hadoop
and now set
core-site.xml properties
$ sudo vi /usr/local/hadoop/etc/hadoop/core-site.xml
Updated to below content:
hadoop.tmp.dir - A base for other temporary directories.
fs.defaultFS - The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri’s scheme determines the config property (
fs.SCHEME.impl) naming the FileSystem implementation class. The uri’s authority is used to determine the host, port, etc. for a filesystem.
If you have any questions about
core-site.xml configuration options, see
here for more details.
As we are configuring a single node, we can edit
mapred-site.xml file and config it as follow:
# Create a copy of the template mapred-site.xml file
$ sudo cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml
# Edit the copy we just created
$ sudo vi /usr/local/hadoop/etc/hadoop/mapred-site.xml
Update to below content:
mapreduce.jobtracker.address - The host and port that the MapReduce job tracker runs at. If “local”, then jobs are run in-process as a single map and reduce task.
If you have any questions about
core-site.xml configuration options, see
here for more details.
By default, Hadoop will place DFS data node blocks in
file://${hadoop.tmp.dir}/dfs/data (
the property you have just configured in core-site.xml). This is fine while still in development or evaluation, but you probably should override this default value in a production system.
It’s a little bit of work, but you’re going to have to do it anyway. So we can just create them now
$ sudo mkdir /home/hadoop/hdfs
$ sudo chown -R hadoop:hadoopgroup /home/hadoop/hdfs
$ sudo chmod 750 /home/hadoop/hdfs
Open
hdfs-site.xml for editing
$ sudo vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Update the content as below:
dfs.replication - Default block replication. The actual number of replications can be specified when the file is created.
The default value 3 is used if replication is not specified in create time.
dfs.datanode.data.dir - Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.
If you have any questions about
hdfs-site.xml configuration option, see
here.
STEP 9: Formating NameNode
Before start adding files to the HDFS we must format it. The command bellow will do it for us.
$ su - hadoop # switch to account "hadoop"
$ /usr/local/hadoop/bin/hdfs namenode -format
STEP 10: Starting the services
Now that we have formatted HDFS, use the following commands to launch hadoop
# Switch to account "hadoop" first!
$ /usr/local/hadoop/sbin/start-dfs.sh
$ /usr/local/hadoop/sbin/start-yarn.sh
Given the fact that Hadoop is written in the Java programming language, we can so use the Java Process Status tool (
JPS) to check which processes are currently running in the JVM.
$ jps
# Initiated by start-dfs.sh
5848 Jps
5795 SecondaryNameNode
5375 NameNode
5567 DataNode
# Initiated by start-yarn.sh
5915 ResourceManager
6101 NodeManager
STEP 11: Running job test
To make sure all was configured correctly we will use the wordcount example that come with Hadoop. It reads text files from a specified folder and lists in another file the number of times each word occur. First, let’s create a folder for our examples and download the plain text book A JOURNEY TO THE CENTER OF THE EARTH by Jules Verne inside this folder.
# The –p option make parent directories as needed. In practice, all folders in the mkdir path are created.
$ mkdir -p /usr/local/hadoop/examples/jverne
$ cd /usr/local/hadoop/examples/jverne
$ wget http://www.textfiles.com/etext/FICTION/center_earth
# Copy the downloaded file to HDFS
$ cd /usr/local/hadoop
$ ./bin/hdfs dfs -copyFromLocal ./examples /
Check if the file was copied
$ ./bin/hdfs dfs -ls /examples/jverne
Found 1 items
-rw-r--r-- 1 hadoop supergroup 489319 2013-08-02 20:40 /examples/jverne/center_earth
If you have any questions regarding Hadoop Shell Commands, see
Hadoop Shell Commands
So, let’s run the sample itself!
$ ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-0.23.9.jar wordcount /examples/jverne /examples/jverne/output
# To print out the results
$ ./bin/hdfs dfs -cat /examples/jverne/output/part-r-00000
...
youthful 1
zeal 1
zero! 1
zigzag 2
zigzags 1
STEP 12: Stopping all services
In order to stop the services use:
# if you are already using account "hadoop"
$ /usr/local/hadoop/sbin/stop-dfs.sh
$ /usr/local/hadoop/sbin/stop-yarn.sh
Supplement:
*
Hadoop 解除 "Name node is in safe mode"
在分布式文件系统启动的时候,开始的时候会有安全模式,当分布式文件系统处于安全模式的情况下,文件系统中的内容不允许修改也不允许删除,直到安全模式结束。安全模式主要是为了系统启动的时候检查各个DataNode上数据块的有效性,同时根据策略必要的复制或者删除部分数据块...