來源自 這裡
Preface:
In this post I will explain you how to install Hadoop in a brand new Linux distribution. This post is based on Michael G. Noll post Running Hadoop on Ubuntu Linux (Single-Node Cluster) which is still very handful but is somewhat outdated. We assume that you already have Ubuntu 13.04 installed and running. If not, you can download it here. After you have your Ubuntu 13.04 installed, we are ready to get Hadoop running.
STEP 1: Install Java
Since Ubuntu no longer come with Java, the first thing we have to do is install it. For the sake of simplicity, I will not go through this step by step. The post HOW TO INSTALL ORACLE JAVA 7 UPDATE 25 ON UBUNTU 13.04 LINUX has already done it so brilliantly.
STEP 2: Install SSH
As we are installing Hadoop in a clean version of Ubuntu 13.04, we also should have SSH server installed. A distributed Hadoop cluster requires SSH because is through SSH that Hadoop manage its nodes, e.g. starting and stopping slave nodes. The following command will do that.
STEP 3: Create a dedicated user
A new user is not required but in a large-scale environment I strongly recommend that you create a separate user account dedicated exclusively to Hadoop. This allows you to restrict the permissions to the mimimum needed by Hadoop. This account does not need to have extra privileges such as sudo privileges. It only needs to have read and write access to some directories in order to perform Hadoop tasks.
Now let’s create a dedicated user to Hadoop:
STEP 4: Configuring passphraseless SSH
To avoid entering passphrase every time Hadoop interacts with its nodes, let’s create an RSA keypair to manage authentication. The authorized_keys file holds public keys that are allowed to authenticate into the account the key is added to.
STEP 5: Downloading Hadoop
To download the last stable version, go to Hadoop Releases and check the last release. Inside Hadoop Releases page go to Download a release now! link in order to find a mirror site for your download. Now, just copy the link to the hadoop-0.23.9.tar.gz file (version being used in this post). It will be used in the second command line bellow to download hadoop-0.23.9.tar.gz straight to the desired folder.
STEP 6: Setting up JAVA_HOME for Hadoop
Now that you have Java installed let’s configure it for Hadoop. In previous versions of Hadoop the file conf/hadoop-env.sh come for setting environment variables. Hadoop 0.23.9 don’t have this file in it. In such a case, we will manually create it inside $HADOOP_HOME/etc/hadoop folder and set the JAVA_HOME variable.
Firstly, let's check where you installed java:
Now let’s create hadoop-env.sh file:
and add the following line:
STEP 7: Disabling IPv6
Given the fact that Apache Hadoop is not currently supported on IPv6 networks (see Hadoop and IPv6) we will disable IPv6 in Java by editing hadoop-env.sh again.
Add the following line at the bottom of the file:
I am not sure if also disable IPv6 on Ubuntu 13.04 is really necessary (It worked without this step for me in test environments) but just in case, you can do it adding the following lines at the end of sysctl.conf file.
Add below lines in the end:
Reload configuration for sysctl.conf
Check IPv6 is disabled typing
Response:
0 – mean that IPv6 is enabled.
1 – mean that IPv6 is disable. It is what we expect.
STEP 8: Configuring HDFS
The Hadoop Distributed File System (HDFS) is a reliable distributed file system designed to run on ordinary hardware and to store very large amounts of data (terabytes or even petabytes). HDFS is highly fault-tolerant because from a pratical standpoint it was built upon the premise that hardware failure is the norm rather than the exception (see HDFS Architecture Guide). Thus, failure detection, distributed replication and quick recovery are in its core architecture.
The configuration settings are a set of key-value pairs of the format:
The main configurations are stored in the 3 files bellow:
* core-site.xml – contains default values for core Hadoop properties.Preface:
In this post I will explain you how to install Hadoop in a brand new Linux distribution. This post is based on Michael G. Noll post Running Hadoop on Ubuntu Linux (Single-Node Cluster) which is still very handful but is somewhat outdated. We assume that you already have Ubuntu 13.04 installed and running. If not, you can download it here. After you have your Ubuntu 13.04 installed, we are ready to get Hadoop running.
STEP 1: Install Java
Since Ubuntu no longer come with Java, the first thing we have to do is install it. For the sake of simplicity, I will not go through this step by step. The post HOW TO INSTALL ORACLE JAVA 7 UPDATE 25 ON UBUNTU 13.04 LINUX has already done it so brilliantly.
STEP 2: Install SSH
As we are installing Hadoop in a clean version of Ubuntu 13.04, we also should have SSH server installed. A distributed Hadoop cluster requires SSH because is through SSH that Hadoop manage its nodes, e.g. starting and stopping slave nodes. The following command will do that.
STEP 3: Create a dedicated user
A new user is not required but in a large-scale environment I strongly recommend that you create a separate user account dedicated exclusively to Hadoop. This allows you to restrict the permissions to the mimimum needed by Hadoop. This account does not need to have extra privileges such as sudo privileges. It only needs to have read and write access to some directories in order to perform Hadoop tasks.
Now let’s create a dedicated user to Hadoop:
STEP 4: Configuring passphraseless SSH
To avoid entering passphrase every time Hadoop interacts with its nodes, let’s create an RSA keypair to manage authentication. The authorized_keys file holds public keys that are allowed to authenticate into the account the key is added to.
STEP 5: Downloading Hadoop
To download the last stable version, go to Hadoop Releases and check the last release. Inside Hadoop Releases page go to Download a release now! link in order to find a mirror site for your download. Now, just copy the link to the hadoop-0.23.9.tar.gz file (version being used in this post). It will be used in the second command line bellow to download hadoop-0.23.9.tar.gz straight to the desired folder.
STEP 6: Setting up JAVA_HOME for Hadoop
Now that you have Java installed let’s configure it for Hadoop. In previous versions of Hadoop the file conf/hadoop-env.sh come for setting environment variables. Hadoop 0.23.9 don’t have this file in it. In such a case, we will manually create it inside $HADOOP_HOME/etc/hadoop folder and set the JAVA_HOME variable.
Firstly, let's check where you installed java:
Now let’s create hadoop-env.sh file:
and add the following line:
STEP 7: Disabling IPv6
Given the fact that Apache Hadoop is not currently supported on IPv6 networks (see Hadoop and IPv6) we will disable IPv6 in Java by editing hadoop-env.sh again.
Add the following line at the bottom of the file:
I am not sure if also disable IPv6 on Ubuntu 13.04 is really necessary (It worked without this step for me in test environments) but just in case, you can do it adding the following lines at the end of sysctl.conf file.
Add below lines in the end:
Reload configuration for sysctl.conf
Check IPv6 is disabled typing
Response:
0 – mean that IPv6 is enabled.
1 – mean that IPv6 is disable. It is what we expect.
STEP 8: Configuring HDFS
The Hadoop Distributed File System (HDFS) is a reliable distributed file system designed to run on ordinary hardware and to store very large amounts of data (terabytes or even petabytes). HDFS is highly fault-tolerant because from a pratical standpoint it was built upon the premise that hardware failure is the norm rather than the exception (see HDFS Architecture Guide). Thus, failure detection, distributed replication and quick recovery are in its core architecture.
The configuration settings are a set of key-value pairs of the format:
The main configurations are stored in the 3 files bellow:
* mapred-site.xml – contains configuration information for MapReduce properties.
* hdfs-site.xml – contains server side configuration of your distributed file system.
First, let’s create a temporary directory for Hadoop
and now set core-site.xml properties
Updated to below content:
hadoop.tmp.dir - A base for other temporary directories.
fs.defaultFS - The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri’s scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri’s authority is used to determine the host, port, etc. for a filesystem.
If you have any questions about core-site.xml configuration options, see here for more details.
As we are configuring a single node, we can edit mapred-site.xml file and config it as follow:
Update to below content:
mapreduce.jobtracker.address - The host and port that the MapReduce job tracker runs at. If “local”, then jobs are run in-process as a single map and reduce task.
If you have any questions about core-site.xml configuration options, see here for more details.
By default, Hadoop will place DFS data node blocks in file://${hadoop.tmp.dir}/dfs/data (the property you have just configured in core-site.xml). This is fine while still in development or evaluation, but you probably should override this default value in a production system.
It’s a little bit of work, but you’re going to have to do it anyway. So we can just create them now
Open hdfs-site.xml for editing
Update the content as below:
dfs.replication - Default block replication. The actual number of replications can be specified when the file is created. The default value 3 is used if replication is not specified in create time.
dfs.datanode.data.dir - Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.
If you have any questions about hdfs-site.xml configuration option, see here.
STEP 9: Formating NameNode
Before start adding files to the HDFS we must format it. The command bellow will do it for us.
STEP 10: Starting the services
Now that we have formatted HDFS, use the following commands to launch hadoop
Given the fact that Hadoop is written in the Java programming language, we can so use the Java Process Status tool (JPS) to check which processes are currently running in the JVM.
STEP 11: Running job test
To make sure all was configured correctly we will use the wordcount example that come with Hadoop. It reads text files from a specified folder and lists in another file the number of times each word occur. First, let’s create a folder for our examples and download the plain text book A JOURNEY TO THE CENTER OF THE EARTH by Jules Verne inside this folder.
Check if the file was copied
If you have any questions regarding Hadoop Shell Commands, see Hadoop Shell Commands
So, let’s run the sample itself!
STEP 12: Stopping all services
In order to stop the services use:
Supplement:
* Hadoop 解除 "Name node is in safe mode"
沒有留言:
張貼留言