程式扎記

來源自這裡
Preface:
In this post I will explain you how to install Hadoop in a brand new Linux distribution. This post is based on Michael G. Noll post Running Hadoop on Ubuntu Linux (Single-Node Cluster) which is still very handful but is somewhat outdated. We assume that you already have Ubuntu 13.04 installed and running. If not, you can download it here. After you have your Ubuntu 13.04 installed, we are ready to get Hadoop running.

STEP 1: Install Java
Since Ubuntu no longer come with Java, the first thing we have to do is install it. For the sake of simplicity, I will not go through this step by step. The post HOW TO INSTALL ORACLE JAVA 7 UPDATE 25 ON UBUNTU 13.04 LINUX has already done it so brilliantly.

STEP 2: Install SSH
As we are installing Hadoop in a clean version of Ubuntu 13.04, we also should have SSH server installed. A distributed Hadoop cluster requires SSH because is through SSH that Hadoop manage its nodes, e.g. starting and stopping slave nodes. The following command will do that.

$ sudo apt-get install openssh-server

STEP 3: Create a dedicated user
A new user is not required but in a large-scale environment I strongly recommend that you create a separate user account dedicated exclusively to Hadoop. This allows you to restrict the permissions to the mimimum needed by Hadoop. This account does not need to have extra privileges such as sudo privileges. It only needs to have read and write access to some directories in order to perform Hadoop tasks.

Now let’s create a dedicated user to Hadoop:

$ sudo addgroup hadoopgroup
$ sudo adduser --ingroup hadoopgroup hadoop

STEP 4: Configuring passphraseless SSH
To avoid entering passphrase every time Hadoop interacts with its nodes, let’s create an RSA keypair to manage authentication. The authorized_keys file holds public keys that are allowed to authenticate into the account the key is added to.

$ su - hadoop # 切換到 hadoop 帳號

# Creates an RSA keypair
# The -P "" specifies that an empty password should be used
$ ssh-keygen -t rsa -P ""

# Write the public key file for the generated RSA key into the authorized_key file
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
$ exit

STEP 5: Downloading Hadoop
To download the last stable version, go to Hadoop Releases and check the last release. Inside Hadoop Releases page go to Download a release now! link in order to find a mirror site for your download. Now, just copy the link to the hadoop-0.23.9.tar.gz file (version being used in this post). It will be used in the second command line bellow to download hadoop-0.23.9.tar.gz straight to the desired folder.

$ cd /usr/local
$ sudo wget http://ftp.unicamp.br/pub/apache/hadoop/common/had...op-0.23.9/hadoop-0.23.9.tar.gz

# Extract hadoop-0.23.9 files
$ sudo tar xzf hadoop-0.23.9.tar.gz

# Remove hadoop-0.23.9.tar.gz file we download
$ sudo rm hadoop-0.23.9.tar.gz

# Create a symbolic link to make things easier, but it is not required.
$ sudo ln -s hadoop-0.23.9 hadoop

# The next command gives ownership of hadoop-0.23.9 directory, files
# and sub-directories to the hadoop user.
$ sudo chown -R hadoop:hadoopgroup hadoop-0.23.9

STEP 6: Setting up JAVA_HOME for Hadoop
Now that you have Java installed let’s configure it for Hadoop. In previous versions of Hadoop the file conf/hadoop-env.sh come for setting environment variables. Hadoop 0.23.9 don’t have this file in it. In such a case, we will manually create it inside $HADOOP_HOME/etc/hadoop folder and set the JAVA_HOME variable.

Firstly, let's check where you installed java:

$ echo $JAVA_HOME
/usr/lib/jvm/jdk1.7.0_25 # Where your JDK home directory

Now let’s create hadoop-env.sh file:

$ sudo vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh

and add the following line:

export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_25

STEP 7: Disabling IPv6
Given the fact that Apache Hadoop is not currently supported on IPv6 networks (see Hadoop and IPv6) we will disable IPv6 in Java by editing hadoop-env.sh again.

$ sudo vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh

Add the following line at the bottom of the file:

HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

I am not sure if also disable IPv6 on Ubuntu 13.04 is really necessary (It worked without this step for me in test environments) but just in case, you can do it adding the following lines at the end of sysctl.conf file.

$ sudo vi /etc/sysctl.conf

Add below lines in the end:

# IPv6 configuration
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Reload configuration for sysctl.conf

$ sudo sysctl -p

Check IPv6 is disabled typing

$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

Response:
0 – mean that IPv6 is enabled.
1 – mean that IPv6 is disable. It is what we expect.

STEP 8: Configuring HDFS
The Hadoop Distributed File System (HDFS) is a reliable distributed file system designed to run on ordinary hardware and to store very large amounts of data (terabytes or even petabytes). HDFS is highly fault-tolerant because from a pratical standpoint it was built upon the premise that hardware failure is the norm rather than the exception (see HDFS Architecture Guide). Thus, failure detection, distributed replication and quick recovery are in its core architecture.

The configuration settings are a set of key-value pairs of the format:

The main configurations are stored in the 3 files bellow:

* core-site.xml – contains default values for core Hadoop properties.
* mapred-site.xml – contains configuration information for MapReduce properties.
* hdfs-site.xml – contains server side configuration of your distributed file system.

First, let’s create a temporary directory for Hadoop

$ sudo mkdir /home/hadoop/tmp
$ sudo chown -R hadoop:hadoopgroup /home/hadoop/tmp

# Set folder permissions
$ sudo chmod 750 /home/hadoop

and now set core-site.xml properties

$ sudo vi /usr/local/hadoop/etc/hadoop/core-site.xml

Updated to below content:

hadoop.tmp.dir - A base for other temporary directories.

fs.defaultFS - The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri’s scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri’s authority is used to determine the host, port, etc. for a filesystem.

If you have any questions about core-site.xml configuration options, see here for more details.

As we are configuring a single node, we can edit mapred-site.xml file and config it as follow:

# Create a copy of the template mapred-site.xml file
$ sudo cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml

# Edit the copy we just created
$ sudo vi /usr/local/hadoop/etc/hadoop/mapred-site.xml

Update to below content:

mapreduce.jobtracker.address - The host and port that the MapReduce job tracker runs at. If “local”, then jobs are run in-process as a single map and reduce task.

If you have any questions about core-site.xml configuration options, see here for more details.

By default, Hadoop will place DFS data node blocks in file://${hadoop.tmp.dir}/dfs/data (the property you have just configured in core-site.xml). This is fine while still in development or evaluation, but you probably should override this default value in a production system.

It’s a little bit of work, but you’re going to have to do it anyway. So we can just create them now

$ sudo mkdir /home/hadoop/hdfs
$ sudo chown -R hadoop:hadoopgroup /home/hadoop/hdfs
$ sudo chmod 750 /home/hadoop/hdfs

Open hdfs-site.xml for editing

$ sudo vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Update the content as below:

dfs.replication - Default block replication. The actual number of replications can be specified when the file is created. The default value 3 is used if replication is not specified in create time.

dfs.datanode.data.dir - Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.

If you have any questions about hdfs-site.xml configuration option, see here.

STEP 9: Formating NameNode
Before start adding files to the HDFS we must format it. The command bellow will do it for us.

$ su - hadoop # switch to account "hadoop"
$ /usr/local/hadoop/bin/hdfs namenode -format

STEP 10: Starting the services
Now that we have formatted HDFS, use the following commands to launch hadoop

# Switch to account "hadoop" first!
$ /usr/local/hadoop/sbin/start-dfs.sh
$ /usr/local/hadoop/sbin/start-yarn.sh

Given the fact that Hadoop is written in the Java programming language, we can so use the Java Process Status tool (JPS) to check which processes are currently running in the JVM.

$ jps
# Initiated by start-dfs.sh
5848 Jps
5795 SecondaryNameNode
5375 NameNode
5567 DataNode
# Initiated by start-yarn.sh
5915 ResourceManager
6101 NodeManager

STEP 11: Running job test
To make sure all was configured correctly we will use the wordcount example that come with Hadoop. It reads text files from a specified folder and lists in another file the number of times each word occur. First, let’s create a folder for our examples and download the plain text book A JOURNEY TO THE CENTER OF THE EARTH by Jules Verne inside this folder.

# The –p option make parent directories as needed. In practice, all folders in the mkdir path are created.
$ mkdir -p /usr/local/hadoop/examples/jverne
$ cd /usr/local/hadoop/examples/jverne
$ wget http://www.textfiles.com/etext/FICTION/center_earth

# Copy the downloaded file to HDFS
$ cd /usr/local/hadoop
$ ./bin/hdfs dfs -copyFromLocal ./examples /

Check if the file was copied

$ ./bin/hdfs dfs -ls /examples/jverne
Found 1 items
-rw-r--r-- 1 hadoop supergroup 489319 2013-08-02 20:40 /examples/jverne/center_earth

If you have any questions regarding Hadoop Shell Commands, see Hadoop Shell Commands

So, let’s run the sample itself!

$ ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-0.23.9.jar wordcount /examples/jverne /examples/jverne/output
# To print out the results
$ ./bin/hdfs dfs -cat /examples/jverne/output/part-r-00000
...
youthful 1
zeal 1
zero! 1
zigzag 2
zigzags 1

STEP 12: Stopping all services
In order to stop the services use:

# if you are already using account "hadoop"
$ /usr/local/hadoop/sbin/stop-dfs.sh
$ /usr/local/hadoop/sbin/stop-yarn.sh

Supplement:
* Hadoop 解除 "Name node is in safe mode"

在分布式文件系统启动的时候，开始的时候会有安全模式，当分布式文件系统处于安全模式的情况下，文件系统中的内容不允许修改也不允许删除，直到安全模式结束。安全模式主要是为了系统启动的时候检查各个DataNode上数据块的有效性，同时根据策略必要的复制或者删除部分数据块...

程式扎記

標籤

2013年10月29日星期二

[ 文章收集 ] HADOOP-0.23.9 SINGLE NODE SETUP ON UBUNTU 13.04

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2013年10月29日 星期二