程式扎記: [ Doc ] Hadoop 2.5.1 - CentOS Cluster Setup (1 namenode+2 datanodes)

標籤

2014年11月19日 星期三

[ Doc ] Hadoop 2.5.1 - CentOS Cluster Setup (1 namenode+2 datanodes)

Source From Here
Prerequisite
This document describes how to install, configure and manage non-trivial Hadoop clusters ranging from a few nodes to extremely large clusters with thousands of nodes. To play with Hadoop, you may first want to install it on a single machine (See Single Node Setup). Here we are going to use CentOS 6.4 to build a cluster with one name node and two data nodes. To begin with, we should have our machines ready!

Prepare 3 Machine (or VMs) and everyone can connect each other and self with ssh (Refer to SSH 免密碼登入). Then assign IP as below:
* NameNode: 192.168.192.128
* DataNode1: 192.168.192.129
* DataNode2: 192.168.192.130

Then modify /etc/hosts to mapping hostname to corresponding IP address:
  1. 127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4  
  2. 192.168.192.129 centosd1  
  3. 192.168.192.128 centosnn  
  4. 192.168.192.130 centosd2  
Create an account hduser with password "hduser":
$ sudo useradd -m hduser
$ sudo passwd hduser
# Give password "hduser"

Download a stable version of Hadoop from Apache mirrors. (Our target is 2.5.1 and do it to all nodes):
$ su hduser
$ cd ~ # Enter hduser's home
$ wget http://ftp.mirror.tw/pub/apache/hadoop/common/hadoop-2.5.1/hadoop-2.5.1.tar.gz
$ tar -xvf hadoop-2.5.1.tar.gz
$ ls -s /home/hduser/hadoop-2.5.1 hadoop # Build a soft link for better management!
$ sudo vim /etc/profile
  1. ...  
  2. export HADOOP_HOME=/home/hduser/hadoop  
  3. export PATH=$HADOOP_HOME/bin/:$HADOOP_HOME/sbin/:$PATH  
$ hadoop version
Hadoop 2.5.1

Running Hadoop in Non-Secure Mode
The following sections describe how to configure a Hadoop cluster.

Configuring Environment of Hadoop Daemons
Administrators can configure individual daemons using the configuration options shown below in the table:


For example, To configure Namenode to use parallelGC, the following statement should be added in hadoop-env.sh :
  1. export HADOOP_NAMENODE_OPTS="-XX:+UseParallelGC ${HADOOP_NAMENODE_OPTS}"  
Configuring the Hadoop Daemons in Non-Secure Mode
This section deals with important parameters to be specified in the given configuration files:
$HADOOP_HOME/etc/hadoop/core-site.xml

For both name node and data node, the template is:
  1.   
  2.       
  3.         fs.defaultFS  
  4.         hdfs://centosnn:9000  
  •     
  •   
  •   $HADOOP_HOME/etc/hadoop/hdfs-site.xml


    Template for name node: (Remember to mkdir /home/hduser/namedir)
    1.   
    2.       
    3.         dfs.replication  
    4.         1  
  •     
  •   
  •   
  •       
  •         dfs.namenode.name.dir  
  •         /home/hduser/namedir  
  •     
  •   
  •   Template for data node (Remember to mkdir /home/hduser/hdfs):
    1.   
    2.       
    3.         dfs.datanode.data.dir  
    4.         /home/hduser/hdfs  
    5.     
      
  •   $HADOOP_HOME/etc/hadoop/yarn-site.xml




    Template for name node:
    1.   
    2.       
    3.         yarn.nodemanager.aux-services  
    4.         mapreduce_shuffle  
    5.     
      
  •   Template for data node:
    1.   
    2.       
    3.         yarn.resourcemanager.address  
    4.         centosnn:8032  
  •         Enter your ResourceManager hostname.  
  •     
  •   
  •       
  •         yarn.resourcemanager.resource-tracker.address  
  •         centosnn:8031  
  •         Enter your ResourceManager hostname.  
  •     
  •   
  •   $HADOOP_HOME/etc/hadoop/mapred-site.xml


    Template of name node:
    1.   
    2.       
    3.         mapreduce.framework.name  
    4.         yarn  
    5.     
      
  •   When all configuration files are ready, please modify $HADOOP_HOME/etc/hadoop/slaves to add two data node IPs:
    1. 192.168.192.129  
    2. 192.168.192.130  
    Now is time to start the cluster at name node:
    $ start-dfs.sh
    $ start-yarn.sh
    $ jps # Make sure below daemons are up!
    7964 NameNode
    8323 ResourceManager
    9370 Jps
    8158 SecondaryNameNode

    $ hdfs dfsadmin -report # Make sure two data nodes are up too!
    Configured Capacity: 37558796288 (34.98 GB)
    Present Capacity: 28391260193 (26.44 GB)
    DFS Remaining: 28385787904 (26.44 GB)
    DFS Used: 5472289 (5.22 MB)
    DFS Used%: 0.02%
    Under replicated blocks: 0
    Blocks with corrupt replicas: 0
    Missing blocks: 0

    -------------------------------------------------
    Live datanodes (2):

    Name: 192.168.192.130:50010 (centosd2)
    ...
    Name: 192.168.192.129:50010 (centosd1)
    ...

    Hadoop Rack Awareness
    The HDFS and the YARN components are rack-aware. The NameNode and the ResourceManager obtains the rack information of the slaves in the cluster by invoking an API resolve in an administrator configured module.

    The API resolves the DNS name (also IP address) to a rack id. The site-specific module to use can be configured using the configuration item topology.node.switch.mapping.impl. The default implementation of the same runs a script/command configured using topology.script.file.name. If topology.script.file.name is not set, the rack id /default-rack is returned for any passed IP address.

    Logging
    Hadoop uses the Apache log4j via the Apache Commons Logging framework for logging. Edit the $HADOOP_HOME/etc/hadoop/log4j.properties file to customize the Hadoop daemons' logging configuration (log-formats and so on). Or you can use below way to dynamically change log level for debugging:
    $ export HADOOP_ROOT_LOGGER=DEBUG,console
    $ hadoop fs -ls # You will see more debug message!
    14/11/19 03:45:16 DEBUG util.Shell: setsid exited with exit code 0
    ...
    DEBUG ipc.Client: IPC Client (1699966644) connection to centosnn/192.168.192.128:9000 from hduser: closed
    DEBUG ipc.Client: IPC Client (1699966644) connection to centosnn/192.168.192.128:9000 from hduser: stopped, remaining connections 0

    Web Interfaces
    Once the Hadoop cluster is up and running check the web-ui of the components as described below:


    Supplement
    Apache Hadoop 2.5.1 - MapReduce Tutorial
    Hadoop2.2.0遇到NativeLibraries錯誤的解決過程
    $ export HADOOP_ROOT_LOGGER=DEBUG,console # See more debug information

    Hadoop 2.5.1 - Setting up a Single Node Cluster
    This message was edited 44 times. Last update was at 19/11/2014 19:51:34

    1 則留言:

    1. 因為這裡貼 xml 會亂碼, 如果需要 configuration 的 template 檔案, 可以到下面連結下載:
      https://www.space.ntu.edu.tw/navigate/s/4D93CF9BC99E424E8F10FFFC773E323CQQY

      回覆刪除

    網誌存檔

    關於我自己

    我的相片
    Where there is a will, there is a way!