程式扎記: [ In Action ] Ch2. Starting Hadoop

Preface:
This chapter covers:

* The architectural components of Hadoop
* Setting up Hadoop and its three operating modes: standalone, pseudo-distributed, and fully distributed
* Web-based tools to monitor your Hadoop setup

After discussing the physical components of Hadoop in section 2.1, we’ll progress to setting up your cluster in sections 2.2. and 2.3. Section 2.3 will focus on the three operational modes of Hadoop and how to set them up. You’ll read about web-based tools that assist monitoring your cluster in section 2.4.

2.1 The building blocks of Hadoop
On a fully configured cluster, “running Hadoop” means running a set of daemons, or resident programs, on the different servers in your network. These daemons have specific roles; some exist only on one server, some exist across multiple servers. The daemons include:

■ NameNode
■ DataNode
■ Secondary NameNode
■ JobTracker
■ TaskTracker

2.1.1 NameNode
Let’s begin with arguably the most vital of the Hadoop daemons—the NameNode . Hadoop employs a master/slave architecture for both distributed storage and distributed computation. The distributed storage system is called the Hadoop File System, or HDFS. The NameNode is the master of HDFS that directs the slave DataNode daemons to perform the low-level I/O tasks. The NameNode is the bookkeeper of HDFS; it keeps track of how your files are broken down into file blocks, which nodes store those blocks, and the overall health of the distributed filesystem.

The function of the NameNode is memory and I/O intensive. As such, the server hosting the NameNode typically doesn’t store any user data or perform any computations for a MapReduce program to lower the workload on the machine. This means that the NameNode server doesn’t double as a DataNode or a TaskTracker.

There is unfortunately a negative aspect to the importance of the NameNode—it’s a single point of failure of your Hadoop cluster. For any of the other daemons, if their host nodes fail for software or hardware reasons, the Hadoop cluster will likely continue to function smoothly or you can quickly restart it. Not so for the NameNode.

2.1.2 DataNode
Each slave machine in your cluster will host a DataNode daemon to perform the grunt work of the distributed filesystem—reading and writing HDFS blocks to actual files on the local filesystem. When you want to read or write a HDFS file, the file is broken into blocks and the NameNode will tell your client which DataNode each block resides in. Your client communicates directly with the DataNode daemons to process the local files corresponding to the blocks. Furthermore, a DataNode may communicate with other DataNodes to replicate its data blocks for redundancy.

Figure 2.1 illustrates the roles of the NameNode and DataNodes. In this figure, we show two data files, one at /user/chuck/data1 and another at /user/james/data2. The data1 file takes up three blocks, which we denote 1, 2, and 3, and the data2 file consists of blocks 4 and 5. The content of the files are distributed among the DataNodes. In this illustration, each block has three replicas. For example, block 1 (used for data1) is replicated over the three rightmost DataNodes. This ensures that if any one DataNode crashes or becomes inaccessible over the network, you’ll still be able to read the files.

DataNodes are constantly reporting to the NameNode. Upon initialization, each of the DataNodes informs the NameNode of the blocks it’s currently storing. After this mapping is complete, the DataNodes continually poll the NameNode to provide information regarding local changes as well as receive instructions to create, move, or delete blocks from the local disk.

2.1.3 Secondary NameNode
The Secondary NameNode (SNN) is an assistant daemon for monitoring the state of the cluster HDFS. Like the NameNode, each cluster has one SNN, and it typically resides on its own machine as well. No other DataNode or TaskTracker daemons run on the same server. The SNN differs from the NameNode in that this process doesn’t receive or record any real-time changes to HDFS. Instead, it communicates with the NameNode to take snapshots of the HDFS metadata at intervals defined by the cluster configuration.

As mentioned earlier, the NameNode is a single point of failure for a Hadoop cluster, and the SNN snapshots help minimize the downtime and loss of data. Nevertheless, a NameNode failure requires human intervention to reconfigure the cluster to use the SNN as the primary NameNode. We’ll discuss the recovery process in chapter 8 when we cover best practices for managing your cluster.

2.1.4 JobTracker
The JobTracker daemon is the liaison between your application and Hadoop. Once you submit your code to your cluster, the JobTracker determines the execution plan by determining which files to process, assigns nodes to different tasks, and monitors all tasks as they’re running. Should a task fail, the JobTracker will automatically relaunch the task, possibly on a different node, up to a predefined limit of retries.

There is only one JobTracker daemon per Hadoop cluster. It’s typically run on a server as a master node of the cluster.

2.1.5 TaskTracker
As with the storage daemons, the computing daemons also follow a master/slave architecture: the JobTracker is the master overseeing the overall execution of a MapReduce job and the TaskTrackers manage the execution of individual tasks on each slave node. Figure 2.2 illustrates this interaction.

Figure 2.2 JobTracker and TaskTracker interaction.

Each TaskTracker is responsible for executing the individual tasks that the JobTracker assigns. Although there is a single TaskTracker per slave node, each TaskTracker can spawn multiple JVMs to handle many map or reduce tasks in parallel.

One responsibility of the TaskTracker is to constantly communicate with the JobTracker. If the JobTracker fails to receive a heartbeat from a TaskTracker within a specified amount of time, it will assume the TaskTracker has crashed and will resubmit the corresponding tasks to other nodes in the cluster.

2.2 Setting up SSH for a Hadoop cluster
When setting up a Hadoop cluster, you’ll need to designate one specific node as the master node. As shown in figure 2.3, this server will typically host the NameNode and JobTracker daemons. It’ll also serve as the base station contacting and activating the DataNode and TaskTracker daemons on all of the slave nodes. As such, we need to define a means for the master node to remotely access every node in your cluster.

Hadoop uses passphraseless SSH for this purpose. SSH utilizes standard public key cryptography to create a pair of keys for user verification—one public, one private. The public key is stored locally on every node in the cluster, and the master node sends the private key when attempting to access a remote machine. With both pieces of information, the target machine can validate the login attempt.

Figure 2.3 Topology of a typical Hadoop cluster

For how to install/buildup the Hadoop environment, you can refer to here.

2.3 Running Hadoop
We need to configure a few things before running Hadoop. Let’s take a closer look at the Hadoop configuration directory :

$ cd $HADOOP_HOME # 在 2.2 節, 下載 Hadoop package 並將解壓縮路徑設為 $HADOOP_HOME
$ ls -l conf/

The first thing you need to do is to specify the location of Java on all the nodes including the master. In hadoop-env.sh define the JAVA_HOME environment variable to point to the Java installation directory.

export JAVA_HOME="/home/john/jdk"

The hadoop-env.sh file contains other variables for defining your Hadoop environment, but JAVA_HOME is the only one requiring initial modification. The default settings on the other variables will probably work fine. As you become more familiar with Hadoop you can later modify this file to suit your individual needs (logging directory location, Java class path, and so on).

The majority of Hadoop settings are contained in XML configuration files. Before version 0.20, these XML files are hadoop-default.xml and hadoop-site.xml . As the names imply, hadoop-default.xml contains the default Hadoop settings to be used unless they are explicitly overridden in hadoop-site.xml. In practice you only deal with hadoop-site.xml. In version 0.20 this file has been separated out into three XML files: core-site.xml , hdfs-site.xml , and mapred-site.xml . This refactoring better aligns the configuration settings to the subsystem of Hadoop that they control. In the rest of this chapter we’ll generally point out which of the three files used to adjust a configuration setting. If you use an earlier version of Hadoop, keep in mind that all such configuration settings are modified in hadoop-site.xml.

2.3.1 Local (standalone) mode
The standalone mode is the default mode for Hadoop. When you first uncompress the Hadoop source package, it’s ignorant of your hardware setup. Hadoop chooses to be conservative and assumes a minimal configuration. All three XML files (or hadoopsite.xml before version 0.20) are empty under this default mode:

2.3.2 Pseudo-distributed mode
The pseudo-distributed mode is running Hadoop in a “cluster of one” with all daemons running on a single machine. This mode complements the standalone mode for debugging your code, allowing you to examine memory usage, HDFS input/out put issues, and other daemon interactions. Listing 2.1 provides simple XML files to configure a single server in this mode.
- Listing 2.1 Example of the three configuration files for pseudo-distributed mode
core-site.xml

mapred-site.xml

hdfs-site.xml

In core-site.xml and mapred-site.xml we specify the hostname and port of the NameNode and the JobTracker, respectively. In hdfs-site.xml we specify the default replication factor for HDFS, which should only be one because we’re running on only one node. We must also specify the location of the Secondary NameNode in the masters file and the slave nodes in the slaves file:

$ cat masters
localhost
$ cat slaves
localhost

While all the daemons are running on the same machine, they still communicate with each other using the same SSH protocol as if they were distributed over a cluster. Section 2.2 has a more detailed discussion of setting up the SSH channels, but for single-node operation simply check to see if your machine already allows you to ssh back to itself.

You are almost ready to start Hadoop. But first you’ll need to format your HDFS by using the command:

$ bin/hadoop namenode -format

We can now launch the daemons by use of the start-all.sh script. The Java jps command will list all daemons to verify the setup was successful:

$ bin/start-all.sh
$ jps
26893 Jps
26832 TaskTracker
26620 SecondaryNameNode
26333 NameNode
26484 DataNode
26703 JobTracker

When you’ve finished with Hadoop you can shut down the Hadoop daemons by the command:

$ bin/stop-all.sh

Both standalone and pseudo-distributed modes are for development and debugging purposes. An actual Hadoop cluster runs in the third mode, the fully distributed mode.

2.3.3 Fully distributed mode
It’s time for us to set up a full cluster. In the discussion below we’ll use the following server names:

■ master—The master node of the cluster and host of the NameNode and Job-Tracker daemons
■ backup—The server that hosts the Secondary NameNode daemon
■ hadoop1, hadoop2, hadoop3, ...—The slave boxes of the cluster running both DataNode and TaskTracker daemons

Listing 2.2 is a modified version of the pseudo-distributed configuration files (listing 2.1) that can be used as a skeleton for your cluster’s setup.
Listing 2.2 Example configuration files for fully distributed mode
core-site.xml (Locate NameNode for filesystem)

mapred-site.xml (Locate JobTracker master)

hdfs-site.xml (Decide HDFS replication factor)

We also need to update the masters and slaves files to reflect the locations of the other daemons.

$ cat masters
backup
$ cat slaves
hadoop1
hadoop2
hadoop3

Once you have copied these files across all the nodes in your cluster, be sure to format HDFS to prepare it for storage:

$ bin/hadoop namenode-format

Now you can start the Hadoop daemons:

$ bin/start-all.sh

and verify the nodes are running their assigned jobs.

$ jps # at master
30879 JobTracker
30717 NameNode
30965 Jps

$ jps # at backup
2099 Jps
1679 SecondaryNameNode

$ jps # at datanode
7101 TaskTracker
7617 Jps
6988 DataNode

You have a functioning cluster!

2.4 Web-based cluster UI
Having covered the operational modes of Hadoop, we can now introduce the web interfaces that Hadoop provides to monitor the health of your cluster. The browser in terface allows you to access information you desire much faster than digging through logs and directories.

The NameNode hosts a general report on port 50070. It gives you an overview of the state of your cluster’s HDFS. Figure 2.4 displays this report for a 2-node cluster example. From this interface, you can browse through the filesystem, check the status of each DataNode in your cluster, and peruse the Hadoop daemon logs to verify your cluster is functioning correctly.

Hadoop provides a similar status overview of ongoing MapReduce jobs. Figure 2.5 depicts one hosted at port 50030 of the JobTracker.

Again, a wealth of information is available through this reporting interface. You can access the status of ongoing MapReduce tasks as well as detailed reports about completed jobs. The latter is of particular importance—these logs describe which nodes performed which tasks and the time/resources required to complete each task. Finally, the Hadoop configuration for each job is also available, as shown in figure 2.6. With all of this information you can streamline your MapReduce programs to better utilize the resources of your cluster.

Supplement:
* [ 深入雲計算 ] Hadoop 的安裝和配置: Hadoop Eclipse 簡介與使用
* [ 深入雲計算 ] Hadoop 的安裝和配置: Linux 配置 1 NameNode + 2 DataNode

程式扎記

標籤

2013年11月22日星期五

[ In Action ] Ch2. Starting Hadoop

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2013年11月22日 星期五