2014年11月22日 星期六

[CCDH] Class1 - The Motivation for Hadoop

Course Objective 
During this course, you will learn:
* The core technologies of Hadoop
* How HDFS and MapReduce work
* How to develop and unit test MapReduce applications
* How to use MapReduce combiners, partitioners, and the distributed cache
* Best practices for developing and debugging MapReduce applications
* How to implement custom data input and output formats in MapReduce applications
* Algorithms for common MapReduce tasks
* How to join datasets in MapReduce
* How Hadoop integrates into the data center
* How Hive, Impala, and Pig can be used for rapid$application development
* How to create workflows using Oozie


The Motivation for Hadoop
In this chapter you will learn
* What problems exist with traditional large-scale computing systems
* What requirements an alternative approach should have
* How Hadoop addresses those requirements

Problems with Traditional Large-scale Systems
* Traditionally, computation has been processor-bound
* The early solution: bigger computers
* The beNer solution: more computers


Introducing Hadoop
* A tradical new approach to distributed computing
* Originally based on work done at Google
* Open-source project overseen by the Apache Software Foundation
* Applications are written in high-level code
* Data is distributed in advance
* Data is replicated for increased availability and reliability
* Hadoop is scalable and fault-tolerant
* Adding nodes adds capacity proportionally
* Increasing load results in a graceful decline in performance


Hadoop-able Problems
- Nature of the data
* Volume
* Velocity
* Variety

- Nature of the analysis
* Batch processing
* Parallel execution
* Distributed data

Hadoop Basic Concept and HDFS
In this chapter you will learn
* What Hadoop is
* What features the Hadoop Distributed File System (HDFS) provides

The Hadoop Project and Hadoop Components

HDFS (Hadoop Distributed File System): Stores data on the cluster
MapReduce: Processes data on the cluster
A Hadoop cluster: a group of machines working together to store and process data
- Any number of ‘slave’ or ‘worker’ nodes: HDFS to store data; MapReduce to process data
- Two master nodes: NameNode to manager HDFS; JobTracker to manage MapReduce



The Hadoop Distributed File System (HDFS)
* HDFS is a filesystem written in Java
* Sits on top of the native filesystem.
* Provide redundant storage for massive amounts of data.
* Files in HDFS are written once - No random writes to file are allowed
* HDFS are customized for large, streaming reads of files.



How "Files" Are Stored:
* Data files are split into blocks and distributed at load time
* Each block is replicated on multiple data nodes (default 3x)
* NameNode stores metadata



Options for accessing HDFS:
* FsShell Command line: hadoop fs ...
* Java API
* Ecosystem Projects
Flume: Collect data from network sources.
Sqoop: Transfer data between HDFS and RDBMS
Hue: Web-based interactive UI.


Introduction to MapReduce
In this chapter you will learn
* The concepts behind MapReduce
* How data flows through MapReduce stages
* Typical uses of Mappers
* Typical uses of Reducers

MapReduce Overview
* MapRecuce is a method for distributing a task across multiple nodes
* Each node processes data stored on the node
* Consists of two phases: Map ->Reduce

Features of MapReduce
* Automatic parallelization and distribution
* Fault-tolerance
* A clean abstraction for programmers.




Example - WordCount
Hadoop runs Map tasks on the node storing the data (when possible)
- Minimizes network traffic
- Many Mappers can run in parallel



WordCount shuffle and sort


WordCount reduce


Another Example - Analyzing Log Data


Mapper
* Input: key/value pair
* Output: A list of zero or more key value pairs



Reducer
After the Map phrase is over, all intermediate values for a given intermediate key are grouped together. Each key and value list is passed to a Reducer.
* All values for a particular intermediate key go to the same Reducer
* The intermediate key/value lists are passed in sorted key order



The Reducer outputs zero or more final key/value pairs


Hadoop Clusters and the Hadoop Ecosystem
In this chapter you will learn
* The components of a Hadoop cluster
* How Hadoop jobs and tasks run on a cluster
* How a job's data flows in a Hadoop cluster
* What other Hadoop Ecosystem projects exist

Hadoop Cluster Overview
* A Hadoop cluster is a group of computers working together - Usually runs HDFS and MapReduce
* A node is an individual computer in the cluster - Master nodes manage distribution of work and data to slave nodes
* A daemon is a program running on a node - Each performs different functions in the cluster



MapReduce v2 & v2
* MapReduce v1 ("MRv1" or "Classic MapReduce")
- Use a JobTracker/TaskTracker architecture
- One JobTracker per cluster - limits cluster size to about 4000 nodes
- Slots on slave nodes designated for Map/Reduce tasks

* MapReduce v2 ("MRv2")
- Built on top of YARN (Yet Another Resource Negotiator)
- Uses ResourceManager/NodeManager architecture
- Node resources can be used for any type of task (Support non-MR jobs)

MRv1 daemons
* JobTracker - on per cluster. Manages MapReduce jobs, distributes individuals tasks to TaskTrackers.
* TaskTracker - one per slave node. Starts and monitors individual Map/Reduce tasks.



MRv2 daemons
* ResourceManager - One per cluster. Starts ApplicationMasters, allocates resources on slave nodes. (JobTracker)
* ApplicationMaster - One per job. Request resources, manages individual Map/Reduce tasks.
* NodeManager - One per slave node. Manages resources on individual slave nodes. (TaskTracker)
* JobHistory - One per cluster. Archives jobs' metrics and metadata.



HDFS daemons
* NameNode - Holds the metadata for HDFS. Typically two on a production cluster: one active, one standby
* DataNode - Holds the actual HDFS data. One per slave node.



Hadoop Jobs and Tasks
* A job is a "full program" - A complete execution of Mapper/Reducer over a dataset.
* A task is the execution of a single Mapper/Reducer over a slice of data
* A task attempt is a particular instance of an attempt to execute a task
- There will be at least as many task attempts as there are tasks
- If a task attempt fails, another will be started by the JobTracker or ApplicationMaster
Speculative execution (The later task will be killed while quicker task already completed) can also result in more task attempts than completed tasks



Running a Job on a MapReduce v1 Cluster


Running a Job on a MapReduce v2 Cluster


Other Hadoop Ecosystem Components


Following is an introduction to some of the most significant projects (Should pay attention to Sqoop, Hive for examination)
Hive and Pig
* Languages for querying and manipulating data - Higher level than MapReduce
* Interpreter runs on a client machine - Turns queries to MapReduce jobs and submit jobs to the cluster
* Overview later in the course

Impala
* High-performance SQL engine for vast amounts of data - 10 to 50+ times faster than Hive, Pig or MapReduce
* Impala runs on Hadoop clusters - Data stored in HDFS and doesn't use MapReduce
* Developed by Cloudera - 100% open source under Apache software license

Flume and Sqoop
* Flume imports data into HDFS as it is generated
* Sqoop transfers data between RDBMSs and HDFS - Sqoop = "SQL to Hadoop"

Oozie
* Workflow engine for MapReduce jobs
* Defines dependencies between jobs
* The Oozie server submits the jobs to the server in the correct sequence

Supplement
Apache Hadoop 2.5.1 - Command Menu
All hadoop commands are invoked by the bin/hadoop script. Running the hadoop script without any arguments prints the description for all commands.
Usage: hadoop [--config confdir] [COMMAND] [GENERIC_OPTIONS] [COMMAND_OPTIONS]

Hadoop MapReduce Next Generation - Cluster Setup
This document describes how to install, configure and manage non-trivial Hadoop clusters ranging from a few nodes to extremely large clusters with thousands of nodes. To play with Hadoop, you may first want to install it on a single machine (see Single Node Setup).


沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...