程式扎記

Course Objective

During this course, you will learn:

* The core technologies of Hadoop
* How HDFS and MapReduce work
* How to develop and unit test MapReduce applications
* How to use MapReduce combiners, partitioners, and the distributed cache
* Best practices for developing and debugging MapReduce applications
* How to implement custom data input and output formats in MapReduce applications
* Algorithms for common MapReduce tasks
* How to join datasets in MapReduce
* How Hadoop integrates into the data center
* How Hive, Impala, and Pig can be used for rapid$application development
* How to create workflows using Oozie

The Motivation for Hadoop
In this chapter you will learn

* What problems exist with traditional large-scale computing systems
* What requirements an alternative approach should have
* How Hadoop addresses those requirements

Problems with Traditional Large-scale Systems

* Traditionally, computation has been processor-bound
* The early solution: bigger computers
* The beNer solution: more computers

Introducing Hadoop

* A tradical new approach to distributed computing
* Originally based on work done at Google
* Open-source project overseen by the Apache Software Foundation
* Applications are written in high-level code
* Data is distributed in advance
* Data is replicated for increased availability and reliability
* Hadoop is scalable and fault-tolerant
* Adding nodes adds capacity proportionally
* Increasing load results in a graceful decline in performance

Hadoop-able Problems
- Nature of the data

* Volume
* Velocity
* Variety

- Nature of the analysis

* Batch processing
* Parallel execution
* Distributed data

Hadoop Basic Concept and HDFS
In this chapter you will learn

* What Hadoop is
* What features the Hadoop Distributed File System (HDFS) provides

The Hadoop Project and Hadoop Components

* HDFS (Hadoop Distributed File System): Stores data on the cluster
* MapReduce: Processes data on the cluster
* A Hadoop cluster: a group of machines working together to store and process data

- Any number of ‘slave’ or ‘worker’ nodes: HDFS to store data; MapReduce to process data
- Two master nodes: NameNode to manager HDFS; JobTracker to manage MapReduce

The Hadoop Distributed File System (HDFS)

* HDFS is a filesystem written in Java
* Sits on top of the native filesystem.
* Provide redundant storage for massive amounts of data.
* Files in HDFS are written once - No random writes to file are allowed
* HDFS are customized for large, streaming reads of files.

How "Files" Are Stored:

* Data files are split into blocks and distributed at load time
* Each block is replicated on multiple data nodes (default 3x)
* NameNode stores metadata

Options for accessing HDFS:

* FsShell Command line: hadoop fs ...
* Java API
* Ecosystem Projects

- Flume: Collect data from network sources.
- Sqoop: Transfer data between HDFS and RDBMS
- Hue: Web-based interactive UI.

Introduction to MapReduce
In this chapter you will learn

* The concepts behind MapReduce
* How data flows through MapReduce stages
* Typical uses of Mappers
* Typical uses of Reducers

MapReduce Overview

* MapRecuce is a method for distributing a task across multiple nodes
* Each node processes data stored on the node
* Consists of two phases: Map ->Reduce

Features of MapReduce

* Automatic parallelization and distribution
* Fault-tolerance
* A clean abstraction for programmers.

Example - WordCount
Hadoop runs Map tasks on the node storing the data (when possible)

- Minimizes network traffic
- Many Mappers can run in parallel

WordCount shuffle and sort

WordCount reduce

Another Example - Analyzing Log Data

Mapper

* Input: key/value pair
* Output: A list of zero or more key value pairs

Reducer
After the Map phrase is over, all intermediate values for a given intermediate key are grouped together. Each key and value list is passed to a Reducer.

* All values for a particular intermediate key go to the same Reducer
* The intermediate key/value lists are passed in sorted key order

The Reducer outputs zero or more final key/value pairs

Hadoop Clusters and the Hadoop Ecosystem
In this chapter you will learn

* The components of a Hadoop cluster
* How Hadoop jobs and tasks run on a cluster
* How a job's data flows in a Hadoop cluster
* What other Hadoop Ecosystem projects exist

Hadoop Cluster Overview

* A Hadoop cluster is a group of computers working together - Usually runs HDFS and MapReduce
* A node is an individual computer in the cluster - Master nodes manage distribution of work and data to slave nodes
* A daemon is a program running on a node - Each performs different functions in the cluster

MapReduce v2 & v2
* MapReduce v1 ("MRv1" or "Classic MapReduce")

- Use a JobTracker/TaskTracker architecture
- One JobTracker per cluster - limits cluster size to about 4000 nodes
- Slots on slave nodes designated for Map/Reduce tasks

* MapReduce v2 ("MRv2")

- Built on top of YARN (Yet Another Resource Negotiator)
- Uses ResourceManager/NodeManager architecture
- Node resources can be used for any type of task (Support non-MR jobs)

MRv1 daemons

* JobTracker - on per cluster. Manages MapReduce jobs, distributes individuals tasks to TaskTrackers.
* TaskTracker - one per slave node. Starts and monitors individual Map/Reduce tasks.

MRv2 daemons

* ResourceManager - One per cluster. Starts ApplicationMasters, allocates resources on slave nodes. (JobTracker)
* ApplicationMaster - One per job. Request resources, manages individual Map/Reduce tasks.
* NodeManager - One per slave node. Manages resources on individual slave nodes. (TaskTracker)
* JobHistory - One per cluster. Archives jobs' metrics and metadata.

HDFS daemons

* NameNode - Holds the metadata for HDFS. Typically two on a production cluster: one active, one standby
* DataNode - Holds the actual HDFS data. One per slave node.

Hadoop Jobs and Tasks

* A job is a "full program" - A complete execution of Mapper/Reducer over a dataset.
* A task is the execution of a single Mapper/Reducer over a slice of data
* A task attempt is a particular instance of an attempt to execute a task

- There will be at least as many task attempts as there are tasks
- If a task attempt fails, another will be started by the JobTracker or ApplicationMaster
- Speculative execution (The later task will be killed while quicker task already completed) can also result in more task attempts than completed tasks

Running a Job on a MapReduce v1 Cluster

Running a Job on a MapReduce v2 Cluster

Other Hadoop Ecosystem Components

Following is an introduction to some of the most significant projects (Should pay attention to Sqoop, Hive for examination)
Hive and Pig

* Languages for querying and manipulating data - Higher level than MapReduce
* Interpreter runs on a client machine - Turns queries to MapReduce jobs and submit jobs to the cluster
* Overview later in the course

Impala

* High-performance SQL engine for vast amounts of data - 10 to 50+ times faster than Hive, Pig or MapReduce
* Impala runs on Hadoop clusters - Data stored in HDFS and doesn't use MapReduce
* Developed by Cloudera - 100% open source under Apache software license

Flume and Sqoop

* Flume imports data into HDFS as it is generated
* Sqoop transfers data between RDBMSs and HDFS - Sqoop = "SQL to Hadoop"

Oozie

* Workflow engine for MapReduce jobs
* Defines dependencies between jobs
* The Oozie server submits the jobs to the server in the correct sequence

Supplement
* Apache Hadoop 2.5.1 - Command Menu

All hadoop commands are invoked by the bin/hadoop script. Running the hadoop script without any arguments prints the description for all commands.
Usage: hadoop [--config confdir] [COMMAND] [GENERIC_OPTIONS] [COMMAND_OPTIONS]

* Hadoop MapReduce Next Generation - Cluster Setup

This document describes how to install, configure and manage non-trivial Hadoop clusters ranging from a few nodes to extremely large clusters with thousands of nodes. To play with Hadoop, you may first want to install it on a single machine (see Single Node Setup).

程式扎記

標籤

2014年11月22日星期六

[CCDH] Class1 - The Motivation for Hadoop

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2014年11月22日 星期六