During this course, you will learn:
* The core technologies of
Hadoop* How HDFS and MapReduce work
* How to develop and unit test MapReduce applications
* How to use MapReduce combiners, partitioners, and the distributed cache
* Best practices for developing and debugging MapReduce applications
* How to implement custom data input and output formats in MapReduce applications
* Algorithms for common MapReduce tasks
* How to join datasets in MapReduce
* How Hadoop integrates into the data center
* How Hive, Impala, and Pig can be used for rapid$application development
* How to create workflows using Oozie
The Motivation for HadoopIn this chapter you will learn
* What problems exist with traditional large-scale computing systems
* What requirements an alternative approach should have
* How Hadoop addresses those requirements
Problems with Traditional Large-scale Systems
* Traditionally, computation has been processor-bound
* The early solution: bigger computers
* The beNer solution: more computers
Introducing Hadoop
* A tradical new approach to distributed computing
* Originally based on work done at Google
* Open-source project overseen by the Apache Software Foundation
* Applications are written in high-level code
* Data is distributed in advance
* Data is replicated for increased availability and reliability
* Hadoop is scalable and fault-tolerant
* Adding nodes adds capacity proportionally
* Increasing load results in a graceful decline in performance
Hadoop-able Problems- Nature of the data
* Volume
* Velocity
* Variety
- Nature of the analysis
* Batch processing
* Parallel execution
* Distributed data
Hadoop Basic Concept and HDFSIn this chapter you will learn
* What Hadoop is
* What features the Hadoop Distributed File System (HDFS) provides
The Hadoop Project and Hadoop Components*
HDFS (Hadoop Distributed File System): Stores data on the cluster
*
MapReduce: Processes data on the cluster
*
A Hadoop cluster: a group of machines working together to store and process data
- Any number of ‘slave’ or ‘worker’ nodes: HDFS to store data; MapReduce to process data
- Two master nodes: NameNode to manager HDFS; JobTracker to manage MapReduce
The Hadoop Distributed File System (HDFS)
* HDFS is a filesystem written in Java
* Sits on top of the native filesystem.
* Provide redundant storage for massive amounts of data.
* Files in HDFS are written once - No random writes to file are allowed
* HDFS are customized for large, streaming reads of files.
How "Files" Are Stored:
* Data files are split into blocks and distributed at load time
* Each block is replicated on multiple data nodes (default 3x)
* NameNode stores metadata
Options for accessing HDFS:
* FsShell Command line:
hadoop fs ...* Java API
* Ecosystem Projects
-
Flume: Collect data from network sources.
-
Sqoop: Transfer data between HDFS and RDBMS
-
Hue: Web-based interactive UI.
Introduction to MapReduceIn this chapter you will learn
* The concepts behind MapReduce
* How data flows through MapReduce stages
* Typical uses of Mappers
* Typical uses of Reducers
MapReduce Overview
* MapRecuce is a method for distributing a task across multiple nodes
* Each node processes data stored on the node
* Consists of two phases: Map ->Reduce
Features of MapReduce
* Automatic parallelization and distribution
* Fault-tolerance
* A clean abstraction for programmers.
Example - WordCountHadoop runs Map tasks on the node storing the data (when possible)
- Minimizes network traffic
- Many Mappers can run in parallel
WordCount shuffle and sort
WordCount reduce
Another Example - Analyzing Log Data
Mapper
* Input: key/value pair
* Output: A list of zero or more key value pairs
ReducerAfter the Map phrase is over, all intermediate values for a given intermediate key are grouped together. Each key and value list is passed to a Reducer.
* All values for a particular intermediate key go to the same Reducer
* The intermediate key/value lists are passed in sorted key order
The Reducer outputs zero or more final key/value pairs
Hadoop Clusters and the Hadoop EcosystemIn this chapter you will learn
* The components of a Hadoop cluster
* How Hadoop jobs and tasks run on a cluster
* How a job's data flows in a Hadoop cluster
* What other Hadoop Ecosystem projects exist
Hadoop Cluster Overview
* A Hadoop cluster is a group of computers working together - Usually runs HDFS and MapReduce
* A node is an individual computer in the cluster - Master nodes manage distribution of work and data to slave nodes
* A daemon is a program running on a node - Each performs different functions in the cluster
MapReduce v2 & v2
* MapReduce v1 ("MRv1" or "Classic MapReduce")
- Use a JobTracker/TaskTracker architecture
- One JobTracker per cluster - limits cluster size to about 4000 nodes
- Slots on slave nodes designated for Map/Reduce tasks
* MapReduce v2 ("MRv2")
- Built on top of YARN (Yet Another Resource Negotiator)
- Uses ResourceManager/NodeManager architecture
- Node resources can be used for any type of task (Support non-MR jobs)
MRv1 daemons
* JobTracker - on per cluster. Manages MapReduce jobs, distributes individuals tasks to TaskTrackers.
* TaskTracker - one per slave node. Starts and monitors individual Map/Reduce tasks.
MRv2 daemons
* ResourceManager - One per cluster. Starts ApplicationMasters, allocates resources on slave nodes. (JobTracker)
* ApplicationMaster - One per job. Request resources, manages individual Map/Reduce tasks.
* NodeManager - One per slave node. Manages resources on individual slave nodes. (TaskTracker)
* JobHistory - One per cluster. Archives jobs' metrics and metadata.
HDFS daemons
* NameNode - Holds the metadata for HDFS. Typically two on a production cluster: one active, one standby
* DataNode - Holds the actual HDFS data. One per slave node.
Hadoop Jobs and Tasks
* A
job is a "full program" - A complete execution of Mapper/Reducer over a dataset.
* A
task is the execution of a single Mapper/Reducer over a slice of data
* A
task attempt is a particular instance of an attempt to execute a task
- There will be at least as many task attempts as there are tasks
- If a task attempt fails, another will be started by the JobTracker or ApplicationMaster
- Speculative execution (The later task will be killed while quicker task already completed) can also result in more task attempts than completed tasks
Running a Job on a MapReduce v1 Cluster
Running a Job on a MapReduce v2 Cluster
Other Hadoop Ecosystem ComponentsFollowing is an introduction to some of the most significant projects (
Should pay attention to Sqoop, Hive for examination)
Hive and Pig
* Languages for querying and manipulating data - Higher level than MapReduce
* Interpreter runs on a client machine - Turns queries to MapReduce jobs and submit jobs to the cluster
* Overview later in the course
Impala
* High-performance SQL engine for vast amounts of data - 10 to 50+ times faster than Hive, Pig or MapReduce
* Impala runs on Hadoop clusters - Data stored in HDFS and doesn't use MapReduce
* Developed by Cloudera - 100% open source under Apache software license
Flume and Sqoop
* Flume imports data into HDFS as it is generated
* Sqoop transfers data between RDBMSs and HDFS - Sqoop = "SQL to Hadoop"
Oozie
* Workflow engine for MapReduce jobs
* Defines dependencies between jobs
* The Oozie server submits the jobs to the server in the correct sequence
Supplement*
Apache Hadoop 2.5.1 - Command Menu
All hadoop commands are invoked by the bin/hadoop script. Running the hadoop script without any arguments prints the description for all commands.
Usage: hadoop [--config confdir] [COMMAND] [GENERIC_OPTIONS] [COMMAND_OPTIONS]
*
Hadoop MapReduce Next Generation - Cluster Setup
This document describes how to install, configure and manage non-trivial Hadoop clusters ranging from a few nodes to extremely large clusters with thousands of nodes. To play with Hadoop, you may first want to install it on a single machine (
see Single Node Setup).
沒有留言:
張貼留言