程式扎記

Source From Here
Goal
主要介紹 Docker 的實作，會提到以下幾個部分：

1. 建立 Docker container
2. 管理 Docker 上的 container
3. 下載 image 、commit 建立新的 image
4. 經由撰寫 Dockerfile 來自動建立新的 image

在進入實作前，會簡單介紹 Docker 與虛擬化的差異、Docker 上的重要元件，接下來準備的部分會需要安裝 Docker 及先登入 Docker Hub。

簡介 Docker
Docker 是一個開源專案，支援多平台，從筆電到公、私有雲上能進行快速部署輕量、獨立的作業環境。Docker 使用 Linux 核心中的功能，Namespace 及 Control Groups (cgroups) 等，來達到建置獨立的環境及控制 CPU 、Memory 、網路等資源。

專案網址： http://www.docker.com/

Docker Container 與虛擬化的不同
在前面敘述有提到，Docker 能提供建置獨立的環境，但是運作的方式與虛擬化有所差別。

虛擬化
虛擬化通常都是透過在 Host OS 上安裝 hypervisor ，由 hypervisor 來管理不同虛擬主機，每個虛擬主機都需要安裝不同的作業系統。

Docker Container
Docker 提供應用程式在獨立的 container 中執行，這些 container 並不需要像虛擬化一樣額外依附在 hypervisor 或 guest OS 上，是透過 Docker Engine 來進行管理。

Docker 重要元件
在進入如何操作 Docker 前，先介紹 Docker 中的三個主要部分：
* Docker Images

Docker image 能用來開啟 Docker container，但是 images 是唯讀的，也就是直接結束 container 後，在 container 變更的資料並不會儲存在 image 內，但是 Docker 能對變更後的 container 建立 image。

* Docker Containers

Docker Containers 提供獨立、安全的環境給應用程式執行，container 是從 Docker image 建立，運行在主機上。

* Docker Registries

將 Docker images 上傳、下載到公開或私有的 Docker Registries 上，與其他人分享，公開的像 Docker Hub 提供了許多不同的 images ，如：Ubuntu 或 Ubuntu 環境下安裝 Ruby on Rails 的 images ，只要下載來，就能直接開啟成 container。

操作 Docker 方法
Docker 主機上會運行 Docker daemon，也能開啟許多 container。要對 Docker 進行操作，使用 Docker client，也就是 Docker 指令 (例如：docker pull, docker images ...) ，分別可以藉由：

1. UNIX sockets
2. 網路 (RESTful API)

對主機上的 Docker daemon 進行控制，當然 Docker client 與 Docker daemon 可以是同一台或不同主機上。

更多 Docker 使用方法, 請參考 Docker Remote API

前置作業
接下來我們先在主機上安裝 Docker：
Ubuntu (doc)

$ sudo apt-get update
$ sudo apt-get install docker.io
$ sudo docker info // 測試一下是否安裝成功
Containers: 0
Images: 0
...

CentOS7 (doc)

$ sudo yum install docker

Mac OS X, Windows
在 Mac OS X 或 Windows 上需要安裝 Boot2Docker ，因為 Docker Engine 有用到 Linux 的特定功能，所以 Boot2Docker 會用 Virtualbox 建立 Linux VM ，在 Linux VM 上開啟 Docker daemon 由在 Mac OS X 或 Windows 的 Docker client 去操作Linux VM 上 Docker daemon (後面會在詳述)。Boot2Docker 安裝參考：

* Mac OS X : https://docs.docker.com/installation/mac/
* Windows : https://docs.docker.com/installation/windows/

在前面有提到 Docker Hub 上提供了許多 images 可以使用，先註冊 Docker Hub 帳號，網址：https://hub.docker.com/account/signup/

接著可以利用 docker login 來登入 Docker Hub ：

$ sudo docker login
Username (johnklee):
Login Succeeded

操作 Docker Container
建立一個新的 Container
先來暖身一下，啟動一個 CentOS 6 container，希望能執行指令來顯示今天的日期與時間。使用 docker run 開啟一個新的 container：

# docker run centos:centos6 /bin/date
Unable to find image 'centos:centos6' locally
Trying to pull repository docker.io/centos ...
b9aeeaeb5e17: Download complete
f1b10cd84249: Download complete
Status: Downloaded newer image for docker.io/centos:centos6
Thu Apr 23 07:16:32 UTC 2015

輸出結果的第 1 行到第 5 行是，因為 docker 主機上原來並沒有 centos6 的 image 檔案，所以會先從 Docker Hub 下載。

輸出結果的最後一行就是執行的 date 指令。另外可以發現，如果扣除下載 image 時間，開啟一個 container 只要非常短的時間。
* docker run: 啟動一個新的 container
* centos:centos6

centos 是 repo 名稱，centos6 是 repo 內 tags。到 Docker Hub 網站搜尋 centos 後，在 centos repo 頁面上就會呈現有哪些 tags 。

* /bin/date: 開啟 container 後執行的指令

docker run 還有其他參數可以進入 container 的終端機指令互動模式下：

// -t 開啟 tty 進入 container
// -i 透過 STDIN 與 container 進行互動
# docker run -t -i centos:centos6 bash
[root@6dd59f78d33e /]# whoami
root

該如何離開 container 的終端機？
1. 輸入 exit 或按 control/Ctrl + D，目前使用的這個 container 就會結束，而下次在開啟的 container 又是一個全新的。
2. control/Ctrl + P，再按 control/Ctrl + Q ，就可以跳離開這個 container 的 tty。

管理 Container
如果只是離開了 container 終端機， container 並沒有關閉或停止執行，所以接下來說明如何來管理 container。

Container ID
每一個 container 都有一個唯一的 CONTAINER ID，在上面的部分，執行 docker run -t -i centos:centos6 bash 開啟新的 container 後，會進入 container 的終端機內：

# docker run -t -i centos:centos6 bash
[root@d9485c95064b /]#

"d9485c95064b" 就是 CONTAINER ID，之後都是利用此 id 來分別不同的 container。

列出 Container
Docker client 提供了 docker ps 可以查看目前開啟且正在執行的 container

# docker ps --help
Usage: docker ps [OPTIONS]

List containers

-a, --all=false Show all containers (default shows just running)
--before= Show only container created before Id or Name
-f, --filter=[] Filter output based on conditions provided
--help=false Print usage
-l, --latest=false Show the latest created container, include non-running
-n=-1 Show n last created containers, include non-running
--no-trunc=false Don't truncate output
-q, --quiet=false Only display numeric IDs
-s, --size=false Display total file sizes
--since= Show created since Id or Name, include non-running
# docker ps -a -q
d9485c95064b
6dd59f78d33e
1fb091625c6d
5c4abb8dcc01
142873548b68
3fbd20430713

在 Container 中執行指令
離開了 container 的終端機後，並沒有將 container 關掉，可以用 docker exec 在 container 中執行指令，例如下面範例就是再回到 034972c95a2d container 的終端機：

# docker exec -i -t 034972c95a2d bash
[root@034972c95a2d /]#

停止 Container
想要讓在執行中的 container 停止，用 docker stop 停止執行中的 container :

# docker stop --help

Usage: docker stop [OPTIONS] CONTAINER [CONTAINER...]

Stop a running container by sending SIGTERM and then SIGKILL after a
grace period

--help=false Print usage
-t, --time=10 Seconds to wait for stop before killing it

# docker ps -a // docker ps -a，可以看到所有沒被刪除的 container。
...
034972c95a2d centos:centos6 "bash" 3 minutes ago Up 3 minutes
...
# docker stop 034972c95a2d
034972c95a2d

# docker ps -a
...
034972c95a2d centos:centos6 "bash" 5 minutes ago Exited (137) 51 seconds ago
...

開啟停止的 Container
開啟停止的 container 使用命令 docker start：

# docker start 034972c95a2d
034972c95a2d
# docker ps -a
...
034972c95a2d centos:centos6 "bash" 8 minutes ago Up 10 seconds
...

開啟後會執行一開始建立這個 container 的指令及參數，所以如果建立的時候，並沒有開啟終端機與互動模式，執行完指令後，container 又結束了。

刪除 Container
透過 docker ps -a 可以列出所有建立過且沒被刪除的 container： docker rm 刪除不需要的 container ：

# docker rm 034972c95a2d
034972c95a2d
# docker ps -a // 確認 Container 已經被刪除

Container 內建立額外的掛載點
在 container 建立時，可以再建自訂的掛載點 (與外部檔案系統獨立)：

// -v, --volume=[]: Bind mount a volume
# docker run -t -i -v /tmp/myData centos:centos6 bash
# touch /tmp/myData/test
# echo "test" > /tmp/myData/test
// Ctrl+p, Ctrl+q to exit container

共享主機與 Container 資料
Docker 也提供了可以與主機目錄共享的資料的功能，這在開發的時候蠻好用的，只要把專案目錄掛載到 container 內，就能直接在 container 測試、執行。我們將主機上的/home/aming/Docker/tutorial/ 目錄掛到 container 內的 /app 目錄：

// -v [HOST_DIR]:[CONTAINER_DIR] 將 host OS 上的目錄掛載到 container 內的掛載點
# docker run -t -i -v \
> /home/aming/Docker/tutorial/:/app \
> tutum/apache-php bash

管理 Docker Image
建立一個新的 Container 內有提到，使用 docker run 建立新的 container 時，如果 Docker 發現沒有符合的 image 預設會去 Docker Hub 下載，那我們該如何知道目前有哪些可用的 images ？ docker images 可以列出主機內已經存在的 image：

可以看到前面所提到的 image 的「repo 名稱」、「tag 名稱」等資訊。

1. 搜尋 Images
我們可以使用命令 docker search 去搜尋 Docker Hub 上的 Images, 如 'tomcat':

# docker search tomcat

2. 下載 Image
我們使用命令 docker pull 下載搜尋到的 " docker.io/consol/tomcat-7.0" ：

# docker pull docker.io/consol/tomcat-7.0
Trying to pull repository docker.io/consol/tomcat-7.0 ...
c8da30218989: Pulling dependent layers
511136ea3c5a: Download complete
36fd425d7d8a: Downloading 5.394 MB // Ongoing

想知道目前主機上有哪些 images 可以使用 docker images ，會列出主機內所有的 image：

# docker images | grep tomcat
docker.io/consol/tomcat-7.0 latest c8da30218989 4 months ago 617.9 MB

3. 製作新的 Image
把剛才下載下來的 consol/tomcat-7.0 已經有 Tomcat 及 Java 環境，直接再利用這個 image，安裝 Git，製作成新的 image：

# docker run -t -i consol/tomcat-7.0 bash // Start Container
root@33f15b0c22f2:~# apt-get update // Inside Container with ID=33f15b0c22f2
root@33f15b0c22f2:~# apt-get install git
root@33f15b0c22f2:~# git --version
git version 2.1.4
// Enter Ctrl+p, Ctrl+q to quit Container

目前，我們已經建立一個新的 container，container ID 為 33f15b0c22f2。使用 docker commit 將 container 製作成新的 image：

# docker commit --help

Usage: docker commit [OPTIONS] CONTAINER [REPOSITORY[:TAG]]

Create a new image from a container's changes

-a, --author= Author (e.g., "John Hannibal Smith ")
-c, --change=[] Apply Dockerfile instruction to the created image
--help=false Print usage
-m, --message= Commit message
-p, --pause=true Pause container during commit

# docker commit -m="Add Git version 2.1.4" -a="johnklee" 33f15b0c22f2 johnklee/tutorial:TomcatWithGit
c63e03f45541b38d4a8b486ddb18a367ec716d2c2bcca12848075dbb7fdb3ddd
# docker images | grep TomcatWithGit
johnklee/tutorial TomcatWithGit c63e03f45541 39 seconds ago 656.8 MB

在 container 內建立好需要的環境後，把修改的 container 建立成一個新 image 檔案，這個 image 已經有增加 Git 測試環境。

4. 上傳至 Docker Hub
我們做了一個新的 image 後，可以上傳到 Docker Hub 或私有的 registry 將 image 分享出去。首先，我們需要到 Docker Hub 網站建立一個自己的 repo：
(1) 點選「Add Repository」，選擇「Respository」

(2) 填寫 Repo 名稱（注意要小寫）與描述，接著點選「Add Repository」

(3) 成功在 Docker Hub 上建立 repo

在 Docker Hub 建立好 repo 後，就可以將 image 使用 docker push 上傳至 Docker Hub：

# docker push johnklee/tutorial
Do you really want to push to public registry? [Y/n]: Y
The push refers to a repository [docker.io/johnklee/tutorial] (len: 1)
Sending image list
Pushing repository docker.io/johnklee/tutorial (1 tags)
e39724bc32b2: Image already pushed, skipping
4fb1c181433b: Image already pushed, skipping
...

5. 移除 Image

# docker rmi johnklee/tutorial:TomcatWithGit
Untagged: johnklee/tutorial:TomcatWithGit
Deleted: c63e03f45541b38d4a8b486ddb18a367ec716d2c2bcca12848075dbb7fdb3ddd
# docker images | grep TomcatWithGit // Double confirmed image has been removed

如果無法刪除可能是有 container 使用了這個 image ，所以要先用 docker rm 移除使用的 container，才能順利移除 image。

撰寫 Dockerfile
在上面了解到能自己製作 image ，並且透過 Docker 分享出去。除了可以利用做好的 image 方便建立一個新的 container 外，也可以撰寫好 Dockerfile 來建立 image 的腳本檔案，再使用 docker build 就能幫我們建立一個 image，一樣可以上傳到 Docker Hub 分享出去！

下面要示範如何使用 Dockerfile 來自動建立 image。跟上面一樣，我們利用 docker.io/consol/tomcat-7.0 來建包含 Git 的環境。
(1). 建立目錄與 Dockerfile

# mkdir ~/dockerfile-demo && cd ~/dockerfile-demo

(2). 編輯 Dockerfile
* FROM：以什麼 image 為基底
* MAINTAINER：維護 image 的人
* RUN：在 image 內執行的指令
* ADD：將本機的檔案或遠端的檔案加入到 image 內的目錄，如果是壓縮檔會自動解壓縮。想將本機檔案或目錄複製到 image 可以使用 COPY

# vi Dockerfile

view plain copy to clipboard print ?

FROM docker.io/consol/tomcat-7.0

MAINTAINER Huang AMing

RUN apt-get update \

  && apt-get install -y git

(3). 透過 Dockerfile 建立 Image
Dockerfile 就像一個腳本檔案，用 docker build 會根據 Dockerfile 內容來建立新的 image：

# docker build --help
Usage: docker build [OPTIONS] PATH | URL | -

Build a new image from the source code at PATH

-f, --file= Name of the Dockerfile(Default is 'Dockerfile')
--force-rm=false Always remove intermediate containers
--help=false Print usage
--no-cache=false Do not use cache when building the image
--pull=false Always attempt to pull a newer version of the image
-q, --quiet=false Suppress the verbose output generated by the containers
--rm=true Remove intermediate containers after a successful build
-t, --tag= Repository name (and optionally a tag) for the image

# docker build -t="docker.io/johnklee/tutorial:TomcatWithGit2.1.4" .
Sending build context to Docker daemon 2.048 kB
Sending build context to Docker daemon
Step 0 : FROM docker.io/consol/tomcat-7.0
---> c8da30218989
Step 1 : MAINTAINER johnklee
---> Running in c70a3921c389
---> 70b0bc570d28
Removing intermediate container c70a3921c389
Step 2 : RUN apt-get update && apt-get install -y git
...
Successfully built 744a3d5b654c

# docker images
REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE
docker.io/johnklee/tutorial TomcatWithGit2.1.4 744a3d5b654c 16 seconds ago 656.8 MB

# docker run -i -t 744a3d5b654c bash // Using our new image to run Container
root@ef217701a1b7:/# git --version // Make sure Dockerfile works
git version 2.1.4

Supplement
* Docker User Guide - Managing Data in Containers
* Permission denied on accessing host directory in docker
* VBird - 第十七章、程序管理與 SELinux 初探

什麼是 SELinux 呢？其實他是『 Security Enhanced Linux 』的縮寫，字面上的意義就是安全強化的 Linux 之意...

Source From Here
Question
How I can get the name of the input file within a mapper? I have multiple input files stored in the input directory, each mapper may read a different file, and I need to know which file the mapper has read.

How-To
First you need to get the InputSplit object, using the MapReduce v2 API it would be done as follows:

view plaincopy to clipboardprint?
...  
    @Override  
    public void map(LongWritable key, Text value, Context context)  
            throws IOException, InterruptedException {  
          
        InputSplit inputSplit = context.getInputSplit();  
    }  
...  

But in order to get the file path and the file name you will need to first typecast the result into FileSplit. So, in order to get the input file path you may do the following:

view plaincopy to clipboardprint?
Path filePath = ((FileSplit) context.getInputSplit()).getPath();  
String filePathString = filePath.toString();  

Similarly, to get the file name, you may just call upon getName(), like this:

view plaincopy to clipboardprint?
String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();  

Preface
You now know how to use Grunt to run Pig Latin statements and investigate their execution and results. We can come back and give a more formal treatment of the language. You should feel free to use Grunt to explore these language concepts as we present them.

Data types and schemas
Let’s first look at Pig data types from a bottom-up view. Pig has six simple atomic types and three complex types, shown in tables 10.4 and 10.5 respectively. The atomic types include numeric scalars as well as string and binary objects. Type casting is supported and done in the usual manner. Fields default to bytearray unless specified otherwise.

A field in a tuple or a value in a map can be null or any atomic or complex type. This enables nesting and complex data structures. Whereas data structures can be arbitrarily complex, some are definitely more useful and occur more often than others, and nesting usually doesn’t go deeper than two levels. In the Excite log example earlier, the GROUP BY operator generated a relation grpd where each tuple has a field that is a bag. The schema for the relation seems more natural once you think of grpd as the query history of each user. Each tuple represents one user and has a field that is a bag of the user’s queries.

We can also look at Pig’s data model from the top down. At the top, Pig Latin statements work with relations, which is a bag of tuples. If you force all the tuples in a bag to have a fixed number of fields and each field has a fixed atomic type, then it behaves like a relational data model —the relation is a table, tuples are rows (records), and fields are columns. But, Pig’s data model has more power and flexibility by allowing nested data types . Fields can themselves be tuples, bags, or maps. Maps are helpful in processing semistructured data such as JSON, XML, and sparse relational data. In addition, it isn’t necessary that tuples in a bag have the same number of fields. This allows tuples to represent unstructured data.

Besides declaring types for fields, schemas can also assign names to fields to make them easier to reference. Users can define schemas for relations using the ASkeyword with the LOAD, STREAM, and FOREACH operators. For example, in the LOAD statement for getting the Excite query log, we defined the data types for the fields in log, as well as named the fields user, time, and query.

grunt> log = LOAD 'tutorial/data/excite-small.log'
➥ AS (user:chararray, time:long, query:chararray);

In defining a schema, if you leave out the type, Pig will default to bytearray as the most generic type. You can also leave out the name, in which case a field would be unnamed and you can only reference it by position.

Expressions and functions
You can apply expressions and functions to data fields to compute various values. The simplest expression is a constant value . Next is to reference the value of a field . You can reference the named fields’ value directly by the name. You can reference an unnamed field by $n, where n is its position inside the tuple. (Position is numbered starting at 0.) For example, this LOAD command provides named fields to log through the schema.

grunt> log = LOAD 'tutorial/data/excite-small.log'
➥ AS (user:chararray, time:long, query:chararray);

The three named fields are user, time, and query. For example, we can refer to the time field as either time or $1, because the time field is the second field in log (position number 1). Let’s say we want to extract the time field into its own relation; we can use this statement:

grunt> projection = FOREACH log GENERATE time;

We can also achieve the same with:

grunt> projection = FOREACH log GENERATE $1;

Most of the time you should give names to fields. One use of referring to fields by position is when you’re working with unstructured data.

When using complex types, you use the dot notation to reference fields nested inside tuples or bags. For example, recall earlier that we’d grouped the Excite log by user ID and arrived at relation grpd with a nested schema:

The second field in grpd is named log, of type bag. Each bag has tuples with three named fields: user, time, and query. In this relation, log.query would refer to the two copies of “conan” “o’brien” when applied to the first tuple. You can get the same field with log.$2.

You reference fields inside maps through the pound operator instead of the dot operator. For a map named m, the value associated with key k is referenced through m#k.

Being able to refer to values is only a first step. Pig supports the standard arithmetic, comparison, conditional, type casting , and Boolean expressions that are common in most popular programming languages. See table 10.6.

Furthermore, Pig also supports functions. Table 10.7 shows Pig’s built-in functions, most of which are self-explanatory. We’ll discuss user-defined functions (UDF) in section 10.6.

You can’t use expressions and functions alone. You must use them within relational operators to transform data.

Relational operators
The most salient characteristic about Pig Latin as a language is its relational operators. These operators define Pig Latin as a data processing language. We’ll quickly go over the more straightforward operators first, to acclimate ourselves to their style and syntax. Afterward we’ll go into more details on the more complex operators such as COGROUP and FOREACH.

UNION combines multiple relations together whereas SPLIT partitions a relation into multiple ones. An example will make it clear:

grunt> a = LOAD 'A' using PigStorage(',') AS (a1:int, a2:int, a3:int);
grunt> DUMP a;
(0,1,2)
(1,3,4)
grunt> b = LOAD 'B' using PigStorage(',') AS (b1:int, b2:int, b3:int);
grunt> DUMP b;
(0,5,2)
(1,7,8)
grunt> c = UNION a, b;
grunt> DUMP c;
(0,1,2)
(1,3,4)
(0,5,2)
(1,7,8)
grunt> SPLIT c INTO d IF $0 == 0, e IF $0 == 1;
grunt> DUMP d;
(0,5,2)
(0,1,2)
grunt> DUMP e;
(1,7,8)
(1,3,4)

The UNION operator allows duplicates. You can use the DISTINCT operator to remove duplicates from a relation. Our SPLIT operation on c sends a tuple to d if its first field ($0) is 0, and to e if it’s 1. It’s possible to write conditions such that some rows will go to both dand e or to neither. You can simulate SPLIT by multiple FILTER operators. The FILTER operator alone trims a relation down to only tuples that pass a certain test:

grunt> f = FILTER c BY $1 > 3;
grunt> DUMP f;
(0,5,2)
(1,7,8)

We’ve seen LIMIT being used to take a specified number of tuples from a relation. SAMPLE is an operator that randomly samples tuples in a relation according to a specified percentage.

The operations ‘till now are relatively simple in the sense that they operate on each tuple as an atomic unit. More complex data processing, on the other hand, will require working on groups of tuples together. We’ll next look at operators for grouping. Unlike previous operators, these grouping operators will create new schemas in their output that rely heavily on bags and nested data types. The generated schema may take a little time to get used to at first. Keep in mind that these grouping operators are almost always for generating intermediate data. Their complexity is only temporary on your way to computing the final results.

The simpler of these operators is GROUP. Continuing with the same set of relations we used earlier,

grunt> DUMP c;
(0,5,2)
(1,7,8)
(0,1,2)
(1,3,4)
grunt> g = GROUP c BY $2;
grunt> DUMP g;
(2,{(0,1,2),(0,5,2)})
(4,{(1,3,4)})
(8,{(1,7,8)})
grunt> DESCRIBE c;
c: {a1: int,a2: int,a3: int}
grunt> DESCRIBE g;
g: {group: int,c: {(a1: int,a2: int,a3: int)}}

We’ve created a new relation, g, from grouping tuples in c having the same value on the third column ($2, also named a3). The output of GROUP always has two fields. The first field is group key, which is a3 in this case. The second field is a bag containing all the tuples with the same group key. Looking at g’s dump, we see that it has three tuples, corresponding to the three unique values in c’s third column. The first field of GROUP’s output relation is always named group, for the group key. In this case it may seem more natural to call the first field a3, but currently Pig doesn’t allow you to assign a name to replace group. You’ll have to adapt yourself to refer to it as group. The second field of GROUP’s output relation is always named after the relation it’s operating on, which is c in this case, and as we said earlier it’s always a bag. As we use this bag to hold tuples from c, the schema for this bag is exactly the schema for c —three fields of integers named a1, a2, and a3.

Before moving on, we want to note that one can GROUP by any function or expression. For example, if time is a timestamp and there exists a functionDayOfWeek, one can conceivably do this grouping that would create a relation with seven tuples.

GROUP log BY DayOfWeek(time);

Finally, one can put all tuples in a relation into one big bag. This is useful for aggregate analysis on relations, as functions work on bags but not relations. For example:

grunt> h = GROUP c ALL;
grunt> DUMP h;
(all,{(0,1,2),(1,3,4),(0,5,2),(1,7,8)})
grunt> i = FOREACH h GENERATE COUNT($1);
grunt> DUMP i;
(4)

This is one way to count the number of tuples in c. The first field in GROUP ALL’s output is always the string all.

Now that you’re comfortable with GROUP, we can look at COGROUP, which groups together tuples from multiple relations. It functions much like a join. For example, let’s cogroup a and b on the third column.

grunt> j = COGROUP a BY $2, b BY $2;
grunt> DUMP j;
(2,{(0,1,2)},{(0,5,2)})
(4,{(1,3,4)},{})
(8,{},{(1,7,8)})
grunt> DESCRIBE j;
j: {group: int,a: {(a1: int,a2: int,a3: int)},b: {(b1: int,b2: int,b3: int)}}

Whereas GROUP always generates two fields in its output, COGROUP always generates three (more if cogrouping more than two relations). The first field is the group key, whereas the second and third fields are bags. These bags hold tuples from the cogrouping relations that match the grouping key. If a grouping key matches only tuples from one relation but not the other, then the field corresponding to the nonmatching relation will have an empty bag. To ignore group keys that don’t exist for a relation, one can add the INNER keyword to the operation, like

grunt> j = COGROUP a BY $2, b BY $2 INNER;
grunt> DUMP j;
(2,{(0,1,2)},{(0,5,2)})
(8,{},{(1,7,8)})
grunt> j = COGROUP a BY $2 INNER, b BY $2 INNER;
grunt> DUMP j;
(2,{(0,1,2)},{(0,5,2)})

Conceptually, you can think of the default behavior of COGROUP as an outer join, and the INNER keyword can modify it to be left outer join, right outer join, or inner join. Another way to do inner join in Pig is to use the JOIN operator. The main difference between JOIN and an inner COGROUP is that JOIN creates a flat set of output records, as indicated by looking at the schema:

grunt> j = JOIN a BY $2, b BY $2;
grunt> DUMP j;
(0,1,2,0,5,2)
grunt> DESCRIBE j;
j: {a::a1: int,a::a2: int,a::a3: int,b::b1: int,b::b2: int,b::b3: int}

The last relational operator we look at is FOREACH. It goes through all tuples in a relation and generates new tuples in the output. Behind this seeming simplicity lies tremendous power though, particularly when it’s applied to complex data types outputted by the grouping operators. There’s even a nested form of FOREACHdesigned for handling complex types. First let’s familiarize ourselves with different FOREACH operations on simple relations.

grunt> k = FOREACH c GENERATE a2, a2*a3;
grunt> DUMP k;
(5,10)
(7,56)
(1,2)
(3,12)

FOREACH is always followed by an alias (name given to a relation) followed by the keyword GENERATE. The expressions after GENERATE control the output. At its simplest, we use FOREACH to project specific columns of a relation into the output. We can also apply arbitrary expressions, such as multiplication in the preceding example.

For relations with nested bags (e.g., ones generated by the grouping operations), FOREACH has special projection syntax, and a richer set of functions. For example, applying nested projection to have each bag retain only the first field:

grunt> k = FOREACH g GENERATE group, c.a1;
grunt> DUMP k;
(2,{(0),(0)})
(4,{(1)})
(8,{(1)})

To get two fields in each bag:

grunt> k = FOREACH g GENERATE group, c.(a1,a2);
grunt> DUMP k;
(2,{(0,1),(0,5)})
(4,{(1,3)})
(8,{(1,7)})

Most built-in Pig functions are geared toward working on bags.

grunt> k = FOREACH g GENERATE group, COUNT(c);
grunt> DUMP k;
(2,2)
(4,1)
(8,1)

Recall that g is based on grouping c on the third column. This FOREACH statement therefore generates a frequency count of the values in c’s third column. As we said earlier, grouping operators are mainly for generating intermediate data that will be simplified by other operators such as FOREACH. The COUNTfunction is one of the aggregate functions. As we’ll see, you can create many other functions via UDFs.

The FLATTEN function is designed to flatten nested data types. Syntactically it looks like a function, such as COUNT and AVG, but it’s a special operator as it can change the structure of the output created by FOREACH...GENERATE. Its flattening behavior is also different depending on how it’s applied and what it’s applied to. For example, consider a relation with tuples of the form (a, (b, c)). The statement FOREACH... GENERATE $0, FLATTEN($1) will create one output tuple of the form (a, b, c) for each input tuple.

When applied to bags, FLATTEN modifies the FOREACH...GENERATE statement to generate new tuples. It removes one layer of nesting and behaves almost the opposite of grouping operations. If a bag contains N tuples, flattening it will remove the bag and create N tuples in its place.

grunt> k = FOREACH g GENERATE group, FLATTEN(c);
grunt> DUMP k;
(2,0,1,2)
(2,0,5,2)
(4,1,3,4)
(8,1,7,8)
grunt> DESCRIBE k;
k: {group: int,c::a1: int,c::a2: int,c::a3: int}

Another way to understand FLATTEN is to see that it produces a cross-product. This view is helpful when we use FLATTEN multiple times within a single FOREACHstatement. For example, let’s say we’ve somehow created a relation l.

grunt> DUMP l;
(1,{(1,2)},{(3)})
(4,{(4,2),(4,3)},{(6),(9)})
(8,{(8,3),(8,4)},{(9)})
grunt> DESCRIBE l;
d: {group: int,a: {a1: int,a2: int},b: {b1: int}}

The following statement that flattens two bags outputs all combinations of those two bags for each tuple:

grunt> m = FOREACH l GENERATE group, FLATTEN(a), FLATTEN(b);
grunt> DUMP m;
(1,1,2,3)
(4,4,2,6)
(4,4,2,9)
(4,4,3,6)
(4,4,3,9)
(8,8,3,9)
(8,8,4,9)

We see that the tuple with group key 4 in relation l has a bag of size 2 in field a and also a bag size 2 in field b. The corresponding output in m therefore has four rows representing the full cross-product.

Finally, there’s a nested form of FOREACH to allow for more complex processing of bags. Let’s assume you have a relation (say l) and one of its fields (say a) is a bag, a FOREACH with nested block has this form:

view plaincopy to clipboardprint?
alias = FOREACH l {  
    tmp1 = operation on a;  
    [more operations...]  
    GENERATE expr [, expr...]  
}  

The GENERATE statement must always be present at the end of the nested block. It will create some output for each tuple in l. The operations in the nested block can create new relations based on the bag a. For example, we can trim down the a bag in each element of l’s tuple.

You can have multiple statements in the nested block. Each one can even be operating on different bags.

As of this writing, only five operators are allowed in the nested block: DISTINCT, FILTER, LIMIT, ORDER, and SAMPLE. It’s expected that more will be supported in the future.
NOTE.

Sometimes the output of FOREACH can have a completely different schema from its input. In those cases, users can specify the output schema using the ASoption after each field. This syntax differs from the LOAD command where the schema is specified as a list after the ASoption, but in both cases we use AS to specify a schema.

For more information how to use Pig Latin, please refer to Official Document - Pig Latin Basics (r0.9.1). On many operators you’ll see an option for PARALLEL n(See more on Use the parallel feature). The number n is the degree of parallelism you want for executing that operator. In practice n is the number of reduce tasks in Hadoop that Pig will use. If you don’t set n it’ll default to the default setting of your Hadoop cluster. Pig documentation recommends setting the value of n according to the following guideline:

n = (#nodes - 1) * 0.45 * RAM

where #nodes is the number of nodes and RAM is the amount of memory in GB on each node.

At this point you’ve learned various aspects of the Pig Latin language—data types, expressions, functions, and relational operators. You can extend the language further with user-defined functions. But before discussing that we’ll end this section with a note on Pig Latin compilation and optimization.

Execution optimization
As with many modern compilers, the Pig compiler can reorder the execution sequence to optimize performance, as long as the execution plan remains logically equivalent to the original program. For example, imagine a program that applies an expensive function (say, encryption) to a certain field (say, social security number) of every record, followed by a filtering function to select records based on a different field (say, limit only to people within a certain geography). The compiler can reverse the execution order of those two operations without affecting the final result, yet performance is much improved. Having the filtering step first can dramatically reduce the amount of data and work the encryption step will have to do.

As Pig matures, more optimization will be added to the compiler. Therefore it’s important to try to always use the latest version. But there’s always a limit to a compiler’s ability to optimize arbitrary code. You can read Pig’s web documentation for techniques to improve performance. A list of tips for enhancing performance under Pig version r0.9.1 is at https://pig.apache.org/docs/r0.9.1/perf.html.

程式扎記

標籤

2015年4月28日星期二

[ 文章收集 ] Docker 實作入門

2015年4月26日星期日

[ 常見問題 ] How to get the input file name in the mapper in a Hadoop program?

[ In Action ] Ch10. Pig: Speaking Pig Latin (3)

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2015年4月28日 星期二