Learning Pig Latin through Grunt
Before formally describing Pig’s data types and data processing operators, let’s run a few commands in the Grunt shell to get a feel for how to process data in Pig. For the purpose of learning, it’s more convenient to run Grunt in local mode:
You may want to first try some of the file commands, such as pwd and ls, to orient yourself around the filesystem.
Let’s look at some data. We’ll later reuse the patent data we introduced in chapter 4, but for now let’s dig into an interesting data set of query logs from the Excite search engine. This data set already comes with the Pig installation, and it’s in the file tutorial/data/excite-small.log (download) under the Pig installation directory. The data comes in a three-column, tab-separated format. The first column is an anonymized user ID. The second column is a Unix timestamp, and the third is the search query. A decidedly non-random sample from the 4,500 records of this file looks like:
From within Grunt, enter the following statement to load this data into an “alias” (i.e., variable) called log.
Note that nothing seems to have happened after you entered the statement. In the Grunt shell, Pig parses your statements but doesn’t physically execute them until you use a DUMP or STORE command to ask for the results. The DUMP command prints out the content of an alias whereas the STORE command stores the content to a file. The fact that Pig doesn’t physically execute any command until you explicitly request some end result will make sense once you remember that we’re processing large data sets. There’s no memory space to “load” the data, and in any case we want to verify the logic of the execution plan before spending the time and resources to physically execute it.
We use the DUMP command usually only for development. Most often you’ll STORE significant results into a directory. (Like Hadoop, Pig will automatically partition the data into files named part-nnnnn.) When you DUMP an alias, you should be sure that its content is small enough to be reasonably printed to screen. The common way to do that is to create another alias through the LIMIT command and DUMP the new, smaller alias. The LIMIT command allows you to specify how many tuples (rows) to return back. For example, to see four tuples of log:
Table 10.2 summarizes the read and write operators in Pig Latin. LIMIT is technically not a read or write operator, but as it’s often used alongside, we’ve included it in the table.
Let’s execute a few data processing statements and see how we can explore Pig Latin through Grunt.
The preceding statements count the number of queries each user has searched for. The content of the output files (you’ll have to look at the file from outside Grunt) look like this:
Conceptually we’ve performed an aggregating operation similar to the SQL query :
Two main differences between the Pig Latin and SQL versions are worth pointing out. As we’ve mentioned earlier, Pig Latin is a data processing language. You’re specifying a series of data processing steps instead of a complex SQL query with clauses. The other difference is more subtle—relations in SQL always have fixed schemas . In SQL, we define a relation’s schema before it’s populated with data. Pig takes a much looser approach to schema. In fact, you don’t need to use schemas if you don’t want to, which may be the case when handling semistructured or unstructured data. Here we do specify a schema for the relation log, but it’s only in the load statement and it’s not enforced until we’re loading in the data. Any field that doesn’t obey the schema in the load operation is casted to a null. In this way the relation log is guaranteed to obey our stated schema for subsequent operations.
As much as possible, Pig tries to figure out the schema for a relation based on the operation used to create it. You can expose Pig’s schema for any relation with the DESCRIBE command. This can be useful in understanding what a Pig statement is doing. For example, we’ll look at the schemas for grpd and cntd. Before doing this, let’s first see how the DESCRIBE command describes log.
As expected, the load command gives log the exact schema we’ve specified. The relation log consists of three fields named user, time, and query. The fields userand query are both strings (chararray in Pig) whereas time is a long integer.
A GROUP BY operation on the relation log generates the relation grpd. Based on the operation and the schema for log, Pig infers a schema for grpd:
group and log are two fields in grpd. The field logis a bag with subfields user, time, and query. As we haven’t covered Pig’s type system and the GROUP BYoperation, we don’t expect you to understand this schema yet. The point is that relations in Pig can have fairly complex schemas, and DESCRIBE is your friend in understanding the relations you’re working with:
Finally, the FOREACH command operates on the relation grpd to give us cntd. Having looked at the output of cntd, we know it has two fields—the user ID and a count of the number of queries. Pig’s schema for cntd, as given by DESCRIBE, also has two fields. The first one’s name—group—is taken from grpd’s schema. The second field has no name, but it has a type of long. This field is generated by the COUNT function , and the function doesn’t automatically provide a name, although it does tell Pig that the type has to be a long.
Whereas DESCRIBE can tell you the schema of a relation, ILLUSTRATE does a sample run to show a step-by-step process on how Pig would compute the relation. Pig tries to simulate the execution of the statements to compute a relation, but it uses only a small sample of data to make the execution fast. The best way to understand ILLUSTRATE is by applying it to a relation. In this case we use cntd. (The output is reformatted to fit the width of a printed page.)
The ILLUSTRATE command shows there to be four transformations to arrive at cntd. The header row of each table describes the schema of the output relation after transformation, and the rest of the table shows example data. The log relation is shown as two transformations. The data hasn’t changed from one to the next, but the schema has changed from a generic bytearray (Pig’s type for binary objects) to the specified schema. The GROUP operation on log is executed on the three sample log tuples to arrive at the data for grpd. Based on this we can infer the GROUP operation to have taken the user field and made it the group field. In addition, it groups all tuples in log with the same user value into a bag in grpd. Seeing sample data in a simulated run by ILLUSTRATE can greatly aid the understanding of different operations. Finally, we see the FOREACH operation applied to grpd to arrive at cntd. Having seen the data in grpd in the previous table, one can easily infer that the COUNT() function provided the size of each bag.
Although DESCRIBE and ILLUSTRATE are your workhorses in understanding Pig Latin statements, Pig also has an EXPLAIN command to show the logical and physical execution plan in detail. We summarize the diagnostic operators in table 10.3.
* Pig Document 0.14.0 - Getting Started
- [ 英文學習 ]
- [ 計算機概論 ]
- [ 深入雲計算 ]
- [ 雜七雜八 ]
- [ Algorithm in Java ]
- [ Data Structures with Java ]
- [ IR Class ]
- [ Java 文章收集 ]
- [ Java 代碼範本 ]
- [ Java 套件 ]
- [ JVM 應用 ]
- [ LFD Note ]
- [ MangoDB ]
- [ Math CC ]
- [ MongoDB ]
- [ MySQL 小學堂 ]
- [ Python 考題 ]
- [ Python 常見問題 ]
- [ Python 範例代碼 ]
- [C 常見考題]
- [C 範例代碼]
- [C/C++ 範例代碼]
- [Intro Alg]
- [Java 代碼範本]
- [Java 套件]
- [Linux 小技巧]
- [Linux 小學堂]
- [Linux 命令]
- [ML In Action]
- [Python 學習筆記]
- [Quick Python]
- [Software Engineering]
- [The python tutorial]
- ActiveMQ In Action
- Big Data 研究
- Design Pattern
- Device Driver Programming
- Docker 工具
- Docker Practice
- English Writing
- ExtJS 3.x
- Git Pro
- Hadoop. Hadoop Ecosystem
- Java Framework
- Java UI
- Learn Spark
- ML Udemy
- node js
- Python Std Library
- Python tools
- Ruby Packages
- Windows 技巧
Source From Here Preface The cmd module contains one public class, Cmd , designed to be used as a base class for command processors ...
來源自 這裡 前言 : Thread 是 threading 模塊中最重要的類之一，可以使用它來創建線程。有兩種方式來創建線程：一種是通過繼承Thread 類，重寫它的 run 方法；另一種是創建一個 threading.Thread 對象，在它的初始化...
Preface: 在這個階層中，我們只需考慮電路模組的功能，而不需考慮其硬體的詳細內容. Verilog 的時序控制為以事件為基礎的時序控制: * 接線或暫存器的值被改變。 * 模組的輸入埠接收到新的值 * 正規...
轉載自 這裡 前言 : 這裡簡單說明了 #define 的幾種使用方法. 簡單的define定義 : #define MAXTIME 1000 一個簡單的MAXTIME就定義好了，它代表1000，如果在程序裡面寫 : int i = MAXTIME; ...