Learning Pig Latin through Grunt
Before formally describing Pig’s data types and data processing operators, let’s run a few commands in the Grunt shell to get a feel for how to process data in Pig. For the purpose of learning, it’s more convenient to run Grunt in local mode:
You may want to first try some of the file commands, such as pwd and ls, to orient yourself around the filesystem.
Let’s look at some data. We’ll later reuse the patent data we introduced in chapter 4, but for now let’s dig into an interesting data set of query logs from the Excite search engine. This data set already comes with the Pig installation, and it’s in the file tutorial/data/excite-small.log (download) under the Pig installation directory. The data comes in a three-column, tab-separated format. The first column is an anonymized user ID. The second column is a Unix timestamp, and the third is the search query. A decidedly non-random sample from the 4,500 records of this file looks like:
From within Grunt, enter the following statement to load this data into an “alias” (i.e., variable) called log.
Note that nothing seems to have happened after you entered the statement. In the Grunt shell, Pig parses your statements but doesn’t physically execute them until you use a DUMP or STORE command to ask for the results. The DUMP command prints out the content of an alias whereas the STORE command stores the content to a file. The fact that Pig doesn’t physically execute any command until you explicitly request some end result will make sense once you remember that we’re processing large data sets. There’s no memory space to “load” the data, and in any case we want to verify the logic of the execution plan before spending the time and resources to physically execute it.
We use the DUMP command usually only for development. Most often you’ll STORE significant results into a directory. (Like Hadoop, Pig will automatically partition the data into files named part-nnnnn.) When you DUMP an alias, you should be sure that its content is small enough to be reasonably printed to screen. The common way to do that is to create another alias through the LIMIT command and DUMP the new, smaller alias. The LIMIT command allows you to specify how many tuples (rows) to return back. For example, to see four tuples of log:
Table 10.2 summarizes the read and write operators in Pig Latin. LIMIT is technically not a read or write operator, but as it’s often used alongside, we’ve included it in the table.
Let’s execute a few data processing statements and see how we can explore Pig Latin through Grunt.
The preceding statements count the number of queries each user has searched for. The content of the output files (you’ll have to look at the file from outside Grunt) look like this:
Conceptually we’ve performed an aggregating operation similar to the SQL query :
Two main differences between the Pig Latin and SQL versions are worth pointing out. As we’ve mentioned earlier, Pig Latin is a data processing language. You’re specifying a series of data processing steps instead of a complex SQL query with clauses. The other difference is more subtle—relations in SQL always have fixed schemas . In SQL, we define a relation’s schema before it’s populated with data. Pig takes a much looser approach to schema. In fact, you don’t need to use schemas if you don’t want to, which may be the case when handling semistructured or unstructured data. Here we do specify a schema for the relation log, but it’s only in the load statement and it’s not enforced until we’re loading in the data. Any field that doesn’t obey the schema in the load operation is casted to a null. In this way the relation log is guaranteed to obey our stated schema for subsequent operations.
As much as possible, Pig tries to figure out the schema for a relation based on the operation used to create it. You can expose Pig’s schema for any relation with the DESCRIBE command. This can be useful in understanding what a Pig statement is doing. For example, we’ll look at the schemas for grpd and cntd. Before doing this, let’s first see how the DESCRIBE command describes log.
As expected, the load command gives log the exact schema we’ve specified. The relation log consists of three fields named user, time, and query. The fields userand query are both strings (chararray in Pig) whereas time is a long integer.
A GROUP BY operation on the relation log generates the relation grpd. Based on the operation and the schema for log, Pig infers a schema for grpd:
group and log are two fields in grpd. The field logis a bag with subfields user, time, and query. As we haven’t covered Pig’s type system and the GROUP BYoperation, we don’t expect you to understand this schema yet. The point is that relations in Pig can have fairly complex schemas, and DESCRIBE is your friend in understanding the relations you’re working with:
Finally, the FOREACH command operates on the relation grpd to give us cntd. Having looked at the output of cntd, we know it has two fields—the user ID and a count of the number of queries. Pig’s schema for cntd, as given by DESCRIBE, also has two fields. The first one’s name—group—is taken from grpd’s schema. The second field has no name, but it has a type of long. This field is generated by the COUNT function , and the function doesn’t automatically provide a name, although it does tell Pig that the type has to be a long.
Whereas DESCRIBE can tell you the schema of a relation, ILLUSTRATE does a sample run to show a step-by-step process on how Pig would compute the relation. Pig tries to simulate the execution of the statements to compute a relation, but it uses only a small sample of data to make the execution fast. The best way to understand ILLUSTRATE is by applying it to a relation. In this case we use cntd. (The output is reformatted to fit the width of a printed page.)
The ILLUSTRATE command shows there to be four transformations to arrive at cntd. The header row of each table describes the schema of the output relation after transformation, and the rest of the table shows example data. The log relation is shown as two transformations. The data hasn’t changed from one to the next, but the schema has changed from a generic bytearray (Pig’s type for binary objects) to the specified schema. The GROUP operation on log is executed on the three sample log tuples to arrive at the data for grpd. Based on this we can infer the GROUP operation to have taken the user field and made it the group field. In addition, it groups all tuples in log with the same user value into a bag in grpd. Seeing sample data in a simulated run by ILLUSTRATE can greatly aid the understanding of different operations. Finally, we see the FOREACH operation applied to grpd to arrive at cntd. Having seen the data in grpd in the previous table, one can easily infer that the COUNT() function provided the size of each bag.
Although DESCRIBE and ILLUSTRATE are your workhorses in understanding Pig Latin statements, Pig also has an EXPLAIN command to show the logical and physical execution plan in detail. We summarize the diagnostic operators in table 10.3.
Supplement
* Pig Document 0.14.0 - Getting Started
This is a blog to track what I had learned and share knowledge with all who can take advantage of them
標籤
- [ 英文學習 ]
- [ 計算機概論 ]
- [ 深入雲計算 ]
- [ 雜七雜八 ]
- [ Algorithm in Java ]
- [ Data Structures with Java ]
- [ IR Class ]
- [ Java 文章收集 ]
- [ Java 代碼範本 ]
- [ Java 套件 ]
- [ JVM 應用 ]
- [ LFD Note ]
- [ MangoDB ]
- [ Math CC ]
- [ MongoDB ]
- [ MySQL 小學堂 ]
- [ Python 考題 ]
- [ Python 常見問題 ]
- [ Python 範例代碼 ]
- [心得扎記]
- [網路教學]
- [C 常見考題]
- [C 範例代碼]
- [C/C++ 範例代碼]
- [Intro Alg]
- [Java 代碼範本]
- [Java 套件]
- [Linux 小技巧]
- [Linux 小學堂]
- [Linux 命令]
- [ML In Action]
- [ML]
- [MLP]
- [Postgres]
- [Python 學習筆記]
- [Quick Python]
- [Software Engineering]
- [The python tutorial]
- 工具收集
- 設計模式
- 資料結構
- ActiveMQ In Action
- AI
- Algorithm
- Android
- Ansible
- AWS
- Big Data 研究
- C/C++
- C++
- CCDH
- CI/CD
- Coursera
- Database
- DB
- Design Pattern
- Device Driver Programming
- Docker
- Docker 工具
- Docker Practice
- Eclipse
- English Writing
- ExtJS 3.x
- FP
- Fraud Prevention
- FreeBSD
- GCC
- Git
- Git Pro
- GNU
- Golang
- Gradle
- Groovy
- Hadoop
- Hadoop. Hadoop Ecosystem
- Java
- Java Framework
- Java UI
- JavaIDE
- JavaScript
- Jenkins
- JFreeChart
- Kaggle
- Kali/Metasploit
- Keras
- KVM
- Learn Spark
- LeetCode
- Linux
- Lucene
- Math
- ML
- ML Udemy
- Mockito
- MPI
- Nachos
- Network
- NLP
- node js
- OO
- OpenCL
- OpenMP
- OSC
- OSGi
- Pandas
- Perl
- PostgreSQL
- Py DS
- Python
- Python 自製工具
- Python Std Library
- Python tools
- QEMU
- R
- Real Python
- RIA
- RTC
- Ruby
- Ruby Packages
- Scala
- ScalaIA
- SQLAlchemy
- TensorFlow
- Tools
- UML
- Unix
- Verilog
- Vmware
- Windows 技巧
- wxPython
訂閱:
張貼留言 (Atom)
[Git 常見問題] error: The following untracked working tree files would be overwritten by merge
Source From Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 # git clean -d -fx 方案2: 今天在服务器上 gi...
-
前言 : 為什麼程序管理這麼重要呢?這是因為: * 首先,本章一開始就談到的,我們在操作系統時的各項工作其實都是經過某個 PID 來達成的 (包括你的 bash 環境), 因此,能不能進行某項工作,就與該程序的權限有關了。 * 再來,如果您的 Linux 系統是個...
-
屬性 : 系統相關 - 檔案與目錄 語法 : du [參數] [檔案] 參數 | 功能 -a | 顯示目錄中個別檔案的大小 -b | 以bytes為單位顯示 -c | 顯示個別檔案大小與總和 -D | 顯示符號鏈結的來源檔大小 -h | Hum...
-
來源自 這裡 說明 : split 是 Perl 中非常有用的函式之一,它可以將一個字串分割並將之置於陣列中。若無特別的指定,該函式亦使用 RE 與 $_ 變數 語法 : * split /PATTERN/,EXPR,LIMIT * split /...
沒有留言:
張貼留言