2015年4月23日 星期四

[ In Action ] Ch10. Pig: Using Pig (2)

Learning Pig Latin through Grunt 
Before formally describing Pig’s data types and data processing operators, let’s run a few commands in the Grunt shell to get a feel for how to process data in Pig. For the purpose of learning, it’s more convenient to run Grunt in local mode: 
# pig -x local

You may want to first try some of the file commands, such as pwd and ls, to orient yourself around the filesystem. 

Let’s look at some data. We’ll later reuse the patent data we introduced in chapter 4, but for now let’s dig into an interesting data set of query logs from the Excite search engine. This data set already comes with the Pig installation, and it’s in the file tutorial/data/excite-small.log (download) under the Pig installation directory. The data comes in a three-column, tab-separated format. The first column is an anonymized user ID. The second column is a Unix timestamp, and the third is the search query. A decidedly non-random sample from the 4,500 records of this file looks like: 
 

From within Grunt, enter the following statement to load this data into an “alias” (i.e., variable) called log
grunt> log = LOAD 'tutorial/data/excite-small.log' AS (user, time, query);

Note that nothing seems to have happened after you entered the statement. In the Grunt shell, Pig parses your statements but doesn’t physically execute them until you use a DUMP or STORE command to ask for the results. The DUMP command prints out the content of an alias whereas the STORE command stores the content to a file. The fact that Pig doesn’t physically execute any command until you explicitly request some end result will make sense once you remember that we’re processing large data sets. There’s no memory space to “load” the data, and in any case we want to verify the logic of the execution plan before spending the time and resources to physically execute it. 

We use the DUMP command usually only for development. Most often you’ll STORE significant results into a directory. (Like Hadoop, Pig will automatically partition the data into files named part-nnnnn.) When you DUMP an alias, you should be sure that its content is small enough to be reasonably printed to screen. The common way to do that is to create another alias through the LIMIT command and DUMP the new, smaller alias. The LIMIT command allows you to specify how many tuples (rows) to return back. For example, to see four tuples of log: 
grunt> lmt = LIMIT log 4;
grunt> DUMP lmt;
(2A9EABFB35F5B954,970916105432L,+md foods +proteins)
(BED75271605EBD0C,970916001949L,yahoo chat)
(BED75271605EBD0C,970916001954L,yahoo chat)
(BED75271605EBD0C,970916003523L,yahoo chat)

Table 10.2 summarizes the read and write operators in Pig Latin. LIMIT is technically not a read or write operator, but as it’s often used alongside, we’ve included it in the table. 
 

Let’s execute a few data processing statements and see how we can explore Pig Latin through Grunt. 
grunt> log = LOAD 'tutorial/data/excite-small.log'
➥ AS (user:chararray, time:long, query:chararray); 
grunt> grpd = GROUP log BY user; 
grunt> cntd = FOREACH grpd GENERATE group, COUNT(log);
grunt> STORE cntd INTO 'output';

The preceding statements count the number of queries each user has searched for. The content of the output files (you’ll have to look at the file from outside Grunt) look like this: 
002BB5A52580A8ED 18
005BD9CD3AC6BB38 18
00A08A54CD03EB95 3
011ACA65C2BF70B2 5
01500FAFE317B7C0 15
...

Conceptually we’ve performed an aggregating operation similar to the SQL query : 
SELECT user, COUNT(*) FROM excite-small.log GROUP BY user;

Two main differences between the Pig Latin and SQL versions are worth pointing out. As we’ve mentioned earlier, Pig Latin is a data processing language. You’re specifying a series of data processing steps instead of a complex SQL query with clauses. The other difference is more subtle—relations in SQL always have fixed schemas . In SQL, we define a relation’s schema before it’s populated with data. Pig takes a much looser approach to schema. In fact, you don’t need to use schemas if you don’t want to, which may be the case when handling semistructured or unstructured data. Here we do specify a schema for the relation log, but it’s only in the load statement and it’s not enforced until we’re loading in the data. Any field that doesn’t obey the schema in the load operation is casted to a null. In this way the relation log is guaranteed to obey our stated schema for subsequent operations. 

As much as possible, Pig tries to figure out the schema for a relation based on the operation used to create it. You can expose Pig’s schema for any relation with the DESCRIBE command. This can be useful in understanding what a Pig statement is doing. For example, we’ll look at the schemas for grpd and cntd. Before doing this, let’s first see how the DESCRIBE command describes log. 
grunt> DESCRIBE log
log: {user: chararray,time: long,query: chararray}

As expected, the load command gives log the exact schema we’ve specified. The relation log consists of three fields named usertime, and query. The fields userand query are both strings (chararray in Pig) whereas time is a long integer. 

GROUP BY operation on the relation log generates the relation grpd. Based on the operation and the schema for log, Pig infers a schema for grpd
grunt> DESCRIBE grpd;
grpd: {group: chararray,log: {(user: chararray,time: long,query: chararray)}}

group and log are two fields in grpd. The field logis a bag with subfields usertime, and query. As we haven’t covered Pig’s type system and the GROUP BYoperation, we don’t expect you to understand this schema yet. The point is that relations in Pig can have fairly complex schemas, and DESCRIBE is your friend in understanding the relations you’re working with: 
grunt> DESCRIBE cntd;
cntd: {group: chararray,long}

Finally, the FOREACH command operates on the relation grpd to give us cntd. Having looked at the output of cntd, we know it has two fields—the user ID and a count of the number of queries. Pig’s schema for cntd, as given by DESCRIBE, also has two fields. The first one’s name—group—is taken from grpd’s schema. The second field has no name, but it has a type of long. This field is generated by the COUNT function , and the function doesn’t automatically provide a name, although it does tell Pig that the type has to be a long. 

Whereas DESCRIBE can tell you the schema of a relation, ILLUSTRATE does a sample run to show a step-by-step process on how Pig would compute the relation. Pig tries to simulate the execution of the statements to compute a relation, but it uses only a small sample of data to make the execution fast. The best way to understand ILLUSTRATE is by applying it to a relation. In this case we use cntd. (The output is reformatted to fit the width of a printed page.
 

The ILLUSTRATE command shows there to be four transformations to arrive at cntd. The header row of each table describes the schema of the output relation after transformation, and the rest of the table shows example data. The log relation is shown as two transformations. The data hasn’t changed from one to the next, but the schema has changed from a generic bytearray (Pig’s type for binary objects) to the specified schema. The GROUP operation on log is executed on the three sample log tuples to arrive at the data for grpd. Based on this we can infer the GROUP operation to have taken the user field and made it the group field. In addition, it groups all tuples in log with the same user value into a bag in grpd. Seeing sample data in a simulated run by ILLUSTRATE can greatly aid the understanding of different operations. Finally, we see the FOREACH operation applied to grpd to arrive at cntd. Having seen the data in grpd in the previous table, one can easily infer that the COUNT() function provided the size of each bag. 

Although DESCRIBE and ILLUSTRATE are your workhorses in understanding Pig Latin statements, Pig also has an EXPLAIN command to show the logical and physical execution plan in detail. We summarize the diagnostic operators in table 10.3. 
 

Supplement 
Pig Document 0.14.0 - Getting Started

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...