This chapter covers
One frequent complaint about MapReduce is that it’s difficult to program. When you first think through a data processing task, you may think about it in terms of data flow operations, such as loops and filters. However, as you implement the program in MapReduce, you’ll have to think at the level of mapper and reducer functions and job chaining. Certain functions that are treated as first-class operations in higher-level languages become nontrivial to implement in MapReduce, as we’ve seen for joins in chapter 5. Pig is a Hadoop extension that simplifies Hadoop programming by giving you a high-level data processing language while keeping Hadoop’s simple scalability and reliability. Yahoo , one of the heaviest user of Hadoop (and a backer of both the Hadoop Core and Pig), runs 40 percent of all its Hadoop jobs with Pig. Twitter is also another well-known user of Pig.1
Pig has two major components:
Pig simplifies programming because of the ease of expressing your code in Pig Latin. The compiler helps to automatically exploit optimization opportunities in your script. This frees you from having to tune your program manually. As the Pig compiler improves, your Pig Latin program will also get an automatic speed-up.
Thinking like a Pig
Pig has a certain philosophy about its design. We expect ease of use, high performance, and massive scalability from any Hadoop subproject. More unique and crucial to understanding Pig are the design choices of its programming language (a data flow language called Pig Latin), the data types it supports, and its treatment of user-defined functions (UDFs ) as first-class citizens.
Data flow language
You write Pig Latin programs in a sequence of steps where each step is a single high-level data transformation. The transformations support relational-style operations, such as filter, union, group, and join. An example Pig Latin program that processes a search query log may look like:
We can summarize Pig’s philosophy toward data types in its slogan of “Pigs eat anything.” Input data can come in any format. Popular formats, such as tab-delimited text files, are natively supported. Users can add functions to support other data file formats as well. Pig doesn’t require metadata or schema on data, but it can take advantage of them if they’re provided.
Pig can operate on data that is relational, nested, semistructured, or unstructured. To support this diversity of data, Pig supports complex data types, such as bags and tuples that can be nested to form fairly sophisticated data structures.
Pig was designed with many applications in mind—processing log data, natural language processing, analyzing network graphs, and so forth. It’s expected that many of the computations will require custom processing. Pig is architected from the ground up with support for user-defined functions. Knowing how to write UDFs is a big part of learning to use Pig.
You can download the latest release of Pig from http://hadoop.apache.org/pig/ releases.html. As of this writing, the latest versions of Pig are 0.4 and 0.5. Both of them require Java 1.6. The main difference between them is that Pig version 0.4 targets Hadoop version 0.18 whereas Pig version 0.5 targets Hadoop version 0.20. As usual, make sure to set JAVA_HOME to the root of your Java installation, and Windows users should install Cygwin . Your Hadoop cluster should already be set up. Ideally it’s a real cluster in fully distributed mode, although a pseudo-distributed setup is fine for practice.
You install Pig on your local machine by unpacking the downloaded distribution. There’s nothing you have to modify on your Hadoop cluster. Think of the Pig distribution as a compiler and some development and deployment tools. It enhances your MapReduce programming but is otherwise only loosely coupled with the production Hadoop cluster.
Under the directory where you unpacked Pig, you should create the subdirectories logs and conf (unless they’re already there). Pig will take custom configuration from files in conf. If you are creating the conf directory just now, there’s obviously no configuration file, and you’ll need to put in conf a new file named pig-env.sh. This script is executed when you run Pig, and it can be used to set up environment variables for configuring Pig. Besides JAVA_HOME, the environment variables of particular interest are PIG_HADOOP_VERSION and PIG_CLASSPATH. You set these variables to instruct Pig about your Hadoop cluster. For example, the following statements in pig-env.sh will tell Pig the version of Hadoop used by the cluster is 0.18, and to add the configuration directory of your local installation of Hadoop to Pig’s classpath:
Instead of using Pig’s classpath, you can also specify the location of your Hadoop cluster by creating a pig.properties file . This properties file will be under the confdirectory you created earlier. It should define fs.default.name and mapred.job.tracker, the filesystem (i.e., HDFS’s NameNode) and the location of the JobTracker. An example pig. properties file pointing to a Hadoop set up in pseudo-distributed mode is:
Let’s start Pig’s interactive shell to see that it’s reading the configurations properly.
You’re now inside Pig’s interactive shell, also known as Grunt.
We can run Pig Latin commands in three ways—via the Grunt interactive shell, through a script file, and as embedded queries inside Java programs. Each way can work in one of two modes—local mode and Hadoop mode . (Hadoop mode is sometimes called Mapreduce mode in the Pig documentation.) At the end of the previous section we’ve entered the Grunt shell running in Hadoop mode.
The Grunt shell allows you to enter Pig commands manually. This is typically used for ad hoc data analysis or during the interactive cycles of program development. Large Pig programs or ones that will be run repeatedly are run in script files. To enter Grunt, use the command pig. To run a Pig script, execute the same pig command with the file name as the argument, such as pig myscript.pig. The convention is to use the .pig extension for Pig scripts.
You can think of Pig programs as similar to SQL queries , and Pig provides a PigServer class that allows any Java program to execute Pig queries. Conceptually this is analogous to using JDBC to execute SQL queries. Embedded Pig programs is a fairly advanced topic and you can find more details athttp://wiki.apache.org/pig/ EmbeddedPig.
When you run Pig in local mode, you don’t use Hadoop at all.2 Pig commands are compiled to run locally in their own JVM , accessing local files. This is typically used for development purposes, where you can get fast feedback by running locally against a small development data set. Running Pig in Hadoop mode means the compile Pig program will physically execute in a Hadoop installation. Typically the Hadoop installation is a fully distributed cluster. (The pseudo-distributed Hadoop setup we used in section 10.2 was purely for demonstration. It’s rarely used except to debug configurations.) The execution mode is specified to the pig command via the -x or -exectype option. You can enter the Grunt shell in local mode through:
Entering the Grunt shell in Hadoop mode is
or use the pig command without arguments, as it chooses the Hadoop mode by default.
Managing the Grunt shell
In addition to running Pig Latin statements (which we’ll look at in a later section), the Grunt shell supports some basic utility commands. Typing help will print out a help screen of such utility commands. You exit the Grunt shell with quit. You can stop a Hadoop job with the kill command followed by the Hadoop job ID. Some Pig parameters are set with the set command . For example,
The debug parameter states whether debug-level logging is turned on or off. The job.name parameter takes a single-quoted string and will use that as the Pig program’s Hadoop job name. It’s useful to set a meaningful name to easily identify your Pig job in Hadoop’s Web UI .
The Grunt shell also supports file utility commands, such as ls and cp. You can see the full list of utility commands and file commands in table 10.1. The file commands are mostly a subset of the HDFS filesystem shell commands, and their usage should be self-explanatory.
Two new commands are exec and run. They run Pig scripts while inside the Grunt shell and can be useful in debugging Pig scripts. The exec command executes a Pig script in a separate space from the Grunt shell. Aliases defined in the script aren’t visible to the shell and vice versa. The command run executes a Pig script in the same space as Grunt (also known as interactive mode). It has the same effect as manually typing in each line of the script into the Grunt shell.