CSCI 343 — Pig & Hadoop

Goal

Use a noSQL database.

Getting started

Open the following links in additional browser tabs.

Also, just to make sure you can git it, use the following command to download the files used in the Pig tutorial.

[…]$ git clone https://github.com/rohitsden/pig-tutorial

This should create a subdirectory called pig-tutorial of your current directory. Be sure that you know your way around git before you look for a job.

Getting pig started

Using the following commands, connect to your pig-tutorial directory and start pig using the -x local option.

[…]$ cd pig-tutorial
[…]$ ls -l
[…]$ tail movies_data.csv
[…]$ pig -x local

Take note of the movies_data.csv file in your directory. It is a 49590-line comma-separated file of movie data. That is your database.

Instead of using a networked Hadoop server, you will be using a local Hadoop server built into pig. You will soon notice that pig generates looks of annoying output. It is difficult to see the useful results within the oinks.

Your first Pig commands

Move about half way through the first part of the Pig Tutorial until you see the “Pig Latin” section header. You can do this by searching for the string “PigStorage”.

Read this section carefully. There are five Pig Latin commands in the section. Execute each of these commands, however omit the /home/hduser/pig/myscripts/ from the file names. You can just say something like movies_data.csv.

Ask pig which movies were released in 1996.

Starting the second tutorial

Quit pig and then restart it. Remember to use local to get the built-in Hadoop server.

Now look at the the second part of the tutorial. I am afraid there are a few formatting issues in the first section. You may find the following more readable and useful for cut-and-paste.

grunt> movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id:int,name:chararray,year:int,rating:double,duration:int);
grunt> movies_rating_3_4 = FILTER movies BY rating>3.0 and rating<4.0; 
grunt> DESCRIBE ;
grunt> DESCRIBE movies ;

Notice that we are now giving types to the values stored in the movies_data.csv database.

Continuing the second tutorial

Continue through the second tutorial, but stop right before the section for the ORDER BY.

pig has a FOREACH operator which is a bit like Java’s Java’s for-each operator. pig is a procedural language. SQL is a declarative language.

Within the discussion of the GROUP option, pay attention to how FOREACH, GENERATE, and SUM generate data structures. You will have a better chance of understanding what is going on if you frequently type the DESCRIBE command immediately after you define a new table. That is, type the following commands, rather than those shown on the tutorial page.

grunt> grouped_by_year = group movies by year;
grunt> DESCRIBE grouped_by_year ;
grunt> count_by_year = FOREACH grouped_by_year GENERATE group, COUNT(movies);
grunt> DESCRIBE count_by_year ;
grunt> group_all = GROUP count_by_year ALL;
grunt> DESCRIBE group_all ;
grunt> sum_all = FOREACH group_all GENERATE SUM(count_by_year.$1);
grunt> DESCRIBE sum_all ;
grunt> DUMP sum_all;

Don’t get worried about the long output sequences. Just look at the last few lines produced by each statement. Notice how tuples, bags and maps are used to represent data. This looks more like data structures than data bases.

Being social

Stop pig. Download a colon-separated representation of the Highschooler table and store it on your computer.

1510:Jordan:9
1689:Gabriel:9
1381:Tiffany:9
1709:Cassandra:9
1101:Haley:10
1782:Andrew:10
1468:Kris:10
1641:Brittany:10
1247:Alexis:11
1316:Austin:11
1911:Gabriel:11
1501:Jessica:11
1304:Jordan:12
1025:John:12
1934:Kyle:12
1661:Logan:12

Generate a list of students in the 10th grade using pig.

If you feel ambitious, generate a table of the number of students in each grade.

That’ll do pig.