Goal
Use a noSQL database.
Getting started
Open the following links in additional browser tabs.
- Rohit Menon’s Apache Pig Tutorial — Part 1
- Rohit Menon’s Apache Pig Tutorial — Part 2
- Pig Latin Basics
Also, just to make sure you can git it, use the following command to download the files used in the Pig tutorial.
[…]$ git clone https://github.com/rohitsden/pig-tutorial
This should create a subdirectory called pig-tutorial of your current directory. Be sure that you know your way around git before you look for a job.
Getting pig started
Using the following commands, connect to your pig-tutorial directory and start pig using the -x local option.
[…]$ cd pig-tutorial […]$ ls -l […]$ tail movies_data.csv […]$ pig -x local
Take note of the movies_data.csv file in your directory. It is a 49590-line comma-separated file of movie data. That is your database.
Instead of using a networked Hadoop server, you will be using a local Hadoop server built into pig. You will soon notice that pig generates looks of annoying output. It is difficult to see the useful results within the oinks.
Your first Pig commands
Move about half way through the first part of the Pig Tutorial until you see the “Pig Latin” section header. You can do this by searching for the string “PigStorage”.
Read this section carefully. There are five Pig Latin commands in the section. Execute each of these commands, however omit the /home/hduser/pig/myscripts/ from the file names. You can just say something like movies_data.csv.
Ask pig which movies were released in 1996.
Starting the second tutorial
Quit pig and then restart it. Remember to use
local
to get the built-in Hadoop server.
Now look at the the second part of the tutorial. I am afraid there are a few formatting issues in the first section. You may find the following more readable and useful for cut-and-paste.
grunt> movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id:int,name:chararray,year:int,rating:double,duration:int); grunt> movies_rating_3_4 = FILTER movies BY rating>3.0 and rating<4.0; grunt> DESCRIBE ; grunt> DESCRIBE movies ;
Notice that we are now giving types to the values stored in the movies_data.csv database.
Continuing the second tutorial
Continue through the second tutorial, but stop right before the
section for the ORDER BY
.
pig has a
FOREACH
operator which is a bit like Java’s
Java’s for-each
operator.
pig is a
procedural language.
SQL is a declarative language.
Within the discussion of the GROUP
option, pay attention to how
FOREACH
, GENERATE
, and SUM
generate data structures.
You will have a better chance of understanding what is going on if you frequently type the
DESCRIBE
command immediately after you define a new table.
That is, type the following commands, rather than those shown on the tutorial page.
grunt> grouped_by_year = group movies by year; grunt> DESCRIBE grouped_by_year ; grunt> count_by_year = FOREACH grouped_by_year GENERATE group, COUNT(movies); grunt> DESCRIBE count_by_year ; grunt> group_all = GROUP count_by_year ALL; grunt> DESCRIBE group_all ; grunt> sum_all = FOREACH group_all GENERATE SUM(count_by_year.$1); grunt> DESCRIBE sum_all ; grunt> DUMP sum_all;
Don’t get worried about the long output sequences. Just look at the last few lines produced by each statement. Notice how tuples, bags and maps are used to represent data. This looks more like data structures than data bases.
Being social
Stop pig.
Download a colon-separated representation
of the Highschooler
table
and store it on your computer.
1510:Jordan:9 1689:Gabriel:9 1381:Tiffany:9 1709:Cassandra:9 1101:Haley:10 1782:Andrew:10 1468:Kris:10 1641:Brittany:10 1247:Alexis:11 1316:Austin:11 1911:Gabriel:11 1501:Jessica:11 1304:Jordan:12 1025:John:12 1934:Kyle:12 1661:Logan:12
Generate a list of students in the 10th grade using pig.
If you feel ambitious, generate a table of the number of students in each grade.