CSCI 343 — XML, JSON & relational algebra

Goal

This lab is going to introduce you to some programs for performing the following tasks.

Getting ready

Go to the command line and create a directory for the lab. In the examples of this lab, the “[…]$” indicates the bash shell prompt.

[…]$ mkdir -p csci/343/lab2

XML and DTD

Download an XML DTD specification for an XML file format used for data exchange about computer accounts and an XML file that conforms to that specification. Store them in your csci/343/lab2 directory.

Good XML

Connect to your csci/343/lab2 directory and use the xmllint program to verify that the XML file is not malformed. Many of those on-line XML verifiers actually use xmllint behind the scenes.

[…]$ cd csci/343/lab2
[…]$ xmllint --noout UsersInput.xml
[…]$ echo $?

Because the XML file is good and you used the --noout option, xmllint says nothing. However, its return code, which you printed with echo $? is zero. Following an ancient tradition from the era of mainframes, this indicates that no errors were found in processing the file.

Malformed XML

Now dirty up your XML by removing one of the end tags in your XML file. Store the bad XML into a file called BadInput.xml in your directory.

Following the advice of many CSCI graduates, I suggest you use NetBeans to edit your files in the lab. It has some for XML, DTD and JSON.

Now use xmllint on BadInput.xml, your bad file. You should get an error message and a non-zero return code for malformed XML because your XML file doesn’t have matching start and end tags.

Non-conforming XML

Restore the end tag to your XML file to make it well-formed. Now delete one of the required elements, such as a the last name of our first president. Go ahead and run the file through xmllint. It will not complain because you still have well-formed XML.

However, run xmllint with the --valid option. You should notice that xmllint will be a tad slower because is must fetch the DTD from the net. It will also print some error messages and yield a return code of 4.

[…]$ xmllint --valid --noout BadInput.xml
[…]$ echo $?

Keep fixing your XML until it passes xmllint’s tests. It may take more than one attempt.

Modifying the DTD and XML

Load up the DTD into your editor and make the following changes:

Store your updated DTD file as NamesInput.dtd .

Next, modify your XML file to conform to the new DTD and store it under the name NamesInput.xml . This is largely a matter of renaming the users tags and deleting a lot of elements that are no longer needed. You should also remove the !DOCTYPE tag.

Use xmllint to make sure that NamesInput.xml is well formed. Then use xmllint with the --dtdvalid option to make sure both your DTD and XML files are valid.

[…]$ xmllint --dtdvalid NamesInput.dtd --noout NamesInput.xml 
[…]$ echo $?

Again, iterate to you get the 0 return code.

JSON

Using NetBeans, create a new JSON file called names.json in your directory. If you are a NetBeans newbie, find a guru to help you get started.

Make your names.json file resemble your NamesInput.xml file. To increase the probably of the lab ending in time, start with the following and add the missing president. You need to think about the JSON-appropriate way to add a president with two middle names. It may require a slight modification of what you see below.

{
    "college"  :  "UNCA",
    "names"    :
    [
        {
            "first"      : "George" ,
            "last"       : "Washington"
        }
        ,
        {
            "first"      : "James" ,
            "last"       : "Carter" ,
            "middle"     : "Earl" ,
            "suffix"     : "Jr." ,
            "prefer"     : "Jimmy"
        }
    ]
}

NetBeans is pretty good about recognizing bad JSON. Pay attention to its suggestions.

JSON validation

Given JSON’s relation to JavaScript, it’s not surprising that there are many on-line JSON validators. Use Google to find one and try it out on your JSON.

JSON schema

Now let’s work on an appropriate JSON schema. Since JSON schemas are specified using JSON, create a second JSON file names-schema.json for your schema. I started with a cut-and-paste from my names.json and modified it by looking at the JSON examples we examined in class. However that took forever, so I’m just going to give you a JSON definition that is almost done.

{
    "$schema" : "http://json-schema.org/draft-04/schema#",
    "title" : "JSON Names schema for CSCI 343" ,
    "type" : "object" ,
    "properties" : {
        "college" : {
            "type" : "string"  
        },
        "names" : {
            "type" : "array" ,
            "items" : {
                "type" : "object" ,
                "properties" : {
                    "first" : {
                        "type" : "string"
                    },
                    "middle" : {
                        "type" : "string"              
                    } ,
                    "last" : {
                        "type" : "string"
                    },
                    "prefer" : {
                        "type" : "string"
                    }
                }
            }
        }
    }
}

Now its time to validate your JSON again your JSON schema. Again, there are many on-line JSON validators. I used the one at the top of my Google search for “json schema validator online.”

However, don’t just cut-and-paste this into the validator because it does have a bug for you to fix. (Think about the middle names.) When you are satisfied with your schema, load it into NetBeans and begin the validate-fix cycle. Also, see if you can figure out how to modify this JSON schema so that it really requires first and last fields in the name.

Relational algebra

Since it’s written in Java, you can install RA: A Relational Algebra Interpreter written by Jun Yang for just about any computer.

However, you will need to have your own copy to get RA to work during this lab. To do this, execute the following magic system administration commands. Be sure to ask someone what each of them does so that you can put “Linux system administrator” on your resume.

[…]$ cd ~/csci/343
[…]$ wget http://www.cs.duke.edu/~junyang/ra/ra-2.1b.tgz
[…]$ ls -l
[…]$ tar xfz ra-2.1b.tgz
[…]$ ls -l
[…]$ cd ra-2.1b

At this point you can start the relational algebra interpreter with the following command:

[…]$ java -jar ra.jar

Now you can type relational algebra statements to look at this database with six tables. (The database was created at Duke so don’t blame me for its contents.) Go ahead and type the following command to print a help message and to list the six tables of the database. Be sure to end all RA commands with a semi-colon.

ra> \help ;
ra> \list ;

RA reads input lines (given in an odd syntax) representing relational algebra expressions and translates them SQL statements that are passed on to an sqlite3 database for execution.

The simplest relational expressions

Let’s start with two very simple relational expressions to see the attributes and tuples of the DRINKER and LIKES relations.

ra> DRINKER ;
ra> LIKES ;

Π and σ

Next, use the relational algebra π (project) operator to list single attributes of these relations. Notice how an underscore is needed for any relational operator that is typically specialized with subscripted information. When the relational algebraist wants Πdrinker LIKES, they must request “\project_{drinker} LIKES ”.

ra> \project_{drinker} LIKES ;
ra> \project_{beer} LIKES ;

This time use a σ operation (select) to figure out what Coy likes.

ra> \select_{drinker='Coy'} LIKES ;

The returned relation has a single tuple, but that tuple contains a pair of values

Now do one by yourself. Use a select and a project to print only the item that Coy likes.

Joins

Next try out the following pair of operations. You will be disappointed at the results of the second expression.

ra> DRINKER \cross LIKES ;
ra> DRINKER \join LIKES ;

The problem is that DRINKER and LIKES use two different attribute names for people. You can use the \rename_ to fix that. Or you can just use a θ join.

So we end with an assignment. Write a relational expression to pairs the addresses and likes of all people, without listing the name of the people. This useless table is shown below.

Output schema: (address text, beer text)
-----
100 W. Main Street|Amstel
100 W. Main Street|Corona
101 W. Main Street|Amstel
101 W. Main Street|Budweiser
200 S. Duke Street|Dixie
300 N. Duke Street|Amstel
300 N. Duke Street|Budweiser
300 N. Duke Street|Corona
300 N. Duke Street|Dixie
300 N. Duke Street|Erdinger
-----
Total number of rows: 10

It is a good idea to build up your expression one operator at a time.