Goal
This lab is going to introduce you to some programs for performing the following tasks.
- Validating an XML file with a DTD specification
- Validating an JSON file with a JSON schema
- Writing some relational algebra
Getting ready
Go to the command line and create a directory for the lab. In the examples of this lab, the “[…]$” indicates the bash shell prompt.
[…]$ mkdir -p csci/343/lab2
XML and DTD
Download an XML DTD specification for an XML file format used for data exchange about computer accounts and an XML file that conforms to that specification. Store them in your csci/343/lab2 directory.
Good XML
Connect to your csci/343/lab2 directory and use the xmllint program to verify that the XML file is not malformed. Many of those on-line XML verifiers actually use xmllint behind the scenes.
[…]$ cd csci/343/lab2 […]$ xmllint --noout UsersInput.xml […]$ echo $?
Because the XML file is good and you used the
--noout
option,
xmllint says nothing.
However, its return code, which you printed with
echo $? is zero.
Following an ancient tradition from the era of mainframes,
this indicates that no errors were found in processing the file.
Malformed XML
Now dirty up your XML by removing one of the end tags in your XML file. Store the bad XML into a file called BadInput.xml in your directory.
Following the advice of many CSCI graduates, I suggest you use NetBeans to edit your files in the lab. It has some for XML, DTD and JSON.
Now use xmllint on BadInput.xml, your bad file. You should get an error message and a non-zero return code for malformed XML because your XML file doesn’t have matching start and end tags.
Non-conforming XML
Restore the end tag to your XML file to make it well-formed. Now delete one of the required elements, such as a the last name of our first president. Go ahead and run the file through xmllint. It will not complain because you still have well-formed XML.
However, run xmllint with the
--valid
option. You should notice that
xmllint will be a tad slower because
is must fetch the DTD from the net.
It will also print some error messages and yield a return code of 4.
[…]$ xmllint --valid --noout BadInput.xml […]$ echo $?
Keep fixing your XML until it passes xmllint’s tests. It may take more than one attempt.
Modifying the DTD and XML
Load up the DTD into your editor and make the following changes:
- Rename the top-level element from
users
tonames
and then make thenames
element an optionalcollege
element followed by a sequence of zero or morename
elements. That is, delete thesubmitter
attribute. - Remove all the element definitions that are no longer needed. There should be seven of these.
Store your updated DTD file as NamesInput.dtd .
Next, modify your XML file to conform to the new DTD and store it under
the name NamesInput.xml .
This is largely a matter of renaming the users
tags and
deleting a lot of elements that are no longer needed.
You should also remove the !DOCTYPE
tag.
Use xmllint
to make sure that
NamesInput.xml is well formed.
Then use xmllint
with the --dtdvalid
option to make sure both your DTD and XML files are valid.
[…]$ xmllint --dtdvalid NamesInput.dtd --noout NamesInput.xml […]$ echo $?
Again, iterate to you get the 0 return code.
JSON
Using NetBeans, create a new JSON file called names.json in your directory. If you are a NetBeans newbie, find a guru to help you get started.
Make your names.json file resemble your NamesInput.xml file. To increase the probably of the lab ending in time, start with the following and add the missing president. You need to think about the JSON-appropriate way to add a president with two middle names. It may require a slight modification of what you see below.
{ "college" : "UNCA", "names" : [ { "first" : "George" , "last" : "Washington" } , { "first" : "James" , "last" : "Carter" , "middle" : "Earl" , "suffix" : "Jr." , "prefer" : "Jimmy" } ] }
NetBeans is pretty good about recognizing bad JSON. Pay attention to its suggestions.
JSON validation
Given JSON’s relation to JavaScript, it’s not surprising that there are many on-line JSON validators. Use Google to find one and try it out on your JSON.
JSON schema
Now let’s work on an appropriate JSON schema. Since JSON schemas are specified using JSON, create a second JSON file names-schema.json for your schema. I started with a cut-and-paste from my names.json and modified it by looking at the JSON examples we examined in class. However that took forever, so I’m just going to give you a JSON definition that is almost done.
{ "$schema" : "http://json-schema.org/draft-04/schema#", "title" : "JSON Names schema for CSCI 343" , "type" : "object" , "properties" : { "college" : { "type" : "string" }, "names" : { "type" : "array" , "items" : { "type" : "object" , "properties" : { "first" : { "type" : "string" }, "middle" : { "type" : "string" } , "last" : { "type" : "string" }, "prefer" : { "type" : "string" } } } } } }
Now its time to validate your JSON again your JSON schema. Again, there are many on-line JSON validators. I used the one at the top of my Google search for “json schema validator online.”
However, don’t just cut-and-paste this into the validator because
it does have a bug for you to fix. (Think about the middle names.)
When you are satisfied with your schema, load it into
NetBeans and begin the
validate-fix cycle.
Also, see if you can figure out how to modify this JSON schema so
that it really requires first
and last
fields in the name.
Relational algebra
Since it’s written in Java, you can install RA: A Relational Algebra Interpreter written by Jun Yang for just about any computer.
However, you will need to have your own copy to get RA to work during this lab. To do this, execute the following magic system administration commands. Be sure to ask someone what each of them does so that you can put “Linux system administrator” on your resume.
[…]$ cd ~/csci/343 […]$ wget http://www.cs.duke.edu/~junyang/ra/ra-2.1b.tgz […]$ ls -l […]$ tar xfz ra-2.1b.tgz […]$ ls -l […]$ cd ra-2.1b
At this point you can start the relational algebra interpreter with the following command:
[…]$ java -jar ra.jar
Now you can type relational algebra statements to look at this database with six tables. (The database was created at Duke so don’t blame me for its contents.) Go ahead and type the following command to print a help message and to list the six tables of the database. Be sure to end all RA commands with a semi-colon.
ra> \help ; ra> \list ;
RA reads input lines (given in an odd syntax) representing relational algebra expressions and translates them SQL statements that are passed on to an sqlite3 database for execution.
The simplest relational expressions
Let’s start with two very simple relational expressions
to see the attributes
and tuples of the DRINKER
and LIKES
relations.
ra> DRINKER ; ra> LIKES ;
Π and σ
Next, use the relational algebra π (project) operator to
list single attributes of these relations.
Notice how an underscore is needed for any relational operator
that is typically specialized with subscripted information.
When the relational algebraist wants
Πdrinker LIKES, they must request
“\project_{drinker} LIKES
”.
ra> \project_{drinker} LIKES ; ra> \project_{beer} LIKES ;
This time use a σ operation
(select) to figure out what Coy
likes.
ra> \select_{drinker='Coy'} LIKES ;
The returned relation has a single tuple, but that tuple contains a pair of values
Now do one by yourself. Use a select and a project to print only the item that Coy likes.
Joins
Next try out the following pair of operations. You will be disappointed at the results of the second expression.
ra> DRINKER \cross LIKES ; ra> DRINKER \join LIKES ;
The problem is that DRINKER
and LIKES
use
two different attribute names for people.
You can use the \rename_
to fix that.
Or you can just use a θ join.
So we end with an assignment. Write a relational expression to pairs the addresses and likes of all people, without listing the name of the people. This useless table is shown below.
Output schema: (address text, beer text) ----- 100 W. Main Street|Amstel 100 W. Main Street|Corona 101 W. Main Street|Amstel 101 W. Main Street|Budweiser 200 S. Duke Street|Dixie 300 N. Duke Street|Amstel 300 N. Duke Street|Budweiser 300 N. Duke Street|Corona 300 N. Duke Street|Dixie 300 N. Duke Street|Erdinger ----- Total number of rows: 10
It is a good idea to build up your expression one operator at a time.