On April 11th, I will link each of your pages to a central page (one page for the "good" links, and one page for the "bad" links) that will be accessible from this page. Visit this page again on April 11th to find the central pages; you will need them later in the assignment.
In this project, each person will have the opportunity to run their own small web server, in particular you will use the Tomcat server developed and freely distributed by Apache. Your server will run the server on penrose, a Linus box located in Joe Daugherty's office. Each web server expects a particular directory organization, and, in the case of Tomcat (as well as others), this directory structure is somewhat involved. To help you, I have a shell script that, when you run it, creates a directory structure that includes the following:
To run this script, rlogin (i.e., remote login using the command
"rlogin penrose") to penrose and type the command:
/usr/local/bin/tcsetup
You must be logged on to penrose to run this script. When the
script runs, it will create a directory named servlet
containing the files described above in your top level directory. The
servlet directory will be described in greater detail in class.
Once you have created your servlet directory, you should try
starting the server and running Ray Mooney's search engine interface.
To start the server, give the command (assuming you are in your
top-level directory):
servlet/bin/startup.sh
(You will be using this script frequently so you may want to
alais this command in your .cshrc file. Here is an example:
alias starttc '~bruce/servlet/bin/startup.sh'
Such an alias makes the command easier to remember and type.)
Once the server has been started, view your index page (created by the tcsetup script) by loading your specified authority (e.g., mine is http://jsp.cs.unca.edu:10835/) in a web browser. Next, test Ray Mooney's search engine interface by loading search.html (e.g., http://jsp.cs.unca.edu:10835/search.html) and using it.
To stop the web server give the command:
servlet/bin/shutdown.sh
Always remember to stop the web server before logging out. In
addition, it will be necessary to stop and restart the web server
every time you recompile the code for a servlet.
Modifying the Code
Phase 1: Adding Relevance Ranking
There are two different sets of modifications that you are required to make to your search engine code in this project. The first modification is to define a relevance ranking (an ordering) of the documents returned by your Boolean search engine using link analysis. (Remember that in a standard Boolean search, all documents that satisfy the query are considered to be equally relevance.) As discussed in class, link analysis is often used to order the documents returned in a search; Google is the best known example.
While most approaches to link analysis require involved calculations and data structures (e.g., the HITS algorithm), we will use a very simple approach: documents should be ranked in order of popularity as measured by a count of their incoming links. The link counts should be accumulated over the entire document collection. In this project, you will work with two document collections: badLinks and goodLinks. The badLinks collection will consist of the first 200 UNCA web pages spidered (using the SiteSpider class) starting from the "badLinks" class page described in Part 1 above. The goodLinks collection will, similarly, consist of the first 200 UNCA web pages spidered (using the SiteSpider class) starting from the "goodLinks" class page described in Part 1 above.
There are a number of ways to implement this phase of the project. In all cases, you will have to install your existing Boolean search system in the servlet directory described in the section above. In order to reduce the disk space that you use, please remove all unnecessary code in that directory (e.g., typically you can remove Ray Mooney's ir.vsr package once you have installed your bool package).
In his existing system, Ray Mooney separates the spidering phase from the search phase. His current web interface only drives the search engine over an existing document collection. He assumes that spidering has been used to build that collection prior to searching. While this is a reasonable approach, it presents a problem for you in implementing a document ranking based on page popularity: the information on link counts is available in the run-time data structures of the spider code, but it is not currently available in the run-time data structures of the search code. One plausible solution is to collection the information on link counts in the spidering code (where it is easily accessed) and write it to a file that can later be read by the search code. Other solutions are also possible.
The information on link counts should be used by your Boolean search engine to order the documents returned in response to a query. The ordering of documents should be such that the most popular page satisfying the query (i.e., the page with the most in-coming links) should be listed first, the second most popular page satisfying the query should be next, and so on.
Phase 2: Modifying the Web Interface
Your search engine must also have a web interface. The simplest approach to achieving this is to modify the interface that Ray Mooney developed for his vsr search engine. The necessary modifications will include the following: