CS 373 Information Retrieval and Web Search: Project 2

Project 2 for CS 373: Web Search

Part 1: What Will We Search?

First, we need to build 2 collections of web documents (i.e., html files) to use in testing our search engines. Each collection will be rooted in a single web page that will be accessible from the 373 class page. I need your help to build these collections. Each person in the class must create two web pages each containing at least five links to sites within UNCA. These pages may also contain whatever text that you would like to include. In the first page, include only links that you enjoy for some reason (e.g., the pages are interesting, they are artistically pleasing, they concern a subject or a person that you like, etc); call this page goodLinks.html. The second page should include five links to pages that you don't like (again, for any reason); this page should be called badLinks.html. Place both pages in the top level of your public_html directory within your cs account. To receive full credit for this portion of the assignment, both pages must be in place by April 10th. This portion of the assignment is worth 10 points.

On April 11th, I will link each of your pages to a central page (one page for the "good" links, and one page for the "bad" links) that will be accessible from this page. Visit this page again on April 11th to find the central pages; you will need them later in the assignment.

Good Links

Bad Links

Part 2: Implementing Web Search

The Initial Setup

In this project, each person will have the opportunity to run their own small web server, in particular you will use the Tomcat server developed and freely distributed by Apache. Your server will run the server on penrose, a Linus box located in Joe Daugherty's office. Each web server expects a particular directory organization, and, in the case of Tomcat (as well as others), this directory structure is somewhat involved. To help you, I have a shell script that, when you run it, creates a directory structure that includes the following:

the scripts necessary to start and stop the web server
all html and servlet code from the book, Core Servlets and JavaServer Pages
Ray Mooney's search engine and spidering code as well as the queryparser package
Ray Mooney's search engine web interface consisting of the form search.html, and the servlet Search.java

To run this script, rlogin (i.e., remote login using the command "rlogin penrose") to penrose and type the command:
/usr/local/bin/tcsetup
You must be logged on to penrose to run this script. When the script runs, it will create a directory named servlet containing the files described above in your top level directory. The servlet directory will be described in greater detail in class.

Once you have created your servlet directory, you should try starting the server and running Ray Mooney's search engine interface. To start the server, give the command (assuming you are in your top-level directory):
servlet/bin/startup.sh
(You will be using this script frequently so you may want to alais this command in your .cshrc file. Here is an example:
alias starttc '~bruce/servlet/bin/startup.sh'
Such an alias makes the command easier to remember and type.)

Once the server has been started, view your index page (created by the tcsetup script) by loading your specified authority (e.g., mine is http://jsp.cs.unca.edu:10835/) in a web browser. Next, test Ray Mooney's search engine interface by loading search.html (e.g., http://jsp.cs.unca.edu:10835/search.html) and using it.

To stop the web server give the command:
servlet/bin/shutdown.sh
Always remember to stop the web server before logging out. In addition, it will be necessary to stop and restart the web server every time you recompile the code for a servlet.

Modifying the Code

Phase 1: Adding Relevance Ranking

There are two different sets of modifications that you are required to make to your search engine code in this project. The first modification is to define a relevance ranking (an ordering) of the documents returned by your Boolean search engine using link analysis. (Remember that in a standard Boolean search, all documents that satisfy the query are considered to be equally relevance.) As discussed in class, link analysis is often used to order the documents returned in a search; Google is the best known example.

While most approaches to link analysis require involved calculations and data structures (e.g., the HITS algorithm), we will use a very simple approach: documents should be ranked in order of popularity as measured by a count of their incoming links. The link counts should be accumulated over the entire document collection. In this project, you will work with two document collections: badLinks and goodLinks. The badLinks collection will consist of the first 200 UNCA web pages spidered (using the SiteSpider class) starting from the "badLinks" class page described in Part 1 above. The goodLinks collection will, similarly, consist of the first 200 UNCA web pages spidered (using the SiteSpider class) starting from the "goodLinks" class page described in Part 1 above.

There are a number of ways to implement this phase of the project. In all cases, you will have to install your existing Boolean search system in the servlet directory described in the section above. In order to reduce the disk space that you use, please remove all unnecessary code in that directory (e.g., typically you can remove Ray Mooney's ir.vsr package once you have installed your bool package).

In his existing system, Ray Mooney separates the spidering phase from the search phase. His current web interface only drives the search engine over an existing document collection. He assumes that spidering has been used to build that collection prior to searching. While this is a reasonable approach, it presents a problem for you in implementing a document ranking based on page popularity: the information on link counts is available in the run-time data structures of the spider code, but it is not currently available in the run-time data structures of the search code. One plausible solution is to collection the information on link counts in the spidering code (where it is easily accessed) and write it to a file that can later be read by the search code. Other solutions are also possible.

The information on link counts should be used by your Boolean search engine to order the documents returned in response to a query. The ordering of documents should be such that the most popular page satisfying the query (i.e., the page with the most in-coming links) should be listed first, the second most popular page satisfying the query should be next, and so on.

Phase 2: Modifying the Web Interface

Your search engine must also have a web interface. The simplest approach to achieving this is to modify the interface that Ray Mooney developed for his vsr search engine. The necessary modifications will include the following:

Modifying search.html to specify the badLinks and goodLinks collections described above.
Modify Search.java (the servlet) to run your Boolean system with document ranking as described in phase 1.

The "look and feel" of the search engine interface should otherwise be similar to Ray Mooney's with two exceptions. You are asked to enhance Ray Mooney's interface in two ways:

As currently implemented in Search.java (the servlet), the inverted index is rebuilt and the query is reprocessed each time a user indicates that they want "More Results" for a given query (i.e., 10 more documents satisfying a previously specified query). This is very inefficient! Modify the servlet code to store the inverted index for the collection and the ranked document list for a query in a HttpSession object. Reuse the inverted index and the ranked list of documents whenever possible---do not regenerate them if they have already been generated in the current session.
Add a "New Search" bottom to the results page produced by the servlet. This bottom should terminate the existing search and return the user to search.html to begin a new search.

Grading

Your grade will be based on the following factors (in order of importance):

correct compilation, execution and performance of your code
good program design
appropriate (i.e., time and space efficient) choices of data structures and algorithms
good documentation of your code
good documentation of the group work effort. As a group, formulate a plan for implementing the system, and make an effort to break the overall programming task into pieces that can be implemented by various group members. Try to assure that all group members share in the program design and implementation. As part of each project, each group must submit a brief summary of the contributions of each group member.

What to turn in

The following files must be placed in your course dropoff directory by mid-night of May 8th:

A zip file containing your entire servlet directory *excluding* any code taken from the book Core Servlets and JavaServer pages or other sources that is not used in your search system.
A text file containing a list of classes that your group either modified or created as part of this project. The list should also contain the package name for each class.
A text file outlining the contributions made by each group member to this project.