Spring 2015 CSCI 373 Homework 3

Your solution to this assignment should be uploaded to HW 3 on moodle by 11:55 PM on Sunday, 15 February, 2015.

The objective

This assignment has one big objective: Showing that you can write a program following a well documented interface on your own.

The task

The big view

The task is to do a little web scraping. Specifically, to write a program that is started from a command similar to the following:

[…]$ home3sol http://blah.com/blither.html

The program will download the referenced URL, perhaps by connecting to the web server and then search the HTML for hyperlinks. In those cases, the program will print the hyperlink target and the hyperlink text (without any internal tags). For example, if the downloaded page contains the following text:

<p>
This is a repeat of
<a href="../../../Fall2003/363/projects/index.html">CSCI 363
Exercise 2</a> from the Fall 2003 semester.
</p>

Output similar to one of following would be printed.

CSCI 363
Exercise 2
 ==> ../../../Fall2003/363/projects/index.html
CSCI 363 Exercise 2
 ==> ../../../Fall2003/363/projects/index.html

Is that really possible?

Yes! And it isn’t that hard because there is a Python module HTMLParser that does all the hard work. Remember the objective: following a well documented interface on your own. Your need to read the HTMLParser documentations and look at some examples before you start programming.

Remember that the syllabus requires you to “cite any sources, including the work or advice of other students, used in completing their assignments”.

Do I really have to use the socket stuff?

You can use Python’s urllib2 module if you wish.

Grading

Your program should to be well-written and well-documented. They are also expected to be robust with the following exception: If the URL has illegal URL, your program is allowed to go down in flames. In order to receive more than 50%, your program must demonstrate that it has been tested. In order to receive more than 25%, your program must be written in legal Python.