Homework 3

Your solution to this assignment should be uploaded to HW 3 on moodle by 11:55 PM on Sunday, 15 February, 2015.

The objective

This assignment has one big objective: Showing that you can write a program following a well documented interface on your own.

The task

The big view

The task is to do a little web scraping. Specifically, to write a program that is started from a command similar to the following:

[…]$ home3sol http://blah.com/blither.html

The program will download the referenced URL, perhaps by connecting to the web server and then search the HTML for hyperlinks. In those cases, the program will print the hyperlink target and the hyperlink text (without any internal tags). For example, if the downloaded page contains the following text:

<p>
This is a repeat of
<a href="../../../Fall2003/363/projects/index.html">CSCI 363
Exercise 2</a> from the Fall 2003 semester.
</p>

Output similar to one of following would be printed.

CSCI 363
Exercise 2
 ==> ../../../Fall2003/363/projects/index.html

CSCI 363 Exercise 2
 ==> ../../../Fall2003/363/projects/index.html

Is that really possible?

Yes! And it isn’t that hard because there is a Python module HTMLParser that does all the hard work. Remember the objective: following a well documented interface on your own. Your need to read the HTMLParser documentations and look at some examples before you start programming.

Remember that the syllabus requires you to “cite any sources, including the work or advice of other students, used in completing their assignments”.

Do I really have to use the `socket` stuff?

You can use Python’s urllib2 module if you wish.

Grading