Your solution to this assignment should be uploaded to HW 3 on moodle by 11:55 PM on Sunday, 15 February, 2015.
The objective
This assignment has one big objective: Showing that you can write a program following a well documented interface on your own.
The task
The big view
The task is to do a little web scraping. Specifically, to write a program that is started from a command similar to the following:
[…]$ home3sol http://blah.com/blither.html
The program will download the referenced URL, perhaps by connecting to the web server and then search the HTML for hyperlinks. In those cases, the program will print the hyperlink target and the hyperlink text (without any internal tags). For example, if the downloaded page contains the following text:
<p> This is a repeat of <a href="../../../Fall2003/363/projects/index.html">CSCI 363 Exercise 2</a> from the Fall 2003 semester. </p>
Output similar to one of following would be printed.
CSCI 363 Exercise 2 ==> ../../../Fall2003/363/projects/index.html
CSCI 363 Exercise 2 ==> ../../../Fall2003/363/projects/index.html
Is that really possible?
Yes! And it isn’t that hard because there is a Python module
HTMLParser
that does all the hard work.
Remember the objective: following a well documented interface
on your own.
Your need to read the
HTMLParser
documentations
and look at some examples before you start programming.
Remember that the syllabus requires you to “cite any sources, including the work or advice of other students, used in completing their assignments”.
Do I really have to use the socket
stuff?
You can use Python’s
urllib2
module if you wish.
Grading
Your program should to be well-written and well-documented. They are also expected to be robust with the following exception: If the URL has illegal URL, your program is allowed to go down in flames. In order to receive more than 50%, your program must demonstrate that it has been tested. In order to receive more than 25%, your program must be written in legal Python.