Applying Patterns and Framework Components to Develop a Web
Crawler (pt. 2)
This is the second part in a multi-part programming assignment. The
overall purpose of this assignment is to give you experience applying
communication software patterns and framework components by developing
a Web Crawler. A Web Crawler is a client application that
``visits'' URLs and performs various tasks, such as downloading the
contents of the URL, checking the validity of the links in an HTML
page, building a title index for a search engine, etc.
In the second part of this assignment, you'll develop a more
sophisticated Web Crawler that will visit a URL to determine if it's
valid or not. You'll get to reuse many of the patterns and framework
components from the first assignment. In
addition, you'll get to use some other ACE C++
>framework components, e.g.,
ACE_Hash_Map_Manager and ACE_Mem_Map.
We'll extend our existing Web Crawler client program to contain a new
visitor, called a URL_Validation_Visitor. In brackets
are some hints about what kinds of ACE classes you might want to look
up to do these (see ACE HTML
manual pages and ACE
tutorial reference for additional details).
Web Crawler Client
The Web Crawler client program should perform the following activities:
- Read the HTTP URL from the command line. Create a socket
that is connected to the server machine at the specified port (e.g.,
HTTP port 80) [
ACE_INET_Addr,
ACE_SOCK_Connector, ACE_SOCK_Stream].
- Send the requested URL to the server, and first process the
header that comes back across HTTP connection, determine what the
reply status of the URL is, and print this to the terminal. If the
the
-r "recursive" option is not enabled exit your
client program at this point.
- However, if (1) the reply status is valid, e.g., status == 200,
(2) the content type of the file is "text/html", and (3) the
-r "recursive" option is enabled, then download
and iterate over the entire body of the HTML file and recursively
validate all of the embedded http HREF links.
Implementation Hints
- Make sure that the client won't hang indefinitely if the server
fails to follow the HTTP protocol or of the network and remote hosts
fail. For instance, make sure that all of your connect,
send and recv calls time out after a user-specified
period of time has elapsed.
- To simplify parsing of the HTTP header and the HTML body, you
might want to develop a memory-mapped stream class. This class should
allow your Web crawler to treat an connection as a stream of bytes,
similar to the C library stdio streams. As an example, I've included
a
Mem_Map_Stream class with the shells. This class
buffers the contents of an ACE_SOCK_Stream connection
incrementally in a memory-mapped file [ACE_Mem_Map].
Note that this class is provided "as is," and you'll need to figure
out how it works.
- To avoid endless loops if there are cycles in the URL graph, make
sure to use a URL cache [
ACE_Hash_Map_Manager] to detect
the cycles and return the cached reply status.
- For this part of the assignment, don't worry about the order in
which you validate the URLs. I suggest using a depth-first search
approach since it's the easiest way to recursively validate the
URLs. Part 3 of assignment 2 will explore other approaches involving
the use of the Command Pattern (which you can ignore for this part
of the assignment).
Concluding Remarks
I strongly encourage you to use and develop reusable C++ components
based on the code you write. You will reuse these components
throughout the course.
Please see the online help for
information on how to setup your development environment on Washington
University's computing system. In addition, you can obtain the program
shells online, as well.
References
Last modified 16:29:31 CST 28 February 1999