Applying Patterns and Framework Components to Develop a Web
Crawler (pt. 3)
This is the third part in a multi-part programming assignment. The
overall purpose of this assignment is to give you experience applying
communication software patterns and framework components by developing
a Web Crawler. A Web Crawler is a client application that
``visits'' URLs and performs various tasks, such as downloading the
contents of the URL, checking the validity of the links in an HTML
page, building a title index for a search engine, etc.
In the third part of this assignment, you'll enhance your Web Crawler
solution from part 2 so that can visit all the links from a starting
point URL in FIFO order in order to determine if it's valid
or not. You'll get to reuse all the patterns and framework components
from the first and second assignments. In addition, you'll get
to use some other ACE C++
>framework components, e.g.,
ACE_Unbounded_Queue.
Web Crawler Client
The Web Crawler client program should perform the following activities:
- Read the HTTP URL from the command line. Create a socket
that is connected to the server machine at the specified port (e.g.,
HTTP port 80) [
ACE_INET_Addr,
ACE_SOCK_Connector, ACE_SOCK_Stream].
- Send the requested URL to the server, and first process the
header that comes back across HTTP connection, determine what the
reply status of the URL is, and print this to the terminal. If the
the
-r "recursive" option is not enabled exit your client
program at this point.
- However, if (1) the reply status is valid, e.g., status == 200,
(2) the content type of the file is "text/html", and (3) the
-r "recursive" option is enabled, then download
and iterate over the entire body of the HTML file and recursively
validate all of the embedded http HREF links in FIFO
or in LIFO order. Use the -o option to select
between the alternatives:
[-o FIFO|LIFO]
For the purposes of this lab, don't worry about handling large HTML
files, i.e., files that'll cause ACE_Mem_Map to remap the
file. This will require minor changes to the
Mem_Map_Stream implementation so that it won't try to map
the file at a fixed location.
Implementation Hints
- Make sure that the client won't hang indefinitely if the server
fails to follow the HTTP protocol or of the network and remote hosts
fail. For instance, make sure that all of your connect,
send and recv calls time out after a user-specified
period of time has elapsed.
- To simplify parsing of the HTTP header and the HTML body, you
might want to develop a memory-mapped stream class. This class should
allow your Web crawler to treat an connection as a stream of bytes,
similar to the C library stdio streams. As an example, I've included
a
Mem_Map_Stream class with the shells. This class
buffers the contents of an ACE_SOCK_Stream connection
incrementally in a memory-mapped file [ACE_Mem_Map].
Note that this class is provided "as is," and you'll need to figure
out how it works.
- To avoid endless loops if there are cycles in the URL graph, make
sure to use a URL cache [
ACE_Hash_Map_Manager] to detect
the cycles and return the cached reply status.
- For this part of the assignment I recommend using the Command
Pattern in order to queue up your requests in FIFO order and execute
them.
Concluding Remarks
I strongly encourage you to use and develop reusable C++ components
based on the code you write. You will reuse these components
throughout the course.
Please see the online help for
information on how to setup your development environment on Washington
University's computing system. In addition, you can obtain the program
shells online, as well.
References
Last modified 20:26:02 CST 18 February 1999