Applying Patterns and Framework Components to Develop a Web Crawler (pt. 2)

This is the second part in a multi-part programming assignment. The overall purpose of this assignment is to give you experience applying communication software patterns and framework components by developing a Web Crawler. A Web Crawler is a client application that ``visits'' URLs and performs various tasks, such as downloading the contents of the URL, checking the validity of the links in an HTML page, building a title index for a search engine, etc.

In the second part of this assignment, you'll develop a more sophisticated Web Crawler that will visit a URL to determine if it's valid or not. You'll get to reuse many of the patterns and framework components from the first assignment. In addition, you'll get to use some other ACE C++ >framework components, e.g., ACE_Hash_Map_Manager and ACE_Mem_Map.

We'll extend our existing Web Crawler client program to contain a new visitor, called a URL_Validation_Visitor. In brackets are some hints about what kinds of ACE classes you might want to look up to do these (see ACE HTML manual pages and ACE tutorial reference for additional details).


Web Crawler Client

The Web Crawler client program should perform the following activities:
  1. Read the HTTP URL from the command line. Create a socket that is connected to the server machine at the specified port (e.g., HTTP port 80) [ACE_INET_Addr, ACE_SOCK_Connector, ACE_SOCK_Stream].

  2. Send the requested URL to the server, and first process the header that comes back across HTTP connection, determine what the reply status of the URL is, and print this to the terminal. If the the -r "recursive" option is not enabled exit your client program at this point.

  3. However, if (1) the reply status is valid, e.g., status == 200, (2) the content type of the file is "text/html", and (3) the -r "recursive" option is enabled, then download and iterate over the entire body of the HTML file and recursively validate all of the embedded http HREF links.


Implementation Hints


Concluding Remarks

I strongly encourage you to use and develop reusable C++ components based on the code you write. You will reuse these components throughout the course.

Please see the online help for information on how to setup your development environment on Washington University's computing system. In addition, you can obtain the program shells online, as well.


References

Last modified 16:29:31 CST 28 February 1999