Applying Patterns and Framework Components to Develop a Web Crawler (pt. 1)

This is the first part in a multi-part programming assignment. The overall purpose of this assignment is to give you experience applying communication software patterns and framework components by developing a Web Crawler. A Web Crawler is a client application that ``visits'' URLs and performs various tasks, such as downloading the contents of the URL, checking the validity of the links in an HTML page, building a title index for a search engine, etc.

In the first part of the assignment, you'll develop a simple Web Crawler that will download the contents of a URL and display it on the standard output. This first part is intended to be very simple. It's primary purpose is to expose you to several patterns, e.g., Factory Method, Visitor, Facade, Wrapper Facade, and Connector, and ACE C++ framework components, e.g., ACE_Connector and ACE_SOCK_Stream. All of these patterns and framework components will be reused in subsequent parts of this assignment so it's important to understand them and get them right.

To start off, we'll write one program: a Web Crawler client. In brackets are some hints about what kinds of ACE classes you might want to look up to do these (see ACE HTML manual pages and ACE tutorial reference for additional details).


Web Crawler Download Client

The Web Crawler download client program should perform the following activities:
  1. Read the HTTP URL from the command line. Create a socket that is connected to the server machine at the specified port (e.g., HTTP port 80) [ACE_INET_Addr, ACE_SOCK_Connector, ACE_SOCK_Stream].

  2. Send the requested URL to the server, read all the content, i.e., both HTTP header and the body, coming back across HTTP connection, and write the content to a temp file in /tmp created by your Web Crawler on the local host [ACE_FILE_Addr, ACE_FILE_IO, ACE_SOCK_Stream]. To ensure unique file names in /tmp, I recommend you either use ACE_OS::mktemp or use the ACE_FILE_Addr file (ACE_sap_any_cast (ACE_FILE_Addr &)) feature to create a unique temporary file.

  3. Please print the contents of the file to the terminal and remove the file ACE_OS::unlink. In fact, I recommend that you remove the temp file [ACE_OS::unlink] immediately after you open it so that if your program crashes the temp file won't be left in /tmp.
The client should print out the appropriate error message [ACE_OS::perror] and exit with a return status of 1 if any of the system calls fail to work properly. If everything works correctly, the program should exit with a return status of 0.

Speaking of errors, make sure that the client won't hang indefinitely if the server fails to follow the HTTP protocol or of the network and remote hosts fail. For instance, make sure that all of your connect, send and recv calls time out after a user-specified period of time has elapsed.


Design Overview

The following UML diagrams illustrate the Webcrawler design. The first diagram gives an overview of the key classes and the relationships between the classes:

The next diagram presents a more detailed view of the classes and their relationships:


Concluding Remarks

I strongly encourage you to use and develop reusable C++ components based on the code you write. You will reuse these components throughout the course.

Please see the online help for information on how to setup your development environment on Washington University's computing system. In addition, you can obtain the program shells online, as well.


References

Last modified 16:29:25 CST 28 February 1999