Applying Patterns and Framework Components to Develop a Web Crawler (pt. 3)

This is the third part in a multi-part programming assignment. The overall purpose of this assignment is to give you experience applying communication software patterns and framework components by developing a Web Crawler. A Web Crawler is a client application that ``visits'' URLs and performs various tasks, such as downloading the contents of the URL, checking the validity of the links in an HTML page, building a title index for a search engine, etc.

In the third part of this assignment, you'll enhance your Web Crawler solution from part 2 so that can visit all the links from a starting point URL in FIFO order in order to determine if it's valid or not. You'll get to reuse all the patterns and framework components from the first and second assignments. In addition, you'll get to use some other ACE C++ >framework components, e.g., ACE_Unbounded_Queue.


Web Crawler Client

The Web Crawler client program should perform the following activities:
  1. Read the HTTP URL from the command line. Create a socket that is connected to the server machine at the specified port (e.g., HTTP port 80) [ACE_INET_Addr, ACE_SOCK_Connector, ACE_SOCK_Stream].

  2. Send the requested URL to the server, and first process the header that comes back across HTTP connection, determine what the reply status of the URL is, and print this to the terminal. If the the -r "recursive" option is not enabled exit your client program at this point.

  3. However, if (1) the reply status is valid, e.g., status == 200, (2) the content type of the file is "text/html", and (3) the -r "recursive" option is enabled, then download and iterate over the entire body of the HTML file and recursively validate all of the embedded http HREF links in FIFO or in LIFO order. Use the -o option to select between the alternatives:
    [-o FIFO|LIFO]
    

    For the purposes of this lab, don't worry about handling large HTML files, i.e., files that'll cause ACE_Mem_Map to remap the file. This will require minor changes to the Mem_Map_Stream implementation so that it won't try to map the file at a fixed location.


Implementation Hints


Concluding Remarks

I strongly encourage you to use and develop reusable C++ components based on the code you write. You will reuse these components throughout the course.

Please see the online help for information on how to setup your development environment on Washington University's computing system. In addition, you can obtain the program shells online, as well.


References

Last modified 20:26:02 CST 18 February 1999