CS102 Lab 4:
File I/O, Parsing, and URL connections

Assigned: Tuesday, March 16
Demonstration (5 points) in lab section: Monday, March 29
Hard copy of code (15 points) due to the CS102 mailboxes: Tuesday, March 30

Goals:

By the end of this lab, you should...


Motivation and Overview:

As described in lecture, Java provides excellent support for applications that want to save objects to persistent storage and load them back later. However, when data needs to be extracted from human-generated text, the problem is harder to automate, since the program reading the data did not generate it. Therefore, rather than knowing ahead of time exactly what to expect, it is necessary to read through the data in order to determine what is there and to pick out what is useful.

In lecture, we saw that the java.io.StreamTokenizer provides useful support for text parsing by breaking up an input stream into its logical units called tokens. However, having a stream of tokens is not the end of the story, since the application must still decide what to do with them.

In this lab, you will gain some experience with file I/O, text parsing, Applets, URL connections, and web browser control by writing an Applet that reads an .html file and generates a user-selectable list of hypertext links available in the file. When the user selects from the list, the document corresponding to the chosen link will be shown in a separate frame within the browser and the same process will be repeated on the selected link.


Before Starting:

Before beginning this lab, you should familiarize yourself with certain parts of some Java packages.
  1. Become familiar with the API provided by the java.io package. Pay particular attention to the classes seen in lecture, such as File, InputStream, FileInputStream, Reader, FileReader, and StreamTokenizer.
  2. Read over the methods provided by the String class. It provides a lot of methods you may find handy for completing this assignment.
  3. Look at the classes URL and URLconnection in the java.net package.
  4. Review the Applet class and related classes in the java.applet package, especially the getAppletContext method in Applet and the showDocument method in AppletContext.
  5. Look at the List class and the ItemListener interface in the java.awt package. While you're there, also review the TextField class, particularly the addActionListener method.
  6. Before doing the second extra credit option, look at the class Stack in the java.util package.


Assumptions:

We will make the following assumptions about the names and contents of the files you will be parsing.
  1. All of the actual files you will parse will end with the ".html" suffix. However, the link names typed into your user interface or found within the files themselves may not show the suffix explicitly. If a link ends with a "/", you should append the string "index.html" before processing. If a link does not end with a "/" and also does not end with ".html", you should append the string "/index.html" before processing.
  2. If you are currently looking at a page whose full pathname is "http://www.foo.bar/abc/nonsense.html", then the path "http://www.foo.bar/abc/" is considered to be current directory URL.
  3. If a link does not begin with "http://" then it is a relative link, meaning that you should prefix it with the current directory URL before processing.


Part I: Finding links in HTML files

Create a Cafe project of type "application." As usual, create a file called Lab4.java and a Startup.java file. The method main in Lab4 should create a thread out of an instance of Startup and start the thread running. Begin by creating a class called Parser with the following methods.

  1. The method getTokenizer takes a filename as its parameter and returns a StreamTokenizer that is positioned at the beginning of the given file.

  2. The method tokenMatch takes a StreamTokenizer and a String targetWord as parameters. It should return true if the next token in the stream matches the given targetWord and false otherwise. Note that there are two ways a match can occur: either the next token has token type TT_WORD and the token's sval is the same as the targetWord, OR targetWord contains only one character and that character matches the current token type.

  3. The method consumeThroughPatternMatch takes a StreamTokenizer and an array of strings as its two parameters. It should read tokens until seeing a sequence of tokens that match the sequence of strings in the provided array (in order). For example, if the provided array is {"<","a","href","="} then if the method were called with the StreamTokenizer at the start of the following file, then it would leave the StreamTokenizer positioned just before the text "http://students.cec.wustl.edu/~xyz99" In general, it is possible to consume the entire stream before finding a match for the given pattern, in which case the method should return with the StreamTokenizer positioned at the end of the file.
    <HTML>
    
    <TITLE>My Award-Winning Home Page<TITLE>
    
    <H3>My Home Page</H3>
    This page is under construction.
    <a hbuf+"http://www.goofy.com">
    <P>
    
    See my <a href="http://students.cec.wustl.edu/~xyz99">friend's
    home page</a>.
    
    Also, you can look at <a href="myFavoriteThings/foo.html">my favorite things.</a>
    
    </HTML>
    

  4. The method getNextQuotedString should take a StreamTokenizer its a parameter and keep reading tokens until reaching one whose token type is '"'. It should then return the sval for that token. Again, EOF may be reached before a matching token is found.

  5. The method getTitle should take a StreamTokenizer its a parameter and should extract the title from the input stream. For example, for the file given above, the method would return the String "My Award-Winning Home Page" as its result.

  6. The method getLinks should take a StreamTokenizer its a parameter and should return a Vector that contains all the links in the remainder of the input stream. For example, for the file given above, the method would return a vector containing two String objects: http://students.cec.wustl.edu/~xyz99 and myFavoriteThings/foo.html
Thoroughly test your methods from within the run() method in Startup. You can create your own test files, but you should also find some HTML files on the web, save them locally, and test your methods on them as well.


Part II. Add a simple user interface

To create a user interface for your application, create a Panel that contains two components:
  1. A TextField into which a user can type a file name.
  2. A List in which you'll display all the links in the file.
Register an ActionListener to the TextField, so that when the user hits the enter (return) key on the keyboard, the file with the given name is parsed and the list of links is displayed.

In your run() method in Startup, put the Panel into a Frame and display it on the screen. Test thoroughly.


Part III. A Navigation Applet

In this part of the lab, you'll use the code you wrote in Parts I and II to create an Applet that works as follows:
  1. The user types a URL in the TextField and presses the enter key.
  2. The Applet opens an input stream for the given URL, extracts all the links from the file, and displays them in the List. In addition, the browser is instructed to show the document at the given URL in another frame (which we'll call the target frame).
  3. The user either types another URL and hits enter OR the user selects one of the items from the list. The typed or selected item is processed as the next input file. If it was a selected item, the full URL should be displayed in the TextField (even if the link in the list is a relative link).

First, modify your code to read its input from a URLconnection instead of from a local file.

Then, create a new Cafe project of type "applet", and create as its main html file, a file something like Lab4frames.html that creates two frames in the browser. For the contents of the first frame, you should create a file Sitemap.html in which to display your applet. The other frame will be used by the applet when the user requests a URL.

Create an Applet class whose init() method creates an instance of your Startup class, and starts it running in a thread. Modify your code so that when a URL is selected (either typed or chosen from the list) by a user, your program will instruct the browser to display that URL as a document in the target frame. You will need to register an ItemListener to the List object to be informed of any ItemEvent that occurs when the user selects from the list.

Before testing, be sure to set the permissions appropriately. (The .class files need to be world readable. The directory itself needs to be both world readable and world executable.) Test thoroughly. Remember that your applet will only have permission to open URL connections to the web server from which it was loaded, so you won't be able to test your applet on web sites outside of CEC.


Extra-Credit Features:


Demonstration:

In your lab section on March 29, you will demonstrate your complete working lab. Have a completed CS102 cover sheet ready for the TA to record your demo grade and demo comments (what worked and didn't work).


Hard Copy:

After your demo, clean up your code, add documentation, and make it beautiful. If there was a problem during the demo, you should mark on your code where you think the problem is. If you have time, you can try fixing it, and you should describe how you fixed it and whether or not you were able to get it to work. You may replace your demo grade by doing another demo only if you use a late coupon or a rewrite coupon. By 5:00pm on March 30, turn in your cover sheet (with the demo grade recorded on it) and a printed copy of all your code to the CS102 mailbox.

Kenneth J. Goldman (kjg@cs.wustl.edu)