Computer Science 514
Lab 5: Dynamic Programming


Assigned:April 2, 2002
Code due:April 16, 2002 by e-mail at midnight


Objectives


Background

The diff utility is commonly used in UNIX to compare text files to see if they are identical. One helpful thing that diff does is show the shortest sequence of edits that would transform one file into another. This is accomplished using a dynamic programming algorithm, namely the edit distance algorithm discussed in class.

"But wait," you say. "The edit distance algorithm operates on individual character strings. It would be very slow to apply it to two text files, especially if they are long." The approach that diff uses is to think of each line in the file as a character. The edit distance algorithm is then applied at the line level rather than the character level.

Your goal will be to write a program which takes in two files and mimics the operation of diff on the two files. The program will be called diff514, and its syntax will be similar to diff's:

C++:	diff514 [sourcefile] [targetfile]
Java:	java diff514 [sourcefile] [targetfile]

The output of diff514 should consist of two parts: (1) a "similarity score" which is equal to the "normalized" edit distance, and (2) the individual changes that must occur in order to transform sourcefile into targetfile.


Computation

  1. Read this tutorial on the edit distance algorithm. This is essentially the class lecture, redux, but has some notes specific to helping you with this lab.

  2. (10% of lab grade) Your lab can consist of any classes you like, as long as you provide a Makefile that will correctly compile your code on UNIX and produce an executable called diff514 (or a class called diff514 in Java). You can copy a Makefile from previous labs (such as Lab 1), and change it to fit your needs (yes, even possibly below the section which says "Do not change anything below this line"). Your Makefile must additionally support the make turnin target; the Makefile from Lab 1 already supports this, but you must change the Makefile to write a more appropriate subject header for the e-mail (e.g. "Assignment-5" instead of "Assignment-1").

  3. (40% of lab grade) Begin by implementing a dynamic programming algorithm to return the edit distance between two files. As mentioned in the tutorial, consider each line to be a character in the string and have the edit distance be measured in per-line operations rather than per-character operations. You may assume that all lines are no longer than 1024 characters in length.

    Your program must print out the normalized edit distance according to the output formatting guidelines described later in this lab handout. For the purposes of this lab, the normalized edit distance is the edit distance calculated by the dynamic programming algorithm divided by the number of lines in the largest of the two files. I expect this number to be between 0 and 1; if there is some anomaly and the normalized edit distance turns out to be larger than 1, there is a problem with your algorithm (because one edit can be done per line in the largest file in the case where the two files are different, so the optimal solution must be no worse than that).

  4. (30% of lab grade) Instrument your algorithm from step 3 above to also provide the shortest sequence of changes to make. If there are multiple optimal shortest sequences (i.e. multiple pointers to follow out of a particular table cell), for purposes of this lab favor changes/matches over deletes and deletes over inserts (in reality, it doesn't matter which one you provide as all the edit distances are the same). The format of your output is given in the output formatting guidelines described in the section below.

  5. The remaining 20% of your lab grade will be determined by an evaluation of code style, output format, and provided documentation.

Output Formatting Guidelines

Any lab that fails to print out its output according to the following rules will lose points. Pay attention to the spacing guidelines and capitalization of the letters in the templates; they are not merely for show, but are rather part of the specification. Assume all output lines are flush with the left edge of the screen.

Checking Your Answers

To verify against a staff-provided, working copy of diff514, execute the following script at a UNIX prompt in your CEC account. You will need to provide absolute paths to source and target files that you wish to check against diff514, although the folder does contain some sample text files which you are free to copy to your own account.

	pushd ~cs514/Lab-Solutions/lab5/
	diff514 [source] [target]
	popd

Extra Credit Options (I will only accept these options)

DO NOT modify the interface of diff514 to accomplish these goals; rather, rely on a text file called diff514options (note that there is no file extension!) that exists in the same directory and can be customized to achieve this effect. (If diff514options cannot be found, you must use the default behavior of the lab -- don't crash the program!) The format of the diff514options file is up to you, but be sure to explain it carefully in your README.

  1. (+5% EC) Allow the user to have the option of "pretty-printing" the edit sequence. In this "pretty-printing" scenario, you must group changes that occur over multiple lines into one. Deletes and inserts occurring over sequential lines should be kept together. So, instead of:
    Delete: 4 from source
    Delete: 6 from source
    Delete: 7 from source
    Delete: 8 from source
    Insert: 12 from target as 11 in source
    Insert: 13 from target as 11 in source
    Insert: 14 from target as 11 in source
    
    You would have:
    Delete: 4 from source
    Delete: 6-8 from source
    Insert: 12-14 from target as 11 in source
    
  2. (+10% EC) Instead of having each edit be worth +1, allow the user to customize the penalties for each edit. Each penalty P must obey the rule 0 < P < 1, i.e. penalties are real numbers on the half-closed interval (0,1] and no edit operation may be worth a non-positive penalty towards the similarity score. Your edit distances will be smaller, but you should still calculate the renormalized edit distance as the edit distance divided by the number of lines in the largest file.

    Also, in your README, outline a set of experiments that shows how the individual penalties affect the outcome of the diff514 utility on different files when the penalty weights are changed. You must justify any claims with experimental evidence. Since the README is plain text, you may reference another file(s) that you submit electronically with your lab in the README to better present your data (but don't forget to do the README!).

What to Submit

By midnight on the code due date, you are required to use the make turnin operation on your Makefile to send your lab to cs514gr@cec.wustl.edu. The make turnin operation zips your files up, runs the zip file through uuencode (to convert the binary file to a known text format), and then mails your assignment as the body of an e-mail. No other method of file submission (including manually attaching a ZIP file to an e-mail) will be accepted; non-compliant e-mails will simply be discarded. (You should be sure to get help from a TA well in advance of the due date if you think you will have a submission problem!)

You must include the following:


Robert Amar (raa4@cs.wustl.edu)