Assigned:April 2, 2002
Code due:April 16, 2002 by e-mail at midnight
Objectives
- See how dynamic programming is implemented and gain experience with a
common practical application of a dynamic programming algorithm.
Background
The diff utility is commonly used in UNIX to compare text files
to see if they are identical. One helpful thing that diff does
is show the shortest sequence of edits that would transform one
file into another. This is accomplished using a dynamic programming
algorithm, namely the edit distance algorithm discussed in class.
"But wait," you say. "The edit distance algorithm operates on
individual character strings. It would be very slow to apply it to two
text files, especially if they are long." The approach that diff
uses is to think of each line in the file as a character. The edit
distance algorithm is then applied at the line level rather than the
character level.
Your goal will be to write a program which takes in two files and mimics
the operation of diff on the two files. The program will be
called diff514, and its syntax will be similar to
diff's:
C++: diff514 [sourcefile] [targetfile]
Java: java diff514 [sourcefile] [targetfile]
The output of diff514 should consist of two parts: (1) a
"similarity score" which is equal to the "normalized" edit distance, and
(2) the individual changes that must occur in order to transform
sourcefile into targetfile.
Computation
- Read this tutorial on the edit distance
algorithm. This is essentially the class lecture, redux, but has some
notes specific to helping you with this lab.
- (10% of lab grade) Your lab can consist of any classes you like, as
long as you provide a Makefile that will correctly compile your code on
UNIX and produce an executable called diff514 (or a class called
diff514 in Java). You can copy a Makefile from previous labs
(such as Lab 1), and
change it to fit your needs (yes, even possibly below the section which
says "Do not change anything below this line"). Your Makefile must
additionally support the make turnin target; the Makefile from
Lab 1 already supports this, but you must change the Makefile to write a
more appropriate subject header for the e-mail (e.g. "Assignment-5"
instead of "Assignment-1").
- (40% of lab grade) Begin by implementing a dynamic programming
algorithm to return the edit distance between two files. As mentioned in
the tutorial, consider each line to be a
character in the string and have the edit distance be measured in per-line
operations rather than per-character operations. You may assume
that all lines are no longer than 1024 characters in length.
Your program must print out the normalized edit distance according
to the output formatting guidelines described later in this lab handout.
For the purposes of this lab, the normalized edit distance is the edit
distance calculated by the dynamic programming algorithm divided by the
number of lines in the largest of the two files. I expect this number to
be between 0 and 1; if there is some anomaly and the normalized edit
distance turns out to be larger than 1, there is a problem with your
algorithm (because one edit can be done per line in the largest file in
the case where the two files are different, so the optimal solution must
be no worse than that).
- (30% of lab grade) Instrument your algorithm from step 3 above to also
provide the shortest sequence of changes to make. If there are multiple
optimal shortest sequences (i.e. multiple pointers to follow out of a
particular table cell), for
purposes of this lab favor changes/matches over deletes and deletes over
inserts (in reality, it doesn't matter which one you provide as all the
edit distances are the same). The format of your output is given in the
output formatting guidelines described in the section below.
- The remaining 20% of your lab grade will be determined by an
evaluation of code style, output format, and provided documentation.
Output Formatting Guidelines
Any lab that fails to print out its output according to the following
rules will lose points. Pay attention to the spacing guidelines and
capitalization of the letters in the templates; they are not merely for
show, but are rather part of the specification. Assume all output lines
are flush with the left edge of the screen.
- The normalized edit distance you print should be of type
double. It must be the first thing your program prints out, in a
line all by itself, with the following format:
Similarity score: 0.2023101
Obviously, you would replace 0.2023101 with the real number.
- The sequence of operations should be the second thing printed out.
There should be NO extra blank lines between the similarity score and the
edit sequence section. All line numbers that refer to the source
file should refer to the source file AS IT WAS ORIGINALLY NUMBERED
(i.e. do not take deletes/inserts into account when printing out
line numbers for the source file). Format as according to the following
example:
Similarity score: 0.2023101
==EDIT SEQUENCE==
Delete: 4 from source
Delete: 6 from source
Delete: 12 from source
Change: 13 from source to 11 from target
Delete: 18 from source
Insert: 12 from target as 21 in source
Insert: 15 from target as 22 in source
==END SEQUENCE==
If a certain line (with line number N) is to be deleted from the SOURCE
file to progress towards transforming it into the TARGET file, then you
print out the change as follows:
Delete: N from source
If a certain line (with line number N) is to be changed in the SOURCE
file so that it reads
the same as line M from the TARGET file, then you print out the change as
follows:
Change: N from source to M from target
If a certain line from the TARGET file (with line number M) is to be
inserted into the SOURCE file at a certain point (so that the new line
would be inserted immediately before the contents at line number N)
then you print out the change as follows:
Insert: M from target as N in source
When multiple lines from the TARGET file should be inserted into the
SOURCE file, list them in order, all with the same line number from the
source.
- There should be NO blank lines after the ==END SEQUENCE==
line.
- The edits should be listed in the order that the dynamic programming
algorithm traverses pointers (i.e. the "forward
cursor model"). If there are no edits to be done (files are
identical), you should still print out the ==EDIT SEQUENCE== and
==END SEQUENCE== lines. While each change is annotated with
information from the source file and can practically be done in any order
(except for the inserts), we have to have a standard approach so that
outputs can be easily compared.
Checking Your Answers
To verify against a staff-provided, working copy of diff514,
execute the following script at a UNIX prompt in your CEC account. You
will need to provide absolute paths to source and target files that you
wish to check against diff514, although the folder does contain
some sample text files which you are free to copy to your own account.
pushd ~cs514/Lab-Solutions/lab5/
diff514 [source] [target]
popd
Extra Credit Options (I will only accept these options)
DO NOT modify the interface of diff514 to accomplish
these goals; rather, rely on a text file called diff514options
(note that there is no file extension!) that exists in the same directory
and can be customized to achieve this effect. (If diff514options
cannot be found, you must use the default behavior of the lab -- don't
crash the program!) The format of the diff514options file is up
to you, but be sure to explain it carefully in your README.
- (+5% EC) Allow the user to have the option of "pretty-printing" the
edit sequence. In this "pretty-printing" scenario, you must group changes
that occur over multiple lines into one. Deletes and inserts occurring
over sequential lines should be kept together. So, instead of:
Delete: 4 from source
Delete: 6 from source
Delete: 7 from source
Delete: 8 from source
Insert: 12 from target as 11 in source
Insert: 13 from target as 11 in source
Insert: 14 from target as 11 in source
You would have:
Delete: 4 from source
Delete: 6-8 from source
Insert: 12-14 from target as 11 in source
- (+10% EC) Instead of having each edit be worth +1, allow the user to
customize the penalties for each edit. Each penalty P must obey the rule
0 < P < 1, i.e. penalties are real numbers on the
half-closed interval (0,1] and no edit operation may be worth a
non-positive penalty towards the similarity score. Your edit distances
will be smaller, but you should still calculate the renormalized edit
distance as the edit distance divided by the number of lines in the
largest file.
Also, in your README, outline a
set of experiments that shows how the individual penalties affect the
outcome of the diff514 utility on different files when the
penalty weights are changed. You must justify any claims with
experimental evidence. Since the README is plain text, you may reference
another file(s) that you submit electronically with your lab in the
README to better present your data (but don't forget to do the README!).
What to Submit
By midnight on the code due date, you are required
to use the make turnin operation on your Makefile to send
your lab to cs514gr@cec.wustl.edu. The
make turnin operation zips your files up, runs the zip file
through uuencode (to convert the binary file to a known text
format), and then mails your assignment as the body of an e-mail. No
other method of file submission (including manually attaching a ZIP file
to an e-mail) will be accepted; non-compliant e-mails will simply be
discarded. (You should be sure to get help from a TA well in advance of
the due date if you think you will have a submission problem!)
You must include the following:
- Source code for any files you have written
- A plain-text README file documenting your design approach to
the problem, any problems you encountered during the design and
implementation of this lab, and any known bugs and possible fixes
- Your extra credit experimental report, if any
Robert Amar (raa4@cs.wustl.edu)