CSE 132 (Spring 2010)
Lab 2a: KWIC Index (ADT Review and I/O)

Abstract:

Search engines such as Google cannot afford to process all prospective source material in response to each search query. Instead, they preprocess the source material to develop an index that maps a keyword to the source material.

In CSE131, you studied the following ADTs:

In this lab you will use those ADTs, along with your studio 2 experience reading files, to implement a KWIC (KeyWord In Context) collection of classes in a package we will call kwic.

In the upcoming studio and lab sessions, you will design an interface for using the KWIC index and add some interesting features.


Getting Started

  1. You are encouraged, but not required, to work in pairs on this lab.
    • If you want to work in pairs, you must ask the professor to create a repo for you and partner using a stylish and unique group name of your choosing. Do not expect to get credit as a team later by showing work done in a personal repository and claiming team effort.
    • If you work in pairs, we will be looking for evidence in svn that you worked on separate parts of the code independently. In particular, you may not work in teams by having one member do all of the committing!
      Both team members must edit and commit code to the same stylishly named team repository using their individual @grid.cec.wustl.edu accounts for you to receive credit as a team.

      As an example of what we see as a result of your contributions to the repo, here is a view of an actual student's committed Controller file from Lab 1. Each line is annotated with the version and author who last contributed that line to the repository. Such a listing is produced using svn praise (also known as svn blame). You can see this information in eclipse using the Team...Show Annotation selection on a file. Mouse-over the colored bar on the left of the resulting display.

      I changed the student's name in the file to wooster keep him/her anonymous.

    • Ask, and ask again, if any of this is unclear.
    • Be sure to remember your repository name.
    • You can work on the project at the same time, but be sure to
      • Try to divide up the work so you don't have conflicts.
      • Update before you start any work and whenever your parter wants you to see the work he or she has done.
      • Commit often, as you work on the project.
  2. Open eclipse and update your personal repository (or checkout your group repository) for this lab.
  3. Run the KWICTest class as a JUnit test.
    It should fail at this point with a NullPointerException.

Overview

Supose you wanted to search a document for all phrases that include a given word, say "swordfish". There are two approaches that could be used in this endeavor:

  1. Once "swordfish" is supplied, a computer could search the document and return each phrase that contains that word. Every time a word is supplied, the entire document is scanned to find matching phrases.
  2. The document could be preprocessed offline , in anticipation of the need for search. When "swordfish" is supplied, the result is already computed and simply returned.
Which approach is best? It depends on In essence, the choice depends on the frequency of insertions and deletions to the document as compared with the frequency of lookups for its words.

We shall assume that offline preprocessing pays off, and that it would be expensive to search the document each time a word is supplied. As an analogy, consider the difficult task undertaken by Google to index the WWW. Imagine how slow it would be for Google to search the entire WWW each time you ask it to find a word.

As an example, consider the following phrases: The following table shows how phrases should be returned for words that might be supplied for KWIC:
Word Set of Phrases
swordfish
  • Swordfish goes well with pasta; the pasta should not be overcooked.
  • The password for entry to the castle is: "swordfish".
Well
  • Swordfish goes well with pasta; the pasta should not be overcooked.
  • All's well that ends well.
Notice that case and punctuation do not matter in matches, but that the returned phrases are exactly as they were entered. Also, although "well" is contained twice in one phrase, the set of phrases contains each phrase once (as sets are supposed to do).
You can find may sites that offer KWIC or concordance indices. For example this site has Shakespeare's works. Check out the results for rogue.

In summary, our KWIC index will:

  1. Process lines of text, decoding each into words.
  2. For each Word in a Phrase, remember that the Word occurs in that Phrase. We accomplish this using the ADT Map<Word, Set<Phrase>>. Each Word is mapped to a Set of Phrases that contain the Word.
  3. On demand, produce all Phrases that contain a given Word. The Map makes this easy.

Implementation notes


Classes

KWICTest
WordFilter
will show up in the documentation, because I found it useful to have a separate class for this. You do not necessarily need to write this class.
Word
Phrase
KWIC

For the Demo


Submitting your work

A TA will demo you or your group on the day the lab is due. Before asking the TA to demo you:

Last modified 08:58:57 CDT 03 June 2010 by Ron K. Cytron