CSE 132 (Spring 2009)
Lab 2: KWIC Index (part a)
(ADT Review and I/O)
Due: 29 Jan 2009 in Lab

Abstract:

Search engines such as Google cannot afford to process all prospective source material in response to each search query. Instead, they preprocess the source material to develop an index that maps a keyword to the source material.

In CSE131, you studied the following ADTS:

In this lab you will use those ADTs, along with your studio 2 experience reading files, to implement a KWIC (KeyWord In Context) collection of classes in a package we will call kwic.

In the upcoming studio and lab sessions, you will design an interface for using the KWIC index and add some interesting features.


Basic part

  1. You are encouraged, but not required, to work in pairs on this lab.
  2. Open eclipse and go to your CSE132 project.
  3. Save the lab 2 zip to your desktop.
  4. In eclipse, right click CSE132 and import the zip file as a general archive file.
  5. Run the KWICTest class as a JUnit test.
    It should fail at this point with a NullPointerException.

Overview

Supose you wanted to search a document for all phrases that include a given word, say "swordfish". There are two approaches that could be used in this endeavor:

  1. Once "swordfish" is supplied, a computer could search the document and return each phrase that contains that word. Every time a word is supplied, the entire document is scanned to find matching phrases.
  2. The document could be preprocessed offline , in anticipation of the need for search. When "swordfish" is supplied, the result is already computed and simply returned.
Which approach is best? It depends on In essence, the choice depends on the frequency of insertions and deletions to the document as compared with the frequency of lookups for its words.

We shall assume that offline preprocessing pays off, and that it would be expensive to search the document each time a word is supplied. As an analogy, consider the difficult task undertaken by Google to index the WWW. Imagine how slow it would be for Google to search the entire WWW each time you ask it to find a word.

As an example, consider the following phrases: The following table shows how phrases should be returned for words that might be supplied for KWIC:
Word Set of Phrases
swordfish
  • Swordfish goes well with pasta; the pasta should not be overcooked.
  • The password for entry to the castle is: "swordfish".
Well
  • Swordfish goes well with pasta; the pasta should not be overcooked.
  • All's well that ends well.
Notice that case and punctuation do not matter in matches, but that the returned phrases are exactly as they were entered. Also, although "well" is contained twice in one phrase, the set of phrases contains each phrase once (as sets are supposed to do).
You can find may sites that offer KWIC or concordance indices. For example this site has Shakespeare's works. Check out the results for rogue.

In summary, our KWIC index will:

  1. Process lines of text, decoding each into words.
  2. For each Word in a Phrase, remember that the Word occurs in that Phrase. We accomplish this using the ADT Map<Word, Set<Phrase>>. Each Word is mapped to a Set of Phrases that contain the Word.
  3. On demand, produce all Phrases that contain a given Word. The Map makes this easy.

Implementation notes


Classes

KWICTest
WordFilter
will show up in the documentation, because I found it useful to have a separate class for this. You do not necessarily need to write this class.
Word
Phrase
KWIC

Demo your lab

Next week, when the lab is due, grab a grade sheet, fill it out, and show what you did to a TA. Get him or her to sign the sheet.

Submit your project

  1. Generate a zip file to submit your work by exporting your kwic package as a zip file.
  2. From your CMS page, upload the zip file and submit your work.


Last modified 20:10:43 CST 11 January 2010 by Ron K. Cytron