CSE 132 (Spring 2015)
Lab 2a: KWIC Index (ADT Review and I/O)

Abstract:

Search engines such as Google cannot afford to process all prospective source material in response to each search query. Instead, they preprocess the source material to develop an index that maps a keyword to the source material.

In CSE 131, you studied the following ADTs:

In this lab you will use those ADTs to implement a KWIC (KeyWord In Context) collection of classes in a package we will call kwic.

In the upcoming studio and lab sessions, you will design an interface for using the KWIC index and add some interesting features.


Getting Started

  1. You are encouraged, but not required, to work in pairs on this lab.
    • If you work in pairs, then do all the work together in one person's repo.
    • When you demo, indicate that you want the results propagated to the other person's repo.
  2. Use Eclipse version 3.7 or later for this lab. It is available on all of the lab machines.
  3. Open Eclipse and update your repository.
  4. Look for the kwic package in the labs source folder.
  5. Run the KWICTest class as a JUnit test.
    It should fail at this point with a NullPointerException.

Overview

Supose you wanted to search a document for all phrases that include a given word, say "swordfish". There are two approaches that could be used in this endeavor:

  1. Once "swordfish" is supplied, a computer could search the document and return each phrase that contains that word. Every time a word is supplied, the entire document is scanned to find matching phrases.
  2. The document could be preprocessed offline , in anticipation of the need for search. When "swordfish" is supplied, the result is already computed and simply returned.
Which approach is best? It depends on In essence, the choice depends on the frequency of insertions and deletions to the document as compared with the frequency of lookups for its words.

We shall assume that offline preprocessing pays off, and that it would be expensive to search the document each time a word is supplied. As an analogy, consider the difficult task undertaken by Google to index the WWW. Imagine how slow it would be for Google to search the entire WWW each time you ask it to find a word.

As an example, consider the following phrases: The following table shows how phrases should be returned for words that might be supplied for KWIC:
Word Set of Phrases
swordfish
  • Swordfish goes well with pasta; the pasta should not be overcooked.
  • The password for entry to the castle is: "swordfish".
Well
  • Swordfish goes well with pasta; the pasta should not be overcooked.
  • All's well that ends well.
Notice that case and punctuation do not matter in matches, but that the returned phrases are exactly as they were entered. Also, although "well" is contained twice in one phrase, the set of phrases contains each phrase once (as sets are supposed to do).
You can find may sites that offer KWIC or concordance indices. For example this site has Shakespeare's works. Check out the results for rogue.

In summary, our KWIC index will:

  1. Process lines of text, decoding each into words.
  2. For each Word in a Phrase, remember that the Word occurs in that Phrase. We accomplish this using the ADT Map<Word, Set<Phrase>>. Each Word is mapped to a Set of Phrases that contain the Word.
  3. On demand, produce all Phrases that contain a given Word. The Map makes this easy.

Implementation notes


Classes

KWICTest
WordFilter
will show up in the documentation, because I found it useful to have a separate class for this. You do not necessarily need to write this class.
Word
Phrase
KWIC

For the Demo


Submitting your work

A TA will demo you or your group on the day the lab is due. Before asking the TA to demo you:

Demo

When you done with this studio, you must be cleared by the TA to receive credit.


Last name WUSTL Key Propagate?
(or your numeric ID) Do not propagate
e.g. Smith j.smith
1 Copy from 1 to all others
2 Copy from 2 to all others

TA: Make sure they commit their code before you sign them out!

TA: Password:




Last modified 15:08:38 CST 26 January 2015 by Ron K. Cytron