
CS102: Parsing Character Streams
Copyright © 1999,
Kenneth J. Goldman
Parsing Character Streams
- So far, we've seen examples where program data is saved in a file for later use,
but what happens when the data isn't created by a program, but instead by a
human user.
- For example, when you create a .java file, it's just your keystrokes that the
text editor puts into the file. The same is true of an .html file for your
web site.
- When a compiler reads your .java file or a browser reads your .html file, it
doesn't dnow exactly what to expect. (It's not like the earlier examples
where the program knows exactly what type of data is going to be in the file.)
Instead, the compiler or browser must read through the characters of the file
and break up the file into logical units, called tokens and process
them. This activity is known as parsing.
- For example, the line
- public static int x = 75 ;
- has 7 tokens, as underlined.
- To parse a text file, you need to know where one token ends and the next one begins.
Whitespace (spaces, tabs, linebreaks) usually separate tokens, but there are also
other characters, known as delimiters, that separate tokens (e.g., ") and
there are some characters (mostly punctuation) that generally are tokens by
themselves (for example, =, ., ;, etc.).
- When working with a raw text file, you need to read each character and figure out
where the next token starts. However, JAVA provides a handy class called
StreamTokenizer that does most of the work for you.
- You can think of the StreamTokenizer as an iterator that walks through the tokens
of a stream.
- Constructor:
- StreamTokenizer(Reader r) // for example, r is a FileReader
- Methods for navigating through the stream:
- int nextToken() throws IOException
- finds the next token and returns its token type:
- TT_EOF, TT_EOL, TT_NUMBER, TT_WORD or a special character
(like ;)
after nextToken is called, the following values are available in
public instance variables of the tokenizer:
- String sval -- string if ttype = TT_WORD
double nval -- the number if ttype = TT_NUMBER
int ttype -- the token type
- void pushback()
- makes the next call to nextToken return the current ttype and consume
nothing (sval and nval would be unchanged)
- int lineno()
- returns the current line number
- Other methods:
- The remaining methods are mainly for controlling how the tokenizer treats
the characters in the input stream. For example,
- void quoteChar(int ch)
- lets you specify another quote character(" is the default)
void wordChars(int low, int high)
- lets you specify characters to be treated as parts of a word
(for example, = and [ are generally not considered word
characters)
- Example: let's write a method that counts the number of times a given word appears
in a given file as a separate token (not as part of a quote)
- int numberOfOccurrences(String targetWord, String filename) throws IOException {
- StreamTokenizer tokens = new StreamTokenizer(new FileReader(filename));
int count = 0;
while (tokens.nextToken() != tokens.TT_EOF) {
- if (tokens.ttype == tokens.TT_WORD && tokens.sval.compareTo(targetWord) == 0) {
- count++;
}
}
}
return count;
- What if we want to pick apart the text within a quote? The StreamTokenizer
provides each piece of quoted text as a single String. (We could change the
StreamTokenizer settings to make " an ordinary character, but then the quoted
Strings wouldn't be treated normally. For example, "123" would be a number,
not a String.)
- Given a String, we can further examine it using some of the built-in methods of
the String class.
- boolean endsWith(String suffix)
boolean startsWith(String prefix)
int indexOf(String str)
int indexOf(String str, int fromIndex)
String substring(int beginIndex, int endIndex
String toLowerCase()
String toUpperCase()
- To find the number of occurrences of a targetWord in a given searchString, we
can write:
- index = 0;
while ((index = searchString.indexOf(targetWord,index)) != -1) {
- count++;
- index++;
}
- We could add this to the previous example when ttype == '"' to include quotes
in our count.
