CS102: Parsing Character Streams

Copyright © 1999, Kenneth J. Goldman

Parsing Character Streams

So far, we've seen examples where program data is saved in a file for later use, but what happens when the data isn't created by a program, but instead by a human user.

For example, when you create a .java file, it's just your keystrokes that the text editor puts into the file. The same is true of an .html file for your web site.

When a compiler reads your .java file or a browser reads your .html file, it doesn't dnow exactly what to expect. (It's not like the earlier examples where the program knows exactly what type of data is going to be in the file.) Instead, the compiler or browser must read through the characters of the file and break up the file into logical units, called tokens and process them. This activity is known as parsing.

For example, the line

public static int x = 75 ;

has 7 tokens, as underlined.

To parse a text file, you need to know where one token ends and the next one begins. Whitespace (spaces, tabs, linebreaks) usually separate tokens, but there are also other characters, known as delimiters, that separate tokens (e.g., ") and there are some characters (mostly punctuation) that generally are tokens by themselves (for example, =, ., ;, etc.).

When working with a raw text file, you need to read each character and figure out where the next token starts. However, JAVA provides a handy class called StreamTokenizer that does most of the work for you.

You can think of the StreamTokenizer as an iterator that walks through the tokens of a stream.


StreamTokenizer(Reader r) // for example, r is a FileReader

Methods for navigating through the stream:

int nextToken() throws IOException

finds the next token and returns its token type:

TT_EOF, TT_EOL, TT_NUMBER, TT_WORD or a special character (like ;)
after nextToken is called, the following values are available in public instance variables of the tokenizer:
String sval -- string if ttype = TT_WORD
double nval -- the number if ttype = TT_NUMBER
int ttype -- the token type

void pushback()

makes the next call to nextToken return the current ttype and consume nothing (sval and nval would be unchanged)

int lineno()

returns the current line number

Other methods:

The remaining methods are mainly for controlling how the tokenizer treats the characters in the input stream. For example,

void quoteChar(int ch)

lets you specify another quote character(" is the default)

void wordChars(int low, int high)
lets you specify characters to be treated as parts of a word (for example, = and [ are generally not considered word characters)
Example: let's write a method that counts the number of times a given word appears in a given file as a separate token (not as part of a quote)

int numberOfOccurrences(String targetWord, String filename) throws IOException {
StreamTokenizer tokens = new StreamTokenizer(new FileReader(filename));
int count = 0;
while (tokens.nextToken() != tokens.TT_EOF) {
if (tokens.ttype == tokens.TT_WORD && tokens.sval.compareTo(targetWord) == 0) {
return count;

What if we want to pick apart the text within a quote? The StreamTokenizer provides each piece of quoted text as a single String. (We could change the StreamTokenizer settings to make " an ordinary character, but then the quoted Strings wouldn't be treated normally. For example, "123" would be a number, not a String.)

Given a String, we can further examine it using some of the built-in methods of the String class.

boolean endsWith(String suffix)
boolean startsWith(String prefix)
int indexOf(String str)
int indexOf(String str, int fromIndex)
String substring(int beginIndex, int endIndex
String toLowerCase()
String toUpperCase()

To find the number of occurrences of a targetWord in a given searchString, we can write:

index = 0;
while ((index = searchString.indexOf(targetWord,index)) != -1) {



We could add this to the previous example when ttype == '"' to include quotes in our count.