GAWK AGAIN

As long as you have seen programming for the first time,
we might as well try to cement a bit of the ideas with a 
follow-on lab.  How many metaphors in that sentence?  GAK.

Gawk is particularly good at processing textual data.

1.  Click on gawkterm.html again and grab a NY Times newspaper
article -- in fact, select the whole page -- and copy it into
the right window.


2.  For a check of sanity, use the sentence 

	print NF, length($0)

in the middle part (remember that GAWK has a begin, and end,
and a middle part to each program, even though we usually use
only the begin or the middle part).

As you remember, ahem, this will print the number of words on
each line followed by the number of characters on each line.

Try it.

You remember how to change it so it prints the number of
characters in the last word and the number of characters in
the first word, right?

Show the TA you remember this.


3.  I am sure that one of your line has the word "Republican" in
it.  If not, choose a word that occurs in the text in a few places,
such as "Bush" or "disaster" or "economy".

You can make gawk print just the lines containing that word by
putting in the middle part:

	if ($0 ~ /Bush/) print $0

and try it!

You can try different combinations, such as

	if ($0 ~ /Bush/) print $0
	if ($0 ~ /economy/) print "$$$"
	if ($0 ~ /Republican/) print $1

What is the action being taken when "Republican" is found in a line?


4.  There is another way of making gawk do this.

You can have, instead of the normal middle part which is

	{
	}

a logically conditional middle part, such as

	(/Bush/) {
	  print $0
	}

Try it.

If you want to have two tests, you have two middle parts, each with
its own conditional:

	(/Bush/) { print $0 }
	(/economy/) { print "$$$" }

And you can make the third one if you want.  Oops, I mean, I want
you to do it.


5.  The third way to make gawk process your data is to use
the begin part instead of the middle part.  Erase all the
stuff in the middle.  In the begin part, which happens only
once (the middle part applies to each line), put:

	while (getline) print $1

Try this.  What did it do?

How about

	while (getline) print length($NF)

And while you are at it,

	while (getline) {
	  if ($0 ~ /Bush/) print $0
	}

See if you can add the two other conditionals and get it to work.


6.  Choose one of the three methods that you like most.

Of course you didn't sign up for cs100 to learn all the various ways of
using gawk.  Perhaps you should have, but I understand no one told you
to, so you didn't know.  I am telling you that if you had signed up
just to learn the various ways of using gawk, I think you would be a
very clever person indeed.  Or a g-person, as I say, and I very much
like g-people.

What you really could learn right now, which anyone in the universe,
g-person or not, would agree is well worth your interest, is the
idea of a regular expression.  Oh, we'll run into regular expressions
in yet another lab, but here is a good time to start.

You like "Bush"?  Or you think it should be "Bush" or "bush"?  Suppose
you had some case insensitivity in mind.  Go ahead and put some upper
and lower versions of the same word in your data.

You could use

	if ($0 ~ /Bush/) print $0
	if ($0 ~ /bush/) print $0

Or you could say

	if ($0 ~ /[Bb]ush/) print $0

Or even

	if ($0 ~ /[Bb][aeiouy]sh/) print $0

which would give you variations on your vowel.

Try it.


7.  Inferential leap.  Can you figure out how to write gawk
code which prints all of the lines, and only the lines, that have
a capital letter in the first word?

Ok, I'll help you.  It's something like

	if ($1 ~ /[FOO]/) print $0

but the FOO part ain't right.


8.  Your test says print any line that contains a capital letter
anywhere in the first word.  Your words typically start with the 
capital letter, but there might be words like "buSH" which are not
what you had in mind.  You could insist that the regular expression
(the pattern) match the beginning of the word by using the carat
symbol.  That's the "^".

	if ($1 ~ /^[A-Z]/) print $0

will do the job.  Prove it to your TA.  How?

9.  You could print all of the properly capitalized words
by using a loop.

In this case, you probably want to use the begin part, and not
the middle part.  Start with this:

	while (getline) {
	  for (i=1; i<=NF; i=i+1)
	    if ($i ~ /^[A-Z]/) print $i
	}

Does that get all the words for you?

Change it so it prints just the words starting with a capital vowel.
How about just the words starting with a capital consonant?

10.  You don't normally want to just print capitalized words.

Can you modify your program so that it prints just the words that
contain a number in them?  For example, 99, Agent99, 99AgentsAreUnitedAs1
are all fair game.

If you can change your program so that it prints the SUM of all the
numbers in the data area, I would consider you done with the lab.