CS313 Fall 2004
LAB 1

REGULAR EXPRESSIONS

I would like you to remember a few small things in this lab.

First remember how regular expressions work:  .* stands for a run
of any characters, a.*b is a run of characters beginning with a and
ending with b.  In fact, it is the longest run that matches the pattern,
among those that begin in the same spot.  [a-z] matches an alpha.  [^a-z]
matches a non-alpha.  So [a-z]+[^a-z]+[a-z]* is the longest first run
of alphas followed by non-alphas, followed possibly by alphas.

Next, remember how grep works.  To grep (or egrep) for a pattern in a file,
say "grep pattern file".  If you want to process this further, you can do so:
"grep pattern1 file | grep pattern2".

Finally, if you want to do stream substitution, use "sed -e".  For example,
to remove all numbers from a line containing ".com" on it, say:
"grep '\.com' file | sed -e 's/[0-9]*//g' | less".  Don't forget the less
if you think the output will be long.  Or " > outfile" instead of " | less"
if you want the results in a file.  You could say " | tee outfile | less"
which achieves both at the same time.

If you want to do accumulations in gawk, do something like: 
"gawk '{ count[$1]++ } END { for (i in count) print count[i], i }'"
which counts the occurrences of words occurring first in each line in a file.
This is a marvelous thing to use before piping to "sort -nr".

If you get stuck, don't forget that "man gawk" contains a section
on regular expressions.

THE EXERCISE

Download a copy of ex000519.log from the course website.  This is just a
piece of a large webserver log.  

0.  In fact, can you tell me what was the time of the first and last
events logged in this file?

1.  Using just grep and wc, tell me how many requests in this log
resulted from a referral from doubleclick.net.

2.  Using just grep and wc, tell me how many requests in this log
resulted from a txtSearch at fff.com using [Rr]esults.asp.

3.  Using just grep and wc, tell me how many requests in this log
came from a machine in the range xxx.xxx.xxx.yyy, where xxx is
any number and yyy is between 0 and 199.  

4.  Using a combination of grep and sed, give me a list of all
the search terms used on the [Rr]esults.asp page.

5.  Using a combination of grep and sed, give me a list of all
the session id's that show up in the logs.  I don't want the 
attribute name=value pair; I want just the value.

6.  Do the same thing for NGUserID.

7.  Using a combination of grep and sed, give me a list of all
the client IP addresses and the times of their requests.

8.  Using gawk in addition to #7, tell me the top ten client IP
addresses ranked in order of the number of total server requests.

9.  Using gawk in addition to #6, give me a list of all the NGUserID's
that made more than 200 requests to the server.

10.  Using gawk again, give me a report saying how many bytes each
NGUserID requested, and sort in order decreasing in the number of total
bytes.