CS363-U Lab 4

As always, if you can't figure out how to do something, raise your hand. Also, when you are done call over the instructor and show what you have before logging off.

PERL v. GAWK

  1. Here is the chance to learn some PERL. If you think about writing readable PERL, you might want to write code that looks somewhat like gawk. (PERL people may now get angry at me, but we have to start somewhere.)

  2. Take a look at the m1, m2, m3, ..., m7 programs that we used in the first lecture of this course (http://www.cs.wustl.edu/~loui/363s04/1). Translate each of these to perl. This will actually take you about half an hour. Remember: semi-colons are mandatory unless you use brackets, e.g. if (blah) { blah; blah } is ok. Also, if, for, and while structures all take braces in perl. Print statements can be done with variables under quotes. Don't forget to use $ in front of your scalars, @ in front of arrays, and % in front of hashes. Split takes two arguments, in reverse of gawk, and returns the array. Concatenation is explicit with the period. And you have to open a file before you can get lines from it. And eq is the string comparator, not ==, which compares values numerically. AND... k9 and wolf need "cgi" in the name of the file in order to execute it on the server.

    You might remember you can use perl0 in my subman (e.g., http://www.cs.wustl.edu/~loui/513f03/subman.cgi?manterm=perl0&tagterm=split) and you can look at the examples in http://www.cs.wustl.edu/~loui/363s04/4/.

  3. It is now time to have a perl-vs-gawk performance test.

    First, write a program in gawk which takes lines from stdin, then reports the total number of lines, total number of words, and total number of bytes. This is essentially the wc program. Do it in perl. Now show which is faster, by finding a very large file and timing each one. You might try repeating the task 100 times if your run time is too short.

    Second, write a program which takes lines from a file, then records the frequency of each word (after stripping all non-alpha characters).

    So if the file is input:

    this is not the entire file at all!
    this is not the entire file either!
    this is not the entire file yet!
    this completes the entire file!!!
    
    it will have output
    4 this
    4 the
    4 file
    4 entire
    3 not
    3 is
    1 yet
    1 either
    1 completes
    1 at
    1 all
    
    And now do it in perl. Compare which is fastest and by how much.

  4. Finally, do the program again where it keeps track of bigrams, which are successive pairs of letters. So the word "this" has bigrams "th", "hi", and "is". My output on the file above is as follows:
    11 e 
    9 th
    8 s 
    7 is
    5 le
    5 he
    5  e
    4 ti
    4 t 
    4 re
    4 nt
    4 ir
    4 il
    4 hi
    4 fi
    4 en
    4 !
    4  t
    4  f
    3 ot
    3 no
    3  n
    3  i
    2 et
    2 !!
    2  a
    1 ye
    1 te
    1 t!
    1 r!
    1 pl
    1 om
    1 mp
    1 ll
    1 l!
    1 it
    1 es
    1 er
    1 ei
    1 e!
    1 co
    1 at
    1 al
    1  y
    1  c
    
    Now do it again in perl and compare the run-times on a large file.