CS 527A Homework 2


You are expected to complete 40 points worth of homework problems. For those selecting 10 and 20 point problems, you must select problems from at least two of the chapters. Also, no more than one paper critique can be selected.

If you are doing a 20 or 40 point problem be sure to attach the appropriate cover sheet and review the guidelines given there and in the course information handout.

If you are interested in doing a group project talk to Dr. Goldman.

Due on Wednesday February 28th.


Cover Sheets:


  1. (20 pts) Implement the delta training rule for a two input linear unit. Train it to fit the target concept -2 + x1 + 2 x2 > 0. You can select each training example (x1,x2) where x1 and x2 are selected uniformly form [0,10000] or pick a different distribution if you want. Plot the error E as a function of the number of training iteration. Plot the decision surface after 5, 10, 50, 100, ..., iterations.

  2. (10 pts) Derive a gradient descent training rule for a single unit with output o, where
    o = w0 + w1 x1 + w1 x12 + ... + wn xn + wn xn2.
    What are the tradeoffs between using this non-linear unit versus that for the standard perceptron?

  3. (10 pts) Recall the 8 x 3 x 8 network described in Figure 4.7 of the text. Consider trying to train a 8 x 1 x 8 network for the same task; that is, a network with just one hidden unit. Notice the 8 training examples could be represented by eight distinct values for the single hidden unit (e.g. 0.1, 0.2, ... 0.8). Could a network with just one hidden unit therefore learn the identity function defined over these training examples? What would happen to the weight of the hidden unit? See problem 4.9 of the text for some more direction.

  4. (10 pts) Consider the alternate error function shown in Problem 4.10. Derive the gradient descent update rule for this error function. Show it can be implemented by multiplying each weight by some constant before performing the standard gradient descent update.

  5. (10 pts) In this problem you will derive a gradient descent algorithm to learn target concepts corresponding to rectangles in the plane. See Problem 4.12 of the text for the details.

  6. (40 pts) Apply backpropagation to the task of face recognition. Code for doing back propagation, some images and an assignment/instructions document (in postscript) from Tom Mitchell is available to guide you and also documents the provided code. You are expected to try at least one of the "extra credit" options from this handout. If you work in a group then you would be expected to do some significant work that falls under "extra credit." I have placed the code and the quarter-size images on the web page (zipped up) in faces.zip to make it easier for you to copy everything. Save this into your directory. Then to extract the quarter-size images use tar xvf faces_4.tar. The trainset data is o nthe web page (zipped up) in trainset.zip on the course web page. Full size images, additional image sources can all be found at http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/faces.html Finally, here is some additional guidance. You will probably need to edit Makefile and at the top add the line
    CC = /pkg/gnu/bin/gcc
    
    Then to compile it use the command
    /pkg/gnu/bin/make
    

  7. (20 points) Read one of the following papers and write a paper critique.


  8. (10 points) In this problem you'll work with confidence intervals.

  9. (20 points) Read one of the following papers and write a critique. I have copies of both of these papers available.


  10. (10 points) This problem will help you review probability computations relevant to the PAC model.

  11. (10 points) We have seen in class that an algorithm that is capable of finding a hypothesis consistent with a given set of labeled examples can be turned into a PAC learning algorithm. Argue that the converse is true: a PAC learning algorithm can be used (with high probability) to find a hypothesis consistent with a given set of labeled examples.
    Hint: Show how to define an appropriate probability distribution on the given set of examples, and how to set epsilon and delta, so that the PAC learning algorithm returns a hypothesis consistent with the given set of examples, with high probability.

  12. (10 points) Consider the space of instances X corresponding to all points in the x,y plane. Compute the VC dimensions for the following hypothesis spaces and clearly explain why your answer is correct.

  13. (20 points) Write a consistent learner for the hypothesis space of rectangles in the plane. Generate a variety of target concept rectangles at random corresponding to different rectangles in the plane. Generate random examples of each of these target concepts based on a uniform distribution of instances within the rectangle from <0,0> to <100,100>. Plot the average generalization error (over about 100 different target concepts) as a function of the number of training examples, m. Along with showing the point also, show a 95% confidence interval. On the same graph plot the theoretical relationship between epsilon and m for delta = 0.95. How close do they match? Consider generating random examples using some different distributions and compare the results. Give an explanation for your findings.

  14. (10 points) In this problem we consider the concept class of monomials.

  15. (20 points) A membership query is designed to model the ability to experiment. For instance space X and any x in X, MQ(x) returns the correct label for x. A monotone DNF formula is of the form t1 v t2 v ... v tk where each term ti is a conjunction of any subset of the n Boolean variables x1, ..., xn where no negations can be used. For this problem you are to give algorithm to learn any monotone DNF formula with k terms over n boolean attributes that uses a polynomial number of MQs and makes a most k mistakes in the mistake bound model of learning. In the worst case, how many MQs are made (as a function of k and n)?

  16. (10 pts) In this problem we consider r-of-k threshold functions. Here X = {0,1}n. For a chosen set of k (k <= n) variables and a given number r (1 <= r <= k), an r-of-k threshold function is true if and only if at least r of the k relevant variables have value 1. Assuming that both r and k are unknown to the learner, show that the class of r-of-k threshold functions can be learned in the mistake-bound model using the halving algorithm.
    What mistake bound do you obtain?
    Recall the Binomial Theorem: [sumk = 0 to n C(n,k)] = 2n.

  17. (10 points) In this problem we considered a simple case of learning with queries where the feedback can be erroneous. The learner and adversary agree on a number n, and then the adversary thinks of a number between 1 and n, inclusive. The learner must find out which number the adversary has selected by asking questions of the form, ``Is your number less than t?'' for various t. A binary-search approach allows you to ask at most log2 n questions before finding the number. To make this an interesting problem, suppose that the adversary is allowed to incorrectly respond to at most one question. How many questions must the learner now ask? (A bound of 3 log2 n is easy: the learner can just ask each question three times and take the majority vote of the adversary's responses.) Give a learning algorithm that uses a number of queries of the form
    log2 n + f(n) where f(n) grows asymptotically more slowly than the logarithm function (i.e. f(n) = o(log n)).

  18. (10 points) Consider the hypothesis class H of "regular, depth-2 decision trees" over n Boolean variables. A "regular, depth-2 decision tree" is a depth-2 decision tree (a tree with four leaves, all distance 2 from the root) in which the left child and right child of the root are required to contain the same variable.

  19. (20 points) Read one of the following papers and write a critique.


  20. CHOOSE YOUR OWN ADVENTURE. You can propose any additional homework options (or variations of those given above) to Dr. Goldman. If approved a point value will be given.