CS 527A Homework 2
You are expected to complete 40 points worth of homework problems.
For those selecting 10 and 20 point problems, you must select problems
from at least two of the chapters. Also, no more than one paper
critique can be selected.
If you are doing a 20 or 40 point problem be sure to attach the appropriate
cover sheet and review the guidelines given there and in
the course information handout.
If you are interested in doing a group project talk to Dr. Goldman.
Due on Wednesday February 28th.
Cover Sheets:
- (20 pts) Implement the delta training rule for a two input linear
unit. Train it to fit the target concept -2 + x1 + 2 x2 > 0.
You can select each training example (x1,x2) where
x1 and x2 are selected uniformly form [0,10000] or pick
a different distribution if you want. Plot
the error E as a function of the number of training iteration. Plot
the decision surface after 5, 10, 50, 100, ..., iterations.
- Try normalizing x= (x1,x2) so ||x|| = 1
versus not doing any normalization. How does this affect performance?
- Try this using various constant values for eta (the learning rate) and
using a decaying learning rate of eta0/i for the ith iteration where
eta0=0.1. Which works better?
- Try incremental and batch learning. Which converges more quickly?
Consider both the number of weight updates and the total execution
time.
- (10 pts) Derive a gradient descent training rule for a single
unit with output o, where
o = w0 + w1
x1 + w1 x12 + ... +
wn xn + wn
xn2.
What are the tradeoffs between using
this non-linear unit versus that for the standard
perceptron?
- (10 pts) Recall the 8 x 3 x 8 network described in Figure 4.7 of
the text. Consider trying to train a 8 x 1 x 8 network for the same
task; that is, a network with just one hidden unit. Notice the 8
training examples could be represented by eight distinct values for
the single hidden unit (e.g. 0.1, 0.2, ... 0.8). Could a network with
just one hidden unit therefore learn the identity function defined
over these training examples? What would happen to the weight of the
hidden unit? See problem 4.9 of the text for some more direction.
- (10 pts) Consider the alternate error function shown in Problem
4.10. Derive the gradient descent update rule for this error function.
Show it can be implemented by multiplying each weight by some constant
before performing the standard gradient descent update.
- (10 pts) In this problem you will derive a gradient descent
algorithm to learn target concepts corresponding to rectangles in
the plane. See Problem 4.12 of the text for the details.
- (40 pts) Apply backpropagation to the task of face recognition.
Code for doing back propagation, some images and an
assignment/instructions document (in postscript) from Tom Mitchell
is available to guide you and also documents the provided code. You
are expected to try at least one of the "extra credit" options from
this handout. If you work in a group then you would be expected to do
some significant work that falls under "extra credit." I have placed
the code and the quarter-size images on the web page (zipped up) in faces.zip to make it easier for you to copy
everything. Save this into your directory. Then to extract the
quarter-size images use tar xvf faces_4.tar. The
trainset data is o nthe web page (zipped up) in trainset.zip on the course web page. Full size
images, additional image sources can all be found at
http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/faces.html
Finally, here is some additional guidance. You will probably need
to edit Makefile and at the top add the line
CC = /pkg/gnu/bin/gcc
Then to compile it use the command
/pkg/gnu/bin/make
- (20 points) Read one of the following papers and write a
paper critique.
- (10 points) In this problem you'll work with confidence intervals.
- Consider a learned hypothesis, h, for some boolean concept.
When h is tested on a set of 100 examples, it classified 83 correctly.
What is the standard deviation and the 95% confidence interval for
the true error rate for ErrorD(h)?
- You are about to test a hypothesis h whose
ErrorD(h) is know to be
between 0.2 and 0.6. What is the minimum number of examples you
must collect to assure that the width of the two-sided 95%
confidence interval will be smaller than 0.1?
- Explain why the confidence interval estimate given in
Equation (5.17) applies to estimating the quantity in
(5.16), and not the quantity in Equation (5.14)
- (20 points) Read one of the following papers and write a
critique. I have copies of both of these papers available.
- (10 points) This problem will help you review probability computations relevant to the PAC model.
- Suppose you are a quality control inspector at a shirt
factory. The factory produces 10000 shirts a day. On a particular
day, the 10000 shirts are put in a big pile, and 500 of them are
missing a button, 200 are ripped, and 17 have only one sleeve. Some
shirts may have more than one defect.
- Suppose you randomly pick one shirt from the pile. What is the
probability that it has none of these defects?
- Suppose you pick a random shirt, and throw it back in the pile, and
you repeat this 50 times. What is the probability that you never see a shirt
with a missing button?
- Suppose you're shipwrecked on a desert island with a faulty
radio transmitter. Each time you try to broadcast an SOS message, there's
a probability of 3/4 that it won't be broadcast, and a probability of 1/4
that it will be (and the trials are independent).
Suppose that you want to successfully broadcast an SOS with probability
at least 1 - delta. As a function of delta, how many times do you
need to try to broadcast?
- (10 points)
We have seen in class that an algorithm that is capable
of finding a hypothesis consistent with a given set of labeled
examples can be turned into a PAC learning algorithm. Argue that the
converse is true: a PAC learning algorithm can be used (with high
probability) to find a hypothesis consistent with a given set of
labeled examples.
Hint: Show how to define an appropriate
probability distribution on the given set of examples, and how to set
epsilon and delta, so that the PAC learning algorithm returns a
hypothesis consistent with the given set of examples, with high
probability.
- (10 points) Consider the space of instances X corresponding to
all points in the x,y plane. Compute the VC dimensions for the following
hypothesis spaces and clearly explain why your answer is correct.
- Hr the set of all axis-aligned rectangles. Points on the boundary
or inside of the target rectangle are positive and the rest are negative.
Can you generalize your
answer and give the VC dimension of the class of axis-aligned boxes in d-dimensional
space? Give it a try.
- Hc the set of circles in the x,y plane.
- Hp the set of convex polygons in the x,y plane.
- (20 points) Write a consistent learner for the hypothesis space
of rectangles in the plane. Generate a variety of target concept
rectangles at random corresponding to different rectangles in the
plane. Generate random examples of each of these target concepts
based on a uniform distribution of instances within the rectangle from
<0,0> to <100,100>. Plot the average generalization error (over about
100 different target concepts) as a function of the number of training
examples, m. Along with showing the point also, show a 95% confidence
interval. On the same graph plot the theoretical relationship between
epsilon and m for delta = 0.95. How close do they match? Consider
generating random examples using some different distributions and
compare the results. Give an explanation for your findings.
- (10 points) In this problem we consider the concept class of monomials.
- A monomial is monotone if it contains no negated
literals. Prove that the concept class of monotone monomials defined
on {0,1}n has VC-dimension of precisely n.
- For Mn the class of (general) monomials defined on
{0,1}n show that:
n <= VCD(Mn) <= n log2 3.
- (20 points) A membership query is designed to model the ability
to experiment. For instance space X and any x in X, MQ(x) returns the
correct label for x. A monotone DNF formula is of the form
t1 v t2 v ... v tk where each term
ti is a conjunction of any subset of the n Boolean
variables x1, ..., xn where no negations can be
used. For this problem you are to give algorithm to learn any
monotone DNF formula with k terms over n boolean attributes that uses
a polynomial number of MQs and makes a most k mistakes in the mistake
bound model of learning. In the worst case, how many MQs are made (as
a function of k and n)?
- (10 pts) In this problem we consider r-of-k threshold functions.
Here X = {0,1}n. For a chosen
set of k (k <= n) variables and a given number r (1 <= r <= k),
an r-of-k threshold function is true if and only if
at least r of the k relevant variables have value 1.
Assuming that both r and k are unknown to the learner, show that
the class of r-of-k threshold functions can be learned in the
mistake-bound model using the halving algorithm.
What mistake bound
do you obtain?
Recall the Binomial Theorem: [sumk = 0 to n C(n,k)] = 2n.
- (10 points)
In this problem we considered a simple case of learning with queries
where the feedback can be erroneous. The learner and adversary agree
on a number n, and then the adversary thinks of a number between 1
and n, inclusive. The learner must find out which number the
adversary has selected by asking questions of the form, ``Is your
number less than t?'' for various t. A binary-search approach
allows you to ask at most log2 n questions before finding the number.
To make this an interesting problem, suppose that the adversary is
allowed to incorrectly respond to at most one question. How
many questions must the learner now ask? (A bound of 3 log2 n is
easy: the learner can just ask each question three times and take the
majority vote of the adversary's responses.) Give a learning
algorithm that uses a number of queries of the form
log2 n + f(n)
where f(n) grows asymptotically more slowly than the logarithm
function (i.e. f(n) = o(log n)).
- (10 points) Consider the hypothesis class H of "regular, depth-2
decision trees" over n Boolean variables. A "regular, depth-2 decision
tree" is a depth-2 decision tree (a tree with four leaves, all distance 2
from the root) in which the left child and right child of the root are
required to contain the same variable.
- As a function of n, how many syntactically distinct trees are there in H?
- Given an upper bound on the number of examples needed in the PAC model
to learn H with error epsilon and confidence delta.
- Consider the following Weighted-Majority algorithm for the class H.
You begin with all hypothesis in H assigned an initial weight equal to 1.
Every time you see a new example, you predict based on a weighted majority
vote over all hypothesis in H. Then instead of eliminating inconsistent trees,
you cut down their weight by a factor of 2. How many mistakes will the
procedure make in the worst case as a function of n and the number of mistakes
made by the best tree in H?
- (20 points) Read one of the following papers and write a
critique.
-
Dana Angluin, Michael Frazier, and Lenny Pitt (1992).
Learning conjunctions of Horn clauses.
Machine Learning, 9, 147-164.
- A. Blumer, A. Ehrenfeucht, D. Hausler and M. Warmuth. (1987).
Occam's razor.
Information Processing Letters,24, 377-380.
- Michael Kearns (1993).
Efficient noise-tolerant learning from statistical queries.
In Proceedings of the 25th Annual ACM Symposium on Theory of
Computing (STOC '93), 392-401.
-
Sally A. Goldman and Manfred Warmuth (1993).
Learning Binary Relations Using Weighted Majority Voting.
(postscript
or pdf).
Proceedings of Sixth Annual ACM Conference on Computational Learning
Theory (COLT 1993). A longer version appears in
Machine Learning, 20(3):245--271, September 1995.
- CHOOSE YOUR OWN ADVENTURE. You can propose any additional homework options
(or variations of those given above) to Dr. Goldman. If approved a point value
will be given.