CS 527A Homework 3
You are expected to complete 40 points worth of homework problems.
For those selecting 10 and 20 point problems, you must select problems
from at least two of the chapters. Also, no more than one paper
critique can be selected.
If you are doing a 20 or 40 point problem be sure to attach the appropriate
cover sheet and review the guidelines given there and in
the course information handout.
If you are interested in doing a group project talk to Dr. Goldman.
Due on Wednesday April 4th. Late penalty will not apply until April 11th.
(So basically, there is a one week extension.)
Cover Sheets:
- (10 pts) Read any of the chapters from the
lecture notes from my
computational learning theory class (except for Chapter 1).
If you are interested in then also working a problem related
to the chapter you read, just let me know and I can provide a
question that will be worth an additional 10 points.
- (20 pts) Use the provided
AdaBoost Applet to experiment with boosting. Try some different
things to help you understand how (and when it works). Then complete
a 20 page project report.
- (40 pts) Do some experimentation with boosting. You can use the
decision tree algorithm code (from HW 1) or anything of your choice as
the base algorithm for boosting. Or you can use the naive bayes
algorithm designed for text categorization (provided with the next
problem) and then use it as a basis for the boosting. A variation of
AdaBoost called BoosTexter which is designed for text categorization
can be found at
http://www.research.att.com/~schapire/BoosTexter/.
- (20 pts) Use the provided
SVM Applet to experiment with Support Vector Machines (and kernel
functions). Try some different things to help you understand how (and
when it works).
Write a 20 point project report.
- (40 pts) Use the SVM light software
packge to experiment with support vector machines. I recommend
that you try the first application discussed on the page under
"Getting Started: An Example Problem." which is a text classification
problem. Once you understand how it works try some other things out.
Part of this homework will be reading some of the provided material
and describing that in your report.
- (10 pts) In this problem you'll study different aspects related
to Bayesian Learning.
- Consider the example application of Bayes rule in Section 6.2.1 of
the text. Suppose the doctor decides to order a second laboratory
test for the sample patient, and suppose the second test returns
a positive result as well. What are the posterior probabilities of
cancer and !cancer following these two tests? Assume that the two
tests are independent.
- In the example of Section 6.2.1 we computed the posterior probability
of cancer by normalizing the quantities
P(+|cancer)*P(cancer) and P(+ | !cancer) * P(!cancer)
so that they summed to one. Use Bayes theorem
and the theorem of total probability (see Table 6.1) to prove that this
method is valid (i.e. that normalizing in this way yields the correct
value for P(cancer|+)).
- Draw the Bayesian belief network that represents the conditional
independence assumptions of the naive Bayes classifier for the PlayTennis
problem of Section 6.9.1. Give the conditional probability table
associated with the node Wind.
- (10 points) Consider the concpet learning algorithm FindG, which
outputs a maximally general consistent hypothesis (e.g. some
maximally general member of the version space).
- Give a distribution for P(h) and P(D|h) under which FindG is guaranteed
to output a MAP hypothesis.
- Give a distribution for P(h) and P(D|h) under which FindG is not guaranteed
to output a MAP hypothesis.
- Give a distribution for P(h) and P(D|h) under which FindG is guaranteed
to output a ML hypothesis but not a MAP hypothesis.
- (20 pts) In the analysis of concept learning in Section 6.3 we
assumed that the sequence of instances (x1, ...,
xm) was held fixed. Therefore, in deriving an expression
for P(D|h) we needed only consider the probability of observing the
sequence of target values (d1, ..., dm) for this
fixed instance sequence. Consider the more general setting in which
the instances are not held fixed, but drawn independently from some
probability distribution defined over the instance space X. The data
D must now be described as the set of ordered pairs
{(xi,di)}, and P(D|h) must now reflect the
probability of encountering the specific instance x1
as well as the probability of the observed target value di.
Show Equation (6.5) holds even under this more general setting.
Hint: Consider the analysis of Section 6.5.
- (10 pts) Consider the Minimum Description Length (MDL) principle
applied to the hypothesis space H consisting of conjunctions of up to
n boolean attributes (i.e. monotone monomials). Assume that each
hypothesis is encoded simply by listing the attributes present in the
hypothesis, where the number of bits needed to encode any one of the
n boolean attributes is log2 n. Suppose the encoding of
an example given the hypothesis uses zero bits if the example is
consistent with the hypothesis and uses log2 m bits
otherwise (to indicate which of the m examples was misclassified--the
correct classification can be inferred to be the oppositie of that
predicted by the hypothesis).
- Write down the expression for the quantity to be minimized
according to the MDL principle.
- Is it possible to construct a set of training data such that a
consistent hypothesis exists, but MDL chooses a less consistent
hypothesis? If so, give such a training set. If not, explain why not.
- Give probability distributions P(h) and P(D|h) such that the above
MDL algorithm outputs MAP hypotheses.
- (40 pts) In this problem you use some provided code to explore
how the naive bayes learning algorithm can be applied to text
categorization. Here's the
provided code.
Here's the assignment from Tom Mitchell to help guide you.
After running the provided install program you will need
to edit the Makefile to modify the line that begins
with "CC =" to be
CC = /pkg/gnu/bin/gcc
Then to compile it use the command
/pkg/gnu/bin/make
Also, in svm_base.c you may need to change the call to
sqrtf to sqrt.
- (10 pts) In this problem we look at instance-based learning.
- Derive the gradient descent rule for a distance-weighted local
linear approximation to the target function, given by Equation (8.7).
- Problem 8.2 from text.
- (10 pts) Suggest a lazy version of the eager decision tree learning
algorithm ID3. Be sure to give a very clear description of your lazy
algorithm. What are the advantages and disadvantages of your lazy
learning algorithm as compared to the original eager algorithm. I'm
expected a well thought out discussion on this.
- (20 pts) Read one of the following papers and write a paper
critique. Please write the summary of the paper so that someone in
this class who has not read the paper would understand what it was
about at a high level and would understand one part at a deeper level.
- Robert Amar, Daniel Dooly, Sally Goldman and Qi Zhang (2001).
Multiple-Instance Learning of Real-Valued Data.
Submitted to ICML 2001 (International Conference on Machine Learning).
NOTE: I have funding for an undergraduate to work this summer on
a project that is a continuation of the work described in this paper.
Please talk to me about this if you are interested.
- W.L. Buntine, 1994.
Operations for learning with graphical models.
Journal of Artificial Intelligence Research2,159-225.
In postcript or in
pdf.
NOTE: You can just read through the end of Section 3 (page 177).
- P. Domingos and M. Pazzani, 1997.
Beyond Independence: Conditions for the optimilaity of the simple
Bayesian classifier.
Machine Learning, 29, pages 103-130.
In postcript or in
pdf.
- K. Lang, 1995.
Newsweeder: Learning to filter netnews.
Proceedings of the 12th International Conference of Machine
Learning, pages 331-339.
This paper applies the MDL principle to learn which articles
a user will find interesting. I've had trouble finding an
active link to the article. I have a hardcopy in my office if
you are unable to get it electronically.
- You can read any of the papers on lazy learning found at
Special AI Review Issue on Lazy Learning.
- (10 pts) In this problem you will compute the posterior probabilities
based on a given bayesian belief network and some partial observations.
Consider the Fire Alarm example from the following
applet except
remove the attribute "reporting." For each of the following three sets
of observations, show your computation for obtaining the posterior probabilities
of all variables.
- alarm = T
- smoke = T and alarm = T
- leaving = T
- CHOOSE YOUR OWN ADVENTURE. You can propose any additional
homework options (or variations of those given above) to Dr. Goldman.
If approved a point value will be given. If you are interested in
doing a problem from HW 2 you can do that under this option. Just send
email to Dr. Goldman to confirm the particular problem you want to do
(from HW 2) is acceptable for HW 3.