CS 527A Frequently Asked Questions (FAQs)


Frequently Asked Questions:


Homework 4 Questions

NOTE: For problem #4, you need only give on example of a transformation for which the optimal policy will not be changed.

For problem #4, can you say more about what is meant by an interesting transformation?

Let r(s,a) be the rewards. Some example of transformation functions f are:
    f(r(s,a)) = r(s,a) + c for a constant c
    f(r(s,a)) = c * r(s,a) for a constant c
    f(r(s,a)) = log(r(s,a))
and so on. By interesting, I mean that you want the property that if r(s,a) < r(s,a') then f(r(s,a)) < f(r(s,a')) If you don't have this property then clearly the optimal policy will be changed (in general). You could also consider transformations that also involve other rewards.

I'm doing the pole balancing problem and would like some help with the physics needed to model the environment.

Here is everything you need for modeling the physics.
/*** Parameters for simulation ***/

#define GRAVITY 9.8
#define MASSCART 1.0
#define MASSPOLE 0.1
#define TOTAL_MASS (MASSPOLE + MASSCART)
#define LENGTH 0.5		  /* actually half the pole's length */
#define POLEMASS_LENGTH (MASSPOLE * LENGTH)
#define FORCE_MAG 10.0
#define TAU 0.02		  /* seconds between state updates */
#define FOURTHIRDS 1.3333333333333

  x is the cart's position (float)
  x_dot is the cart's velocity (float)
  theta is the pole's angle (float)
  theta_dot is the angular velocity of the pole (float)

Here is a method to update the state variables according to
what they would be TAU seconds later

    float xacc,thetaacc,force,costheta,sintheta,temp;

    if you are stationary
        force = 0

    if you are moving forward
        force = FORCE_MAG

    if you are moving backwards
        force = -FORCE_MAG

    costheta = cos(theta);
    sintheta = sin(theta);

    temp = (force + POLEMASS_LENGTH * theta_dot * theta_dot * sintheta)/ TOTAL_MASS;

    thetaacc = (GRAVITY * sintheta - costheta* temp)
	       / (LENGTH * (FOURTHIRDS - MASSPOLE * costheta * costheta/ TOTAL_MASS));

    xacc  = temp - POLEMASS_LENGTH * thetaacc* costheta / TOTAL_MASS;

/*** Update the four state variables, using Euler's method. ***/

    x  += TAU * x_dot;
    x_dot += TAU * xacc;
    theta += TAU * theta_dot;
    theta_dot += TAU * thetaacc;
}

Homework 3 Questions

For the third part of Problem 6, how do I know which edges to include in the Bayesian belief network?

Note that you are suppose to give the Bayesian Belief network that "represents the conditional independence assumptions of the naive Bayes classifier for the PlayTennis." You should give a bayes net that

  • Keeps the same conditional probabilities as naive Bayes, and
  • computes the Prob(PlayTennis=yes) when given the value for each attribute exactly as naive Bayes would.

There is a Bayes net that will satisfy both of these. Use the example of page 6.9.1 to give some guidance. Also, don't worry if the edges you put in the Bayes net do not really correspond to the causality that you would expect. If you satisfy the two conditions above then you have given the Bayes belief network that represents the conditional independence assumptions of the naive Bayes classifier.

Be sure to give the conditional probability table associated with the node Wind.

For the Bayesian network problem given in class on Wednesday (in HW 3 as problem 14), I'm getting answers that are close to those given in class but they are not quite right. What am I doing wrong?

The prior probabilities given in class for P(alarm), P(smoke) and P(leaving) were rounded to just four significant digits. As an example, if you use .0267 for the prior probability that there is an alarm, you will be having some round off errors that will cause your answers to be slightly off. And the more you repeatedly use these approximations (of only 4 significant digits) the worse the problem can be. Repeat your computation using the exact values. For example, the prior probability for P(alarm) is really
P(a|f,t)P(f)P(t)+P(a|f,!t)P(f)P(!t)+P(a|!f,t)P(!f)P(t)+P(a|!f,!t)P(!f)P(!t)
    = .5*.01*.02 + .99*.01*.98 + .85*.99*.02 + .0001*.99*.98
For the third part of Problem 7, is there a typo?

Yes. It should read "Give a distribution for P(h) and P(D|h) under which FindG is guarnateed to output a ML hypothesis but not a MAP hypothesis."

For Problem 10, I'm having some trouble compiling the provided code.

First you need to type make install (or on CEC type /pkg/gnu/bin/make install) you may need to edit the Makefile to modify the line that begins with "CC =" to be
CC = /pkg/gnu/bin/gcc
Then to compile it use the command
/pkg/gnu/bin/make
Also, in svm_base.c you may need to change the call to sqrtf to sqrt.

The files README and INSTALL give additional guidance.

For Problem 10, where is the assignment that goes with the code?

Here's the assignment from Tom Mitchell to help guide you.


Homework 2 Questions

For problem #2, should it be o = w0 + w1 x1 + w1 x12 + ... + wn xn + wn xn2?

Yes.

I am trying to use the provided neural network code for face recongition (HW 2, problem 6) and it will not compile

Here is some additional guidance. Edit Makefile and at the top add the line
CC = /pkg/gnu/bin/gcc
Then to compile it use the command
/pkg/gnu/bin/make
Finally, I have edited the training and testing data so it has the appropriate paths for you. Save trainset.zip into the same directory where you put the code and faces_4.tar before you used tar xvf faces_4.tar. Then use
unzip trainset
and you will have the training and testing sets ready. You should know be able to follow the directions given. Note that xv is found in pkg/X11/bin/xv. You should be able to use it by just typing xv followed by one of the images.

I am trying to use the provided decision tree code and it will not compile

Try /pkg/gnu/bin/make and contact Dr. Goldman if it does not work for you.

For Problem #8 are how can we compute the standard deviation sigmaerrorS(h) since we aren't given S?

Look at the top half of page 138 in the text. Notice that really sigmaerrorS(h) is defined to be SQRT(p(1-p)/n) where p = errorD(h). Thus you can directly use it. Of course, you are not given p but rather a range of values. You can figure out a bound for n as a function of p and then will need to determine a value of n that is guaranteed to be good for any value of p in the given range.

For Problem #10 how many points is it worth?

10 points

For Problem #12(a) are the rectangles required to be axis-aligned?

Yes, that was the intention.

For Problem #12(c) are the polygons required to be convex?

Yes, that was the intention. If you solved it under the assumption that they need not be convex that that is okay. However, I'd prefer you to consider the concept class of arbitrary convex polygons.

For Problem #14 what is meant by a monotone monomial on {0,1}n?

A monotone monomial is a conjunction of non-negative variables. Saying it is on {0,1}n means that each example is a bit vector of length n. In other words there are n boolean variables.

For Problem #17, is f(n) = (log n)/2 okay?

It does not satisfy the requirement of the problem. You are given that f(n) grows asymptotically more slowly than log n. In the homework it states this formally by saying that f(n) = o(log n). Note that is "little-oh" NOT "big-oh". Let me formally define little-oh for those who have not seen it. We say that f(n)=o(g(n)) if limn -> infinity [f(n)/g(n)] = 0. In other words, f(n) grows asymptotically slower than g(n). So (log n)/2 != o(log n) since the limit as n goes to infinity of 1/2 (not 0).

For Problem #18 is there anywhere that I can find more information about proving mistake bounds using the weighted majority algorithm?

Look at Chapter 18 (starting at page 143) of the lecture notes from my 1991 offering of my computational learning theory courses ( in postscript or in pdf).


Lecture Related Questions (not directly tied to a homework problem)

Can you go over the ALVINN architecture?

Here is a description of the ALVINN architecture which is a 960 x 4 x 20 network. (The same basic architecture was used for the face recognition network but there were only 3 hidden units and 4 output units. That is, it is a 960 x 3 x 4 network.)

Each of the hidden units has a single real-valued output. When visualizing what the hidden units are doing, you can represent each hidden unit as a 30x32 grid of weights which correspond to the weights from the inputs to the hidden unit and by a vector of 30 weights which correspond to the weights to the output units. However, each hidden unit itself produces a single output. There are different options as to whether you directly use the dot product of the vectors w and x or if you threshold it (or in some other way guarantee that the outputs are between -1 and 1).

To help be sure this is clear, let's compute the total number of weights in the ALVINN system:


 # weights from input layer to hidden layer  = 960*4 = 3840
 # weights from hidden layer to output layer = 4*30  =  120
 # "w_0" weights (one per hidden and output unit)    =   34

So the total number of weights is 3840+120+24= 3994.
What would happen if the hidden units were removed and instead the input and output layers were directly connected? Then you would need
   960*30 + 30 = 28,830 weights
since each input would be connecting to each output (960*30) and there are 30 "w_0" weights. This is the value of the hidden unit. As we talked about, the more hidden units added, the more expensive the training but you have the ability to create more "intermediate" features and hence if they are needed then you can obtain better accuracy. So you want to have as few hidden units as you need to represent the target.

As I mentioned, each weight will be initialized to a random value between -1 and 1 (or sometimes a smaller range like -.1 to .1 is used). Next class we will talk about how to adapt what we saw today for updating the weights for a single neuron to do the update for a full neural network.

Take a look at the Figure 4.1 (on page 84) which shows the final weights for one of the hidden units of ALVINN, and Figure 4.10 (on page 113) which shows the weights for all three hidden units of the face recognition network after 1 iteration and then 100 iterations of training. (In Figure 4.10, they use the top left corner of the 30x32 weights from the input to hidden layer to show "w_0". For the weights from the hidden layers to the ouputs "w_0" is shown as the leftmost weight followed by the 3 weights to the output units.) I think looking at this will help you understand the role of the hidden units.


Homework 1 Questions

I am doing problem 6 (which is Problem 2.4 from the text). Am I to answer question (c) just for the particular set of training examples given?

No. For part (c) I want you to first describe an algorithm that given a set of labeled examples, outputs the sets S and G. Explain why this algorithm is correct. Then you are to describe (in general) a method to find a query that reduces the size of the version space as much as possible.

I am using the normalization factor of Z = x_0 + x_1 + ... + x_n but the weights are stilling getting larger and larger in absolute value.

You want to renormalize each x = (x_0, ..., x_n) so that ||x||=1. To do this you want to replace each x_i by x_i/ SQRT((x_0)^2 + (x_1)^2 + .... + (x_n)^2). Use this value for both updating the weights and making the predictions.

I'm implementing the game playing learning algorithm and the weights are getting bigger and bigger in absolute value. What is happening?

There is an error in the LMS rule given in the text book which is causing this. Assume that there are n attributes (and so n+1 weights). Let
Z = x_0 + x_1 + x_2 + ... + x_n where x_0=1. This is a normalization factor that is very important. The LMS rule in the text should be modified to be
      w_i = w_i + eta (V_train(b) - V_hat(b))  x_i / Z
The only change is that x_i in the formula given in the text is being replaced by x_i / Z. (That is, each x_i value is being divided by Z.)

Let me briefly explain why this should be done. The idea of the LMS rule is that after the update the value of V_hat(b) should be

      V_hat(b) + eta (V_train(b) - V_hat(b)).
So as an example , if eta = .1, V_hat(b)=10 and V_train(b)=100 then you want the update to modify the weights so that with the new weights
   V_hat(b) = 10 + .1 * 90 = 19.

Furthermore, the idea is to adjust the weights based on the relative values of x_i. That is why you want the x_i in the update rule. Without the normalization constant Z the problem with the weights getting too large occurs.

However, with the adjustment notice that the sum over all i of x_i / Z = 1 and this is what guarantees that the total change in the value of V_hat will just be eta (V_train(b) - V_hat(b)) and hence it will always closer to the correct value by eta percent but can never overshoot it. This is what is wanted.

Why is x_0 always 1?

Remember that
 V_hat = w_0 + w_1 * x_1 + ... + w_n * x_n 
where n is the number of featuers. Another way to write this is
V_hat = w_0 * 1 + w_1 * x_1 + ... + w_n * x_n
Here you can see that 1 fills the role of x_0.

Where can I submit my homework?

Either in class or in Professor Goldman's mailbox in Bryan 509 before class on the day it is due.


Return to the CS 527A Home Page