CS 527A Homework 4
You are expected to complete 40 points worth of homework problems.
For those selecting 10 and 20 point problems, you must select problems
from at least two of the chapters. Also, no more than one paper
critique can be selected.
If you are doing a 20 or 40 point problem be sure to attach the appropriate
cover sheet and review the guidelines given there and in
the course information handout.
If you are interested in doing a group project talk to Dr. Goldman.
Due on Wednesday May 2nd. Note that the first 3 problems also appeared
on HW 3. Obviously you cannot pick a problem (or paper reading option)
to do on HW 4 that you did as part of HW 3.
Cover Sheets:
- (10 pts) In this problem we look at instance-based learning.
- Derive the gradient descent rule for a distance-weighted local
linear approximation to the target function, given by Equation (8.7).
- Problem 8.2 from text.
- (10 pts) Suggest a lazy version of the eager decision tree learning
algorithm ID3. Be sure to give a very clear description of your lazy
algorithm. What are the advantages and disadvantages of your lazy
learning algorithm as compared to the original eager algorithm. I'm
expected a well thought out discussion on this.
- (10 pts) In this problem you will compute the posterior probabilities
based on a given bayesian belief network and some partial observations.
Consider the Fire Alarm example from the following
applet except
remove the attribute "reporting." For each of the following three sets
of observations, show your computation for obtaining the posterior probabilities
of all variables.
- alarm = T
- smoke = T and alarm = T
- leaving = T
- (10 pts) Here we look at some topics in the area of Reinforcement Learning.
- Describe two interesting transformations you could apply to the reward function
which would not change the policy learned by the Q-learning algorithm. Prove that the policy
is not changed. Give 1 transformation you could apply to the reward function that
would change the policy.
- Consider the following reward structure:
where the discount rate is 0.99. What policy will the Q-learning
algorithm select here? Is that policy the policy that you feel is
best. Carefully think about this question and discuss your answer and
what it reveals about Q-learning. Think about what reward values might
correspond to in the real world
- (10 pts) Consider the Recycling Robot. This mobile robot has the job of collecting
empty soda cans in an office enviroment. It has sensors for detecting cans, and
an arm and gripper that can pick them up and place them in an onboard bin; it runs
on a rechargeable battery. The robot's control system has components for
interpreting sensory information, for navigating, and for controlling the arm and
gripper. High-level decisions can be made by a reinforcement learning agent
based on the current charge of the battery. This agent has to decide whether the robot
should (1) actively search for a can for a certain period of time, (2) remain
stationary and wait for someone to bring it a can, or (3) head back to its home
base to recharge its battery. The rewards might be zero most of the time, but then become
positive when the robot secures an empty can, or large and negative if the battery
runs all the way down.
In this problem you are to create a finite MDP for modelling this problem. Clearly
state any assumptions you make to do this. For example, you may want to assume that the
best way to find cans is to actively search for them, but this runs down the robot's
battery, whereas waiting does not. You can also assume that the agent makes its
decisions solely as a function of the energy level of the battery.
- (10 pts) Give the Bellman equation for Q* for the recycling robot described in problem 5.
Also calculate the optimal policy.
- (20 pts) Problem 13.2 from the text.
- (20 pts) Problem 13.4 from the text.
- (20 pts) Problem 13.3 from the text. If you also implement your
design then this is worth 40 points.
- (40 pts) Implement a reinforcement learning algorithm for the Mountain-Car
task. Consider the task of driving an underpowered car up a steep mountain
road as illustrated in:

The difficulty is that gravity is stronger than the car's engine
and even at full throttle the car cannot accelerate up the steep
slope. The only solution is to first move away from the goal and up the
oppositie slope on the left. This is a simple example of a
continuous control task where things have to get worse in a sense
(farther from the goal) before they can get better. The actions
available to the car are full throttle forward (+1), full throttle
reverse (-1) and zero throttle (0). The car moves according to a
simplified physics. Its position xt and velocity
vt are updated by:
- xt+1 = bound [ xt + vt+1]
- vt+1 = bound [ vt + 0.001 at - 0.0025 cos(3 xt)]
where the bound operation enforces -1.2 <= xt+1 <= 0.5 and
-0.07 <= vt+1 <= 0.07. When xt+1 reaches the left bound it has
crashed into the wall and its velocity vt+1 is reset to 0. When
xt+1 reaches the right boundary it has reached the goal and the
episode is terminated. You can try out different values for the accelaration
at at full throttle and full throttle reverse.
You can vary the reward structure. The reward structure suggested in the
figure is to have a reward of +1 when reaching the goal, -1 if hitting
the wall and 0 elsewhere. Other options you can consider are having a reward
of -1 for all positions except for the goal and then having a reward of +1
at the goal.
Once you've done the basic problem you can decide where to go with this. If
you want you could create an applet to visualize the car's progress.
You will have to discretize the velocity and position vectors to apply
reinforcement learning. Try varying the discretation and see how that
changes the speed of learning. Another thing you can do if plot the
optimal trajectory that you learn by showing the value of position and
velocity from the start state to the end in a 2d graph (say with position
as the x-coordinate and velocity and the y-coordinate). Be creative with
other variations.
NOTE: If you create an applet for this (that ideally let's you vary a
few things) then you do not need to write a report. At the meeting you'll
just demo your applet.
- (40 pts) Implement a reinforcement learning algorithm for the
pole-balancing task.

The problem here is to apply forces to a cart moving along a track so
as to keep a pole hinged to the cart from falling over. A failure is
said to occur if the pole falls past a given angle from vertical or
if the cart reaches an end of the track. The pole is reset to vertical
after each failure. The reward in this case could be +1 for every time step
in which failure did not occur, so that the return at each time would be the
number of steps until failure. Alternatively, we could treat pole-balancing
as a continuing task, using discounting. In this case the reward would be
-1 on each failure and zero at all other times. The return at each time would then
be related to -gammak where gamma is the discounting factor and
k is the number of time steps before failure. In either case the return is
maximized by keeping the pole balanced as long as possible.
Notice that here you have four variables, the velocity of the car, the
position of the car, the angle between the car and the pole and
finally the velocity of the pole (in terms of the change of angle.)
You can decide what actions you want to provide with a basic starting
point being having 3 actions (forward accelaration, backward
accelaration, no accelaration).
Once you've done the basic problem you can decide where to go with
this. If you want you could create an applet to visualize the
progress. You will have to discretize the 4 variables to apply
reinforcement learning. Try varying the discretation and see how that
changes the speed of learning. You can vary the actions or reward
structure. What happens when you introduce hidden state. Be creative
with other variations.
NOTE: If you create an applet for this (that ideally let's you vary a
few things) then you do not need to write a report. At the meeting you'll
just demo your applet.
- (20 pts) Read one of the following papers and write a paper
critique. Please write the summary of the paper so that someone in
this class who has not read the paper would understand what it was
about at a high level and would understand one part at a deeper level.
- CHOOSE YOUR OWN ADVENTURE. You can propose any additional
homework options (or variations of those given above) to Dr. Goldman.