CS 527A Homework 4


You are expected to complete 40 points worth of homework problems. For those selecting 10 and 20 point problems, you must select problems from at least two of the chapters. Also, no more than one paper critique can be selected.

If you are doing a 20 or 40 point problem be sure to attach the appropriate cover sheet and review the guidelines given there and in the course information handout.

If you are interested in doing a group project talk to Dr. Goldman.

Due on Wednesday May 2nd. Note that the first 3 problems also appeared on HW 3. Obviously you cannot pick a problem (or paper reading option) to do on HW 4 that you did as part of HW 3.


Cover Sheets:


  1. (10 pts) In this problem we look at instance-based learning.

  2. (10 pts) Suggest a lazy version of the eager decision tree learning algorithm ID3. Be sure to give a very clear description of your lazy algorithm. What are the advantages and disadvantages of your lazy learning algorithm as compared to the original eager algorithm. I'm expected a well thought out discussion on this.

  3. (10 pts) In this problem you will compute the posterior probabilities based on a given bayesian belief network and some partial observations.

    Consider the Fire Alarm example from the following applet except remove the attribute "reporting." For each of the following three sets of observations, show your computation for obtaining the posterior probabilities of all variables.

  4. (10 pts) Here we look at some topics in the area of Reinforcement Learning.

  5. (10 pts) Consider the Recycling Robot. This mobile robot has the job of collecting empty soda cans in an office enviroment. It has sensors for detecting cans, and an arm and gripper that can pick them up and place them in an onboard bin; it runs on a rechargeable battery. The robot's control system has components for interpreting sensory information, for navigating, and for controlling the arm and gripper. High-level decisions can be made by a reinforcement learning agent based on the current charge of the battery. This agent has to decide whether the robot should (1) actively search for a can for a certain period of time, (2) remain stationary and wait for someone to bring it a can, or (3) head back to its home base to recharge its battery. The rewards might be zero most of the time, but then become positive when the robot secures an empty can, or large and negative if the battery runs all the way down.

    In this problem you are to create a finite MDP for modelling this problem. Clearly state any assumptions you make to do this. For example, you may want to assume that the best way to find cans is to actively search for them, but this runs down the robot's battery, whereas waiting does not. You can also assume that the agent makes its decisions solely as a function of the energy level of the battery.

  6. (10 pts) Give the Bellman equation for Q* for the recycling robot described in problem 5. Also calculate the optimal policy.

  7. (20 pts) Problem 13.2 from the text.

  8. (20 pts) Problem 13.4 from the text.

  9. (20 pts) Problem 13.3 from the text. If you also implement your design then this is worth 40 points.

  10. (40 pts) Implement a reinforcement learning algorithm for the Mountain-Car task. Consider the task of driving an underpowered car up a steep mountain road as illustrated in:


    The difficulty is that gravity is stronger than the car's engine and even at full throttle the car cannot accelerate up the steep slope. The only solution is to first move away from the goal and up the oppositie slope on the left. This is a simple example of a continuous control task where things have to get worse in a sense (farther from the goal) before they can get better. The actions available to the car are full throttle forward (+1), full throttle reverse (-1) and zero throttle (0). The car moves according to a simplified physics. Its position xt and velocity vt are updated by: where the bound operation enforces -1.2 <= xt+1 <= 0.5 and -0.07 <= vt+1 <= 0.07. When xt+1 reaches the left bound it has crashed into the wall and its velocity vt+1 is reset to 0. When xt+1 reaches the right boundary it has reached the goal and the episode is terminated. You can try out different values for the accelaration at at full throttle and full throttle reverse. You can vary the reward structure. The reward structure suggested in the figure is to have a reward of +1 when reaching the goal, -1 if hitting the wall and 0 elsewhere. Other options you can consider are having a reward of -1 for all positions except for the goal and then having a reward of +1 at the goal. Once you've done the basic problem you can decide where to go with this. If you want you could create an applet to visualize the car's progress. You will have to discretize the velocity and position vectors to apply reinforcement learning. Try varying the discretation and see how that changes the speed of learning. Another thing you can do if plot the optimal trajectory that you learn by showing the value of position and velocity from the start state to the end in a 2d graph (say with position as the x-coordinate and velocity and the y-coordinate). Be creative with other variations.

    NOTE: If you create an applet for this (that ideally let's you vary a few things) then you do not need to write a report. At the meeting you'll just demo your applet.

  11. (40 pts) Implement a reinforcement learning algorithm for the pole-balancing task.


    The problem here is to apply forces to a cart moving along a track so as to keep a pole hinged to the cart from falling over. A failure is said to occur if the pole falls past a given angle from vertical or if the cart reaches an end of the track. The pole is reset to vertical after each failure. The reward in this case could be +1 for every time step in which failure did not occur, so that the return at each time would be the number of steps until failure. Alternatively, we could treat pole-balancing as a continuing task, using discounting. In this case the reward would be -1 on each failure and zero at all other times. The return at each time would then be related to -gammak where gamma is the discounting factor and k is the number of time steps before failure. In either case the return is maximized by keeping the pole balanced as long as possible. Notice that here you have four variables, the velocity of the car, the position of the car, the angle between the car and the pole and finally the velocity of the pole (in terms of the change of angle.) You can decide what actions you want to provide with a basic starting point being having 3 actions (forward accelaration, backward accelaration, no accelaration). Once you've done the basic problem you can decide where to go with this. If you want you could create an applet to visualize the progress. You will have to discretize the 4 variables to apply reinforcement learning. Try varying the discretation and see how that changes the speed of learning. You can vary the actions or reward structure. What happens when you introduce hidden state. Be creative with other variations.

    NOTE: If you create an applet for this (that ideally let's you vary a few things) then you do not need to write a report. At the meeting you'll just demo your applet.

  12. (20 pts) Read one of the following papers and write a paper critique. Please write the summary of the paper so that someone in this class who has not read the paper would understand what it was about at a high level and would understand one part at a deeper level.

  13. CHOOSE YOUR OWN ADVENTURE. You can propose any additional homework options (or variations of those given above) to Dr. Goldman.