CS 527A, Spring 2002, Homework 6


You are expected to complete 50 points worth of homework problems. (Suggested Problems: 1, 2 and 3.)
This homework is due on Wednesday April 17 with the standard late policy applying.

A signed cover sheet for Homework 6 must be submitted with your homework.


  1. (10 points) Consider the following reward structure:



    where the discount rate is 0.99. What policy will the Q-learning algorithm select here? Is that policy the policy that you feel is best. Carefully think about this question and discuss your answer and what it reveals about Q-learning in terms of risk taking. Think about what reward values might correspond to in the real world.

  2. (20 pts) Imagine an environment in which a robot has two Boolean-valued sensors S1 and S2 that define the state of the robot. Consider the MDP shown below where the initial state is always or <0,0> for short.

    Note that a reward is associated with each state (versus with a state/action pair). Given a reward and action the immediate reward is that of the state which is reached.

    • Show the Q and V* values for a discount rate gamma = 0.9. The best way to show this is to draw a graph with the same structure as the MDP and then put the V* values inside the vertices and the Q values on the edges for the corresponding state/action pairs.

    • What is the optimal policy for gamma = 0.9. That is for each of the 3 non-goal states indicate what action would be taken from that state by the optimal policy?

    • For what range of values of the discount rate (i.e. gamma) would the optimal policy for getting from <0,0> to <1,1> be the indirect route defined by always taking action A?
      Hint: Write an expression for the discounted reward for each policy and then you can algebraically determine the requirements on gamma for when the indirect route will be optimal.

  3. (20 points) Consider applying a transformation fuction f to the rewards. That is, for all s,a, the transformation function f(r(s,a)) will be applied. For example, if f(r(s,a)) = r(s,a)+1 then 1 is added to every reward. For the below two classes of transformation functions you are to prove whether or not the optimal policy is changed. That is, you must either prove that for all MDPs the transformation will not change the optimal policy or give an MDP for which you can demonstrate that the optimal policy would change.

  4. (20 pts) Consider the Recycling Robot. This mobile robot has the job of collecting empty soda cans in an office environment. It has sensors for detecting cans, and an arm and gripper that can pick them up and place them in an onboard bin; it runs on a rechargeable battery. The robot's control system has components for interpreting sensory information, for navigating, and for controlling the arm and gripper. High-level decisions can be made by a reinforcement learning agent based on the current charge of the battery. This agent has to decide whether the robot should (1) actively search for a can for a certain period of time, (2) remain stationary and wait for someone to bring it a can, or (3) head back to its home base to recharge its battery. The rewards might be zero most of the time, but then become positive when the robot secures an empty can, or large and negative if the battery runs all the way down.

    In this problem you are to create a finite MDP for modeling this problem. Clearly state any assumptions you make to do this. For example, you may want to assume that the best way to find cans is to actively search for them, but this runs down the robot's battery, whereas waiting does not. You can also assume that the agent makes its decisions solely as a function of the energy level of the battery.

  5. (20 pts) Give the Bellman equation for Q* for the recycling robot described in Problem 4. Also calculate the optimal policy. You can only do this problem if you also do Problem 4.

  6. (20 pts) Problem 13.2 from the text.

  7. (20 pts) Problem 13.4 from the text.

  8. (20 pts) Problem 13.3 from the text.

  9. (30 pts) Implement the Q-learning tic-tac-toe learner designed in Problem 7. As part of this problem of course you are expected to present your results. That is, you should show appropriate plots such as the learning curve and then discuss the performance. You should empirically consider what happens when the opponent plays optimally rather than randomly and discuss the outcome. You can only do this problem if you also do Problem 8.

  10. (30 points) Read one of the following papers (or for the longer papers one significant part of it plus whatever is needed from the introduction to understand the selected part) and write a paper critique follwing these guidelines. You will be required to have a conference with Dr. Goldman to discuss the paper and part of your grade will be based on this conference. Please write the summary of the paper so that someone in this class who has not read the paper would understand what it was about at a high level and would understand one part at a deeper level.

  11. CHOOSE YOUR OWN ADVENTURE. You can propose any additional homework options (or variations of those given above) to Dr. Goldman.