CS 527A, Spring 2002, Homework 5/6 Combined


These problems are worth 150 points and will count for Homework 5 and Homework 6. Since we have not yet covered reinforcement learning you should start by constructing the applet. The reinforcement learning algorithm will create a policy that takes as input the current state and outputs an action to perform. One option so that you can get started early is to build the applet. To do this you can replace the reinforcement learning algorithm by a simple program that picks an action. Of course this won't learn but it will allow you to debug your applet. Then you can add the Q-learning component after it is covered in class.

If you would like me to give you a quick preview of what you need to know about Q-learning to do these two problems, I'd be happy to do that. Just come by my office ours or schedule an appointment with me (by signing up on the advising sign-up sheet on my door). Also as your "resident-experts," Michal Bryc did the mountain car problem last year and Justin Domke did the pole balancing problem last year (and went on to build a physical system for the pole balancing problem).

If you do either of these problems, your homework will not be due until April 17. However, to make sure you begin working early enough you you must demonstrate to me by April 10 that you have made significant progress. For example, you could show me your applet or whatever portion of the assignment that you have been working on.


  1. (150 pts) The Mountain Car Problem involves learning to drive an underpowered car up a steep mountain road as illustrated in:


    The difficulty is that gravity is stronger than the car's engine and even at full throttle the car cannot accelerate up the steep slope when starting (at zero velocity) at the bottom. The only solution is to first move away from the goal and up the opposite slope on the left. This is a simple example of a continuous control task where things have to get worse in a sense (farther from the goal) before they can get better. The actions available to the car are full throttle forward (+1), full throttle reverse (-1) and zero throttle (0). The car moves according to a simplified physics. Its position xt and velocity vt are updated by:
    • xt+1 = bound [ xt + vt+1]
    • vt+1 = bound [ vt + 0.001 at - 0.0025 cos(3 xt)]
    where the bound operation enforces -1.2 <= xt+1 <= 0.5 and -0.07 <= vt+1 <= 0.07. When xt+1 reaches the left bound it has crashed into the wall and its velocity vt+1 is reset to 0. When xt+1 reaches the right boundary it has reached the goal and the episode is terminated. You can try out different values for the acceleration at at full throttle and full throttle reverse.

    Your task is to implement Q-learning for the Mountain Car Problem and create an applet that allows the user to watch the car's progress. Along with showing the car's movement on the hill, in another window plot the optimal trajectory of the current policy by showing the value of position and velocity from the start state to the end in a 2d graph (say with position as the x-coordinate and velocity and the y-coordinate). Since showing the car moving along the hill will slow down the learning process, provide an option on the applet to only show the simulation once every t trials where t is a value that can be changed by the user.

    Once you've done the basic problem you can decide where to go with this. You will have to discretize the velocity and position vectors to apply reinforcement learning. Try varying the discretization and see how that changes the speed of learning. Another interesting parameter to vary is the reward structure. The reward structure suggested in the figure is to have a reward of +1 when reaching the goal, -1 if hitting the wall and 0 elsewhere. Other options you can consider are having a reward of -1 for all positions except for the goal and then having a reward of +1 at the goal. Allow these values to be varied within the user interface provided for the applet.

    You need not submit anything in writing for this problem. Instead you will be graded based on a demo which will be given during the week of April 17.

  2. (150 pts) The Pole Balancing Problem is to apply forces to a cart moving along a track so as to keep a pole hinged to the cart from falling over.


    A failure is said to occur if the pole falls past a given angle from vertical or if the cart reaches an end of the track. The pole is reset to vertical after each failure. The reward in this case could be +1 for every time step in which failure did not occur, so that the return for each episode would be the number of steps until failure. Alternatively, we could treat pole-balancing as a continuing task, using discounting. In this case the reward would be -1 on each failure and zero at all other times. The return at each time would then be related to -gammak where gamma is the discounting factor and k is the number of time steps before failure. In either case the return is maximized by keeping the pole balanced as long as possible.

    Notice that here you have four variables, the velocity of the car, the position of the car, the angle between the car and the pole and finally the velocity of the pole (in terms of the change of angle.) You can decide what actions you want to provide with a basic starting point being having 3 actions (forward acceleration, backward acceleration, no acceleration).

    Here is everything you need for modeling the physics.

    /*** Parameters for simulation ***/
    
    #define GRAVITY 9.8
    #define MASSCART 1.0
    #define MASSPOLE 0.1
    #define TOTAL_MASS (MASSPOLE + MASSCART)
    #define LENGTH 0.5		  /* actually half the pole's length */
    #define POLEMASS_LENGTH (MASSPOLE * LENGTH)
    #define FORCE_MAG 10.0
    #define TAU 0.02		  /* seconds between state updates */
    #define FOURTHIRDS 1.3333333333333
    
      x is the cart's position (float)
      x_dot is the cart's velocity (float)
      theta is the pole's angle (float)
      theta_dot is the angular velocity of the pole (float)
    
    Here is a method to update the state variables according to
    what they would be TAU seconds later
    
        float xacc,thetaacc,force,costheta,sintheta,temp;
    
        if you are stationary
            force = 0
    
        if you are moving forward
            force = FORCE_MAG
    
        if you are moving backwards
            force = -FORCE_MAG
    
        costheta = cos(theta);
        sintheta = sin(theta);
    
        temp = (force + POLEMASS_LENGTH * theta_dot * theta_dot * sintheta)/ TOTAL_MASS;
    
        thetaacc = (GRAVITY * sintheta - costheta* temp)
    	       / (LENGTH * (FOURTHIRDS - MASSPOLE * costheta * costheta/ TOTAL_MASS));
    
        xacc  = temp - POLEMASS_LENGTH * thetaacc* costheta / TOTAL_MASS;
    
    /*** Update the four state variables, using Euler's method. ***/
    
        x  += TAU * x_dot;
        x_dot += TAU * xacc;
        theta += TAU * theta_dot;
        theta_dot += TAU * thetaacc;
    }
    
    Your task is to implement Q-learning for pole balancing problem and create an applet that allows the user to watch a simulation. Since showing the simulation will slow down the learning process, provide an option on the applet to only show the simulation once every t trials where t is a value that can be changed by the user.

    Once you've done the basic problem you can decide where to go with this. You will have to discretize the 4 variables to apply reinforcement learning. Try varying the discretization and see how that changes the speed of learning. You can vary the actions or reward structure. Be creative with other variations setting up options on the applet to let the use control the variations.

    You need not submit anything in writing for this problem. Instead you will be graded based on a demo which will be given during the week of April 17.