If you would like me to give you a quick preview of what you need to know about Q-learning to do these two problems, I'd be happy to do that. Just come by my office ours or schedule an appointment with me (by signing up on the advising sign-up sheet on my door). Also as your "resident-experts," Michal Bryc did the mountain car problem last year and Justin Domke did the pole balancing problem last year (and went on to build a physical system for the pole balancing problem).
If you do either of these problems, your homework will not be due until April 17. However, to make sure you begin working early enough you you must demonstrate to me by April 10 that you have made significant progress. For example, you could show me your applet or whatever portion of the assignment that you have been working on.

Your task is to implement Q-learning for the Mountain Car Problem and create an applet that allows the user to watch the car's progress. Along with showing the car's movement on the hill, in another window plot the optimal trajectory of the current policy by showing the value of position and velocity from the start state to the end in a 2d graph (say with position as the x-coordinate and velocity and the y-coordinate). Since showing the car moving along the hill will slow down the learning process, provide an option on the applet to only show the simulation once every t trials where t is a value that can be changed by the user.
Once you've done the basic problem you can decide where to go with this. You will have to discretize the velocity and position vectors to apply reinforcement learning. Try varying the discretization and see how that changes the speed of learning. Another interesting parameter to vary is the reward structure. The reward structure suggested in the figure is to have a reward of +1 when reaching the goal, -1 if hitting the wall and 0 elsewhere. Other options you can consider are having a reward of -1 for all positions except for the goal and then having a reward of +1 at the goal. Allow these values to be varied within the user interface provided for the applet.
You need not submit anything in writing for this problem. Instead you will be graded based on a demo which will be given during the week of April 17.

Notice that here you have four variables, the velocity of the car, the position of the car, the angle between the car and the pole and finally the velocity of the pole (in terms of the change of angle.) You can decide what actions you want to provide with a basic starting point being having 3 actions (forward acceleration, backward acceleration, no acceleration).
Here is everything you need for modeling the physics.
/*** Parameters for simulation ***/
#define GRAVITY 9.8
#define MASSCART 1.0
#define MASSPOLE 0.1
#define TOTAL_MASS (MASSPOLE + MASSCART)
#define LENGTH 0.5 /* actually half the pole's length */
#define POLEMASS_LENGTH (MASSPOLE * LENGTH)
#define FORCE_MAG 10.0
#define TAU 0.02 /* seconds between state updates */
#define FOURTHIRDS 1.3333333333333
x is the cart's position (float)
x_dot is the cart's velocity (float)
theta is the pole's angle (float)
theta_dot is the angular velocity of the pole (float)
Here is a method to update the state variables according to
what they would be TAU seconds later
float xacc,thetaacc,force,costheta,sintheta,temp;
if you are stationary
force = 0
if you are moving forward
force = FORCE_MAG
if you are moving backwards
force = -FORCE_MAG
costheta = cos(theta);
sintheta = sin(theta);
temp = (force + POLEMASS_LENGTH * theta_dot * theta_dot * sintheta)/ TOTAL_MASS;
thetaacc = (GRAVITY * sintheta - costheta* temp)
/ (LENGTH * (FOURTHIRDS - MASSPOLE * costheta * costheta/ TOTAL_MASS));
xacc = temp - POLEMASS_LENGTH * thetaacc* costheta / TOTAL_MASS;
/*** Update the four state variables, using Euler's method. ***/
x += TAU * x_dot;
x_dot += TAU * xacc;
theta += TAU * theta_dot;
theta_dot += TAU * thetaacc;
}
Your task is to implement Q-learning for pole balancing problem and
create an applet that allows the user to watch a simulation. Since
showing the simulation will slow down the learning process, provide an
option on the applet to only show the simulation once every t
trials where t is a value that can be changed by the user.Once you've done the basic problem you can decide where to go with this. You will have to discretize the 4 variables to apply reinforcement learning. Try varying the discretization and see how that changes the speed of learning. You can vary the actions or reward structure. Be creative with other variations setting up options on the applet to let the use control the variations.
You need not submit anything in writing for this problem. Instead you will be graded based on a demo which will be given during the week of April 17.