A signed cover sheet for Homework 6 must be submitted with your homework.
where the discount rate is 0.99. What policy will the Q-learning algorithm select here? Is that policy the policy that you feel is best. Carefully think about this question and discuss your answer and what it reveals about Q-learning in terms of risk taking. Think about what reward values might correspond to in the real world.
In this problem you are to create a finite MDP for modeling this problem. Clearly state any assumptions you make to do this. For example, you may want to assume that the best way to find cans is to actively search for them, but this runs down the robot's battery, whereas waiting does not. You can also assume that the agent makes its decisions solely as a function of the energy level of the battery.