HTML document prepared by Brian Blankstein.

Example for reinforcement learning: Playing Checkers

Task: playing checkers (and winning)
Performance: % games won against opponent (human)
Experience: practice against self

If given the quality of each move, then you would have supervised, on-line learning.
Let's assume that you do not have that information - assume that the learner is given only the rules of the game. So this is reinforcement learning because the only feedback you get is knowing whether, at the end of the game, you won or lost.

What kind of data?

Discrete features - each of 64 (really 32) board positions has one of 5 states: empty, red piece, black piece, red king, black king

Training experience:

x1, x2, x3... are a sequence of board positions in a game, so we have a time series - each element is closely related to its neighbors.

Each is a board on the player's turn (not the opponent's turn).

Assume the learner plays against itself with no teacher - so it is an active learner that selects its own training experience. The learner must choose between experimenting with novel board states and honing its skill by playing minor variations of lines of play it currently finds most promising.

Training (games against self) vs Testing (games against human experts)

Ideally, it could train against a human expert, but this may not be possible. However, it can continue to improve based on its experiences against human experts if it will be given enough opportunities.

Choosing a target function

What do I want to learn?

First thought: function (choose move): B -> M - where B=legal boards and M=legal moves, but this is hard to work with because it can't analyze specific moves, only the entire game.

Alternate function: V:B ->  - where  is a real number indicating how good a board is. Positive is good, negative is bad, 0 is neutral.

Choose move (using V) by looking at all legal moves and picking the one that takes us to a board of maximum value.

How to determine how good a board is (target function = V):

Define V(b) for board b:

  1. b is a winning final board, V(b) = 100
  2. b is a losing final board, V(b) = -100
  3. b is a draw final board, V(b) = 0
  4. b is not a final board, V(b) = V(b') where b' is the best final board state that can be reached from b (assuming optimal play on both sides)

The problem with this definition is that the fourth part is not operational in that you cannot compute without already know how to optimally play.

The new goal is to find a good operational approximation of V:  Û (this should be a V with a ^, but I couldn’t find one)

 

Choose a representation for Û:

Tradeoff between picking a representation that is expressive enough so that we can closely approximate Û, yet not so expressive that the hypothesis space is too large (which would require more data and time to search for best choice for Û)

Use linear combination of the following attributes:

 

Picking good attributes is crucial, and general domain knowledge is used to do this.

Û(b) = w0 + w1 * x1 + w2 * x2 + w3 * x3 + w4 * x4 + w5 *x5 + w6 *x6

Note that you can write the first term as w0 * 1 and for notational convenience we will write it as w0 * x0 where x0=1.

 

w values are adjustable weights, initially set to 0;  so, in the beginning, all boards are considered neutral.

 

Source for training examples:

<b, Vtrain(b)> = (<x1, x2, x3, x4, x5, x6>, estimated value for V(b))

example: (<3, 0, 1, 0, 0, 0>, 100)

 

Following simple rule (* rule) works well to estimate training values:

Vtrain(b) <- Û(successor(b))

where successor(b) denotes the next board state following b for which it is again the program's turn

If the board is in a final board state, us the true V(b) value.

 

Adjusting the weights:

Using previously played game for each board position, create set of training examples: S = {<b, Vtrain(b)>}

We want to modify weights to improve the fit between Vtrain and Û.  Common approach is to minimize squared error:

  E = (Vtrain(b) – Û(b))^2

Here, we need an algorithm to incrementally refine weights as new training examples become available and that will be robust to errors in these estimated values.  One such algorithm is least mean squares (LMS).  For each example, it adjusted the weights by a small amount to reduce squared error.  This algorithm can be viewed as performing a stochastic gradient-descent search through hypothesis space (weight values) to minimize squared error E.

 

LMS weight update rule:

For each <b, Vtrain(b)>

Observations – develop intuition

Using Û to make moves:

Given board position b do the following to select the next move: