Task: playing checkers (and winning)
Performance: % games won against opponent (human)
Experience: practice against self
If given the quality of each move, then you would have supervised, on-line learning.
Let's assume that you do not have that information - assume that the learner is
given only the rules of the game. So this is reinforcement learning because the
only feedback you get is knowing whether, at the end of the game, you won or
lost.
What kind of data?
Discrete features - each of 64 (really 32) board positions has one of 5 states: empty, red piece, black piece, red king, black king
Training experience:
x1, x2, x3... are a sequence of board positions in a game, so we have a time series - each element is closely related to its neighbors.
Each is a board on the player's turn (not the opponent's turn).
Assume the learner plays against itself with no teacher - so it is an active learner that selects its own training experience. The learner must choose between experimenting with novel board states and honing its skill by playing minor variations of lines of play it currently finds most promising.
Training (games against self) vs Testing (games against human experts)
Ideally, it could train against a human expert, but this may not be possible. However, it can continue to improve based on its experiences against human experts if it will be given enough opportunities.
Choosing a target function
What do I want to learn?
First thought: function (choose move): B -> M - where B=legal boards and M=legal moves, but this is hard to work with because it can't analyze specific moves, only the entire game.
Alternate function: V:B ->  - where  is a real number indicating how good a board is. Positive is good, negative is bad, 0 is neutral.
Choose move (using V) by looking at all legal moves and picking the one that takes us to a board of maximum value.
How to determine how good a board is (target function = V):
Define V(b) for board b:
The problem with this definition is that the fourth part is not operational in that you cannot compute without already know how to optimally play.
The new goal is to find a good operational approximation of V: Û (this should be a V with a ^, but I couldn’t find one)
Choose a representation for Û:
Tradeoff between picking a representation that is expressive enough so that we can closely approximate Û, yet not so expressive that the hypothesis space is too large (which would require more data and time to search for best choice for Û)
Use linear combination of the following attributes:
x1 = # black pieces (assume computer is black)
x2 = # red pieces
x3 = # black kings
x4 = # red kings
x5 = # black pieces threatened (in a single move)
x6 = # red pieces threatened (in a single move)
Picking good attributes is crucial, and general domain knowledge is used to do this.
Û(b) = w0 + w1 * x1 + w2 * x2 + w3 * x3 + w4 * x4 + w5 *x5 + w6 *x6
Note that you can write the first term as w0 * 1 and for notational convenience we will write it as w0 * x0 where x0=1.
w values are adjustable weights, initially set to 0; so, in the beginning, all boards are considered neutral.
Source for training examples:
<b, Vtrain(b)> = (<x1, x2, x3, x4, x5, x6>, estimated value for V(b))
example: (<3, 0, 1, 0, 0, 0>, 100)
Following simple rule (* rule) works well to estimate training values:
Vtrain(b) <- Û(successor(b))
where successor(b) denotes the next board state following b for which it is again the program's turn
If the board is in a final board state, us the true V(b) value.

Adjusting the weights:
Using previously played game for each board position, create set of training examples: S = {<b, Vtrain(b)>}
We want to modify weights to improve the fit between Vtrain and Û. Common approach is to minimize squared error:
E = (Vtrain(b) – Û(b))^2
Here, we need an algorithm to incrementally refine weights as new training examples become available and that will be robust to errors in these estimated values. One such algorithm is least mean squares (LMS). For each example, it adjusted the weights by a small amount to reduce squared error. This algorithm can be viewed as performing a stochastic gradient-descent search through hypothesis space (weight values) to minimize squared error E.
LMS weight update rule:
For each <b, Vtrain(b)>
use current weights to calculate Û(b)
for each weight Wi, Wi = Wi + N(Vtrain(b) – Û(b))Xi/Z
N is a small constant (eg 0.1) that moderates the size of the update
Z = x0 + x_1 + ... + Xn is a normalization factor where n is the number of attributes. Recall that x0=1.
the size of N could be reduced over time
if Vtrain(b) = Û(b), then no weights will be changed
if Vtrain(b) > Û(b), then the estimation is too low, so each weight will be increased in proportion to its feature and this will raise Û(b) (and the opposite)
if xi = 0, then the weight 2i will not change. Only weights updated are those with features that occur.
Given board position b do the following to select the next move:
For a given value of i, let Û(bi) = min Û(bij)
This is the best move for the opponent to make (from the opponent’s perspective) if the program makes move i.
Then, pick the move to take you to bi where Û(bi) is maximized.
