Concept Learning: inferring a Boolean-valued function from training examples of its input and output (supervised learning)
-label is + or – (boolean)
-things are described by their properties
ex. Regarding the property of being a mammal: +dog, +cat, -frog
Notation:
Target concept c:X -> {0, 1} where x is a set of examples
X is the input domain
C is a set of all possible target functions (called concept class)
Note that X and C are generally very large sets
If c(x) = 0, x is a negative example
If c(x) = 1, x is a positive example
Example from text: Determining what day Aldo wants to play his favorite water sport.
Each example (day) is specified by attributes:
Sky (sunny, cloudy, rainy)
Temperature (warm, cold)
Humidity (normal, high)
Wind (strong, weak)
Water (warm, cool)
Forecast (same, change)
So, |X| = 3*2*2*2*2*2 = 96 possible days
Target concept is days when Aldo enjoys his favorite water sport”
One possible concept class is the set of conjunctions. Ex: (Temp=cold)^(Humidity=high)
Book denotes a concept c from the class of conjunctions by <v1, v2, v3, v4 ,v5 ,v6>, where vi can be one of the values for that parameter (meaning that is the only acceptable parameter) or:
Example: Target in the example above would be denoted by <?, cold, high, ?, ?, ?>
Note: if any attribute has {} for an attribute, we'll use {} as one possible target concept.
For today, we assume that the hypothesis space H = C.
What is |H|? 4*3*3*3*3*3 + 1 = 973
An alternate view for a hypothesis h in H is h:X -> {1,0} (it's the set of positive examples)
Another example:Assume a 3 by 3 grid.
1 2 3
a * * *
b * * *
d * * *
X = {a1, a2, a3, b1, b2, b3, d1, d2, d3}
C = H = axis-aligned rectangles with one corner at upper left (a1)
C = H = { {}, {a1}, {a1,a2}, {a1,a2,a3}, {a1,b1}, …{a1, a2, a3, b1, b2, b3, d1, d2, d3}}
|C| = |H| = 10
Define relations more_general_than = {(h1,h2) | h1 is a superset of h2}
Define relations more_specific_than = {(h1,h2) | h1 is a subset of h2}
These are partial orders and we can Hasse diagram them:

Candidate elimination algorithm – list all hypotheses in H that are consistent with the training examples.
Definition: h is consistent with set T of training examples if and only if h(x) = c(x) for all <x, c(x)>ÎT
Let's do an example.
Training data T = {<d3, ->, <a2, +>, <d1, +>}
d3 negative, h9 predicts positive and thus is too specific and is eliminated
d2 positive, h0,h1,h2, and h4 predictive negative and so are too specific and thus eliminated. If you simulate the candidate elimination algorithm found below you will find that both h3 and h5 are initially placed in S and then h5 is removed in the final step of the algorithm.

The items not crossed out are the version space:
VSH,T = {h in H | h is consistent with each x in T}
If H is finite, you could maintain VSH,T by initially letting VSH,{} contain a list of the elements of H. Then for each new labeled example, remove those that are inconsistent. VSH,T useful since for some new unlabeled example x, could predict c(x) by:
Let VS+ be h in VS where h(x) = +
Let VS- be h in VS where h(x) = -
If |VS+| > |VS-| predict c(x) = +
If |VS+| < |VS-| predict c(x) = -
Else flip fair coin and predict + iff heads.
Nice idea, but for most interesting problems H is exponential in # of attributes and so this is not computationally feasible.
Goal: find a compact representation for VSH,T where can update and make predictions efficiently
Definition for Most General Boundary G and Most
Specific Boundary S:
Define G = {g in H | (g consistent with T) and (there is no g in H that is strictly more general than g and consistent with T)}
Likewise, S is the set of hypotheses consistent with T for which those strictly more specific than S are not consistent with T
Version Space Representation Theorem (2.1): Given that H=C and that G and S are always well defined the following holds:
Let c:X --> {0,1}
for any c in C. Let T be an arbitrary sample T =
{<x,c(x)>} then,
VSH,T = {h
in H | there is a s in S and a g in G such that s is a subset of h and
h is a subset of g}
When adding the third training example to the diagram above < d1,- >, the following happens:
h8 = inconsistent, so remove (still not general enough, but on the most general boundary)
h2 = minimal gen. h5 & h6 but neither consistent with x.
h4 = minimal gen. h7 which has already been placed in S
How to maintain G and S using Candidate Elimination Algorithm:
Repeat until G = S for |G| = |S| = 1
For each training example <x, c(x)> do
if (c(x) == 1)
remove from G any hypothesis inconsistent with x
for each s in S not consistent with x
remove s from S
add to S all minimal generalizations h of s such that:
1. h consistent with x
2. for some g in G, g is at least as general as h
remove from S any hypothesis that is more general than another hypothesis in S
if (c(x) == 0)
exactly like above except G <--> S and generalization <--> specilization
Try simulating this algorithm on the example above and you will see where all of the steps come into play.
Will this algorithm converge to correct hypothesis? Yes, as long as there are no errors in the training data and H=C
These are both very strong conditions, not typically seen in practice, which reduces applicability of this algorithm. However, it provides a nice way to view search and is helpful in designing alg.
Another important issue: efficiency in updating S for + examples and G for - examples. In general G and/or S can be exponential in the number of bits to encode the training examples. However, in some special cases either S or G can be efficiently maintained.
We now consider a special
case in which S can be efficiently maintained. Let's consider EnjoySport problem:
Here H is a conjunction. Let's
consider what is needed to maintain S.
Find-S algorithm:
1. Initialize S to most specific
hypothesis in H (Ø for EnjoySport)
2. For each instance x
if c(x) = +
For each
attribute ai element-of S
if ai is false on input x
replace ai by next most general
constraint that will be satisfied by x
Note: If c(x) = - nothing need be
done.
Here |S|=1 and hence if C=H and no
noise a negative example will not change it.
Let's now go through
an example. Initially S= {Ø}
x1=<sunny, warm, normal, strong, warm, same>, +
S={<sunny, warm, normal, strong, warm, same>}
x2=<sunny, warm, high, strong, warm, same>, +
S={<sunny, warm, ?, strong, warm, same>}
x3=<rainy, cold, high, strong, warm, change>, -
S={<sunny, warm, ?, strong, warm, same>}
x4=<sunny, warm, high, strong, cool, change>, +
S={<sunny, warm, ?, strong, ?, ?>}
We know that this will predict negative
unless x1-x4 deductively imply that an example is positive
A learner that makes no a priori assumptions regarding the identity of
the target concept has no rational basis
for classifying any unseen instances.
Suppose for EnjoySport we use some H in which all 2|x|=296=~1028
possible target concepts are defined.
You can prove every example x not in training set will be classified +
by exactly half of the elements of VSH,T and - by other half
Book defines inductive bias as minimal
set B of assertions such that target concept is...
Decision Tree
ex. from EnjoySport

Any unbroken function can be
represented by a sufficiently large decision tree so all 1028 possible
hypotheses can be represented.
So very powerful representation is achieved.