HTML document prepared by Brian Blankstein and Sean Waters.

Concept Learning and Version Spaces

Concept Learning: inferring a Boolean-valued function from training examples of its input and output (supervised learning)

Notation:

Example from text:  Determining what day Aldo wants to play his favorite water sport.

Book denotes a concept c from the class of conjunctions by <v1, v2, v3, v4 ,v5 ,v6>, where vi can be one of the values for that parameter (meaning that is the only acceptable parameter) or:

For today, we assume that the hypothesis space H = C.

What is |H|?  4*3*3*3*3*3 + 1 = 973

An alternate view for a hypothesis h in H is h:X -> {1,0}  (it's the set of positive examples)

 

Another example:Assume a 3 by 3 grid.

  1 2 3

a * * *

b * * *

d * * *

 

X = {a1, a2, a3, b1, b2, b3, d1, d2, d3}

C = H = axis-aligned rectangles with one corner at upper left (a1)

C = H = { {}, {a1}, {a1,a2}, {a1,a2,a3}, {a1,b1}, …{a1, a2, a3, b1, b2, b3, d1, d2, d3}}

|C| = |H| = 10

Define relations more_general_than = {(h1,h2) | h1 is a superset of h2}

Define relations more_specific_than = {(h1,h2) | h1 is a subset of h2}

 

These are partial orders and we can Hasse diagram them:

Candidate elimination algorithm – list all hypotheses in H that are consistent with the training examples.

Definition: h is consistent with set T of training examples if and only if h(x) = c(x) for all <x, c(x)>ÎT

 

Let's do an example.

Training data T = {<d3, ->, <a2, +>, <d1, +>}

d3 negative, h9 predicts positive and thus is too specific and is eliminated

d2 positive, h0,h1,h2, and h4 predictive negative and so are too specific and thus eliminated. If you simulate the candidate elimination algorithm found below you will find that both h3 and h5 are initially placed in S and then h5 is removed in the final step of the algorithm.

The items not crossed out are the version space:

VSH,T = {h in H | h is consistent with each x in T}

 

If H is finite, you could maintain VSH,T by initially letting VSH,{} contain a list of the elements of H.  Then for each new labeled example, remove those that are inconsistent.  VSH,T useful since for some new unlabeled example x, could predict c(x) by:

Nice idea, but for most interesting problems H is exponential in # of attributes and so this is not computationally feasible.

Goal:  find a compact representation for VSH,T where can update and make predictions efficiently

 

Definition for Most General Boundary G and Most Specific Boundary S:

Define G = {g in H | (g consistent with T) and (there is no g in H that is strictly more general than g and consistent with T)}

Likewise, S is the set of hypotheses consistent with T for which those strictly more specific than S are not consistent with T

Version Space Representation Theorem (2.1): Given that H=C and that G and S are always well defined the following holds:

            Let c:X --> {0,1} for any c in C. Let T be an arbitrary sample T = {<x,c(x)>}  then,
           
VSH,T = {h in H | there is a s in S and a g in G such that s is a subset of h and h is a subset of g}

 

When adding the third training example to the diagram above < d1,- >, the following happens:

 

How to maintain G and S using Candidate Elimination Algorithm:

Repeat until G = S for |G| = |S| = 1

For each training example <x, c(x)> do

            if (c(x) == 1)

                        remove from G any hypothesis inconsistent with x

                        for each s in S not consistent with x

                                    remove s from S

                                    add to S all minimal generalizations h of s such that:

1.      h consistent with x

2.      for some g in G, g is at least as general as h

remove from S any hypothesis that is more general than another hypothesis in S

            if (c(x) == 0)

                        exactly like above except G <--> S and generalization <--> specilization

 

Try simulating this algorithm on the example above and you will see where all of the steps come into play.

Correctness of the Candidate Elimination Algorithm

Will this algorithm converge to correct hypothesis? Yes, as long as there are no errors in the training data and H=C

These are both very strong conditions, not typically seen in practice, which reduces applicability of this algorithm. However, it provides a nice way to view search and is helpful in designing alg.

Efficiency Issues of the Candidate Elimination Algorithm

Another important issue:  efficiency in updating S for + examples and G for - examples. In general G and/or S can be exponential in the number of bits to encode the training examples. However, in some special cases either S or G can be efficiently maintained.

We now consider a special case in which S can be efficiently maintained. Let's consider EnjoySport problem:

   Here H is a conjunction.  Let's consider what is needed to maintain S.

Find-S algorithm:

   1.  Initialize S to most specific hypothesis in H (Ø  for EnjoySport)

   2.  For each instance x

         if c(x) = +

           For each attribute ai element-of S

               if ai is false on input x

                  replace ai by next most general constraint that will be satisfied by x

Note: If c(x) = - nothing need be done.

Here |S|=1 and hence if C=H and no noise a negative example will not change it.

 

Let's now go through an example.

Initially S= {Ø}

   x1=<sunny, warm, normal, strong, warm, same>, +

     S={<sunny, warm, normal, strong, warm, same>}

   x2=<sunny, warm, high, strong, warm, same>, +

     S={<sunny, warm, ?, strong, warm, same>}

   x3=<rainy, cold, high, strong, warm, change>, -

     S={<sunny, warm, ?, strong, warm, same>}

   x4=<sunny, warm, high, strong, cool, change>, +

     S={<sunny, warm, ?, strong, ?, ?>}

We know that this will predict negative unless x1-x4 deductively imply that an example is positive

 

Inductive Bias

   A learner that makes no a priori assumptions regarding the identity of the target concept has no rational basis

   for classifying any unseen instances.

 

   Suppose for EnjoySport we use some H in which all 2|x|=296=~1028 possible target concepts are defined.

 

   You can prove every example x not in training set will be classified + by exactly half of the elements of VSH,T and - by other half

 

Book defines inductive bias as minimal set B of assertions such that target concept is...

 

Decision Tree

ex. from EnjoySport

 

Any unbroken function can be represented by a sufficiently large decision tree so all 1028 possible hypotheses can be represented. 

So very powerful representation is achieved.