HTML document prepared by Brian Blankstein.

Computational learning theory

Goal: Identify concept classes that are inherently difficult or easy to learn. For a concept class C, we want to characterize the number of training excersizes necesary to reach desired accuracy (with respect to the unknown distribution D).

How is the number of excersizes needed affected if the learner is allowed to pose queries to a teacher versus just passively getting labeled examples drawn from D?

How is the number of examples needed affected by noise in the training sample?

Can one characterize the number of prediction mistakes that the learner will make before it knows the target function?

Can you characterize the inherent computational complexity of classes of learning problems?

Learning theory addresses these kinds of questions.

Definitions:

Sample complexity: number of training examples needed to converge (with a high probability) to a sufficiently accurate hypothesis
Computational complexity: amount of computation time needed (often called time complexity)
Mistake bound: number of training examples misclassified before converging to a perfect hypothesis

We can vary many things:

By doing so we get different learning models, only some of which we'll study here.

PAC Model (Valiant 1984)

This was historically the start of computational learning theory and provides a good starting point.

X - set of all possible instances
C - set of possible target functions (we'll focus on boolean functions, but you can generalize)
D - arbitrary unknown distribution over X
H - class of hypotheses
error[D](h) = prob[for any x in D](c(x)!=h(x))

Definition: Consider a concept class C defined over a set of instances X of length n and a learner L using hypothesis H. C is PAC-learnable by L using H if, for all c in C, distributions D over X, epsilon such that 0 PAC is an acronym for "probably approximately correct" (probably = delta, approximately = epsilon)
Note: this definition implicitly assumes that there is always a h in H which has an error no greater than epsilon.

Sample Complexity for finite hypothesis spaces (for noise-free data, to PAC learn)

consistent learner: outputs a hypothesis that correctly classifies all training examples. called consistency problem to find a polynomial time algorithm to achieve this goal

We now derive a boun on the number of training examples required by any consistent learner.

Let VS be the version space for the given sample and hypothesis space H
Goal: Ensure that for all h in VS, error[D](h) is no greater than epsilon.

Theorem: If H is finite and D is a sequence of m iid examples from D labeled by c, then for any 0<=epsilon<=1/2, the probability that some epsilon-bad (error>epsilon) hypothesis is in VS[H,D] is no greater than |H|e^(-epsilon*m)

PF: let h1, h2,...hk be epsilon-bad hypotheses. Consider h1. error>epsilon. Hence, using a single example from D, the probability that h1 is eliminated from VS is at least epsilon. The probability that h1 is not eliminated is at most 1-epsilon. The same holds for h2, h3,...hk
So, the probability that (h1,...hk are in VS after all m examples are processed) is at most (1-epsilon) * k
Note, the number of epsilon-bad hypotheses is at most |H|
So, the probability that (an epsilon-bad hypothesis is consistent with m examples) is at most |H| * (1-epsilon)^m
>From Taylor's expression for e^x, it follows that for 0<=epsilon<=1, (1-epsilon) <= e^-epsilon
Thus, prob(epsilon-bad hypothesis is consistent with m examples) is at most |H| * e^(-epsilon*m)

To get a PAC algorithm, we need just to ensure that |H|e^(-epsilon*m) is at most delta
That is with probability at least 1-delta that the hypothesis output is not epsilon-bad.

Taking ln of each side:
ln(|H|e^(-epsilon*m)) <= ln(delta)
ln(|H|) - epsilon*m <= ln(delta)
m*epsilon >= ln|H| - ln(delta)
m >= 1/epsilon * (ln(|H|) + ln(1/delta))

Thus one way to obtain a PAC algorithm to learn C (using H=C) is to give a poly-time algorithm to find a h in C consistent with a given labeled sample.

Monomials: use find-S
|C|=3^n
ln(3^n) = n*ln(3)
So, need sample of size (1/epsilon)*(n*ln(3)+ln(1/delta))

K-CNF fork constant
CNF conjunctive normal form:
(l1 || l2 ||...|| lk) ^ (l1' || l2' ||...|| lk') ^ ...
each li is a literal, each set of parenthases is a clause.
Each literal is any variable or its negation. Each clause contains at most k literals.

Reduce this to monomial problem by creating one variable per possible clause.
There are at most (2n)^k possible clauses.
Then treat it as learning a monomial over at most (2n)^k variables, so need a sample of size:
(1/epsilon)*((2n)^k * ln(3) + ln(1/delta))
Note: could really view as a monotone monomial over new variables, but this just replaces ln(3) with ln(2).

Let's consider k-term DNF
DNF disjunctive normal form
T1 || T2 || ... || Tk
Each T is a term which is a monomial
k-term DNF is a DNF formula with at most k terms with no restriction on a term.
|H| <= 3^(nk) since there are k terms with at most 3^n choices for each (positive, negative, absent)
So we know that given a sample of size (1/epsilon)*(n*k*ln(3) + ln(1/delta)), if we can find a k-term DNF consistent with it then we have a PAC-alrorithm. However, it is believed that the consistency problem for k-term DNF is not solvable in poly time (unless RP=NP)
Does this mean we can't PAC-learn k-term DNF? No. In fact, there's an easy algorithm. By using DeMorgan and the distributive property, it's easy to show k-CNF contains k-term DNF and that k-CNF is PAC-learnable. Hence, we can PAC-learn C=k-term-DNF by letting H=k-CNF. This proves that sometimes it is necessary to use a hypothesis space H which is more expressive than C even when the target is known to be in C.
Try for unbiased C. So |H|=2^|x|. So, if boolean domain X=2^n get bound m=(1/epsilon)*(2^n * ln(2) + ln(1/delta))

Agnostic Learning and Inconsistent Hypotheses

Suppose we want to remove any assumption about how H relates to C. So perhaps there is no epsilon-good hypothesis in H.

Agnostic Learning Model: don't assume H contains C. Let h-best be the hypothesis from H with the lowest error on training data D (from distribution D).
Goal: Like PAC, but now want to output hyp h that satisfies error[D](h) <= epsilon + error[D](h-best)
Note, when H contains C, then you can always output a consistent learner and so error[D](h-best)=0

To prove this we use Hoeffding bound (also called additive Chernoff bound) for upperbounding the probability weight in the tail of a binomial distribution. Consider a biased coin where probability the coin will land on head is probability that h will misclassify a random example from D. The m coin flips correspond to m random draws from D.

Hoeffding bound:
prob(error[D](h) > error[D](h)+epsilon) <= e^(-2m * epsilon^2)

so we want:
prob(there exists an h in H such that: error[D](h) > error[D](h)+epsilon) <= |H|*e^(-2m*epsilon^2) <= delta

solving for m yields that it suffices to pick m >= (1/(2*epsilon^2)) * (ln|H| + ln(1/delta))

Sample Complexity for Infinite Hypothesis Space

Consider C = set of all rectangles in a plane or C = set of all halfspaces in a plane
Replace ln|H| by a combinatorial measure of the complexity (or intuitively, degrees of freedom) of H.
def: a set S of instances is shattered by H iff all 2^|S| possible +/- classifications for S can be represented using H. For example:

Def: Vapnik-Chervonenkis dimension VCD(H) is the largest finite subset of X that can be shattered by H.

If arbitrarily large finite sets of X can be shattered by H, then VCD(H) = infinity.
Note: VCD(H) <= log[2](|H|)
To shatter d instances requires 2^d elements from H.

Examples:
Example 1: Intervals on a real line.
Each h in H is defined by two reals a and b where a <= b.
For a real number x in X, h(x) iff a <= x <= b.
VCD(H) = 2

Example 2: Let H be halfspaces in a plane.
VCD(H) = d+1

Example 3: Perceptron with r inputs (ie halfspace in r-dimensional space)
VCD(H) = r+1

Example 4: Conjunction of exactly 3 boolean literals
VCD(H) = 3

Thm: Prove that if you use a consistent learner then to PAC-learn you can use a sample of size:
m >= (1/epsilon) * (4*log[2](2/delta) + 8*VCD(H)*log[2](13/epsilon))
Further, you can prove the following lower bound

Thm: Consider any C such that VCD(C) >= 2, any learner L, and any 0 < epsilon < 1/8 and 0 < delta < 1/100. Then there exists a distribution D and target concept in C such that if L observes fewer examples than max(log(1/delta)/epsilon, (VCD(C)-1)/(32*epsilon)) examples, then with probability at least delta, L outputs a hypothesis h where error[D](h) > epsilon.
Note: L can use infinite computation and this still holds (information theoretic bound)

Very high level intuition

Let P[d](m) be the number of possible divisions of m examples into + and - examples by H with VCD d.
By induction you can show that P[d](m) = sum[i=0 to d](m choose i) <= (e*m/d)^d = O(m^d)
This grows poly in d for m>d. Until then, P[d](m) = 2^m. Intuitively you can replace |H| by P[d](m).