How is the number of excersizes needed affected if the learner is allowed to pose queries to a teacher versus just passively getting labeled examples drawn from D?
How is the number of examples needed affected by noise in the training sample?
Can one characterize the number of prediction mistakes that the learner will make before it knows the target function?
Can you characterize the inherent computational complexity of classes of learning problems?
Learning theory addresses these kinds of questions.
We can vary many things:
X - set of all possible instances
C - set of possible target functions (we'll focus on boolean functions, but you can generalize)
D - arbitrary unknown distribution over X
H - class of hypotheses
error[D](h) = prob[for any x in D](c(x)!=h(x))
Definition: Consider a concept class C defined over a set of instances X of length n and a learner L using hypothesis H. C is PAC-learnable by L using H if, for all c in C, distributions D over X, epsilon such that 0
We now derive a boun on the number of training examples required by any consistent learner.
Let VS be the version space for the given sample and hypothesis space H
Theorem: If H is finite and D is a sequence of m iid examples from D labeled by c, then for any 0<=epsilon<=1/2, the probability that some epsilon-bad (error>epsilon) hypothesis is in VS[H,D] is no greater than |H|e^(-epsilon*m)
PF: let h1, h2,...hk be epsilon-bad hypotheses. Consider h1. error>epsilon. Hence, using a single example from D, the probability that h1 is eliminated from VS is at least epsilon. The probability that h1 is not eliminated is at most 1-epsilon. The same holds for h2, h3,...hk
To get a PAC algorithm, we need just to ensure that |H|e^(-epsilon*m) is at most delta
Taking ln of each side:
Thus one way to obtain a PAC algorithm to learn C (using H=C) is to give a poly-time algorithm to find a h in C consistent with a given labeled sample.
Monomials: use find-S
K-CNF fork constant
Reduce this to monomial problem by creating one variable per possible clause.
Let's consider k-term DNF
Agnostic Learning Model: don't assume H contains C. Let h-best be the hypothesis from H with the lowest error on training data D (from distribution D).
To prove this we use Hoeffding bound (also called additive Chernoff bound) for upperbounding the probability weight in the tail of a binomial distribution. Consider a biased coin where probability the coin will land on head is probability that h will misclassify a random example from D. The m coin flips correspond to m random draws from D.
Hoeffding bound:
so we want:
solving for m yields that it suffices to pick m >= (1/(2*epsilon^2)) * (ln|H| + ln(1/delta))
Def: Vapnik-Chervonenkis dimension VCD(H) is the largest finite subset of X that can be shattered by H.
If arbitrarily large finite sets of X can be shattered by H, then VCD(H) = infinity.
Examples:
Example 2: Let H be halfspaces in a plane.
Example 3: Perceptron with r inputs (ie halfspace in r-dimensional space)
Example 4: Conjunction of exactly 3 boolean literals
Thm: Prove that if you use a consistent learner then to PAC-learn you can use a sample of size:
Thm: Consider any C such that VCD(C) >= 2, any learner L, and any 0 < epsilon < 1/8 and 0 < delta < 1/100. Then there exists a distribution D and target concept in C such that if L observes fewer examples than max(log(1/delta)/epsilon, (VCD(C)-1)/(32*epsilon)) examples, then with probability at least delta, L outputs a hypothesis h where error[D](h) > epsilon.
Note: this definition implicitly assumes that there is always a h in H which has an error no greater than epsilon.Sample Complexity for finite hypothesis spaces (for noise-free data, to PAC learn)
consistent learner: outputs a hypothesis that correctly classifies all training examples. called consistency problem to find a polynomial time algorithm to achieve this goal
Goal: Ensure that for all h in VS, error[D](h) is no greater than epsilon.
So, the probability that (h1,...hk are in VS after all m examples are processed) is at most (1-epsilon) * k
Note, the number of epsilon-bad hypotheses is at most |H|
So, the probability that (an epsilon-bad hypothesis is consistent with m examples) is at most |H| * (1-epsilon)^m
>From Taylor's expression for e^x, it follows that for 0<=epsilon<=1, (1-epsilon) <= e^-epsilon
Thus, prob(epsilon-bad hypothesis is consistent with m examples) is at most |H| * e^(-epsilon*m)
That is with probability at least 1-delta that the hypothesis output is not epsilon-bad.
ln(|H|e^(-epsilon*m)) <= ln(delta)
ln(|H|) - epsilon*m <= ln(delta)
m*epsilon >= ln|H| - ln(delta)
m >= 1/epsilon * (ln(|H|) + ln(1/delta))
|C|=3^n
ln(3^n) = n*ln(3)
So, need sample of size (1/epsilon)*(n*ln(3)+ln(1/delta))
CNF conjunctive normal form:
(l1 || l2 ||...|| lk) ^ (l1' || l2' ||...|| lk') ^ ...
each li is a literal, each set of parenthases is a clause.
Each literal is any variable or its negation. Each clause contains at most k literals.
There are at most (2n)^k possible clauses.
Then treat it as learning a monomial over at most (2n)^k variables, so need a sample of size:
(1/epsilon)*((2n)^k * ln(3) + ln(1/delta))
Note: could really view as a monotone monomial over new variables, but this just replaces ln(3) with ln(2).
DNF disjunctive normal form
T1 || T2 || ... || Tk
Each T is a term which is a monomial
k-term DNF is a DNF formula with at most k terms with no restriction on a term.
|H| <= 3^(nk) since there are k terms with at most 3^n choices for each (positive, negative, absent)
So we know that given a sample of size (1/epsilon)*(n*k*ln(3) + ln(1/delta)), if we can find a k-term DNF consistent with it then we have a PAC-alrorithm. However, it is believed that the consistency problem for k-term DNF is not solvable in poly time (unless RP=NP)
Does this mean we can't PAC-learn k-term DNF? No. In fact, there's an easy algorithm. By using DeMorgan and the distributive property, it's easy to show k-CNF contains k-term DNF and that k-CNF is PAC-learnable. Hence, we can PAC-learn C=k-term-DNF by letting H=k-CNF. This proves that sometimes it is necessary to use a hypothesis space H which is more expressive than C even when the target is known to be in C.
Try for unbiased C. So |H|=2^|x|. So, if boolean domain X=2^n get bound m=(1/epsilon)*(2^n * ln(2) + ln(1/delta))Agnostic Learning and Inconsistent Hypotheses
Suppose we want to remove any assumption about how H relates to C. So perhaps there is no epsilon-good hypothesis in H.
Goal: Like PAC, but now want to output hyp h that satisfies error[D](h) <= epsilon + error[D](h-best)
Note, when H contains C, then you can always output a consistent learner and so error[D](h-best)=0
prob(error[D](h) > error[D](h)+epsilon) <= e^(-2m * epsilon^2)
prob(there exists an h in H such that: error[D](h) > error[D](h)+epsilon) <= |H|*e^(-2m*epsilon^2) <= deltaSample Complexity for Infinite Hypothesis Space
Consider C = set of all rectangles in a plane or C = set of all halfspaces in a plane
Replace ln|H| by a combinatorial measure of the complexity (or intuitively, degrees of freedom) of H.
def: a set S of instances is shattered by H iff all 2^|S| possible +/- classifications for S can be represented using H. For example:

Note: VCD(H) <= log[2](|H|)
To shatter d instances requires 2^d elements from H.
Example 1: Intervals on a real line.
Each h in H is defined by two reals a and b where a <= b.
For a real number x in X, h(x) iff a <= x <= b.
VCD(H) = 2
VCD(H) = d+1
VCD(H) = r+1
VCD(H) = 3
m >= (1/epsilon) * (4*log[2](2/delta) + 8*VCD(H)*log[2](13/epsilon))
Further, you can prove the following lower bound
Note: L can use infinite computation and this still holds (information theoretic bound)Very high level intuition
Let P[d](m) be the number of possible divisions of m examples into + and - examples by H with VCD d.
By induction you can show that P[d](m) = sum[i=0 to d](m choose i) <= (e*m/d)^d = O(m^d)
This grows poly in d for m>d. Until then, P[d](m) = 2^m. Intuitively you can replace |H| by P[d](m).