HTML document prepared by Sean Waters.
Mistake Bound Model
PAC model is a batch model meaning you get a set
of training exs which are used to
construct a hypothesis
and then the hypothesis is used to make predictions (generally
without further
updates)
Mistake-bound model is an on-line
learning model in which you must use your hypothesis
to predict as you are learning
It works as follows:
adversary pick target concept
c element-of C (learner knows C)
Repeat forever
adversary
picks an ex x element-of X
learner
gives prediction h(x)
learner
given true value c(x)
learner
updates h(x)
If h(x) != c(x) we say the learner
has made a mistake
Say C is learnable in mistake
bound model if:
1. # mistakes
(in infinite # trials) is bounded by a polynomial in
n (# bits in each example)
and #
bits in target C.
2. time per prediction
is polynomial
Query Learning
Closely related model is
query model.
Two most common queries:
Membership Query
MQ(x)
Learner picks
an x element-of X
and MQ(x)
returns C(x) where C is the target concept
This models
the ability to perform experiments.
Note: You
can also add MQs to PAC model(often called PA with MQ model).
Interesting
to explore
the extent to which MQs enhance learning. In some cases (e.g.
DFA) having MQs
enables us
to learn something we couldn't without them. In other cases
we can prove MQs
don't help.
(If we had a DNF algorithm that used MQs then a reduction
has been given to
show we can
learn without them).
In terms of
time complexity, each MQ takes constant time.
Equivalence Query
For hypothesis
h (selected by learner) EQ(h) either reports h correctly classifies
all exs
or return
counter example x (i.e. h(x) != c(x))
At first EQ
seems too strong but it is really extremely close to mistake
model.
Instead of
using EQ(h), just use h to make predictions. Whenever
a mistake occurs you obtain a
counter example.
So the only difference is with EQ you know when you have
exactly identified
(i.e. For all x,
h(x) = c(x)) h and with mistake bound model
you don't know you're done (but will not
make any more
mistakes). So #EQs = #mistakes + 1
Common model is EQ+MQ:
use EQ to "discover"
new region of domain then use MQ to refine
This is very much like how
we develop scientific theories. We use our current
theory until it does not
correctly predict some phenomena.
Then experiments are used to help revise the theory (hypothesis)
and it is used until it does
not predict correctly, and so on.
We now focus on the mistake bound model
without MQs (equivalent to learning with just
MQs)
Let's view find-S as a mistake bound model
and bound # mistakes.
h is initially x1
AND !x 1 AND x 2
AND !x 2 AND ... AND x n AND !x
n
2n literals
1st mistake removes n of them
n literals after
1 mistake
all other mistakes remove >=
1 literal
so at most n
additional mistakes could occur
So # mistakes <= in worst
case
Let C be a finite conept space. Suppose
also that you are no longer required to make each prediction
in poly time and further can output any hypothesis
(e.g. H is the powerset of X where H is and unbiased hyp class)
Give an algorithm with good upperbound for
# mistakes (in worst case)
Idea 1: |C| -1 since each mistake allows one concept from C to be removed
Halving Algorithm
For each x predict according
to the majority of concepts in VS (version space)
Initially |VS| = |C|
With each mistake # items eliminated
from VS is >= ½ of those left Since majority of items
were wrong
For alg A:
Let MA
(C) = max c element-of C (max # mistakes made
by A when learning c)
MHalving
(C) <= ceiling (log 2 |C|) <= log
2 |C|
Sometimes, e.g. interval [0,r] meaning
if x <= then + else if x > r then -, this can be efficiantly
implemented but in general exp. time
is used to make each prediction.
Def Opt mistake bound
Opt(C) = min MA(C) where
A element-of learning algs
We often call this Learning Complexity
in mistake bound model
VCD(C) <= Opt(C) <= MHalving
(C) <= log 2|C|
Let's argue VCD(C) <= Opt(C)
Let S ⊆ X be a shattered set.
The adversary can present examples in S (in any
order) and then always say that a mistake
was made. Since S is shattered by C there must
still be some c element-of C consistent.
Hence >=|S| = VCS(C) mistakes will be made.
Note: MHalving(C) can be less than log 2|C|.
Consider class of singletons C = {{x1},{x
2 },...,{x n}}.
Suppose you predict false then at first mistake
you'll know the target.
The halving algorithm will do exactly this.
So here even though |C| = n
MHalving(C) = 1
Note: VCD(C) = 1. You cannot shatter any tow exs since no c element-of
C classifies any two exs as +.
Along with not being efficient (in general) halving
algorithm is not noise tolerant. We now present a general
algorithm that is robust against noise. (Halving
algorithm is special case)
Weighted Majority Algorithm
n experts, A1,...,An
(can be another algorithm, concepts in C, attributes, different
parameter choices,...)

Similar to perceptron
but we'll use multiplicative weight update.
WM algorithm (target c element-of
C selected by adversary)
initialize w1
= w 2 =...= wn = 1
For each ex x (as selected
by adversary)
q-
= 0
q+
= 0
For each
expert a i
if a i (x) = -, q- = q- +
1
if a i (x) = +, q+ = q+
+ 1
Predict +
iff q + >= q-
Get feedback
c(x)
For each expert
a i (you can just perform this for loop on
a mistake)
if a i (x) != c(x) then
wi = βwi
β is tunable learning
rate where 0 <= β < 1
If you have one expert for
each c element-of C and β=0 then this is exactly the halving
alg.
Let's analyze # mistakes made in this setting # experts
is |C|
Suppose best expert makes mopt mistakes without
loss of generality let's assume a1 is best expert.
Two key Facts
1. W1 >= β
m opt (weight initially
1 and only multiply by β when mistake made)
2. On each mistake, let
|C|
W = Σwi
i=1
weight
of algs predicting wrong >= w/2
after update:
W <= w/2
+ β*(w/2) = W((1+β)/2) (where w/2 is not updated
and β*(w/2) is updated)
Thus after m mistakes
W <= |C|
((1+ β)/2) m (where |C| is the initial
value of W)
Since W >= W1 by 1 we know W >= β
m opt
Combined with W <= |C| ((1+ β)/2)m gives:
|C|((1+ β)/2)m >= β
m opt
with weight reduced by (1+β)/2
with each mistake but we know it never gets smaller than β
m opt
log2|C| + mlog2((1+
β)/2) >= moptlog2β
-log2|C| + mlog2(2/(1+
β)) <= moptlog2(1/β)
mlog2(2/(1+ β)) <= log2
|C| + moptlog2(1/β)
m <= log2|C| + moptlog
2 (1/β) (where m is the total # mistakes)
log2
(2/(1+ β))
If you know (or approximately know) mopt then
you can pick β to optimize the bound.
For ex, if mopt = 0 then best bound obtained
when β = 0 (m <= log2|C|)
More generally:
Let Winit be initial Σw
i
Let Wfin be a lower bound on Σw
i
Then Winit((1+β)/2)m >= W
fin
Winit/Wfin >= ((1+β)/2)
m
log(Winit/Wfin) >=
mlog(2/(1+β))
m <= log(Winit/Wfin)
log(2/(1+β))
Lots of interesting variations:
Suppose the target concept slowly changes
over time.
Modify algorithm to have a number value for
each weight
Can then prove nice bounds
Can modify for real-valued labels (VS and +-).
Here you can use multiplicative update that depends on
(h(x)-c(x)) 2 and can prove bounds on:
Σ (h(x)-c(x))2
∞ # trials
Note: In boolean domain this is exactly the # of mistakes
Winnow - learning wiht many irrelevant attributes. Suppose n
attributes but only k are relevant where k
is much smaller than n.
Weight Update
| prediction
| correct output |
update
false + |
+ |
- | if xi = 1 then
w i = 0
false - |
- |
+ | if xi
= 1 then wi = αw i
If # relevant attributes known you can tune α and θ.
(Can also apply WM with different choices)
Can prove # mistakes <= αk(logαθ
+ 1) + n/θ
Good generic choide for α and θ, α = 2,
θ = n/2
# mistakes <= 2k log2n+2
Theoretical Studies of additive update (gradient descent)
VS multiplicative updates for real-valued predictions
have been done.
Summary of what's known; N is # variables
|
additive update |
multiplicative update
K relevant vars | Loss <= KN +2 Loss of best
weights | Loss <= K2ln(2N) +3 Loss of best
weights
N relevant vars | Loss <= N +2 Loss of best
weights | Loss <= N2ln(2N) +3 Loss
of best weights
(assume all attribs are -1 or 1)
More generally
Multiplicative Update Loss <= 3( loss of best hyp
in H + U 2 X2ln(2N)
where U2 is the max calue of x
1 +...+x n in all exs and X2 is
the max value of any Xi all exs
Additive Update Loss <= 2( loss of best hyp in H +
Z 2 Y 2
where Z2 is the max calue of ||X||
2 possible in domain and Y2 is the max value
of ||X|| 2 in exs
Let's return to PAC model and talk about some variations
Gaining Noise Tolerance -
First approach (by Anguluin and Laird):
output hypothesis from H that
disagrees with the sample on the fewest examples possible
Good News Can prove that
by doing so even when each example has wrong label with prob P(noise
rate)
using a sample of size m >=
2
ln 2|C|
(where Pb is the upper bound for
P)
ε2
(1-2P b ) 2
δ
suffices. Noe can tolerate
noise rates 0 <= P < ½ to ensure with prob >=1-δ,
error D (h) <= ε
Bad News Even for simple
concept classes such as monotone monomial minimizing # disagreements
cannot be
solved in poly time (unless P=NP)
Better Solution - SQ Model
Replace EX oracle of PAC model by
SQ oracle
Let Q be any predicate defined over
exs that can be evaluated in poly time.
e.g. xi = 0 and label =
1
Let PQ = probability
Q(<x,l>) is true
<x,l> element-of D
SQ(Q,μ,θ) returns
estimate P-hatQ
such that PQ(1-μ) <= P-hatQ <= P
Q (1+μ)
or P-hatQ = ⊥in which
case P Q < θ
Let's modify Find-S to be an SQ algorithm.
Advantage of SQ algorithm is that even with noisy ex oracle you can
simulate the SQ oracle
Intuitively you cannot look at any single
ex but rather just gather statistics and hence will be tolerant of noise
SQ algorithm for learning a monomial
over vars x1,...,xn. Let l be label of an ex
h = T
For i = 1 to n
Qx
i = (xi = 0 AND l = 1)
P-hat
x i = SQ(Qx
i , ½, ε/2n) (where ½
is μ and ε/2n is θ)
If P-hat
x i = 0 or ⊥
h = h AND xi
Qx
-bari = (xi = 1 AND l =
1)
P-hat
x -bari = SQ(Q!x
i , ½, ε/2n)
If P-hat
x -bari = 0 or ⊥
h = h AND !xi
Output h
Let's argue this is correct
if P-hatliteral
= ⊥ then Pliteral < ε/2n (this is prob
literal is false in a + ex which will cause an error
if P-hatliteral
= 0 then Pliteral <= P-hatliteral
= 0
1- μ
# literals included is
<= 2n. Hence error introduced for including a literal not in target
<= 2n · ε/2n <= ε
We must also be sure we don't leave out
any vars in the target monomial
If P-hatliteral > 0 then
P literal >= P-hatliteral = P-hat
literal · 2/3 > 0 (so literal not in target)
1 + μ
Thus if literal n not in h then literal
is not in target (equiv if literal in target → literal will be in
h)
Hence if SQ oracle is accurate
then error D(h) <= ε.
What is left is to use EX oracle to estimate
SQ oracle (with high probability)
If no noise let Q be class of queries
used.
If Q finite: with c is a
constant
m = c ·
1/(μ 2θ) log (|Q|/δ) exs suffice to simulate
SQ
If Q infinite:
m = c(VCD(Q)/μ
2θ) log (1/μθ) + 1/μ2θ log
(1/δ)) exs suffice to simulate SQ
For our algorithm |Q| = 2n, μ = ½,
θ = ε/2n
# exs we need is c·n/ε
log(n/δ)
Now suppose EX is replaced by EXP
in which with prob p the label l is replaced by l-bar
We can use EXP to simulate
SQ and still guarantee that:
with prob >= 1-δ
all SQ estimates meet requirements (this is done in general) and
when all SQ estimates meet
requirement errorD(h) <= ε (we did this)
Sample size needed is roughly (O
~ drop constants and low order terms)
Qfinitelog |Q| + log (1/δ
) (where μ2θ is the min value used
and r is b/n 1 and θ)
μ 2θr(1-2p)2
(r = θ/%of exs where predicate's value depends
on label)
QinfiniteVCD(Q) + log (1/δ)
μ2θr(1-2p)2
So for our monomial ex get sample complexity
of:
O~ (n/ε
· log(n) + log (1/δ) )
(1-2p)2
Get results very similar to Angluin/Laird
with efficient algorithms
Good News With the exception of
parity PAC algorithm (like monomial but use + vs and ), every PAC
algorithm can be converted to SQ algorithm
Also we do not have any noise-tolerant parity
algorithm
So SQ model does a good job capturing what
can be done in PAC model and noise
(contrived exs do exist where can't learn
in SQ model but can PAC learn wiht noise,
but no "natural exs"
Dependance on complexity of target concept as represented in C
DFA - (learnable with MQ) determinisitc finite state
machine
NFA - (not learnable) nondeterministic finite state
machine
or
boolean formula - not learnable
vs
decision tree - learnable (with MQ)
Can represent same functions but for some boolean function
c length of shortest DT can be exponentially
shorter than length of shortest boolean formula
What is PAC-learnable or exactly learnable?
Aside exactly learnable with poly # of EQ (PAC learnable)
converse is not true
C-PAC learnable (k-constant)
Not PAC learnable
K-CNF, K-DNF
DFA
K-decision lists
boolean formula
K-decision trees
read-thrice boolean formula
K-term-CNF, K-term-CNF
constant depth threshold cicuits
any boolean function over k vars
context free grammars
union of K boxes in ℜn
intersection of K halfspaces in ℜn
(where the first 2 in learnable column can learn using C as hypothesis
and the next 3 need H=C)
Learnable from MQ and EQ
Decision Trees
Read once Boolean formula
Read-Twice DNF
DFA
Horn sentences
open
DNF
but if we learn read-thrice DNF with EQ and MQ then can
learn arbitrary DNF from EQ alone