HTML document prepared by Sean Waters.

Mistake Bound Model
   PAC model is a batch model meaning you get a set of training exs which are used to
   construct a hypothesis and then the hypothesis is used to make predictions (generally
   without further updates)

Mistake-bound model is an on-line learning model in which you must use your hypothesis
to predict as you are learning

It works as follows:
   adversary pick target concept c element-of C (learner knows   C)
   Repeat forever
       adversary picks an ex x element-of X
       learner gives prediction h(x)
       learner given true value c(x)
       learner updates h(x)
If h(x) != c(x) we say the learner has made a mistake

Say C is learnable in mistake bound model if:
   1.  # mistakes (in infinite # trials) is bounded by a polynomial in n (# bits in each example)
        and # bits in target C.
   2.  time per prediction is polynomial

Query Learning
   Closely related model is query model.
   Two most common queries:
   Membership Query MQ(x)
       Learner picks an x element-of X
       and MQ(x) returns C(x) where C is the target concept
       This models the ability to perform experiments.
      
       Note: You can also add MQs to PAC model(often called PA with MQ model).  Interesting
       to explore the extent to which MQs enhance learning. In some cases (e.g. DFA) having MQs
       enables us to learn something we couldn't without them. In other cases we can prove MQs
       don't help.  (If we had a DNF algorithm that used MQs then a reduction has been given to
       show we can learn without them).
      
       In terms of time complexity, each MQ takes constant time.
  
   Equivalence Query
       For hypothesis h (selected by learner) EQ(h) either reports h correctly classifies all exs
       or return counter example x (i.e. h(x) != c(x))
      
       At first EQ seems too strong but it is really extremely close to mistake model.
       Instead of using EQ(h), just use h to make predictions.  Whenever a mistake occurs you obtain a
       counter example.  So the only difference is with EQ you know when you have exactly identified
       (i.e. For all x, h(x) = c(x)) h and with mistake bound model you don't know you're done (but will not
       make any more mistakes).  So #EQs = #mistakes + 1
  
   Common model is EQ+MQ:
       use EQ to "discover" new region of domain then use MQ to refine
  
   This is very much like how we develop scientific theories.  We use our current theory until it does not
   correctly predict some phenomena.  Then experiments are used to help revise the theory (hypothesis)
   and it is used until it does not predict correctly, and so on.

We now focus on the mistake bound model without MQs (equivalent to learning with just MQs)
Let's view find-S as a mistake bound model and bound # mistakes.
   h is initially x1 AND !x 1 AND x 2 AND !x 2 AND ... AND x n AND !x n
       2n literals
   1st mistake removes n of them
       n literals after 1 mistake
   all other mistakes remove >= 1 literal
       so at most n additional mistakes could occur
   So # mistakes <= in worst case

Let C be a finite conept space.  Suppose also that you are no longer required to make each prediction
in poly time and further can output any hypothesis (e.g. H is the powerset of X where H is and unbiased hyp class)

Give an algorithm with good upperbound for # mistakes (in worst case)
Idea 1: |C| -1 since each mistake allows one concept from C to be removed

Halving Algorithm
   For each x predict according to the majority of concepts in VS (version space)
   Initially |VS| = |C|
   With each mistake # items eliminated from VS is >= ½ of those left Since majority of items were wrong

   For alg A:
       Let MA (C) = max c element-of C (max # mistakes made by A when learning c)
       MHalving (C) <= ceiling (log 2 |C|) <= log 2 |C|
   Sometimes, e.g. interval [0,r] meaning if x <= then + else if x > r then -, this can be efficiantly
   implemented but in general exp. time is used to make each prediction.

Def Opt mistake bound
   Opt(C) = min MA(C) where A element-of learning algs
   We often call this Learning Complexity in mistake bound model
   VCD(C) <= Opt(C) <= MHalving (C) <= log 2|C|

Let's argue VCD(C) <= Opt(C)
Let  S ⊆ X be a shattered set.
The adversary can present examples in S (in any order) and then always say that a mistake
was made.  Since S is shattered by C there must still be some c element-of C consistent.
Hence >=|S| = VCS(C) mistakes will be made.

Note: MHalving(C) can be less than log 2|C|.   Consider class of singletons C = {{x1},{x 2 },...,{x   n}}.
  Suppose you predict false then at first mistake you'll know the target.

The halving algorithm will do exactly this.
So here even though |C| = n
MHalving(C) = 1
Note: VCD(C) = 1. You cannot shatter any tow exs since no c element-of C classifies any two exs as +.

Along with not being efficient (in general) halving algorithm is not noise tolerant.  We now present a general
algorithm that is robust against noise.  (Halving algorithm is special case)

Weighted Majority Algorithm
   n experts, A1,...,An (can be another algorithm, concepts in C, attributes, different parameter choices,...)



       Similar to perceptron but we'll use multiplicative weight update.
   WM algorithm (target c element-of C selected by adversary)
       initialize w1 = w 2 =...= wn = 1
       For each ex x (as selected by adversary)
           q- = 0
           q+ = 0
           For each expert a i
               if a i (x) = -, q- = q- + 1
               if a i (x) = +, q+  = q+ + 1
           Predict + iff q + >= q-
           Get feedback c(x)
           For each expert a (you can just perform this for loop on a mistake)
               if a i (x) != c(x) then
                   wi = βwi
       β is tunable learning rate where 0 <= β < 1
       If you have one expert for each c element-of C and β=0 then this is exactly the halving alg.

Let's analyze # mistakes made in this setting # experts is |C|
Suppose best expert makes mopt mistakes without loss of generality let's assume a1 is best expert.
Two key Facts
   1.  W1 >= β m opt   (weight initially 1 and only multiply by β when mistake made)
   2.  On each mistake, let
                 |C|
         W = Σwi
                 i=1
           
weight of algs predicting wrong >= w/2
         after update:
            W <= w/2 + β*(w/2) = W((1+β)/2)  (where w/2 is not updated and β*(w/2) is updated)
          Thus after m mistakes
            W <= |C| ((1+ β)/2) m   (where |C| is the initial value of W)

Since W >= W1 by 1 we know W >= β m opt
Combined with W <= |C| ((1+ β)/2)m gives:
   |C|((1+ β)/2)m >= β m opt
       with weight reduced by (1+β)/2 with each mistake but we know it never gets smaller than β m opt
    log2|C| + mlog2((1+ β)/2) >= moptlog2β
   -log2|C| + mlog2(2/(1+ β)) <= moptlog2(1/β)
   mlog2(2/(1+ β)) <= log2 |C| + moptlog2(1/β)
  m <= log2|C| + moptlog 2 (1/β)   (where m is the total # mistakes)
           log2 (2/(1+ β))

If you know (or approximately know) mopt then you can pick β to optimize the bound.
For ex, if mopt = 0 then best bound obtained when β = 0 (m <= log2|C|)

More generally:
   Let Winit be initial Σw i
   Let Wfin be a lower bound on Σw i

Then Winit((1+β)/2)m >= W fin
   Winit/Wfin >= ((1+β)/2) m
   log(Winit/Wfin) >= mlog(2/(1+β))
   m <= log(Winit/Wfin)
           log(2/(1+β))

Lots of interesting variations:
   Suppose the target concept slowly changes over time.
   Modify algorithm to have a number value for each weight
   Can then prove nice bounds

Can modify for real-valued labels (VS and +-).
Here you can use multiplicative update that depends on (h(x)-c(x)) 2 and can prove bounds on:
     Σ (h(x)-c(x))2
   ∞ # trials
Note: In boolean domain this is exactly the # of mistakes

Winnow - learning wiht many irrelevant attributes.  Suppose n attributes but only k are relevant where k
                 is much smaller than n.

                                     
Weight Update
                |   prediction | correct output |               update      
false +      |        +         |           -           |  if xi = 1 then w i = 0
false -       |        -         |            +          |  if xi = 1 then wi = αw i

If # relevant attributes known you can tune α and θ. (Can also apply WM with different choices)
Can prove # mistakes <= αk(logαθ + 1) + n/θ
Good generic choide for α and θ, α = 2, θ = n/2
   # mistakes <= 2k log2n+2

Theoretical Studies of additive update (gradient descent)
   VS multiplicative updates for real-valued predictions have been done.

Summary of what's known;  N is # variables

                         |                  additive update                 |               multiplicative update                   
K relevant vars  |  Loss <= KN +2 Loss of best weights  |  Loss <= K2ln(2N) +3 Loss of best weights
N relevant vars  |  Loss <= N +2 Loss of best weights     |  Loss <= N2ln(2N) +3 Loss of best weights
(assume all attribs are -1 or 1)

More generally
Multiplicative Update  Loss <= 3( loss of best hyp in H + U 2 X2ln(2N)
   where U2 is the max calue of x 1 +...+x n in all exs and X2 is the max value of any Xi all exs
Additive Update  Loss <= 2( loss of best hyp in H + Z 2 Y 2
   where Z2 is the max calue of ||X|| 2 possible in domain and Y2 is the max value of ||X|| 2 in exs

Let's return to PAC model and talk about some variations
Gaining Noise Tolerance -
   First approach (by Anguluin and Laird):
       output hypothesis from H that disagrees with the sample on the fewest examples possible
       Good News Can prove that by doing so even when each example has wrong label with prob P(noise rate)
                           using a sample of size m >=              2               ln   2|C|         (where Pb is the upper bound for P)
                                                                        ε2 (1-2P b ) 2               δ
                           suffices.  Noe can tolerate noise rates 0 <= P < ½ to ensure with prob >=1-δ, error D (h) <= ε
       Bad News  Even for simple concept classes such as monotone monomial minimizing # disagreements cannot be
                          solved in poly time (unless P=NP)

   Better Solution - SQ Model
       Replace EX oracle of PAC model by SQ oracle
       Let Q be any predicate defined over exs that can be evaluated in poly time.
       e.g. xi = 0 and label = 1
       Let  PQ = probability Q(<x,l>) is true
                    <x,l> element-of D

       SQ(Q,μ,θ) returns
           estimate P-hatQ such that PQ(1-μ) <= P-hatQ <= P Q (1+μ)
        or P-hatQ = ⊥in which case P Q < θ

       Let's modify Find-S to be an SQ algorithm.  Advantage of SQ algorithm is that even with noisy ex oracle you can
       simulate the SQ oracle

       Intuitively you cannot look at any single ex but rather just gather statistics and hence will be tolerant of noise

       SQ algorithm for learning a monomial over vars x1,...,xn. Let l be label of an ex
           h = T
           For i = 1 to n
               Qx i = (xi = 0 AND   l = 1)
               P-hat x i = SQ(Qx i , ½, ε/2n)   (where ½ is μ and ε/2n is θ)
               If P-hat x i = 0 or ⊥
                   h = h AND xi
               Qx -bari = (xi = 1 AND   l = 1)
               P-hat x -bari = SQ(Q!x i , ½, ε/2n)
               If P-hat x -bari = 0 or ⊥
                   h = h AND !xi
           Output h

       Let's argue this is correct
           if P-hatliteral = ⊥ then Pliteral < ε/2n (this is prob literal is false in a + ex which will cause an error
           if P-hatliteral = 0  then Pliteral <= P-hatliteral = 0
                                                              1- μ
           # literals included is <= 2n.  Hence error introduced for including a literal not in target <= 2n · ε/2n <= ε
       We must also be sure we don't leave out any vars in the target monomial
       If P-hatliteral > 0 then P literal >= P-hatliteral = P-hat literal · 2/3 > 0  (so literal not in target)
                                                       1 + μ
       Thus if literal n not in h then literal is not in target (equiv if literal in target → literal will be in h)
       Hence if SQ oracle is accurate then error D(h) <= ε.
       What is left is to use EX oracle to estimate SQ oracle (with high probability)
       If no noise let Q be class of queries used.
           If Q finite: with c is a constant
               m = c · 1/(μ 2θ) log (|Q|/δ) exs suffice to simulate SQ
           If Q infinite:
               m = c(VCD(Q)/μ 2θ) log (1/μθ) + 1/μ2θ log (1/δ))  exs suffice to simulate SQ
       For our algorithm |Q| = 2n, μ = ½, θ = ε/2n
           # exs we need is c·n/ε log(n/δ)

       Now suppose EX is replaced by EXP in which with prob p the label l is replaced by l-bar
       We can use EXP to simulate SQ and still guarantee that:
           with prob >= 1-δ all SQ estimates meet requirements  (this is done in general) and
           when all SQ estimates meet requirement errorD(h) <= ε  (we did this)

      Sample size needed is roughly (O ~ drop constants and low order terms)
       Qfinitelog |Q| + log (1/δ )    (where μ2θ is the min value used and r is b/n 1 and θ)
                   μ 2θr(1-2p)2               (r = θ/%of exs where predicate's value depends on label)

       QinfiniteVCD(Q) + log (1/δ)
                       μ2θr(1-2p)2

       So for our monomial ex get sample complexity of:
           O~  (n/ε · log(n) + log (1/δ) )
                              (1-2p)2

       Get results very similar to Angluin/Laird with efficient algorithms
       Good News With the exception of parity PAC algorithm (like monomial but use + vs and ), every PAC
                           algorithm can be converted to SQ algorithm
                           Also we do not have any noise-tolerant parity algorithm
                           So SQ model does a good job capturing what can be done in PAC model and noise
                           (contrived exs do exist where can't learn in SQ model but can PAC learn wiht noise,
                             but no "natural exs"

Dependance on complexity of target concept as represented in C
   DFA - (learnable with MQ) determinisitc finite state machine
   NFA - (not learnable) nondeterministic finite state machine
   or
   boolean formula - not learnable
       vs
   decision tree - learnable (with MQ)
   Can represent same functions but for some boolean function c length of shortest DT can be exponentially
   shorter than length of shortest boolean formula

What is PAC-learnable or exactly learnable?
   Aside exactly learnable with poly # of EQ (PAC learnable) converse is not true

   C-PAC learnable (k-constant)                Not PAC learnable
   K-CNF, K-DNF                                    DFA
   K-decision lists                                        boolean formula
   K-decision trees                                      read-thrice boolean formula
   K-term-CNF, K-term-CNF                    constant depth threshold cicuits
   any boolean function over k vars              context free grammars
   union of K boxes in ℜn
   intersection of K halfspaces in ℜn
(where the first 2 in learnable column can learn using C as hypothesis and the next 3 need H=C)

Learnable from MQ and EQ
   Decision Trees
   Read once Boolean formula
   Read-Twice DNF
   DFA
   Horn sentences

open
   DNF
   but if we learn read-thrice DNF with EQ and MQ then can learn arbitrary DNF from EQ alone