Boosting
(NOTE: Ep = Epsilon)

Let's begin by considering two questions

1.  Suppose you are given a PAC algorithm A1, that works.  For any Ep. but only for delta=1/2.
That is, in poly time it outputs a hyp h with errorD(h) <= E with prob >= 1/2

2.  Suppose you are given a PAC algorithm A2 that works for any  delta but only for Ep.=1/2

Can you use this algorithm (as a black box) to obtain a PAC algorithm that works for any inputs
0<delta<=1/2 and 0<Ep.<=1/2

For (1) the is answer is yes.  By running algorithm A1 O(log1/delta) times (and using hypothesis
testing to evaluate the resulting hypothesis) you achieve the desired goal.  This is often called
confidence boosting.

Clearly the answer for (2) is no since flipping a coin and predicting + is head and - if tails works
as A2 but clearly no learning is taking place.

Many hardness results in PAC learning are of the form:
   If C is PAC-learnable then we can break the RSA cryptosystem.
There's even an interesting relationship between learning and breaking a crypto system.  Suppose
you want to predict the first bit b of the message that was encoded.
   If with prob >= 1-delta  (accuracy >= 1/2 + 1/poly) you could predict b with error <= 1/2-x
   where x is 1/poly then we say that the system is broken.

Def. - Strong Learner - standard PAC learner that works for any Ep. and delta
Def. - Weak Learner(Rule of Thumb) - achieve PAC learning for any fixed Ep. <= 1/2 - 1/poly

Important that weak learner works for any distance D.

Big Question in late 80's: Does Weak PAC learning = Strong PAC learning?
that is given weak learner Aw can you use it to construct strong learner A?

Rob Schapre proved this could be done in 1989.  This is known as boosting Since you can boost the
accuracy of weak learner Aw to any Ep > 0 desired and the time complexity is O(1/Ep · time comp of Aw)

Since then better and better boosting algorithms have been obtained and success for real learning problems
(text categorization, character recognition, and many more) has been extremely good.

Basic Idea:
train Aw on different training data to get a set of hypotheses and then combine these hypotheses.  Two key questions:
1.  Which collection of exs should be presented to Aw (often called "the expert") to extract weak hyp ("useful rules of thumb")
2.  Once many weak hypotheses are obtained how should they be combined to obtain an Ep-good hyp.

AdaBoost (introduced in 1995 and continues to find new real applications)

Given (x1,y1),...,(xm,ym) where xi element-of X and yi is label which is +1 or -1 here
Initialize D1(i)=1/m (see note below)
For t = 1,...,T
   train weak learner on Dt
   Get weak hyp ht: X->{-1,+1} with error Ep t = Pr [ht(xi) != yi] (note: this is measured wrt Dt
   choose alphat = 1/2 ln((1-Ept)/Ep t )
   update Dt+i(i) = Dt(i) · {e -alphat if ht(xi) = yi
                                      {e alpha t if ht(xi) != yi
                                           Zt
                         = Dt(i) exp(-alphat y i h t(xi))
                                           Zt                             Where  Zt is just a normalization factor so Dt+i will be a distribution
Output final hyp H(x) = sign(Sum (t=1 to T) ht(x))

Note: (as referenced above) distribution over m exs that weak learner must work.  Basic idea is to increase weight of "hard"
exs to force the weak learner to learn the hard exs from the training set.

How do you create Dt?

Sometimes an algorithm lets you directly give weights to examples.  If not a training set can be created by sampling according
to Dt and this unweighted set of examples can be used to train the weak learner.
Intuitively alphat measures the importance that is assigned to ht  Note: alphat >= 0 if Ept <= 1/2

There are many extensions
Let Ept = 1/2 - deltat (so deltat is advantage over random guessing)

Freund and Schapire prove that training of final hyp H is at most:
PI( [2·sqrt(Ept(1-Ept)]) = PI(sqrt(1-4delta t2))<= exp(-2·Sum(deltat2 ))
So training error drops exponentially in # trials. (previous boosters required as input some lower bound delta > 0 such that
deltat >= delta.  AdaBoost does not need to know delta.  It is adaptive and hence the name "Ada" short for adaptive)

Of course the real goal is to bound the generalization (true) error, error D(H)

Let d = VCD of weak hypothesis space
let m = size of training set
let T = # boosting rounds
Freund and Schapire proved that with high prob:
errorD(H) <= errortraining data(H) + O~ (sqrt(Td/m)) (note: O~= soft obj - like BigO but drop log terms as well as constants
So putting these together yields with high prob.
errorD(H) <= e-2·Sum(deltat 2) + O~(sqrt(Td/m))

This bound would make it seem like AdaBoost will overfit the training data when T becomes very large.  While this sometimes
happens, emerically it does not overfit even when run for thousands of rounds.  More over, sometimes generalization error would
continue to go down long after training error reaches zero.  This clearly contradicts the spirit of the above bound.



Alternate analysis using margins
margin of example (x,y) = y·Sum(alphatht (x))    where y is the label
                                           Sum(alphat)
in [-1,+1] and is positive iff H correctly classifies the ex.
Magnitude of margin can be interpreted as a measure of confidence in the prediction
Schapire et. al proved that larger margins on training set translate into a superior upper bound on generalization error.  Namely with
high prob.
gen. error <= Pr[margin(x,y)<=Theta] + O~(sqrt(d/mTheta 2))
For any Theta > 0
Note this bound is entirely independant of T, # boosting rounds

They also proved that boosting is particularly useful aggressive at reducing margin (in quantifiable sense) since it conentrates on exs
with the smallest margins
                 


Boosting Example
Each hyp will be a "decision stump" which is just a 1-node decistion tree.  Let;s use the PlayTennis training data from Table 3.2


Outlook
Temp
Humidity
Wind
Label
X1
S
H
H
W
-1
X2
S
H
H
S
-1
X3
O
H
H
W
+1
X4
R
M
H
W
+1
X5
R
C
N
W
+1
X6
R
C
N
S
-1
X7
O
C
N
S
+1
X8
S
M
H
W
-1
X9
S
C
N
W
+1
X10
R
M
N
W
+1
X11
S
M
N
S
+1
X12
O
M
H
S
+1
X13
O
H
N
W
+1
X14
R
M
H
S
-1

Choices for the root are: Outlook, Temp, Humidity, and Wind. If Outlook is selected then there are 3 branches with the examples falling into the leaves as follows: If Temp is selected then there are 3 branches with the examples falling into the leaves as follows: If Humidity is selected then there are 2 branches with the examples falling into the leaves as follows: If Wind is selected then there are 2 branches with the examples falling into the leaves as follows: For the weak learner we will pick the attribute that maximizes the information gain (when weighting the examples based on the current distribution Dt). Since the information gain is maximized when the entropy is minimized, we'll just look at the entropy. Let's consider an attribute A which has three possible values v1, v2, and v3. Let pi be the sum of the weights of the positive examples in which A=vi, and let ni be the sum of the weights of the negative examples in which A=vi. Let mi = pi+ni and let m=m1+m2+m3. Then the label for the leaf corresponding to vi is negative if and only if pi <= ni. The entropy for A is
sumi [mi/m (-pi/mi lg (pi/mi) - ni/mi lg(ni/mi))].

where lg denotes log2. When there are just two values then the only difference is that m=m1+m2 (and the sum goes over i=1,2, versus i=1,2,3).

Here is the output obtained when running boosting 20 rounds. The third and fourth lines for each round give the weights D[x1],...,D[x7] and D[x8],...,D[x14]. Then for the line starting with "Outlook" the first number is the weight of the negative examples where Outlook=S, the second number is the weight of the positive examples where Outlook=S, the third number is the weight of the negative examples in which Outlook=O, the fourth number is the weight of the positive examples in which Outlook=O, and so on (with the last next two numbers corresponding to Outlook=R). Finally, the entropy for Outlook is given. The lines for Temp, Humidity and Wind are the same with the order of which the values are considered for Temp is H,M,C, for Humidty is H,N, and for Wind is S,W.

Finally, the attribute selected followed by the label for each leaf is given. Then epsilon and alpha for that round are given followed by the test error of the resulting hypothesis (obtained by taking the sign of sumi alpha_i h_i)



Starting Round 1
distribution D[x1]...D[x14]:
0.0714286  0.0714286  0.0714286  0.0714286  0.0714286  0.0714286  0.0714286  
0.0714286  0.0714286  0.0714286  0.0714286  0.0714286  0.0714286  0.0714286  
Outlook: 0.214286  0.142857  0  0.285714  0.142857  0.214286  entropy is 0.693536
Temp: 0.142857  0.142857  0.142857  0.285714  0.0714286  0.214286  entropy is 0.911063
Hum: 0.285714  0.214286  0.0714286  0.428571  entropy is 0.78845
Wind: 0.214286  0.214286  0.142857  0.428571  entropy is 0.892159
Outlook minimizes the entropy
label for S is -, label for O is +, label for R is +
epsilon is 0.285714 and alpha is 0.458145
training error is 0.285714

Starting Round 2
distribution D[x1]...D[x14]:
0.05  0.05  0.05  0.05  0.05  0.125  0.05  
0.05  0.125  0.05  0.125  0.05  0.05  0.125  
Outlook: 0.15  0.25  0  0.2  0.25  0.15  entropy is 0.763547
Temp: 0.1  0.1  0.175  0.275  0.125  0.225  entropy is 0.962936
Hum: 0.275  0.15  0.125  0.45  entropy is 0.832424
Wind: 0.3  0.225  0.1  0.375  entropy is 0.869926
Outlook minimizes the entropy
label for S is +, label for O is +, label for R is -
epsilon is 0.3 and alpha is 0.423649
training error is 0.285714

Starting Round 3
distribution D[x1]...D[x14]:
0.0833333  0.0833333  0.0357143  0.0833333  0.0833333  0.0892857  0.0357143  
0.0833333  0.0892857  0.0833333  0.0892857  0.0357143  0.0357143  0.0892857  
Outlook: 0.25  0.178571  0  0.142857  0.178571  0.25  entropy is 0.839888
Temp: 0.166667  0.119048  0.172619  0.291667  0.0892857  0.208333  entropy is 0.939531
Hum: 0.339286  0.154762  0.0892857  0.416667  entropy is 0.783257
Wind: 0.261905  0.160714  0.166667  0.410714  entropy is 0.905551
Humidity minimizes the entropy
label for H is -, label for N is +
epsilon is 0.244048 and alpha is 0.565308
training error is 0.142857

Starting Round 4
distribution D[x1]...D[x14]:
0.0551181  0.0551181  0.0731707  0.170732  0.0551181  0.182927  0.023622  
0.0551181  0.0590551  0.0551181  0.0590551  0.0731707  0.023622  0.0590551  
Outlook: 0.165354  0.11811  0  0.193586  0.241982  0.280968  entropy is 0.798609
Temp: 0.110236  0.0787402  0.114173  0.358076  0.182927  0.137795  entropy is 0.894279
Hum: 0.224409  0.317073  0.182927  0.275591  entropy is 0.974903
Wind: 0.2971  0.155848  0.110236  0.436816  entropy is 0.817215
Outlook minimizes the entropy
label for S is -, label for O is +, label for R is +
epsilon is 0.360092 and alpha is 0.287482
training error is 0.142857

Starting Round 5
distribution D[x1]...D[x14]:
0.0430672  0.0430672  0.0571729  0.133403  0.0430672  0.254  0.0184574  
0.0430672  0.082  0.0430672  0.082  0.0571729  0.0184574  0.082  
Outlook: 0.129202  0.164  0  0.151261  0.336  0.219538  entropy is 0.82801
Temp: 0.0861345  0.0615246  0.125067  0.315643  0.254  0.143525  entropy is 0.91189
Hum: 0.211202  0.247749  0.254  0.287049  entropy is 0.996441
Wind: 0.379067  0.15763  0.0861345  0.377168  entropy is 0.78978
Wind minimizes the entropy
label for S is -, label for W is +
epsilon is 0.243765 and alpha is 0.566075
training error is 0.142857

Starting Round 6
distribution D[x1]...D[x14]:
0.0883377  0.0284748  0.037801  0.0882023  0.0284748  0.167937  0.037859  
0.0883377  0.0542159  0.0284748  0.168195  0.117271  0.0122035  0.0542159  
Outlook: 0.20515  0.222411  0  0.205134  0.222153  0.145152  entropy is 0.782632
Temp: 0.116812  0.0406782  0.142554  0.402143  0.167937  0.12055  entropy is 0.872507
Hum: 0.259366  0.243274  0.167937  0.329423  entropy is 0.961112
Wind: 0.250628  0.323325  0.176675  0.249372  entropy is 0.984348
Outlook minimizes the entropy
label for S is +, label for O is +, label for R is -
epsilon is 0.350302 and alpha is 0.308856
training error is 0.142857

Starting Round 7
distribution D[x1]...D[x14]:
0.126088  0.0406432  0.0290912  0.125895  0.0406432  0.129242  0.0291359  
0.126088  0.0417239  0.0406432  0.129441  0.0902501  0.00939164  0.0417239  
Outlook: 0.292819  0.171165  0  0.157869  0.170966  0.207181  entropy is 0.816346
Temp: 0.166731  0.0500348  0.167812  0.386229  0.129242  0.111503  entropy is 0.888703
Hum: 0.334543  0.245236  0.129242  0.290979  entropy is 0.943953
Wind: 0.21161  0.248827  0.252176  0.287388  entropy is 0.996169
Outlook minimizes the entropy
label for S is -, label for O is +, label for R is +
epsilon is 0.342131 and alpha is 0.326906
training error is 0.142857

Starting Round 8
distribution D[x1]...D[x14]:
0.0958306  0.03089  0.0221102  0.0956837  0.03089  0.188878  0.0221441  
0.0958306  0.0609765  0.03089  0.189168  0.0685928  0.00713793  0.0609765  
Outlook: 0.222551  0.250145  0  0.119985  0.249855  0.157464  entropy is 0.863603
Temp: 0.126721  0.0380279  0.156807  0.384335  0.188878  0.114011  entropy is 0.880028
Hum: 0.283528  0.186387  0.188878  0.341207  entropy is 0.953384
Wind: 0.280745  0.279905  0.191661  0.247688  entropy is 0.994831
Outlook minimizes the entropy
label for S is +, label for O is +, label for R is -
epsilon is 0.380015 and alpha is 0.244742
training error is 0.142857

Starting Round 9
distribution D[x1]...D[x14]:
0.126088  0.0406432  0.0178312  0.125895  0.0406432  0.152325  0.0178586  
0.126088  0.0491758  0.0406432  0.152559  0.0553181  0.00575654  0.0491758  
Outlook: 0.292819  0.201735  0  0.0967644  0.201501  0.207181  entropy is 0.891008
Temp: 0.166731  0.0463997  0.175264  0.374415  0.152325  0.107678  entropy is 0.89165
Hum: 0.341995  0.199044  0.152325  0.306636  entropy is 0.934264
Wind: 0.242144  0.225736  0.252176  0.279945  entropy is 0.998539
Outlook minimizes the entropy
label for S is -, label for O is +, label for R is +
epsilon is 0.403236 and alpha is 0.196001
training error is 0.142857

Starting Round 10
distribution D[x1]...D[x14]:
0.105643  0.0340529  0.0149399  0.105481  0.0340529  0.188878  0.0149629  
0.105643  0.0609765  0.0340529  0.189168  0.0463483  0.00482312  0.0609765  
Outlook: 0.245339  0.250145  0  0.0810742  0.249855  0.173587  entropy is 0.908929
Temp: 0.139696  0.0388761  0.166619  0.375051  0.188878  0.109992  entropy is 0.884063
Hum: 0.306315  0.166769  0.188878  0.338037  entropy is 0.938983
Wind: 0.283908  0.25048  0.211286  0.254326  entropy is 0.995617
Temp minimizes the entropy
label for H is -, label for M is +, label for C is -
epsilon is 0.296375 and alpha is 0.43231
training error is 0.0714286

Starting Round 11
distribution D[x1]...D[x14]:
0.0750705  0.0241982  0.0252044  0.0749554  0.0574491  0.134218  0.0252431  
0.178225  0.102871  0.0241982  0.134424  0.0329354  0.00813686  0.102871  
Outlook: 0.277494  0.237295  0  0.0915198  0.237089  0.156603  entropy is 0.89426
Temp: 0.0992687  0.0323351  0.281096  0.266513  0.134218  0.185563  entropy is 0.967972
Hum: 0.380365  0.133095  0.134218  0.352322  entropy is 0.837328
Wind: 0.261287  0.192603  0.253296  0.292815  entropy is 0.990409
Humidity minimizes the entropy
label for H is -, label for N is +
epsilon is 0.267313 and alpha is 0.504148
training error is 0.0714286

Starting Round 12
distribution D[x1]...D[x14]:
0.0512296  0.0165133  0.047144  0.140201  0.0392044  0.25105  0.0172264  
0.121624  0.070201  0.0165133  0.0917337  0.0616045  0.00555276  0.070201  
Outlook: 0.189367  0.161935  0  0.131528  0.321251  0.195919  entropy is 0.844796
Temp: 0.0677429  0.0220661  0.191825  0.310053  0.25105  0.126632  entropy is 0.929879
Hum: 0.259568  0.24895  0.25105  0.240432  entropy is 0.999675
Wind: 0.337764  0.170565  0.172854  0.318817  entropy is 0.927847
Outlook minimizes the entropy
label for S is -, label for O is +, label for R is -
epsilon is 0.357854 and alpha is 0.292346
training error is 0.142857

Starting Round 13
distribution D[x1]...D[x14]:
0.0398894  0.0128579  0.0367081  0.195892  0.0547771  0.195477  0.0134132  
0.0947016  0.0980861  0.0230727  0.128172  0.0479676  0.00432359  0.0546612  
Outlook: 0.147449  0.226258  0  0.102413  0.250139  0.273742  entropy is 0.884741
Temp: 0.0527473  0.0171815  0.149363  0.395104  0.195477  0.166276  entropy is 0.899243
Hum: 0.20211  0.280568  0.195477  0.321845  entropy is 0.968266
Wind: 0.262997  0.189553  0.134591  0.41286  entropy is 0.884411
Wind minimizes the entropy
label for S is -, label for W is +
epsilon is 0.324144 and alpha is 0.367397
training error is 0.0714286

Starting Round 14
distribution D[x1]...D[x14]:
0.0615304  0.00951233  0.0271568  0.144921  0.0405242  0.144615  0.0206901  
0.14608  0.0725643  0.0170693  0.197709  0.0739913  0.0031986  0.0404385  
Outlook: 0.217122  0.270273  0  0.125037  0.185053  0.202515  entropy is 0.870206
Temp: 0.0710427  0.0127109  0.186518  0.433691  0.144615  0.133779  entropy is 0.89242
Hum: 0.257561  0.246069  0.144615  0.351755  entropy is 0.935507
Wind: 0.194565  0.29239  0.20761  0.305435  entropy is 0.972189
Outlook minimizes the entropy
label for S is +, label for O is +, label for R is +
epsilon is 0.402175 and alpha is 0.198205
training error is 0

Starting Round 15
distribution D[x1]...D[x14]:
0.0764969  0.0118261  0.022713  0.121207  0.0338931  0.179791  0.0173045  
0.181612  0.0606903  0.0142761  0.165357  0.0618838  0.0026752  0.0502747  
Outlook: 0.269935  0.226047  0  0.104576  0.230065  0.169376  entropy is 0.885941
Temp: 0.088323  0.0145013  0.231886  0.362724  0.179791  0.111888  entropy is 0.924256
Hum: 0.320209  0.205804  0.179791  0.294196  entropy is 0.961789
Wind: 0.241891  0.244545  0.258109  0.255455  entropy is 0.99998
Outlook minimizes the entropy
label for S is -, label for O is +, label for R is -
epsilon is 0.395424 and alpha is 0.212285
training error is 0.0714286

Starting Round 16
distribution D[x1]...D[x14]:
0.0632649  0.00978048  0.0187842  0.153263  0.0428567  0.148691  0.0143113  
0.150198  0.0767409  0.0180517  0.209088  0.0511794  0.00221246  0.0415784  
Outlook: 0.223243  0.285829  0  0.0864874  0.19027  0.214171  entropy is 0.906929
Temp: 0.0730454  0.0119929  0.191776  0.431582  0.148691  0.133909  entropy is 0.895077
Hum: 0.264821  0.223226  0.148691  0.363261  entropy is 0.930522
Wind: 0.20005  0.274579  0.213462  0.311909  entropy is 0.978137
Temp minimizes the entropy
label for H is -, label for M is +, label for C is -
epsilon is 0.346681 and alpha is 0.316829
training error is 0.0714286

Starting Round 17
distribution D[x1]...D[x14]:
0.0484181  0.00748523  0.0270915  0.117295  0.0618099  0.113797  0.0206404  
0.216622  0.110679  0.0138154  0.16002  0.0391688  0.00319091  0.0599663  
Outlook: 0.272525  0.270699  0  0.0900916  0.173763  0.192921  entropy is 0.909182
Temp: 0.0559033  0.0106761  0.276588  0.3303  0.113797  0.19313  entropy is 0.956461
Hum: 0.332491  0.183556  0.113797  0.370156  entropy is 0.865401
Wind: 0.181249  0.219829  0.26504  0.333882  entropy is 0.991598
Humidity minimizes the entropy
label for H is -, label for N is +
epsilon is 0.297353 and alpha is 0.429968
training error is 0

Starting Round 18
distribution D[x1]...D[x14]:
0.0344541  0.00532645  0.0455544  0.197233  0.0439836  0.19135  0.0146876  
0.154147  0.0787587  0.00983097  0.113869  0.0658626  0.00227064  0.0426717  
Outlook: 0.193927  0.192628  0  0.128375  0.234022  0.251047  entropy is 0.87119
Temp: 0.0397805  0.00759708  0.196819  0.386796  0.19135  0.13743  entropy is 0.927977
Hum: 0.236599  0.30865  0.19135  0.263401  entropy is 0.984842
Wind: 0.239348  0.19442  0.188601  0.377631  entropy is 0.95023
Outlook minimizes the entropy
label for S is -, label for O is +, label for R is +
epsilon is 0.42665 and alpha is 0.147766
training error is 0

Starting Round 19
distribution D[x1]...D[x14]:
0.0300463  0.00464502  0.0397265  0.172  0.0383567  0.224247  0.0128086  
0.134427  0.092299  0.00857327  0.133446  0.0574366  0.00198015  0.0500079  
Outlook: 0.169118  0.225745  0  0.111952  0.274255  0.21893  entropy is 0.877684
Temp: 0.0346913  0.00662517  0.184434  0.371456  0.224247  0.143464  entropy is 0.923033
Hum: 0.219126  0.269164  0.224247  0.287464  entropy is 0.990647
Wind: 0.2789  0.203691  0.164473  0.352936  entropy is 0.940835
Outlook minimizes the entropy
label for S is +, label for O is +, label for R is -
epsilon is 0.388048 and alpha is 0.227762
training error is 0

Starting Round 20
distribution D[x1]...D[x14]:
0.0387146  0.00598511  0.0324589  0.221623  0.0494226  0.183223  0.0104653  
0.173209  0.0754136  0.0110467  0.109033  0.046929  0.0016179  0.0408593  
Outlook: 0.217908  0.184447  0  0.0914711  0.224082  0.282092  entropy is 0.901713
Temp: 0.0446997  0.00760301  0.214068  0.388631  0.183223  0.135302  entropy is 0.935041
Hum: 0.258768  0.30101  0.183223  0.256999  entropy is 0.988737
Wind: 0.230067  0.166427  0.211923  0.391582  entropy is 0.953429
Outlook minimizes the entropy
label for S is -, label for O is +, label for R is +
epsilon is 0.408529 and alpha is 0.185025
training error is 0