HTML document prepared by Brian Blankstein.

Two-sided and one-sided bounds

Two-sided bound with N% confidence is error[s](h) - Z[N]*delta <= error[D](h) <=error[s](h) + Z[N]*delta

Suppose you just want to say that error[D](h) <= x. Then you just need a one-sided bound.
Let a=1-N/100
Then there is a/2 weight in each tail, so you can get the answer with a 100*(1-a/2)% confidence: error[D](h) <= error[s](h) - Z[N]*d
So if N=95%
a = 1-.95 = .05
So, you get 100(1-.025) = 97.5% confidence for a one-sided interval

Central Limit Theorem

Consider a set of iid random variables Y1,Y2,...Yn governed by an arbitrary probability distribution with mean m and finite variance delta^2
Let Yn (bold indicates there is a line over it = the average of the variables) = 1/n * sum[i=1 to n] Yi
Then, as n -> D the distribution governing (Yn - m)/(delta/sqrt(n)) approaches a normal with mean 0 and standard deviation 1.
Hence we can apply the one and two-sided confidence bounds regardless of the distribution defining the Yi's

An important property of the Normal Distribution

Let Y1 be drawn based on N(m1, delta1^2)
Let Y2 be drawn based on N(m2, delta2^2)
Then Y1+Y2 is defined by N(m1+m2, delta1^2+delta2^2)
Hence, if we want to estimate d=error[D](h1) - error[D](h2) using S1 for h1 (|S1|=n1) and S2 for h2 (|S2|=n2) then we can: Then, the N% confidence interval is: d-Z[N]*sqrt(delta[d]^2) <= d <= d+Z[N]*sqrt(delta[d]^2)

What if you just have one sample S? THen you can use S for S1 and S2. However, this eliminates variance due to random differences in S1 and S2 and hence the confidence interval given by this will be generally an overly conservative, but correct, interval.

Hypothesis Testing

Suppose we have h1 and h2 with corresponding samples S1 and S2 each of size 100. Suppose error[S1](h1) = 0.3 and error[S2](h2) = 0.1
d = error[S1](h1) - error[S2](h2) = 0.1
What is the probability that error[D](h1) > error[D](h2)
That is, with d=error[D](h1) - error[D](h2), what is probability that d>0, given that d=0.1
What is prob that d=d+0.1
So, prob(d>0) equals the probability that d falls into one-sided interval d < m[d] + 0.1
O[d] ~ .061 --> d < m[d] + 1.64*delta[d]
>From table 5.1 we find 1.64 standard deviations corresponds to two-sided 90% confidence level and hence, one-sided confidence is 95%
Thus, the prob that error[D](h1) > error[D](h2) is about .95 (so 5% chance that h1 actually has lower error)

Comparing learning algorithms

Suppose we want to compare algorithms LA and LB. There is active debate about how to best do this but we'll present one approach. If you're interested in alternative approaches, let me know and I can direct you to some related papers.

We want to estimate:
E[all samples (S) of size n from D] (error[D](LA(S)) - error[D](LB(S))) where LA(S) is the hypothesis output by LA on sample S.
But we only have a limited sample, D0

Approach 1
Divide D0 into training set S0 and disjoint test set T0 and use S0 to train A and B. So, we would measure:
error[T0](LA(S0)) - error[T0](LB(S0))
So we've replaced D by T0 and the average over all Samples S by a single sample S0. We can improve this a little by using cross-validation so we get more than one sample. Namely, do the following: (k-fold cross validation)

  1. Partition D0 into k disjoint subsets T1, T2,...Tk of equal size where the size >= 30
  2. For i= 1 to k
  3. Return delta = 1/k * sum[i=1 to k](delta[i]) (where bold delta is the average of all the deltas)
The approximate N% confidence interval is given by delta +- t[N,k-1]*S[delta]
where
S[delta] = sqrt(sum[i=1 to k]((delta[i]-delta)^2) / k(k-1))

t[N,k-1] is like Z[N] where the second parameter is the number of degrees of freedom and is usually denoted by V and is related to the number of independent random events that are used to produce delta. It is k-1 here.
t[N,k-1] is area under a t distribution, which is similar to normal, but wider and shorter.
As k-->D, t[N,k-1]-->Z[N]

This test is called the t-test and since the hypotheses are evaluated over the same sample, they are called paired t-tests. paired tests typically produce tighter confidence intervals since any differences seen must be due to the differences between the hypotheses versus also differences between the make-up of the sample. So, if it reports that A and B are equally good, it is reliable, but if it says one is better, it may not be as confident as it claims.

Practical Considerations

Alternative to k-fold cross-validation is to randomly pick training and testing sets k times. But then delta[i] are not independent of one another. Advantage: can be repeated any number of times to shrink confidence interval to the desired width. Advantage of cross-validation approach is you have independence, but k is limited since you need at least 30 examples in each group.