HTML document prepared by Sean Waters.

Evaluating Hypotheses (chap 5)
   With enough data this is easily handled using a large validation set.  Focus here is on doing this when data
   is limited.  Two key difficulties:
      1.  Bias in estimate - observed accuracy over training exs often poor estimator due to overfitting especially
                                           when a very rich hypothesis space is used.  Address this by using validation set or
                                           cross validation
      2.  Variance in estimate - smaller the size of the validation set the larger the variance would be (if you repeated
                                                   for different validation sets) For ex. consider 10-fold cross validation vs 100-fold
                                                   cross validation

Estimating Hypothesis Accuracy
   Assume following setting:
      X - set of all possible instances (or exs) specified often by a set of attributes.
             Given the example of learning who will purchase skis for marketing purposes, X would be all people and
             possible attributes might be age, home city, weight, #days you would ski per year, etc.
      Ð - arbitrary probability distribution over X that represents their occurence in natrue.  Independant of target concept
      Pr(Event E(x)) - is the probability the E(x) is true for a x element-of X drawn randomly from Ð
    x element-of Ð
      H - set of possible hypotheses
      C - set of possible target concepts (may be unknown)
      f element-of C - the target concept
      Ex. Suppose you want to learn the target concept "people who plan to purchase skis next year" (Ð is diff for diff options)
       option 1 - Survey people entering ski resort
       option 2 - Do phone survey of "random people"

        for option 1: Ð specifies for each person x element-of X the probability that they will be the next person arriving at the
                             ski resort
                             f: X-> {0,1} classifies each person as to whether they will buy skis next year
    Key Assumption
       Labeled data set S={<x1 , f(x1)>, <x2, f(x2)>, ... , <x n, f(xn)>} Obtain by drawing each xi independantly from Ð and
       properly labeling by f.  You can use similar ideas to model noise in sample.

   Two questions we want to answer
        1. Given h and labeled sample of n exs, what is best estimate of the accuracy of h over future exs drawn from Ð
        2. What is the possible error in this accuracy estimate?
    Defs  Sample Error:
         errors(h) = 1/n Sum(x element-of delta) [f(x) xor h(x)]
         where s is the sample; h is the hypothesis; n = |S|; f(x) is the target; and f(x) xor h(x) is 1 if f(x) != h(x) and 0 otherwise
         True Error (often called generalization error)
            error Ð(h) = Pr[f(x) != h(x)]
                            x element-of Ð
            Note that if Ð is discrete then this is just the sum of the prob. weights of x element-of x that are misclassified.
        Observe that this is the prob that a random x from Ð is incorrectly labeled h
 
   How good an estimate of errorÐ(h) is error s(h)?
    Let's apply some standard probability theory to our problem

                             |      Given coin with some              |             want to estimate prob
                             |      bias and want to estimate        |             that for a random x from
                             |      prob p that you'll get a Head   |             Ð h(x) = f(x)
----------------------------------------------------------------------------------------------
  sample space      |     {Head, Tail}                            |       {h(x)=f(x), h(x) != f(x)}
(possible outcomes)  |          H        T                             |               1                   0
----------------------------------------------------------------------------------------------
    Event E             |   coin lands heads up                   |       h(x) = f(x)
----------------------------------------------------------------------------------------------
  iid sample           |    each flip from same coin with    |     each ex drawn from Ð and labeled
identically independant     |    prob p of heads.                      |     by f
                            |    compute estimate p-hat for p    |
                            |    p-hat - like sample error           |
                            |    p - like true error                      |

Underlying Justification
   Binomial Distribution:
      gives prob of observing r successful trials in n independant trials where there is a prob p of success
       P(r) = (nCr) p r(1-p)n-r = [n!/r!(n-r)!]pr(1-p)n-r
         Let x be # successes
          E(x)=np
          Var(x)=np(1-p)
          O = sqrt(np(1-p))
      when np(1-p) >= 5 then binomial is closely approximated by normal
      For binomial distribution
           errors (h) = r/n
           errorÐ (s) = p
      estimation bias - For estimator Y For parameter p is E[Y]-P
         errors(h) is an unbiased estimator for errorÐ(s)
         E[r] = np   so E[r/n] = p
         E[errors(h)]-error s(h) = 0
      Oerrors(h) = Or/n = sqrt[p(1-p)/n]
      can approx by using r/n = errors (h) for p

Central Limit Theorem
   Sum of a large number of iid random vars approximately follow a Normal Distribution
    Normal curves with different standard deviations:
 
  95% confidence interval [l,h]


      
                                              std-dev=.5                                                                                                     std-dev=1.5
 
 
           
 

   area under curve is 1 (it's a prob dist) Look at portion (centered around mean) that defines 95% of area
      Let p = prob event E occurs (e.g. coin heads or h(x) = f(x))
      Let p-hat = estimate for p = (# of trials when E occurred)/(total # trials)
      Prob (l <= p-hat <= h) = .95
      Note: if O smaller (so size of sample bigger) h-l small as compared to when O is larger
 
    best estimate for errorÐ(h) is error s(h)  where errorÐ(h) is p and errors (h) is p-hat
    For N% confidence interval have:
       errors(h)-ZNS <= errorÐ(h) <= errors(h) + ZN S
    where S is estimate for O from sample where S = sqrt[(error s(h)(1-errors(h)))/n]
     and ZN given by:
    confidence level N%:    50%  |  68%  |  80%  |  90%  |  95%  |  98%  |  99%
                                           ----------------------------------------------------
    constant ZN             :     .67    |  1.00  |  1.28  |  1.64   |  1.96  |  2.33   | 2.58
     this approximation to area under normal works well as long as:
         n*errors(h)(1-error s(h)) >= 5
     ex   n=40,  12 errors
          errors (h) = 12/40 = .3
          S = sqrt(.3*.7/40) =~ .07
          So for 95% confidence interval (ZN = 1.96)
             .30 - .14 <= errorÐ(h) <= .30 + .14
          68% confidence interval
              .23 <= errorÐ(h) <= .37