HTML document prepared by Sean Waters.

Gradient Descent ({<x->, t>}, n} (where <x ->, t> are the training exs)
For stochastic approximations replace:
        delta(wi) = delta(w i ) + n(t-o)xi with wi = wi + n(t-o)xi

Differences between stochastic and standard gradient descnet:
        Standard                                              Stochastic
Delta(wi) summed over all exs                update wi for each training
before updating wi                                  example

more computation but uses true                when multiple local minima
gradient and hence can often use              E(w->) can sometimes
larger n                                                   avoid them

delta(wi) = n(t-o)xi called delta rule (or LMS rule, Adaline rule or Widrow-Hoff rule)
perceptron rule is very similar but you replace o = w-> ·x -> by o = sign(w->·x ->)
when o = sign(w->·x->) so +1 or -1 then delta-rule still minimizes the squared error and
will not necessarily minimize the number of training exs misclassified by the threshold output.

Summary
  perceptron training - update weights based on threshold output, converges after finitie # steps
                                  if data linearly seperable.
  delta rule - update weights based on unthresholded output, converges (though may require
                   unbounded time) even if data linearly seperable

Multilayer nets and Back Propagation Algorithm




A differentiable threshold unit



can replace by others as long as differentiable.  others used are e -ky or tan h
Since multiple outputs:
   E(w->) = ½ Sum(d element of D) Sum (k element of outputs) [tkd - okd] 2

Back Propagation
    Given {<w->, t->>}  (training examples)
    n learning rate (=~ .05)



initialize all weights to random # between -.05 and .05
Until termination condition is met
    For each <w->, t-> > in training exs
  1. compute Ou for each unit u
  2. For each output unit K calculate its error term
       deltak = Ok (1- Ok)(t k -O   k)
  3. For each hidden unit h, calculate its error term
      deltah = Oh (1-Oh) Sum(k element of outputs) (Wkhdeltak)
      where Wkh is K's responsibility for error and delta k  is the error term for output unit h
  4. Update weights for all ij
    Wji = Wji + n(deltaj)(xji )
    where n(deltaj)(xji) = delta(wij )

Issue with training Multilayer Net

Adding Momentum
  Replace delta(wij) by delta(wji)(p) = n delta j*xji + alpha*delta(wji) (p-1)
   where p is the pth iteration and delta(w ji) (p-1) is the change in weight last time
  Ideally this will help gain momentum to get through local minimum and dlows down the step
  size as training progresses

Convergence and Avoiding Local Minima
Inductive Bias - roughly can summarize as smooth interpolation between data points.

Generalization, Overfitting and Stopping Criterion
  Use k-fold cross validation as follows:
    Divide m training exs into k disjoint sets
       s1,...,sk of m/k exs each
       For each of si
          Train net on si using k exs set aside as validation data
          Stop training when accuracy on validation data decreases signifigantly
  If you want to estimate accuracy look at average accuracy across all k runs

  You can do a similar thing when also using different starting weights and pick the
   one with the best performance on validation data

  Another option - compute average # training rounds, r-bar, across k runs.  Then use
  all m exs and traing using r-bar rounds
 
  Another option - weight decay - decrease weights by small factor at first at each iteration

Design Decisions made for Face Recognition
  Input Encoding - rescale so each feature value is between 0 and 1
  Output Encoding:
     instead of <1,0,0,0> use <.9,.1,.1,.1>
     Reason: sigmoid unit can approximate these using only finite weights
  #hidden units
     3 hidden units   90% accuracy, about 5 minutes training time
     30 hidden units  91-92% accuracy, about 1 hour training time
     < 3 units did not give good accuracy
     Can use cross validation to help select the optimal number of units
         too few - won't have good accuracy
         too many - slower to train and more prone to overfitting
  Other Learning Parameters
     learning rate n = .3
     momentum alpha = .3
     lower values of these gave roughly same generalization but longer training times
     higher values caused error to be too high
     all input unit weights "wo" set to 0 since this gave much more intelligible visualizations
     of learned weights without hurting generalization accuracy

  Stopping
     created seperate validation set
     every 50 gradient descent steps the accuracy on the validation data was computed
     Stopped when there was a significant decrease in accuracy
     output weights giving highest accuracy on validation sets
     90% accuracy on seperate test data set

  8x3x8 Network
     training data - identity function
     3 hidden units basically learn binary encoding for 8 possible outputs
     Fig 4.8 shows how training proceeds and hidden units evolve