HTML document prepared by Sean Waters.
Gradient Descent ({<x->, t>}, n}
(where <x ->, t> are the training exs)
- Initialize each wi to some small random value like -.05 to
.05
- Repeat until termination condition is met
- Initialize each delta(wi) to zero
- For each training ex <x->, t>
- Compute o = w->·x-> = Sum(
wixi)
- delta(wi) = delta(wi) + n(t-o)xi
For stochastic approximations replace:
delta(wi) = delta(w
i ) + n(t-o)xi with wi = wi + n(t-o)xi
Differences between stochastic and standard gradient descnet:
Standard
Stochastic
Delta(wi) summed over all exs
update wi for each training
before updating wi
example
more computation but uses true
when multiple local minima
gradient and hence can often use
E(w->) can sometimes
larger n
avoid them
delta(wi) = n(t-o)xi called delta
rule (or LMS rule, Adaline rule or Widrow-Hoff rule)
perceptron rule is very similar but you replace o = w->
·x -> by o = sign(w->·x
->)
when o = sign(w->·x->) so
+1 or -1 then delta-rule still minimizes the squared error and
will not necessarily minimize the number of training exs misclassified
by the threshold output.
Summary
perceptron training - update weights based on threshold
output, converges after finitie # steps
if data linearly
seperable.
delta rule - update weights based on unthresholded output,
converges (though may require
unbounded time) even if data linearly seperable
Multilayer nets and Back Propagation Algorithm


A differentiable threshold unit

can replace by others as long as differentiable. others
used are e -ky or tan h
Since multiple outputs:
E(w->) = ½ Sum(d element
of D) Sum (k element of outputs) [tkd - okd]
2
Back Propagation
Given {<w->, t->>}
(training examples)
n learning rate (=~ .05)
initialize all weights to random # between -.05 and .05
Until termination condition is met
For each <w->, t->
> in training exs
- compute Ou for each unit u
- For each output unit K calculate its error term
deltak = Ok (1- Ok)(t
k -O k)
- For each hidden unit h, calculate its error term
deltah = Oh (1-Oh) Sum(k
element of outputs) (Wkhdeltak)
where Wkh is K's responsibility for error and delta
k is the error term for output unit h
- Update weights for all ij
Wji = Wji + n(deltaj)(xji
)
where n(deltaj)(xji) = delta(wij
)
Issue with training Multilayer Net
- error surface can have multiple local minima thus not guarenteed to
converge to global minima
- deltaj = -deltaEd/delta unit j
- will go through training data many (1000s) of times
- given back propagation code is stochastic approximation can also do
true gradient descent.
- overfitting is an issue
- can generalize to any network which is acyclic.
For each unit just consider those directly connected to it and use
same algorithm
- there is work on recurrent nets(where the graph is cyclic). Won't
be discussed
Adding Momentum
Replace delta(wij) by delta(wji)(p) =
n delta j*xji + alpha*delta(wji) (p-1)
where p is the pth iteration and delta(w
ji) (p-1) is the change in weight last time
Ideally this will help gain momentum to get through local minimum
and dlows down the step
size as training progresses
Convergence and Avoiding Local Minima
- momentum term often helps
- stochastic gradient descent (vs. true gradient descent) often helps
since it
descneds different error surface for each ex.
- train multiple networks using same data but different initial weights
and then
pick best(or combine) using validation set.
- surface begins fairly flat (as weights near 0) and then gets more complex
as
training continues - Hence over time overfitting will often occur (note:
sigmoid
near linear around w=0 and gets more non-linear as weights move towards
1 & -1)
Inductive Bias - roughly can summarize as smooth interpolation between
data points.
Generalization, Overfitting and Stopping Criterion
Use k-fold cross validation as follows:
Divide m training exs into k disjoint sets
s1,...,sk of m/k exs
each
For each of si
Train net on si using
k exs set aside as validation data
Stop training when accuracy on validation
data decreases signifigantly
If you want to estimate accuracy look at average accuracy across
all k runs
You can do a similar thing when also using different starting
weights and pick the
one with the best performance on validation data
Another option - compute average # training rounds, r-bar, across
k runs. Then use
all m exs and traing using r-bar rounds
Another option - weight decay - decrease weights by small factor
at first at each iteration
Design Decisions made for Face Recognition
Input Encoding - rescale so each feature value is between
0 and 1
Output Encoding:
instead of <1,0,0,0> use <.9,.1,.1,.1>
Reason: sigmoid unit can approximate these using
only finite weights
#hidden units
3 hidden units 90% accuracy, about 5 minutes
training time
30 hidden units 91-92% accuracy, about 1 hour
training time
< 3 units did not give good accuracy
Can use cross validation to help select the optimal
number of units
too few - won't have good accuracy
too many - slower to train and more
prone to overfitting
Other Learning Parameters
learning rate n = .3
momentum alpha = .3
lower values of these gave roughly same generalization
but longer training times
higher values caused error to be too high
all input unit weights "wo" set to 0
since this gave much more intelligible visualizations
of learned weights without hurting generalization
accuracy
Stopping
created seperate validation set
every 50 gradient descent steps the accuracy on
the validation data was computed
Stopped when there was a significant decrease in
accuracy
output weights giving highest accuracy on validation
sets
90% accuracy on seperate test data set
8x3x8 Network
training data - identity function
3 hidden units basically learn binary encoding for
8 possible outputs
Fig 4.8 shows how training proceeds and hidden units
evolve