HTML document prepared by Brian Blankstein.
Artificial Neural Networks (ANN)
Robust (ie noise-tolerant) approach to approximating real-valued, discrete-valued or vector valued target functions. There are lots of things that humans do well that computers can't do well. ANN is best known algorithm for many problems, such as learning to interpret complex real-world sensor data (ie vision, character recognition, face recognition...) Human brain is made up of approximately 1011 neurons. Each neuron is connected to approximately 104 other neurons. Neurons send signals over these connections which can be either weak or strong, excitatory or inhibitory. If the sum of the signals being sent to a given neuron is greater than some threshold, it will fire some signal to each of the neurons that it is connected to. These signals don't have to be the same. Each has a weight, wi, which corresponds to the strength and sign of the signals which the neurons that it connects to will receive.
The fastest time a neuron can send signals to another neuron is about 10-3 seconds (for computers, this time is 10-10 seconds). Humans can recognize faces in about 0.1 seconds, so the sequence of firings can't be longer than a few hundred steps; so biological neural networks must be highly parallel. While ANNs are loosely motivated by biological systems, there are many complexities to biological systems that are not modeled (and are inconsistent) in ANNs (ie biological neurons output a complex time series of spikes while ANN neurons output a single constant value). There are two branches of research:
- use ANNs to learn about biological systems
- develop effective machine learning algorithms with ANN (what we focus on)
ALVINN
Learns to steer an autonomous vehicle driving at
normal speeds on public highways

Each hidden unit has 30x32 = 960 inputs. Each hidden unit has a vector
of 30 real valued outputs (one for each output unit). So (960+1)x4 = 3844
weights from input where the "+1" is for the "w0" weight.
There are 30 output units each which has 5 weights (4 from the hidden
units and the "w0" value). Hence there are
3844+150=3994 total weights. Each output unit represents a possible action (ie A might
represent turn left sharply and C might represent go straight).
Training is slow, but using is generally pretty quick.
Face Recognition
120x128 grayscale images of faces. each pixel is an intensity from 0(black) to 255(white)...then scaled to range 0 to 1. Suppose we want to train it to recognize when the person is looking left, straight, right or up. To make it faster, reduce the image to 30x32 (quarter images) by replacing each cluster of 16 pixels with a single pixel (can be their mean, or the upper left, or some other value). Architecture for the quarter images is a 960x3x4 ANN (960 inputs, 3 hidden units, 4 outputs) similar to that used by ALVINN.
Perceptron - one model for a neuron:

Representational power of perceptron:
output of 1 vs -1 depends on whether w1x1 + w2x2 + w3x3 +...+wnxn > -w0
notice that this is an n-dimensional halfspace, so it can correctly classify a set of + and - points that are linearly (in 2d) separable.
A single perceptron can implement many boolean functions (2d shown here, but generalizeable):

m-of-n functions: true where at least m of the n inputs are true.
OR - m=1
AND - m=n
Any m-of-n function can be easily represented by setting w1, w2,...wn = .5 and setting w0 appropriately
XOR cannot be represented by a single perceptron (not linearly separable).
Note that any boolean formula can be expressed as an OR of ANDs and so any boolean formula can be expressed via a 2-level perceptron network.
Perceptron Training Rule
begin with random weights in [-1,1]
repeat until all training examples correctly classified (each training example may be used multiple times)
for i=0 to n
wi = wi + N(t-o)xi
where N = learning rate (~0.1 and may decay as # iterations increase)
t = target output from training example
o = actual output from perceptron
Gradient descent and delta rule
above is good for linearly separable data, but it can fail to converge when not separable. delta rule overcomes this and will converge towards a best-fit approximation to the target
training error E(w) = 0.5*SUM[training examples] (t-o)^2
derivation of gradient descent rule: (gradient is n-d equivelant of tangent in 2-d)
-GRAD[E(w)] = [dE/dw0, dE/dw1, ..., dE/dwn] (d is symbol for partial derivative)
so w = w - N*GRAD[E(w)]
thus wi = wi - N*dE/dwi
Note: dE/dwi = SUM[training examples] (t-o)(-xi)
Normalization Z where:
SQRT[ (x0/z)^2 + (x1/z)^2 +...+(xn/z)^2] = 1
(x0/z)^2 + (x1/z)^2 +...+(xn/z)^2 = 1
x0^2 + x1^2 +...+xn^2 = z^2
z = SQRT[ x0^2 + x1^2 +...+xn^2]
bu this can take time, so as long as N||x|| is well-behaved, there is no need for the normalization. If ||x|| is roughly constant, then you can just set N appropriately.