An Example: Text Classification
each example is a text document
label s the type of document (e.g. articles I will find interesting)
Give algorithm based on naive bayes which is very effective
Two key decision:
text document needs to be converted to attributes need way to estimate
needed probabilities
Use very simple way to represent the document:
Define an attribute for each word position.
Value for attribute is the English word found in that position.
So # attribues per ex vary
Ex: 1000 training documents that someone has classified - 700
classified as "dislikes"
300 classified as "likes"
Suppose document 1 is "This is a very silly document"
6
VNB = argmax P(Vj)
PI (P(a i|Vj))
Vj<-{likes,dislikes}
i=1
= argmax
P(V j)·P(a1=this|Vj)·P(a
2 =is|Vj)·...·P(a6=document|V
j )
Vj<-{likes,dislikes}
Note: independance assumption that the word in one position is independant
of that in the
others clearly does not hold here. Yet in practive it works quite
well
(HW3 paper option - Domingo and Pazzani, 1996 provide an interesting
analysis of this phenomenon.)
Back to ex
we need estimates for
P(Vj), P(ai=Wk|Vj)
P(Vj) is easy, for example, P(like) = .3 P(dislike)
= .7
n
VNB = argmax P(Vj)
PI (P(ai|Vj))
Vj element-of V
i=1
for P(Vj) classify based on % of total documents with label
V j
ai is the ith word of text
Estimation P(ai|Vj) is still problematic.
In English about 50,000 distince words. Suppose 2 target values and
100 text positions then you would need to estimae (2)(100)(50000) = 10,000,000
terms
Complexity is further reduced by making the very reasonable assumption
that the probability of encountering a
specific word is independant of position. That is you assume:
P(ai=Wk|Vj) = P(am=W
k |Vj) For All i,j,k,m
So now (in ex above) only need to estimate (2)(50000) = 100,000 estimates
which is large but manageable
Finally, an m-estimate is used with uniform priors with m = size of word
vocabulary
That is, P(Wk|Vj) = (nk + 1)/(n + |vocabulary|)
where nk is the # times Wk appears in documents
with label Vj
and n is the # of words in document with label Vj
Resulting algorithm:
LearnNaiveBayes(Examples, V) where V is the set of target values
1. Vocabulary = set of all distinct words and other tokens that
occur in Examples
2. Calculate P(Vj) and P(Wk|Vj
) terms by
- for each Vj element-of V do
- docsj = subset of Examples with label Vj
- P(Vj) = |docsj|/|Examples|
- Textj = document obtained by concatenating all documents
in docj
- n = # words/tokens in Textj
- For each word Wk in vocabulary
- nk = # of times Wk occurs in Textj
- P(Wk|Vj) = (nk + 1)/(n + |vocabulary|)
ClassifyNaiveBayes(Doc)
positions = all word positions in Doc that contain tokens
in vocabulary
n
where n is the number of positions
return VNB = argmax
P(Vj) PI (P(ai|Vj))
Vj element-of V
i=1
Note: any words in Doc not in training text are ignored
Experimental Results
20 usenet groups, 1000 articles from each group collected to give 20,000
examples
2/3 used for training others used for test set
random guessing would have accuracy of approx 5%
Naive Bayes achieved an accuracy of 89%
The only variation from pseudocode we gave was that the 100 most frequent
words were removed
(such as "the", "of",etc) and also any word occuring fewer than 3 times
were removed. The resulting
vocab contained approx 38,500 words.
Newsreader (program for reading netnews that allows user to rate articles
as he/she reads them)
16% articles interesting
59% of articles Newsreader recommended were interesting
Bayesian Belief Nets
Independance assumption
P(a1...an|Wj) = P(a1|W
j )·...·P(an=Wj)
made by Naive Bayes greatly reduced the complexity but this assumption
is often too strong.
Let's begin with an example:
Suppose you want to predict if there's a forest fire. Suppose you
observe 5 boolean attributes: storm, lightning,
campfire, thunder, and Bus Tour Group
(or more broadly you want to estimate the prob of one of the 25
=32 possible exs)

without any independance assumptions you would need to estimate 2
6 =64 probabilities (and this is a toy example)
What does the Bayes Next rep. As an example let's look at Campfire

Pr(Campfire|other 5 attribs) = Pr(campfire|storm,BusTourGroup)
P(S,B,L,C,T,F) = P(S) . P(S,B) . P(S,B,L) . P(S,B,L,C)
. P(S,B,L,C,T) . P(S,B,L,C,T,F)
P(S)
P(S,B) P(S,B,L)
P(S,B,L,C) P(S,B,L,C,T)
= P(S) · P(B|S) · P(L|S,B) · P(C|S,B,L)
· P(T|S,B,L,C) · P(F|S,B,L,C,T)
Conditional Independance Assumptions
P(B|S) = P(B)
P(L|S,B) = P(L|S)
P(C|S,B,L) = P(C|S,B)
P(T|S,B,L,C) = P(T|L)
P(F|S,B,L,C,T) = P(F|S,L,C)
Thus, P(S,B,L,C,T,F) = P(S)·P(B)·P(L|S)·P(C|S,B)·P(T|L)·P(F|S,L,C)
These probabilities come directly from table stored at nodes
Instead of 32 probabilities in joint distribution, here we only need
to estimate:
1+1+2+4+2+8 = 18 probabilities (a.k.a - S+B+L+C+T+F = 18)
From these 6 marginal distributions (and using conditional independance
assumptions) you can compute any of the
32 probabilities in the joint distribution and hence answer questions
like Pr. of forest fire under certain conditions
Q) What Bayes net would correspond to assuming all variables are independant?
A) no edges
Now Let's Generalize
A Bayesian Belief Net represents the joint prob dist for a set of variables
by specifying a set of conditional independance
assumptions (represented by a directed acyclic graph) together with a
set of local conditional properties (often called marginals)
For values y1,...,yn for Y1,...,Y
n
n
P(y1,...,yn) = PI(P(yi|Parents(Y
i ))) where P(y1,...,y
n ) is shorthand for P(Y1=y1,...,Yn
=y n)
i=1
and Parents(Y
i) are nodes with edges directly into Yi
Inference
Problem: Infer prob dist for some variable (e.g. Forest Fire) given
only observed values for a subset of the other variables
(Suppose 5 values for Forest Fire then computing prob dist with 5 components)
Applet: www.cs.ubc.ca/labs/lci/CIspace/bayes.html lets you try other variations
We'll use this ex

Suppose you are given smoke = T, How do the prob change. That is
what are the posteriors
By Bayes Thm: P(fire|smoke) = P(smoke|fire)·P(fire)
P(smoke)
where P(smoke) is the prior)
P(smoke) = P(smoke /\ fire) + P(smoke /\ !fire)
= P(smoke|fire)·P(fire)
+ P(smoke|!fire)·P(!fire)
= (.9)(.01) +
(.01)(.99) =~ .0189
Posterior for P(fire) = (.9)(.01)
where .9 = P(smoke|fire), .01 = prior for P(fire)
.0189
and .0189 = prior for P(smoke)
So posterior for P(!fire) = 1-.4762 = .5238
tampering and fire are independant and so posterior for P(tampering) =
.02, P(!tampering) = .98
P(alarm) = P(alarm /\ fire /\ tampering) + P(alarm /\ fire /\ !tampering)
+ P(alarm /\ !fire /\ tampering) + P(alarm /\ !fire /\ !tampering)
= P(a|f,t)·P(f)·P(t)+P(a|f,!t)·P(f)·P(!t)+P(a|!f,t)·P(!f)·P(t)+P(a|!f,!t)·P(!f)·P(!t)
= (.5)(.4762)(.02) +
(.99)(.4762)(.98) + (.85)(.5238)(.02) + (.0001)(.5238)(.98)
=~ .4757
P(leaving) = P(leaving|alarm)·P(alarm)+P(leaving|!alarm)·P(!alarm)
= (.88)(.4757)
+ (.001)(.5243) = .4192
Summary
|
smoke=T
|
alarm=T
|
smoke=T /\ alarm=T
|
priors
|
tampering
|
.02
|
.6334
|
.0287
|
.02
|
fire
|
.4762
|
.3667
|
.9812
|
.01
|
alarm
|
.4757
|
1.0
|
1.0
|
.0267
|
smoke
|
1.0
|
.3364
|
1.0
|
.0189
|
leaving
|
.4192
|
.88
|
.88
|
.0245
|
for a 10 pt HW problem show work (as above) to obtain the last two columns
and also give Prob if observation leaving=T
In fact, a Bayesian net can be used to compute the prob dist for any subset
of network variables given the values
or distributions for any subset of the remaining vars
Exact inference of probabilities for an arbitrary Bayes Net is NP-hard.
So here we try to approximate them. (even this can
be shown to be NP-hard but in practice these heuristics work)
Suppose you want to compute:
P(y|x) where X is obs. and Y is set of variables deemed important
for prediction or diagnosis
By Bayes's rule
P(y|x) = SUM-OVER-s(P(y,x,s)) = P(x/\y)
SUM-OVER-y,s(P(y,x,s)
P(x)
where s is all vars except those in X and Y
Complexity depends on # of parents
Learning Bayesian Belief Nets
If network structure is given in advance and all variables are fully observable
then learning conditional prob tables
is straight forward. Just estimate the cond. prob table entries as
we would for a naive Bayes classifier
Consider when network structure is known but only some of the variable
values are observable in the training data.
Problem is similar to learning weights for hidden units in a neural net.
(If network (viewed as undirected graph) is a tree then the problem can
easily be solved as we did in the ex, but with
cycles it is harder and it's NP-Hard to find exact solution)
Objective function to maximize:
P(D|h) where D is the training data and h is the hypothesis
By def this corresponds to searching for the maximum likelihood hyp for
table entries
Gradien Ascent Training of Bayesian Nets (Russel, et al)
Maximize P(D|h) by following the gradient of ln P(D|h) wrt parameters that
define the conditional prob tables
of the Bayesian network
Rule you end with is:
1. Wijk = Wijk + eta· SUM
(Ph(yij,Uik|d)/W
ijk
d element-of D
where Wijk is one enry in a conditional prob table for example:
Campfire
eta is the learning rate
yij is the value of campfire
Uik is value of <storm,BusGroup>
2. renormalize Wijk to ensure they are valid probability distributions
The EM Algorithm
General technique to use when only a subset of the relevant features are
observable. Can also apply when label is missing
on some exs (i.e. have unlabeled data along with some labeled data)
EM alg has been used to train Bayesian Belief Nets
Ex. Estimating Means of k gaussians. For now suppose k=2, G
1 and G2 are two normal gaussians with same variance

Get data as follows:
- with prob 1/2 pick G1 and with prob 1/2 pick G2
- draw random x based on gaussian selected
- only x is given (whether you picked G1 or G2
is a hidden variable)
Goal: find h = <µ1,µ2>
Now consider same problem but there are k gaussians (all with same variance)
Goal is to output hyp h = <µ1,...,µk
>
that is a maximum likelihood hyp.
that is h should maximize p(D|h)
For a moment suppose each ex was <Xi,Zi1,Ziz
,...,Zik> where Xi,Zi1,Ziz
,...,Zik are indicators from which normat it came
This is an easy problem. For all mj examples from gaussian
j you want
mj
µML = argmin SUM (xi-µ
j)2
µ
i=1
where µj is the mean of the jth gaussian
It can be shown that sum of squared errors is minimized by sample mean
mj
µML = 1/mj SUM xi
i=1
But now suppose the first attribute is hidden so you just see xi
EM the alg we are about to see can be applied when you have hidden attributes
EM searches for maximum liklihood hyp by repeatedly re-estimating the expected
values of hidden variables given
current hyp <µ1,...,µk> then re-calculating
the maximum likelihood hyp. using these expected values for the hidden
vars
Let's look at EM for problem of estimating the two means
Note: "E" is the current hypothesis used to estimate unobserved variables
"M" is the expected values for unobserved
variables to calculate an improved hyp.
Initializarion: h = <µ1,µ2> where
µ1 and µ2 are arbitrary initial values
Step 1: Calculate E[Zij] of each hidden variable Zij
assuming the current hyp h = <µ1,µ2
>
Step 2: Calculate a new maximum likelihood hyp h'=<µ'1
,µ'2> assuming the value taken on by each hidden var
Zij is its
expected value E[Zij] calculated in step 1. Then replace
h = <µ1,µ2> by the new hypothesis
h'=<µ'1,µ'2> and iterate
Let's look at how these two steps are implemented (for our two means ex)
Step 1 Must estimate E[Zij] which is the prob that
xi was generated by the jth normal
E[Zij] = P(x=xi|µ=µ
j)
SUM(n=1 to 2)(P(x=x
i|µ=µn)
= e-1/2(sigma
)2(xi - µ
j)2
SUM e-1/2(sigma
)2(xi - µ
n)2
Compute this using current values of <µ1,µ
2> and observed xi into this
Step 2 You can show the maximum likelihood hyp in this case
given by: (for further info here see 6.12.3)
m
weighted sample mean = µj = 1/m SUM (E[Z
ij]xi)
i=1
This alg (in general) converges to a maximum likelihood hyp.
Sections 6.3-6.5 show how you can give Bayesian interpretations for:
- version spaces
- Using least-squared error (give noise obeys normal dist)
- gradient search to max likelihood
First Order Hidden Markov Models
can represent as following Bayesian belief net

Hi hidden state variables
Oi are observed variables
Note P(Ht+1|H1,...,Ht) = P(Ht+1
|Ht)
That is given state Ht, Ht+1 is independant of earlier
states.
Joint distribution completely specified by:
P(H1) initial state prob
P(Ht+1|Ht) transition prob
P(Ot|Ht) emission probabilities
left-to-right topology of HMM
each node represents a value of state var Ht
represents distribution of acoustic sequences associated with a unit of speech
(e.g. phenome, word)

For speech recognition bring in many levels of abstraction

Estimating probabilities: Use EM
To make predictions Input observations O1,...,On: Use
dynamic programming.Subproblem is:
m[i,L] probability of most likely path from state i that produces output
OL,...,On