Decision Tree Learning
One of the most widely used inductive
inference method. Provides method for
approximate discrete-valued target functions.
Nice feature of decision trees is that
they can be interpreted by humans.
Sample DT:

can express as a disjunction of conjunctions:
(outlook=sunny ^ humidity=normal) v (outlook=overcast) v (outlook=rain ^
wind=weak)
Appropriate Problems for DT Learning
instance represented by attribute-value pairs, can extend to real valued
attributes
target function has discrete output values, can extend to real valued
output
disjunctive descriptions may be required
training data may contain errors
training data may contain missing attribute values
classification problem
- classify example into one of a discrete set of possible categories
Basic Decision Tree Algorithm
(Top-down greedy search)
Pick attribute most useful for classifying examples.
Put that attribute at the root
Recursively repeat for each subtree that has both + and – examples
Terminate when all exs +, all exs -, no attributes left (go with most
common label) or no
examples fall into leaf (go with most common level at parent)
ID3 and its successor C4.5 both use this structure.
Which attribute is best?
Use Information Gain
First we need to define entropy which categorizes the impurity
(or lack of order) in an arbitrary collection of examples.
Let S be set of P positive and N negative examples:
Entropy(s) = - P/(P+N) log2 P/(P+N) – N/(P+N) log2
N/(P+N)

If all exs +: P/(P+N)=1 then entropy =
0
If all exs -: P/(P+N)=0 then entropy =
0
If ½ exs + and ½ exs - then entropy =
1
One interpretation from
information theory is the minimum number of bits needed to encode the
classification of an arbitrary member of S.
General definition of entropy when c
possible values
Entropy(s) = Sum(i=1 to c) [ -Pi log2
Pi]
where Pi is portion of S that belongs to class I
Information gain measures expected reduction in entropy G given
sample S and attribute A
Gain(S,A) = Entropy(S) – Sum(all v element-of Values(A))
[ |Sv|/|S| Entropy(Sv)]
In the above equation the sum is the expected value of entropy
if A is picked; v element-of Values(A)
are the possible values of A;
and Sv is a subset of S when A has value V
(based
on training data in table 3.2)
Gain(S, outlook) =
0.246
Gain(S, Humidity) =
0.151
Gain(S, Wind) =
0.048
Gain(S, Temperature)
= 0.029

Gain(Ssunny,
Humidity) = 0.97
Gain(Ssunny,
Temp) = 0.57
Gain(Ssunny,
Wind) = 0.019

Start with an empty
tree
Progressively
elaborate in search of DT that correctly classifies the training data
Evaluation function
that guides this hill-climbing search is the information gain measure
Important
Observations:
·
ID3’s hypothesis space of all DTs is complete in that
every finite discrete-valued function can be represented. So some hypothesis in space will be
consistent with data (if noise free)
·
ID3 maintains a single current hypothesis unlike version
spaces cannot determine how many alternative DT are consistent with data
·
ID3 in its pure form performs no backtracking
·
ID3 uses all examples for each level (vs candidate-elimination
which uses each example only once
·
Advantage of using statistical property of all examples is the
resulting search is much less sensitive to errors. Can handle noise by terminating with hypothesis that does not
perfectly classify the data
·
Inductive Bias in DT learning: shorter trees preferred over
longer trees. Trees that place high information gain attributes close to the
root are preferred over those that do not.
·
ID3 bias comes from search method (preference bias or search
bias)
·
Candidate-Elimination bias comes from incomplete hypothesis
space (restriction bias or language bias)
·
Game learning algorithm we saw (checkers) had both kinds of
biases
·
Restriction bias – could only represent linear function of
attributes
·
Preference bias – LMS bias from ordered search through space
with initial weights all equal
Occam’s Razor (1320) – prefer the simplest
hypothesis that fits the data
Why – fewer short
hypotheses than long and so less likely that you will find short one that fits
the data
-
scientists prefer shorter/simpler theories to longer ones, if
they both explain the observed data
-
How deep to grow DT (esp when noise)
-
Handling continuous attributes
-
Handling attributes with missing values
-
Improved computational issues
C4.5 is extension of
ID3 to handle these issues
Avoid Overfitting
ID3 grows tree
enough to get perfect classification
This is not good
when there’s noise or when not enough training exs
Give hypothesis
space H. A h Є H
Overfits training
data if some h Є H if h
has smaller error than h’ on training exs but h’ has smaller error over the
example space.
i.e. h’ is a better
predictor even though it doesn’t fit the training data as well as h

It’s easy to see how
noise could cause overfitting. Make
tree more complex to classify noisy ex. as given in data, but this will make
performance worse.
Even in noise-free
data can occur when small # of exs. are associated with a leaf. There may be coincidental regularities that
it “notices” that are unrelated to the target
In study found that
overfitting reduced accuracy by 10-25%
In either of these
approaches you need a way to pick “ideal” final tree size. Approaches include: