HTML document prepared by Sean Waters.

Decision Tree Learning

One of the most widely used inductive inference method.  Provides method for approximate discrete-valued target functions. 

Nice feature of decision trees is that they can be interpreted by humans.

Sample DT:

 

 

  can express as a disjunction of conjunctions:

  (outlook=sunny ^ humidity=normal) v (outlook=overcast) v (outlook=rain ^ wind=weak)

Appropriate Problems for DT Learning

  instance represented by attribute-value pairs, can extend to real valued attributes

  target function has discrete output values, can extend to real valued output

  disjunctive descriptions may be required

  training data may contain errors

  training data may contain missing attribute values

classification problem - classify example into one of a discrete set of possible categories

 

Basic Decision Tree Algorithm (Top-down greedy search)

   Pick attribute most useful for classifying examples.

   Put that attribute at the root

   Recursively repeat for each subtree that has both + and – examples

   Terminate when all exs +, all exs -, no attributes left (go with most common label) or no                        

    examples fall into leaf (go with most common level at parent)

 

   ID3 and its successor C4.5 both use this structure.

 

Which attribute is best?

   Use Information Gain

      First we need to define entropy which categorizes the impurity (or lack of order) in an arbitrary collection of examples. 

      Let S be set of P positive and N negative examples:

  Entropy(s) = - P/(P+N) log2 P/(P+N) – N/(P+N) log2 N/(P+N)

 

 

If all exs +: P/(P+N)=1 then entropy = 0

If all exs -: P/(P+N)=0 then entropy = 0

If ½ exs + and ½ exs - then entropy = 1

 

One interpretation from information theory is the minimum number of bits needed to encode the classification of an arbitrary member of S.

 

General definition of entropy when c possible values

  

    Entropy(s) = Sum(i=1 to c) [ -Pi log2 Pi]

 

where Pi is portion of S that belongs to class I

 

Information gain measures expected reduction in entropy G given sample S and attribute A

 

Gain(S,A) = Entropy(S) – Sum(all v element-of Values(A)) [ |Sv|/|S| Entropy(Sv)]

 

In the above equation the sum is the expected value of entropy if A is picked; v element-of Values(A) are the possible values of A;

and Sv is a subset of S when A has value V

 

An example – PlayTennis

   (based on training data in table 3.2)

  

   Gain(S, outlook) = 0.246

   Gain(S, Humidity) = 0.151

   Gain(S, Wind) = 0.048

   Gain(S, Temperature) = 0.029

 

 

   Gain(Ssunny, Humidity) = 0.97

   Gain(Ssunny, Temp) = 0.57

   Gain(Ssunny, Wind) = 0.019

 

Hypothesis Space Search in Decision Tree

  Start with an empty tree

   Progressively elaborate in search of DT that correctly classifies the training data

   Evaluation function that guides this hill-climbing search is the information gain measure

 

   Important Observations:

·        ID3’s hypothesis space of all DTs is complete in that every finite discrete-valued function can be represented.  So some hypothesis in space will be consistent with data (if noise free)

·        ID3 maintains a single current hypothesis unlike version spaces cannot determine how many alternative DT are consistent with data

·        ID3 in its pure form performs no backtracking

·        ID3 uses all examples for each level (vs candidate-elimination which uses each example only once

·        Advantage of using statistical property of all examples is the resulting search is much less sensitive to errors.  Can handle noise by terminating with hypothesis that does not perfectly classify the data

·        Inductive Bias in DT learning: shorter trees preferred over longer trees. Trees that place high information gain attributes close to the root are preferred over those that do not.

·        ID3 bias comes from search method (preference bias or search bias)

·        Candidate-Elimination bias comes from incomplete hypothesis space (restriction bias or language bias)

·        Game learning algorithm we saw (checkers) had both kinds of biases

·        Restriction bias – could only represent linear function of attributes

·        Preference bias – LMS bias from ordered search through space with initial weights all equal

Occam’s Razor (1320) – prefer the simplest hypothesis that fits the data

   Why – fewer short hypotheses than long and so less likely that you will find short one that fits the data

-         scientists prefer shorter/simpler theories to longer ones, if they both explain the observed data

 

Practical Issues

-         How deep to grow DT (esp when noise)

-         Handling continuous attributes

-         Handling attributes with missing values

-         Improved computational issues

 

  C4.5 is extension of ID3 to handle these issues

   Avoid Overfitting

     ID3 grows tree enough to get perfect classification

    This is not good when there’s noise or when not enough training exs

 

   Give hypothesis space H. A h Є H

   Overfits training data if some h Є H if h has smaller error than h’ on training exs but h’ has smaller error over the example space.

   i.e. h’ is a better predictor even though it doesn’t fit the training data as well as h

 

 

  It’s easy to see how noise could cause overfitting.  Make tree more complex to classify noisy ex. as given in data, but this will make performance worse.

  Even in noise-free data can occur when small # of exs. are associated with a leaf.  There may be coincidental regularities that it “notices” that are unrelated to the target

 

   In study found that overfitting reduced accuracy by 10-25%

 

Ways to Avoid Overfitting

 

   In either of these approaches you need a way to pick “ideal” final tree size.  Approaches include: