HTML document prepared by Brian Blankstein.

Machine Learning Notes for 02-05-01

Variations of basic decision tree algorithm

Avoiding overfitting:

Pre-pruning: stop growing the dt before it begins overfitting (before it perfectly classifies the training data) Post-pruning: allow tree to overfit and then post-prune (two variations)
  1. Reduced error = prune the tree itself
  2. Rule post-pruning

Other variations:

Incorporating continuous valued attributes - pick a threshold, c, so you go left if value <=c and right otherwise
ex: Temperature - sort data by value
Temp-PlayTennis
---------------
   40   |  -
   48   |  -
   60   |  +
   72   |  +
   80   |  +
   90   |  -

The idea is to pick a temperature that maximizes information gain. You can prove that the optimal c must occur at the midpoint between the transision between - and +. In this example, this is either (48+60)/2 = 45 or (80+90)/2 = 85.
These dynamically created boolean attributes can then compete with others for growing the DT. You can extend these ideas to then split again to create additional thresholds.


Alternative ways to pick attributes
Information gain is biased to favor attributes with lots of values. for example, consider an attribute date (and suppose each example had a different value). Then picking date would maximize information gain, but would not be useful in classifying new examples. So, we want to modify this to reduce bias towards attributes with many possible values.

SplitInfo(S,A) = -sum[i=1 to c] of (|si|/|s| * log2(|si|/|s|))
GainRatio(S,A) = InfoGain(S,A)/SplitInfo(S,A)
Use GR...this creates a pentalty if there are lots of classes. There are still issues. For example, if SplitInfo is very small (or even 0) then GainRatio does not work. Lots of variations and alternatives have been studied. Based on experiments, method of pruning has bigger impact on final accuracy than method to select attribute.

Handling Missing Attributes
For medical application, one attribute may be BloodTestResult and for some patients, this may be missing. A good solution is to compute the fraction of examples that go to a node and from that, estimate the probability that each branch is taken. Use these fractional examples to compute InfoGain and in computing final output.

In above example, use history to classify the last example: there is a 5/8 chance of it being -, and 3/8 chance of it being +. You can either use this, or just go with - because it is the most probable output.

Handling attributes with differing costs
In medical domain, you may use DT to help in diagnosis and not compute attribute value until needed. While certain tests may be very valuable, they may have a higher cost (monetary, risk/discomfort for patient...) Want to add cost to each attribute value. Simplest approach is to use InfoGain(S,A)/Cost(A) in picking which attribute to use. Other variations studied include InfoGain^2/Cost and (2^InfoGain - 1)/([Cost+1]^w) where w is a constant between 0 and 1 that determines the relative importance of cost versus info gain.