HTML document prepared by Brian Blankstein.
Machine Learning Notes for 02-05-01
Variations of basic decision tree algorithm
Avoiding overfitting:
Pre-pruning: stop growing the dt before it begins overfitting (before it perfectly classifies the training data)
- stop growing when the information gain is less than some fixed constant (E)
- use chi-squared test to test for significance
- apply MDL (minimum description length) - minimize size(tree) + size(misclassifications(tree))...only grow if description length is reduced
Post-pruning: allow tree to overfit and then post-prune (two variations)
- Reduced error = prune the tree itself
- consider nodes in post-order traversal of the decision tree
- remove a node only if the resulting pruned tree performs no worse than the original tree on validation set
- replace by leaf node with class label of majority of training examples that fall into that leaf
- Drawback - if the amount of data is limited, then holding some data for a validation set may not be good
- Rule post-pruning
- let T be the DT learned by basic method
- convert T to a set of rules - one for each leaf (ie "if (outlook==sunny)^(humidity==high) then playtennis = no")
- then independently consider each rule
- consider removing (individually) each parameter
- ie consider (outlook==sunny), (humidity==high), and (outlook==sunny)^(humidity==high)
- remove the one which gives the best improvement in accuracy
- validation set or statistical measure can be used to determine accuracy
- repeat until there is no improvement by removing a parameter
- then, sort the pruned rules by their estimated accuracy and consider them from most to least accurate when classifying subsequent examples
- advantages of converting to rules before pruning:
- allows different pruning decisions to be made for each path
- removes distinction between rule near root vs those near leaf
- improves readability
Other variations:
Incorporating continuous valued attributes - pick a threshold, c, so you go left if value <=c and right otherwise
ex: Temperature - sort data by value
Temp-PlayTennis
---------------
40 | -
48 | -
60 | +
72 | +
80 | +
90 | -
The idea is to pick a temperature that maximizes information gain. You can prove that the optimal c must occur at the midpoint between the transision between - and +. In this example, this is either (48+60)/2 = 45 or (80+90)/2 = 85.
These dynamically created boolean attributes can then compete with others for growing the DT. You can extend these ideas to then split again to create additional thresholds.

Alternative ways to pick attributes
Information gain is biased to favor attributes with lots of values. for example, consider an attribute date (and suppose each example had a different value). Then picking date would maximize information gain, but would not be useful in classifying new examples. So, we want to modify this to reduce bias towards attributes with many possible values.
SplitInfo(S,A) = -sum[i=1 to c] of (|si|/|s| * log2(|si|/|s|))
GainRatio(S,A) = InfoGain(S,A)/SplitInfo(S,A)
Use GR...this creates a pentalty if there are lots of classes. There are still issues. For example, if SplitInfo is very small (or even 0) then GainRatio does not work. Lots of variations and alternatives have been studied. Based on experiments, method of pruning has bigger impact on final accuracy than method to select attribute.
Handling Missing Attributes
For medical application, one attribute may be BloodTestResult and for some patients, this may be missing. A good solution is to compute the fraction of examples that go to a node and from that, estimate the probability that each branch is taken. Use these fractional examples to compute InfoGain and in computing final output.

In above example, use history to classify the last example: there is a 5/8 chance of it being -, and 3/8 chance of it being +. You can either use this, or just go with - because it is the most probable output.
Handling attributes with differing costs
In medical domain, you may use DT to help in diagnosis and not compute attribute value until needed. While certain tests may be very valuable, they may have a higher cost (monetary, risk/discomfort for patient...) Want to add cost to each attribute value. Simplest approach is to use InfoGain(S,A)/Cost(A) in picking which attribute to use. Other variations studied include InfoGain^2/Cost and (2^InfoGain - 1)/([Cost+1]^w) where w is a constant between 0 and 1 that determines the relative importance of cost versus info gain.