If you are doing a 20 or 40 point problem be sure to attach the appropriate cover sheet and review the guidelines given there and in the course information handout.
If you are interested in doing a group project talk to Dr. Goldman.
Due on Wednesday February 7th.
If you have selected this option you should (even before decision trees are discussed in class) be sure that you can compile and run this program. To compile it use make or gmake. Once you have done this you should find an executable called dt. To run it type
dt [-sIf you do not provide a random seed then the system clock will be used. Each of train %, prune %, and test % are real numbers between 0 and 1 with the specification that their sum is at most 1.0. The specify, respectively, the fraction of the data that will be used for training, pruning and testing. To be sure it is working correctly, try doing]
dt 1.0 0.0 0.0 play-tennis.ssvWhen you do this you should get the decision tree show in Figure 3.1 of the textbook. You can go this far on this HW option before decision trees are discussed in class.
There are three parts to this problem. If you do only the first part it is worth 10 points. If you do the first two parts, then you can submit it as a 20 point homework option. Finally, if you do all three parts then it is a 40 point homework option. If you just do the first part then no cover sheet is needed. If you pick the 20 point or 40 point option then the appropriate cover sheet and write-up is expected.
NOTE: If you are doing the 40 point option, you may submit your final report as late as Monday February 12th without a late penalty prior to then. If you are selecting this option you must submit something on February 7th that shows that you have completed the experiments for Parts I and II. This can be a handwritten rough draft of your thoughts on those that you will be using for your report. This will not be graded but rather used to be sure that you have completed those parts by the deadline.
Try running the decision tree learner on a randomly chosen subset of half of the examples for training, and using half for testing. What are the training and test accuracies? What conclusion(s) can you draw from this? Determine if each of the following are true or false assuming that the target concept is that described by the decision tree in Figure 3.1, that all examples you add must be consistent with the target concept, and without using any post pruning. Explain how you reached your answers and convince us of them.
Here you will use the voting data. The first attribute of each example describes the political party of the representative, and the remaining attributes indicate their yes/no/absent vote for each bill considered by congress. You will use this data to learn a decision tree that predicts the political party of the representative based on his/her vote.
Use the voting data to build a decision tree to predict the political party. Use 25% of the members of congress for training and the rest for testing. (So no pruning.) Try this several times and study the impact of different random splits. Report the sizes and accuracies of these trees over 6 distinct runs.
Now measure the impact of training set size on the accuracy and size of the learned tree (no pruning, and 30% of the data for testing). Consider training set sizes in the range of 0-40% (include at least the values .02, .1, .2, .3 and .4 for training fractions). Because of the high variance due to random splits you should repeat each experiment with at least 10 different random seeds. Include in your report two plots showing how accuracy varies with training set size, and how the tree size varies with training set size. Also report the maximum, minimum and average accuracies and tree sizes for each training set size.
Next measure the impact of pruning on accuracy and size of the learned trees, under the same conditions as above. Use the same training set sizes, but this time use 30% of the data for pruning and 30% for testing. What is the impact, if any, of post-pruning the tree?
The code as provided uses information gain to select attributes while growing the tree. Modify it to instead select attributes at random and study the effect of this change by repeating all of the experiments you have done so far. To make this change, look at the function MaxGainAttribute() in entropy.c. Replace this by a method that randomly selects an attribute. From your experiments what can you conclude about the impact of learning with randomly selected attributes.
Finally, explore one more non-trivial topic of your own choosing. For example, develop a different method for select attributes that you think will work well and report on it. Or you could implement a rule post-pruning method and add it to the source code and report how its performance compares to that of reduced error pruning?