CS 527A, Spring 2002, Homework 2


For this homework you must pick a set of problems worth 100 points. Unless you choose to do all three parts of Problem 6, you are required to do at least one part of a problem from those related to both Chapter 2 and Chapter 3. For those of you who don't like choices, a good choice of options is to do #2, #5 and parts I and II of #6.

This homework is due on Wednesday February 6th with the standard late policy applying.

A signed cover sheet for Homework 2 must be submitted with your homework.


    Problems Related to Chapter 2


  1. Recall that FindS is an algorithm that uses as its hypothesis space the set of all conjunctions. Namely, the hypothesis is defined by selecting a subset of the attributes (called the relevant attributes) and giving a specific desired value for each relevant attribute. An example is positive if and only if every relevant attribute has the desired value. One example conjunctive hypothesis is
    (a2 == warm) AND (a4 == weak) AND (a5==warm)
    
    The inductive bias of FindS is to output a most specific hypothesis in the version space.

    Note that if you modify your example from Part B, you get an example illustrating that with a conjunctive hypothesis the cardinality of G (the most general boundary of the version space) can be exponential in the number of attributes.

  2. (30 points) Consider a hypothesis space that is a disjunction of constraints over a set of n attributes (exactly as in Part B of the last problem). Propose an algorithm that accepts a sequence of training examples and outputs a consistent hypothesis if one exists. You are welcome to select any inductive bias that you would like. Please just explicitly state the inductive bias you select. Your algorithm should run in time that is polynomial in n and in the number of training examples. Be sure to clearly describe the algorithm, argue that it always outputs a consistent hypothesis and that it runs in the desired time.

  3. (20 points) Prove that given an unbiased hypothesis space H (i.e. one in which each of the 2|X| possible ways to classify the examples in the example space X are in H), the learner would find that each unobserved instance would be predicted correctly by exactly half of the current members of the version space. That is, prove that for any instance space X, any set of training examples D, and any instance x in X not present in D, that if H is the power set of X, then exactly half the hypotheses in the Version Space defined by H and D will classify x as positive and half will classify it as negative.

  4. (50 points) Implement the FIND-S algorithm and verify that it successfully produces the trace in Section 2.4 of the text for the EnjoySport example. Now use this program to study the number of training examples required to exactly learn the target concept following the guidelines given in Exercise 2.10 of the text.


    Problems Related to Chapter 3


  5. (20 points). Using the basic decision tree algorithm (using information gain to select the attribute and no pruning) construct the decision tree to represent the boolean formula
    (X1 AND X2) OR (X3 AND X4).  
    In other words, use as your training data all 16 entries of the truth table with those examples in which the formula is true labeled as positive (+) and those examples in which the formula is false labeled as negative (-). Be sure to give the decision tree created by the learning algorithm versus just one you could create by hand that would properly represent the function. To make the grading easier, if there is a tie in the information gain break ties by using the attribute with the smallest index.

    You can do this problem by hand, or by using a program. Just be sure to show your work. If you decide to do this by hand you might want to write a small program that computes the information gain for each attribute given a set of examples.

  6. (30, 50, or 100 points) This assignment gives you an opportunity to experiment with a decision tree learning program. You can use the PlayTennis learning data as well as a larger set of data describing the voting records of congressmen/women form the U.S. House of Representatives. The code is all contained in dt-code.zip. This code is designed for the gcc C++ compiler. On CEC for unix you can access this by using pkgadd sc_5.0 and for nt users, from the start menu go to "Programming, then to Microsoft Visual Studio 6.0, then to Microsoft Visual C++ 6.0.

    If you are just doing the first two parts (and using CEC) you should not need to compile the code. The executable is already included in the zip file. However, if you are doing the third part or using your own computer, to compile the code use make or gmake. On CEC you can run gmake by typin /pkg/gnu/bin/make Once you have done this you should find an executable called dt. To run it type

    dt [-s ]    
    
    If you do not provide a random seed then the system clock will be used. Each of train %, prune %, and test % are real numbers between 0 and 1 with the specification that their sum is at most 1.0. They specify, respectively, the fraction of the data that will be used for training, pruning (i.e. the validation data) and testing. To be sure it is working correctly, try doing
    dt 1.0 0.0 0.0 play-tennis.ssv
    
    When you do this you should get the decision tree show in Figure 3.1 of the textbook.

    There are three parts to this problem. If you do only the first part it is worth 30 points. If you do the first two parts, then it is worth 50 points, and if you do all three parts it is worth 100 points.


  7. CHOOSE YOUR OWN ADVENTURE. You can propose any additional homework options (or variations of those given above) to Dr. Goldman. If approved a point value will be given. If you would like to work further with the Othello game player, that would fit well into this option.