Decision Trees

Introduction

Some facts about decision trees:

• They represent data-classification models.

• An internal node of the represents a question about feature-vector attribute, and whose answer dictates which child node is next queried.

• Each leaf node represents a potential classification of the feature vector.

• Alternatively (from a logical perspective), decision trees allow classes of elements to be represented as logical disjunctions of conjunctions of attribute values.

1 2 Here are some characteristics of data universes that admit good decision-tree models.

• Instances are represented by feature vectors whose attributes are preferably discrete.

• Instances are discretely classified.

• A disjunctive description represents a reasonable means of representing a class of elements.

• Errors may exist, such as missing attribute values or erroneous classification of some training vectors.

Entropy of a collection of classified feature vectors. Given a set of training vectors S, suppose that the vectors are classified in c different ways, and that pi represents the proportion of vectors in S that belong to the i th class. Then the classification entropy of S is defined as

c X H(S) = − pi log2 pi. i=1

Example 1. Calculate the classification entropy for the following set of feature vectors.

Weight color texture classification medium orange smooth orange heavy green smooth melon medium green smooth apple light red bumpy berry medium orange bumpy orange light red bumpy berry heavy green rough melon medium red smooth apple heavy yellow smooth melon medium yellow smooth orange medium red smooth apple medium green smooth apple medium orange rough orange

3 Quinlan’s ID3 . At each phase of the construction of the , and for each branch of the decision tree under construction, the attribute A that is considered next is the one which

1. has yet to be considered along that branch; and

2. minimizes the conditional classification entropy H(S|A). Indeed, for a given feature/attribute A,

X |Sa| H(S|A) = H(S ), |S| a a∈A where S is the set of training vectors that that reach the current branch under construction, and, for all a ∈ A, Sa represents the set of feature vectors v ∈ S such that vA = a. Clearly, the smaller H(S|A), the less classification information that remains in the vectors of S, once they are divided according to their A-attribute.

Example 2. Using the table of feature vectors from Example 1 and the concept of conditional classificastion entropy, construct a decision tree for classifying fruit as either, apple, orange, melon, or berry.

4 Split Information. Let A be an attribute, considered here as a discrete set of possible values. Then the split information relative to A and a set of feature vectors S is defined as

X |Sa| |Sa| SI(A, S) = − log , |S| |S| a∈A where Sa represents the set of feature vectors v ∈ S such that vA = a.

IG(S,A) For attributes that take on many values, using the gain ratio SI(S,A) instead of IG(S,A) can help avoid favoring many-valued attributes which may not perform well in classifying the data. Here IG(S,A) is defined as H(S) − H(S|A).

Example 3. Suppose attribute A has 8 possible values. If selecting A for the next node of a decision tree IG(S,A) yields I(S,A) = 2 bits of information, then compute SI(S,A) .

5 Classifier Selection and The Bias-Variance Tradeoff

The hypothesis space of decision trees is complete in the sense that, for every set S of training vectors, there exists a decision tree TS which correctly classifies all of the training vectors. Moreover, that tree can be constructed quite easily, assuming discrete features and a finite number of classes. So why bother with ID3 when there is already a tree that will correctly classify all the training data? The problem of course is that this tree does not inform use about how to classify any of the data that is not part of the training set. For those cases let us suppose that (assuming two classes that are equally probable) the tree is designed so that non-training vectors are classified by the toss of a coin. Such a classifier that correctly classifies all training vectors and randomly classifies non-training vectors represents an extreme example of an unbiased classifier, in that it makes no assumptions about correlations between a vector’s class and attribute values. In practice, being unbiased implies a lack of learning. For example, if you touch a very hot baking dish just removed from the oven, chances are you will think twice about touching it again the next time you see it on the counter. In other words, your next encounter with that dish on the counter will be biased from the past encounter.

Another quality that TS suffers from is that of possessing a high amount of variance that exists in how a vector is classified based on the training set. When a vector is in the training set, it is correctly classified, but in all other cases it will receive a random classification that has a variance that grows according to the number of possible classes. Ideally, a good learning algorithm should keep the variance low, meaning that the classification of a vector does not change much from training set to training set. For example, given a large enough basket of fruit to learn from, we would expect that our concept of an orange (e.g. medium-sized, orange-colored, and smooth) would not change from basket to basket. In this case our learning algorithm would display low variance.

It should also be noted that attempting to minimize variance can sometimes lead to an increase in bias, which in turn may increase the overall classification error. For example, suppose we are biased towards classifying medium-sized, orange-colored, smooth fruit as oranges. Doing so may cause the misclassification of some smaller orange-complexioned grapefruits. In other words, by increasing bias for the sake of reducing variance, we sometimes make errors on the “exceptions to the rule”.

The following mathematical derivation suggests that the ideal learning algorithm is one that strikes an optimal balance when attempting to reduce both bias and variance.

For a given vector x, let P (c|x) denote the classification associated with x. Let γ be a classifier, and let γ(x) denote the class that γ assigns to x. Then the mean squared error of γ, denoted mse(γ) is defined as 2 Ex(γ(x) − P (c|x)) , where the expectation is taken over a probability distribution over the data universe X . Now let Γ denote a learning algorithm, and ΓD denote a particular classifier that is derived by the algorithm upon input of a randomly drawn training-data sample D. Assume that all training samples have a fixed size, and that they are obtainined by independent sampling from the distribution over X . Define the learning error of Γ, denoted, learning-error(Γ) as

learning-error(Γ) = ES[mse(ΓD)] = 2 EDEx[ΓD(x) − P (c|x)] =

6 2 ExED[ΓD(x) − P (c|x)] , where the last equality is a change in the order of summation. Now the inner summation can be simplified using the following claim.

Claim. E[x − k]2 = (Ex − k)2 + E[x − Ex]2, where Ex denotes the expectancy of x, and k is some constant. The term (Ex − k)2 is called the bias term. Here, we are thinking of k as representing a desired target value that x is attempting to attain, while Ex denotes the average of what x actually attained in practice. Finally, E[x − Ex]2 is the definition of variance for random variable x.

Proof of Claim. By linearity of expectation,

E[x − k]2 = Ex2 − 2kEx + k2 =

[(Ex)2 − 2kEx + k2] + [Ex2 − 2(Ex)2 + (Ex)2] = (Ex − k)2 + E[x − Ex]2.

2 Applying the claim to the expectation ED[ΓD(x) − P (c|x)] , we get

learning-error(Γ) = Ex[bias(Γ, x) + variance(Γ, x)], where 2 bias(Γ, x) = (EDΓD(x) − P (c|x)) , and 2 variance(Γ, x) = ED[ΓD(x) − EDΓD(x)] .

Example 4. Suppose X = {1, 2, 3, 4} and that 1 and 2 are in class 0, while 3 and 4 are in class +1. Suppose that |D| = 2 (one training vector from each class) and that our learning algorithm uses a nearest neighbor algorithm, in that the resulting classifier classifies a number based on which training point it is nearest, breaking ties by tossing a coin. Compute the learning-error for this nearest neighbor algorithm.

7 In the light of bias and variance tradeoff, we see that the ID3 algorithm attempts to reduce the length of of branches in the search tree. This has the effect of reducing variance, at the expense of increasing bias. To further achieve this, the ID3 algorithm is usually followed by a rule pruning phase in which it is attempted to shorten rules.

Rule post-pruning steps.

1. develop the decision tree without any concern towards overfitting

2. convert tree into an equivalent set of rules

3. prune each rule by removing any preconditions whose removal improves accuracy

4. sort rules by their estimated accuracy and use them in this sequence

8 Exercises

1. Draw a minimum-sized decision tree for the three-input XOR function which produces a 1 iff an odd number of the inputs evaluate to one.

2. Provide decision trees to represent the following Boolean functions: A and not B, A or (B and C), A xor B, (A and B) or (C and D).

3. Consider the following sets of training examples:

Instance Classification a1 a2 1 + T T 2 + T T 3 - T F 4 + F F 5 - F T 6 - F T Calculate the entropy of the collection with respect to the classification, and determine which of the two attributes provides the most information gain.

IG(S,A) 4. Repeat Example 2, but instead use the measure SI(S,A) to calculate the attribute to use at a given node of the tree.

5. Create a decision tree using the ID3 Algorithm for the following table of data.

Vector A1 A2 A3 Class v1 1 0 0 0 v2 1 0 1 0 v3 0 1 0 0 v4 1 1 1 1 v5 1 1 0 1 6. Suppose X = {1, 2, 3, 4, 5, 6} and that 1,2,3 are in class 0, while 3,4,5 are in class 1. Suppose that each training set S has |S| = 2 (one training vector from each class) and that the learning algorithm Γ is again the nearest neighbor algorithm (see Example 4). Compute learning-error(Γ). Also, compute bias(Γ, 1) and variance(Γ, 1). Hint: there are only nine possible training sets to consider.

9