Decision Trees Algorithms
Total Page:16
File Type:pdf, Size:1020Kb
Alon and Omri Decision Trees Algorithms Problem description The problem we are facing is to decide an unknown outcome given a set of parameters. For example: decide whether to wait for a table at a restaurant, based on certain attributes such as wait time, price, etc. Problem Modeling A decision tree is a classifier, ℎ: 풳 → 풴, that predicts the label associated with an instance 푥 ∈ 풳 by traveling from the root to a leaf. Each instance 푥 ∈ 풳 has 푑 features and each feature 푖 ∈ [푑] ranges between 0 and 푗푖. Formally, 풳 = 0,1, … , 푗1 × … × {0,1, … , 푗푑 }. 풴 represents the label feature and ranges between 0 and some constant 푘. Each instance 푥 ∈ 풳 has a true label 푦푥 ∈ 풴. The goal of a decision tree is to accurately predict an instance's label. Formally, we wish to minimize the generalization error which is defined as 푒푟푟퐷 ℎ = ℙ 푥,푦 ~풟[ℎ 푥 ≠ 푦] Where 풟 is a probability distribution over 풳 × 풴. Each node in the decision tree is either a label node (in case that the node is a leaf), or has the form 푥푖 = ? for some 푖 ∈ [푑], (in case that this node is internal node) and each arc coming out of such a node is a number between 0 and 푗푖. For any 푥 ∈ 풳, ℎ(푥) is determined by the following method: 1. Set node to be the root of the tree. 2. Repeat till node is a leaf: 2.1. If the current node is of the form 푥푖 = ?, check 푥푖 and travel on the corresponding arc and set node to be the corresponding child. 3. Return the label on the reached leaf. Ideally, we wish to have a learning algorithm which given a training set 푆 of samples { 푥,1 , 푦1 , … , 푥푚 , 푦푚 }, outputs a decision tree whose generalization error 푒푟푟풟(ℎ) is as small as possible. Unfortunately, it turns out that even designing a decision tree, that minimizes the empirical error 푒푟푟푆(ℎ) (where | 푖:ℎ 푥 ≠푦 | 푒푟푟 ℎ = 푖 푖 푆 푚 ) is NP-complete. Consequently, practical decision-tree learning algorithms are based on heuristics such as a greedy approach, where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree but tend to work reasonably well in practice. The ID3 (iterative dichotomizer 3)/C4.5 algorithms Each decision tree algorithm should define a strategy for splitting the tree according to the given features. When splitting a given node, we wish to choose an attribute that maximizes the information we gain for labeling the instances. The ID3 algorithm chooses the "Information Gain" measure. We denote 퐻(푆) the entropy of a set of examples 푆 w.r.t to the label feature. Formally, 푘 푖 ∈ 푆 : 푦푖 = 푗 푖 ∈ 푆 : 푦푖 = 푗 퐻 푆 = − ⋅ log ( ) |푆| 2 |푆| 푗 =0 The information gain of attribute 푖 with respect to a set of examples 푆, is defined as: 푗푖 |{푥 ∈ 푆: 푥푖 = 푗}| 퐼퐺 푆, 푖 = 퐻 푆 − ⋅ 퐻({푥 ∈ 푆: 푥 = 푗}) |푆| 푖 푗 =0 We complete the algorithm description by reviewing its pseudo-code. 푰푫ퟑ (푺, 푨 ⊆ [풅]) 1. If all examples are labeled 푖 ∈ {0, … , 푘}, Return the single-node tree Root, with label = 푖. 2. If 퐴 is empty, then return the single node tree Root, with label = most common value of the label attribute in the examples. 3. Let 푖 ∈ 푎푟푚푎푥푗 ∈ 푑 퐼퐺(푆, 푗) 3.1. For each possible value 푣 ∈ {0, … , 푗푖} 3.1.1. Let 푇푣 be the tree returned by 퐼퐷3( 푥, 푦 ∈ 푆: 푥푖 = 푣 , 퐴 ∖ 푖 ). 3.2. Return a tree whose root is 푥푖 = ? and whose children are {푇0, … , 푇푗푖 }. The arc connecting to each child 푇푣 has the form v. Note that the algorithm is recursive. The initial call is 퐼퐷3(푆, 푑 ). C4.5 algorithm has the same structure but uses information gain ratio measure instead of the original information gain in order to split the tree. The information gain ration of a set 푆 and an attribute 푖 is defines: 퐼퐺 푆, 푖 퐼퐺푅 푆, 푖 = 퐼푉 푆, 푖 Where 푗푖 푖 ∈ 푆 : 푥푖 = 푗 푖 ∈ 푆 : 푥푖 = 푗 퐼푉 푆, 푖 = − ⋅ log ( ) |푆| 2 |푆| 푗 =0 Pruning We start with citing a result that was proved in a the course Introduction to machine learning. We have proved for any 훿 ∈ (0,1), with probability at least 1 − 훿 over the choice of a training set of size 푚 (taken 푖. 푖. 푑), it holds for any decision tree ℎ with n nodes that 2 푛 log 푑 + 1 + log( ) 훿 푒푟푟 ℎ ≤ 푒푟푟 ℎ + 풟 푆 2푚 We see that this bound performs a trade-off: on the one hand, we expect larger, more complex decision trees to have a small 푒푟푟푆 ℎ , but the respective value of n will be large. On the other hand, small decision trees will have a small value of n, but 푒푟푟푆 ℎ might be larger. Our hope is that we can find a decision tree with both low empirical error 푒푟푟푆 ℎ , and with a number of nodes n not too high. Our bound indicates that such a tree will have low generalization error 푒푟푟풟 ℎ . This is sometimes referred to as the minimal description length principle which is closely related to Occam's razor principle1 which states in a nutshell that a short explanation (that is, a hypothesis that has a short length) tends to be more valid than a long explanation. 1 http://en.wikipedia.org/wiki/Occam's_razor The algorithm described above still suffers from a big problem: the returned tree will usually be very large. Such trees may have low empirical error, but their generalization error will tend to be high – both according to the theoretical analysis above, and in practice. A common solution is to prune the tree after (or during, as we shall see later) it is built, hoping to reduce it to a much smaller tree, but still with a similar empirical error. Theoretically, according to the bound we have stated, if we can make n much smaller without increasing 푒푟푟푆 ℎ too much, we are likely to get a decision tree with a better generalization error. Usually, post pruning is performed by a bottom-up walk on the tree. Each node might be replaced with one of its subtrees or with a leaf, based on some bound or estimate for the generalization error. Another approach will be to prune the tree while it is being built – stop splitting nodes once a pre-defined criterion has been reached. This approach is sometimes referred to as "pre-pruning". We used the following pruning methods in our project: Pre-pruning where the stopping criterion is when the information gain is too low (i.e. when the information gain reaches a predefined value) Post-pruning method whose estimate of the generalization error is based on the minimum description length bound shown above. Post-pruning method whose estimate of the generalization error is based on a hold-out set. Implementation All the algorithms described above are implemented in a file called ID3.java. This file contains the following classes: ID3 - responsible for building the tree, given a training set. Attribute – represents a specific feature and lists its possible values. Sample – represents a sample 푥 ∈ 풳. Contains a list with the values of 푥's features and its label Node – represents a node in the id3 tree. Contains the node's entropy, its label if it is a leaf, a list of already used attributes and list of samples that still have to be examined. In order to view a tree, one can use the function PrintTree in the ID3 tree to get an If…else…then representation of it. We also implemented a toString() version of it (i.e., printing to the console id3.toString() where id3 is an instance of the class ID3) which prints out a graphic description of the tree which can be generated into an actual graph using the following tool: http://www2.research.att.com/~john/Grappa/grappa1_2.html Evaluation and Discussion Example 1 As a first step, we ran our algorithm on the small dataset from the theoretical exercise. The results we got were, surprisingly enough, identical to the results we got in the exercise. When running the dataset with ID3 we got the following tree: When running with C4.5: We see that when running ID3 (information gain) the data was first split by the attribute “Weather”. When running C4.5 (gain ratio), “Has-Friend” became the first attribute. This is not surprising because the attribute “Weather” has 3 possible values and “Has-Friend” only has 2. We can even see that the attribute “Mood”, which has 3 attributes was dropped from the tree. Example 2 We got a database from the following website: http://www.cis.temple.edu/~ingargio/cis587/readings/id3-c45.html The database is taken from the U.S Congressional Quarterly Almanac and contains information about voters and the way they voted on key issues and whether they are republican or democrats. The Algorithm’s purpose is to determine whether a voter is a democrat or a republican depending on the way they voted on key issues. When running the ID3 algorithm on the database without pruning we got the following tree: Needless to say, this a huge tree (has 67 nodes) and achieves a rather high error rate on the test set (0.0592). When running the algorithm with pre- pruning (a cut off of 0.2 IG) we get the following tree: This tree has a much better error rate on the test set (0.0296).