Decision Tree Models
Total Page:16
File Type:pdf, Size:1020Kb
Decision Tree Models Pekka Malo, Associate Prof. (statistics) Aalto BIZ / Department of Information and Service Management Learning Objectives for this week • Understand basic concepts in decision tree modeling • Understand how decision trees can be used to solve classification problems • Understand the risk of model over-fitting and the need to control it via pruning • Able to evaluate the performance of a classification model using training, validation and test datasets • Able to apply decision tree algorithm for different classification problems 6.1.2020 Classification and decision making Should we grant a loan or not? 6.1.2020 How would our customer respond to a marketing offer? 6.1.2020 Do we suspect an insurance fraud? 6.1.2020 Spam or Ham? 6.1.2020 Classification trees Decision trees • Approximate discrete valued target functions through tree structures • Popular tools for classifying instances - Each node specifies a test of attribute - Each branch of a node corresponds to a possible value of the attribute • Benefits - Often good performance (classification accuracy) - Easy to understand - Can be represented as simple rules - Show importance of variables - Easy to build (recursive partitioning) 6.1.2020 Would Claudio default? Claudio = (Employed=No, Balance=115K, Age<45) Employed No Yes Balance >= 50K Class: < 50K Not Write-off Age Class: Not Write-off < 45 >= 45 Class: Class: Not Write-off Write-off 6.1.2020 Survival chances on Titanic Proportion of observations in this belonging to this leaf Probability of survival Sibsp = number of spouses or siblings aboard 6.1.2020 Decision Tree and Partitioning of instance space Age Balance <50K >=50K 50 Age Age 45 <50 <45 >=50 >=45 Write-off Write-off prob=12/12 prob=4/7 No Write-off No Write-off prob=2/3 prob=10/10 50K Balance 6.1.2020 306 9. Additive Models, Trees, and Related Methods R5 R2 t4 2 2 X X R3 t2 R4 R1 306 9. Additive Models, Trees, and Related Methods t1 t3 X1 X1 X1 ≤ t1 | R5 R2 t4 2 X2 ≤ t2 2 X X X1 ≤ t3 R3 t2 R4 R1 X2 ≤ t4 R1 R2 R3 t1 t3 X2 X X1 X1 1 R4 R5 6.1.2020 X1 ≤ t1 FIGURE 9.2. Partitions and CART. Top right panel shows a partition of a | two-dimensional feature space by recursive binary splitting, asusedinCART, applied to some fake data. Top left panel shows a general partition that cannot X2 ≤ t2 X1 ≤ t3 be obtained from recursive binary splitting. Bottom left panel shows the tree cor- responding to the partition in the top right panel, and a perspective plot of the prediction surface appears in the bottom right panel. X2 ≤ t4 R1 R2 R3 X2 X1 R4 R5 FIGURE 9.2. Partitions and CART. Top right panel shows a partition of a two-dimensional feature space by recursive binary splitting, asusedinCART, applied to some fake data. Top left panel shows a general partition that cannot be obtained from recursive binary splitting. Bottom left panel shows the tree cor- responding to the partition in the top right panel, and a perspective plot of the prediction surface appears in the bottom right panel. Example: Good or Evil? Training data: Gender Super- Mask Cape Tie Bald Pointy Smokes Class strength ears J.Bond male No No No Yes No No Yes good Batman male No Yes Yes No No Yes No good Super- male Yes No Yes No No No No good man Penguin male No No No Yes Yes No Yes bad Joker male No No No No No No No bad Moriarty male No No No Yes No No No bad Supergirl female Yes No Yes No No No No good Cat- female No Yes No No No Yes No bad woman Test data: Holmes male No No No Yes No No Yes good Batgirl female No Yes No No No Yes No good Source note: example adapted/modified from prof. R. Schapire 6.1.2020 Building decision trees … • Choose a variable and threshold for splitting -> splitting rule • Divide data into subsets J.Bond, Batman, Superman, Penguin, Joker, Moriarty, Supergirl, Catwoman Superstrength Yes No J.Bond, Batman, Penguin, Superman, Supergirl Joker, Moriarty, Catwoman 6.1.2020 Building decision trees … J.Bond, Batman, Superman, Penguin, Joker, Moriarty, Supergirl, Catwoman • Continue splitting recursively • Stop when nodes are close to ”pure” Superstrength Yes No J.Bond, Batman, Penguin, Superman, Supergirl Joker, Moriarty, Catwoman Cape Yes No Batman J.Bond, Penguin, Joker, Moriarty, Catwoman 6.1.2020 Challenge: How to choose splits? • Prefer rules that lead to increase “purity” of nodes J.Bond, Batman, Superman, Penguin, J.Bond, Batman, Superman, Penguin, Joker, Moriarty, Supergirl, Catwoman Joker, Moriarty, Supergirl, Catwoman Superstrength Tie Yes No Yes No J.Bond, Batman, Penguin, Batman, Superman, Joker, J.Bond, Moriarty, Superman, Supergirl Joker, Moriarty, Catwoman Supergirl, Catwoman Penguin 6.1.2020 9.2 Tree-Based Methods 309 Measuring Node Impurity (binary case) Entropy Gini index Misclassification error 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.2 0.4 0.6 0.8 1.0 p FIGURE 9.3. Node impurity measures for two-class classification, as a function6.1.2020 of the proportion p in class 2. Cross-entropy has been scaled to pass through (0.5, 0.5). impurity measure Qm(T ) defined in (9.15), but this is not suitable for classification. In a node m, representing a region Rm with Nm observations, let 1 pˆmk = I(yi = k), Nm xi!∈Rm the proportion of class k observations in node m. We classify the obser- vations in node m to class k(m)=argmaxk pˆmk, the majority class in node m.Different measures Qm(T ) of node impurity include the following: 1 Misclassification error: I(yi = k(m)) = 1 pˆmk(m). Nm i∈Rm ̸ − K Gini index: "′ pˆmkpˆmk′ = pˆmk(1 pˆmk). k̸=k k=1 − K Cross-entropy or deviance: " k=1 pˆmk logp ˆmk". − (9.17) For two classes, if p is the proportion" in the second class, these three mea- sures are 1 max(p, 1 p), 2p(1 p) and p log p (1 p)log(1 p), respectively.− They are shown− in Figure− 9.3. All− three are− similar,− but cross− - entropy and the Gini index are differentiable, and hence more amenable to numerical optimization. Comparing (9.13) and (9.15), we see that we need to weight the node impurity measures by the number NmL and NmR of observations in the two child nodes created by splitting node m. In addition, cross-entropy and the Gini index are more sensitive to changes in the node probabilities than the misclassification rate. For example, in a two-class problem with 400 observations in each class (denote this by (400, 400)), suppose one split created nodes (300, 100) and (100, 300), while Learning decision trees • Provide a very popular and efficient hypothesis space - Ability to represent any boolean function - Deterministic - Handle both discrete and continuous variables • Decision tree learning algorithms can be characterized as - Constructive (i.e. tree is built by adding nodes) - Eager (i.e. analyze training data and construct an explicit hypothesis) - Mostly batch (i.e., collect samples, analyze them and output a hypothesis) 6.1.2020 Learning decision trees Basic idea in a nutshell: 1. Find best initial split (all data at root node) 2. For each child node i. Find best split for data subset at the node ii. Continue recursively until no more “reasonable” splits are found 3. Prune nodes to avoid overfitting and maximize generalizability of the model Objective: Find attributes which help to split the data into groups that are as pure as possible (i.e., homogeneous with respect to the target variable) 6.1.2020 Learning decision trees (more formally) Commonly based on purity (diversity) measures ü Gini (population diversity) ü Entropy (information gain) ü Information Gain Ratio ü Chi-square test Source: “Data mining” by Pedro Domingo 6.1.2020 Presemo: http://presemo.aalto.fi/dsfb1 Good split vs. Bad split – what attributes are helpful? 6.1.2020 Tree size, accuracy and overfit J.Bond, Batman, Superman, Penguin, Joker, Moriarty, Supergirl, Catwoman • Fits data quite well Gender Female • Too complex? Male Pointy ears Smokes Yes No Yes No BAD GOOD Bald Cape Yes No Yes No BAD GOOD GOOD BAD 6.1.2020 6.1.2020 Trees vs. linear models 8.1 The Basics of Decision Trees 315 2 2 X X 10 1 2 10 1 2 − − 2 2 − − −2 −10 1 2 −2 −10 1 2 X1 X1 6.1.2020 2 2 X X 10 1 2 10 1 2 − − 2 2 − − −2 −10 1 2 −2 −10 1 2 X1 X1 FIGURE 8.7. Top Row: Atwo-dimensionalclassificationexampleinwhich the true decision boundary is linear, and is indicated by the shaded regions. Aclassicalapproachthatassumesalinearboundary(left)willoutperformade- cision tree that performs splits parallel to the axes (right). Bottom Row: Here the true decision boundary is non-linear. Here a linear model is unable to capture the true decision boundary (left), whereas a decision tree is successful (right). 8.1.4 Advantages and Disadvantages of Trees Decision trees for regression and classification have a number of advantages over the more classical approaches seen in Chapters 3 and 4: ▲ Trees are very easy to explain to people. In fact, they are even easier to explain than linear regression! ▲ Some people believe that decision trees more closely mirror human decision-making than do the regression and classification approaches seen in previous chapters. ▲ Trees can be displayed graphically, and are easily interpreted even by anon-expert(especiallyiftheyaresmall). ▲ Trees can easily handle qualitative predictors without the need to create dummy variables.