DECISION TREES CSCI 452: Data Mining Today
Total Page:16
File Type:pdf, Size:1020Kb
slides originally by Dr. Richard Burns, modified by Dr. Stephanie Schwartz DECISION TREES CSCI 452: Data Mining Today ¨ Decision Trees ¤ Structure ¤ Information Theory: Entropy ¤ Information Gain, Gain Ratio, Gini ¤ ID3 Algorithm n Efficiently Handling Continuous Features ¤ Overfitting / Underfitting n Bias-Variance Tradeoff ¤ Pruning ¤ Regression Trees Motivation: Guess Who Game ¨ I’m thinking of one of you. ¨ Figure out who I’m thinking of by asking a series of binary questions. Decision Trees ¨ Simple, yet widely used classification technique ¤ For nominal target variables n There also are Regression trees, for continuous target variables ¤ Predictor Features: binary, nominal, ordinal, discrete, continuous ¤ Evaluating the model: n One metric: error rate in predictions 1. Root node 2. Internal nodes Decision Trees 3. Leaf / terminal nodes Tree is consistent with binary training “Splitting Attributes” categoricalcontinuousclass Tid Home Marital Annual Defaulted dataset. Owner Status Income Borrower 1 Yes Single 125K No Home 2 No Married 100K No Yes 3 No Single 70K No Owner No 4 Yes Married 120K No NO Marital 5 No Divorced 95K Yes {Single, Married 6 No Married 60K No Divorced} Status 7 Yes Divorced 220K No NO 8 No Single 85K Yes Annual < 80K 9 No Married 75K No Income > 80K 10 No Single 90K Yes 10 NO YES Training Data Decision Tree Model #1 Decision Trees Tree is consistent with training dataset. Marital binary Married {Single, Divorced} categoricalcontinuousclass Tid Home Marital Annual Defaulted Status Owner Status Income Borrower NO Home 1 Yes Single 125K No Yes No 2 No Married 100K No Owner 3 No Single 70K No NO Annual 4 Yes Married 120K No < 80K Income > 80K 5 No Divorced 95K Yes NO YES 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes Decision Tree Model #2 9 No Married 75K No 10 No Single 90K Yes 10 There could be more than one tree that fits the Training Data same data! Decision Tree Classifier ¨ Decision tree models are relatively more descriptive than other types of classifier models 1. Easier interpretation 2. Easier to explain predicted values ¨ Exponentially many decision trees can be built ¤ Which is best? n Some trees will be more accurate than others ¤ How to construct the tree? n Computationally infeasible to try every possible tree. Apply Model to Test Data Start from the root of tree. Test Data Home Marital Annual Defaulted Owner Status Income Barrower Home No Married 80K ? Yes 10 Owner No NO Marital {Single, Married Divorced} Status Annual NO < 80K Income > 80K NO YES Apply Model to Test Data Start from the root of tree. Test Data Home Marital Annual Defaulted Owner Status Income Barrower Home No Married 80K ? Yes 10 Owner No NO Marital {Single, Married Divorced} Status Annual NO < 80K Income > 80K NO YES Apply Model to Test Data Start from the root of tree. Test Data Home Marital Annual Defaulted Owner Status Income Barrower Home No Married 80K ? Yes 10 Owner No NO Marital {Single, Married Divorced} Status Annual NO < 80K Income > 80K NO YES Apply Model to Test Data Start from the root of tree. Test Data Home Marital Annual Defaulted Owner Status Income Barrower Home No Married 80K ? Yes 10 Owner No NO Marital {Single, Married Divorced} Status Annual NO < 80K Income > 80K NO YES Apply Model to Test Data Start from the root of tree. Test Data Home Marital Annual Defaulted Owner Status Income Barrower Home No Married 80K ? Yes 10 Owner No NO Marital {Single, Married Divorced} Status Annual NO < 80K Income > 80K NO YES Apply Model to Test Data Start from the root of tree. Test Data Home Marital Annual Defaulted Owner Status Income Barrower Home No Married 80K ? Yes 10 Owner No NO Marital {Single, Married Divorced} Status • Predict that this person NO Annual will not default. < 80K Income > 80K NO YES Formally… ¨ A decision tree has three types of nodes: 1. A root node that has no incoming edges and zero or mode outgoing edges 2. Internal nodes, each of which has exactly one incoming edge and two or more outgoing edges 3. Leaf nodes (or terminal nodes), each of which has exactly one incoming edge and no outgoing edges ¨ Each leaf node is assigned a class label ¨ Non-terminal nodes contain attribute test conditions to separate records that have different characteristics How to Build a Decision Tree? ¨ Referred to as decision tree induction. ¨ Exponentially many decision trees can be constructed from a given set of attributes ¤ Infeasible to try them all to find the optimal tree ¨ Different “decision tree building” algorithms: ¤ Hunt’s algorithm, CART, ID3, C4.5, … ¨ Usually a greedy strategy is employed Tid Home Marital Annual Defaulted Hunt’s Algorithm Owner Status Income Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No ¨ Dt = set of training records that reach a node t 4 Yes Married 120K No 5 No Divorced 95K Yes ¨ Recursive Procedure: 6 No Married 60K No 1. If all records in Dt belong to the same 7 Yes Divorced 220K No class: 8 No Single 85K Yes n then t is a leaf node with class yt 9 No Married 75K No 10 No Single 90K Yes 2. If D is an empty set: 10 t n then t is a leaf node, class determined by the majority class of records in D ’s parent t Dt 3. If Dt contains records that belong to more than one class: ? n use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. Tid Home Marital Annual Defaulted Owner Status Income Borrower 1 Yes Single 125K No Hunt’s Algorithm 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 ¨ Tree begins with single node whose label reflects the majority class value ¨ Tree needs to be refined because root node contains records from both classes ¨ Divide records recursively into smaller subsets Tid Home Marital Annual Defaulted Owner Status Income Borrower 1 Yes Single 125K No 2 No Married 100K No Hunt’s Algorithm 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No ¨ 7 Yes Divorced 220K No Hunt’s Algorithm will work if: 8 No Single 85K Yes 9 No Married 75K No ¤ Every combination of attribute values is present 10 No Single 90K Yes 10 n Question: realistic or unrealistic? n Unrealistic: at least 2n records necessary for binary attributes n Examples: no record for {HomeOwner=Yes, Marrital=Single, Income=60K} ¤ Each combination of attribute values has unique class label n Question: realistic or unrealistic? n Unrealistic. n Example: Suppose we have two individuals, each with the properties {HomeOwner=Yes, Marrital=Single, Income=125K}, but one person defaulted and the other did not. Hunt’s Algorithm ¨ Scenarios the algorithm may run into: 1. All records associated with Dt have identical attributes except for the class label (not possible to split anymore) n Solution: declare a leaf node with the same class label as the majority class of Dt 2. Some child nodes are empty (no records associated, no combination of attribute values leading to this node) n Solution: declare a leaf node with the same class label as the majority class of the empty node’s parent Design Issues of Decision Tree Induction 1. How should the training records be split? ¤ Greedy strategy: split the records based on some attribute test (always choose immediate best option) ¤ Need to evaluate the “goodness” of each attribute test and select the best one. 2. How should the splitting procedure stop? ¤ For now, we’ll keep splitting until we can’t split anymore. Different Split Ideas… Splitting Based on Nominal Attributes ¨ Multiway Split: Use as many partitions as distinct values ¨ Binary Split: Grouping attribute values CART decision tree algorithm only creates ¤ Need to find optimal partitioning binary splits. Splitting Based on Ordinal Attributes ¨ Ordinal attributes can also produce binary or multiway splits. ¤ Grouping should not violate ordering in the ordinal set Splitting Based on Continuous Attributes Tid Home Marital Annual Defaulted Owner Status Income Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No ¨ Continuous attributes can also have a binary or 7 Yes Divorced 220K No multiway split. 8 No Single 85K Yes ¤ Binary: decision tree algorithm must consider all 9 No Married 75K No 10 No Single 90K Yes possible split positions v, and it selects the best one 10 n Comparison test: (A <= v) or (A > v), where v=80K ¤ Computationally intensive Splitting Based on Continuous Attributes ¨ Continuous attributes can also have a binary or multiway split. ¤ Multiway: outcomes of the form v ≤ A < v , for i =1,..., k n Consider all possible ranges of continuousi variable?i+1 n Use same binning strategies as discussed for preprocessing a continuous attribute into a discrete one ¤ Note: adjacent intervals/“bins” can always be aggregated into wider ones Tid Home Marital Annual Defaulted Tree Induction Owner Status Income Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No ¨ What to split on? 4 Yes Married 120K No 5 No Divorced 95K Yes ¤ Home Owner 6 No Married 60K No 7 Yes Divorced 220K No ¤ Marital Status 8 No Single 85K Yes 9 No Married 75K No n 10 No Single 90K Yes Multiway or binary? 10 ¤ Annual Income n Multiway or binary? Entropy ¨ Defined by mathematician Little impure.