slides originally by Dr. Richard Burns, modified by Dr. Stephanie Schwartz

DECISION TREES CSCI 452: Data Mining Today

¨ Decision Trees ¤ Structure ¤ : Entropy ¤ Information Gain, Gain Ratio, Gini ¤ ID3 n Efficiently Handling Continuous Features ¤ Overfitting / Underfitting n Bias-Variance Tradeoff ¤ Pruning ¤ Regression Trees Motivation: Guess Who Game

¨ I’m thinking of one of you.

¨ Figure out who I’m thinking of by asking a series of binary questions. Decision Trees

¨ Simple, yet widely used classification technique ¤ For nominal target variables n There also are Regression trees, for continuous target variables ¤ Predictor Features: binary, nominal, ordinal, discrete, continuous ¤ Evaluating the model: n One metric: error rate in predictions 1. Root node 2. Internal nodes Decision Trees 3. Leaf / terminal nodes is consistent with binary training “Splitting Attributes” categoricalcontinuousclass Tid Home Marital Annual Defaulted dataset. Owner Status Income Borrower 1 Yes Single 125K No Home 2 No Married 100K No Yes 3 No Single 70K No Owner No 4 Yes Married 120K No NO Marital 5 No Divorced 95K Yes {Single, Married 6 No Married 60K No Divorced} Status 7 Yes Divorced 220K No NO 8 No Single 85K Yes Annual < 80K 9 No Married 75K No Income > 80K 10 No Single 90K Yes

10 NO YES

Training Data Model #1 Decision Trees Tree is consistent with training dataset.

Marital binary Married {Single, Divorced} categoricalcontinuousclass Tid Home Marital Annual Defaulted Status Owner Status Income Borrower NO Home 1 Yes Single 125K No Yes No 2 No Married 100K No Owner 3 No Single 70K No NO Annual 4 Yes Married 120K No < 80K Income > 80K 5 No Divorced 95K Yes NO YES 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes Decision Tree Model #2 9 No Married 75K No 10 No Single 90K Yes

10 There could be more than one tree that fits the Training Data same data! Decision Tree Classifier

¨ Decision tree models are relatively more descriptive than other types of classifier models 1. Easier interpretation 2. Easier to explain predicted values ¨ Exponentially many decision trees can be built ¤ Which is best? n Some trees will be more accurate than others ¤ How to construct the tree? n Computationally infeasible to try every possible tree. Apply Model to Test Data

Start from the root of tree. Test Data

Home Marital Annual Defaulted Owner Status Income Barrower Home No Married 80K ? Yes 10 Owner No

NO Marital {Single, Married Divorced} Status

Annual NO < 80K Income > 80K NO YES Apply Model to Test Data

Start from the root of tree. Test Data

Home Marital Annual Defaulted Owner Status Income Barrower Home No Married 80K ? Yes 10 Owner No

NO Marital {Single, Married Divorced} Status

Annual NO < 80K Income > 80K NO YES Apply Model to Test Data

Start from the root of tree. Test Data

Home Marital Annual Defaulted Owner Status Income Barrower Home No Married 80K ? Yes 10 Owner No

NO Marital {Single, Married Divorced} Status

Annual NO < 80K Income > 80K NO YES Apply Model to Test Data

Start from the root of tree. Test Data

Home Marital Annual Defaulted Owner Status Income Barrower Home No Married 80K ? Yes 10 Owner No

NO Marital {Single, Married Divorced} Status

Annual NO < 80K Income > 80K NO YES Apply Model to Test Data

Start from the root of tree. Test Data

Home Marital Annual Defaulted Owner Status Income Barrower Home No Married 80K ? Yes 10 Owner No

NO Marital {Single, Married Divorced} Status

Annual NO < 80K Income > 80K NO YES Apply Model to Test Data

Start from the root of tree. Test Data

Home Marital Annual Defaulted Owner Status Income Barrower Home No Married 80K ? Yes 10 Owner No

NO Marital {Single, Married Divorced} Status • Predict that this person NO Annual will not default. < 80K Income > 80K NO YES Formally…

¨ A decision tree has three types of nodes: 1. A root node that has no incoming edges and zero or outgoing edges 2. Internal nodes, each of which has exactly one incoming edge and two or more outgoing edges 3. Leaf nodes (or terminal nodes), each of which has exactly one incoming edge and no outgoing edges ¨ Each leaf node is assigned a class label ¨ Non-terminal nodes contain attribute test conditions to separate records that have different characteristics How to Build a Decision Tree?

¨ Referred to as decision tree induction. ¨ Exponentially many decision trees can be constructed from a given set of attributes ¤ Infeasible to try them all to find the optimal tree ¨ Different “decision tree building” : ¤ Hunt’s algorithm, CART, ID3, C4.5, … ¨ Usually a greedy strategy is employed Tid Home Marital Annual Defaulted Hunt’s Algorithm Owner Status Income Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No ¨ Dt = set of training records that reach a node t 4 Yes Married 120K No 5 No Divorced 95K Yes ¨ Recursive Procedure: 6 No Married 60K No 1. If all records in Dt belong to the same 7 Yes Divorced 220K No class: 8 No Single 85K Yes n then t is a leaf node with class yt 9 No Married 75K No 10 No Single 90K Yes

2. If D is an empty set: 10 t n then t is a leaf node, class determined by the majority class of records in D ’s parent t Dt 3. If Dt contains records that belong to more than one class: ? n use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. Tid Home Marital Annual Defaulted Owner Status Income Borrower 1 Yes Single 125K No Hunt’s Algorithm 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes

10

¨ Tree begins with single node whose label reflects the majority class value ¨ Tree needs to be refined because root node contains records from both classes ¨ Divide records recursively into smaller subsets Tid Home Marital Annual Defaulted Owner Status Income Borrower 1 Yes Single 125K No 2 No Married 100K No Hunt’s Algorithm 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No ¨ 7 Yes Divorced 220K No Hunt’s Algorithm will work if: 8 No Single 85K Yes 9 No Married 75K No ¤ Every combination of attribute values is present 10 No Single 90K Yes

10

n Question: realistic or unrealistic? n Unrealistic: at least 2n records necessary for binary attributes n Examples: no record for {HomeOwner=Yes, Marrital=Single, Income=60K} ¤ Each combination of attribute values has unique class label n Question: realistic or unrealistic? n Unrealistic. n Example: Suppose we have two individuals, each with the properties {HomeOwner=Yes, Marrital=Single, Income=125K}, but one person defaulted and the other did not. Hunt’s Algorithm

¨ Scenarios the algorithm may run into:

1. All records associated with Dt have identical attributes except for the class label (not possible to split anymore) n Solution: declare a leaf node with the same class label as the majority class of Dt 2. Some child nodes are empty (no records associated, no combination of attribute values leading to this node) n Solution: declare a leaf node with the same class label as the majority class of the empty node’s parent Design Issues of Decision Tree Induction

1. How should the training records be split? ¤ Greedy strategy: split the records based on some attribute test (always choose immediate best option) ¤ Need to evaluate the “goodness” of each attribute test and select the best one. 2. How should the splitting procedure stop? ¤ For now, we’ll keep splitting until we can’t split anymore. Different Split Ideas… Splitting Based on Nominal Attributes

¨ Multiway Split: Use as many partitions as distinct values

¨ Binary Split: Grouping attribute values CART decision tree algorithm only creates ¤ Need to find optimal partitioning binary splits. Splitting Based on Ordinal Attributes

¨ Ordinal attributes can also produce binary or multiway splits. ¤ Grouping should not violate ordering in the ordinal set Splitting Based on Continuous Attributes

Tid Home Marital Annual Defaulted Owner Status Income Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No ¨ Continuous attributes can also have a binary or 7 Yes Divorced 220K No multiway split. 8 No Single 85K Yes ¤ Binary: decision tree algorithm must consider all 9 No Married 75K No 10 No Single 90K Yes

possible split positions v, and it selects the best one 10

n Comparison test: (A <= v) or (A > v), where v=80K ¤ Computationally intensive Splitting Based on Continuous Attributes

¨ Continuous attributes can also have a binary or multiway split. ¤ Multiway: outcomes of the form v ≤ A < v , for i =1,..., k n Consider all possible ranges of continuousi variable?i+1 n Use same binning strategies as discussed for preprocessing a continuous attribute into a discrete one ¤ Note: adjacent intervals/“bins” can always be aggregated into wider ones Tid Home Marital Annual Defaulted Tree Induction Owner Status Income Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No ¨ What to split on? 4 Yes Married 120K No 5 No Divorced 95K Yes ¤ Home Owner 6 No Married 60K No 7 Yes Divorced 220K No ¤ Marital Status 8 No Single 85K Yes 9 No Married 75K No n 10 No Single 90K Yes Multiway or binary? 10

¤ Annual Income n Multiway or binary? Entropy

¨ Defined by mathematician Little impure. Claude Shannon Very pure. Not impure. ¨ Measures the impurity (heterogeneity) of the elements of a set ¨ “what is the uncertainty of guessing the result of the random selection from a set?” Completely impure. Entropy

. ¨ !"#$%&' " = − ∑+,- &+ log2 &+ ¨ Weighted sum of the logs of the probabilities of each of the possible outcomes. Entropy Examples

1. Entropy of the set of 52 playing cards: ¤ Randomly selecting any specific card i is 1/52. ./ ¤ !"#$%&' " = − ∑+,- 0.019 log/ 0.019 = 5.7 2. Entropy if only the 4 suits matter: ¤ Randomly selecting any suit is ¼ 9 ¤ !"#$%&' " = − ∑+,- 0.25 log/ 0.25 = 2

The higher the impurity, the higher the entropy. Big Idea Fundamentals Standard Approach: The ID3 Algorithm Summary

Shannon’sThat’s Entropy Model the Reason for Using the log function 0 10 8 −2 6 −4 P(x) P(x) 2 2 g g o l o l 4 − −6 2 −8 0 −10 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 P(x) P(x)

(a) (b)

Figure: (a) A graphWant illustrating a low “score” how when the something value ofis highly a binary probable log (theor certain. log to the base 2) of a probability changes across the range of probability values. (b) the impact of multiplying these values by 1. How to determine the Best Split?

Tid Home Marital Annual Defaulted Owner Status Income Borrower How does entropy help us? 1 Yes Single 125K No • We can calculate the entropy (impureness) of 2 No Married 100K No Default Borrower 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes Home Marital 6 No Married 60K No Owner? Status? Yes Single 7 Yes Divorced 220K No No Divorced Married 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes

10

Annual Income <= 85K Yes No Information Gain

¨ Try to split on each possible feature in a dataset. See which split works “best”.

¨ Measure the reduction in the overall entropy of a set of instances

34 ¨ !"#$%&'()$"*')" = ,"(%$-. / − ∑ ,(/ ) 2 3 2

Weighting Term Information Gain Example

Big Idea Fundamentals Standard Approach: The ID3 Algorithm Summary 8 8 8 8 !"#$%&' = − ∑2 & log & = −( log + log ) = 1 Information Gain ()*+) /01 / 6 / 9 6 9 9 6 9 C/ =">%$?@#A%"B@A" = !"#$%&' C − D !(C/) Calculate the remainder for the UNKNOWN SENDER feature C / in the dataset. 2 1 1 1 1 4 2 2 2 2 =BEF =rem1 −(d, )=× − log|Dd6=l | +H (tlog, 6 ) + × − log6 + log6 = 0 D6 2 2⇥ 2 Dd=l2 6 4 4 4 4 l levels(d) 2 X |D| entropy of weighting partition d=l | {z D } | {z } SUSPICIOUS UNKNOWN CONTAINS ID WORDS SENDER IMAGES CLASS 376 true false true spam • The 0 means there was no 489 true true false spam 541 true true false spam “information gain”. 693 false true true ham 782 false false false ham • Nothing was learned by 976 false false false ham splitting on “Contains Images”. Big Idea Fundamentals Standard Approach: The ID3 Algorithm Summary

Information Gain

Calculate the remainder for the UNKNOWN SENDER feature in the dataset.

rem (d, )= |Dd=l | H (t, ) D ⇥ Dd=l l levels(d) Calculate the Information2 X |D| Gainentropy of on Each weighting partition d=l | {z D } Feature | {z } SUSPICIOUS UNKNOWN CONTAINS ID WORDS SENDER IMAGES CLASS 376 true false true spam 489 true true false spam 541 true true false spam 693 false true true ham 782 false false false ham 976 false false false ham

¨ !"#$ = 1 − 0 = 1

¨ !")# = 1 − .9183 = .0817

¨ !"/0 = 1 − 1 = 0 “Suspicious Words” is the best split. ID3 Algorithm

¨ Attempts to create the shallowest tree that is consistent with the training dataset

¨ Builds the tree in a recursive, depth-first manner ¤ beginning at the root node and working down to the leaf nodes ID3 Algorithm

1. Figure out the best feature to split on based on by using information gain 2. Add this root note to the tree; label it with the selected test feature 3. Partition the dataset using this test 4. For each partition, grow a branch from this node 5. Recursively repeat the process for each of these branches using the remaining partition of the dataset ID3 Algorithm: Stopping Condition

Stop the recursion and construct a leaf node when: 1. All of the instances in the remaining dataset have the same classification (target feature value). ¤ Create a leaf node with that classification as its label 2. The set of features left to test is empty. ¤ Create a leaf node with the majority class of the dataset as its classification. 3. The remaining dataset is empty. ¤ Create a leaf note one level up (parent node), with the majority class. Determine the Best Split

Tid Home Marital Annual Defaulted ¨ Before: Owner Status Income Borrower 1 Yes Single 125K No ¤ 7 records of class No 2 No Married 100K No ¤ 3 records of class Yes 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes Home Marital 6 No Married 60K No Owner? Status? Yes Single 7 Yes Divorced 220K No No Divorced Married 8 No Single 85K Yes Yes: 0 Yes: 2 Yes: 3 Yes: 1 Yes: 0 9 No Married 75K No No: 3 No: 2 No: 4 No: 1 No: 4 10 No Single 90K Yes

10

Annual Income <= 85K Yes No Yes: 2 Yes: 1 Which test condition is best? No: 4 No: 3 Other Measures of Node Impurity

1. Gini Index (“genie”) 2. Entropy 3. Misclassification Error • c = # of classes c 1 − • 0 log 0 = 0 Entropy(n) = − pi log2 pi ∑ • pi = fraction of records i=0 belonging to class i at a given c−1 2 node. Gini(n) =1−∑[pi ] i=0

MisclassificationError(n) =1− max pi Example Calculations

2 2 Proposed æ 0 ö æ 6 ö Gini =1- ç ÷ - ç ÷ = 0 Node N1 Count è 6 ø è 6 ø Split æ 0 ö æ 0 ö æ 6 ö æ 6 ö Class = 0 0 Entropy = -ç ÷log2 ç ÷ - ç ÷log2 ç ÷ = 0 è 6 ø è 6 ø è 6 ø è 6 ø é0 6ù Class = 1 6 Misclassification =1- max , = 0 Class0: 0 Class0: ? ê6 6ú ë û Class1: 6 Class1: ? 2 2 æ 1 ö æ 5 ö Gini =1- ç ÷ - ç ÷ = 0.278 Node N2 Count è 6 ø è 6 ø æ 1 ö æ 1 ö æ 5 ö æ 5 ö Class = 0 1 Entropy = -ç ÷log2 ç ÷ - ç ÷log2 ç ÷ = 0.650 è 6 ø è 6 ø è 6 ø è 6 ø é1 5ù Class = 1 5 Misclassification =1- max , = 0.167 ëê6 6ûú How “impure” is this node that would be created if we do the 2 2 æ 3 ö æ 3 ö Node N3 Count Gini =1- ç ÷ - ç ÷ = 0.5 split? è 6 ø è 6 ø Class = 0 3 æ 3 ö æ 3 ö æ 3 ö æ 3 ö Entropy = -ç ÷log2 ç ÷ - ç ÷log2 ç ÷ =1 è 6 ø è 6 ø è 6 ø è 6 ø Class = 1 3 é3 3ù Misclassification =1- max , = 0.5 ëê6 6ûú Comparing Impurity Measures for Binary Classification Problems

• Assuming only two classes. • p = fraction of records that belong to one of the two classes

3 Class0: 3 p = =.5 Class1: 3 6 Comparing Impurity Measures

¨ Consistency among different impurity measures ¨ But attribute chosen as the test condition may vary depending on impurity measure choice Gain

¨ Gain: “goodness of the split” ¨ Comparing: ¤ degree of impurity of parent node (before splitting) ¤ degree of impurity of the child nodes (after splitting), weighted ¨ larger gain => better split (better test condition)

• I(n) = impurity measure at node n k • I(vj) = impurity measure at child node vj N(v j ) • D(gain) = I( parent) - å I(v j ) k = number of attribute values j=1 N • N(vj) = total number of records at child node vj • N = total number of records at parent node

Footnote: (Information Gain: term used when entropy is used as the impurity measure) Gain Example

? ? • What is the Gini (index) of the child nodes? • Is gain greater for split A or split B? 0.371

2 2 2 2 æ 3 ö æ 4 ö " 3% " 2 % Gini(N1) =1- ç ÷ - ç ÷ = 0.4898 Gini(N2 ) =1−$ ' −$ ' = 0.480 è 7 ø è 7 ø # 5& # 5& æ 7 ö æ 5 ö GainA =.500 −.486 =.014 GiniWeightedAvg = ç ÷´0.4898 + ç ÷´0.480 = 0.486 è12 ø è12 ø GainB =.500 −.375 =.125 Since descendent nodes after splitting with Attribute B have a smaller Gini index than after splitting with Attribute A, splitting with Attribute B is preferred. (The gain is greater.) Computing Multiway Gini index

¨ Computed for every attribute value.

Gini({Family}) = 0.375 Gini({Sports}) = 0 Gini({Luxury}) = 0.219 4 8 8 OverallGiniIndex = ´0.375 + ´0 + ´0.219 = 0.163 20 20 20 Binary Splitting of Continuous Attributes

¨ Need to find best value v to split against ¨ Brute-force method: ¤ Consider every attribute value v as a split candidate n O(n) possible candidates n For each candidate, need to iterate through all records again to determine the count of records with attribute < v or > v n O(n2) complexity ¨ By sorting records by the continuous attribute, this improves to O(n log n) Splitting of Continuous Attributes

¨ Need to find best value v to split against ¨ Brute-force method: ¤ Consider every attribute value v as a split candidate

Tid Home Marital Annual Defaulted • Owner Status Income Borrower Need to find a split value for AnnualIncome predictor 1 Yes Single 125K No • Try 125K? 2 No Married 100K No • Compute Gini for <= 125K and > 125K. 3 No Single 70K No 4 Yes Married 120K No • Complexity: O(n) 5 No Divorced 95K Yes • Try 100K? 6 No Married 60K No 7 Yes Divorced 220K No • Complexity: O(n) 8 No Single 85K Yes • Try 70K?, etc. 9 No Married 75K No • Overall complexity: O(n2) 10 No Single 90K Yes

10

Splitting of Continuous Attributes

¨ By sorting records by the continuous attribute, this improves to O(n log n) ¤ Candidate Split position are midpoints between two adjacent, different, class values ¤ Only need to consider split positions: 80 and 97 Tid Home Marital Annual Defaulted Owner Status Income Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes

10

Bias: Favoring attributes with large number of distinct values

¨ Entropy and Gini Impurity measures favor attributes with large number of distinct values

Possible nodes to split on:

Own Car Student Car? Type? ID?

Yes No Family Luxury c1 c20 c c Sports 10 11 C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0 C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

• StudentId will result in perfectly pure children. • Will have the greatest gain. • Should have been removed as a predictor variable. C4.5 algorithm uses Gain Ratio to Gain Ratio determine the goodness of a split.

¨ To avoid bias of favoring attributes with large number of distinct values: 1. Restrict test conditions to only binary splits n CART decision tree algorithm 2. Gain Ratio: Take into account the number of outcomes produced by attribute split condition n Adjusts information gain by the entropy of the partitioning

Δ GainRatio = info Split Info • Large number of splits make Split Info larger k • will reduce the Gain Ratio SplitInfo = − P(vi )log2 P(vi ) ∑i=1 k is the total number of splits Complete Name Body Gives 4 Hibernates Class Example Temp. Birth legged (Mammal?) Porcupine Warm Yes Yes Yes Yes Cat Warm Yes Yes No Yes Bat Warm Yes No Yes No Whale Warm Yes No No No Initial: 2 Yes, 8 No Salamander Cold No Yes Yes No " 2 %2 " 8 %2 Komodo Cold No Yes No No Gini =1−$ ' −$ ' =.32 #10 & #10 & Dragon Python Cold No No Yes No Salmon Cold No No No No Eagle Warm No No No No Guppy Cold Yes No No No Complete Name Body Gives 4 Hibernates Class Example Temp. Birth legged (Mammal?) Porcupine Warm Yes Yes Yes Yes Cat Warm Yes Yes No Yes Split on Gives Birth? Bat Warm Yes No Yes No 2 2 " 2 % " 3% Whale Warm Yes No No No GiniGivesBirth=Yes =1−$ ' −$ ' =.48 # 5& # 5& Salamander Cold No Yes Yes No 2 2 " 0 % " 5% Komodo Cold No Yes No No GiniGivesBirth No =1−$ ' −$ ' = 0 = # 5& # 5& Dragon 5 5 Weighted Average: ×.48+ × 0 =.24 Python Cold No No Yes No 10 10 Salmon Cold No No No No Gini =.32 −.24 =.08 GivesBirth Eagle Warm No No No No Guppy Cold Yes No No No Complete Name Body Gives 4 Hibernates Class Example Temp. Birth legged (Mammal?) Porcupine Warm Yes Yes Yes Yes Cat Warm Yes Yes No Yes Split on 4-legged? Bat Warm Yes No Yes No

2 2 Whale Warm Yes No No No " 2 % " 2 % Gini4Legged Yes =1−$ ' −$ ' =.5 = # 4& # 4& Salamander Cold No Yes Yes No Komodo Cold No Yes No No Gini4Legged=No = 0 Dragon 4 6 Weighted Average: ×.5+ × 0 =.2 10 10 Python Cold No No Yes No

Gini4Legged =.32 −.2 =.12 Salmon Cold No No No No Eagle Warm No No No No Guppy Cold Yes No No No Complete Name Body Gives 4 Hibernates Class Example Temp. Birth legged (Mammal?) Porcupine Warm Yes Yes Yes Yes Cat Warm Yes Yes No Yes Split on Hibernates? Bat Warm Yes No Yes No

2 2 Whale Warm Yes No No No " 1 % " 3% GiniHibernates Yes =1−$ ' −$ ' =.375 = # 4& # 4& Salamander Cold No Yes Yes No " 1 %2 " 5%2 Komodo Cold No Yes No No GiniHibernates No =1−$ ' −$ ' =.278 Dragon = # 6 & # 6 & 4 6 Python Cold No No Yes No Weighted Average: ×.375+ ×.278 =.3168 10 10 Salmon Cold No No No No

GiniHibernates =.32 −.3168 =.0032 Eagle Warm No No No No Guppy Cold Yes No No No Complete Name Body Gives 4 Hibernates Class Example Temp. Birth legged (Mammal?) Porcupine Warm Yes Yes Yes Yes Cat Warm Yes Yes No Yes Split on Body Temperature? Bat Warm Yes No Yes No Whale Warm Yes No No No Salamander Cold No Yes Yes No GiniBodyTemperature=Warm =.48 Komodo Cold No Yes No No GiniBodyTemperature=Cold = 0 Dragon Weighted Average: .24 Python Cold No No Yes No Gini =.08 BodyTemperature Salmon Cold No No No No Eagle Warm No No No No Guppy Cold Yes No No No Complete Name Body Gives 4 Hibernates Class Example Temp. Birth legged (Mammal?) Porcupine Warm Yes Yes Yes Yes Cat Warm Yes Yes No Yes Splitting on 4-legged would yield the largest Gain. Bat Warm Yes No Yes No Whale Warm Yes No No No Salamander Cold No Yes Yes No 4-legged? Komodo Cold No Yes No No Dragon Yes No Python Cold No No Yes No … No Salmon Cold No No No No (2 Yes, 2 No) (0 Yes, 6 No) Eagle Warm No No No No Guppy Cold Yes No No No " 2 %2 " 2 %2 Gini =1−$ ' −$ ' =.5 # 4& # 4& Complete Name Body Gives 4 Hibernates Class Example Temp. Birth legged (Mammal?) Porcupine Warm Yes Yes Yes Yes Cat Warm Yes Yes No Yes Split on Gives Birth? Bat Warm Yes No Yes No Whale Warm Yes No No No 2 2 " 2 % " 0 % Salamander Cold No Yes Yes No GiniGivesBirth Yes =1−$ ' −$ ' = 0 = # 2 & # 2 & Komodo Cold No Yes No No GiniGivesBirth=No = 0 Dragon Weighted Average: 0 Python Cold No No Yes No Gini .5 0 .5 GivesBirth = − = Salmon Cold No No No No Eagle Warm No No No No Guppy Cold Yes No No No Complete Name Body Gives 4 Hibernates Class Example Temp. Birth legged (Mammal?) Porcupine Warm Yes Yes Yes Yes Cat Warm Yes Yes No Yes Split on Hibernates? Bat Warm Yes No Yes No Whale Warm Yes No No No 2 2 " 1 % " 1 % Salamander Cold No Yes Yes No GiniHibernates=Yes =1−$ ' −$ ' =.5 # 2 & # 2 & Komodo Cold No Yes No No

GiniHibernates=No =.5 Dragon 2 2 Weighted Average: ×.5+ ×.5 =.5 Python Cold No No Yes No 4 4 Salmon Cold No No No No GiniHibernates =.5−.5 = 0 Eagle Warm No No No No Guppy Cold Yes No No No Complete Name Body Gives 4 Hibernates Class Example Temp. Birth legged (Mammal?) Porcupine Warm Yes Yes Yes Yes Cat Warm Yes Yes No Yes Split on Body Temperature? Bat Warm Yes No Yes No Whale Warm Yes No No No Salamander Cold No Yes Yes No GiniBodyTemperature=Warm = 0 Komodo Cold No Yes No No GiniBodyTemperature=Cold = 0 Dragon Weighted Average: 0 Python Cold No No Yes No Gini =.5− 0 =.5 BodyTemperature Salmon Cold No No No No Eagle Warm No No No No Guppy Cold Yes No No No Complete Name Body Gives 4 Hibernates Class Example Temp. Birth legged (Mammal?) Porcupine Warm Yes Yes Yes Yes Splitting on either Gives Birth Cat Warm Yes Yes No Yes or Body Temperature will fit Bat Warm Yes No Yes No training data perfectly. Whale Warm Yes No No No Salamander Cold No Yes Yes No 4-legged? Komodo Cold No Yes No No Yes Dragon No Python Cold No No Yes No Gives Birth? No Salmon Cold No No No No Yes No (0 Yes, 6 No) Eagle Warm No No No No Guppy Cold Yes No No No Yes (2 Yes, No (0 Yes, 2 No) 0 No) Decision Tree Rules

¨ If 4-legged And GivesBirth Then Yes

4-legged? ¨ If 4-legged And Not GivesBirth Then No ¨ If Not 4-legged Then No Yes No Gives Birth? No Yes No (0 Yes, 6 No) Yes (2 Yes, No (0 Yes, 2 No) 0 No) Training Set vs. Test Set

Tid Attrib1 Attrib2 Attrib3 Class Learning 1 Yes Large 125K No algorithm ¨ Overall dataset is 2 No Medium 100K No 3 No Small 70K No divided into: 4 Yes Medium 120K No Induction 5 No Large 95K Yes 1. Training set – used 6 No Medium 60K No 7 Yes Large 220K No Learn to build model 8 No Small 85K Yes Model 9 No Medium 75K No 10 No Small 90K Yes 2. Test set – evaluates 10 Model model Training Set Apply Model 3. (sometimes a Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? Validation set is also 12 Yes Medium 80K ? 13 Yes Large 110K ? Deduction used; more later) 14 No Small 95K ? 15 No Large 67K ?

10

Test Set Review: Model Evaluation on Test Set (Classification) – Error Rate

¨ Error Rate: proportion of mistakes that are made by applying our f ˆ model to the testing observations: 1 n ˆ ∑I(yi ≠ yi ) n i=1

Observations in test set: {(x1,y1), …, (xn,yn)}

yˆi is the predicted class for the ith record

I(yi ≠ yˆi ) is an indicator variable: equals 1 if yi ≠ yˆi and 0 if yi = yˆi Review: Model Evaluation on Test Set (Classification) – Confusion Matrix

¨ Confusion Matrix: tabulation of counts of test records correctly and incorrectly predicted by model Predicted Class Class = 1 Class = 0

Class = 1 f11 f10 Actual Class Class = 0 f01 f00

(Confusion matrix for a 2-class problem.) Review: Model Evaluation on Test Set (Classification) – Confusion Matrix

Predicted Class Class = 1 Class = 0

Class = 1 f11 f10 Actual Class Class = 0 f01 f00

Number of correct predictions f + f Accuracy = = 11 00 Total number of predictions f11 + f10 + f01 + f00 Number of wrong predictions f + f Error rate = = 10 01 Total number of predictions f11 + f10 + f01 + f00

Most classification tasks seek models that attain the highest accuracy when applied to the test set. Review: Model Evaluation on Test Set (Regression) – Mean Squared Error

¨ Mean Squared Error: measuring the “quality of fit” ¤ will be small if the predicted responses are very close to the true responses n 1 ˆ 2 MSE = ∑(yi − f (xi )) n i=1

Observations in test set: {(x1,y1), …, (xn,yn)} ˆ f (xi ) is the predicted value for the ith record Review: Problem

¨ Error rates on training set vs. testing set might be drastically different.

¨ No guarantee that the model with the smallest training error rate will have the smallest testing error rate Review: Overfitting

¨ Overfitting: occurs when model “memorizes” the training set data ¤ very low error rate on training data ¤ yet, high error rate on test data

¨ Model does not generalize to the overall problem

¨ This is bad! We wish to avoid overfitting. WY045-05 September 28, 2004 18:50

BIAS–VARIANCE TRADE-OFF 93

Optimal Level of Model Complexity

Error Rate on Validation Set Error Rate

Error Rate on Underfitting Overfitting Training Set

Complexity of Model Figure 5.2 The optimal level of model complexity is at the minimum error rate on the validation set.

training set continues to fall in a monotone fashion. However, as the model complexity increases, the validation set error rate soon begins to flatten out and increase because the provisional model has memorized the training set rather than leaving room for generalizing to unseen data. The point where the minimal error rate on the validation set is encountered is the optimal level of model complexity, as indicated in Figure 5.2. Complexity greater than this is considered to be overfitting; complexity less than this is considered to be underfitting.

BIAS–VARIANCE TRADE-OFF

Suppose that we have the scatter plot in Figure 5.3 and are interested in constructing the optimal curve (or straight line) that will separate the dark gray points from the light gray points. The straight line in has the benefit of low complexity but suffers from some classification errors (points ending up on the wrong side of the line). In Figure 5.4 we have reduced the classification error to zero but at the cost of a much more complex separation function (the curvy line). One might be tempted to adopt the greater complexity in order to reduce the error rate. However, one should be careful not to depend on the idiosyncrasies of the training set. For example, sup- pose that we now add more data points to the scatter plot, giving us the graph in Figure 5.5. Note that the low-complexity separator (the straight line) need not change very much to accommodate the new data points. This means that this low-complexity separator has low variance. However, the high-complexity separator, the curvy line, must alter considerably if it is to maintain its pristine error rate. This high degree of change indicates that the high-complexity separator has a high variance. Review: Bias and Variance

¨ Bias: the error introduced by modeling a real-life problem (usually extremely complicated) by a much simpler problem ¤ The more flexible (complex) a method is, the less bias it will generally have. ¨ Variance: how much the learned model will change if the training set was different ¤ Does changing a few observations in the training set, dramatically affect the model? ¤ Generally, the more flexible (complex) a method is, the more variance it has. WY045-05 September 28, 2004 18:50

Example: we wish to build a model that separates the dark-colored points from the 94 CHAPTERlight 5 k-NEAREST-colored NEIGHBOR points. ALGORITHM Black line is simple, linear model

Currently, some classification error

• Low variance • Bias present

Figure 5.3 Low-complexity separator with high error rate.

Figure 5.4 High-complexity separator with low error rate.

Figure 5.5 With more data: low-complexity separator need not change much; high- complexity separator needs much revision. WY045-05 September 28, 2004 18:50

94 CHAPTER 5 k-NEAREST NEIGHBOR ALGORITHM

More complex model (curvy line instead of linear) Figure 5.3 Low-complexity separator with high error rate.

Zero classification error for these data points

• No linear model bias • Higher Variance?

Figure 5.4 High-complexity separator with low error rate.

Figure 5.5 With more data: low-complexity separator need not change much; high- complexity separator needs much revision. WY045-05 September 28, 2004 18:50

94 CHAPTER 5 k-NEAREST NEIGHBOR ALGORITHM

Figure 5.3 Low-complexity separator with high error rate.

More data has been added.

Figure 5.4Re-High-complexitytrain both models separator (linear with lowline, error and rate. curvy line) in order to minimize error rate

Variance: • Linear model doesn’t change much • Curvy line significantly changes

Which model is better?

Figure 5.5 With more data: low-complexity separator need not change much; high- complexity separator needs much revision. This could involve learning noise in the data. Or it could involve learning to identify specific inputs rather than whatever factors are actually predictive for the desired out‐ put. The other side of this is underftting, producing a model that doesn’t perform well even on the training data, although typically when this happens you decide your model isn’t good enough and keep looking for a better one.

Figure 11-1. Overftting and underftting

In Figure 11-1, I’ve fit three polynomials to a sample of data. (Don’t worry about how; we’ll get to that in later chapters.) The horizontal line shows the best fit degree 0 (i.e., constant) polynomial. It severely underfts the training data. The best fit degree 9 (i.e., 10-parameter) polynomial goes through every training data point exactly, but it very severely overfts—if we were to pick a few more data points it would quite likely miss them by a lot. And the degree 1 line strikes a nice balance—it’s pretty close to every point, and (if these data are repre‐ sentative) the line will likely be close to new data points as well. Clearly models that are too complex lead to overfitting and don’t generalize well beyond the data they were trained on. So how do we make sure our models aren’t too

Overftting and Underftting | 143 Model Overfitting

¨ Errors committed by a classification model are generally divided into: 1. Training errors: misclassification on training set records 2. Generalization errors (testing errors): errors made on testing set / previously unseen instances ¨ Good model has low training error and low generalization error. ¨ Overfitting: model has low training error rate, but high generalization errors Model Underfitting and Overfitting

¨ When tree is small: ¤ Underfitting ¤ Large training error rate ¤ Large testing error rate ¤ Structure of data isn’t yet learned ¨ When tree gets too large: ¤ Beware of overfitting ¤ Training error rate decreases while testing error rate increases ¤ Tree is too complex ¤ Tree “almost perfectly fits” training data, Model underfitting Model overfitting but doesn’t generalize to testing examples Reasons for Overfitting

1. Presence of noise

2. Lack of representative samples Training Set Name Body Gives 4 Hibernates Class Temp. Birth legged (Mammal?)

Porcupine Warm Yes Yes Yes Yes Cat Warm Yes Yes No Yes Bat Warm Yes No Yes No Whale Warm Yes No No No Salamander Cold No Yes Yes No Komodo Cold No Yes No No Dragon

Python Cold No No Yes No Salmon Cold No No No No Eagle Warm No No No No Guppy Cold Yes No No No

Two training records are mislabeled. Tree perfectly fits training data. Testing Set Name Body Gives 4 Hibernates Class Temp. Birth legged (Mammal?)

Human Warm Yes No No Yes Pigeon Warm No No No No Elephant Warm Yes Yes No Yes Leopard Cold Yes No No No Shark Turtle Cold No Yes No No Penguin Cold No No No No

Eel Cold No No No No Dolphin Warm Yes No No Yes Spiny Warm No Yes Yes Yes Anteater Gila Cold No Yes Yes No Monster Test error rate: 30% Testing Set Name Body Gives 4 Hibernates Class Temp. Birth legged (Mammal?)

Human Warm Yes No No Yes Pigeon Warm No No No No Elephant Warm Yes Yes No Yes Leopard Cold Yes No No No Shark Turtle Cold No Yes No No Penguin Cold No No No No

Test error rate: 30% Eel Cold No No No No Reasons for misclassifications: Dolphin Warm Yes No No Yes • Mislabeled records in training Spiny Warm No Yes Yes Yes data Anteater • “Exceptional case” • Unavoidable Gila Cold No Yes Yes No • Minimal error rate Monster achievable by any classifier • Training error rate: 0% • Training error rate: 20% • Test error rate: 30% • Test error rate: 10% • overfitting Overfitting and Decision Trees

¨ The likelihood of overfitting occurring increases as a tree gets deeper ¤ the resulting classifications are based on smaller subsets of the full training dataset

¨ Overfitting involves splitting the data on an irrelevant feature. Pruning: Handling Overfitting in Decision Trees

¨ Tree pruning identifies and removes subtrees within a decision tree that are likely to be due to noise and sample variance in the training set used to induce it. ¨ Pruning will result in decision trees being created that are not consistent with the training set used to build them. ¨ But we are more interested in created prediction models that generalize well to new data! 1. Pre-pruning (Early Stopping) 2. Post-pruning Pre-pruning Techniques

1. Stop creating subtrees when the number of instances in a partition falls below a threshold 2. Information gain measured at a node is not deemed to be sufficient to make partitioning the data worthwhile 3. Depth of the tree goes beyond a predefined limit 4. … other more advanced approaches Benefits: Computationally efficient; works well for small datasets. Downsides: Stopping too early will fail to create the most effective trees. Post-pruning

1. Decision tree initially grown to its maximum size 2. Then examine each branch 3. Branches that are deemed likely to be due to overfitting are pruned. ¨ Post-pruning tends to give better results than prepruning ¨ Which is faster? ¤ Post-pruning is more computationally expensive than prepruning because entire tree is grown Post-pruning Techniques

1. Reduced Error Pruning

2. Cost Complexity Pruning Reduced Error Pruning

¨ Starting at the leaves, each node is replaced with its most popular class. ¨ If the accuracy is not affected, then the change is kept. ¤ Evaluate accuracy on a validation set ¤ Set aside some of the training set as a validation set ¨ Advantages: simplicity and speed Cost Complexity Pruning

¨ Nonnegative tuning parameter: α ¤ “Penalizing cost” / “complexity parameter” ¨ Will look at different pruned subtrees and compare their performance on a test sample ¨ α determines the trade-off between misclassification error and the model complexity ¤ Small α: penalty for larger tree is small ¤ Larger α: smaller trees preferred depending on # of errors Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Post-Pruning Example

Example validation set: Table: An example validation set for the post-operative patient routing Core-Temp task. [icu]

CORE-STABLE- low high ID TEMP TEMP GENDER DECISION 1 high true male gen Gender Stable-Temp 2 low true female icu [icu] [gen] 3 high false female icu 4 high false male icu male female true false 5 low false female icu icu gen gen icu 6 low true male icu

Figure: TheInduced decision decision tree for tree the from post-operative training data patient routing task. • Need to prune? Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Table: An example validation set for the post-operative patient routing task.

CORE-STABLE- ID TEMP TEMP GENDER DECISION 1 high true male gen Feature SelectionPost-PruningCont. Desc. Features ExampleCont. Targets Noise and Overfitting2 lowEnsembles true femaleSummary icu 3 high false female icu 4 high false male icu 5 low false female icu 6 low true male icu

Core-Temp Core-Temp [icu] Core-Temp [icu] [icu] (1)

low high low high low high

Gender Stable-Temp Stable-Temp Stable-Temp [icu] (0) [gen] icu icu (0) [gen] (2) [gen]

male female true false true false true false

icu (0) gen (2) gen icu gen (0) icu (0) gen (0) icu (0) (a) (b) (c)

Figure: The iterations of reduced error pruning for the decision tree in Figure 7 [34] using the validation set in Table 7 [33]. The subtree that is being considered for pruning in each iteration is highlighted in black. The prediction returned by each non-leaf node is listed in square brackets. The error rate for each node is given in round brackets. Occam’s Razor

General Principle (Occam’s Razor): given two models with same generalization (testing) errors, the simpler model is preferred over the more complex model ¤ Additional components in a more complex model have greater chance at being fitted purely by chance

Problem solving principle by philosopher William of Ockham (1287-1347) Advantages of Pruning

1. Smaller trees are easier to interpret

2. Increased generalization accuracy. Regression Trees

¨ Target Attribute: ¤ Decision (Classification) Trees: qualitative ¤ Regression Trees: continuous

¨ Decision trees: reduce the entropy in each subtree

¨ Regression trees: reduce the variance in each subtree ¤ Idea: adapt ID3 algorithm measure of Information Gain to use variance rather than node impurity Regression TreeFeature Splits Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Classification Trees Regression Trees The impurity (variance) at a node can be calculated using ¨ Gain: “goodness of the split” the following¨ ImpurityThe equation: impurity (variance) (variance) atat a nodea can be calculated using the following equation: ¨ larger gain => better split (better node: n ¯ 2 ti t test condition) ( , )= i=1 n ¯ 2 var t ti t (3) D var (nt, 1)= i=1 (3) P D n 1 P ¨ k N(v ) We selectSelect the feature feature to splitto split on at on a nodethat by selecting the j We select the feature to split on at a node by selecting the D(gain) = I( parent) - å I(v j ) featureminimizes that minimizes the the weighted weighted variance variance across the j=1 N resultingacross partitions:feature all resulting that minimizes partitions: the weighted variance across the resulting partitions:

• I(n) = impurity measure at node n d=l • k = number of attribute values d[best]=argmin |D | var(t,d=ld=l ) (4) dd[dbest]=argmin ⇥ |D D | var(t, d=l ) (4) • N(n) = total number of records at child node n 2 l levels(dd) d|D| ⇥ D 2 X 2 l levels(d) |D| • N = total number of records at parent node 2 X Need to watch out for Overfitting

Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary ¨ Want to avoid overfitting: ¤ - Early stopping criterion uuu(a) uuuu Target ¤ Stop partitioning the - dataset if the number of uuu uuuuUnderfitting (b) training instances is less than -

uuu(c) uuuuGoldilocks some threshold - ¤ (5% of the dataset) uuuhhh(d) uuuuhhhhOverfitting

Figure: (a) A set of instances on a continuous number line; (b), (c), and (d) depict some of the potential groupings that could be applied to these instances. Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Table: A dataset listing the number of bike rentals per day.

ID SEASON WORK DAY RENTALS ID SEASON WORK DAY RENTALS 1 winter false 800 7 summer false 3 000 2 winter false 826 8 summer true 5 800 Table: The partitioning of the dataset in Table3 5 winter[25] based on true SEASON 900 9 summer true 6 200 and WExampleORK DAY features and the computation4 of spring the weighted false 2 100 10 autumn false 2 910 5 spring true 4 740 11 autumn false 2 880 variance for each partitioning. 6 spring true 4 900 12 autumn true 2 820

Split by Feature Selectiond=l Cont. Desc.Weighted Features Cont. Targets Noise and Overfitting Ensembles Summary |D | Feature Level Part. Instances var (t, ) Variance |D| D ’winter’ 1 d1, d2, d3 0.25 2 692 D 1 ’spring’ 2 d4, d5, d6 0.25 2 472 533 3 1 SEASON D 1 379 331 3 ’summer’ d7, d , d 0.25 3 040 000 D3 8 9 ’autumn’ 4 d10, d11, d12 0.25 2 100 D 1 ’true’ 5 d3, d5, d6, d8, d9, d12 0.50 4 026 346 3 1 Season WORK DAY D 2 551 813 3 ’false’ d , d , d , d7, d , d 0.50 1 077 280 D6 1 2 4 10 11 winter autumn

Work Day spring summer Work Day

true false true false

ID Rentals Pred. ID Rentals Pred. ID Rentals Pred. ID Rentals Pred. 1 800 Work Day Work Day 10 2,910 3 900 900 813 12 2,820 2,820 2,895 2 826 11 2,880

true false true false

ID Rentals Pred. ID Rentals Pred. ID Rentals Pred. ID Rentals Pred. 5 4,740 8 5,800 4,820 4 2,100 2,100 6,000 7 3,000 3,000 6 4,900 9 6,200

Figure: The final decision tree induced from the dataset in Table 5 [25]. To illustrate how the tree generates predictions, this tree lists the instances that ended up at each leaf node and the prediction (PRED.) made by each leaf node. Advantages and Disadvantages of Trees (Compared to Linear Models)

Two-classes: {green, blue} Decision boundary: linear

Linear model can perfectly separate the two regions. Decision tree cannot separate regions. • What about a decision tree? Advantages and Disadvantages of Trees (Compared to Linear Models)

Two-classes: {green, blue} Decision boundary: nonlinear

Linear model cannot perfectly separate the two regions. A Decision tree can! • What about a decision tree? Decision Tree can Separate Nonlinear Regions Advantages and Disadvantages of Trees

Advantages Disadvantages

¨ Trees are very easy to explain ¨ Trees usually do not have same ¤ Easier to explain than linear level of predictive accuracy as regression other data mining algorithms ¨ Trees can be displayed graphically and interpreted by a non-expert ¨ Decision trees may more closely mirror human decision-making But, predictive performance of decision trees ¨ Trees can easily handle qualitative can be improved by aggregating trees. predictors • Techniques: bagging, boosting, random forests ¤ No dummy variables Decision Tree Advantages

¤ Inexpensive to construct ¤ Extremely fast at classifying unknown records n O(d) where d is the depth of the tree ¤ Presence of redundant attributes does not adversely affect the accuracy of decision trees n One of the two redundant attributes will not be used for splitting once the other attribute is chosen ¤ Nonparametric approach n Does not require any prior assumptions regarding probability distributions, means, variances, etc. References

¨ Fundamentals of for Predictive Data Analytics, 1st Edition, Kelleher et al. st ¨ Data Science from Scratch, 1 Edition, Grus st ¨ Data Mining and Business Analytics in R, 1 edition, Ledolter st ¨ An Introduction to Statistical Learning, 1 edition, James et al. nd ¨ Discovering Knowledge in Data, 2 edition, Larose et al. st ¨ Introduction to Data Mining, 1 edition, Tam et al.