Decision Trees: Overfitting STAT/CSE 416: Machine Learning Emily Fox University of Washington April 26, 2018
Total Page:16
File Type:pdf, Size:1020Kb
4/25/18 Decision Trees: Overfitting STAT/CSE 416: Machine Learning Emily Fox University of Washington April 26, 2018 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning Decision trees recap ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning 1 4/25/18 Decision tree recap Loan status: Root poor 22 18 4 14 Safe Risky Credit? Income? excellent Fair 9 0 9 4 high low 4 5 0 9 For each leaf Safe Term? node, set Term? Risky ŷ = majority value 3 years 5 years 0 4 9 0 3 years 5 years 0 2 4 3 Risky Safe Risky Safe 3 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning Greedy decision tree learning • Step 1: Start with an empty tree • Step 2: Select a feature to split data Pick feature split • For each split of the tree: leading to lowest • Step 3: If nothing more to, classification error make predictions Stopping • Step 4: Otherwise, go to Step 2 & conditions 1 & 2 continue (recurse) on this split Recursion 4 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning 2 4/25/18 Stopping condition 1: All data agrees on y All data in these nodes have same Loan status: Root poor y value è 22 18 4 14 NothingSafe Risky to do Credit? Income? excellent Fair 9 0 9 4 high low 4 5 0 9 Safe Term? Term? Risky 3 years 5 years 0 4 9 0 3 years 5 years 0 2 4 3 Risky Safe Risky Safe 5 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning Stopping condition 2: Already split on all features Already split on all possible features Loan status: Root poor è 22 18 4 14 NothingSafe Risky to do Credit? Income? excellent Fair 9 0 9 4 high low 4 5 0 9 Safe Term? Term? Risky 3 years 5 years 0 4 9 0 3 years 5 years 0 2 4 3 Risky Safe Risky Safe 6 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning 3 4/25/18 Scoring a loan application xi = (Credit = poor, Income = high, Term = 5 years) Start excellent poor Credit? fair Income? Safe Term? high Low 3 years 5 years Risky Safe Term? Risky 3 years 5 years Risky Safe ŷi = Safe 7 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning Predicting probabilities with decision trees Loan status: Root Safe Risky 18 12 Credit? excellent fair poor P(y = Safe | x) 9 2 6 9 3 1 = 3 = 0.75 3 + 1 Safe Risky Safe 8 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning 4 4/25/18 Comparison with logistic regression Logistic Regression Decision Trees (Degree 2) (Depth 2) Class Probability 9 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning Overfitting in decision trees ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning 5 4/25/18 What happens when we increase depth? Training error reduces with depth Tree depth depth = 1 depth = 2 depth = 3 depth = 5 depth = 10 Training error 0.22 0.13 0.10 0.03 0.00 Decision boundary 11 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning Two approaches to picking simpler trees 1. Early Stopping: Stop the learning algorithm before tree becomes too complex 2. Pruning: Simplify the tree after the learning algorithm terminates 12 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning 6 4/25/18 Technique 1: Early stopping • Stopping conditions (recap): 1. All examples have the same target value 2. No more features to split on • Early stopping conditions: 1. Limit tree depth (choose max_depth using validation set) 2. Do not consider splits that do not cause a sufficient decrease in classification error 3. Do not split an intermediate node which contains too few data points 13 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning Challenge with early stopping condition 1 Hard to know exactly when to stop Simple Complex Also, might want trees trees some branches of tree to go deeper while others remain shallow True error Classification Error Classification Training error max_depth Tree depth 14 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning 7 4/25/18 Early stopping condition 2: Pros and Cons • Pros: - A reasonable heuristic for early stopping to avoid useless splits • Cons: - Too short sighted: We may miss out on “good” splits may occur right after “useless” splits - Saw this with “xor” example 15 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning Two approaches to picking simpler trees 1. Early Stopping: Stop the learning algorithm before tree becomes too complex 2. Pruning: Simplify the tree after the learning algorithm terminates Complements early stopping 16 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning 8 4/25/18 Pruning: Intuition Train a complex tree, simplify later Complex Tree Simpler Tree Simplify 17 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning Pruning motivation Simple Complex tree tree True Error Simplify after tree is built Don’t stop too early Classification Error Classification Training Error Tree depth 18 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning 9 4/25/18 Scoring trees: Desired total quality format Want to balance: i. How well tree fits data ii. Complexity of tree want to balance Total cost = measure of fit + measure of complexity (classification error) Large # = likely to overfit Large # = bad fit to training data 19 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning Simple measure of complexity of tree L(T) = # of leaf nodes Start excellent poor Credit? good fair bad Safe Risky Safe Safe Risky 20 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning 10 4/25/18 Balance simplicity & predictive power Too complex, risk of overfitting Start excellent poor Credit? Too simple, high fair classification error Income? Safe Term? Start high low 3 years 5 years Risky Risky Safe Term? Risky 3 years 5 years Risky Safe 21 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning Balancing fit and complexity Total cost C(T) = Error(T) + λ L(T) tuning parameter If λ=0: If λ=∞: If λ in between: 22 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning 11 4/25/18 Tree pruning algorithm ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning Step 1: Consider a split Start Tree T excellent poor Credit? fair Income? Safe Term? high low 3 years 5 years Risky Safe Term? Risky 3 years 5 years Candidate for Safe Risky pruning 24 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning 12 4/25/18 Step 2: Compute total cost C(T) of split Start Tree T excellent poor λ = 0.3 Credit? Tree Error #Leaves Total fair T 0.25 6 0.43 Income? Safe Term? high low 3 years 5 years C(T) = Error(T) + λ L(T) Risky Safe Term? Risky 3 years 5 years Candidate for Safe Risky pruning 25 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning Step 2: “Undo” the splits on Tsmaller Start Tree Tsmaller excellent poor λ = 0.3 Credit? Tree Error #Leaves Total fair T 0.25 6 0.43 Income? Term? Safe Tsmaller 0.26 5 0.41 high low 3 years 5 years C(T) = Error(T) + λ L(T) Risky Safe Safe Risky Replace split by leaf node? 26 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning 13 4/25/18 Prune if total cost is lower: C(Tsmaller) ≤ C(T) Worse training error but Start Tree Tsmaller lower overall cost excellent poor λ = 0.3 Credit? Tree Error #Leaves Total fair T 0.25 6 0.43 Income? Term? Safe Tsmaller 0.26 5 0.41 high low 3 years 5 years C(T) = Error(T) + λ L(T) Risky Safe Safe Risky Replace split by leaf node? YES! 27 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning Step 5: Repeat Steps 1-4 for every split Start Decide if each split can be “pruned” excellent poor Credit? fair Income? Safe Term? high low 3 years 5 years Risky Safe Safe Risky 28 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning 14 4/25/18 Summary of overfitting in decision trees ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning What you can do now… • Identify when overfitting in decision trees • Prevent overfitting with early stopping - Limit tree depth - Do not consider splits that do not reduce classification error - Do not split intermediate nodes with only few points • Prevent overfitting by pruning complex trees - Use a total cost formula that balances classification error and tree complexity - Use total cost to merge potentially complex trees into simpler ones 30 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning 15 4/25/18 Boosting STAT/CSE 416: Machine Learning Emily Fox University of Washington April 26, 2018 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning Simple (weak) classifiers are good! Income>$100K? Yes No Safe Risky Logistic regression Shallow Decision w. simple features decision trees stumps Low variance. Learning is fast! But high bias… 32 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning 16 4/25/18 Finding a classifier that’s just right true error error train error Classification Classification Model complexity Weak learner Need stronger learner Option 1: add more features or depth Option 2: ????? 33 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning Boosting question “Can a set of weak learners be combined to create a stronger learner?” Kearns and Valiant (1988) Yes! Schapire (1990) Boosting Amazing impact: simple approach widely used in industry wins most Kaggle competitions 34 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning 17 4/25/18 Ensemble classifier ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning A single classifier Input: x Income>$100K? Yes No Safe Risky Output: ŷ = f(x) - Either +1 or -1 Classifier 36 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning 18 4/25/18 Income>$100K? Yes No Safe Risky 37 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning Ensemble methods: Each classifier “votes” on prediction xi = (Income=$120K, Credit=Bad, Savings=$50K, Market=Good) Income>$100K? Credit history? Savings>$100K? Market conditions? Yes No Bad Good Yes No Bad Good Safe Risky Risky Safe Safe Risky Risky Safe f1(xi)