Overfitting in Decision Trees, Boosting Slides
Total Page:16
File Type:pdf, Size:1020Kb
1/29/17 Decision Trees: Overfitting CSE 446: Machine Learning Emily Fox University of Washington January 30, 2017 ©2017 Emily Fox CSE 446: Machine Learning Decision tree recap Loan status: Root poor 22 18 4 14 Safe Risky Credit? Income? excellent Fair 9 0 9 4 high low 4 5 0 9 For each leaf Safe Term? node, set Term? Risky ŷ = majority value 3 years 5 years 0 4 9 0 3 years 5 years 0 2 4 3 Risky Safe Risky Safe 2 ©2017 Emily Fox CSE 446: Machine Learning 1 1/29/17 Greedy decision tree learning • Step 1: Start with an empty tree • Step 2: Select a feature to split data Pick feature split • For each split of the tree: leading to lowest • Step 3: If nothing more to, classification error make predictions Stopping • Step 4: Otherwise, go to Step 2 & conditions 1 & 2 continue (recurse) on this split Recursion 3 ©2017 Emily Fox CSE 446: Machine Learning Scoring a loan application xi = (Credit = poor, Income = high, Term = 5 years) Start excellent poor Credit? fair Income? Safe Term? high Low 3 years 5 years Risky Safe Term? Risky 3 years 5 years Risky Safe ŷi = Safe 4 ©2017 Emily Fox CSE 446: Machine Learning 2 1/29/17 Decision trees vs logistic regression: Example ©2017 Emily Fox CSE 446: Machine Learning Logistic regression Weight Feature Value Learned h0(x) 1 0.22 h1(x) x[1] 1.12 h2(x) x[2] -1.07 6 ©2017 Emily Fox CSE 446: Machine Learning 3 1/29/17 Depth 1: Split on x[1] y values Root - + 18 13 x[1] x[1] < -0.07 x[1] >= -0.07 13 3 4 11 7 ©2017 Emily Fox CSE 446: Machine Learning Depth 2 y values Root - + 18 13 x[1] x[1] < -0.07 x[1] >= -0.07 13 3 4 11 x[1] x[2] x[1] < -1.66 x[1] >= -1.66 x[2] < 1.55 x[2] >= 1.55 7 0 6 3 1 11 3 0 8 ©2017 Emily Fox CSE 446: Machine Learning 4 1/29/17 Threshold split caveat y values Root - + 18 13 x[1] For threshold splits, x[1] < -0.07 x[1] >= -0.07 same feature can be 13 3 4 11 used multiple times x[1] x[2] x[1] < -1.66 x[1] >= -1.66 x[2] < 1.55 x[2] >= 1.55 7 0 6 3 1 11 3 0 9 ©2017 Emily Fox CSE 446: Machine Learning Decision boundaries Depth 1 Depth 2 Depth 10 10 ©2017 Emily Fox CSE 446: Machine Learning 5 1/29/17 Comparing decision boundaries Decision Tree Depth 1 Depth 3 Depth 10 Logistic Regression Degree 1 features Degree 2 features Degree 6 features 11 ©2017 Emily Fox CSE 446: Machine Learning Predicting probabilities with decision trees Loan status: Root Safe Risky 18 12 Credit? excellent fair poor P(y = Safe | x) 9 2 6 9 3 1 = 3 = 0.75 3 + 1 Safe Risky Safe 12 ©2017 Emily Fox CSE 446: Machine Learning 6 1/29/17 Depth 1 probabilities Y values root - + 18 13 X1 X1 < -0.07 X1 >= -0.07 13 3 4 11 13 ©2017 Emily Fox CSE 446: Machine Learning Depth 2 probabilities Y values root - + 18 13 X1 X1 < -0.07 X1 >= -0.07 13 3 4 11 X1 X2 X1 < -1.66 X1 >= -1.66 X2 < 1.55 X2 >= 1.55 7 0 6 3 1 11 3 0 14 ©2017 Emily Fox CSE 446: Machine Learning 7 1/29/17 Comparison with logistic regression Logistic Regression Decision Trees (Degree 2) (Depth 2) Class Probability 15 ©2017 Emily Fox CSE 446: Machine Learning Overfitting in decision trees ©2017 Emily Fox CSE 446: Machine Learning 8 1/29/17 What happens when we increase depth? Training error reduces with depth Tree depth depth = 1 depth = 2 depth = 3 depth = 5 depth = 10 Training error 0.22 0.13 0.10 0.03 0.00 Decision boundary 17 ©2017 Emily Fox CSE 446: Machine Learning Two approaches to picking simpler trees 1. Early Stopping: Stop the learning algorithm before tree becomes too complex 2. Pruning: Simplify the tree after the learning algorithm terminates 18 ©2017 Emily Fox CSE 446: Machine Learning 9 1/29/17 Technique 1: Early stopping • Stopping conditions (recap): 1. All examples have the same target value 2. No more features to split on • Early stopping conditions: 1. Limit tree depth (choose max_depth using validation set) 2. Do not consider splits that do not cause a sufficient decrease in classification error 3. Do not split an intermediate node which contains too few data points 19 ©2017 Emily Fox CSE 446: Machine Learning Challenge with early stopping condition 1 Hard to know exactly when to stop Simple Complex Also, might want trees trees some branches of tree to go deeper while others remain shallow True error Classification Error Training error max_depth Tree depth 20 ©2017 Emily Fox CSE 446: Machine Learning 10 1/29/17 Early stopping condition 2: Pros and Cons • Pros: - A reasonable heuristic for early stopping to avoid useless splits • Cons: - Too short sighted: We may miss out on “good” splits may occur right after “useless” splits - Saw this with “xor” example 21 ©2017 Emily Fox CSE 446: Machine Learning Two approaches to picking simpler trees 1. Early Stopping: Stop the learning algorithm before tree becomes too complex 2. Pruning: Simplify the tree after the learning algorithm terminates Complements early stopping 22 ©2017 Emily Fox CSE 446: Machine Learning 11 1/29/17 Pruning: Intuition Train a complex tree, simplify later Complex Tree Simpler Tree Simplify 23 ©2017 Emily Fox CSE 446: Machine Learning Pruning motivation Simple Complex tree tree True Error Simplify after tree is built Don’t stop too early Classification Error Training Error Tree depth 24 ©2017 Emily Fox CSE 446: Machine Learning 12 1/29/17 Scoring trees: Desired total quality format Want to balance: i. How well tree fits data ii. Complexity of tree want to balance Total cost = measure of fit + measure of complexity (classification error) Large # = likely to overfit Large # = bad fit to training data 25 ©2017 Emily Fox CSE 446: Machine Learning Simple measure of complexity of tree L(T) = # of leaf nodes Start excellent poor Credit? good fair bad Safe Risky Safe Safe Risky 26 ©2017 Emily Fox CSE 446: Machine Learning 13 1/29/17 Balance simplicity & predictive power Too complex, risk of overfitting Start excellent poor Credit? Too simple, high fair classification error Income? Safe Term? Start high low 3 years 5 years Risky Risky Safe Term? Risky 3 years 5 years Risky Safe 27 ©2015-2016 Emily Fox & Carlos Guestrin CSE 446: Machine Learning Balancing fit and complexity Total cost C(T) = Error(T) + λ L(T) tuning parameter If λ=0: If λ=∞: If λ in between: 28 ©2017 Emily Fox CSE 446: Machine Learning 14 1/29/17 Tree pruning algorithm ©2017 Emily Fox CSE 446: Machine Learning Step 1: Consider a split Start Tree T excellent poor Credit? fair Income? Safe Term? high low 3 years 5 years Risky Safe Term? Risky 3 years 5 years Candidate for Safe Risky pruning 30 ©2017 Emily Fox CSE 446: Machine Learning 15 1/29/17 Step 2: Compute total cost C(T) of split Start Tree T excellent poor λ = 0.3 Credit? Tree Error #Leaves Total fair T 0.25 6 0.43 Income? Safe Term? high low 3 years 5 years C(T) = Error(T) + λ L(T) Risky Safe Term? Risky 3 years 5 years Candidate for Safe Risky pruning 31 ©2017 Emily Fox CSE 446: Machine Learning Step 2: “Undo” the splits on Tsmaller Start Tree Tsmaller excellent poor λ = 0.3 Credit? Tree Error #Leaves Total fair T 0.25 6 0.43 Income? Safe Term? Tsmaller 0.26 5 0.41 high low 3 years 5 years C(T) = Error(T) + λ L(T) Risky Safe Safe Risky Replace split by leaf node? 32 ©2017 Emily Fox CSE 446: Machine Learning 16 1/29/17 Prune if total cost is lower: C(Tsmaller) ≤ C(T) Worse training error but Start Tree Tsmaller lower overall cost excellent poor λ = 0.3 Credit? Tree Error #Leaves Total fair T 0.25 6 0.43 Income? Safe Term? Tsmaller 0.26 5 0.41 high low 3 years 5 years C(T) = Error(T) + λ L(T) Risky Safe Safe Risky Replace split by leaf node? YES! 33 ©2017 Emily Fox CSE 446: Machine Learning Step 5: Repeat Steps 1-4 for every split Start Decide if each split can be “pruned” excellent poor Credit? fair Income? Safe Term? high low 3 years 5 years Risky Safe Safe Risky 34 ©2017 Emily Fox CSE 446: Machine Learning 17 1/29/17 Summary of overfitting in decision trees ©2017 Emily Fox CSE 446: Machine Learning What you can do now… • Identify when overfitting in decision trees • Prevent overfitting with early stopping - Limit tree depth - Do not consider splits that do not reduce classification error - Do not split intermediate nodes with only few points • Prevent overfitting by pruning complex trees - Use a total cost formula that balances classification error and tree complexity - Use total cost to merge potentially complex trees into simpler ones 36 ©2017 Emily Fox CSE 446: Machine Learning 18 1/29/17 Boosting CSE 446: Machine Learning Emily Fox University of Washington January 30, 2017 ©2017 Emily Fox CSE 446: Machine Learning Simple (weak) classifiers are good! Income>$100K? Yes No Safe Risky Logistic regression Shallow Decision w.