4/25/18
Decision Trees: Overfitting STAT/CSE 416: Machine Learning Emily Fox University of Washington April 26, 2018
©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Decision trees recap
©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
1 4/25/18
Decision tree recap
Loan status: Root poor 22 18 4 14 Safe Risky
Credit? Income?
excellent Fair 9 0 9 4 high low 4 5 0 9 For each leaf Safe Term? node, set Term? Risky ŷ = majority value 3 years 5 years 0 4 9 0 3 years 5 years 0 2 4 3 Risky Safe
Risky Safe
3 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Greedy decision tree learning
• Step 1: Start with an empty tree • Step 2: Select a feature to split data Pick feature split • For each split of the tree: leading to lowest • Step 3: If nothing more to, classification error make predictions Stopping • Step 4: Otherwise, go to Step 2 & conditions 1 & 2 continue (recurse) on this split Recursion
4 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
2 4/25/18
Stopping condition 1: All data agrees on y All data in these nodes have same Loan status: Root poor y value è 22 18 4 14 NothingSafe Risky to do
Credit? Income?
excellent Fair 9 0 9 4 high low 4 5 0 9
Safe Term? Term? Risky 3 years 5 years 0 4 9 0 3 years 5 years 0 2 4 3 Risky Safe
Risky Safe
5 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Stopping condition 2: Already split on all features Already split on all possible features Loan status: Root poor è 22 18 4 14 NothingSafe Risky to do
Credit? Income?
excellent Fair 9 0 9 4 high low 4 5 0 9
Safe Term? Term? Risky 3 years 5 years 0 4 9 0 3 years 5 years 0 2 4 3 Risky Safe
Risky Safe
6 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
3 4/25/18
Scoring a loan application
xi = (Credit = poor, Income = high, Term = 5 years) Start
excellent poor Credit?
fair
Income? Safe Term? high Low 3 years 5 years
Risky Safe Term? Risky
3 years 5 years
Risky Safe ŷi = Safe
7 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Predicting probabilities with decision trees
Loan status: Root Safe Risky 18 12
Credit?
excellent fair poor P(y = Safe | x) 9 2 6 9 3 1 = 3 = 0.75 3 + 1 Safe Risky Safe
8 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
4 4/25/18
Comparison with logistic regression
Logistic Regression Decision Trees (Degree 2) (Depth 2)
Class
Probability
9 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Overfitting in decision trees
©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
5 4/25/18
What happens when we increase depth?
Training error reduces with depth
Tree depth depth = 1 depth = 2 depth = 3 depth = 5 depth = 10
Training error 0.22 0.13 0.10 0.03 0.00
Decision boundary
11 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Two approaches to picking simpler trees
1. Early Stopping: Stop the learning algorithm before tree becomes too complex
2. Pruning: Simplify the tree after the learning algorithm terminates
12 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
6 4/25/18
Technique 1: Early stopping
• Stopping conditions (recap): 1. All examples have the same target value 2. No more features to split on
• Early stopping conditions: 1. Limit tree depth (choose max_depth using validation set) 2. Do not consider splits that do not cause a sufficient decrease in classification error 3. Do not split an intermediate node which contains too few data points
13 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Challenge with early stopping condition 1
Hard to know exactly when to stop
Simple Complex Also, might want trees trees some branches of tree to go deeper while others remain shallow True error
Classification Error Classification Training error
max_depth Tree depth
14 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
7 4/25/18
Early stopping condition 2: Pros and Cons
• Pros: - A reasonable heuristic for early stopping to avoid useless splits
• Cons: - Too short sighted: We may miss out on “good” splits may occur right after “useless” splits - Saw this with “xor” example
15 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Two approaches to picking simpler trees
1. Early Stopping: Stop the learning algorithm before tree becomes too complex
2. Pruning: Simplify the tree after the learning algorithm terminates
Complements early stopping
16 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
8 4/25/18
Pruning: Intuition Train a complex tree, simplify later
Complex Tree
Simpler Tree
Simplify
17 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Pruning motivation
Simple Complex tree tree True Error Simplify after tree is built
Don’t stop too early
Classification Error Classification Training Error
Tree depth
18 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
9 4/25/18
Scoring trees: Desired total quality format
Want to balance: i. How well tree fits data ii. Complexity of tree
want to balance Total cost = measure of fit + measure of complexity
(classification error) Large # = likely to overfit Large # = bad fit to training data 19 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Simple measure of complexity of tree
L(T) = # of leaf nodes Start
excellent poor Credit?
good fair bad Safe Risky
Safe Safe Risky
20 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
10 4/25/18
Balance simplicity & predictive power
Too complex, risk of overfitting
Start
excellent poor Credit? Too simple, high fair classification error Income? Safe Term? Start high low 3 years 5 years
Risky Risky Safe Term? Risky
3 years 5 years
Risky Safe
21 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Balancing fit and complexity
Total cost C(T) = Error(T) + λ L(T)
tuning parameter If λ=0:
If λ=∞:
If λ in between:
22 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
11 4/25/18
Tree pruning algorithm
©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Step 1: Consider a split
Start Tree T
excellent poor Credit?
fair
Income? Safe Term? high low 3 years 5 years
Risky Safe Term? Risky
3 years 5 years Candidate for Safe Risky pruning
24 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
12 4/25/18
Step 2: Compute total cost C(T) of split
Start Tree T
excellent poor λ = 0.3 Credit? Tree Error #Leaves Total fair T 0.25 6 0.43 Income? Safe Term? high low 3 years 5 years C(T) = Error(T) + λ L(T)
Risky Safe Term? Risky
3 years 5 years Candidate for Safe Risky pruning
25 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Step 2: “Undo” the splits on Tsmaller
Start Tree Tsmaller
excellent poor λ = 0.3 Credit? Tree Error #Leaves Total fair T 0.25 6 0.43 Income? Term? Safe Tsmaller 0.26 5 0.41 high low 3 years 5 years C(T) = Error(T) + λ L(T) Risky Safe Safe Risky
Replace split by leaf node? 26 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
13 4/25/18
Prune if total cost is lower: C(Tsmaller) ≤ C(T)
Worse training error but Start Tree Tsmaller lower overall cost
excellent poor λ = 0.3 Credit? Tree Error #Leaves Total fair T 0.25 6 0.43 Income? Term? Safe Tsmaller 0.26 5 0.41 high low 3 years 5 years C(T) = Error(T) + λ L(T) Risky Safe Safe Risky
Replace split by leaf node? YES! 27 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Step 5: Repeat Steps 1-4 for every split
Start Decide if each split can be “pruned” excellent poor Credit?
fair
Income? Safe Term? high low 3 years 5 years
Risky Safe Safe Risky
28 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
14 4/25/18
Summary of overfitting in decision trees
©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
What you can do now…
• Identify when overfitting in decision trees • Prevent overfitting with early stopping - Limit tree depth - Do not consider splits that do not reduce classification error - Do not split intermediate nodes with only few points • Prevent overfitting by pruning complex trees - Use a total cost formula that balances classification error and tree complexity - Use total cost to merge potentially complex trees into simpler ones
30 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
15 4/25/18
Boosting STAT/CSE 416: Machine Learning Emily Fox University of Washington April 26, 2018
©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Simple (weak) classifiers are good!
Income>$100K?
Yes No
Safe Risky Logistic regression Shallow Decision w. simple features decision trees stumps
Low variance. Learning is fast!
But high bias…
32 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
16 4/25/18
Finding a classifier that’s just right true error error train error Classification Classification
Model complexity
Weak learner Need stronger learner
Option 1: add more features or depth Option 2: ????? 33 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Boosting question
“Can a set of weak learners be combined to create a stronger learner?” Kearns and Valiant (1988)
Yes! Schapire (1990)
Boosting
Amazing impact: simple approach widely used in industry wins most Kaggle competitions
34 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
17 4/25/18
Ensemble classifier
©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
A single classifier
Input: x
Income>$100K?
Yes No
Safe Risky
Output: ŷ = f(x) - Either +1 or -1 Classifier
36 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
18 4/25/18
Income>$100K?
Yes No
Safe Risky
37 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Ensemble methods: Each classifier “votes” on prediction
xi = (Income=$120K, Credit=Bad, Savings=$50K, Market=Good)
Income>$100K? Credit history? Savings>$100K? Market conditions? Yes No Bad Good Yes No Bad Good
Safe Risky Risky Safe Safe Risky Risky Safe
f1(xi) = +1 f2(xi) = -1 f3(xi) = -1 f4(xi) = +1
Combine? Ensemble model Learn coefficients
F(xi) = sign(w1 f1(xi) + w2 f2(xi) + w3 f3(xi) + w4 f4(xi))
38 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
19 4/25/18
Prediction with ensemble
Income>$100K? Credit history? Savings>$100K? Market conditions? Yes No Bad Good Yes No Bad Good
Safe Risky Risky Safe Safe Risky Risky Safe
f1(xi) = +1 f2(xi) = -1 f3(xi) = -1 f4(xi) = +1
Combine? Ensemble w1 2 model Learn coefficients w2 1.5
w3 1.5 F(xi) = sign(w1 f1(xi) + w2 f2(xi) + w3 f3(xi) + w4 f4(xi)) w4 0.5
39 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Ensemble classifier in general • Goal: - Predict output y • Either +1 or -1 - From input x • Learn ensemble model:
- Classifiers: f1(x),f2(x),…,fT(x)
- Coefficients: ŵ1,ŵ2,…,ŵT • Prediction: T
yˆ = sign wˆ tft(x) t=1 ! 40 X ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
20 4/25/18
Boosting
©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Training a classifier
x Training Learn f( ) Predict data classifier ŷ = sign(f(x))
42 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
21 4/25/18
Learning decision stump
Credit Income y A $130K Safe B $80K Risky Income? C $110K Risky A $110K Safe A $90K Safe > $100K ≤ $100K B $120K Safe 3 1 4 3 C $30K Risky C $60K Risky B $95K Safe A $60K Safe ŷ = Safe ŷ = Safe A $98K Safe
43 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Boosting = Focus learning on “hard” points
x Training Learn f( ) Predict data classifier ŷ = sign(f(x))
Evaluate Boosting: focus next classifier on places where f(x) does less well Learn where f(x) makes mistakes
44 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
22 4/25/18
Learning on weighted data: More weight on “hard” or more important points
• Weighted dataset:
- Each xi,yi weighted by αi
• More important point = higher weight αi
• Learning:
- Data point i counts as αi data points
• E.g., αi = 2 è count point twice
45 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Learning a decision stump on weighted data
Increase weight α of harder/misclassified points
CrediCredit IncomeIncome y y Weight t A $130K Safe α AB $130K$80K SafeRisky 0.5 Income? BC $80K$110K RiskyRisky 1.5 CA $110K$110K RiskySafe 1.2 AA $110K$90K SafeSafe 0.8 > $100K ≤ $100K AB $90K$120K SafeSafe 0.6 2 1.2 3 6.5 BC $120K$30K SafeRisky 0.7 CC $30K$60K RiskyRisky 3 CB $60K$95K RiskySafe 2 BA $95K$60K SafeSafe 0.8 ŷ = Safe ŷ = Risky AA $60K$98K SafeSafe 0.7 A $98K Safe 0.9 46 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
23 4/25/18
Boosting = Greedy learning ensembles from data
Training data
Learn classifier Higher weight f (x) for points where 1 Predict ŷ = sign(f (x)) f1(x) is wrong 1
Weighted data
Learn classifier & coefficient
ŵ,f2(x)
Predict ŷ = sign(ŵ1 f1(x) + ŵ2 f2(x))
47 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
AdaBoost algorithm
©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
24 4/25/18
AdaBoost: learning ensemble [Freund & Schapire 1999]
• Start with same weight for all points: αi = 1/N
• For t = 1,…,T
- Learn ft(x) with data weights αi
- Compute coefficient ŵt
- Recompute weights αi
• Final model predicts by: T
yˆ = sign wˆ tft(x) t=1 ! 49 X©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Computing coefficient ŵt
©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
25 4/25/18
AdaBoost: Computing coefficient ŵt of classifier ft(x)
Yes ŵt large Is ft(x) good? No ŵt small
• ft(x) is good è ft has low training error
• Measuring error in weighted data? - Just weighted # of misclassified points
51 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Weighted classification error
Learned classifier ŷ =
Data point Weight of correct 1.20 Correct!Mistake! (Food(SushiFoodSushi waswas OKOK,great,great ,,αα=0.5=1.2)) Weight of mistakes 0.50
Hide label
52 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
26 4/25/18
AdaBoost:
Formula for computing coefficient ŵt of classifier ft(x) 1 1 weighted error(f ) wˆ = ln t t 2 weighted error(f ) ✓ t ◆
weighted_error(f ) 1 weighted error(ft) t ŵ on training data weighted error(ft) t Yes
Is ft(x) good? No
53 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
AdaBoost: learning ensemble
• Start with same weight for all points: αi = 1/N
• For t = 1,…,T 1 1 weighted error(ft) - Learn ft(x) with data weights αi wˆ t = ln 2 weighted error(ft) - Compute coefficient ŵt ✓ ◆
- Recompute weights αi
• Final model predicts by: T
yˆ = sign wˆ tft(x) t=1 ! 54 X©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
27 4/25/18
Recompute weights αi
©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
AdaBoost: Updating weights αi based on where classifier ft(x) makes mistakes
Yes Decrease αi
Did ft get xi right? No Increase αi
56 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
28 4/25/18
AdaBoost: Formula for updating weights αi
-ŵt αi e , if ft(xi)=yi αi ç ŵt αi e , if ft(xi)≠yi
Multiply α by Implication ft(xi)=yi ? ŵt i
Yes
Did ft get xi right? No
57 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
AdaBoost: learning ensemble
• Start with same weight for all points: αi = 1/N
• For t = 1,…,T 1 1 weighted error(ft) - Learn ft(x) with data weights αi wˆ t = ln 2 weighted error(ft) - Compute coefficient ŵt ✓ ◆
- Recompute weights αi -ŵt αi e , if ft(xi)=yi αi ç • Final model predicts by: ŵt T αi e , if ft(xi)≠yi yˆ = sign wˆ tft(x) t=1 ! 58 X ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
29 4/25/18
AdaBoost: Normalizing weights αi
If xi often mistake, If xi often correct, weight αi gets very weight αi gets very large small
Can cause numerical instability after many iterations
Normalize weights to add up to 1 after every iteration ↵ ↵ i i N j=1 ↵j
59 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning P .
AdaBoost: learning ensemble
• Start with same weight for 1 1 weighted error(ft) wˆ = ln t 2 weighted error(f ) all points: αi = 1/N ✓ t ◆ • For t = 1,…,T -ŵt - Learn ft(x) with data weights αi αi e , if ft(xi)=yi - Compute coefficient ŵt αi ç ŵ - Recompute weights α t i αi e , if ft(xi)≠yi
- Normalize weights αi • Final model predicts by: ↵ T i ↵i yˆ = sign wˆ f (x) N t t j=1 ↵j t=1 ! 60 X ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning P .
30 4/25/18
AdaBoost example: A visualization
©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
t=1: Just learn a classifier on original data
Original data Learned decision stump f1(x)
62 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
31 4/25/18
Updating weights αi
Increase weight αi of misclassified points
Learned decision stump f1(x) New data weights αi Boundary
63 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
t=2: Learn classifier on weighted data
f1(x)
Weighted data: using αi Learned decision stump f2(x) chosen in previous iteration on weighted data
64 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
32 4/25/18
Ensemble becomes weighted sum of learned classifiers
ŵ1 f1(x) ŵ2 f2(x) 0.61 + 0.53 =
65 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Decision boundary of ensemble classifier after 30 iterations
training_error = 0
66 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
33 4/25/18
AdaBoost example: Boosted decision stumps step-by-step
©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Boosted decision stumps
• Start same weight for all points: αi = 1/N • For t = 1,…,T
- Learn ft(x): pick decision stump with lowest weighted training error according to αi
- Compute coefficient ŵt
- Recompute weights αi
- Normalize weights αi • Final model predicts by: T
yˆ = sign wˆ tft(x) t=1 ! 68 X©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
34 4/25/18
Finding best next decision stump ft(x) Consider splitting on each feature:
Income>$100K? Credit history? Savings>$100K? Market conditions?
Yes No Bad Good Yes No Bad Good
Safe Risky Risky Safe Safe Risky Risky Safe
weighted_error weighted_error weighted_error weighted_error = 0.2 = 0.35 = 0.3 = 0.4
Income>$100K? f = Yes No t Safe Risky
1 1 weighted error(f ) wˆ = ln t ŵtt 2 weighted error(f ) = 0.69 ✓ t ◆ 69 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Boosted decision stumps
• Start same weight for all points: αi = 1/N • For t = 1,…,T
- Learn ft(x): pick decision stump with lowest weighted training error according to αi
- Compute coefficient ŵt
- Recompute weights αi
- Normalize weights αi • Final model predicts by: T
yˆ = sign wˆ tft(x) t=1 ! 70 X©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
35 4/25/18
-ŵ Updating weights αi t -0.69 αi e = αi e = αi /2 , if ft(xi)=yi Income>$100K? αi ç ŵt Yes No 0.69 αi e = αi e = 2αi , if ft(xi)≠yi Safe Risky Previous New Credit Income y ŷ weight α weight α A $130K Safe Safe A $130K Safe Safe 0.5 0.5/2 = 0.25 B $80K Risky Risky B $80K Risky Risky 1.5 0.75 C $110K Risky Safe C $110K Risky Safe 1.5 2 * 1.5 = 3 A $110K Safe Safe A $110K Safe Safe 2 1 A $90K Safe Risky A $90K Safe Risky 1 2 B $120K Safe Safe B $120K Safe Safe 2.5 1.25 C $30K Risky Risky C $30K Risky Risky 3 1.5 C $60K Risky Risky C $60K Risky Risky 2 1 B $95K Safe Risky B $95K Safe Risky 0.5 1 A $60K Safe Risky A $60K Safe Risky 1 2 A $98K Safe Risky A $98K Safe Risky 0.5 1
71 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Boosting convergence & overfitting
©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
36 4/25/18
Boosting question revisited
“Can a set of weak learners be combined to create a stronger learner?” Kearns and Valiant (1988)
Yes! Schapire (1990)
Boosting
73 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
After some iterations, training error of boosting goes to zero!!!
Training error of 1 decision stump = 22.5%
Training error of ensemble of 30 decision stumps = 0%
Boosted decision
Training error Training stumps on toy dataset Iterations of boosting
74 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
37 4/25/18
AdaBoost Theorem
Under some technical conditions… May oscillate a bit
But will generally decrease, & eventually become 0! Training error of boosted classifier → 0 as T→∞ Training error Training
Iterations of boosting 75 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Condition of AdaBoost Theorem
Condition = At every t, Under some technical conditions… can find a weak learner with weighted_error(ft) < 0.5 Extreme example: No classifier can Not always separate a +1 possible on top of -1 Training error of boosted classifier → 0 as T→ Nonetheless, boosting often ∞ yields great training error
76 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
38 4/25/18
Decision trees on loan data 39% test error
Overfitting
8% training error
Boosted decision stumps on loan data
32% test error Better fit & lower test error 28.5% training error
77 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Boosting tends to be robust to overfitting
Test set performance about constant for many iterations è Less sensitive to choice of T
78 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
39 4/25/18
But boosting will eventually overfit, so must choose max number of components T
Best test error around 31%
Test error eventually increases to 33% (overfits)
79 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
How do we decide when to stop boosting?
Choosing T ?
Like selecting parameters in other ML approaches (e.g., λ in regularization)
Never ever Not on Cross- ever ever on Validation set training data validation test data
Not useful: training If dataset is large For smaller data error improves as T increases
80 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
40 4/25/18
Summary of boosting
©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
Variants of boosting and related algorithms
There are hundreds of variants of boosting, most important:
Gradient boosting • Like AdaBoost, but useful beyond basic classification
Many other approaches to learn ensembles, most important:
• Bagging: Pick random subsets of the data - Learn a tree in each subset Random - Average predictions forests • Simpler than boosting & easier to parallelize • Typically higher error than boosting for same # of trees (# iterations T)
82 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
41 4/25/18
Impact of boosting (spoiler alert... HUGE IMPACT)
Amongst most useful ML methods ever created
Extremely useful in • Standard approach for face detection, for example computer vision
Used by most winners of • Malware classification, credit fraud detection, ads click through rate estimation, sales forecasting, ML competitions ranking webpages for search, Higgs boson (Kaggle, KDD Cup,…) detection,…
Most deployed ML systems use • Coefficients chosen manually, with boosting, with model ensembles bagging, or others
83 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
What you can do now…
• Identify notion ensemble classifiers • Formalize ensembles as weighted combination of simpler classifiers • Outline the boosting framework – sequentially learn classifiers on weighted data • Describe the AdaBoost algorithm - Learn each classifier on weighted data - Compute coefficient of classifier - Recompute data weights - Normalize weights • Implement AdaBoost to create an ensemble of decision stumps
84 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning
42