4/25/18

Decision Trees: Overfitting STAT/CSE 416: Emily Fox University of Washington April 26, 2018

©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Decision trees recap

©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

1 4/25/18

Decision tree recap

Loan status: Root poor 22 18 4 14 Safe Risky

Credit? Income?

excellent Fair 9 0 9 4 high low 4 5 0 9 For each leaf Safe Term? node, set Term? Risky ŷ = majority value 3 years 5 years 0 4 9 0 3 years 5 years 0 2 4 3 Risky Safe

Risky Safe

3 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Greedy

• Step 1: Start with an empty tree • Step 2: Select a feature to split data Pick feature split • For each split of the tree: leading to lowest • Step 3: If nothing more to, classification error make predictions Stopping • Step 4: Otherwise, go to Step 2 & conditions 1 & 2 continue (recurse) on this split Recursion

4 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

2 4/25/18

Stopping condition 1: All data agrees on y All data in these nodes have same Loan status: Root poor y value è 22 18 4 14 NothingSafe Risky to do

Credit? Income?

excellent Fair 9 0 9 4 high low 4 5 0 9

Safe Term? Term? Risky 3 years 5 years 0 4 9 0 3 years 5 years 0 2 4 3 Risky Safe

Risky Safe

5 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Stopping condition 2: Already split on all features Already split on all possible features Loan status: Root poor è 22 18 4 14 NothingSafe Risky to do

Credit? Income?

excellent Fair 9 0 9 4 high low 4 5 0 9

Safe Term? Term? Risky 3 years 5 years 0 4 9 0 3 years 5 years 0 2 4 3 Risky Safe

Risky Safe

6 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

3 4/25/18

Scoring a loan application

xi = (Credit = poor, Income = high, Term = 5 years) Start

excellent poor Credit?

fair

Income? Safe Term? high Low 3 years 5 years

Risky Safe Term? Risky

3 years 5 years

Risky Safe ŷi = Safe

7 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Predicting probabilities with decision trees

Loan status: Root Safe Risky 18 12

Credit?

excellent fair poor P(y = Safe | x) 9 2 6 9 3 1 = 3 = 0.75 3 + 1 Safe Risky Safe

8 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

4 4/25/18

Comparison with logistic regression

Logistic Regression Decision Trees (Degree 2) (Depth 2)

Class

Probability

9 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Overfitting in decision trees

©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

5 4/25/18

What happens when we increase depth?

Training error reduces with depth

Tree depth depth = 1 depth = 2 depth = 3 depth = 5 depth = 10

Training error 0.22 0.13 0.10 0.03 0.00

Decision boundary

11 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Two approaches to picking simpler trees

1. Early Stopping: Stop the learning algorithm before tree becomes too complex

2. Pruning: Simplify the tree after the learning algorithm terminates

12 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

6 4/25/18

Technique 1: Early stopping

• Stopping conditions (recap): 1. All examples have the same target value 2. No more features to split on

• Early stopping conditions: 1. Limit tree depth (choose max_depth using validation set) 2. Do not consider splits that do not cause a sufficient decrease in classification error 3. Do not split an intermediate node which contains too few data points

13 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Challenge with early stopping condition 1

Hard to know exactly when to stop

Simple Complex Also, might want trees trees some branches of tree to go deeper while others remain shallow True error

Classification Error Classification Training error

max_depth Tree depth

14 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

7 4/25/18

Early stopping condition 2: Pros and Cons

• Pros: - A reasonable heuristic for early stopping to avoid useless splits

• Cons: - Too short sighted: We may miss out on “good” splits may occur right after “useless” splits - Saw this with “xor” example

15 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Two approaches to picking simpler trees

1. Early Stopping: Stop the learning algorithm before tree becomes too complex

2. Pruning: Simplify the tree after the learning algorithm terminates

Complements early stopping

16 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

8 4/25/18

Pruning: Intuition Train a complex tree, simplify later

Complex Tree

Simpler Tree

Simplify

17 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Pruning motivation

Simple Complex tree tree True Error Simplify after tree is built

Don’t stop too early

Classification Error Classification Training Error

Tree depth

18 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

9 4/25/18

Scoring trees: Desired total quality format

Want to balance: i. How well tree fits data ii. Complexity of tree

want to balance Total cost = measure of fit + measure of complexity

(classification error) Large # = likely to overfit Large # = bad fit to training data 19 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Simple measure of complexity of tree

L(T) = # of leaf nodes Start

excellent poor Credit?

good fair bad Safe Risky

Safe Safe Risky

20 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

10 4/25/18

Balance simplicity & predictive power

Too complex, risk of overfitting

Start

excellent poor Credit? Too simple, high fair classification error Income? Safe Term? Start high low 3 years 5 years

Risky Risky Safe Term? Risky

3 years 5 years

Risky Safe

21 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Balancing fit and complexity

Total cost C(T) = Error(T) + λ L(T)

tuning parameter If λ=0:

If λ=∞:

If λ in between:

22 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

11 4/25/18

Tree pruning algorithm

©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Step 1: Consider a split

Start Tree T

excellent poor Credit?

fair

Income? Safe Term? high low 3 years 5 years

Risky Safe Term? Risky

3 years 5 years Candidate for Safe Risky pruning

24 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

12 4/25/18

Step 2: Compute total cost C(T) of split

Start Tree T

excellent poor λ = 0.3 Credit? Tree Error #Leaves Total fair T 0.25 6 0.43 Income? Safe Term? high low 3 years 5 years C(T) = Error(T) + λ L(T)

Risky Safe Term? Risky

3 years 5 years Candidate for Safe Risky pruning

25 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Step 2: “Undo” the splits on Tsmaller

Start Tree Tsmaller

excellent poor λ = 0.3 Credit? Tree Error #Leaves Total fair T 0.25 6 0.43 Income? Term? Safe Tsmaller 0.26 5 0.41 high low 3 years 5 years C(T) = Error(T) + λ L(T) Risky Safe Safe Risky

Replace split by leaf node? 26 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

13 4/25/18

Prune if total cost is lower: C(Tsmaller) ≤ C(T)

Worse training error but Start Tree Tsmaller lower overall cost

excellent poor λ = 0.3 Credit? Tree Error #Leaves Total fair T 0.25 6 0.43 Income? Term? Safe Tsmaller 0.26 5 0.41 high low 3 years 5 years C(T) = Error(T) + λ L(T) Risky Safe Safe Risky

Replace split by leaf node? YES! 27 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Step 5: Repeat Steps 1-4 for every split

Start Decide if each split can be “pruned” excellent poor Credit?

fair

Income? Safe Term? high low 3 years 5 years

Risky Safe Safe Risky

28 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

14 4/25/18

Summary of overfitting in decision trees

©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

What you can do now…

• Identify when overfitting in decision trees • Prevent overfitting with early stopping - Limit tree depth - Do not consider splits that do not reduce classification error - Do not split intermediate nodes with only few points • Prevent overfitting by pruning complex trees - Use a total cost formula that balances classification error and tree complexity - Use total cost to merge potentially complex trees into simpler ones

30 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

15 4/25/18

Boosting STAT/CSE 416: Machine Learning Emily Fox University of Washington April 26, 2018

©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Simple (weak) classifiers are good!

Income>$100K?

Yes No

Safe Risky Logistic regression Shallow Decision w. simple features decision trees stumps

Low variance. Learning is fast!

But high bias…

32 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

16 4/25/18

Finding a classifier that’s just right true error error train error Classification Classification

Model complexity

Weak learner Need stronger learner

Option 1: add more features or depth Option 2: ????? 33 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Boosting question

“Can a set of weak learners be combined to create a stronger learner?” Kearns and Valiant (1988)

Yes! Schapire (1990)

Boosting

Amazing impact: Ÿ simple approach Ÿ widely used in industry Ÿ wins most Kaggle competitions

34 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

17 4/25/18

Ensemble classifier

©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

A single classifier

Input: x

Income>$100K?

Yes No

Safe Risky

Output: ŷ = f(x) - Either +1 or -1 Classifier

36 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

18 4/25/18

Income>$100K?

Yes No

Safe Risky

37 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Ensemble methods: Each classifier “votes” on prediction

xi = (Income=$120K, Credit=Bad, Savings=$50K, Market=Good)

Income>$100K? Credit history? Savings>$100K? Market conditions? Yes No Bad Good Yes No Bad Good

Safe Risky Risky Safe Safe Risky Risky Safe

f1(xi) = +1 f2(xi) = -1 f3(xi) = -1 f4(xi) = +1

Combine? Ensemble model Learn coefficients

F(xi) = sign(w1 f1(xi) + w2 f2(xi) + w3 f3(xi) + w4 f4(xi))

38 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

19 4/25/18

Prediction with ensemble

Income>$100K? Credit history? Savings>$100K? Market conditions? Yes No Bad Good Yes No Bad Good

Safe Risky Risky Safe Safe Risky Risky Safe

f1(xi) = +1 f2(xi) = -1 f3(xi) = -1 f4(xi) = +1

Combine? Ensemble w1 2 model Learn coefficients w2 1.5

w3 1.5 F(xi) = sign(w1 f1(xi) + w2 f2(xi) + w3 f3(xi) + w4 f4(xi)) w4 0.5

39 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Ensemble classifier in general • Goal: - Predict output y • Either +1 or -1 - From input x • Learn ensemble model:

- Classifiers: f1(x),f2(x),…,fT(x)

- Coefficients: ŵ1,ŵ2,…,ŵT • Prediction: T

yˆ = sign wˆ tft(x) t=1 ! 40 X ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

20 4/25/18

Boosting

©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Training a classifier

x Training Learn f( ) Predict data classifier ŷ = sign(f(x))

42 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

21 4/25/18

Learning decision stump

Credit Income y A $130K Safe B $80K Risky Income? C $110K Risky A $110K Safe A $90K Safe > $100K ≤ $100K B $120K Safe 3 1 4 3 C $30K Risky C $60K Risky B $95K Safe A $60K Safe ŷ = Safe ŷ = Safe A $98K Safe

43 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Boosting = Focus learning on “hard” points

x Training Learn f( ) Predict data classifier ŷ = sign(f(x))

Evaluate Boosting: focus next classifier on places where f(x) does less well Learn where f(x) makes mistakes

44 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

22 4/25/18

Learning on weighted data: More weight on “hard” or more important points

• Weighted dataset:

- Each xi,yi weighted by αi

• More important point = higher weight αi

• Learning:

- Data point i counts as αi data points

• E.g., αi = 2 è count point twice

45 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Learning a decision stump on weighted data

Increase weight α of harder/misclassified points

CrediCredit IncomeIncome y y Weight t A $130K Safe α AB $130K$80K SafeRisky 0.5 Income? BC $80K$110K RiskyRisky 1.5 CA $110K$110K RiskySafe 1.2 AA $110K$90K SafeSafe 0.8 > $100K ≤ $100K AB $90K$120K SafeSafe 0.6 2 1.2 3 6.5 BC $120K$30K SafeRisky 0.7 CC $30K$60K RiskyRisky 3 CB $60K$95K RiskySafe 2 BA $95K$60K SafeSafe 0.8 ŷ = Safe ŷ = Risky AA $60K$98K SafeSafe 0.7 A $98K Safe 0.9 46 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

23 4/25/18

Boosting = Greedy learning ensembles from data

Training data

Learn classifier Higher weight f (x) for points where 1 Predict ŷ = sign(f (x)) f1(x) is wrong 1

Weighted data

Learn classifier & coefficient

ŵ,f2(x)

Predict ŷ = sign(ŵ1 f1(x) + ŵ2 f2(x))

47 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

AdaBoost algorithm

©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

24 4/25/18

AdaBoost: learning ensemble [Freund & Schapire 1999]

• Start with same weight for all points: αi = 1/N

• For t = 1,…,T

- Learn ft(x) with data weights αi

- Compute coefficient ŵt

- Recompute weights αi

• Final model predicts by: T

yˆ = sign wˆ tft(x) t=1 ! 49 X©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Computing coefficient ŵt

©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

25 4/25/18

AdaBoost: Computing coefficient ŵt of classifier ft(x)

Yes ŵt large Is ft(x) good? No ŵt small

• ft(x) is good è ft has low training error

• Measuring error in weighted data? - Just weighted # of misclassified points

51 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Weighted classification error

Learned classifier ŷ =

Data point Weight of correct 1.20 Correct!Mistake! (Food(SushiFoodSushi waswas OKOK,great,great ,,αα=0.5=1.2)) Weight of mistakes 0.50

Hide label

52 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

26 4/25/18

AdaBoost:

Formula for computing coefficient ŵt of classifier ft(x) 1 1 weighted error(f ) wˆ = ln t t 2 weighted error(f ) ✓ t ◆

weighted_error(f ) 1 weighted error(ft) t ŵ on training data weighted error(ft) t Yes

Is ft(x) good? No

53 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

AdaBoost: learning ensemble

• Start with same weight for all points: αi = 1/N

• For t = 1,…,T 1 1 weighted error(ft) - Learn ft(x) with data weights αi wˆ t = ln 2 weighted error(ft) - Compute coefficient ŵt ✓ ◆

- Recompute weights αi

• Final model predicts by: T

yˆ = sign wˆ tft(x) t=1 ! 54 X©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

27 4/25/18

Recompute weights αi

©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

AdaBoost: Updating weights αi based on where classifier ft(x) makes mistakes

Yes Decrease αi

Did ft get xi right? No Increase αi

56 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

28 4/25/18

AdaBoost: Formula for updating weights αi

-ŵt αi e , if ft(xi)=yi αi ç ŵt αi e , if ft(xi)≠yi

Multiply α by Implication ft(xi)=yi ? ŵt i

Yes

Did ft get xi right? No

57 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

AdaBoost: learning ensemble

• Start with same weight for all points: αi = 1/N

• For t = 1,…,T 1 1 weighted error(ft) - Learn ft(x) with data weights αi wˆ t = ln 2 weighted error(ft) - Compute coefficient ŵt ✓ ◆

- Recompute weights αi -ŵt αi e , if ft(xi)=yi αi ç • Final model predicts by: ŵt T αi e , if ft(xi)≠yi yˆ = sign wˆ tft(x) t=1 ! 58 X ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

29 4/25/18

AdaBoost: Normalizing weights αi

If xi often mistake, If xi often correct, weight αi gets very weight αi gets very large small

Can cause numerical instability after many iterations

Normalize weights to add up to 1 after every iteration ↵ ↵ i i N j=1 ↵j

59 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning P .

AdaBoost: learning ensemble

• Start with same weight for 1 1 weighted error(ft) wˆ = ln t 2 weighted error(f ) all points: αi = 1/N ✓ t ◆ • For t = 1,…,T -ŵt - Learn ft(x) with data weights αi αi e , if ft(xi)=yi - Compute coefficient ŵt αi ç ŵ - Recompute weights α t i αi e , if ft(xi)≠yi

- Normalize weights αi • Final model predicts by: ↵ T i ↵i yˆ = sign wˆ f (x) N t t j=1 ↵j t=1 ! 60 X ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning P .

30 4/25/18

AdaBoost example: A visualization

©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

t=1: Just learn a classifier on original data

Original data Learned decision stump f1(x)

62 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

31 4/25/18

Updating weights αi

Increase weight αi of misclassified points

Learned decision stump f1(x) New data weights αi Boundary

63 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

t=2: Learn classifier on weighted data

f1(x)

Weighted data: using αi Learned decision stump f2(x) chosen in previous iteration on weighted data

64 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

32 4/25/18

Ensemble becomes weighted sum of learned classifiers

ŵ1 f1(x) ŵ2 f2(x) 0.61 + 0.53 =

65 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Decision boundary of ensemble classifier after 30 iterations

training_error = 0

66 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

33 4/25/18

AdaBoost example: Boosted decision stumps step-by-step

©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Boosted decision stumps

• Start same weight for all points: αi = 1/N • For t = 1,…,T

- Learn ft(x): pick decision stump with lowest weighted training error according to αi

- Compute coefficient ŵt

- Recompute weights αi

- Normalize weights αi • Final model predicts by: T

yˆ = sign wˆ tft(x) t=1 ! 68 X©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

34 4/25/18

Finding best next decision stump ft(x) Consider splitting on each feature:

Income>$100K? Credit history? Savings>$100K? Market conditions?

Yes No Bad Good Yes No Bad Good

Safe Risky Risky Safe Safe Risky Risky Safe

weighted_error weighted_error weighted_error weighted_error = 0.2 = 0.35 = 0.3 = 0.4

Income>$100K? f = Yes No t Safe Risky

1 1 weighted error(f ) wˆ = ln t ŵtt 2 weighted error(f ) = 0.69 ✓ t ◆ 69 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Boosted decision stumps

• Start same weight for all points: αi = 1/N • For t = 1,…,T

- Learn ft(x): pick decision stump with lowest weighted training error according to αi

- Compute coefficient ŵt

- Recompute weights αi

- Normalize weights αi • Final model predicts by: T

yˆ = sign wˆ tft(x) t=1 ! 70 X©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

35 4/25/18

-ŵ Updating weights αi t -0.69 αi e = αi e = αi /2 , if ft(xi)=yi Income>$100K? αi ç ŵt Yes No 0.69 αi e = αi e = 2αi , if ft(xi)≠yi Safe Risky Previous New Credit Income y ŷ weight α weight α A $130K Safe Safe A $130K Safe Safe 0.5 0.5/2 = 0.25 B $80K Risky Risky B $80K Risky Risky 1.5 0.75 C $110K Risky Safe C $110K Risky Safe 1.5 2 * 1.5 = 3 A $110K Safe Safe A $110K Safe Safe 2 1 A $90K Safe Risky A $90K Safe Risky 1 2 B $120K Safe Safe B $120K Safe Safe 2.5 1.25 C $30K Risky Risky C $30K Risky Risky 3 1.5 C $60K Risky Risky C $60K Risky Risky 2 1 B $95K Safe Risky B $95K Safe Risky 0.5 1 A $60K Safe Risky A $60K Safe Risky 1 2 A $98K Safe Risky A $98K Safe Risky 0.5 1

71 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Boosting convergence & overfitting

©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

36 4/25/18

Boosting question revisited

“Can a set of weak learners be combined to create a stronger learner?” Kearns and Valiant (1988)

Yes! Schapire (1990)

Boosting

73 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

After some iterations, training error of boosting goes to zero!!!

Training error of 1 decision stump = 22.5%

Training error of ensemble of 30 decision stumps = 0%

Boosted decision

Training error Training stumps on toy dataset Iterations of boosting

74 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

37 4/25/18

AdaBoost Theorem

Under some technical conditions… May oscillate a bit

But will generally decrease, & eventually become 0! Training error of boosted classifier → 0 as T→∞ Training error Training

Iterations of boosting 75 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Condition of AdaBoost Theorem

Condition = At every t, Under some technical conditions… can find a weak learner with weighted_error(ft) < 0.5 Extreme example: No classifier can Not always separate a +1 possible on top of -1 Training error of boosted classifier → 0 as T→ Nonetheless, boosting often ∞ yields great training error

76 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

38 4/25/18

Decision trees on loan data 39% test error

Overfitting

8% training error

Boosted decision stumps on loan data

32% test error Better fit & lower test error 28.5% training error

77 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Boosting tends to be robust to overfitting

Test set performance about constant for many iterations è Less sensitive to choice of T

78 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

39 4/25/18

But boosting will eventually overfit, so must choose max number of components T

Best test error around 31%

Test error eventually increases to 33% (overfits)

79 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

How do we decide when to stop boosting?

Choosing T ?

Like selecting parameters in other ML approaches (e.g., λ in regularization)

Never ever Not on Cross- ever ever on Validation set training data validation test data

Not useful: training If dataset is large For smaller data error improves as T increases

80 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

40 4/25/18

Summary of boosting

©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

Variants of boosting and related algorithms

There are hundreds of variants of boosting, most important:

Gradient boosting • Like AdaBoost, but useful beyond basic classification

Many other approaches to learn ensembles, most important:

• Bagging: Pick random subsets of the data - Learn a tree in each subset Random - Average predictions forests • Simpler than boosting & easier to parallelize • Typically higher error than boosting for same # of trees (# iterations T)

82 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

41 4/25/18

Impact of boosting (spoiler alert... HUGE IMPACT)

Amongst most useful ML methods ever created

Extremely useful in • Standard approach for face detection, for example computer vision

Used by most winners of • Malware classification, credit fraud detection, ads click through rate estimation, sales forecasting, ML competitions ranking webpages for search, Higgs boson (Kaggle, KDD Cup,…) detection,…

Most deployed ML systems use • Coefficients chosen manually, with boosting, with model ensembles bagging, or others

83 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

What you can do now…

• Identify notion ensemble classifiers • Formalize ensembles as weighted combination of simpler classifiers • Outline the boosting framework – sequentially learn classifiers on weighted data • Describe the AdaBoost algorithm - Learn each classifier on weighted data - Compute coefficient of classifier - Recompute data weights - Normalize weights • Implement AdaBoost to create an ensemble of decision stumps

84 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning

42