1/29/17

Decision Trees: Overfitting CSE 446: Emily Fox University of Washington January 30, 2017

©2017 Emily Fox CSE 446: Machine Learning

Decision tree recap

Loan status: Root poor 22 18 4 14 Safe Risky

Credit? Income?

excellent Fair 9 0 9 4 high low 4 5 0 9 For each leaf Safe Term? node, set Term? Risky ŷ = majority value 3 years 5 years 0 4 9 0 3 years 5 years 0 2 4 3 Risky Safe

Risky Safe

2 ©2017 Emily Fox CSE 446: Machine Learning

1 1/29/17

Greedy

• Step 1: Start with an empty tree

• Step 2: Select a feature to split data Pick feature split • For each split of the tree: leading to lowest • Step 3: If nothing more to, classification error make predictions

Stopping • Step 4: Otherwise, go to Step 2 & conditions 1 & 2 continue (recurse) on this split Recursion

3 ©2017 Emily Fox CSE 446: Machine Learning

Scoring a loan application

xi = (Credit = poor, Income = high, Term = 5 years) Start

excellent poor Credit?

fair

Income? Safe Term? high Low 3 years 5 years

Risky Safe Term? Risky

3 years 5 years

Risky Safe ŷi = Safe

4 ©2017 Emily Fox CSE 446: Machine Learning

2 1/29/17

Decision trees vs logistic regression: Example

©2017 Emily Fox CSE 446: Machine Learning

Logistic regression

Weight Feature Value Learned

h0(x) 1 0.22

h1(x) x[1] 1.12

h2(x) x[2] -1.07

6 ©2017 Emily Fox CSE 446: Machine Learning

3 1/29/17

Depth 1: Split on x[1]

y values Root - + 18 13

x[1]

x[1] < -0.07 x[1] >= -0.07 13 3 4 11

7 ©2017 Emily Fox CSE 446: Machine Learning

Depth 2

y values Root - + 18 13

x[1]

x[1] < -0.07 x[1] >= -0.07 13 3 4 11

x[1] x[2]

x[1] < -1.66 x[1] >= -1.66 x[2] < 1.55 x[2] >= 1.55 7 0 6 3 1 11 3 0

8 ©2017 Emily Fox CSE 446: Machine Learning

4 1/29/17

Threshold split caveat

y values Root - + 18 13

x[1]

For threshold splits, x[1] < -0.07 x[1] >= -0.07 same feature can be 13 3 4 11 used multiple times

x[1] x[2]

x[1] < -1.66 x[1] >= -1.66 x[2] < 1.55 x[2] >= 1.55 7 0 6 3 1 11 3 0

9 ©2017 Emily Fox CSE 446: Machine Learning

Decision boundaries

Depth 1 Depth 2 Depth 10

10 ©2017 Emily Fox CSE 446: Machine Learning

5 1/29/17

Comparing decision boundaries Decision Tree Depth 1 Depth 3 Depth 10

Logistic Regression Degree 1 features Degree 2 features Degree 6 features

11 ©2017 Emily Fox CSE 446: Machine Learning

Predicting probabilities with decision trees

Loan status: Root Safe Risky 18 12

Credit?

excellent fair poor P(y = Safe | x) 9 2 6 9 3 1

= 3 = 0.75 3 + 1 Safe Risky Safe

12 ©2017 Emily Fox CSE 446: Machine Learning

6 1/29/17

Depth 1 probabilities

Y values root - + 18 13

X1

X1 < -0.07 X1 >= -0.07 13 3 4 11

13 ©2017 Emily Fox CSE 446: Machine Learning

Depth 2 probabilities

Y values root - + 18 13

X1

X1 < -0.07 X1 >= -0.07 13 3 4 11

X1 X2

X1 < -1.66 X1 >= -1.66 X2 < 1.55 X2 >= 1.55 7 0 6 3 1 11 3 0

14 ©2017 Emily Fox CSE 446: Machine Learning

7 1/29/17

Comparison with logistic regression

Logistic Regression Decision Trees (Degree 2) (Depth 2)

Class

Probability

15 ©2017 Emily Fox CSE 446: Machine Learning

Overfitting in decision trees

©2017 Emily Fox CSE 446: Machine Learning

8 1/29/17

What happens when we increase depth?

Training error reduces with depth

Tree depth depth = 1 depth = 2 depth = 3 depth = 5 depth = 10

Training error 0.22 0.13 0.10 0.03 0.00

Decision boundary

17 ©2017 Emily Fox CSE 446: Machine Learning

Two approaches to picking simpler trees

1. Early Stopping: Stop the learning algorithm before tree becomes too complex

2. Pruning: Simplify the tree after the learning algorithm terminates

18 ©2017 Emily Fox CSE 446: Machine Learning

9 1/29/17

Technique 1: Early stopping

• Stopping conditions (recap): 1. All examples have the same target value 2. No more features to split on

• Early stopping conditions: 1. Limit tree depth (choose max_depth using validation set) 2. Do not consider splits that do not cause a sufficient decrease in classification error 3. Do not split an intermediate node which contains too few data points

19 ©2017 Emily Fox CSE 446: Machine Learning

Challenge with early stopping condition 1

Hard to know exactly when to stop

Simple Complex Also, might want trees trees some branches of

tree to go deeper while others remain shallow True error

Classification Error Training error

max_depth Tree depth

20 ©2017 Emily Fox CSE 446: Machine Learning

10 1/29/17

Early stopping condition 2: Pros and Cons

• Pros: - A reasonable heuristic for early stopping to avoid useless splits

• Cons: - Too short sighted: We may miss out on “good” splits may occur right after “useless” splits - Saw this with “xor” example

21 ©2017 Emily Fox CSE 446: Machine Learning

Two approaches to picking simpler trees

1. Early Stopping: Stop the learning algorithm before tree becomes too complex

2. Pruning: Simplify the tree after the learning algorithm terminates

Complements early stopping

22 ©2017 Emily Fox CSE 446: Machine Learning

11 1/29/17

Pruning: Intuition Train a complex tree, simplify later

Complex Tree

Simpler Tree

Simplify

23 ©2017 Emily Fox CSE 446: Machine Learning

Pruning motivation

Simple Complex tree tree True Error Simplify after tree is built

Don’t stop too early

Classification Error Training Error

Tree depth

24 ©2017 Emily Fox CSE 446: Machine Learning

12 1/29/17

Scoring trees: Desired total quality format

Want to balance: i. How well tree fits data ii. Complexity of tree

want to balance Total cost = measure of fit + measure of complexity

(classification error) Large # = likely to overfit Large # = bad fit to training data 25 ©2017 Emily Fox CSE 446: Machine Learning

Simple measure of complexity of tree

L(T) = # of leaf nodes Start

excellent poor Credit?

good fair bad Safe Risky

Safe Safe Risky

26 ©2017 Emily Fox CSE 446: Machine Learning

13 1/29/17

Balance simplicity & predictive power

Too complex, risk of overfitting

Start

excellent poor Credit? Too simple, high fair classification error Income? Safe Term? Start high low 3 years 5 years

Risky Risky Safe Term? Risky

3 years 5 years

Risky Safe

27 ©2015-2016 Emily Fox & Carlos Guestrin CSE 446: Machine Learning

Balancing fit and complexity

Total cost C(T) = Error(T) + λ L(T)

tuning parameter If λ=0:

If λ=∞:

If λ in between:

28 ©2017 Emily Fox CSE 446: Machine Learning

14 1/29/17

Tree pruning algorithm

©2017 Emily Fox CSE 446: Machine Learning

Step 1: Consider a split

Start Tree T

excellent poor Credit?

fair

Income? Safe Term? high low 3 years 5 years

Risky Safe Term? Risky

3 years 5 years Candidate for Safe Risky pruning

30 ©2017 Emily Fox CSE 446: Machine Learning

15 1/29/17

Step 2: Compute total cost C(T) of split

Start Tree T

excellent poor λ = 0.3 Credit? Tree Error #Leaves Total fair T 0.25 6 0.43 Income? Safe Term? high low 3 years 5 years C(T) = Error(T) + λ L(T)

Risky Safe Term? Risky

3 years 5 years Candidate for Safe Risky pruning

31 ©2017 Emily Fox CSE 446: Machine Learning

Step 2: “Undo” the splits on Tsmaller

Start Tree Tsmaller

excellent poor λ = 0.3 Credit? Tree Error #Leaves Total fair T 0.25 6 0.43 Income? Safe Term? Tsmaller 0.26 5 0.41 high low 3 years 5 years C(T) = Error(T) + λ L(T) Risky Safe Safe Risky

Replace split by leaf node? 32 ©2017 Emily Fox CSE 446: Machine Learning

16 1/29/17

Prune if total cost is lower: C(Tsmaller) ≤ C(T)

Worse training error but

Start Tree Tsmaller lower overall cost

excellent poor λ = 0.3 Credit? Tree Error #Leaves Total fair T 0.25 6 0.43 Income? Safe Term? Tsmaller 0.26 5 0.41 high low 3 years 5 years C(T) = Error(T) + λ L(T) Risky Safe Safe Risky

Replace split by leaf node? YES! 33 ©2017 Emily Fox CSE 446: Machine Learning

Step 5: Repeat Steps 1-4 for every split

Start Decide if each split can be “pruned” excellent poor Credit?

fair

Income? Safe Term? high low 3 years 5 years

Risky Safe Safe Risky

34 ©2017 Emily Fox CSE 446: Machine Learning

17 1/29/17

Summary of overfitting in decision trees

©2017 Emily Fox CSE 446: Machine Learning

What you can do now…

• Identify when overfitting in decision trees • Prevent overfitting with early stopping - Limit tree depth - Do not consider splits that do not reduce classification error - Do not split intermediate nodes with only few points • Prevent overfitting by pruning complex trees - Use a total cost formula that balances classification error and tree complexity - Use total cost to merge potentially complex trees into simpler ones

36 ©2017 Emily Fox CSE 446: Machine Learning

18 1/29/17

Boosting CSE 446: Machine Learning Emily Fox University of Washington January 30, 2017

©2017 Emily Fox CSE 446: Machine Learning

Simple (weak) classifiers are good!

Income>$100K?

Yes No

Safe Risky Logistic regression Shallow Decision w. simple features decision trees stumps

Low variance. Learning is fast!

But high bias…

38 ©2017 Emily Fox CSE 446: Machine Learning

19 1/29/17

Finding a classifier that’s just right true error

error train error Classification

Model complexity

Weak learner Need stronger learner

Option 1: add more features or depth Option 2: ????? 39 ©2017 Emily Fox CSE 446: Machine Learning

Boosting question

“Can a set of weak learners be combined to create a stronger learner?” Kearns and Valiant (1988)

Yes! Schapire (1990)

Boosting

Amazing impact: Ÿ simple approach Ÿ widely used in industry Ÿ wins most Kaggle competitions

40 ©2017 Emily Fox CSE 446: Machine Learning

20 1/29/17

Ensemble classifier

©2017 Emily Fox CSE 446: Machine Learning

A single classifier Input: x

Income>$100K?

Yes No

Safe Risky

Output: ŷ = f(x) - Either +1 or -1 Classifier 42 ©2017 Emily Fox CSE 446: Machine Learning

21 1/29/17

Income>$100K?

Yes No

Safe Risky

43 ©2017 Emily Fox CSE 446: Machine Learning

Ensemble methods: Each classifier “votes” on prediction

xi = (Income=$120K, Credit=Bad, Savings=$50K, Market=Good)

Income>$100K? Credit history? Savings>$100K? Market conditions?

Yes No Bad Good Yes No Bad Good

Safe Risky Risky Safe Safe Risky Risky Safe

f1(xi) = +1 f2(xi) = -1 f3(xi) = -1 f4(xi) = +1

Combine? Ensemble model Learn coefficients

F(xi) = sign(w1 f1(xi) + w2 f2(xi) + w3 f3(xi) + w4 f4(xi))

44 ©2017 Emily Fox CSE 446: Machine Learning

22 1/29/17

Prediction with ensemble

Income>$100K? Credit history? Savings>$100K? Market conditions?

Yes No Bad Good Yes No Bad Good

Safe Risky Risky Safe Safe Risky Risky Safe

f1(xi) = +1 f2(xi) = -1 f3(xi) = -1 f4(xi) = +1

Combine? Ensemble w1 2 model Learn coefficients w2 1.5

w3 1.5 F(xi) = sign(w1 f1(xi) + w2 f2(xi) + w3 f3(xi) + w4 f4(xi)) w4 0.5

45 ©2017 Emily Fox CSE 446: Machine Learning

Ensemble classifier in general • Goal: - Predict output y • Either +1 or -1 - From input x • Learn ensemble model:

- Classifiers: f1(x),f2(x),…,fT(x)

- Coefficients: ŵ1,ŵ2,…,ŵT • Prediction: T

yˆ = sign wˆ tft(x) t=1 ! 46 X ©2017 Emily Fox CSE 446: Machine Learning

23 1/29/17

Boosting

©2017 Emily Fox CSE 446: Machine Learning

Training a classifier

Training Learn f(x) Predict data classifier ŷ = sign(f(x))

48 ©2017 Emily Fox CSE 446: Machine Learning

24 1/29/17

Learning decision stump

Credit Income y A $130K Safe B $80K Risky Income? C $110K Risky A $110K Safe A $90K Safe > $100K B $120K Safe ≤ $100K 3 1 4 3 C $30K Risky C $60K Risky B $95K Safe A $60K Safe ŷ = Safe ŷ = Safe A $98K Safe

49 ©2017 Emily Fox CSE 446: Machine Learning

Boosting = Focus learning on “hard” points

Training Learn f(x) Predict data classifier ŷ = sign(f(x))

Evaluate Boosting: focus next classifier on places where f(x) does less well Learn where f(x) makes mistakes

50 ©2017 Emily Fox CSE 446: Machine Learning

25 1/29/17

Learning on weighted data: More weight on “hard” or more important points

• Weighted dataset:

- Each xi,yi weighted by αi

• More important point = higher weight αi

• Learning:

- Data point i counts as αi data points

• E.g., αi = 2 è count point twice

51 ©2017 Emily Fox CSE 446: Machine Learning

Learning a decision stump on weighted data

Increase weight α of harder/misclassified points

CreditCredit IncomeIncome y y Weight α AA $130K$130K SafeSafe 0.5 BB $80K$80K RiskyRisky 1.5 Income? CC $110K$110K RiskyRisky 1.2 AA $110K$110K SafeSafe 0.8 AA $90K$90K SafeSafe 0.6 > $100K BB $120K$120K SafeSafe 0.7 ≤ $100K 2 1.2 3 6.5 CC $30K$30K RiskyRisky 3 CC $60K$60K RiskyRisky 2 BB $95K$95K SafeSafe 0.8 AA $60K$60K SafeSafe 0.7 ŷ = Safe ŷ = Risky AA $98K$98K SafeSafe 0.9

52 ©2017 Emily Fox CSE 446: Machine Learning

26 1/29/17

Learning from weighted data in general

• Usually, learning from weighted data

- Data point i counts as αi data points

• E.g., gradient ascent for logistic regression:

Sum over data points

NN ((tt+1)+1) ((tt)) ((tt)) ww ww ++⌘⌘ hhαjj((x xii)) 11[[yyii == +1] +1] PP((yy=+1=+1 xxii,,ww )) jj jj i || ii=1=1 XX ⇣⇣ ⌘⌘

Weigh each point by α i

53 ©2017 Emily Fox CSE 446: Machine Learning

Boosting = Greedy learning ensembles from data

Training data

Learn classifier Higher weight f (x) for points where 1 Predict ŷ = sign(f1(x)) f1(x) is wrong

Weighted data

Learn classifier & coefficient

ŵ,f2(x)

Predict ŷ = sign(ŵ1 f1(x) + ŵ2 f2(x))

54 ©2017 Emily Fox CSE 446: Machine Learning

27 1/29/17

AdaBoost

©2017 Emily Fox CSE 446: Machine Learning

AdaBoost: learning ensemble [Freund & Schapire 1999]

• Start with same weight for all points: αi = 1/N

• For t = 1,…,T

- Learn ft(x) with data weights αi

- Compute coefficient ŵt

- Recompute weights αi

• Final model predicts by: T

yˆ = sign wˆ tft(x) t=1 ! 56 X©2017 Emily Fox CSE 446: Machine Learning

28 1/29/17

Computing coefficient ŵt

©2017 Emily Fox CSE 446: Machine Learning

AdaBoost: Computing coefficient ŵt of classifier ft(x)

Yes ŵt large Is ft(x) good? No ŵt small

• ft(x) is good è ft has low training error

• Measuring error in weighted data? - Just weighted # of misclassified points

58 ©2017 Emily Fox CSE 446: Machine Learning

29 1/29/17

Weighted classification error

Learned classifier ŷ =

Data point Weight of correct 1.20 Correct!Mistake! Weight of mistakes (Food(SushiFoodSushi waswas OKOK,great,great ,,αα=0.5=1.2) ) 0.50

Hide label

59 ©2017 Emily Fox CSE 446: Machine Learning

Weighted classification error

• Total weight of mistakes:

• Total weight of all points:

• Weighted error measures fraction of weight of mistakes:

weighted_error = .

- Best possible value is 0.0

60 ©2017 Emily Fox CSE 446: Machine Learning

30 1/29/17

AdaBoost:

Formula for computing coefficient ŵt of classifier ft(x) 1 1 weighted error(f ) wˆ = ln t t 2 weighted error(f ) ✓ t ◆

weighted_error(f ) 1 weighted error(ft) t on training data weighted error(ft) ŵt Yes

Is ft(x) good? No

61 ©2017 Emily Fox CSE 446: Machine Learning

AdaBoost: learning ensemble

• Start with same weight for all points: αi = 1/N

• For t = 1,…,T 1 1 weighted error(ft) - Learn ft(x) with data weights αi wˆ = ln t 2 weighted error(f ) ✓ t ◆ - Compute coefficient ŵt

- Recompute weights αi

• Final model predicts by: T

yˆ = sign wˆ tft(x) t=1 ! 62 X©2017 Emily Fox CSE 446: Machine Learning

31 1/29/17

Recompute weights αi

©2017 Emily Fox CSE 446: Machine Learning

AdaBoost: Updating weights αi based on where classifier ft(x) makes mistakes

Yes Decrease αi

Did ft get xi right? No Increase αi

64 ©2017 Emily Fox CSE 446: Machine Learning

32 1/29/17

AdaBoost: Formula for updating weights αi

-ŵt αi e , if ft(xi)=yi αi ç ŵt αi e , if ft(xi)≠yi

Multiply α by Implication ft(xi)=yi ? ŵt i

Yes

Did ft get xi right? No

65 ©2017 Emily Fox CSE 446: Machine Learning

AdaBoost: learning ensemble

• Start with same weight for all points: αi = 1/N

• For t = 1,…,T 1 1 weighted error(ft) - Learn ft(x) with data weights αi wˆ = ln t 2 weighted error(f ) ✓ t ◆ - Compute coefficient ŵt

- Recompute weights αi -ŵt αi e , if ft(xi)=yi αi ç • Final model predicts by: ŵt T αi e , if ft(xi)≠yi yˆ = sign wˆ tft(x) t=1 ! 66 X ©2017 Emily Fox CSE 446: Machine Learning

33 1/29/17

AdaBoost: Normalizing weights αi

If xi often mistake, If xi often correct, weight αi gets very weight αi gets very large small

Can cause numerical instability after many iterations

Normalize weights to add up to 1 after every iteration ↵ ↵ i i N j=1 ↵j 67 P©2017 Emily Fox . CSE 446: Machine Learning

AdaBoost: learning ensemble

1 1 weighted error(f ) • Start with same weight for wˆ = ln t t 2 weighted error(f ) all points: αi = 1/N ✓ t ◆ • For t = 1,…,T -ŵ - Learn ft(x) with data weights αi t αi e , if ft(xi)=yi - Compute coefficient ŵt αi ç ŵt - Recompute weights αi αi e , if ft(xi)≠yi

- Normalize weights αi • Final model predicts by: ↵ T i ↵i yˆ = sign wˆ f (x) N t t j=1 ↵j t=1 ! 68 X ©2017 Emily Fox CSE 446: Machine Learning P .

34 1/29/17

AdaBoost example

©2017 Emily Fox CSE 446: Machine Learning

t=1: Just learn a classifier on original data

Original data Learned decision stump f1(x)

70 ©2017 Emily Fox CSE 446: Machine Learning

35 1/29/17

Updating weights αi

Increase weight αi of misclassified points

Learned decision stump f1(x) New data weights αi Boundary

71 ©2017 Emily Fox CSE 446: Machine Learning

t=2: Learn classifier on weighted data

f1(x)

α Weighted data: using i Learned decision stump f2(x) chosen in previous iteration on weighted data

72 ©2017 Emily Fox CSE 446: Machine Learning

36 1/29/17

Ensemble becomes weighted sum of learned classifiers

ŵ1 f1(x) ŵ2 f2(x)

0.61 + 0.53 =

73 ©2017 Emily Fox CSE 446: Machine Learning

Decision boundary of ensemble classifier after 30 iterations

training_error = 0

74 ©2017 Emily Fox CSE 446: Machine Learning

37 1/29/17

Boosted decision stumps

©2017 Emily Fox CSE 446: Machine Learning

Boosted decision stumps

• Start same weight for all points: αi = 1/N • For t = 1,…,T

- Learn ft(x): pick decision stump with lowest weighted training error according to α i

- Compute coefficient ŵt

- Recompute weights αi

- Normalize weights αi • Final model predicts by: T

yˆ = sign wˆ tft(x) t=1 ! 76 X©2017 Emily Fox CSE 446: Machine Learning

38 1/29/17

Finding best next decision stump ft(x) Consider splitting on each feature:

Income>$100K? Credit history? Savings>$100K? Market conditions?

Yes No Bad Good Yes No Bad Good

Safe Risky Risky Safe Safe Risky Risky Safe

weighted_error weighted_error weighted_error weighted_error = 0.2 = 0.35 = 0.3 = 0.4

Income>$100K? f = Yes No t Safe Risky 1 1 weighted error(f ) wˆ = ln t ŵt 2 weighted error(f ) = 0.69 t ✓ t ◆ 77 ©2017 Emily Fox CSE 446: Machine Learning

Boosted decision stumps

• Start same weight for all points: αi = 1/N • For t = 1,…,T

- Learn ft(x): pick decision stump with lowest weighted training error according to αi

- Compute coefficient ŵt

- Recompute weights αi

- Normalize weights αi • Final model predicts by: T

yˆ = sign wˆ tft(x) t=1 ! 78 X©2017 Emily Fox CSE 446: Machine Learning

39 1/29/17

-ŵt Updating weights α -0.69 i α αi e = i e = αi /2 , if ft(xi)=yi Income>$100K? αi ç ŵt 0.69 Yes No α αi e = i e = 2αi , if ft(xi)≠yi Safe Risky Previous New Credit Income y ŷ weight α weight α A $130K Safe Safe A $130K Safe Safe 0.5 0.5/2 = 0.25 B $80K Risky Risky B $80K Risky Risky 1.5 0.75 C $110K Risky Safe C $110K Risky Safe 1.5 2 * 1.5 = 3 A $110K Safe Safe A $110K Safe Safe 2 1 A $90K Safe Risky A $90K Safe Risky 1 2 B $120K Safe Safe B $120K Safe Safe 2.5 1.25 C $30K Risky Risky C $30K Risky Risky 3 1.5 C $60K Risky Risky C $60K Risky Risky 2 1 B $95K Safe Risky B $95K Safe Risky 0.5 1 A $60K Safe Risky A $60K Safe Risky 1 2 A $98K Safe Risky A $98K Safe Risky 0.5 1 79 ©2017 Emily Fox CSE 446: Machine Learning

Summary of boosting

©2017 Emily Fox CSE 446: Machine Learning

40 1/29/17

Variants of boosting and related algorithms

There are hundreds of variants of boosting, most important:

Gradient boosting • Like AdaBoost, but useful beyond basic classification

Many other approaches to learn ensembles, most important:

• Bagging: Pick random subsets of the data

- Learn a tree in each subset Random - Average predictions forests • Simpler than boosting & easier to parallelize • Typically higher error than boosting for same # of trees (# iterations T)

81 ©2017 Emily Fox CSE 446: Machine Learning

Impact of boosting (spoiler alert... HUGE IMPACT)

Amongst most useful ML methods ever created

Extremely useful in • Standard approach for face detection, for example computer vision

Used by most winners of • Malware classification, credit fraud detection, ads click through rate estimation, sales forecasting, ML competitions ranking webpages for search, Higgs boson (Kaggle, KDD Cup,…) detection,…

Most deployed ML systems use • Coefficients chosen manually, with boosting, with model ensembles bagging, or others

82 ©2017 Emily Fox CSE 446: Machine Learning

41 1/29/17

What you can do now…

• Identify notion ensemble classifiers • Formalize ensembles as the weighted combination of simpler classifiers • Outline the boosting framework – sequentially learn classifiers on weighted data • Describe the AdaBoost algorithm - Learn each classifier on weighted data - Compute coefficient of classifier - Recompute data weights - Normalize weights • Implement AdaBoost to create an ensemble of decision stumps

83 ©2017 Emily Fox CSE 446: Machine Learning

42