Decision Trees, Bagging and Boosting

CSE 446: created by Emily Fox

©2017 Emily Fox CSE 446: Machine Learning Predicting potential loan defaults

©2017 Emily Fox CSE 446: Machine Learning What makes a loan risky?

I want a to buy a new house! Credit History ★★★★

Income ★★★

Term ★★★★★ Loan Application Personal Info ★★★

3 ©2017 Emily Fox CSE 446: Machine Learning Credit history explained

Did I pay previous loans on time? Credit History ★★★★

Example: excellent, Income good, or fair ★★★

Term ★★★★★

Personal Info ★★★

4 ©2017 Emily Fox CSE 446: Machine Learning Income

Credit History What’s my income? ★★★★

Example: Income ★★★ $80K per year Term ★★★★★

Personal Info ★★★

5 ©2017 Emily Fox CSE 446: Machine Learning Loan terms

Credit History How soon do I need to ★★★★ pay the loan? Income Example: 3 years, ★★★ 5 years,… Term ★★★★★

Personal Info ★★★

6 ©2017 Emily Fox CSE 446: Machine Learning Personal information

Credit History ★★★★

Income Age, reason for the loan, ★★★ marital status,… Term Example: Home loan for a ★★★★★ married couple Personal Info ★★★

7 ©2017 Emily Fox CSE 446: Machine Learning Intelligent application

Loan Applications

Safe ✓

Intelligent loan application Risky review system ✘

Risky ✘

8 ©2017 Emily Fox CSE 446: Machine Learning Classifier review

ŷi = +1

Loan Classifier Safe Application MODEL Risky Output: ŷ Input: xi Predicted class ŷi = -1

9 ©2017 Emily Fox CSE 446: Machine Learning This module ... decision trees

Start

excellent poor Credit?

fair

Income? Safe Term? high Low 3 years 5 years

Risky Safe Term? Risky

3 years 5 years

Risky Safe

10 ©2017 Emily Fox CSE 446: Machine Learning Scoring a loan application

xi = (Credit = poor, Income = high, Term = 5 years) Start

excellent poor Credit?

fair

Income? Safe Term? high Low 3 years 5 years

Risky Safe Term? Risky

3 years 5 years

Safe Risky ŷi = Safe

11 ©2017 Emily Fox CSE 446: Machine Learning Decision learning task

©2017 Emily Fox CSE 446: Machine Learning learning problem

Training data: N observations (xi,yi)

Credit Term Income y excellent 3 yrs high safe fair 5 yrs low risky Optimize fair 3 yrs high safe quality metric on training data T(X) poor 5 yrs high risky excellent 3 yrs low risky fair 5 yrs low safe poor 3 yrs high risky poor 5 yrs low safe fair 3 yrs high safe

13 ©2017 Emily Fox CSE 446: Machine Learning Quality metric: Classification error

• Error measures fraction of mistakes

Error = # incorrect predictions # examples

- Best possible value : 0.0 - Worst possible value: 1.0

14 ©2017 Emily Fox CSE 446: Machine Learning How do we find the best tree?

Exponentially large number of possible trees Learning the smallest makes decision tree learning hard! decision tree is an NP-hard problem T (X) T (X) 1 2 T3(X) [Hyafil & Rivest ’76]

T4(X) T5(X) T6(X)

15 ©2017 Emily Fox CSE 446: Machine Learning Greedy decision tree learning

©2017 Emily Fox CSE 446: Machine Learning Our training data table

Assume N = 40, 3 features

Credit Term Income y excellent 3 yrs high safe fair 5 yrs low risky fair 3 yrs high safe poor 5 yrs high risky excellent 3 yrs low risky fair 5 yrs low safe poor 3 yrs high risky poor 5 yrs low safe fair 3 yrs high safe

17 ©2017 Emily Fox CSE 446: Machine Learning Start with all the data

Loan status: Safe Risky

(all data) # of Risky loans 22 18

# of Safe loans N = 40 examples

18 ©2017 Emily Fox CSE 446: Machine Learning Compact visual notation: Root node

Loan status: Safe Risky Root 22 18 # of Risky loans

# of Safe loans N = 40 examples

19 ©2017 Emily Fox CSE 446: Machine Learning : Single level tree

Loan status: Root Safe Risky 22 18 Split on Credit

Credit?

excellent fair poor 9 0 9 4 4 14

Subset of data with Subset of data with Subset of data with Credit = excellent Credit = fair Credit = poor

20 ©2017 Emily Fox CSE 446: Machine Learning Visual notation: Intermediate nodes

Loan status: Root Safe Risky 22 18

Credit?

excellent fair poor 9 0 9 4 4 14

Intermediate nodes

21 ©2017 Emily Fox CSE 446: Machine Learning Proceed to build a tree on each of children

Loan status: root Safe Risky 22 18

credit?

excellent fair poor 9 0 9 4 4 14

Safe

22 ©2017 Emily Fox CSE 446: Machine Learning Selecting best feature to split on

©2017 Emily Fox CSE 446: Machine Learning How do we pick our first split?

Loan status: Root Find the “best” Safe Risky 22 18 feature to split on!

Credit?

excellent fair poor 9 0 9 4 4 14

24 ©2017 Emily Fox CSE 446: Machine Learning How do we select the best feature?

Choice 1: Split on Credit Choice 2: Split on Term Loan status: Loan status: Root Root Safe Risky Safe Risky 22 18 22 18

Credit? OR Term?

excellent fair poor 3 years 5 years 9 0 9 4 4 14 16 4 6 14

25 ©2017 Emily Fox CSE 446: Machine Learning How do we measure effectiveness of a split?

Loan status: Root Safe Risky 22 18 Idea: Calculate classification error of this decision stump Credit? Error = # mistakes # data points excellent fair poor 9 0 9 4 4 14

26 ©2017 Emily Fox CSE 446: Machine Learning Calculating classification error

• Step 1: ŷ = class of majority of data in node • Step 2: Calculate classification error of predicting ŷ for this data

Loan status: Safe Risky Root 22 18

22 correct 18 mistakes Safe Tree Classification error (root) 0.45 ŷ = majority class

27 ©2017 Emily Fox CSE 446: Machine Learning Choice 1: Split on Credit history?

Choice 1: Split on Credit Root Loan status: Does a split on Credit reduce Safe Risky 22 18 classification error below 0.45?

Credit?

excellent fair poor 9 0 9 4 4 14

28 ©2017 Emily Fox CSE 446: Machine Learning Split on Credit: Classification error

Choice 1: Split on Credit

Loan status: Root Safe Risky 22 18

Error = 4+4 _. Credit? 22 + 18 excellent fair poor = 0.2 9 0 9 4 4 14 Tree Classification error Safe Safe Risky (root) 0.45 Split on credit 0.2 0 mistakes 4 mistakes 4 mistakes 29 ©2017 Emily Fox CSE 446: Machine Learning Choice 2: Split on Term?

Choice 2: Split on Term

Loan status: Root Safe Risky 22 18

Term?

3 years 5 years 16 4 6 14

Safe Risky

30 ©2017 Emily Fox CSE 446: Machine Learning Evaluating the split on Term

Choice 2: Split on Term

Loan status: Root Safe Risky 22 18 Error = 4+6 . Term? 40

3 years 5 years = 0.25 16 4 6 14 Tree Classification error Safe Risky (root) 0.45 Split on credit 0.2 4 mistakes 6 mistakes Split on term 0.25 31 ©2017 Emily Fox CSE 446: Machine Learning Choice 1 vs Choice 2: Tree Classification error Comparing split on (root) 0.45 Credit vs Term split on credit 0.2 split on loan term 0.25

Choice 1: Split on Credit Choice 2: Split on Term

Loan status: Root Loan status: Root Safe Risky 22 18 Safe Risky 22 18

Credit? OR Term?

excellent fair poor 3 years 5 years 9 0 WINNER8 4 4 14 16 4 6 14

32 ©2017 Emily Fox CSE 446: Machine Learning Feature split selection algorithm

• Given a subset of data M (a node in a tree)

• For each feature ϕi(x): 1. Split data of M according to feature ϕi(x) 2. Compute classification error split • Chose feature ϕ*(x) with lowest classification error

33 ©2017 Emily Fox CSE 446: Machine Learning Recursion & Stopping conditions

©2017 Emily Fox CSE 446: Machine Learning We’ve learned a decision stump, what next?

Loan status: Root Safe Risky 22 18

Credit?

excellent fair poor 9 0 9 4 4 14

All data points are Safe è Safe nothing else to do with this subset of data

Leaf node

35 ©2017 Emily Fox CSE 446: Machine Learning Tree learning = Recursive stump learning

Loan status: Root Safe Risky 22 18

Credit?

excellent fair poor 9 0 9 4 4 14

Safe Build decision stump with Build decision stump with subset of data where subset of data where Credit = fair Credit = poor

36 ©2017 Emily Fox CSE 446: Machine Learning Second level

Loan status: Root Safe Risky 22 18

Credit?

excellent fair poor 9 0 9 4 4 14

Safe Term? Income?

3 years 5 years high Low 0 4 9 0 4 5 0 9

Risky Safe Risky Build another stump these data points 37 ©2017 Emily Fox CSE 446: Machine Learning Final decision tree

Loan status: Root poor 22 18 4 14 Safe Risky

Credit? Income? excellent Fair 9 0 9 4 high low 4 5 0 9

Safe Term? Term? Risky 3 years 5 years 0 4 9 0 3 years 5 years 0 2 4 3 Risky Safe

Risky Safe

38 ©2017 Emily Fox CSE 446: Machine Learning Simple greedy decision tree learning

Pick best feature to split on

Learn decision stump with this split

For each leaf of decision stump, recurse

When do we stop???

39 ©2017 Emily Fox CSE 446: Machine Learning Stopping condition 1: All data agrees on y All data in these nodes have same Loan status:è Root poor y value 22 18 4 14 NothingSafe Risky to do

Credit? Income? excellent Fair 9 0 9 4 high low 4 5 0 9

Safe Term? Term? Risky 3 years 5 years 0 4 9 0 3 years 5 years 0 2 4 3 Risky Safe

Risky Safe

40 ©2017 Emily Fox CSE 446: Machine Learning Stopping condition 2: Already split on all features Already split on all possible features è NothingLoan status: to do Root poor 22 18 4 14 Safe Risky

Credit? Income? excellent Fair 9 0 9 4 high low 4 5 0 9

Safe Term? Term? Risky 3 years 5 years 0 4 9 0 3 years 5 years 0 2 4 3 Risky Safe

Risky Safe

41 ©2017 Emily Fox CSE 446: Machine Learning Greedy decision tree learning

• Step 1: Start with an empty tree • Step 2: Select a feature to split data Pick feature split • For each split of the tree: leading to lowest • Step 3: If nothing more to do, make classification error predictions Stopping conditions 1 • Step 4: Otherwise, go to Step 2 & & 2 continue (recurse) on this split Recursion

42 ©2017 Emily Fox CSE 446: Machine Learning Is this a good idea?

Proposed stopping condition 3: Stop if no split reduces the classification error

43 ©2017 Emily Fox CSE 446: Machine Learning Stopping condition 3: Don’t stop if error doesn’t decrease???

y = x[1] xor x[2] y values Root x[1] x[2] y True False 2 2 False False False

False True True Error = 2 . True False True 4 True True False = 0.5

Tree Classification error (root) 0.5

44 ©2017 Emily Fox CSE 446: Machine Learning Consider split on x[1]

y = x[1] xor x[2] y values Root x[1] x[2] y True False 2 2 False False False

False True True Error = 2 . True False True x[1] 4 True True False = 0.5 True False 1 1 1 1

Tree Classification error (root) 0.5 Split on x[1] 0.5

45 ©2017 Emily Fox CSE 446: Machine Learning Consider split on x[2]

y = x[1] xor x[2] y values Root x[1] x[2] y True False 2 2 False False False

False True True Error = 2 . True False True x[2] 4 True True False = 0.5 True False 1 1 1 1

Neither features Tree Classification error improve training error… (root) 0.5 Split on x[1] 0.5 Stop now??? Split on x[2] 0.5 46 ©2017 Emily Fox CSE 446: Machine Learning Final tree with stopping condition 3

y = x[1] xor x[2] y values Root x[1] x[2] y True False 2 2 False False False False True True True False True Predict True True True False

Tree Classification error with stopping 0.5 condition 3

47 ©2017 Emily Fox CSE 446: Machine Learning Without stopping condition 3 Condition 3 (stopping when training error doesn’t improve) not always recommended!

= x[1] xor x[2] Root y y values 2 2 x[1] x[2] y True False False False False x[1] False True True True False True False True 1 1 1 1 True True False x[2] x[2]

Tree Classification error True False True False 0 1 1 0 with stopping 0.5 1 0 0 1 condition 3 without stopping 0 False True True False condition 3

48 ©2017 Emily Fox CSE 446: Machine Learning Decision tree learning: Real valued features

©2017 Emily Fox CSE 446: Machine Learning How do we use real values inputs?

Income Credit Term y $105 K excellent 3 yrs Safe $112 K good 5 yrs Risky $73 K fair 3 yrs Safe $69 K excellent 5 yrs Safe $217 K excellent 3 yrs Risky $120 K good 5 yrs Safe $64 K fair 3 yrs Risky $340 K excellent 5 yrs Safe $60 K good 3 yrs Risky

50 ©2017 Emily Fox CSE 446: Machine Learning Threshold split

Loan status: Root Safe Risky 22 18 Split on the feature Income

Income?

< $60K >= $60K 8 13 14 5

Subset of data with Income >= $60K

51 ©2017 Emily Fox CSE 446: Machine Learning Finding the best threshold split

Infinite possible values of t Income = t*

Income < t* Income >= t* Safe Risky Income

$10K $120K

52 ©2017 Emily Fox CSE 446: Machine Learning Consider a threshold between points

Same classification error for any threshold split between vA and vB

Safe Risky Income vA vB

$10K $120K

53 ©2017 Emily Fox CSE 446: Machine Learning Only need to consider mid-points

Finite number of splits to consider

Safe Risky Income

$10K $120K

54 ©2017 Emily Fox CSE 446: Machine Learning Threshold split selection algorithm

• Step 1: Sort the values of a feature hj(x) :

Let {v1, v2, v3, … vN} denote sorted values • Step 2: - For i = 1 … N-1

• Consider split ti = (vi + vi+1) / 2

• Compute classification error for threshold split hj(x) >= ti - Choose the t* with the lowest classification error

55 ©2017 Emily Fox CSE 446: Machine Learning Visualizing the threshold split

Threshold split is the line Age = 38 Income

$80K

$40K

$0K Age 0 10 20 30 40 …

56 ©2017 Emily Fox CSE 446: Machine Learning Split on Age >= 38

Income age < 38 age >= 38 Predict Risky …

$80K

$40K Predict Safe

$0K Age 0 10 20 30 40 …

57 ©2017 Emily Fox CSE 446: Machine Learning Depth 2: Split on Income >= $60K

Threshold split is the line Income = 60K Income

$80K

$40K

$0K Age 0 10 20 30 40 …

58 ©2017 Emily Fox CSE 446: Machine Learning Each split partitions the 2-D space

Age >= 38 Income Age < 38 Income >= 60K

$80K

$40K Age >= 38 Income < 60K $0K Age 0 10 20 30 40 …

59 ©2017 Emily Fox CSE 446: Machine Learning in decision trees

©2017 Emily Fox CSE 446: Machine Learning Decision tree recap

Loan status: Root poor 22 18 4 14 Safe Risky

Credit? Income? excellent Fair 9 0 9 4 high low 4 5 0 9 For each leaf node, Safe Term? set Term? Risky ŷ = majority value 3 years 5 years 0 4 9 0 3 years 5 years 0 2 4 3 Risky Safe

Risky Safe

61 ©2017 Emily Fox CSE 446: Machine Learning What happens when we increase depth?

Training error reduces with depth

Tree depth depth = 1 depth = 2 depth = 3 depth = 5 depth = 10

Training error 0.22 0.13 0.10 0.03 0.00

Decision boundary

62 ©2017 Emily Fox CSE 446: Machine Learning How do we regularize? In this context, simplicity

Complex Tree

Simpler Tree

63 ©2017 Emily Fox CSE 446: Machine Learning Two approaches to picking simpler trees

1. Early Stopping: Stop the learning algorithm before tree becomes too complex

2. Pruning: Simplify the tree after the learning algorithm terminates

64 ©2017 Emily Fox CSE 446: Machine Learning Technique 1: Early stopping

• Stopping conditions (recap): 1. All examples have the same target value 2. No more features to split on

• Early stopping conditions: 1. Limit tree depth (choose max_depth using validation set) 2. Do not consider splits that do not cause a sufficient decrease in classification error 3. Do not split an intermediate node which contains too few data points

65 ©2017 Emily Fox CSE 446: Machine Learning Challenge with early stopping condition 1

Hard to know exactly when to stop

Simple Complex Also, might want trees trees some branches of tree to go deeper while others remain shallow True error Classification Error Training error

max_depth Tree depth

66 ©2017 Emily Fox CSE 446: Machine Learning Early stopping condition 2: Pros and Cons

• Pros: - A reasonable heuristic for early stopping to avoid useless splits

• Cons: - Too short sighted: We may miss out on “good” splits may occur right after “useless” splits - Saw this with “xor” example

67 ©2017 Emily Fox CSE 446: Machine Learning Two approaches to picking simpler trees

1. Early Stopping: Stop the learning algorithm before tree becomes too complex

2. Pruning: Simplify the tree after the learning algorithm terminates

Complements early stopping

68 ©2017 Emily Fox CSE 446: Machine Learning Pruning: Intuition Train a complex tree, simplify later

Complex Tree

Simpler Tree

Simplify

69 ©2017 Emily Fox CSE 446: Machine Learning Pruning motivation

Simple tree Complex tree True Error Simplify after tree is built

Don’t stop too early Classification Error Training Error

Tree depth

70 ©2017 Emily Fox CSE 446: Machine Learning Scoring trees: Desired total quality format

Want to balance: i. How well tree fits data ii. Complexity of tree

want to balance Total cost = measure of fit + measure of complexity

(classification error) Large # = likely to overfit Large # = bad fit to training data 71 ©2017 Emily Fox CSE 446: Machine Learning Simple measure of complexity of tree

L(T) = # of leaf nodes Start

excellent poor Credit?

good fair bad Safe Risky

Safe Safe Risky

72 ©2017 Emily Fox CSE 446: Machine Learning Balance simplicity & predictive power

Too complex, risk of overfitting

Start

excellent poor Credit? Too simple, high fair classification error Income? Safe Term? Start high low 3 years 5 years

Risky Risky Safe Term? Risky

3 years 5 years

Risky Safe

73 ©2015-2016 Emily Fox & Carlos Guestrin CSE 446: Machine Learning Balancing fit and complexity

Total cost C(T) = Error(T) + λ L(T)

tuning parameter If λ=0: build out full tree.

If λ=∞: won’t ever split.

If λ in between:

74 ©2017 Emily Fox CSE 446: Machine Learning Having said all that…

Vanilla decision trees just don’t really work very well.

75 ©2017 Emily Fox CSE 446: Machine Learning Bagging CSE 446: Machine Learning

CSE 446: Machine Learning Suppose we want to reduce

Recall definition of variance

If we had lots of independent data sets, we could average their results and get the variance to approach 0

Given m independent data sets (D1, …, Dm), each of size n, learn m classifiers (f1, …, fm), where fi trained on Di.

By law of large numbers

But we don’t have lots of data sets

77 CSE 446: Machine Learning Bootstrap Aggregation (aka bagging)

• A technique for addressing high variance (e.g. in decision trees)

- Given sample D, draw m samples (D1, …, Dm) of size n, with replacement. - Train a classifier on each. - Use average (for regression) or majority vote (for classification).

78 CSE 446: Machine Learning Advantages of bagging

• Easy to implement • Reduces variance • Gives information about uncertainty of prediction. • Provides estimate of test error, “out of bag” error without using validation set. Let and define

Then out of bag error is:

79 CSE 446: Machine Learning Bagging produces an ensemble classifier

©2017 Emily Fox CSE 446: Machine Learning Ensemble classifier in general • Goal: - Predict output y • Either +1 or -1 - From input x • Learn ensemble model:

- Classifiers: f1(x),f2(x),…,fT(x)

- Coefficients: ŵ1,ŵ2,…,ŵT • Prediction: T

yˆ = sign wˆ tft(x) t=1 ! 81 X ©2017 Emily Fox CSE 446: Machine Learning Random Forests CSE 446: Machine Learning

CSE 446: Machine Learning One of the most useful bagged algorithms

• Given data set D, draw m samples (D1, …, Dm) of size n from D, with replacement.

• For each Dj, train a full decision tree hj with one crucial modification: - Before each split, randomly subsample k features (without replacement) and only consider these for your split. • Final classifier obtained by averaging.

83 CSE 446: Machine Learning One of best, most popular and easiest to use algorithms • Only two hyperparameters m and k. • Extremely insensitive to both. - Classification: good choice for k is � (where � is number of features) - Regression: good choice for k is d/3 - Set m as large as you can afford. • Works really well! • Features can be of different scale, magnitude, etc. - Very convenient with heterogeneous features.

84 CSE 446: Machine Learning Boosting CSE 446: Machine Learning Emily Fox University of Washington January 30, 2017

©2017 Emily Fox CSE 446: Machine Learning Simple (weak) classifiers are good!

Income>$100K?

Yes No

Safe Risky w. Shallow Decision simple features decision trees stumps

Low variance. Learning is fast!

But high bias…

86 ©2017 Emily Fox CSE 446: Machine Learning Finding a classifier that’s just right true error error

Classification train error

Model complexity

Weak learner Need stronger learner

Option 1: add more features or depth Option 2: ????? 87 ©2017 Emily Fox CSE 446: Machine Learning Boosting question

“Can a set of weak learners be combined to create a stronger learner?” Kearns and Valiant (1988)

Yes! Schapire (1990)

Boosting

Amazing impact: Ÿ simple approach Ÿ widely used in industry Ÿ wins most Kaggle competitions

88 ©2017 Emily Fox CSE 446: Machine Learning Ensemble classifier

©2017 Emily Fox CSE 446: Machine Learning A single classifier Input: x

Income>$100K?

Yes No

Safe Risky

Output: ŷ = f(x) - Either +1 or -1 Classifier 90 ©2017 Emily Fox CSE 446: Machine Learning Income>$100K?

Yes No

Safe Risky

91 ©2017 Emily Fox CSE 446: Machine Learning Ensemble methods: Each classifier “votes” on prediction

xi = (Income=$120K, Credit=Bad, Savings=$50K, Market=Good)

Income>$100K? Credit history? Savings>$100K? Market conditions? Yes No Bad Good Yes No Bad Good Safe Risky Risky Safe Safe Risky Risky Safe

f1(xi) = +1 f2(xi) = -1 f3(xi) = -1 f4(xi) = +1

Combine? Ensemble model Learn coefficients

F(xi) = sign(w1 f1(xi) + w2 f2(xi) + w3 f3(xi) + w4 f4(xi))

92 ©2017 Emily Fox CSE 446: Machine Learning Prediction with ensemble

Income>$100K? Credit history? Savings>$100K? Market conditions? Yes No Bad Good Yes No Bad Good Safe Risky Risky Safe Safe Risky Risky Safe

f1(xi) = +1 f2(xi) = -1 f3(xi) = -1 f4(xi) = +1

Combine? Ensemble w1 2 model Learn coefficients w2 1.5

w3 1.5 F(xi) = sign(w1 f1(xi) + w2 f2(xi) + w3 f3(xi) + w4 f4(xi)) w4 0.5

93 ©2017 Emily Fox CSE 446: Machine Learning Ensemble classifier in general • Goal: - Predict output y • Either +1 or -1 - From input x • Learn ensemble model:

- Classifiers: f1(x),f2(x),…,fT(x)

- Coefficients: ŵ1,ŵ2,…,ŵT • Prediction: T

yˆ = sign wˆ tft(x) t=1 ! 94 X ©2017 Emily Fox CSE 446: Machine Learning Boosting

©2017 Emily Fox CSE 446: Machine Learning Training a classifier

Learn f(x) Predict Training data classifier ŷ = sign(f(x))

96 ©2017 Emily Fox CSE 446: Machine Learning Learning decision stump

Credit Income y A $130K Safe X B $80K Risky Income? X C $110K Risky A $110K Safe A $90K Safe > $100K ≤ $100K B $120K Safe 3 1 4 3 X C $30K Risky X C $60K Risky B $95K Safe A $60K Safe ŷ = Safe ŷ = Safe A $98K Safe

97 ©2017 Emily Fox CSE 446: Machine Learning Boosting = Focus learning on “hard” points

Learn f(x) Predict Training data classifier ŷ = sign(f(x))

Evaluate Boosting: focus next classifier on places where f(x) does less well Learn where f(x) makes mistakes

98 ©2017 Emily Fox CSE 446: Machine Learning Learning on weighted data: More weight on “hard” or more important points

• Weighted dataset:

- Each xi,yi weighted by αi

• More important point = higher weight αi

• Learning:

- Data point i counts as αi data points

• E.g., αi = 2 è count point twice

99 ©2017 Emily Fox CSE 446: Machine Learning Learning a decision stump on weighted data

Increase weight α of harder/misclassified points

CreditCredit IncomeIncome y y Weight α AA $130K$130K SafeSafe 0.5 X BB $80K$80K RiskyRisky 1.5 Income? X CC $110K$110K RiskyRisky 1.2 AA $110K$110K SafeSafe 0.8 AA $90K$90K SafeSafe 0.6 > $100K ≤ $100K BB $120K$120K SafeSafe 0.7 2 1.2 3 6.5 X CC $30K$30K RiskyRisky 3 X CC $60K$60K RiskyRisky 2 BB $95K$95K SafeSafe 0.8 AA $60K$60K SafeSafe 0.7 ŷ = Safe ŷ = Risky AA $98K$98K SafeSafe 0.9

100 ©2017 Emily Fox CSE 446: Machine Learning Learning from weighted data in general

• Learning from weighted data generally no harder.

- Data point i counts as αi data points

• E.g., gradient descent for logistic regression:

Sum over data points

NN ((tt+1)+1) ((tt)) ((tt)) ww ww ++⌘⌘ hhαjj((xxii)) 11[[yyii == +1] +1] PP((yy=+1=+1 xxii,,ww )) jj jj i || ii=1=1 XX ⇣⇣ ⌘⌘

Weigh each point by αi

101 ©2017 Emily Fox CSE 446: Machine Learning Boosting = Greedy learning ensembles from data

Training data

Learn classifier Higher weight for f1(x) points where f1(x) is wrong Predict ŷ = sign(f1(x))

Weighted data

Learn classifier & coefficient

ŵ,f2(x)

Predict ŷ = sign(ŵ1 f1(x) + ŵ2 f2(x))

102 ©2017 Emily Fox CSE 446: Machine Learning AdaBoost

©2017 Emily Fox CSE 446: Machine Learning AdaBoost: learning ensemble [Freund & Schapire 1999]

• Start with same weight for all points: αi = 1/N

• For t = 1,…,T

- Learn ft(x) with data weights αi

- Compute coefficient ŵt

- Recompute weights αi

• Final model predicts by: T

yˆ = sign wˆ tft(x) t=1 ! 104 X©2017 Emily Fox CSE 446: Machine Learning Computing coefficient ŵt

©2017 Emily Fox CSE 446: Machine Learning AdaBoost: Computing coefficient ŵt of classifier ft(x)

Yes ŵt large Is ft(x) good? No ŵt small

• ft(x) is good è ft has low training error

• Measuring error in weighted data? - Just weighted # of misclassified points

106 ©2017 Emily Fox CSE 446: Machine Learning Weighted classification error

Learned classifier ŷ =

Data point Weight of correct 1.20 Correct!Mistake! (Food(SushiFoodSushi waswaswas OKOK,great,great , α=0.5,α=1.2α=0.5α=1.2)) Weight of mistakes 0.50

Hide label

107 ©2017 Emily Fox CSE 446: Machine Learning Weighted classification error

• Total weight of mistakes:

• Total weight of all points:

• Weighted error measures fraction of weight of mistakes:

weighted_error = Total weight of all mistakes. Total weight of all data

- Best possible value is 0.0 Actual worst is 0.5 Worst possible value is 1.0 108 CSE 446: Machine Learning AdaBoost:

Formula for computing coefficient ŵt of classifier ft(x) 1 1 weighted error(f ) wˆ = ln t t 2 weighted error(f ) ✓ t ◆

weighted_error(f ) 1 weighted error(ft) t ŵ on training data weighted error(ft) t Yes 0.01 (1-0.01)/0.01 = 99 0.5 ln(99) = 2.3 Is ft(x) good? 0.5 (1-0.5)/0.5 = 1 0.5 ln(1) = 0 No 0.99 (1-0.99)/0.99 = 0.01 0.5 ln(0.01) = -2.3

Last one is a terrible classifier, but - ft(x) is great! 109 ©2017 Emily Fox CSE 446: Machine Learning AdaBoost: learning ensemble

• Start with same weight for all points: αi = 1/N

• For t = 1,…,T 1 1 weighted error(ft) - Learn ft(x) with data weights αi wˆ t = ln 2 weighted error(ft) - Compute coefficient ŵt ✓ ◆

- Recompute weights αi

• Final model predicts by: T

yˆ = sign wˆ tft(x) t=1 ! 110 X©2017 Emily Fox CSE 446: Machine Learning Recompute weights αi

©2017 Emily Fox CSE 446: Machine Learning AdaBoost: Updating weights αi based on where classifier ft(x) makes mistakes

Yes Decrease αi

Did ft get xi right? No Increase αi

112 ©2017 Emily Fox CSE 446: Machine Learning AdaBoost: Formula for updating weights αi

-ŵt αi e , if ft(xi)=yi αi ç ŵt αi e , if ft(xi)≠yi

ft(xi)=yi ? ŵt Multiply αi by Implication correct 2.3 e -2.3 = 0.1 Yes Decrease importance of xi, yi correct 0 e -0 = 1 Keep importance same Did ft get xi right? mistake 2.3 e 2.3 = 9.98 Increase importance No mistake 0 e -0 = 1 Keep importance same

113 ©2017 Emily Fox CSE 446: Machine Learning AdaBoost: learning ensemble

• Start with same weight for all points: αi = 1/N

• For t = 1,…,T 1 1 weighted error(ft) - Learn ft(x) with data weights αi wˆ t = ln 2 weighted error(ft) - Compute coefficient ŵt ✓ ◆

- Recompute weights αi -ŵt αi e , if ft(xi)=yi αi ç • Final model predicts by: ŵt T αi e , if ft(xi)≠yi yˆ = sign wˆ tft(x) t=1 ! 114 X ©2017 Emily Fox CSE 446: Machine Learning AdaBoost: Normalizing weights αi

If xi often mistake, If xi often correct, weight αi gets very weight αi gets very large small

Can cause numerical instability after many iterations

Normalize weights to add up to 1 after every iteration ↵ ↵ i i N j=1 ↵j

115 ©2017 Emily Fox CSE 446: Machine Learning P . AdaBoost: learning ensemble

• 1 1 weighted error(ft) Start with same weight for wˆ = ln t 2 weighted error(f ) all points: αi = 1/N ✓ t ◆ • For t = 1,…,T -ŵt - Learn ft(x) with data weights αi αi e , if ft(xi)=yi - Compute coefficient ŵt αi ç - Recompute weights α ŵt i αi e , if ft(xi)≠yi

- Normalize weights αi • Final model predicts by: ↵ T i ↵i yˆ = sign wˆ f (x) N t t j=1 ↵j t=1 ! 116 X ©2017 Emily Fox CSE 446: Machine Learning P . AdaBoost example

©2017 Emily Fox CSE 446: Machine Learning t=1: Just learn a classifier on original data

Original data Learned decision stump f1(x)

118 ©2017 Emily Fox CSE 446: Machine Learning Updating weights αi

Increase weight αi of misclassified points

Learned decision stump f1(x) New data weights αi Boundary

119 ©2017 Emily Fox CSE 446: Machine Learning t=2: Learn classifier on weighted data

f1(x)

Weighted data: using αi Learned decision stump f2(x) chosen in previous iteration on weighted data

120 ©2017 Emily Fox CSE 446: Machine Learning Ensemble becomes weighted sum of learned classifiers

ŵ1 f1(x) ŵ2 f2(x) 0.61 + 0.53 =

121 ©2017 Emily Fox CSE 446: Machine Learning Decision boundary of ensemble classifier after 30 iterations

training_error = 0

122 ©2017 Emily Fox CSE 446: Machine Learning Boosting convergence & overfitting

©2017 Emily Fox CSE 446: Machine Learning Boosting question revisited

“Can a set of weak learners be combined to create a stronger learner?” Kearns and Valiant (1988)

Yes! Schapire (1990)

Boosting

124 ©2017 Emily Fox CSE 446: Machine Learning After some iterations, training error of boosting goes to zero!!!

Training error of 1 decision stump = 22.5%

Training error of ensemble of 30 decision stumps = 0%

Boosted decision

Training error stumps on toy dataset Iterations of boosting

125 ©2017 Emily Fox CSE 446: Machine Learning AdaBoost Theorem

Under some technical conditions… May oscillate a bit

But will generally decrease, & eventually become 0! Training error of boosted classifier → 0 as T→∞ Training error

Iterations of boosting 126 ©2017 Emily Fox CSE 446: Machine Learning Condition of AdaBoost Theorem

Condition = At every t, Under some technical conditions… can find a weak learner with weighted_error(ft) < 0.5 Extreme example: No classifier can separate a +1 Not always possible on top of -1 Training error of boosted classifier → 0 Nonetheless, boosting often as T→∞ yields great training error

127 ©2017 Emily Fox CSE 446: Machine Learning Overfitting in boosting

©2017 Emily Fox CSE 446: Machine Learning Decision trees on loan data 39% test error

Overfitting

8% training error

Boosted decision stumps on loan data

32% test error Better fit & lower test error 28.5% training error

129 ©2017 Emily Fox CSE 446: Machine Learning Boosting tends to be robust to overfitting

Test set performance about constant for many iterations è Less sensitive to choice of T

130 ©2017 Emily Fox CSE 446: Machine Learning But boosting will eventually overfit, so must choose max number of components T

Best test error around 31%

Test error eventually increases to 33% (overfits)

131 ©2017 Emily Fox CSE 446: Machine Learning How do we decide when to stop boosting?

Choosing T?

Like selecting parameters in other ML approaches (e.g., λ in regularization)

Never ever ever Not on training ever on Validation set Cross-validation data test data

Not useful: training If dataset is large For smaller data error improves as T increases

132 ©2017 Emily Fox CSE 446: Machine Learning Summary of boosting

©2017 Emily Fox CSE 446: Machine Learning Variants of boosting and related algorithms

There are hundreds of variants of boosting, most important:

Gradient • Like AdaBoost, but useful beyond basic classification boosting • XGBoost (Tianqi Chen, UW)

134 ©2017 Emily Fox CSE 446: Machine Learning Variants of boosting and related algorithms

There are hundreds of variants of boosting, most important:

Gradient • Like AdaBoost, but useful beyond basic classification boosting • XGBoost (Tianqi Chen, UW)

Many other approaches to learn ensembles, most important:

• Bagging: Pick random subsets of the data - Learn a tree in each subset Random - Average predictions forests • Simpler than boosting & easier to parallelize • Typically higher error than boosting for same # of trees (# iterations T)

135 ©2017 Emily Fox CSE 446: Machine Learning Impact of boosting (spoiler alert... HUGE IMPACT)

Amongst most useful ML methods ever created

Extremely useful in • Standard approach for face detection, for example computer vision

Used by most winners of • Malware classification, credit fraud detection, ads click through rate estimation, sales forecasting, ranking ML competitions webpages for search, Higgs boson detection,… (Kaggle, KDD Cup,…)

Most deployed ML systems use model • Coefficients chosen manually, with boosting, with ensembles bagging, or others

136 ©2017 Emily Fox CSE 446: Machine Learning What you should know about today’s material for final • What a decision tree is, basic greedy approach to constructing them, what the decision boundary will look like if data is real-valued (axis-aligned partitions) • Mitigating overfitting and high variance by limiting depth, pruning tree, bagging and random forests. • Understand idea of ensemble classifier • Outline the boosting framework – sequentially learn classifiers on weighted data. Key idea: put more weight on data points where making error. • Understand the AdaBoost algorithm at a high level - Learn each classifier on weighted data - Compute coefficient of classifier - Recompute data weights - Normalize weights

137 ©2017 Emily Fox CSE 446: Machine Learning