Decision Trees, Bagging and Boosting
CSE 446: Machine Learning created by Emily Fox
©2017 Emily Fox CSE 446: Machine Learning Predicting potential loan defaults
©2017 Emily Fox CSE 446: Machine Learning What makes a loan risky?
I want a to buy a new house! Credit History ★★★★
Income ★★★
Term ★★★★★ Loan Application Personal Info ★★★
3 ©2017 Emily Fox CSE 446: Machine Learning Credit history explained
Did I pay previous loans on time? Credit History ★★★★
Example: excellent, Income good, or fair ★★★
Term ★★★★★
Personal Info ★★★
4 ©2017 Emily Fox CSE 446: Machine Learning Income
Credit History What’s my income? ★★★★
Example: Income ★★★ $80K per year Term ★★★★★
Personal Info ★★★
5 ©2017 Emily Fox CSE 446: Machine Learning Loan terms
Credit History How soon do I need to ★★★★ pay the loan? Income Example: 3 years, ★★★ 5 years,… Term ★★★★★
Personal Info ★★★
6 ©2017 Emily Fox CSE 446: Machine Learning Personal information
Credit History ★★★★
Income Age, reason for the loan, ★★★ marital status,… Term Example: Home loan for a ★★★★★ married couple Personal Info ★★★
7 ©2017 Emily Fox CSE 446: Machine Learning Intelligent application
Loan Applications
Safe ✓
Intelligent loan application Risky review system ✘
Risky ✘
8 ©2017 Emily Fox CSE 446: Machine Learning Classifier review
ŷi = +1
Loan Classifier Safe Application MODEL Risky Output: ŷ Input: xi Predicted class ŷi = -1
9 ©2017 Emily Fox CSE 446: Machine Learning This module ... decision trees
Start
excellent poor Credit?
fair
Income? Safe Term? high Low 3 years 5 years
Risky Safe Term? Risky
3 years 5 years
Risky Safe
10 ©2017 Emily Fox CSE 446: Machine Learning Scoring a loan application
xi = (Credit = poor, Income = high, Term = 5 years) Start
excellent poor Credit?
fair
Income? Safe Term? high Low 3 years 5 years
Risky Safe Term? Risky
3 years 5 years
Safe Risky ŷi = Safe
11 ©2017 Emily Fox CSE 446: Machine Learning Decision tree learning task
©2017 Emily Fox CSE 446: Machine Learning Decision tree learning problem
Training data: N observations (xi,yi)
Credit Term Income y excellent 3 yrs high safe fair 5 yrs low risky Optimize fair 3 yrs high safe quality metric on training data T(X) poor 5 yrs high risky excellent 3 yrs low risky fair 5 yrs low safe poor 3 yrs high risky poor 5 yrs low safe fair 3 yrs high safe
13 ©2017 Emily Fox CSE 446: Machine Learning Quality metric: Classification error
• Error measures fraction of mistakes
Error = # incorrect predictions # examples
- Best possible value : 0.0 - Worst possible value: 1.0
14 ©2017 Emily Fox CSE 446: Machine Learning How do we find the best tree?
Exponentially large number of possible trees Learning the smallest makes decision tree learning hard! decision tree is an NP-hard problem T (X) T (X) 1 2 T3(X) [Hyafil & Rivest ’76]
T4(X) T5(X) T6(X)
15 ©2017 Emily Fox CSE 446: Machine Learning Greedy decision tree learning
©2017 Emily Fox CSE 446: Machine Learning Our training data table
Assume N = 40, 3 features
Credit Term Income y excellent 3 yrs high safe fair 5 yrs low risky fair 3 yrs high safe poor 5 yrs high risky excellent 3 yrs low risky fair 5 yrs low safe poor 3 yrs high risky poor 5 yrs low safe fair 3 yrs high safe
17 ©2017 Emily Fox CSE 446: Machine Learning Start with all the data
Loan status: Safe Risky
(all data) # of Risky loans 22 18
# of Safe loans N = 40 examples
18 ©2017 Emily Fox CSE 446: Machine Learning Compact visual notation: Root node
Loan status: Safe Risky Root 22 18 # of Risky loans
# of Safe loans N = 40 examples
19 ©2017 Emily Fox CSE 446: Machine Learning Decision stump: Single level tree
Loan status: Root Safe Risky 22 18 Split on Credit
Credit?
excellent fair poor 9 0 9 4 4 14
Subset of data with Subset of data with Subset of data with Credit = excellent Credit = fair Credit = poor
20 ©2017 Emily Fox CSE 446: Machine Learning Visual notation: Intermediate nodes
Loan status: Root Safe Risky 22 18
Credit?
excellent fair poor 9 0 9 4 4 14
Intermediate nodes
21 ©2017 Emily Fox CSE 446: Machine Learning Proceed to build a tree on each of children
Loan status: root Safe Risky 22 18
credit?
excellent fair poor 9 0 9 4 4 14
Safe
22 ©2017 Emily Fox CSE 446: Machine Learning Selecting best feature to split on
©2017 Emily Fox CSE 446: Machine Learning How do we pick our first split?
Loan status: Root Find the “best” Safe Risky 22 18 feature to split on!
Credit?
excellent fair poor 9 0 9 4 4 14
24 ©2017 Emily Fox CSE 446: Machine Learning How do we select the best feature?
Choice 1: Split on Credit Choice 2: Split on Term Loan status: Loan status: Root Root Safe Risky Safe Risky 22 18 22 18
Credit? OR Term?
excellent fair poor 3 years 5 years 9 0 9 4 4 14 16 4 6 14
25 ©2017 Emily Fox CSE 446: Machine Learning How do we measure effectiveness of a split?
Loan status: Root Safe Risky 22 18 Idea: Calculate classification error of this decision stump Credit? Error = # mistakes # data points excellent fair poor 9 0 9 4 4 14
26 ©2017 Emily Fox CSE 446: Machine Learning Calculating classification error
• Step 1: ŷ = class of majority of data in node • Step 2: Calculate classification error of predicting ŷ for this data
Loan status: Safe Risky Root 22 18
22 correct 18 mistakes Safe Tree Classification error (root) 0.45 ŷ = majority class
27 ©2017 Emily Fox CSE 446: Machine Learning Choice 1: Split on Credit history?
Choice 1: Split on Credit Root Loan status: Does a split on Credit reduce Safe Risky 22 18 classification error below 0.45?
Credit?
excellent fair poor 9 0 9 4 4 14
28 ©2017 Emily Fox CSE 446: Machine Learning Split on Credit: Classification error
Choice 1: Split on Credit
Loan status: Root Safe Risky 22 18
Error = 4+4 _. Credit? 22 + 18 excellent fair poor = 0.2 9 0 9 4 4 14 Tree Classification error Safe Safe Risky (root) 0.45 Split on credit 0.2 0 mistakes 4 mistakes 4 mistakes 29 ©2017 Emily Fox CSE 446: Machine Learning Choice 2: Split on Term?
Choice 2: Split on Term
Loan status: Root Safe Risky 22 18
Term?
3 years 5 years 16 4 6 14
Safe Risky
30 ©2017 Emily Fox CSE 446: Machine Learning Evaluating the split on Term
Choice 2: Split on Term
Loan status: Root Safe Risky 22 18 Error = 4+6 . Term? 40
3 years 5 years = 0.25 16 4 6 14 Tree Classification error Safe Risky (root) 0.45 Split on credit 0.2 4 mistakes 6 mistakes Split on term 0.25 31 ©2017 Emily Fox CSE 446: Machine Learning Choice 1 vs Choice 2: Tree Classification error Comparing split on (root) 0.45 Credit vs Term split on credit 0.2 split on loan term 0.25
Choice 1: Split on Credit Choice 2: Split on Term
Loan status: Root Loan status: Root Safe Risky 22 18 Safe Risky 22 18
Credit? OR Term?
excellent fair poor 3 years 5 years 9 0 WINNER8 4 4 14 16 4 6 14
32 ©2017 Emily Fox CSE 446: Machine Learning Feature split selection algorithm
• Given a subset of data M (a node in a tree)
• For each feature ϕi(x): 1. Split data of M according to feature ϕi(x) 2. Compute classification error split • Chose feature ϕ*(x) with lowest classification error
33 ©2017 Emily Fox CSE 446: Machine Learning Recursion & Stopping conditions
©2017 Emily Fox CSE 446: Machine Learning We’ve learned a decision stump, what next?
Loan status: Root Safe Risky 22 18
Credit?
excellent fair poor 9 0 9 4 4 14
All data points are Safe è Safe nothing else to do with this subset of data
Leaf node
35 ©2017 Emily Fox CSE 446: Machine Learning Tree learning = Recursive stump learning
Loan status: Root Safe Risky 22 18
Credit?
excellent fair poor 9 0 9 4 4 14
Safe Build decision stump with Build decision stump with subset of data where subset of data where Credit = fair Credit = poor
36 ©2017 Emily Fox CSE 446: Machine Learning Second level
Loan status: Root Safe Risky 22 18
Credit?
excellent fair poor 9 0 9 4 4 14
Safe Term? Income?
3 years 5 years high Low 0 4 9 0 4 5 0 9
Risky Safe Risky Build another stump these data points 37 ©2017 Emily Fox CSE 446: Machine Learning Final decision tree
Loan status: Root poor 22 18 4 14 Safe Risky
Credit? Income? excellent Fair 9 0 9 4 high low 4 5 0 9
Safe Term? Term? Risky 3 years 5 years 0 4 9 0 3 years 5 years 0 2 4 3 Risky Safe
Risky Safe
38 ©2017 Emily Fox CSE 446: Machine Learning Simple greedy decision tree learning
Pick best feature to split on
Learn decision stump with this split
For each leaf of decision stump, recurse
When do we stop???
39 ©2017 Emily Fox CSE 446: Machine Learning Stopping condition 1: All data agrees on y All data in these nodes have same Loan status:è Root poor y value 22 18 4 14 NothingSafe Risky to do
Credit? Income? excellent Fair 9 0 9 4 high low 4 5 0 9
Safe Term? Term? Risky 3 years 5 years 0 4 9 0 3 years 5 years 0 2 4 3 Risky Safe
Risky Safe
40 ©2017 Emily Fox CSE 446: Machine Learning Stopping condition 2: Already split on all features Already split on all possible features è NothingLoan status: to do Root poor 22 18 4 14 Safe Risky
Credit? Income? excellent Fair 9 0 9 4 high low 4 5 0 9
Safe Term? Term? Risky 3 years 5 years 0 4 9 0 3 years 5 years 0 2 4 3 Risky Safe
Risky Safe
41 ©2017 Emily Fox CSE 446: Machine Learning Greedy decision tree learning
• Step 1: Start with an empty tree • Step 2: Select a feature to split data Pick feature split • For each split of the tree: leading to lowest • Step 3: If nothing more to do, make classification error predictions Stopping conditions 1 • Step 4: Otherwise, go to Step 2 & & 2 continue (recurse) on this split Recursion
42 ©2017 Emily Fox CSE 446: Machine Learning Is this a good idea?
Proposed stopping condition 3: Stop if no split reduces the classification error
43 ©2017 Emily Fox CSE 446: Machine Learning Stopping condition 3: Don’t stop if error doesn’t decrease???
y = x[1] xor x[2] y values Root x[1] x[2] y True False 2 2 False False False
False True True Error = 2 . True False True 4 True True False = 0.5
Tree Classification error (root) 0.5
44 ©2017 Emily Fox CSE 446: Machine Learning Consider split on x[1]
y = x[1] xor x[2] y values Root x[1] x[2] y True False 2 2 False False False
False True True Error = 2 . True False True x[1] 4 True True False = 0.5 True False 1 1 1 1
Tree Classification error (root) 0.5 Split on x[1] 0.5
45 ©2017 Emily Fox CSE 446: Machine Learning Consider split on x[2]
y = x[1] xor x[2] y values Root x[1] x[2] y True False 2 2 False False False
False True True Error = 2 . True False True x[2] 4 True True False = 0.5 True False 1 1 1 1
Neither features Tree Classification error improve training error… (root) 0.5 Split on x[1] 0.5 Stop now??? Split on x[2] 0.5 46 ©2017 Emily Fox CSE 446: Machine Learning Final tree with stopping condition 3
y = x[1] xor x[2] y values Root x[1] x[2] y True False 2 2 False False False False True True True False True Predict True True True False
Tree Classification error with stopping 0.5 condition 3
47 ©2017 Emily Fox CSE 446: Machine Learning Without stopping condition 3 Condition 3 (stopping when training error doesn’t improve) not always recommended!
= x[1] xor x[2] Root y y values 2 2 x[1] x[2] y True False False False False x[1] False True True True False True False True 1 1 1 1 True True False x[2] x[2]
Tree Classification error True False True False 0 1 1 0 with stopping 0.5 1 0 0 1 condition 3 without stopping 0 False True True False condition 3
48 ©2017 Emily Fox CSE 446: Machine Learning Decision tree learning: Real valued features
©2017 Emily Fox CSE 446: Machine Learning How do we use real values inputs?
Income Credit Term y $105 K excellent 3 yrs Safe $112 K good 5 yrs Risky $73 K fair 3 yrs Safe $69 K excellent 5 yrs Safe $217 K excellent 3 yrs Risky $120 K good 5 yrs Safe $64 K fair 3 yrs Risky $340 K excellent 5 yrs Safe $60 K good 3 yrs Risky
50 ©2017 Emily Fox CSE 446: Machine Learning Threshold split
Loan status: Root Safe Risky 22 18 Split on the feature Income
Income?
< $60K >= $60K 8 13 14 5
Subset of data with Income >= $60K
51 ©2017 Emily Fox CSE 446: Machine Learning Finding the best threshold split
Infinite possible values of t Income = t*
Income < t* Income >= t* Safe Risky Income
$10K $120K
52 ©2017 Emily Fox CSE 446: Machine Learning Consider a threshold between points
Same classification error for any threshold split between vA and vB
Safe Risky Income vA vB
$10K $120K
53 ©2017 Emily Fox CSE 446: Machine Learning Only need to consider mid-points
Finite number of splits to consider
Safe Risky Income
$10K $120K
54 ©2017 Emily Fox CSE 446: Machine Learning Threshold split selection algorithm
• Step 1: Sort the values of a feature hj(x) :
Let {v1, v2, v3, … vN} denote sorted values • Step 2: - For i = 1 … N-1
• Consider split ti = (vi + vi+1) / 2
• Compute classification error for threshold split hj(x) >= ti - Choose the t* with the lowest classification error
55 ©2017 Emily Fox CSE 446: Machine Learning Visualizing the threshold split
Threshold split is the line Age = 38 Income
…
$80K
$40K
$0K Age 0 10 20 30 40 …
56 ©2017 Emily Fox CSE 446: Machine Learning Split on Age >= 38
Income age < 38 age >= 38 Predict Risky …
$80K
$40K Predict Safe
$0K Age 0 10 20 30 40 …
57 ©2017 Emily Fox CSE 446: Machine Learning Depth 2: Split on Income >= $60K
Threshold split is the line Income = 60K Income
…
$80K
$40K
$0K Age 0 10 20 30 40 …
58 ©2017 Emily Fox CSE 446: Machine Learning Each split partitions the 2-D space
Age >= 38 Income Age < 38 Income >= 60K
…
$80K
$40K Age >= 38 Income < 60K $0K Age 0 10 20 30 40 …
59 ©2017 Emily Fox CSE 446: Machine Learning Overfitting in decision trees
©2017 Emily Fox CSE 446: Machine Learning Decision tree recap
Loan status: Root poor 22 18 4 14 Safe Risky
Credit? Income? excellent Fair 9 0 9 4 high low 4 5 0 9 For each leaf node, Safe Term? set Term? Risky ŷ = majority value 3 years 5 years 0 4 9 0 3 years 5 years 0 2 4 3 Risky Safe
Risky Safe
61 ©2017 Emily Fox CSE 446: Machine Learning What happens when we increase depth?
Training error reduces with depth
Tree depth depth = 1 depth = 2 depth = 3 depth = 5 depth = 10
Training error 0.22 0.13 0.10 0.03 0.00
Decision boundary
62 ©2017 Emily Fox CSE 446: Machine Learning How do we regularize? In this context, simplicity
Complex Tree
Simpler Tree
63 ©2017 Emily Fox CSE 446: Machine Learning Two approaches to picking simpler trees
1. Early Stopping: Stop the learning algorithm before tree becomes too complex
2. Pruning: Simplify the tree after the learning algorithm terminates
64 ©2017 Emily Fox CSE 446: Machine Learning Technique 1: Early stopping
• Stopping conditions (recap): 1. All examples have the same target value 2. No more features to split on
• Early stopping conditions: 1. Limit tree depth (choose max_depth using validation set) 2. Do not consider splits that do not cause a sufficient decrease in classification error 3. Do not split an intermediate node which contains too few data points
65 ©2017 Emily Fox CSE 446: Machine Learning Challenge with early stopping condition 1
Hard to know exactly when to stop
Simple Complex Also, might want trees trees some branches of tree to go deeper while others remain shallow True error Classification Error Training error
max_depth Tree depth
66 ©2017 Emily Fox CSE 446: Machine Learning Early stopping condition 2: Pros and Cons
• Pros: - A reasonable heuristic for early stopping to avoid useless splits
• Cons: - Too short sighted: We may miss out on “good” splits may occur right after “useless” splits - Saw this with “xor” example
67 ©2017 Emily Fox CSE 446: Machine Learning Two approaches to picking simpler trees
1. Early Stopping: Stop the learning algorithm before tree becomes too complex
2. Pruning: Simplify the tree after the learning algorithm terminates
Complements early stopping
68 ©2017 Emily Fox CSE 446: Machine Learning Pruning: Intuition Train a complex tree, simplify later
Complex Tree
Simpler Tree
Simplify
69 ©2017 Emily Fox CSE 446: Machine Learning Pruning motivation
Simple tree Complex tree True Error Simplify after tree is built
Don’t stop too early Classification Error Training Error
Tree depth
70 ©2017 Emily Fox CSE 446: Machine Learning Scoring trees: Desired total quality format
Want to balance: i. How well tree fits data ii. Complexity of tree
want to balance Total cost = measure of fit + measure of complexity
(classification error) Large # = likely to overfit Large # = bad fit to training data 71 ©2017 Emily Fox CSE 446: Machine Learning Simple measure of complexity of tree
L(T) = # of leaf nodes Start
excellent poor Credit?
good fair bad Safe Risky
Safe Safe Risky
72 ©2017 Emily Fox CSE 446: Machine Learning Balance simplicity & predictive power
Too complex, risk of overfitting
Start
excellent poor Credit? Too simple, high fair classification error Income? Safe Term? Start high low 3 years 5 years
Risky Risky Safe Term? Risky
3 years 5 years
Risky Safe
73 ©2015-2016 Emily Fox & Carlos Guestrin CSE 446: Machine Learning Balancing fit and complexity
Total cost C(T) = Error(T) + λ L(T)
tuning parameter If λ=0: build out full tree.
If λ=∞: won’t ever split.
If λ in between:
74 ©2017 Emily Fox CSE 446: Machine Learning Having said all that…
Vanilla decision trees just don’t really work very well.
75 ©2017 Emily Fox CSE 446: Machine Learning Bagging CSE 446: Machine Learning
CSE 446: Machine Learning Suppose we want to reduce variance
Recall definition of variance
If we had lots of independent data sets, we could average their results and get the variance to approach 0
Given m independent data sets (D1, …, Dm), each of size n, learn m classifiers (f1, …, fm), where fi trained on Di.
By law of large numbers
But we don’t have lots of data sets
77 CSE 446: Machine Learning Bootstrap Aggregation (aka bagging)
• A technique for addressing high variance (e.g. in decision trees)
- Given sample D, draw m samples (D1, …, Dm) of size n, with replacement. - Train a classifier on each. - Use average (for regression) or majority vote (for classification).
78 CSE 446: Machine Learning Advantages of bagging
• Easy to implement • Reduces variance • Gives information about uncertainty of prediction. • Provides estimate of test error, “out of bag” error without using validation set. Let and define
Then out of bag error is:
79 CSE 446: Machine Learning Bagging produces an ensemble classifier
©2017 Emily Fox CSE 446: Machine Learning Ensemble classifier in general • Goal: - Predict output y • Either +1 or -1 - From input x • Learn ensemble model:
- Classifiers: f1(x),f2(x),…,fT(x)
- Coefficients: ŵ1,ŵ2,…,ŵT • Prediction: T
yˆ = sign wˆ tft(x) t=1 ! 81 X ©2017 Emily Fox CSE 446: Machine Learning Random Forests CSE 446: Machine Learning
CSE 446: Machine Learning One of the most useful bagged algorithms
• Given data set D, draw m samples (D1, …, Dm) of size n from D, with replacement.
• For each Dj, train a full decision tree hj with one crucial modification: - Before each split, randomly subsample k features (without replacement) and only consider these for your split. • Final classifier obtained by averaging.
83 CSE 446: Machine Learning One of best, most popular and easiest to use algorithms • Only two hyperparameters m and k. • Extremely insensitive to both. - Classification: good choice for k is � (where � is number of features) - Regression: good choice for k is d/3 - Set m as large as you can afford. • Works really well! • Features can be of different scale, magnitude, etc. - Very convenient with heterogeneous features.
84 CSE 446: Machine Learning Boosting CSE 446: Machine Learning Emily Fox University of Washington January 30, 2017
©2017 Emily Fox CSE 446: Machine Learning Simple (weak) classifiers are good!
Income>$100K?
Yes No
Safe Risky Logistic regression w. Shallow Decision simple features decision trees stumps
Low variance. Learning is fast!
But high bias…
86 ©2017 Emily Fox CSE 446: Machine Learning Finding a classifier that’s just right true error error
Classification train error
Model complexity
Weak learner Need stronger learner
Option 1: add more features or depth Option 2: ????? 87 ©2017 Emily Fox CSE 446: Machine Learning Boosting question
“Can a set of weak learners be combined to create a stronger learner?” Kearns and Valiant (1988)
Yes! Schapire (1990)
Boosting
Amazing impact: simple approach widely used in industry wins most Kaggle competitions
88 ©2017 Emily Fox CSE 446: Machine Learning Ensemble classifier
©2017 Emily Fox CSE 446: Machine Learning A single classifier Input: x
Income>$100K?
Yes No
Safe Risky
Output: ŷ = f(x) - Either +1 or -1 Classifier 90 ©2017 Emily Fox CSE 446: Machine Learning Income>$100K?
Yes No
Safe Risky
91 ©2017 Emily Fox CSE 446: Machine Learning Ensemble methods: Each classifier “votes” on prediction
xi = (Income=$120K, Credit=Bad, Savings=$50K, Market=Good)
Income>$100K? Credit history? Savings>$100K? Market conditions? Yes No Bad Good Yes No Bad Good Safe Risky Risky Safe Safe Risky Risky Safe
f1(xi) = +1 f2(xi) = -1 f3(xi) = -1 f4(xi) = +1
Combine? Ensemble model Learn coefficients
F(xi) = sign(w1 f1(xi) + w2 f2(xi) + w3 f3(xi) + w4 f4(xi))
92 ©2017 Emily Fox CSE 446: Machine Learning Prediction with ensemble
Income>$100K? Credit history? Savings>$100K? Market conditions? Yes No Bad Good Yes No Bad Good Safe Risky Risky Safe Safe Risky Risky Safe
f1(xi) = +1 f2(xi) = -1 f3(xi) = -1 f4(xi) = +1
Combine? Ensemble w1 2 model Learn coefficients w2 1.5
w3 1.5 F(xi) = sign(w1 f1(xi) + w2 f2(xi) + w3 f3(xi) + w4 f4(xi)) w4 0.5
93 ©2017 Emily Fox CSE 446: Machine Learning Ensemble classifier in general • Goal: - Predict output y • Either +1 or -1 - From input x • Learn ensemble model:
- Classifiers: f1(x),f2(x),…,fT(x)
- Coefficients: ŵ1,ŵ2,…,ŵT • Prediction: T
yˆ = sign wˆ tft(x) t=1 ! 94 X ©2017 Emily Fox CSE 446: Machine Learning Boosting
©2017 Emily Fox CSE 446: Machine Learning Training a classifier
Learn f(x) Predict Training data classifier ŷ = sign(f(x))
96 ©2017 Emily Fox CSE 446: Machine Learning Learning decision stump
Credit Income y A $130K Safe X B $80K Risky Income? X C $110K Risky A $110K Safe A $90K Safe > $100K ≤ $100K B $120K Safe 3 1 4 3 X C $30K Risky X C $60K Risky B $95K Safe A $60K Safe ŷ = Safe ŷ = Safe A $98K Safe
97 ©2017 Emily Fox CSE 446: Machine Learning Boosting = Focus learning on “hard” points
Learn f(x) Predict Training data classifier ŷ = sign(f(x))
Evaluate Boosting: focus next classifier on places where f(x) does less well Learn where f(x) makes mistakes
98 ©2017 Emily Fox CSE 446: Machine Learning Learning on weighted data: More weight on “hard” or more important points
• Weighted dataset:
- Each xi,yi weighted by αi
• More important point = higher weight αi
• Learning:
- Data point i counts as αi data points
• E.g., αi = 2 è count point twice
99 ©2017 Emily Fox CSE 446: Machine Learning Learning a decision stump on weighted data
Increase weight α of harder/misclassified points
CreditCredit IncomeIncome y y Weight α AA $130K$130K SafeSafe 0.5 X BB $80K$80K RiskyRisky 1.5 Income? X CC $110K$110K RiskyRisky 1.2 AA $110K$110K SafeSafe 0.8 AA $90K$90K SafeSafe 0.6 > $100K ≤ $100K BB $120K$120K SafeSafe 0.7 2 1.2 3 6.5 X CC $30K$30K RiskyRisky 3 X CC $60K$60K RiskyRisky 2 BB $95K$95K SafeSafe 0.8 AA $60K$60K SafeSafe 0.7 ŷ = Safe ŷ = Risky AA $98K$98K SafeSafe 0.9
100 ©2017 Emily Fox CSE 446: Machine Learning Learning from weighted data in general
• Learning from weighted data generally no harder.
- Data point i counts as αi data points
• E.g., gradient descent for logistic regression:
Sum over data points
NN ((tt+1)+1) ((tt)) ((tt)) ww ww ++⌘⌘ hhαjj((xxii)) 11[[yyii == +1] +1] PP((yy=+1=+1 xxii,,ww )) jj jj i || ii=1=1 XX ⇣⇣ ⌘⌘
Weigh each point by αi
101 ©2017 Emily Fox CSE 446: Machine Learning Boosting = Greedy learning ensembles from data
Training data
Learn classifier Higher weight for f1(x) points where f1(x) is wrong Predict ŷ = sign(f1(x))
Weighted data
Learn classifier & coefficient
ŵ,f2(x)
Predict ŷ = sign(ŵ1 f1(x) + ŵ2 f2(x))
102 ©2017 Emily Fox CSE 446: Machine Learning AdaBoost
©2017 Emily Fox CSE 446: Machine Learning AdaBoost: learning ensemble [Freund & Schapire 1999]
• Start with same weight for all points: αi = 1/N
• For t = 1,…,T
- Learn ft(x) with data weights αi
- Compute coefficient ŵt
- Recompute weights αi
• Final model predicts by: T
yˆ = sign wˆ tft(x) t=1 ! 104 X©2017 Emily Fox CSE 446: Machine Learning Computing coefficient ŵt
©2017 Emily Fox CSE 446: Machine Learning AdaBoost: Computing coefficient ŵt of classifier ft(x)
Yes ŵt large Is ft(x) good? No ŵt small
• ft(x) is good è ft has low training error
• Measuring error in weighted data? - Just weighted # of misclassified points
106 ©2017 Emily Fox CSE 446: Machine Learning Weighted classification error
Learned classifier ŷ =
Data point Weight of correct 1.20 Correct!Mistake! (Food(SushiFoodSushi waswaswas OKOK,great,great , α=0.5,α=1.2α=0.5α=1.2)) Weight of mistakes 0.50
Hide label
107 ©2017 Emily Fox CSE 446: Machine Learning Weighted classification error
• Total weight of mistakes:
• Total weight of all points:
• Weighted error measures fraction of weight of mistakes:
weighted_error = Total weight of all mistakes. Total weight of all data
- Best possible value is 0.0 Actual worst is 0.5 Worst possible value is 1.0 108 CSE 446: Machine Learning AdaBoost:
Formula for computing coefficient ŵt of classifier ft(x) 1 1 weighted error(f ) wˆ = ln t t 2 weighted error(f ) ✓ t ◆
weighted_error(f ) 1 weighted error(ft) t ŵ on training data weighted error(ft) t Yes 0.01 (1-0.01)/0.01 = 99 0.5 ln(99) = 2.3 Is ft(x) good? 0.5 (1-0.5)/0.5 = 1 0.5 ln(1) = 0 No 0.99 (1-0.99)/0.99 = 0.01 0.5 ln(0.01) = -2.3
Last one is a terrible classifier, but - ft(x) is great! 109 ©2017 Emily Fox CSE 446: Machine Learning AdaBoost: learning ensemble
• Start with same weight for all points: αi = 1/N
• For t = 1,…,T 1 1 weighted error(ft) - Learn ft(x) with data weights αi wˆ t = ln 2 weighted error(ft) - Compute coefficient ŵt ✓ ◆
- Recompute weights αi
• Final model predicts by: T
yˆ = sign wˆ tft(x) t=1 ! 110 X©2017 Emily Fox CSE 446: Machine Learning Recompute weights αi
©2017 Emily Fox CSE 446: Machine Learning AdaBoost: Updating weights αi based on where classifier ft(x) makes mistakes
Yes Decrease αi
Did ft get xi right? No Increase αi
112 ©2017 Emily Fox CSE 446: Machine Learning AdaBoost: Formula for updating weights αi
-ŵt αi e , if ft(xi)=yi αi ç ŵt αi e , if ft(xi)≠yi
ft(xi)=yi ? ŵt Multiply αi by Implication correct 2.3 e -2.3 = 0.1 Yes Decrease importance of xi, yi correct 0 e -0 = 1 Keep importance same Did ft get xi right? mistake 2.3 e 2.3 = 9.98 Increase importance No mistake 0 e -0 = 1 Keep importance same
113 ©2017 Emily Fox CSE 446: Machine Learning AdaBoost: learning ensemble
• Start with same weight for all points: αi = 1/N
• For t = 1,…,T 1 1 weighted error(ft) - Learn ft(x) with data weights αi wˆ t = ln 2 weighted error(ft) - Compute coefficient ŵt ✓ ◆
- Recompute weights αi -ŵt αi e , if ft(xi)=yi αi ç • Final model predicts by: ŵt T αi e , if ft(xi)≠yi yˆ = sign wˆ tft(x) t=1 ! 114 X ©2017 Emily Fox CSE 446: Machine Learning AdaBoost: Normalizing weights αi
If xi often mistake, If xi often correct, weight αi gets very weight αi gets very large small
Can cause numerical instability after many iterations
Normalize weights to add up to 1 after every iteration ↵ ↵ i i N j=1 ↵j
115 ©2017 Emily Fox CSE 446: Machine Learning P . AdaBoost: learning ensemble
• 1 1 weighted error(ft) Start with same weight for wˆ = ln t 2 weighted error(f ) all points: αi = 1/N ✓ t ◆ • For t = 1,…,T -ŵt - Learn ft(x) with data weights αi αi e , if ft(xi)=yi - Compute coefficient ŵt αi ç - Recompute weights α ŵt i αi e , if ft(xi)≠yi
- Normalize weights αi • Final model predicts by: ↵ T i ↵i yˆ = sign wˆ f (x) N t t j=1 ↵j t=1 ! 116 X ©2017 Emily Fox CSE 446: Machine Learning P . AdaBoost example
©2017 Emily Fox CSE 446: Machine Learning t=1: Just learn a classifier on original data
Original data Learned decision stump f1(x)
118 ©2017 Emily Fox CSE 446: Machine Learning Updating weights αi
Increase weight αi of misclassified points
Learned decision stump f1(x) New data weights αi Boundary
119 ©2017 Emily Fox CSE 446: Machine Learning t=2: Learn classifier on weighted data
f1(x)
Weighted data: using αi Learned decision stump f2(x) chosen in previous iteration on weighted data
120 ©2017 Emily Fox CSE 446: Machine Learning Ensemble becomes weighted sum of learned classifiers
ŵ1 f1(x) ŵ2 f2(x) 0.61 + 0.53 =
121 ©2017 Emily Fox CSE 446: Machine Learning Decision boundary of ensemble classifier after 30 iterations
training_error = 0
122 ©2017 Emily Fox CSE 446: Machine Learning Boosting convergence & overfitting
©2017 Emily Fox CSE 446: Machine Learning Boosting question revisited
“Can a set of weak learners be combined to create a stronger learner?” Kearns and Valiant (1988)
Yes! Schapire (1990)
Boosting
124 ©2017 Emily Fox CSE 446: Machine Learning After some iterations, training error of boosting goes to zero!!!
Training error of 1 decision stump = 22.5%
Training error of ensemble of 30 decision stumps = 0%
Boosted decision
Training error stumps on toy dataset Iterations of boosting
125 ©2017 Emily Fox CSE 446: Machine Learning AdaBoost Theorem
Under some technical conditions… May oscillate a bit
But will generally decrease, & eventually become 0! Training error of boosted classifier → 0 as T→∞ Training error
Iterations of boosting 126 ©2017 Emily Fox CSE 446: Machine Learning Condition of AdaBoost Theorem
Condition = At every t, Under some technical conditions… can find a weak learner with weighted_error(ft) < 0.5 Extreme example: No classifier can separate a +1 Not always possible on top of -1 Training error of boosted classifier → 0 Nonetheless, boosting often as T→∞ yields great training error
127 ©2017 Emily Fox CSE 446: Machine Learning Overfitting in boosting
©2017 Emily Fox CSE 446: Machine Learning Decision trees on loan data 39% test error
Overfitting
8% training error
Boosted decision stumps on loan data
32% test error Better fit & lower test error 28.5% training error
129 ©2017 Emily Fox CSE 446: Machine Learning Boosting tends to be robust to overfitting
Test set performance about constant for many iterations è Less sensitive to choice of T
130 ©2017 Emily Fox CSE 446: Machine Learning But boosting will eventually overfit, so must choose max number of components T
Best test error around 31%
Test error eventually increases to 33% (overfits)
131 ©2017 Emily Fox CSE 446: Machine Learning How do we decide when to stop boosting?
Choosing T?
Like selecting parameters in other ML approaches (e.g., λ in regularization)
Never ever ever Not on training ever on Validation set Cross-validation data test data
Not useful: training If dataset is large For smaller data error improves as T increases
132 ©2017 Emily Fox CSE 446: Machine Learning Summary of boosting
©2017 Emily Fox CSE 446: Machine Learning Variants of boosting and related algorithms
There are hundreds of variants of boosting, most important:
Gradient • Like AdaBoost, but useful beyond basic classification boosting • XGBoost (Tianqi Chen, UW)
134 ©2017 Emily Fox CSE 446: Machine Learning Variants of boosting and related algorithms
There are hundreds of variants of boosting, most important:
Gradient • Like AdaBoost, but useful beyond basic classification boosting • XGBoost (Tianqi Chen, UW)
Many other approaches to learn ensembles, most important:
• Bagging: Pick random subsets of the data - Learn a tree in each subset Random - Average predictions forests • Simpler than boosting & easier to parallelize • Typically higher error than boosting for same # of trees (# iterations T)
135 ©2017 Emily Fox CSE 446: Machine Learning Impact of boosting (spoiler alert... HUGE IMPACT)
Amongst most useful ML methods ever created
Extremely useful in • Standard approach for face detection, for example computer vision
Used by most winners of • Malware classification, credit fraud detection, ads click through rate estimation, sales forecasting, ranking ML competitions webpages for search, Higgs boson detection,… (Kaggle, KDD Cup,…)
Most deployed ML systems use model • Coefficients chosen manually, with boosting, with ensembles bagging, or others
136 ©2017 Emily Fox CSE 446: Machine Learning What you should know about today’s material for final • What a decision tree is, basic greedy approach to constructing them, what the decision boundary will look like if data is real-valued (axis-aligned partitions) • Mitigating overfitting and high variance by limiting depth, pruning tree, bagging and random forests. • Understand idea of ensemble classifier • Outline the boosting framework – sequentially learn classifiers on weighted data. Key idea: put more weight on data points where making error. • Understand the AdaBoost algorithm at a high level - Learn each classifier on weighted data - Compute coefficient of classifier - Recompute data weights - Normalize weights
137 ©2017 Emily Fox CSE 446: Machine Learning