CS480 Introduction to Ensemble Learning

Edith Law Ensemble Learning

Models that combine the opinions of multiple classifiers.

not not guilty guilty guilty

Advantages: •use much simpler learners and still achieve great performance. •efficiency at training and test time because of parallelism Multiple Voting Classifiers

•All the learning algorithms we have seen so far are deterministic - if you train a decision tree multiple times on the same dataset, you will get the same tree back. •To get an effect out of multiple voting classifiers, they need to differ. •There are different ways to get variability: - change the learning algorithm - change the dataset Approach #1: Combine different types of classifiers

•Instead of learning a single classifier (e.g., decision tree) on this dataset, you train a set of different classifiers h 1 , … , h K (e.g., decision tree, , KNN, multiple neural networks with different architectures).

•For a test point x’, you make a decision by voting:

y1̂ = h1(x′) … yK̂ = hK(x′)

•Classification: Predict +1 if there are more +1 in the votes. •Regression: take mean or median prediction from different classifiers. •Caveat: While it is unlikely that all classifiers will exactly the same mistake, the inductive biases of different learning classifiers can be highly correlated, i.e., they are prone to similar types of errors. Overview

• Bagging • Random Forests • Boosting

5 Overview

• Bagging • Random Forests • Boosting

6 Approach #2: Train on Multiple Datasets

•Instead of training different types of classifiers on the same dataset, you can train a single type of classifier on multiple datasets. •But what datasets? - We can potentially break the dataset into many pieces and train the model on each one. - But the performance may be poor due to the small size of the training set •Bootstrap resampling: - the dataset we are given, D’, is a sample drawn iid from an unknown distribution D. - If we draw a new dataset from D’’ by random from D with replacement, then D’’ is also a sample from D. Recall Bootstrapping

•Given a dataset D with N training examples:

•Create a bootstrapped training set Dk - which contains N training examples, drawn randomly from D with replacement

• Use the learning algorithm to construct a hypothesis hk by training on Dk.

• Use hk to make predictions on each of the remaining points (from the set Tk = D – Dk). • Repeat this process K times, where K is typically a few hundred. Recall Bootstrapping: Estimate Bias and Variance

• For each point x, we have a set of estimates h1(x), …, hK(x).

• The average empirical prediction of x is:

K ̂ 1 h(x) = ∑ hk(x) K k=1

• We estimate the bias as: y − ĥ(x) K 1 2 • We estimate the variance as: ̂ ∑ (h(x) − hk(x)) K − 1 k=1 Bagging = Bootstrap Aggregation

• If we did all the work to get the hypotheses, why not use all of them to make a prediction? (as opposed to just estimating bias/ variance/error). • All hypotheses get to have a vote: – For classification: pick the majority class. – For regression, average all the predictions. Bagging

•Start with a dataset D with N training examples

•Create B bootstrapped training sets D1, …, DB - each bootstrapped training set contains N training examples, drawn randomly from D with replacement •Train a model (e.g., decision tree) separately on each of the datasets to obtain classifiers h1, …, hB •Use these classifiers to vote on the new test points. B ̂ 1 h(x) = ∑ hb(x) B b=1 Bagging

•These bootstrapped datasets will be similar, but not too similar. •What is the probability of a training example not selected at all? 1 N 1 − ( N ) •As N goes to infinity, this probability approaches 1/e ≈ 0.3679 •Therefore, only about 63% of the original training examples will be represented in any given bootstrapped set. Bagging

Which hypotheses classes would benefit most from this approach?

• In theory, bagging eliminate variance altogether. - Bagging reduce variance by providing an alternative approach to regularization. - That is, even if each of the learned classifiers h1 … hk are individually overfit, they are likely to overfit to different things. - Through voting, we can overcome a significant portion of this overfitting. • In practice, bagging tends to reduce variance and increase bias. • Use this with “unstable” learners that have high variance, e.g., decision trees, neural networks. (A stable algorithm is one that by inputs a little, class labels do not change). Original vs Bagged Trees

[HTF Ch.8] Original vs Bagged Trees

[HTF Ch.8] Experiments (from Breiman)

Decision Trees KNN

Example from the original Breiman paper on bagging, comparing the misclassification of standard vs bagged classifier.

https://www.stat.berkeley.edu/~breiman/bagging.pdf Why does Bagging Work?

• Consider a binary classification task where Y={1,0} and a dataset where the ground truth label is y=1 for all x. • Suppose that for a given input x, we have B independent classifiers hb(x), each predicts y=1 with probability 0.6 and y=0 with probability 0.4. That is, hb(x) has misclassification error = 0.4. • Suppose we form a bagged classifier:

B bag h (x) = arg max 1{hb(x) = k} k=1,0 ∑ b=1 the number of votes for class k Why does Bagging Work?

B bag h (x) = arg max 1{hb(x) = k} k=1,0 ∑ b=1 the number of votes for class k B • Let , i.e., the sum of incorrect votes. B0 = ∑ 1{hb(x) = 0} b=1 • note that B0 ∼ Binom(B, 0.4) • the misclassification rate is b a g goes to 0, P(h (x) = 0) = P(B0 ≥ B/2) as B -> infinity The bagged classifier has perfect predictive accuracy as B -> infinity. This assumes that the classifiers are independent (not true in practice). Wisdom of the Crowd Disadvantages of Bagging

• Loss of interpretability: a bagged tree is not a tree! • Computationally expensive • More stable learning algorithms (like nearest neighbors) are typically not affected much by bagging. When will Bagging fail?

• Suppose we have the same setup, but each of the B independent classifiers has misclassification rate = 0.6. • By the same argument, the bagged classifier has misclassification rate: bag P(h (x) = 0) = P(B0 ≥ B/2) B ∼ Binom(B, 0.6) goes to 1, 0 as B -> infinity

The bagged classifier is perfectly inaccurate as B -> infinity.

Lesson: bag a good classifier! because bagging a bad classifier can hurt accuracy. When will Bagging fail? Overview

• Bagging • Random Forests • Boosting

23 Random Forests

•For ensembles of decision trees, it is computational expensive to train the decision trees itself (choose tree structure). •An effective alternative is to use trees with fixed structures and random features. •Collections of trees are called forests, so classifiers built like this is called random forests. •In doing so, random forests improve upon bagging by reducing the correlation between sampled trees, then averaging them. Random Forests (Breiman) http://www.math.usu.edu/~adele/forests/

Basic algorithm: • Use K bootstrap training sets to train K different trees. - At each node, pick m variables at random (use m

Comments: • Each tree has high variance, but the ensemble averages them thus reducing variance. • Random forests are very competitive in both classification and regression, but still subject to overfitting (especially when the number of features is large, but the fraction of relevant features small). Extremely Randomized Trees (Geurts et al., 2005)

Basic algorithm: • Use K bootstrap training sets to train K different trees. - At each node, pick m attributes at random (without replacement) and pick a random test involving each attribute. - Evaluate all tests (using a normalized information gain metric) and pick the best one for the node. - Continue until a desired depth or a desired number of instances (nmin) at the leaf is reached. • The final classifier combines votes from the K many random trees.

Comments: • Very reliable method for both classification and regression. • The smaller m is, the more randomized the trees are; small m is best, especially with large levels of noise. Small nmin means less bias and more variance, but variance is controlled by averaging over trees. Decision Boundary

[HTF Ch.15]

•Works well when all the features are at least marginally relevant •Intuitively, it works well because - some of the trees will query on useless features and make random predictions. - but some will happen to query on good features and will make good predictions. - If there is enough trees, the random ones will wash out as noise, and only good trees will have an effect on the final classification. Randomization in General

• Instead of searching very hard for the best hypothesis, generate lots of random ones, then average their results. • Examples: – Random Random projections. • Advantages? – Very fast, easy, can handle lots of data. – Can circumvent difficulties in optimization. – Averaging reduces the variance introduced by randomization. • Disadvantages? – New prediction may be more expensive to evaluate (go over all trees). – Still typically subject to overfitting. – Low interpretability compared to standard decision trees. Overview

• Bagging • Random Ensemble • Boosting

30 Boosting (Intuition)

Which of the 10 horses will win and why?

Choose horse 3 because [rule of the thumb] ⋮ Boosting

• In bagging and random forests, a committee of trees each cast a vote for the predicted class, and the final classifier makes a prediction by averaging the outputs of several hypotheses. • Alternative idea: Don’t construct the hypotheses independently. Instead, new hypotheses should focus on instances that are problematic for existing hypotheses. – If an example is difficult, more components should focus on it. Boosting

• A process of taking a weak learner (a learning algorithm that achieves close to 50% error rate) and turning it into a strong learner. • Unlike random forests, the committee of weak learners evolves over time, and the members cast a weighted vote. •AdaBoost (adaptive boosting algorithm) is a famous, practical boosting algorithm: - runs in polynomial time - does not require you to define a large number of hyperparameters. - automatically “adapts” to the data. Boosting

Basic algorithm: • Use the training set to train a simple predictor. • Re-weight the training examples, putting more weight on examples that were not properly classified in the previous predictor. • Repeat n times. • Combine the simple hypotheses into a single, accurate predictor. Weak Learners

Assume that examples are drawn independently from some probability distribution D.

Let JD(h) be the expected error of hypothesis h when data is drawn from D:

JD(h) = ∑ P(⟨x, y⟩) J(h(x), y) ⟨x,y⟩ where J(h(x),y) could be the squared error, or 0/1 loss. Weak Learners

Assuming 2 classes and γ > 0 , “weak” means 1 J (h) < − γ D 2 Since a hypothesis that guesses each instance’s class at random has an error rate of 1/2 (on binary problems), thus γ measures how much better than random are ’s predictions ht So true error of the weak classifier is only slightly better than random. Weak Learners

Assume we have some “weak” binary classifiers: • A is a single node decision tree: xi>t • A single feature Naïve Bayes classifier. • A 1-nearest neighbour classifier.

A decision stump is often trained by brute force: discretize the real numbers from the smallest to the largest value in the training set, enumerate all possible classifiers, and pick the one with the lowest training error. Adaboost

Questions: • How do we re-weight the examples? • How do we combine many simple predictors into a single classifier? Adaboost (Freund & Schapire, 1995)

weighted training error

ϵt = ∑ Dt(i) importance of weak i:ht(xi)≠yi learner ht The meaning of Alpha

ϵ ← D (i) 1 1 − ϵt t ∑ t αt ← log( ) i:ht(xi)≠yi 2 ϵt

•Suppose that we have - a dataset with 80 positive examples and 20 negative examples - we have a weak learner that returns +1 if the total weight of the positive examples > total weight of the negative examples.

weight for +ve examples −1/2 log 4 h1(x) = + 1 e = 1/2 (before normalization) weight for -ve examples 1 1/2 log 4 ϵ1 = × 20 = 0.2 e = 2 (before normalization) 100 1 α = log 4 Z = 80 × 1/2 + 20 × 2 = 80 1 2 The meaning of Alpha

1 1 − ϵt ϵt ← ∑ Dt(i) αt ← log( ) 2 ϵt i:ht(xi)≠yi •Therefore, after normalization, the weight distribution on any positive and negative example is 1/160 and 1/40 respectively. •Since there are 80 positive and 20 negative examples, the cumulative weights on - all positive examples is 80*1/160 = 1/2, - all negative examples is 20*1/40=1/2. •Thus, after a single boosting iteration, the data has become precisely evenly weighted. In the next iteration, our weak learner must do something more interesting than majority voting if it has to achieve an error rate less than 50%. Example Example: First Step Example: Second Step Example: Third Step Example: Final Hypothesis Example: Calculations Properties of Adaboost

• Compared to other boosting algorithms, main insight is to automatically adapt the error rate at each iteration.

• Freund and Schapire proved that training error (fraction of mistakes on the training set) on the final hypothesis is at most:

2 ≤ exp − 2 γ2 ∏[2 ϵt(1 − ϵt] = ∏ 1 − 4γt ( ∑ t ) t t t • recall: is how much better than random is γt ht • Thus, if each weak hypothesis is slightly better than random so that for some for some , , then Adaboost reduces γt ≥ γ γ > 0 the training error exponentially fast

Proof here: https://www.cs.princeton.edu/courses/archive/fall08/cos402/readings/boosting.pdf Real Dataset: Text Categorization Boosting Empirical Evaluation

y-axis = test error rate of C4.5 x-axis = test error rate of boosting stumps vs C4.5. Boosting vs Bagging Boosting vs Bagging

• Bagging is typically faster, but may get a smaller error reduction (not by much). • Bagging works well with “reasonable” classifiers. • Boosting works with very simple classifiers. – E.g., Boostexter - text classification using decision stumps based on single words. • Boosting may have a problem if a lot of the data is mislabeled, because it will focus on those examples a lot, leading to overfitting. Why Does Boosting Work?

• Weak learners have high bias. By combining them, we get more expressive classifiers. Hence, boosting is a bias-reduction technique. • Adaboost looks for a good approximation to the log-odds ratio, within the space of functions that can be captured by a linear combination of the base classifiers. • What happens as we run boosting longer? Intuitively, we get more and more complex hypotheses. How would you expect bias and variance to evolve over time? A Naive (But Reasonable) Analysis of Error

test error

training error

• Expect the training error to continue to drop (until it reaches 0).

• Expect the test error to increase as we get more voters, and hf becomes too complex. Actual Typical Run of Adaboost

test error training error

• Test error does not increase even after 1000 runs! (more than 2 million decision nodes!) • Test error continues to drop even after training error reaches 0! • These are consistent results through many sets of experiments! • Conjecture: Boosting does not overfit! What you should know

• The general idea behind ensemble methods • The algorithmic procedure of Bagging, Random Forests and Boosting • How Bagging, Random Forests and Boosting improve prediction accuracy