CS480 Introduction to Machine Learning Ensemble Learning

CS480 Introduction to Machine Learning Ensemble Learning Edith Law Ensemble Learning Models that combine the opinions of multiple classifiers. not not guilty guilty guilty Advantages: •use much simpler learners and still achieve great performance. •efficiency at training and test time because of parallelism Multiple Voting Classifiers •All the learning algorithms we have seen so far are deterministic - if you train a decision tree multiple times on the same dataset, you will get the same tree back. •To get an effect out of multiple voting classifiers, they need to differ. •There are different ways to get variability: - change the learning algorithm - change the dataset Approach #1: Combine different types of classifiers •Instead of learning a single classifier (e.g., decision tree) on this dataset, you train a set of different classifiers h 1 , … , h K (e.g., decision tree, perceptron, KNN, multiple neural networks with different architectures). •For a test point x’, you make a decision by voting: y1̂ = h1(x′)! … yK̂ = hK(x′)! •Classification: Predict +1 if there are more +1 in the votes. •Regression: take mean or median prediction from different classifiers. •Caveat: While it is unlikely that all classifiers will exactly the same mistake, the inductive biases of different learning classifiers can be highly correlated, i.e., they are prone to similar types of errors. Overview • Bagging • Random Forests • Boosting "5 Overview • Bagging • Random Forests • Boosting "6 Approach #2: Train on Multiple Datasets •Instead of training different types of classifiers on the same dataset, you can train a single type of classifier on multiple datasets. •But what datasets? - We can potentially break the dataset into many pieces and train the model on each one. - But the performance may be poor due to the small size of the training set •Bootstrap resampling: - the dataset we are given, D’, is a sample drawn iid from an unknown distribution D. - If we draw a new dataset from D’’ by random from D with replacement, then D’’ is also a sample from D. Recall Bootstrapping •Given a dataset D with N training examples: •Create a bootstrapped training set Dk - which contains N training examples, drawn randomly from D with replacement • Use the learning algorithm to construct a hypothesis hk by training on Dk. • Use hk to make predictions on each of the remaining points (from the set Tk = D – Dk). • Repeat this process K times, where K is typically a few hundred. Recall Bootstrapping: Estimate Bias and Variance • For each point x, we have a set of estimates h1(x), …, hK(x). • The average empirical prediction of x is: K ̂ 1 h(x) = ∑ hk(x) K k=1 • We estimate the bias as: y − ĥ(x) K 1 2 • We estimate the variance as: ̂ ∑ (h(x) − hk(x)) K − 1 k=1 Bagging = Bootstrap Aggregation • If we did all the work to get the hypotheses, why not use all of them to make a prediction? (as opposed to just estimating bias/ variance/error). • All hypotheses get to have a vote: – For classification: pick the majority class. – For regression, average all the predictions. Bagging •Start with a dataset D with N training examples •Create B bootstrapped training sets D1, …, DB - each bootstrapped training set contains N training examples, drawn randomly from D with replacement •Train a model (e.g., decision tree) separately on each of the datasets to obtain classifiers h1, …, hB •Use these classifiers to vote on the new test points. B ̂ 1 h(x) = ∑ hb(x) B b=1 Bagging •These bootstrapped datasets will be similar, but not too similar. •What is the probability of a training example not selected at all? 1 N 1 − ( N ) •As N goes to infinity, this probability approaches 1/e ≈ 0.3679 •Therefore, only about 63% of the original training examples will be represented in any given bootstrapped set. Bagging Which hypotheses classes would benefit most from this approach? • In theory, bagging eliminate variance altogether. - Bagging reduce variance by providing an alternative approach to regularization. - That is, even if each of the learned classifiers h1 … hk are individually overfit, they are likely to overfit to different things. - Through voting, we can overcome a significant portion of this overfitting. • In practice, bagging tends to reduce variance and increase bias. • Use this with “unstable” learners that have high variance, e.g., decision trees, neural networks. (A stable algorithm is one that by inputs a little, class labels do not change). Original vs Bagged Trees [HTF Ch.8] Original vs Bagged Trees [HTF Ch.8] Experiments (from Breiman) Decision Trees KNN Example from the original Breiman paper on bagging, comparing the misclassification of standard vs bagged classifier. https://www.stat.berkeley.edu/~breiman/bagging.pdf Why does Bagging Work? • Consider a binary classification task where Y={1,0} and a dataset where the ground truth label is y=1 for all x. • Suppose that for a given input x, we have B independent classifiers hb(x), each predicts y=1 with probability 0.6 and y=0 with probability 0.4. That is, hb(x) has misclassification error = 0.4. • Suppose we form a bagged classifier: B bag h (x) = arg max 1{hb(x) = k} k=1,0 ∑ b=1 the number of votes for class k Why does Bagging Work? B bag h (x) = arg max 1{hb(x) = k} k=1,0 ∑ b=1 the number of votes for class k B • Let , i.e., the sum of incorrect votes. B0 = ∑ 1{hb(x) = 0} b=1 • note that B0 ∼ Binom(B, 0.4) • the misclassification rate is b a g goes to 0, P(h (x) = 0) = P(B0 ≥ B/2) as B -> infinity The bagged classifier has perfect predictive accuracy as B -> infinity. This assumes that the classifiers are independent (not true in practice). Wisdom of the Crowd Disadvantages of Bagging • Loss of interpretability: a bagged tree is not a tree! • Computationally expensive • More stable learning algorithms (like nearest neighbors) are typically not affected much by bagging. When will Bagging fail? • Suppose we have the same setup, but each of the B independent classifiers has misclassification rate = 0.6. • By the same argument, the bagged classifier has misclassification rate: bag P(h (x) = 0) = P(B0 ≥ B/2) B ∼ Binom(B, 0.6) goes to 1, 0 as B -> infinity The bagged classifier is perfectly inaccurate as B -> infinity. Lesson: bag a good classifier! because bagging a bad classifier can hurt accuracy. When will Bagging fail? Overview • Bagging • Random Forests • Boosting "23 Random Forests •For ensembles of decision trees, it is computational expensive to train the decision trees itself (choose tree structure). •An effective alternative is to use trees with fixed structures and random features. •Collections of trees are called forests, so classifiers built like this is called random forests. •In doing so, random forests improve upon bagging by reducing the correlation between sampled trees, then averaging them. Random Forests (Breiman) http://www.math.usu.edu/~adele/forests/ Basic algorithm: • Use K bootstrap training sets to train K different trees. - At each node, pick m variables at random (use m<M, the total number of features). - Determine the best variable to split on (using information gain). - Recurse until the tree reaches maximum depth (no pruning). • The final classifier combines votes from the K many random trees. Comments: • Each tree has high variance, but the ensemble averages them thus reducing variance. • Random forests are very competitive in both classification and regression, but still subject to overfitting (especially when the number of features is large, but the fraction of relevant features small). Extremely Randomized Trees (Geurts et al., 2005) Basic algorithm: • Use K bootstrap training sets to train K different trees. - At each node, pick m attributes at random (without replacement) and pick a random test involving each attribute. - Evaluate all tests (using a normalized information gain metric) and pick the best one for the node. - Continue until a desired depth or a desired number of instances (nmin) at the leaf is reached. • The final classifier combines votes from the K many random trees. Comments: • Very reliable method for both classification and regression. • The smaller m is, the more randomized the trees are; small m is best, especially with large levels of noise. Small nmin means less bias and more variance, but variance is controlled by averaging over trees. Decision Boundary [HTF Ch.15] Random Forest •Works well when all the features are at least marginally relevant •Intuitively, it works well because - some of the trees will query on useless features and make random predictions. - but some will happen to query on good features and will make good predictions. - If there is enough trees, the random ones will wash out as noise, and only good trees will have an effect on the final classification. Randomization in General • Instead of searching very hard for the best hypothesis, generate lots of random ones, then average their results. • Examples: – Random feature selection Random projections. • Advantages? – Very fast, easy, can handle lots of data. – Can circumvent difficulties in optimization. – Averaging reduces the variance introduced by randomization. • Disadvantages? – New prediction may be more expensive to evaluate (go over all trees). – Still typically subject to overfitting. – Low interpretability compared to standard decision trees. Overview • Bagging • Random Ensemble • Boosting "30 Boosting (Intuition) Which of the 10 horses will win and why? Choose horse 3 because [rule of the thumb] ⋮ Boosting • In bagging and random forests, a committee of trees each cast a vote for the predicted class, and the final classifier makes a prediction by averaging the outputs of several hypotheses. • Alternative idea: Don’t construct the hypotheses independently.

Load more