Lecture 6 – Tree-Based Methods, Bagging and Random Forests
Total Page:16
File Type:pdf, Size:1020Kb
Lecture 6 – Tree-based methods, Bagging and Random Forests Niklas Wahlstr¨om Division of Systems and Control Department of Information Technology Uppsala University. Email: [email protected] [email protected] Summary of Lecture 5 (I/IV) When choosing models/methods, we are interested in how well they will perform when faced with new unseen data. The new data error Enew , E? [E(yb(x?; T ), y?)] describes how well a method (which is trained using data set T ) will perform “in production”. E is for instance mean squared error (regression) or misclassification (classification). The overall goal in supervised machine learning is to achieve small Enew. 1 Pn Etrain , n i=1 E(yb(xi; T ), yi) is the training data error. Not a good estimate of Enew. 1 / 29 [email protected] Summary of Lecture 5 (II/IV) Two methods for estimating Enew: 1. Hold-out validation data approach: Randomly split the data into a training set and a hold-out validation set. Learn the model using the training set. Estimate Enew using the hold-out validation set. 2. k-fold cross-validation: Randomly split the data into k parts (or folds) of roughly equal size. a) The first fold is kept aside as a validation set and the model is learned using only the remaining k − 1 folds. Enew is estimated on the validation set. b) The procedure is repeated k times, each time a different fold is treated as the validation set. c) The average of all k estimates is taken as the final estimate of Enew. 2 / 29 [email protected] Summary of Lecture 5 (III/IV) h i h h ii ¯ ¯ 2 ¯ 2 2 Enew = E? f(x?) − f0(x?) + E? ET y(x?; T ) − f(x?) + σ b |{z} | {z } | {z } Irreducible Bias2 Variance error ¯ where f(x) = ET [yb(x?; T )] • Bias: The inability of a method to describe the complicated patterns we would like it to describe. • Variance: How sensitive a method is to the training data. The more prone a model is to adapt to complicated pattern in the data, the higher the model complexity (or model flexibility). 3 / 29 [email protected] Summary of Lecture 5 (IV/IV) Underfit Overfit E¯new Irreducible error Error Variance Bias2 Model complexity Finding a balanced fit (neither over- nor underfit) is called the the bias-variance tradeoff. 4 / 29 [email protected] Contents – Lecture 6 1. Classification and regression trees (CART) 2. Bagging – a general variance reduction technique 3. Random forests 5 / 29 [email protected] The idea behind tree-based methods In both regression and classification settings we seek a function yb(x) which maps the input x into a prediction. One flexible way of designing this function is to partition the input space into disjoint regions and fit 2 a simple model in each region. x x1 = Training data • Classification: Majority vote within the region. • Regression: Mean of training data within the region. 6 / 29 [email protected] Finding the partition The key challenge in using this strategy is to find a good partition. Even if we restrict our attention to seemingly simple regions (e.g. “boxes”), finding an optimal partition w.r.t. minimizing the training error is computationally infeasible! Instead, we use a “greedy” approach: recursive binary splitting. 1. Select one input variable xj and a cut-point s. Partition the input space into two half-spaces, {x : xj < s}{x : xj ≥ s}. 2. Repeat this splitting for each region until some stopping criterion is met (e.g., no region contains more than 5 training data points). 7 / 29 [email protected] Recursive binary splitting Partitioning of input space Tree representation x1≤s1 R2 x2≤s2 x1≤s3 R5 2 s2 x R3 x2≤s4 s4 R1 R1 R2 R3 R4 s1 s3 R4 R5 = Trainingx1 data 8 / 29 [email protected] Regression trees (I/II) Once the input space is partitioned into L regions, R1,R2,...,RL the prediction model is L X yb? = yb`I{x? ∈ R`}, `=1 where I{x? ∈ R`} is the indicator function ( 1 if x ∈ R` I{x? ∈ R`} = 0 if x ∈/ R` and yb` is a constant prediction within each region. For regression trees we use yb` = avarage{yi : xi ∈ R`} 9 / 29 [email protected] Regression trees (I/II) - How do we find the partition? Recursive binary splitting is greedy - each split is made in order to minimize the loss without looking ahead at future splits. For any j and s, define R1(j, s) = {x | xj < s} and R2(j, s) = {x | xj ≥ s}. We then seek (j, s) that minimize X 2 X 2 (yi − yb1(j, s)) + (yi − yb2(j, s)) . i:xi∈R1(j,s) i:xi∈R2(j,s) where yb1 = avarage{yi : xi ∈ R1(j, s)} yb2 = avarage{yi : xi ∈ R2(j, s)} This optimization problem is easily solved by ”brute force” by evaluating all possible splits. 10 / 29 [email protected] Classification trees Classification trees are constructed similarly to regression trees, but with two differences. Firstly, the class prediction for each region is based on the proportion of data points from each class in that region. Let 1 X πb`m = I{yi = m} n` i:xi∈R` be the proportion of training observations in the lth region that belong to the mth class. Then we approximate L X p(y = m | x) ≈ πb`mI{x ∈ R`} `=1 11 / 29 [email protected] Classification trees Secondly, the squared loss used to construct the tree needs to be replaced by a measure suitable to categorical outputs. Three common error measures are, Misclassification error: 1 − max π`m m b PM Entropy/deviance: − m=1 πb`m log πb`m PM Gini index: m=1 πb`m(1 − πb`m) 12 / 29 [email protected] Classification error measures For a binary classification problem (M = 2) 0.8 Misclassification Gini index 0.6 Deviance 0.4 Loss 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Class-1 probability, πb1m 13 / 29 [email protected] ex) Spam data Classification tree for spam data: 0 0.39 100% 0 1 0.15 yes cf_! < 0.078 no 0.72 58% 42% 0 0 1 0.10 wf_remove < 0.02 0.42 caps_max < 18 0.91 54% 17% 26% 0 cf_$ < 0.17 0.30 wf_free < 0.29 wf_hp >= 0.34 12% 0 0.23 wf_remove < 0.07 11% cf_$ < 0.080 0 1 1 0 1 1 1 0 1 0.08 0.72 0.83 0.15 0.71 0.89 0.82 0.21 0.94 52% 2% 4% 10% 1% 1% 4% 1% 25% Tree LDA Test error: 11.3 % 10.9 % 14 / 29 [email protected] Improving CART The flexibility/complexity of classification and regression trees (CART) is decided by the tree depth. ! To obtain a small bias the tree need to be grown deep, ! but this results in a high variance! The performance of (simple) CARTs is often unsatisfactory! To improve the practical performance: N Pruning – grow a deep tree (small bias) which is then pruned into a smaller one (reduce variance). N Ensemble methods – average or combine multiple trees. • Bagging and Random Forests (this lecture) • Boosted trees (next lecture) 15 / 29 [email protected] Probability detour - Variance reduction by averaging Let zb, b = 1,...,B be identically distributed random variables with mean E [zb] = µ and variance Var[σ2]. Let ρ be the correlation between distinct variables. Then, " # 1 B X z = µ, E B b b=1 " # 1 B 1 − ρ Var X z = σ2 + ρσ2. B b B b=1 | {z } small for large B The variance is reduced by averaging (if ρ < 1)! 16 / 29 [email protected] Bagging (I/II) For now, assume that we have access to B independent datasets T 1,..., T B. We b can then train a separate deep tree yb (x) for each dataset, 1,...,B. b • Each yb (x) has a low bias but high variance • By averaging 1 B y (x) = X yb(x) bbag B b b=1 the bias is kept small, but variance is reduced by a factor B! 17 / 29 [email protected] Bagging (II/II) Obvious problem We only have access to one training dataset. Solution Bootstrap the data! n • Sample n times with replacement from the original training data T = {xi, yi}i=1 • Repeat B times to generate B ”bootstrapped” training datasets Te 1,..., Te B b b For each bootstrapped dataset Te we train a tree ye (x). Averaging these, 1 B yb (x) = X yb(x) ebag B e b=1 is called ”bootstrap aggregation”, or bagging. 18 / 29 [email protected] Bagging - Toy example ex) Assume that we have a training set T = {(x1, y1), (x2, y2), (x3, y3), (x4, y4)}. We generate, say, B = 3 datasets by bootstrapping: 1 Te = {(x1, y1), (x2, y2), (x3, y3), (x3, y3)} 2 Te = {(x1, y1), (x4, y4), (x4, y4), (x4, y4)} 3 Te = {(x1, y1), (x1, y1), (x2, y2), (x2, y2)} Compute B = 3 (deep) regression trees y˜1(x), y˜2(x) and y˜3(x), one for each dataset Te 1, Te 2, and Te 3, and average 1 3 y˜ (x) = X y˜b(x) bag 3 b=1 19 / 29 [email protected] ex) Predicting US Supreme Court behavior Random forest classifier built on SCDB data1 to predict the votes of Supreme Court justices: Y ∈ {affirm, reverse, other} Result: 70% correct classifications D.