Lecture 6 – Tree-based methods, Bagging and Random Forests

Niklas Wahlstr¨om Division of Systems and Control Department of Information Technology Uppsala University.

Email: [email protected]

[email protected] Summary of Lecture 5 (I/IV)

When choosing models/methods, we are interested in how well they will perform when faced with new unseen data.

The new data error

Enew , E? [E(yb(x?; T ), y?)] describes how well a method (which is trained using data set T ) will perform “in production”. E is for instance mean squared error (regression) or misclassification (classification).

The overall goal in supervised is to achieve small Enew.

1 Pn Etrain , n i=1 E(yb(xi; T ), yi) is the training data error. Not a good estimate of Enew. 1 / 29 [email protected] Summary of Lecture 5 (II/IV)

Two methods for estimating Enew:

1. Hold-out validation data approach: Randomly split the data into a training set and a hold-out validation set. Learn the model using the training set. Estimate Enew using the hold-out validation set. 2. k-fold cross-validation: Randomly split the data into k parts (or folds) of roughly equal size. a) The first fold is kept aside as a validation set and the model is learned using only the remaining k − 1 folds. Enew is estimated on the validation set. b) The procedure is repeated k times, each time a different fold is treated as the validation set. c) The average of all k estimates is taken as the final estimate of Enew.

2 / 29 [email protected] Summary of Lecture 5 (III/IV)

h i h h ii ¯ ¯ 2 ¯ 2 2 Enew = E? f(x?) − f0(x?) + E? ET y(x?; T ) − f(x?) + σ b |{z} | {z } | {z } Irreducible Bias2 error ¯ where f(x) = ET [yb(x?; T )]

• Bias: The inability of a method to describe the complicated patterns we would like it to describe. • Variance: How sensitive a method is to the training data.

The more prone a model is to adapt to complicated pattern in the data, the higher the model complexity (or model flexibility).

3 / 29 [email protected] Summary of Lecture 5 (IV/IV)

Underfit Overfit

E¯new

Irreducible error

Error Variance Bias2

Model complexity

Finding a balanced fit (neither over- nor underfit) is called the the bias-variance tradeoff.

4 / 29 [email protected] Contents – Lecture 6

1. Classification and regression trees (CART) 2. Bagging – a general variance reduction technique 3. Random forests

5 / 29 [email protected] The idea behind tree-based methods

In both regression and classification settings we seek a function yb(x) which maps the input x into a prediction.

One flexible way of designing this function is to

partition the input space into disjoint regions and fit 2 a simple model in each region. x

x1 = Training data • Classification: Majority vote within the region. • Regression: Mean of training data within the region.

6 / 29 [email protected] Finding the partition

The key challenge in using this strategy is to find a good partition.

Even if we restrict our attention to seemingly simple regions (e.g. “boxes”), finding an optimal partition w.r.t. minimizing the training error is computationally infeasible!

Instead, we use a “greedy” approach: recursive binary splitting.

1. Select one input variable xj and a cut-point s. Partition the input space into two half-spaces,

{x : xj < s}{x : xj ≥ s}.

2. Repeat this splitting for each region until some stopping criterion is met (e.g., no region contains more than 5 training data points).

7 / 29 [email protected] Recursive binary splitting

Partitioning of input space Tree representation

x1≤s1

R2 x2≤s2 x1≤s3 R5

2 s2

x R3 x2≤s4 s4 R1 R1 R2 R3 R4

s1 s3 R4 R5 = Trainingx1 data

8 / 29 [email protected] Regression trees (I/II)

Once the input space is partitioned into L regions, R1,R2,...,RL the prediction model is

L X yb? = yb`I{x? ∈ R`}, `=1

where I{x? ∈ R`} is the indicator function ( 1 if x ∈ R` I{x? ∈ R`} = 0 if x ∈/ R`

and yb` is a constant prediction within each region. For regression trees we use

yb` = avarage{yi : xi ∈ R`}

9 / 29 [email protected] Regression trees (I/II) - How do we find the partition?

Recursive binary splitting is greedy - each split is made in order to minimize the loss without looking ahead at future splits.

For any j and s, define

R1(j, s) = {x | xj < s} and R2(j, s) = {x | xj ≥ s}. We then seek (j, s) that minimize X 2 X 2 (yi − yb1(j, s)) + (yi − yb2(j, s)) . i:xi∈R1(j,s) i:xi∈R2(j,s) where yb1 = avarage{yi : xi ∈ R1(j, s)} yb2 = avarage{yi : xi ∈ R2(j, s)} This optimization problem is easily solved by ”brute force” by evaluating all possible splits. 10 / 29 [email protected] Classification trees

Classification trees are constructed similarly to regression trees, but with two differences. Firstly, the class prediction for each region is based on the proportion of data points from each class in that region. Let

1 X πb`m = I{yi = m} n` i:xi∈R` be the proportion of training observations in the lth region that belong to the mth class. Then we approximate

L X p(y = m | x) ≈ πb`mI{x ∈ R`} `=1

11 / 29 [email protected] Classification trees

Secondly, the squared loss used to construct the tree needs to be replaced by a measure suitable to categorical outputs. Three common error measures are,

Misclassification error: 1 − max π`m m b PM Entropy/deviance: − m=1 πb`m log πb`m PM Gini index: m=1 πb`m(1 − πb`m)

12 / 29 [email protected] Classification error measures

For a binary classification problem (M = 2) 0.8 Misclassification Gini index 0.6 Deviance

0.4 Loss

0.2

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Class-1 probability, πb1m

13 / 29 [email protected] ex) Spam data

Classification tree for spam data:

0 0.39 100% 0 1 0.15 yes cf_! < 0.078 no 0.72 58% 42% 0 0 1 0.10 wf_remove < 0.02 0.42 caps_max < 18 0.91 54% 17% 26% 0 cf_$ < 0.17 0.30 wf_free < 0.29 wf_hp >= 0.34 12% 0 0.23 wf_remove < 0.07 11%

cf_$ < 0.080 0 1 1 0 1 1 1 0 1 0.08 0.72 0.83 0.15 0.71 0.89 0.82 0.21 0.94 52% 2% 4% 10% 1% 1% 4% 1% 25%

Tree LDA Test error: 11.3 % 10.9 %

14 / 29 [email protected] Improving CART

The flexibility/complexity of classification and regression trees (CART) is decided by the tree depth. ! To obtain a small bias the tree need to be grown deep, ! but this results in a high variance!

The performance of (simple) CARTs is often unsatisfactory!

To improve the practical performance: N Pruning – grow a deep tree (small bias) which is then pruned into a smaller one (reduce variance). N Ensemble methods – average or combine multiple trees. • Bagging and Random Forests (this lecture) • Boosted trees (next lecture) 15 / 29 [email protected] Probability detour - Variance reduction by averaging

Let zb, b = 1,...,B be identically distributed random variables with mean E [zb] = µ and variance Var[σ2]. Let ρ be the correlation between distinct variables.

Then, " # 1 B X z = µ, E B b b=1 " # 1 B 1 − ρ Var X z = σ2 + ρσ2. B b B b=1 | {z } small for large B

The variance is reduced by averaging (if ρ < 1)!

16 / 29 [email protected] Bagging (I/II)

For now, assume that we have access to B independent datasets T 1,..., T B. We b can then train a separate deep tree yb (x) for each dataset, 1,...,B.

b • Each yb (x) has a low bias but high variance • By averaging 1 B y (x) = X yb(x) bbag B b b=1 the bias is kept small, but variance is reduced by a factor B!

17 / 29 [email protected] Bagging (II/II)

Obvious problem We only have access to one training dataset.

Solution Bootstrap the data! n • Sample n times with replacement from the original training data T = {xi, yi}i=1 • Repeat B times to generate B ”bootstrapped” training datasets Te 1,..., Te B b b For each bootstrapped dataset Te we train a tree ye (x). Averaging these,

1 B yb (x) = X yb(x) ebag B e b=1 is called ”bootstrap aggregation”, or bagging.

18 / 29 [email protected] Bagging - Toy example

ex) Assume that we have a training set

T = {(x1, y1), (x2, y2), (x3, y3), (x4, y4)}. We generate, say, B = 3 datasets by bootstrapping:

1 Te = {(x1, y1), (x2, y2), (x3, y3), (x3, y3)} 2 Te = {(x1, y1), (x4, y4), (x4, y4), (x4, y4)} 3 Te = {(x1, y1), (x1, y1), (x2, y2), (x2, y2)} Compute B = 3 (deep) regression trees y˜1(x), y˜2(x) and y˜3(x), one for each dataset Te 1, Te 2, and Te 3, and average

1 3 y˜ (x) = X y˜b(x) bag 3 b=1

19 / 29 [email protected] ex) Predicting US Supreme Court behavior

Random forest classifier built on SCDB data1 to predict the votes of Supreme Court justices:

Y ∈ {affirm, reverse, other}

Result: 70% correct classifications

D. M. Katz, M. J. Bommarito II and J. Blackman. A General Approach for Predicting the Behavior of the Supreme Court of the United States. arXiv.org, arXiv:1612.03473v2, January 2017.

1http://supremecourtdatabase.org 20 / 29 [email protected]

Fig 5. Justice-Term Accuracy Heatmap Compared Against M=10 Null Model (1915 -2015). Green cells indicate that our model outperformed the baseline for a given Justice in a given term. Pink cells indicate that our model only matched or underperformed the baseline. The deeper the color green or pink, the better or worse, respectively, our model performed relative to the M=10 baseline.

performance against the most challenging of null models (i.e., the M = 10 finite memory window). While careful inspection of Figure 5 allows the interested reader to explore the Justice-by-Justice performance of our model, a high level review of Figure 5 reveals the basic achievement of our modeling goals as described earlier. While we perform better with some Justices than others and better in some time periods than others, Figure 5 displays generalized and consistent mean-field performance across many Justices and many historical contexts.

Version 2.02 - 01.16.17 14/18 ex) Predicting US Supreme Court behavior

Not only have random forests proven to be “unreasonably effective” in a wide array of contexts, but in our testing, random forests outperformed other common approaches including support vector machines [. . . ] and feedforward artificial neural network models such as multi-layer

— Katz, Bommarito II and Blackman (arXiv:1612.03473v2)

21 / 29 [email protected] Random forests

Bagging can drastically improve the performance of CART!

However, the B bootstrapped dataset are correlated ⇒ the variance reduction due to averaging is diminished.

Idea: De-correlate the B trees by randomly perturbing each tree.

A is constructed by bagging, but for each split in each tree only a random subset of q ≤ p inputs are considered as splitting variables. √ Rule of thumb: q = p for classification trees and q = p/3 for regression trees.2

2Proposed by Leo Breiman, inventor of random forests. 22 / 29 [email protected] Random forest pseudo-code

Algorithm Random forest for regression 1. For b = 1 to B (can run in parallel) (a) Draw a bootstrap data set Te of size n from T . (b) Grow a regression tree by repeating the following steps until a minimum node size is reached: i. Select q out of the p input variables uniformly at random.

ii. Find the variable xj among the q selected, and the corresponding split point s, that minimizes the squared error.

iii. Split the node into two children with {xj ≤ s} and {xj > s}. 2. Final model is the average the B ensemble members,

1 B yrf = X yb. b? B e? b=1 23 / 29 [email protected] Random forests

B Recall: For i.d. random variables {zb}b=1 ! 1 B 1 − ρ Var X z = σ2 + ρσ2 B b B b=1

The random input selection used in random forests: H increases the bias, but often very slowly 2 H adds to the variance (σ ) of each tree N reduces the correlation (ρ) of the trees

The reduction in correlation is typically the dominant effect ⇒ there is an overall reduction in MSE!

24 / 29 [email protected] ex) Toy regression model

For the toy model previously considered... 3.0 Random Forest Bagging 2.8 Single regression tree 2.6 2.4 Test error Test 2.2 2.0 1.8

0 20 40 60 80 100

Number of trees

25 / 29 [email protected] Overfitting?

The complexity of a bagging/random forest model increases with an increasing number of trees B. Will this lead to overfitting as B increases?

No – more ensemble members does not increase the flexibility of the model!

Regression case:

1 B yrf = X yb → [y | T ] , as B → ∞, b? B e? E e? b=1 where the expectation is w.r.t. the randomness in the data bootstrapping and input selection.

26 / 29 [email protected] Advantages of random forests

Random forests have several computational advantages: N Embarrassingly parallelizable! N Using q < p potential split variables reduces the computational cost of each split. √ b N We could bootstrap fewer than n, say n, data points when creating Te — very useful for “big data” problems. . . . and they also come with some other benefits: N Often works well off-the-shelf – few tuning parameters N Requires little or no input preparation N Implicit input selection

27 / 29 [email protected] 12 Margareta Ackerman and David Loker

5 Songs made with ALYSIA

Three songs were created using ALYSIA. The first two rely on lyrics written by the authors, and the last borrows its lyrics from a 1917 Vaudeville song. The songs were consequently recorded and produced. All recordings are posted on http://bit.ly/2eQHado. The primary technical difference between these songs was the explore/exploit parameter. Why Do I Still Miss You was made using an explore vs exploit parameter that leads to more exploration, while Everywhere and I’m Always Chasing Rainbows were made with a higher exploit setting.

5.1 Why Do I Still Miss You

Why Do I Still Miss You, the first song ever written with ALYSIA, was made with the goal of testing the viability of using computer-assisted composition for production-quality songs. The creation of the song started with the writing of original lyrics, written by the authors. The goal was to come up with lyrics that stylistically match the style of the corpus, that is, lyrics in today’s pop genre (See Figure 5). Next, each lyrical line was given to ALYSIA, for which she produced be- ex) Automatic music generationtween fifteen and thirty variations. In most cases, we were satisfied with one of the options ALYSIA proposed within the first fifteen tries. We then combined the melodic lines. See Figure 5 for the resulting score. Subsequently, the song was recorded and produced. One of the authors, who is a professionally trained ALYSIA: automated music generation usingsinger, random recorded and forests. produced the song. A recording of this song can be heard at http://bit.ly/2eQHado.

• User specifies the lyrics • ALYSIA generates accompanying music via - rythm model - melody model • Trained on a corpus of pop songs.

https://www.youtube.com/watch?v=whgudcj82_IFig. 5: An original song co-written with ALYSIA. https://www.withalysia.com/.

M. Ackerman and D. Loker. Algorithmic Songwriting with ALYSIA. In: Correia J., Ciesielski V., Liapis A. (eds) Computational Intelligence in Music, Sound, Art and Design. EvoMUSART, 2017. 28 / 29 [email protected] A few concepts to summarize lecture 6

CART: Classification and regression trees. A class of nonparametric methods based on partitioning the input space into regions and fitting a simple model for each region. Recursive binary splitting: A greedy method for partitioning the input space into “boxes” aligned with the coordinate axes. Gini index and deviance: Commonly used error measures for constructing classification trees. Ensemble methods: Umbrella term for methods that average or combine multiple models. Bagging: Bootstrap aggregating. An ensemble method based on the statistical bootstrap. Random forests: Bagging of trees, combined with random for further variance reduction (and computational gains).

29 / 29 [email protected]