<<

Classification and Regression with Breiman Random Forests (Paper: Random Forests - Breiman 2001)

Meghana Killi and Nam Tran (Dated: Mar 6, 2019)

I. INTRODUCTION

R1 Classification and regression are common prob- lems in . The former requires R3 prediction of a discrete label for a new observa- tion, based on training that has already been R grouped into categories. In contrast, the latter re- 2 quires prediction of a continuous numerical quan- tity for new observations, based on a function ob- tained by mapping the input training data to re- FIG. 1. An example on how the can be quired training output. partitioned. The partition of the parameter will depend Currently there exist several methods to solve on the algorithm used to generate a decision tree. these problems, the most common being some appli- cation of decision trees (described in II below). Ran- dom Forests (detailed in IV) is one such method that f : X → of the form, can solve both classification and regression prob- R lems. N The current paper, (Breiman 2001), discusses an X 1 f(x) = ci Ri (x) (1) improvement on regular random forests through the i=1 injection of random feature selection (random sub- space method). Thus, in Breiman Random Forests 1 that can predict the values y1, . . . , yn. Here Ri : (BRF), boostrap aggregating (bagging) is applied to X 7→ {0, 1} is the defined as decision trees in conjunction with the random sub- space method. The key idea behind this method is ( 1, if x ∈ R that can reproduce, or even exceed, de- 1(x) = i (2) terministic approaches owing to the Strong Law of 0, if x ∈/ Ri Large Numbers, which states that the sample aver- age converges almost surely to the Similar to to the least square method the goal is to (Loeve 1977). Breiman conjectures that the deter- determine appropriate constants c1, . . . , cN and in ministic method known as AdaBoost emulates a ran- this case also proper regions R1,..., RN such that dom forest at some stage. the loss function for the problem is minimized. It In this write-up we will briefly describe the idea should be noted that although (Breiman 2001) uses of decision tree learning used to define the random a simple sum of squares loss function for regression forest procedure, explain bagging and random sub- with random forests, the loss function is not unique space methods, and finally summarize some of the and its form should depend on the specific problem. results in (Breiman 2001). A proper loss function is important since the predic- tion of the classifier will highly depend on the chosen function (Hastie et al. 2009).

II. DECISION TREES

A decision tree can be considered as linear combi- III. BOOTSTRAPPING, BAGGING, AND nation of indicator functions and is usually viewed RANDOM SUBSPACE METHODS as a tree-like model as illustrated in fig. 2. To be more specific suppose we have made some observa- Taking the notation as in section II, the number of tions (x1, y1),..., (xn, yn) ∈ X × R. The idea of a observations in the training data is n, and each data decision tree is to partition the set X in smaller dis- point has N features if the elements in the vector x joint subsets R1,..., RN and determine a classifier has N elements; e.g xi = (x1, . . . , xN ). 2

IV. BREIMAN RANDOM FORESTS

x 1 ∈/ A A A random forest is a collection of decision trees ∈ 1 x working in tandem to create a single (aggregate) tree, where each decision tree has some contribution c3 to the final output. For classification problems, each tree votes on the discrete label output, and the deci- x 2 ∈ A / sion with the most votes will be the decision that the A ∈ 2 x single tree will take. In case of regression problems, each tree will output some numerical value and the single tree will combine them all through a or c1 c2 a weighted mean. Whereas a normal random forest only utilizes bag- FIG. 2. A decision tree that corresponds to the function ging, BRF further extends this technique through 1 1 1 x 7→ c1 R1 (x)+c2 R2 (x)+c3 R3 (x) with R1 = A1∩A2, the addition of the random subspace method. { { R2 = A1 ∩ A2 and R3 = A1 Therefore, in a BRF, after each decision tree receives a bootstrapped sample as input, and performs ran- dom subspace of features at each node, the A. Bootstrapping single (aggregate) tree combines their outputs into a final decision. Bootstrapping is a sampling technique in which we randomly sample, with replacement, from the n known observations, to create multiple bootstrap V. SELECTED RESULTS FROM BREIMAN samples of n elements each. Since we allow for re- placement, this bootstrap sample is most likely not BRF, unlike single decision trees and regular bag- identical to our initial sample. Some data points ging methods, avoids overfitting because for a suf- may be duplicated, and others may be omitted. ficiently large number of iterations, the generalized error converges to some asymptotic value. It is also mostly insensitive to the final number of features se- B. Bagging lected to split the tree at each node (usually, 1 or 2 features is enough). Moreover, BRF, especially Bootstrap aggregating (or bagging) is a machine Forest-RC is robust to and noise - more so learning ensemble method that uses bootstrap sam- than Adaboost because Adaboost gives more weight pling to independently train classifiers, and then ag- to misclassified data, thus giving more weight to gregates the predictions. When applied to decision noise, whereas random forests give no weight to any trees, bagging provides bootstrap samples to sev- one result. eral decision trees and then combines their outputs. Here, the number of trees is a free parameter, and at each node, the algorithm searches over all the N VI. SUMMARY features to find the one that best splits the data. Breiman Random Forests utilize a forest of deci- sion trees that each makes decisions by randomly C. Random Subspace Method sampling from the input data, and from the fea- tures at each node. For classification, the algorithm In the Random Subspace Method, instead of se- outputs the result that the majority of the trees lecting from the total N number of features, we in- agree with (i.e., the of the results of individ- stead randomly generate a subspace with m < N ual trees), and for regression, it outputs the average features at each node. Then, we either directly use value (i.e., the mean (or weighted mean) of the re- a few features randomly selected from the subspace sults of individual trees). It makes use of the Strong (Forest-RI), or use random linear combinations of Law of Large Numbers to avoid overfitting, which the features from the subspace (Forest-RC), such is a common problem with other machine learning that the final selected features best split the data. algorithms such as boosting and bagging.

L. Breiman. Random forests. Machine Learn- 10.1023/A:1010933404324. URL https://doi.org/10. ing, 45(1):5–32, Oct 2001. ISSN 1573-0565. doi: 3

1023/A:1010933404324. books?id=eBSgoAEACAAJ. T. Hastie, R. Tibshirani, and J. Friedman. The Elements M. Loeve. Probability Theory. Springer Science and Busi- of Statistical Learning: Data Mining, Inference, and Pre- ness Media. Springer, 1977. ISBN 9781468494648. diction. Springer series in statistics. Springer, 2009. ISBN 9780387848846. URL https://books.google.dk/