Classification and Regression by Randomforest
Total Page:16
File Type:pdf, Size:1020Kb
Vol. 2/3, December 2002 18 Classification and Regression by randomForest Andy Liaw and Matthew Wiener variables. (Bagging can be thought of as the special case of random forests obtained when Introduction mtry = p, the number of predictors.) 3. Predict new data by aggregating the predic- Recently there has been a lot of interest in “ensem- tions of the ntree trees (i.e., majority votes for ble learning” — methods that generate many clas- classification, average for regression). sifiers and aggregate their results. Two well-known methods are boosting (see, e.g., Shapire et al., 1998) An estimate of the error rate can be obtained, and bagging Breiman (1996) of classification trees. In based on the training data, by the following: boosting, successive trees give extra weight to points 1. At each bootstrap iteration, predict the data incorrectly predicted by earlier predictors. In the not in the bootstrap sample (what Breiman end, a weighted vote is taken for prediction. In bag- calls “out-of-bag”, or OOB, data) using the tree ging, successive trees do not depend on earlier trees grown with the bootstrap sample. — each is independently constructed using a boot- strap sample of the data set. In the end, a simple 2. Aggregate the OOB predictions. (On the av- majority vote is taken for prediction. erage, each data point would be out-of-bag Breiman (2001) proposed random forests, which around 36% of the times, so aggregate these add an additional layer of randomness to bagging. predictions.) Calcuate the error rate, and call In addition to constructing each tree using a different it the OOB estimate of error rate. bootstrap sample of the data, random forests change how the classification or regression trees are con- Our experience has been that the OOB estimate of structed. In standard trees, each node is split using error rate is quite accurate, given that enough trees the best split among all variables. In a random for- have been grown (otherwise the OOB estimate can est, each node is split using the best among a sub- bias upward; see Bylander (2002)). set of predictors randomly chosen at that node. This somewhat counterintuitive strategy turns out to per- Extra information from Random Forests form very well compared to many other classifiers, including discriminant analysis, support vector ma- The randomForest package optionally produces two chines and neural networks, and is robust against additional pieces of information: a measure of the overfitting (Breiman, 2001). In addition, it is very importance of the predictor variables, and a measure user-friendly in the sense that it has only two param- of the internal structure of the data (the proximity of eters (the number of variables in the random subset different data points to one another). at each node and the number of trees in the forest), Variable importance This is a difficult concept to and is usually not very sensitive to their values. define in general, because the importance of a The randomForest package provides an R inter- variable may be due to its (possibly complex) face to the Fortran programs by Breiman and Cut- interaction with other variables. The random http://www.stat.berkeley.edu/ ler (available at forest algorithm estimates the importance of a users/breiman/ ). This article provides a brief intro- variable by looking at how much prediction er- duction to the usage and features of the R functions. ror increases when (OOB) data for that vari- able is permuted while all others are left un- The algorithm changed. The necessary calculations are car- ried out tree by tree as the random forest is The random forests algorithm (for both classification constructed. (There are actually four different and regression) is as follows: measures of variable importance implemented in the classification code. The reader is referred 1. Draw ntree bootstrap samples from the original to Breiman (2002) for their definitions.) data. proximity measure The (i, j) element of the prox- 2. For each of the bootstrap samples, grow an un- imity matrix produced by randomForest is the pruned classification or regression tree, with the fraction of trees in which elements i and j fall following modification: at each node, rather in the same terminal node. The intuition is than choosing the best split among all predic- that “similar” observations should be in the tors, randomly sample mtry of the predictors same terminal nodes more often than dissim- and choose the best split from among those ilar ones. The proximity matrix can be used R News ISSN 1609-3631 Vol. 2/3, December 2002 19 to identify structure in the data (see Breiman, Veh 7 4 6 0 0 0 0.6470588 2002) or for unsupervised learning with ran- Con 0 2 0 10 0 1 0.2307692 dom forests (see below). Tabl 0 2 0 0 7 0 0.2222222 Head 1 2 0 1 0 25 0.1379310 Usage in R The user interface to random forest is consistent with We can compare random forests with support that of other classification functions such as nnet() vector machines by doing ten repetitions of 10-fold (in the nnet package) and svm() (in the e1071 pack- cross-validation, using the errorest functions in the age). (We actually borrowed some of the interface ipred package: code from those two functions.) There is a formula interface, and predictors can be specified as a matrix or data frame via the x argument, with responses as a vector via the y argument. If the response is a factor, randomForest performs classification; if the response >library(ipred) is continuous (that is, not a factor), randomForest >set.seed(131) performs regression. If the response is unspecified, >error.RF<-numeric(10) randomForest performs unsupervised learning (see >for(iin1:10)error.RF[i]<- below). Currently randomForest does not handle +errorest(type~.,data=fgl, ordinal categorical responses. Note that categorical +model=randomForest,mtry=2)$error >summary(error.RF) predictor variables must also be specified as factors Min. 1st Qu. Median Mean 3rd Qu. Max. (or else they will be wrongly treated as continuous). 0.1869 0.1974 0.2009 0.2009 0.2044 0.2103 The randomForest function returns an object of >library(e1071) class "randomForest". Details on the components >set.seed(563) of such an object are provided in the online docu- >error.SVM<-numeric(10) mentation. Methods provided for the class includes >for(iin1:10)error.SVM[i]<- predict and print. +errorest(type~.,data=fgl, + model = svm, cost = 10, gamma = 1.5)$error >summary(error.SVM) A classification example Min. 1st Qu. Median Mean 3rd Qu. Max. 0.2430 0.2453 0.2523 0.2561 0.2664 0.2710 The Forensic Glass data set was used in Chapter 12 of MASS4 (Venables and Ripley, 2002) to illustrate vari- ous classification algorithms. We use it here to show how random forests work: We see that the random forest compares quite fa- >library(randomForest) vorably with SVM. >library(MASS) >data(fgl) We have found that the variable importance mea- >set.seed(17) sures produced by random forests can sometimes be >fgl.rf<-randomForest(type~.,data=fgl, useful for model reduction (e.g., use the “important” +mtry=2,importance=TRUE, variables to build simpler, more readily interpretable +do.trace=100) 100: OOB error rate=20.56% models). Figure 1 shows the variable importance of 200: OOB error rate=21.03% the Forensic Glass data set, based on the fgl.rf ob- 300: OOB error rate=19.63% ject created above. Roughly, it is created by 400: OOB error rate=19.63% 500: OOB error rate=19.16% >print(fgl.rf) Call: randomForest.formula(formula = type ~ ., >par(mfrow=c(2,2)) data = fgl, mtry = 2, importance = TRUE, >for(iin1:4) do.trace = 100) + plot(sort(fgl.rf$importance[,i], dec = TRUE), Type of random forest: classification +type="h",main=paste("Measure",i)) Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 19.16% Confusion matrix: We can see that measure 1 most clearly differentiates WinF WinNF Veh Con Tabl Head class.error the variables. If we run random forest again drop- WinF 63 6 1 0 0 0 0.1000000 ping Na, K, and Fe from the model, the error rate re- WinNF 9 62 1 2 2 0 0.1842105 mains below 20%. R News ISSN 1609-3631 Vol. 2/3, December 2002 20 No. of variables tried at each split: 4 Measure 1 Measure 2 RI RI Mg Al Mean of squared residuals: 10.64615 Mg %Varexplained:87.39 Ca Ba Ca K Ba Si Na Si The “mean of squared residuals” is computed as Al Fe Fe K Na n 010203040 051015 −1 OOB 2 MSEOOB = n ∑{yi − yˆi } , 1 Measure 3 Measure 4 where yˆOOB is the average of the OOB predictions RI Al Mg i Mg RI Ca Al Ba Ca for the ith observation. The “percent variance ex- K Si Na Fe Na K plained” is computed as Si Ba Fe MSE 1 − OOB , σ 2 ˆ y 0510152025 0.0 0.2 0.4 0.6 σ 2 where ˆ y is computed with n as divisor (rather than n − 1). Figure 1: Variable importance for the Forensic Glass We can compare the result with the actual data, data. as well as fitted values from a linear model, shown in Figure 2. The gain can be far more dramatic when there are 50 30 40 50 more predictors. In a data set with thousands of pre- 40 dictors, we used the variable importance measures to 30 30 Actual select only dozens of predictors, and we were able to 20 retain essentially the same prediction accuracy.