Forest-Type Regression with General Losses and Robust Forest

Forest-type Regression with General Losses and Robust Forest Alexander Hanbo Li 1 Andrew Martin 2 Abstract dict quantiles by discovering that besides calculating the weighted mean of the observed response variables, one This paper introduces a new general framework could also get information for the weighted distribution of for forest-type regression which allows the de- observed response variables using the sets of local weights velopment of robust forest regressors by select- generated by random forest. This method is strongly con- ing from a large family of robust loss functions. nected to the adaptive nearest neighbors procedure (Lin & In particular, when plugged in the squared error Jeon, 2006) which we will briefly review in section 1.2. and quantile losses, it will recover the classical Different from classical k-NN methods that rely on pre- random forest (Breiman, 2001) and quantile ran- defined distance metrics, the dissimilarities generated by dom forest (Meinshausen, 2006). We then use ro- random forest are data dependent and scale-invariant. bust loss functions to develop more robust forest- type regression algorithms. In the experiments, Another state-of-the-art algorithm AdaBoost (Freund & we show by simulation and real data that our ro- Schapire, 1995; Freund et al., 1996) has been generalized bust forests are indeed much more insensitive to to be applicable to a large family of loss functions (Fried- outliers, and choosing the right number of nearest man, 2001; Mason et al., 1999; Li & Bradic, 2016). Recent neighbors can quickly improve the generalization development of more flexible boosting algorithms such as performance of random forest. xgboost (Chen & Guestrin, 2016) have become the go-to forest estimators with tabular or matrix data. One way in which recent boosting algorithms have an advantage over 1. Introduction the random forest is the ability to customize the loss func- tion used to reduce the influence of outliers or optimize a Since its development by Breiman (2001), random forest metric more suited to the specific problem other than the has proven to be both accurate and efficient for classifica- mean squared error. tion and regression problems. In regression setting, random forest will predict the conditional mean of a response In this paper, we will propose a general framework for variable by averaging predictions of a large number of re- forest-type regression which can also be applied to a broad gression trees. Later then, many other machine learning family of loss functions. It is claimed in (Meinshausen, algorithms were developed upon random forest. Among 2006) that quantile random forest is another nonparamet- them, robust versions of random forest have also been pro- ric approach which does not minimize an empirical loss. posed using various methodologies. Besides the sampling However, we will show in fact both random forest and idea (Breiman, 2001) which adds extra randomness, the quantile random forest estimators can be re-derived as re- other variations are mainly based on two ideas: (1) use gression methods using the squared error or quantile loss more robust criterion to construct regression trees (Galim- respectively in our framework. Inspired by the adaptive berti et al., 2007; Brence & Brown, 2006; Roy & Larocque, nearest neighbor viewpoint, we explore how random forest 2012); (2) choose more robust aggregation method (Mein- makes predictions using the local weights generated by en- shausen, 2006; Roy & Larocque, 2012; Tsymbal et al., semble of trees, and connect that with locally weighted re- 2006). gression (Fan & Gijbels, 1996; Tibshirani & Hastie, 1987; Staniswalis, 1989; Newey, 1994; Loader, 2006; Hastie & Meinshausen (2006) generalized random forest to pre- Loader, 1993). The intuition is that when predicting the 1University of California at San Diego, San Diego, Califor- target value (e.g. E[Y jX = x]) at point x, the observations nia, USA 2Zillow, Seattle, Washington, USA. Correspondence to: closer to x should receive larger weights. Different from Alexander Hanbo Li <[email protected]>. predefining a kernel, random forest assigns the weights data dependently and adaptively. After we illustrate the re- Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 lation between random forest and local regression, we will by the author(s). use random forest weights to design other regression algo- Forest-type Regression with General Losses and Robust Forest rithms. By plugging robust loss functions like Huber loss 1.2. Adaptive nearest neighbors and Tukey’s redescending loss, we get forest-type regres- Lin and Jeon (2006) studies the connection between ran- sion methods that are more robust to outliers. Finally, mo- dom forest and adaptive nearest neighbor. They introduced tivated from the truncated squared error loss example, we the so-called potential nearest neighbors (PNN): A sample will show that decreasing the number of nearest neighbors point x is called a k-PNN to a target point x if there exists in random forest will also immediately improve its gener- i a monotone distance metric under which x is among the k alization performance. i closest to x among all the sample points. The layout of this paper is as follows. In Section 1.1 and 1.2 Therefore, any k-NN method can be viewed as choosing we review random forest and adaptive nearest neighbors. k points from the k-PNNs according to some monotone Section2 introduces the general framework of forest-type metric. For example, under Euclidean metric, the classical regression. In Section3 we plug in robust regression loss k-NN algorithm sorts the observations by their Euclidean functions to get robust forest algorithms. In Section4 we distances to the target point and outputs the k closest ones. motivate from the truncated squared error loss and inves- This is equivalent to weighting the k-PNNs using inverse tigate the importance of choosing right number of nearest L distance. neighbors. Finally, we test our robust forests in Section 2 5 and show that they are always superior to the traditional More interestingly, they prove that those observations with formulation in the presence of outliers in both synthetic and positive weights (3) all belong to the k-PNNs (Lin & Jeon, real data set. 2006). Therefore, random forests is another weighted k- PNN method, but it assigns weights to the observations dif- 1.1. Random forest ferent from any k-NN method under a pre-defined mono- tonic distance metric. In fact, the random forest weights Following the notation of Breiman (2001), let θ be the ran- are adaptive to the data if the splitting scheme is adaptive. dom parameter determining how a tree is grown, and data (X; Y ) 2 X × Y. For each tree T (θ), let L be the total number of leaves, and Rl denotes the rectangular subspace 2. General framework for forest-type in X corresponding to the l-th leaf. Then for every x 2 X , regression there is exactly one leaf l such that x 2 R . Denote this l In this section, we generalize the classical random forest to leaf by l(x; θ). a general forest-type regression (FTR) framework which is For each tree T (θ), the prediction of a new data point applicable to a broad family of loss functions. In Section X = x is the average of data values in leaf l(x; θ), that 2.1, we motivate the framework by connecting random for- Pn is, Yb(x; θ) = j=1 w(Xi; x; θ)Yi, where est predictor with locally weighted regression. Then in Sec- tion 2.2, we formally propose the new forest-type regression framework. In Section 2.3, we rediscover the quantile 1IfXi2Rl(x,θ)g w(Xi; x; θ) = : (1) random forest estimator by plugging the quantile loss func- #fj : Xj 2 Rl(x,θ)g tion into our framework. Finally, the conditional mean E[Y jX = x] is approxi- 2.1. Squared error and random forest mated by the averaged prediction of m trees, Yb(x) = −1 Pm m t=1 Yb(x; θt). After rearranging the terms, we can Classical random forest can be understood as an estimator write the prediction of random forest as of conditional mean E[Y jX]. As shown in (2), the estimator Yb(x) is weighted average of all response Yi’s. This n X special form reminds us of the classical least squares re- Y (x) = w(X ; x)Y ; (2) b i i gression, where the estimator is the sample mean. To be i=1 more precise, we rewrite (2) as where the averaged weight w(X ; x) is defined as n i X w(Xi; x)(Yi − Yb(x)) = 0: (4) m 1 X i=1 w(X ; x) = w(X ; x; θ ): (3) i m i t t=1 Equation (4) is the estimating equation (first order condi- tion) of the locally weighted least squares regression (Rup- From equation (2), the prediction of the conditional expec- pert & Wand, 1994): tation E[Y jX = x] is the weighted average of the response n values of all observations. Furthermore, it is easy to show X 2 Pn Yb(x) = argmin w(Xi; x)(Yi − λ) (5) that i=1 w(Xi; x) = 1. λ2R i=1 Forest-type Regression with General Losses and Robust Forest In classical local regression, the weight w(Xi; x) serves uses ensemble of trees to recursively partition the covariate as a local metric between the target point x and observa- space X . However, there are many other data dependent tion Xi. Intuitively, observations closer to target x should dissimilarity measures that can potentially be used, such as be given more weights when predicting the response at k-NN, mp-dissimilarity (Aryal et al., 2014), shared near- x.

Load more