Supervised Neighborhoods for Distributed Nonparametric Regression
Total Page:16
File Type:pdf, Size:1020Kb
Supervised neighborhoods for distributed nonparametric regression Adam Bloniarz Christopher Wu Bin Yu Ameet Talwalkar Dept. of Statistics Computer Science Dept. Dept. of Statistics Computer Science Dept. UC Berkeley UCLA Dept. of EECS UCLA UC Berkeley Abstract grows in scale and diversity. Moreover, the scale of these datasets makes non-parametric estimation fea- Techniques for nonparametric regression sible even when considering minimax rates for non- based on fitting small-scale local models at parametric functions that require the data to scale ex- prediction time have long been studied in ponentially in the feature dimension. statistics and pattern recognition, but have Global non-parametric models, e.g., neural networks received less attention in modern large-scale and kernel methods, learn complex response surfaces machine learning applications. In practice, and can be quite expensive to train on large-scale such methods are generally applied to low- datasets. Both machine learning and computer sys- dimensional problems, but may falter with tems researchers are actively exploring techniques to high-dimensional predictors if they use a Eu- efficiently embed these learning approaches into paral- clidean distance-based kernel. We propose lel and distributed computing frameworks, e.g., [1]–[5]. a new method, Silo, for fitting prediction- In contrast, local models are simple, interpretable, and time local models that uses supervised neigh- provide an attractive computational profile, by poten- borhoods that adapt to the local shape of tially drastically reducing training time at the expense the regression surface. To learn such neigh- of a moderate increase in test time. borhoods, we use a weight function between points derived from random forests. We However, to date local methods have been rarely em- prove the consistency of Silo, and demon- ployed on large-scale learning tasks due to both statis- strate through simulations and real data that tical and computational concerns. Statistically, classi- our method works well in both the serial and cal methods struggle with even a moderate number of distributed settings. In the latter case, Silo features due to the curse of dimensionality. Although learns the weighting function in a divide-and- these local methods are minimax optimal, this mini- conquer manner, entirely avoiding communi- max rate is quite conservative when the response does cation at training time. not involve all dimensions. Although local approaches relying on ‘supervised neighborhoods,’ e.g., DANN [6], CART [7], Random Forests [8], and Rodeo [9], demon- 1 INTRODUCTION strate empirical and theoretical success at overcoming this dimensionality issue, they do so at great compu- Modern datasets collected via the internet and in sci- tational expense. entific domains hold the promise for a variety of trans- In this work, we address these statistical and com- formational applications. Non-parametric methods putional issues, focusing on the problem of non- present an attractive modeling choice for learning from parametric regression. We propose a novel family these massive and complex datasets. While parametric of Supervised Local modeling methods (Silo) that methods that optimize over a fixed function space can build on the interpretation of random forests as a lo- suffer from high approximation error when the task is cal model to identify supervised neighbors. Like k-NN complex, non-parametric methods augment the capac- methods, the width of our neighborhoods adapts to the ity of the underlying statistical model as the datasets local density of data: higher density leads to smaller neighborhoods while lower density means neighbor- Appearing in Proceedings of the 19th International Con- ference on Artificial Intelligence and Statistics (AISTATS) hoods spread out and borrow strength from distant 2016, Cadiz, Spain. JMLR: W&CP volume 51. Copyright points. Unlike k-NN or local polynomial methods, our 2016 by the authors. neighborhoods are determined by the shape of the re- 1450 Supervised neighborhoods for distributed nonparametric regression sponse rather than some fixed metric (hence they are following two-step process: supervised neighborhoods). n ˆ 2 Additionally, we observe that datasets of the size we let fx( ) = arg min w(xi, x)(yi f (xi x)) (1) · − − f i need to fit nonparametric functions are often stored ∈F X=1 and processed in a distributed fashion. We thus devise gˆ(x) = fˆ (0) . (2) an efficient distributed variant of Silo, which naturally x targets massive, distributed datasets. In this setting, For instance, we can consider k-NN regression under worker nodes at training time independently calculate this framework by first defining as the set of x’s Kx supervised neighborhoods using information from dis- k nearest neighbors. Then, w(x , x) = 1 if x i i ∈ Kx tinct random forests. Given a test point, the master and zero otherwise, and fˆ ( ) is a constant function x · node gathers supervised neighborhood data from all that returns 1 y . Partitioning methods like k xi x i workers, and then trains a local linear model using CART [7], and random∈K forests [8] can also be seen as P this neighborhood data. examples of local models with constant fˆ ( ) functions. x · To summarize, our contributions are as follows: LOESS [12] uses a fixed weighting function, and fits lo- cal polynomial functions. We describe the particular We introduce a novel adaptive local learning ap- setting of linear functions in more detail in Section3 • proach, Silo, based on the idea of supervised (see equations8 and9). Other methods exist that are neighbors. similar in spirit to the local modeling framework de- fined in equations1 and2 but do not fit precisely with We present a distributed variant of Silo that is • our formulation, e.g., tree-based piecewise polynomi- well suited for large-scale distributed datasets. als models [13], and multivariate adaptive regression We prove consistency of Silo under a simple • splines [14]. model of random forests. We show experimentally that Silo outperforms 2.1 Euclidean distance-based local methods • both standard random forests and Euclidean- distance based local methods in single node set- A common form for the weight function w is the fol- lowing: tings. 2 w(x1, x2) = k( x1 x2 ) (3) We implement the distributed variant of Silo k − k • where is the Euclidean, or L2 norm; k : [0, ) in Apache Spark [10], and demonstrate its k·k ∞ → [0, ) is a univariate kernel function. Common ex- favorable performance relative to the default ∞ Spark/MLlib [11] Random Forest implementa- amples include the rectangular, triangular, Epanech- tion. nikov, cosine, and Gaussian kernels. If the local model family is chosen to be polynomials of degree `, such F an estimator is denoted as an LP (`) method (we adopt 2 METHODS BASED ON LOCAL the notation of Tsybakov [15]). If the true condi- MODELS tional expectation function E(y x) belongs to the | H¨olderclass Σ(β, L), then, with appropriate scaling There are three ingredients to a method for local of the kernel function, the mean square error of the model-based nonparametric regression: a loss function LP ( β ) estimator converges to zero at the minimax b c 2β L : R R, a weight function w( , ), and a function 2β+p → · · optimal rate of n− . In practice, despite their opti- space . In this work, we assume that the loss function F 2 mality properties, LP (`) estimators are not generally is the squared loss, i.e., L(u) = u . The weight func- applied to higher dimensional problems where some tion is a mapping w( , ): Rp Rp [0, ), which may · · × → ∞ dimensions may be irrelevant. As an example, Fig- be fixed, or may depend on the training dataset. is a F ure1 shows the empirical performance of k-nearest family of functions where f maps Rp R. Gener- ∈ F → neighbors, LOESS with local linear models, and ran- ally is a fairly simple function space, as it need only F dom forests on the popular Friedman1 simulation from provide a good approximation to the response surface [14]. In this simulation, p = 10, x is distributed locally. Common function spaces are constant func- uniformly on the hypercube in 10 dimensions, and tions, linear functions, or higher-order polynomials of 2 y = 10 sin (πx x ) + 20 x 1 + 10x + 5x + , fixed degree. 1 2 3 − 2 4 5 where (0, 9). Note that only 5 of the 10 di- ∼ N We now describe the general form for local modeling mensions of x are relevant to y. We see that random methods. Assume we have a set of training data pairs, forests outperform LOESS and nearest neighbor meth- n p (yi, xi) . Given a test point x R , we define a ods; in this case the tuning parameters of LOESS and { }i=1 ∈ functiong ˆ(x) which approximates E(y x), via the k-NN have been optimized on the holdout set, while | 1451 Adam Bloniarz, Christopher Wu, Bin Yu, Ameet Talwalkar Test−set RMSE vs. training set size ● ● ● ● ● 0.1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● 2 ● ● ●● ● ●● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● x ● ● ● ● ●● ● ● ● ● −0.1 ● ● 2.5 ● ● ● ● −1.0 −0.5 0.0 0.5 1.0 ● x1 ● ● ● ● ● 2.0 ● ● ● ● ● ● ● ● ● ● ● ● RMSE ● ● ● ● Figure 2: Weights of random forest. Each circle rep- ● ● ● ● ● 1.5 ● ● ● ● resents a point in the training set, and diameter of the ● ● ● ● ● ● ● ● ● ● ● ● ● ● circle represents the weight of that point when making ● ● ● ● ● ● ● ● ● ● ● ● a prediction at the origin. The function being fit does 1.0 5000 10000 15000 20000 not vary along the x1 coordinate. Number of training points Regression ● k−NN ● LOESS ● Random Forest Method the prediction of a random forest can be written as K n 1 w(xi, x, θj)yi Figure 1: Comparison of non-parametric techniques on gˆ(x) = i=1 (5) K kθ (x) the Friedman1 simulation. Parameters of LOESS and j=1 j X P k-NN were tuned on the holdout set, while random forest parameters were set at default values (3 candi- Note that this takes the form of equations1 and2, where is the family of constant functions, and we date variables at each node, trees trained to 5 points F per leaf.