Leveraging for Big Data Regression

Focus Article Leveraging for big data regression Ping Ma∗ and Xiaoxiao Sun Rapid advance in science and technology in the past decade brings an extraordinary amount of data, offering researchers an unprecedented opportunity to tackle complex research challenges. The opportunity, however, has not yet been fully utilized, because effective and efficient statistical tools for analyzing super-large dataset are still lacking. One major challenge is that the advance of computing resources still lags far behind the exponential growth of database. To facilitate scientific discoveries using current computing resources, one may use an emerging family of statistical methods, called leveraging. Leveraging methods are designed under a subsampling framework, in which one samples a small proportion of the data (subsample) from the full sample, and then performs intended computations for the full sample using the small subsample as a surrogate. The key of the success of the leveraging methods is to construct nonuniform sampling probabilities so that influential data points are sampled with high probabilities. These methods stand as the very unique development of their type in big data analytics and allow pervasive access to massive amounts of information without resorting to high performance computing and cloud computing. © 2014 Wiley Periodicals, Inc. Howtocitethisarticle: WIREs Comput Stat 2014. doi: 10.1002/wics.1324 Keywords: big data; subsampling; estimation; least squares; leverage INTRODUCTION of FPGA (field-programmable gate array)-based accel- erators are ongoing in both academia and industry apid advance in science and technology in the (for example, with TimeLogic and Convey Comput- past decade brings an extraordinary amount R ers). However, none of these technologies by them- of data that were inaccessible just a decade ago, selves can address the big data problem accurately and offering researchers an unprecedented opportunity to at scale. Supercomputers are precious resources and tackle much larger and more complex research chal- are not designed to be used routinely by everyone. lenges. The opportunity, however, has not yet been The selection of the projects for running on super- fully utilized, because effective and efficient statis- computer is very stringent. For example, the resource tical and computing tools for analyzing super-large allocation of latest deployed Blue Water supercom- dataset are still lacking. One major challenge is that the advance of computing technologies still lags far puter at National Center for Super Computing Appli- behind the exponential growth of database. The cations (https://bluewaters.ncsa.illinois.edu/) requires cutting edge computing technologies of today and a rigorous National Science Foundation (NSF) review foreseeable future are high performance comput- process and supports only a handful projects. While ing (HPC) platforms, such as supercomputers and clouds have large storage capabilities, ‘cloud comput- cloud computing. Researchers have proposed use of ing is not a panacea: it poses problems for devel- GPUs (graphics processing units),1 and explorations opers and users of cloud software, requires large data transfers over precious low-bandwidth Inter- ∗Correspondence to: [email protected] net uplinks, raises new privacy and security issues, 2 Department of Statistics, University of Georgia, Athens, GA, USA and is an inefficient solution’. While cloud facili- ties like mapReduce do offer some advantages, the Conflict of interest: The authors have declared no conflicts of overall issues of data transfer and usability of today’s interest for this article. clouds will indeed limit what we can achieve; the © 2014 Wiley Periodicals, Inc. Focus Article wires.wiley.com/compstats tight coupling between computing and storage is two novel methods to improve the basic leveraging absent. A recently conducted comparison3 found that method. even high-end GPUs are sometimes outperformed by general-purpose multi-core implementations because of the time required to transfer data. Linear Model and Least( ) Squares Problem T, ‘One option is to invent algorithms that make Given n observations, xi yi ,wherei = 1, … , n,we better use of a fixed amount of computing power’.2 consider a linear model, In this article, we review an emerging family of sta- , tistical methods, called leveraging methods that are y = X + (1) developed for achieving such a goal. Leveraging meth- T ods are designed under a subsampling framework, where y = (y1, … , yn) is the n × 1 response vec- T in which we sample a small proportion of the data tor, X = (x1, … , xn) is the n × p predictor or design (subsample) from the full sample, and then perform matrix, is a p × 1 coefficient vector, and is zero intended computations for the full sample using the mean random noise with constant variance. The small subsample as a surrogate. The key of the success unknown coefficient can be estimated via the OLS, of the leveraging methods relies on effectively con- whichistosolve structing nonuniform sampling probabilities so that ̂ = arg min||y − X||2, (2) influential data points are to be sampled with high ols probabilities. Traditionally, ‘subsampling’ has been used to refer to ‘m-out-of-n’ bootstrap, whose pri- where || · || represents the Euclidean norm. When ̂ mary motivation is to make approximate inference matrix X is of full rank, the OLS estimate ols of , owing to the difficulty or intractability in deriving minimizer of (2), has the following expression, 4,5 analytical expressions. However, the general moti- ( ) vation of the leveraging method is different from the ̂ T −1 T . ols = X X X y (3) traditional subsampling. In particular, (1) leveraging methods are used to achieve feasible computation even When matrix X is not of full rank, the inverse if the simple analytic results are available, (2) lever- of XTX in expression (3) is replaced by generalized aging methods enable the visualization of the data inverse. The predicted response vector is when visualization of the full sample is impossible, and (3) as we will see later, leveraging methods usu- ̂y = Hy, (4) ally use unequal sampling probabilities for subsampling data, whereas subsampling almost exclusively T − 1 T where the hat matrix H = X(X X) X .The( ith) diag- uses equal sampling probabilities. Leveraging methods T T −1 onal element of the hat matrix H, hii = xi X X xi, stand as the very unique development in big data ana- is called the leverage score of the ith observation. It lytics and allow pervasive access to massive amounts is well known that as hii approaches to 1, the pre- of information without resorting to HPC and cloud dicted response of the ith observation ̂y gets close computing. i to yi. Thus the leverage score has been regarded as the most important influence index indicting how influential the ith observation to the least squares LINEAR REGRESSION MODEL estimator. AND LEVERAGING In this article, we focus on the linear regression setting. Leveraging The first effective leveraging method in regression When sample size n is super large, the computation of is developed in Refs 6 and 7 The method uses the the OLS estimator using the full sample may√ become sample leverage scores to construct a nonuniform extremely expensive. For example, if p = n,the sampling probability. Thus, the resulting sample tends computation of the OLS estimator is O(n2), which to concentrate more on important or influential data may be infeasible for super large n. Alternatively, points for the ordinary least squares (OLS).6,7 The one may opt to use leveraging methods, which are estimate based on such a subsample has been shown presented below. to provide an excellent approximation to the OLS There are two types of leveraging methods, basedonfulldata(whenp is small and n is large). weighted leveraging methods and unweighted leverag- Ma et al.8,9 studied the statistical properties of the ing methods. Let us look at the weighted leveraging basic leveraging method in Refs 6 and 7 and proposed methods first. © 2014 Wiley Periodicals, Inc. WIREs Computational Statistics Leveraging for big data regression Algorithm 1. Weighted Leveraging 4 4 2 2 0 0 • Step 1. Taking a random subsample of size r –2 –2 from the data Constructing sampling probability –4 –4 = { , … , } for the data points. Draw a ran- 1 n –6 –6 dom subsample (with replacement) of size r ≪ n, –8 –8 denoted as (X*, y*), from the full sample accord- ing to the probability . Record the correspond-{ } –4–202468 –4 –2 0 2 4 6 8 ing sampling probability matrix Φ∗ = diag ∗ . | k FIGURE 1 A random sample of n = 500 was generated from • Step 2. Solving a weighted least squares (WLS) yi =−xi + i where xi is t(6)-distributed, i ~ N(0, 1). The true regression function is in the dotted black line, the data in black circles, problem using the subsample. Obtain least and the OLS estimator using the full sample in the black solid line. (a) squares estimate using the subsample. Estimate The uniform leveraging estimator is in the green dashed line. The via solving a WLS problem on the subsample ̃ uniform leveraging subsample is superimposed as green crosses. (b) The to get estimate , i.e., solve weighted leveraging estimator is in the red-dot dashed line. The points in the weighted leveraging subsample are superimposed as red crosses. arg min ||Φ∗−1∕2y∗ −Φ∗−1∕2X∗||2. (5) It was also noticed in Ref 8 that the variance of the weighted leveraging estimator may be inflated by In the weighted leveraging methods (and the data with tiny sampling probability. Subsequently, , unweighted leveraging methods to be reviewed Ma et al.8 9 introduced a shrinkage leveraging method. below), the key component is to construct the sam- In the shrinkage leveraging method, sampling proba- 1 pling probability . An easy choice is i = for bility was shrunk to uniform sampling probability, i.e., n h each data point, i.e., uniform sampling probabil- = ∑ii + (1 − ) 1 , where tuning parameter is a i n ity.

Leveraging for Big Data Regression

LSAR: Efficient Leverage Score Sampling Algorithm for the Analysis

Remedies for Assumption Violations and Multicollinearity Outliers • If the Outlier Is Due to a Data Entry Error, Just Correct the Value

Elemental Estimates, Influence, and Algorithmic Leveraging

Leverage (Statistics)

A Statistical Perspective on Algorithmic Leveraging

Chapter 10 NOTES Model Building – II: Diagnostics

Calculating Standard Errors for Linear Regression in Randomized Evaluations

Tests for Leverage and Influential Points

Outliers, Leverage, and Influence

Unusual and Influential Data

Outliers, Durbin-Watson and Interactions for Regression in SPSS Dependent Variable: Continuous (Scale/Interval/Ratio) Independent Variables: Continuous/ Binary

Studentized Residuals Jacknife Residuals 3 Zambia 2 2 1 1 0 0 −1 −1 Jacknife Residuals −2 Studentized Residuals −2