Focus Article Leveraging for big data regression

Ping Ma∗ and Xiaoxiao Sun

Rapid advance in science and technology in the past decade brings an extraordinary amount of data, offering researchers an unprecedented opportunity to tackle complex research challenges. The opportunity, however, has not yet been fully utilized, because effective and efﬁcient statistical tools for analyzing super-large dataset are still lacking. One major challenge is that the advance of computing resources still lags far behind the exponential growth of database. To facilitate scientiﬁc discoveries using current computing resources, one may use an emerging family of statistical methods, called leveraging. Leveraging methods are designed under a subsampling framework, in which one samples a small proportion of the data (subsample) from the full sample, and then performs intended computations for the full sample using the small subsample as a surrogate. The key of the success of the leveraging methods is to construct nonuniform sampling probabilities so that inﬂuential data points are sampled with high probabilities. These methods stand as the very unique development of their type in big data analytics and allow pervasive access to massive amounts of information without resorting to high performance computing and cloud computing. © 2014 Wiley Periodicals, Inc.

Howtocitethisarticle: WIREs Comput Stat 2014. doi: 10.1002/wics.1324

Keywords: big data; subsampling; estimation; least squares; leverage

INTRODUCTION of FPGA (field-programmable gate array)-based accel- erators are ongoing in both academia and industry apid advance in science and technology in the (for example, with TimeLogic and Convey Comput- past decade brings an extraordinary amount R ers). However, none of these technologies by them- of data that were inaccessible just a decade ago, selves can address the big data problem accurately and offering researchers an unprecedented opportunity to at scale. Supercomputers are precious resources and tackle much larger and more complex research chal- are not designed to be used routinely by everyone. lenges. The opportunity, however, has not yet been The selection of the projects for running on super- fully utilized, because effective and efficient statis- computer is very stringent. For example, the resource tical and computing tools for analyzing super-large allocation of latest deployed Blue Water supercom- dataset are still lacking. One major challenge is that the advance of computing technologies still lags far puter at National Center for Super Computing Appli- behind the exponential growth of database. The cations (https://bluewaters.ncsa.illinois.edu/) requires cutting edge computing technologies of today and a rigorous National Science Foundation (NSF) review foreseeable future are high performance comput- process and supports only a handful projects. While ing (HPC) platforms, such as supercomputers and clouds have large storage capabilities, ‘cloud comput- cloud computing. Researchers have proposed use of ing is not a panacea: it poses problems for devel- GPUs (graphics processing units),1 and explorations opers and users of cloud software, requires large data transfers over precious low-bandwidth Inter- ∗Correspondence to: [email protected] net uplinks, raises new privacy and security issues, 2 Department of Statistics, University of Georgia, Athens, GA, USA and is an inefficient solution’. While cloud facili- ties like mapReduce do offer some advantages, the Conflict of interest: The authors have declared no conflicts of overall issues of data transfer and usability of today’s interest for this article. clouds will indeed limit what we can achieve; the

© 2014 Wiley Periodicals, Inc. Focus Article wires.wiley.com/compstats tight coupling between computing and storage is two novel methods to improve the basic leveraging absent. A recently conducted comparison3 found that method. even high-end GPUs are sometimes outperformed by general-purpose multi-core implementations because of the time required to transfer data. Linear Model and Least( ) Squares Problem T, ‘One option is to invent algorithms that make Given n observations, xi yi ,wherei = 1, … , n,we better use of a fixed amount of computing power’.2 consider a linear model, In this article, we review an emerging family of sta- 𝛽 𝜖, tistical methods, called leveraging methods that are y = X + (1) developed for achieving such a goal. Leveraging meth- T ods are designed under a subsampling framework, where y = (y1, … , yn) is the n × 1 response vec- T in which we sample a small proportion of the data tor, X = (x1, … , xn) is the n × p predictor or design 𝛽 𝜖 (subsample) from the full sample, and then perform matrix, is a p × 1 coefficient vector, and is zero intended computations for the full sample using the mean random noise with constant variance. The 𝛽 small subsample as a surrogate. The key of the success unknown coefficient can be estimated via the OLS, of the leveraging methods relies on effectively con- whichistosolve structing nonuniform sampling probabilities so that 𝛽̂ = arg min||y − X𝛽||2, (2) influential data points are to be sampled with high ols 𝛽 probabilities. Traditionally, ‘subsampling’ has been used to refer to ‘m-out-of-n’ bootstrap, whose pri- where || · || represents the Euclidean norm. When 𝛽̂ 𝛽 mary motivation is to make approximate inference matrix X is of full rank, the OLS estimate ols of , owing to the difficulty or intractability in deriving minimizer of (2), has the following expression, 4,5 analytical expressions. However, the general moti- ( ) vation of the leveraging method is different from the 𝛽̂ T −1 T . ols = X X X y (3) traditional subsampling. In particular, (1) leveraging methods are used to achieve feasible computation even When matrix X is not of full rank, the inverse if the simple analytic results are available, (2) lever- of XTX in expression (3) is replaced by generalized aging methods enable the visualization of the data inverse. The predicted response vector is when visualization of the full sample is impossible, and (3) as we will see later, leveraging methods usu- ̂y = Hy, (4) ally use unequal sampling probabilities for subsam- pling data, whereas subsampling almost exclusively T − 1 T where the hat matrix H = X(X X) X .The( ith) diag- uses equal sampling probabilities. Leveraging methods T T −1 onal element of the hat matrix H, hii = xi X X xi, stand as the very unique development in big data ana- is called the leverage score of the ith observation. It lytics and allow pervasive access to massive amounts is well known that as hii approaches to 1, the pre- of information without resorting to HPC and cloud dicted response of the ith observation ̂y gets close computing. i to yi. Thus the leverage score has been regarded as the most important influence index indicting how influential the ith observation to the least squares LINEAR REGRESSION MODEL estimator. AND LEVERAGING In this article, we focus on the linear regression setting. Leveraging The first effective leveraging method in regression When sample size n is super large, the computation of is developed in Refs 6 and 7 The method uses the the OLS estimator using the full sample may√ become sample leverage scores to construct a nonuniform extremely expensive. For example, if p = n,the sampling probability. Thus, the resulting sample tends computation of the OLS estimator is O(n2), which to concentrate more on important or influential data may be infeasible for super large n. Alternatively, points for the ordinary least squares (OLS).6,7 The one may opt to use leveraging methods, which are estimate based on such a subsample has been shown presented below. to provide an excellent approximation to the OLS There are two types of leveraging methods, basedonfulldata(whenp is small and n is large). weighted leveraging methods and unweighted leverag- Ma et al.8,9 studied the statistical properties of the ing methods. Let us look at the weighted leveraging basic leveraging method in Refs 6 and 7 and proposed methods first.

© 2014 Wiley Periodicals, Inc. WIREs Computational Statistics Leveraging for big data regression

Algorithm 1. Weighted Leveraging 4 4 2 2 0 0 • Step 1. Taking a random subsample of size r –2 –2 from the data Constructing sampling probability –4 –4 𝜋 = {𝜋 , … , 𝜋 } for the data points. Draw a ran- 1 n –6 –6 dom subsample (with replacement) of size r ≪ n, –8 –8 denoted as (X*, y*), from the full sample accord- ing to the probability 𝜋. Record the correspond-{ } –4–202468 –4 –2 0 2 4 6 8 ing sampling probability matrix Φ∗ = diag 𝜋∗ . | k FIGURE 1 A random sample of n = 500 was generated from 𝜖 𝜖 • Step 2. Solving a weighted least squares (WLS) yi =−xi + i where xi is t(6)-distributed, i ~ N(0, 1). The true regression function is in the dotted black line, the data in black circles, problem using the subsample. Obtain least and the OLS estimator using the full sample in the black solid line. (a) squares estimate using the subsample. Estimate The uniform leveraging estimator is in the green dashed line. The 𝛽 via solving a WLS problem on the subsample ̃ uniform leveraging subsample is superimposed as green crosses. (b) The to get estimate 𝛽, i.e., solve weighted leveraging estimator is in the red-dot dashed line. The points in the weighted leveraging subsample are superimposed as red crosses. arg min ||Φ∗−1∕2y∗ −Φ∗−1∕2X∗𝛽||2. (5) 𝛽 It was also noticed in Ref 8 that the variance of the weighted leveraging estimator may be inflated by In the weighted leveraging methods (and the data with tiny sampling probability. Subsequently, , unweighted leveraging methods to be reviewed Ma et al.8 9 introduced a shrinkage leveraging method. below), the key component is to construct the sam- In the shrinkage leveraging method, sampling proba- 𝜋 𝜋 1 pling probability . An easy choice is i = for bility was shrunk to uniform sampling probability, i.e., n h each data point, i.e., uniform sampling probabil- 𝜋 = 𝜆∑ii + (1 − 𝜆) 1 , where tuning parameter 𝜆 is a i n ity. The resulting leveraging estimator is denoted hii as 𝛽̃ . The leveraging algorithm with uniform constant in [0, 1]. The resulting estimator is denoted as UNIF 𝛽̃ 𝛽̃ subsampling probability is very simple to implement SLEV. It was demonstrated in Refs 8 and 9 that SLEV 𝛽̃ 𝛼 but performs poorly in many cases, see Figure 1. has smaller means squared error than BLEV when The first nonuniform weighted leveraging algorithm is chosen between 0.8 and 0.95 in their simulation was developed by Drineas et al..6,7 In their weighted studies. leveraging algorithm, they construct sampling prob- It is worth noting that the first leveraging method 𝜋 is designed primarily to approximate full sample OLS ability i using∑ the (normalized) sample leverage , 𝜋 estimate.6 7 Ma et al.8 substantially expand the scope score as i = hii/ ihii. The resulting leveraging esti- mator is denoted as 𝛽̃ , which yields impressive of the leveraging methods by applying them to infer BLEV 𝛽 results: A preconditioned version of this leveraging the true value of coefficient . Moreover, if inference of the true value of coefficient 𝛽 is the major concern, method ‘beats LAPACK’s direct dense least squares 8 solver by a large margin on essentially any dense tall one may use an unweighted leveraging method. matrix’.10 An important feature of the weighted leverag- Algorithm 2. Unweighted Leveraging ing method is, in step 2, to use sampling probability as weight in solving WLS based on the subsample. The weight is analogous to the weight in classic • Step 1. Taking a random sample of size r from Hansen–Hurwitz estimator.11 Recall in classic sam- the data. This step is the same as step 1 in Algorithm 1. pling methods, given data (y1, … , yn) with sampling 𝜋 𝜋 probability ( 1, …(, n), taking) a random sample of • Step 2. Solving an OLS problem using the sub- ∗, , ∗ sample. Obtain least squares estimate using the size r, denoted by∑ y1 … yr , the Hansen–Hurwitz −1 r ∗ 𝜋∗ 𝜋∗ subsample. Estimate 𝛽 by solving an OLS prob- estimator is r i=1 yi ∕ i ,where i is the corre- sponding sampling probability of y∗. It is well known lem (instead of WLS problem) on the sample to ∑ i ∑ ̃ −1 r ∗ 𝜋∗ n get estimate 𝛽u, i.e., solve that r i=1 yi ∕ i is an unbiased estimate of i=1 yi. It was shown in Refs 8 and 9 that weighted leverag- arg min ||y∗ − X∗𝛽||2. (6) ing estimator is an approximately unbiased estimator 𝛽 𝛽̂ of ols.

© 2014 Wiley Periodicals, Inc. Focus Article wires.wiley.com/compstats

Although the 𝛽̃u is an approximately biased LEMMA 1. Let 𝛽̃be the weighted leveraging estima- 𝛽̂ 𝛽̃ estimator of the full sample OLS estimator ols,Ma tor. Then, a Taylor expansion of around the point 8 𝛽̃u et al. show that is an unbiased estimator of the w0 = 1 yields 𝛽 true value of coefficient . Simulation studies also ( ) { } indicate that the unweighted leveraging estimator 𝛽̃u 𝛽̃ 𝛽̂ T −1 T ̂ , = ols + X X X diag e (w − 1) + RW (7) has smaller variance than the corresponding weighted ̃ leveraging estimator 𝛽. ̂ 𝛽̂ where e = y − X ols is the OLS residual vector, and RW = op(||w − w0||) is the Taylor expansion Analytic Framework for Leveraging remainder. Estimators Note that the detailed proofs of all lemma and We use the mean squared error (MSE) to compare theorems can be found in Ref 8. the performance of the leveraging methods. In par- Given Lemma 1, we can establish the following ticular, MSE can be decomposed in to bias and vari- lemma, which provides expressions for the conditional ance components. We start with deriving the bias and and unconditional expectations and variances for the 𝛽̃ variance of the unweighted leveraging estimator in weighted leveraging estimator. Algorithm 1. It is instructive to note that when study- ing the statistical properties of leveraging estimators LEMMA 2. The conditional expectation and condi- there are two layers of randomness (from sampling tional variance for the weighted leveraging estimator and subsampling). are given by: From Algorithm 1, we can easily see { } { } 𝛽̃| 𝛽̂ ( )−1 E y = ols + E RW ; (8) 𝛽̃ = X∗TΦ∗−1X∗ X∗TW∗y∗. { } { } ( ) { } This expression is difficult to analyze as it has −1 1 Var 𝛽̃|y = XTX XTdiag ̂e diag both random subsample (X*, y*) and their corre- r𝝅 sponding sampling probability Φ*. Instead, we rewrite { } ( ) { } ̂ T −1 , the weighted leveraging estimator in terms of the orig- diag e X X X + Var RW (9) inal data (X, y), where W specifies the probability distribution used ( )−1 𝛽̃ = XTWX XTWy, in the sampling and rescaling steps. The uncondi- tional expectation and unconditional variance for the weighted leveraging estimator are given by: where W = diag(w1, w2, … , wn)isann × n diagonal ki { } random matrix, wi = 𝜋 ,andki is the number of r i 𝛽̃ 𝛽 times that the ith data point in the random subsample E = 0; (10) (Recall that step 1 in Algorithm 1 is to take subsample { } with replacement). ( ) 𝜎2 ( ) ̃ 2 T −1 T −1 T Clearly, the estimator 𝛽̃ can be regarded Var 𝛽 = 𝜎 X X + X X X r as a function of the random weight vector {( ) } 2 w = (w , w , … , w )T, denoted as 𝛽̃(w).Aswetake 1 − h ( ) { } 1 2 n ii T −1 . random subsample with replacement, it is easy to see diag 𝜋 X X X + Var RW (11) i that w has a scaled multinomial distribution, { } k k k 1 , 2 , , n Equation (8) states that, when the E{R }termis Pr w1 = 𝜋 w2 = 𝜋 … wn = 𝜋 W r 1 r 2 r n negligible, i.e., when the linear approximation is valid, r! k k k then, conditioning on the observed data y, the estimate = 𝜋 1 𝜋 2 ···𝜋 n , ̃ k !k !, … , k ! 1 2 n 𝛽 is approximately unbiased, relative to the full sample 1 2 n 𝛽̂ OLS estimate ols; and Eq. (10) states that the estimate 𝛽̃ 𝛽 with mean Ew = 1. By setting w0, the vector around is unbiased, relative to the true value 0 of the which we perform our Taylor series expansion, to parameter vector 𝛽. That is, given a particular dataset 𝛽̃ be the all-ones vector, i.e., w0 = 1,then (w) can be (X, y), the conditional expectation result of Eq. (8) 𝛽̂ expanded around the full sample OLS estimate ols, states that the leveraging estimators can approximate 𝛽̃ 𝛽̂ 𝛽̂ i.e., (1) = ols. From this, we have the following well ols; and, as a statistical inference procedure lemma. for arbitrary datasets, the unconditional expectation

© 2014 Wiley Periodicals, Inc. WIREs Computational Statistics Leveraging for big data regression result of Eq. (10) states that the leveraging estimators variance for the unweighted leveraging estimator are 𝛽 can infer well 0. given by, { } Both the conditional variance of Eq. (9) and the 𝛽̃u 𝛽 E = 0; (second term of the) unconditional variance of Eq. (11) are inversely proportional to the subsample size { } ( ) ( ) r; and both contain a sandwich-type expression, the 𝛽̃u 𝜎2 T −1 T 2 T −1 Var = X W0X X W0 X X W0X middle of which depends on how the leverage scores ( ) { } 2 T −1 T +𝜎 X W X X diag I − P , W interact with the sampling probabilities. Moreover, the 0 X W0 0 𝜎2 T − 1 { } ( ) first term of the unconditional variance, (X X) , T −1 u diag I − P , X X W X + Var {R } equals the variance of the OLS estimator; this implies, X W0 0 e.g., that the unconditional variance of Eq. (11) is (13) larger than the variance of the OLS estimator, which ( ) is consistent with the Gauss-Markov theorem. T −1 T where P , = X X W X X W . Now consider the unweighted leveraging estima- X W0 0 0 tor, which is different from the weighted leveraging These two expectation results in this lemma estimator, in that no weight is used for least squares state: (1), when E{Ru} is negligible, then, conditioning based on the subsample. Thus, we shall derive the bias on the observed data y, the estimator 𝛽̃u is approxi- and variance of the unweighted leveraging estimator mately unbiased, relative to the full sample WLS esti- 𝛽̃u . To do so, we first use a Taylor series expansion to 𝛽̂ 𝛽̃u mator wls and (2) the estimator is unbiased, relative get the following lemma. 𝛽 𝛽 to the true value 0 of the parameter vector . LEMMA 3. Let 𝛽̃u be the unweighted leveraging As expected, when the leverage scores are all the estimator. Then, a Taylor expansion of 𝛽̃u around the same, the variance in Eq. (13) is the same as the vari- point w = r𝜋 yields ance of uniform random sampling. This is expected 0 since, when reweighting with respect to the uniform ( ) { } −1 distribution, one does not change the problem being 𝛽̃u = 𝛽̂ + XTW X XTdiag ̂e (w − r𝜋) + Ru, wls 0 w solved, and thus the solutions to the weighted and (12) unweighted OLS problems are identical. More gener- ( ) ally, the variance is not inflated by very small lever- 𝛽̂ T −1 where wls = X W0X XW0y is the full sample age scores, as it is with weighted leveraging estimator. ̂ 𝛽̂ WLS estimator, ew = y − X wls is the LS residual For example, the conditional variance expression is 𝜋 u vector, W0 = diag{r }, and R is the Taylor expansion also a sandwich-type expression, the center of which remainder. is W0 = diag{rhii/n}, which is not inflated by very small leverage scores. This lemma is analogous to Lemma 1. In partic- 𝜋 ular, here we expand around the point w0 = r since E{w} = r𝜋 when no reweighting takes place. Computation Given this Taylor expansion lemma, we can The running time of the leveraging methods depends now establish the following lemma for the mean and on both the time to construct the probability 𝜋,and variance of unweighted leveraging estimator, both the time to solve the least squares based on the conditioned and unconditioned on the data y. subsamples. Solving least squares based on subsample of size r requires O(rp2) time. On the other hand, when LEMMA 4. The conditional expectation and condi- constructing 𝜋 involves leverage score h , the running tional variance for the unweighted leveraging estima- ii time is dominated by the computation of those scores. tor are given by: The leverage score h can be computed using singular { } ii value decomposition (SVD) of X.TheSVDofX can 𝛽̃u| 𝛽̂ u E y = wls + E {R } ; be written as { } ( ) { } X = UΛVT, 𝛽̃u| T −1 T ̂ Var y = X W0X X diag ew W0 { } ( ) where U is an n × p orthogonal matrix whose columns ̂ T −1 u , diag ew X X W0X + Var {R } contain the left singular vectors of X, V is an p × p orthogonal matrix whose columns contain the right 𝜋 𝛽̂ 𝜆 where( )W0 = diag{r }, and where wls = singular vectors of X, p × p matrix Λ=diag{ i}, and T −1 𝜆 X W0X XW0y is the full sample WLS estimator. i, i = 1, … , p are singular values of X. Hat matrix The unconditional expectation and unconditional H can alternatively be expressed as H = UUT. Then,

© 2014 Wiley Periodicals, Inc. Focus Article wires.wiley.com/compstats

TABLE 1 Mean (Standard Deviation) of Sample MSEs over 100 Replicates for different subsample size.

Algorithms 200 500 1000 2000 Uniform 8.46e-5(2.44e-5) 1.32e-5(8.16e-7) 9.89e-6(2.38e-7) 8.80e-6(9.20e-8) Weighted 7.72e-5(2.64e-5) 1.24e-5(6.21e-7) 9.61e-6(2.47e-7) 8.80e-6(1.60e-7) Unweighted 7.03e-5(2.36e-5) 1.18e-5(5.14e-7) 9.28e-6(1.99e-7) 8.51e-6(6.88e-8) Shrinkage 7.02e-5(1.88e-5) 1.24e-5(6.50e-7) 9.67e-6(2.80e-7) 8.79e-6(1.49e-7)

MSE, mean squared error. the leverage score of the ith observation can be as public resource for biomedical research. In HPC, expressed as functional magnetic resonance imaging (fMRI) data || ||2, hii = ui (14) were collected as follows. Scanners acquire the slices of three-dimensional (3D) functional image, resulting where ui is the ith row of U. The exact computation in brain oxygenation-level dependent (BOLD) signal 2 of hii,fori = 1, … , n, in Eq. (14) requires O(np ) samples at different layers of the brain while subjects time.12 perform certain task. In this example, fMRI BOLD The fast randomized algorithm from Ref 13 can intensities released in 201315 were used. The BOLD be used to compute approximations to the leverage signals at approximately 0.2 million voxels in each scores of X in o(np2) time. The algorithm of Ref 13 subject’s brain were taken at 176 consecutive time takes an arbitrary n × p matrix X as input. points during a task named emotion. Most of con- ventional methods can only analyze single subject

• Generate an r1 × n random matrix Π1 and a p × r2 imaging data due to the daunting computational cost. random matrix Π2. We stacked on five subjects’ BOLD signal together, • Let R be the R matrix from a QR decomposition resulting in a data matrix of 1, 004, 738 × 176, in which each row is a voxel and each column is a of Π1X. time point. We use the BOLD signal at the last time • Compute and return the leverage scores of the point as response y, the rest 175 BOLD measurements matrix XR− 1Π . 2 with additional intercept as the predictor X,whichis 1, 004, 738 × 176. We fit the linear regression model For appropriate choices of r1 and r2, if one of (1) to the data. chooses Π1 to be a Hadamard-based random projec- We applied four leveraging methods to fit the 2 tion matrix, then this algorithm runs in o(np )time, linear regression model. In particular, uniform lever- and it returns 1 ± 𝜖 approximations to all the leverage 13 aging, weighted leveraging, unweighted leveraging, scores of X. In addition, with a high-quality imple- and shrinkage leveraging with 𝜆 = 0.9 were compared mentation of the Hadamard-based random projection, with subsample size( ranging) ( from) r = 200 to 2000. this algorithm runs faster than traditional determin- T The sample MSE = y − ̂y y − ̂y ∕n was calculated. istic algorithms based on LAPACK.10,14 Ma et al.8 We repeated applying the leveraging methods for one used two variants of this fast algorithm of Ref 13: hundred times at each subsample size. The mean and the binary fast algorithm, where (up to normaliza- standard deviations of MSEs were reported in Table 1. tion) each element of Π and Π is generated i.i.d. 1 2 From Table 1, one can clearly see that the unweighted from {−1, 1} with equal sampling probabilities and the leveraging outperforms other methods in terms of Gaussian Fast algorithm, where each element of Π1 is generated i.i.d. from a Gaussian distribution with sample MSE. The leveraging methods are significantly better than uniform leveraging especially when the mean zero and variance 1/n and each element of Π2 is generated i.i.d. from a Gaussian distribution with subsample size is small. mean zero and variance 1/p.Maetal.8 showed that As for computing time, the OLS takes 70.550 these two algorithms perform very well. CPU seconds to complete in full sample whereas leveraging methods only need about 28 CPU sec- onds to complete. Particularly, in step 1 of weighted Real Example leveraging method, SVD takes 22.277 CPU sec- Human Connectome Project (HPC) aims to delin- onds, 5.848 seconds for calculating leverages, and eate human brain connectivity and function in the 0.008 seconds for sampling. Step 2 takes additional healthy adult human brain and to provide these data 0.012 CPU seconds.

© 2014 Wiley Periodicals, Inc. WIREs Computational Statistics Leveraging for big data regression

CONCLUSION random ‘sketch’ of the data. Thus visualization of the subsample obtained in step 1 of leveraging methods In the article, we reviewed the emerging family of enables a surrogate visualization of the full data. leveraging methods in the context of linear regression. While outliers are likely to be the high leverage data For linear regression, there are many methods points and thus likely selected into the subsample in available for computing the exact least squares leveraging methods, we suggest that model diagnos- estimator of the full sample using divide-and-conquer tic may be performed after leveraging estimates are approaches. However, none of them are designed calculated. Subsequently, outliers can be identified to visualize the big data. Without looking at the and removed. Finally, the leveraging estimates may be data, exploratory data analysis is impossible. In calculated in the subsample without outliers. contrast, step 1 in leveraging algorithms provides a

ACKNOWLEDGMENTS We thank Tianming Liu Lab for providing the fMRI data. This research was partially supported by National Science Foundation grants DMS-1055815, 12228288, and 1222718.

REFERENCES 1. Suchard MA, Wang Q, Chan C, Frelinger J, Cron 8. Ma P, Mahoney M, Yu B. A statistical perspective on A, West M. Understanding GPU programming for algorithmic leveraging. ArXiv e-prints, June 2013. statistical computation: studies in massively parallel 9. Ma P, Mahoney M, Yu B. A statistical perspective massive mixtures. J Comput Graph Stat 2010, 19: on algorithmic leveraging. In Proceedings of the 31st 419–438. International Conference on Machine Learning, Beijing, 2. Schatz MC, Langmead B, Salzberg SL. Cloud computing China, 2014, 91–99. and theDNA data race. Nat Biotechnol 2010, 28:691. 10. Avron H, Maymounkov P, Toledo S. Blendenpik: Super- 3. Pratas F, Trancoso P, Sousa L, Stamatakis A, Shi G, charging LAPACK’s least-squares solver. SIAM J Sci Kindratenko V. Fine-grain parallelism using multi-core, Comput 2010, 32:1217–1236. cell/BE, and GPU systems. Parallel Comput 2012, 11. Hansen M, Hurwitz W. On the theory of sampling from 38:365–390. a finite population. Ann Math Stat 1943, 14:333–362. 4. Efron B. Bootstrap methods: another look at the jack- 12. Golub G, Van Loan C. Matrix Computations.Balti- knife. Ann Stat 1979, 7:1–26. more: Johns Hopkins University Press; 1996. 5. Wu CFJ. Jackknife, bootstrap and other resam- 13. Drineas P, Magdon-Ismail M, Mahoney MW, Woodruff pling methods in regression analysis. Ann Stat 1986, DP. Fast approximation of matrix coherence and statis- 14:1261–1295. tical leverage. JMachLearnRes2012, 13:3475–3506. 6. Drineas P, Mahoney MW, Muthukrishnan S. Sam- 14. Gittens A, Mahoney MW. Revisiting the Nyström pling algorithms for l2 regression and applications. In method for improved large-scale machine learning. Proceedings of the 17th Annual ACM-SIAM Sympo- Technical Report, Preprint: arXiv:1303.1849, 2013. sium on Discrete Algorithms, Miami, Florida, 2006, 15. Van Essen DC, Smith SM, Barch DM, Behrens TE, 1127–1136. Yacoub E, Ugurbil K. The WU-Minn Human Con- 7. Drineas P, Mahoney MW, Muthukrishnan S, Sarlós T. nectome Project: an overview. Neuroimage 2013, Faster least squares approximation. Numer Math 2010, 80:62–79. 117:219–249.

© 2014 Wiley Periodicals, Inc.