review articles

DOI:10.1145/2842602 generation mechanisms—as an - Randomization offers new benefits rithmic or for the development­ of improved algo- for large-scale linear . rithms for fundamental prob- lems such as , BY PETROS DRINEAS AND MICHAEL W. MAHONEY least-squares (LS) approximation, low- matrix approxi­mation, and Lapla- cian-based linear ­ solvers. Randomized Numerical (RandNLA) is an interdisci- RandNLA: plinary research area that exploits randomization as a computational resource to develop improved algo- rithms for large-scale linear algebra Randomized problems.32 From a foundational per- spective, RandNLA has its roots in theoretical science (TCS), with deep connections to mathemat- Numerical ics (convex analysis, , metric embedding theory) and applied (scientific , , numerical linear Linear algebra). From an applied perspec- tive, RandNLA is a vital new tool for , , and analysis. Well-engineered implemen- Algebra tations have already outperformed highly optimized libraries for ubiquitous problems such as least- squares,4,35 with good in par- allel and distributed environments.­ 52 Moreover, RandNLA promises a sound algorithmic and statistical foundation for modern large-scale . MATRICES ARE UBIQUITOUS in , statistics, and . An m × n key insights

matrix can encode information about m objects ˽˽ Randomization isn’t just used to model noise in data; it can be a powerful (each described by n features), or the behavior of a computational resource to develop discretized differential operator on a finite element with improved running times and stability properties as well as mesh; an n × n positive-definite matrix can encode algorithms that are more interpretable in the correlations between all pairs of n objects, or the downstream data science applications. ˽˽ To achieve best results, random sampling edge-connectivity between all pairs of nodes in a social of elements or columns/rows must be done carefully; but random projections can be network; and so on. Motivated largely by technological used to transform or rotate the input data to a random where simple uniform developments that generate extremely large scientific random sampling of elements or rows/ and Internet datasets, recent years have witnessed columns can be successfully applied. ˽˽ Random sketches can be used directly exciting developments in the theory and practice of to get low-precision solutions to data science applications; or they can be used matrix algorithms. Particularly remarkable is the use of indirectly to construct for randomization—typically assumed to be a property of the traditional iterative numerical algorithms to get high-precision solutions in input data due to, for example, noise in the data scientific computing applications.

80 COMMUNICATIONS OF THE ACM | JUNE 2016 | VOL. 59 | NO. 6 An Historical Perspective With the advent of the digital com- linear algebra became the domain of To get a broader sense of RandNLA, recall puter in the 1950s, it became apparent applied mathematics, and much of that linear algebra—the mathematics that, even when applied to well-posed computer science theory and prac- of vector spaces and linear mappings problems, many algorithms performed tice became discrete and combina- between vector spaces—has had a long poorly in the presence of the finite pre- torial.44 Nearly all subsequent work history in large-scale (by the standards cision that was used to represent real in scientific computing and NLA of the day) statistical data analysis.46 For . Thus, much of the early work has been deterministic (a notable example, the least-squares method is in computer science focused on solving exception being the work on integral due to Gauss, Legendre, and others, and discrete approximations to continu- evaluation using the Markov Chain was used in the early 1800s for fitting ous nu­merical problems. Work by Turing ). This led to linear to data to determine and von Neumann (then Householder, high-quality in the 1980s and planet orbits. Low-rank approximations Wilkinson, and others) laid much of the 1990s (LINPACK, EISPACK, LAPACK, based on Principal Component Analysis foundations for scientific computing and ScaLAPACK) that remain widely used (PCA) are due to Pearson, Hotelling, and NLA.48,49 Among other things, this led to today. Meanwhile, Turing, Church, others, and were used in the early 1900s the introduction of problem-specific­ and others began the study of compu- for exploratory data analysis and for measures (for example, the tation per se. It became clear that sev- making predictive models. Such meth- condition ) that characterize eral seemingly different approaches ods are of interest for many reasons, but the behavior­ of an input for a specific ( theory, the λ-calculus, and especially if there is noise or random- class of algorithms (for example, itera- Turing machines) defined the same ness in the data, because the leading tive algorithms). class of functions; and this led to the principal components then tend to cap- A split then occurred in the nascent belief in TCS that the concept of com-

IMAGE BY FORANCE BY IMAGE ture the signal and remove the noise. of computer science. Continuous putability is formally captured in a

JUNE 2016 | VOL. 59 | NO. 6 | COMMUNICATIONS OF THE ACM 81 review articles

Figure 1. (a) Matrices are a common way to model data. In genetics, for example, matrices can describe data from tens of thousands of individuals typed at millions of Single Nucleotide Polymorphisms or SNPs (loci in the human genome). Here, the (i, j)th entry is the genotype of the ith individual at the jth SNP. (b) PCA/SVD can be used to project every individual on the top left singular vectors (or “eigenSNPs”), thereby providing a convenient of the “out of Africa hypothesis” well known in population genetics.

Single Nucleotide Polymorphmisms (SNPs)

… AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG …

individuals … GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA …

(a)

AFRICA AMERICA CENTRAL SOUTH ASIA EAST ASIA EUROPE GUJARATI 0.02 MEXICANS Africa Middle East MIDDLE EAST OCEANIA 0

–0.02 Oceania Europe –0.04

EigenSNP 3 –0.06 South Central Asia East Asia –0.08

–0.1

Mexicans –0.03 –0.02 –0.01 0.03 America 0.02 0 0.01 0 0.01 –0.01 0.02 –0.02 –0.03 EigenSNP 1 EigenSNP 2 (b)

qualitative and robust way by these these developments is work on novel between geography and genetics, three equivalent approaches, inde- algorithms for linear algebra prob- there are several weaknesses of PCA/ pendent of the input data. Many of lems, and central to this is work on SVD-based methods. One is running these developments were determin- RandNLA algorithms.a In this article, time: computing PCA/SVD approxi- istic; but, motivated by early work on we will describe the basic ideas that mations of even moderately large the Monte Carlo method, random- underlie recent developments in this data matrices is expensive, especially ization—where the is interdisciplinary area. if it needs to be done many times as inside the and the algorithm For a prototypical data analysis part of cross validation or exploratory is applied to arbitrary or worst-case example where RandNLA methods data analysis. Another is interpret- data—was introduced and exploited have been applied, consider Figure ability: in general, eigenSNPs (that as a powerful computational resource. 1, which illustrates an application is, eigenvectors of individual-by-SNP­ Recent years have seen these two in genetics38 (although the same matrices) as well as other eigenfea- very different perspectives start to RandNLA methods have been applied tures don’t “mean” anything in terms converge. Motivated by modern mas- in astronomy, mass spectrometry of the processes generating the data. sive dataset problems, there has been imaging, and related areas33,38,53,54). Both issues have served as motiva- a great deal of interest in developing While the low-dimensional PCA plot tion to RandNLA algorithms algorithms with improved running illustrates the famous correlation to compute PCA/SVD approximations times and/or improved statistical faster than conventional numerical properties that are more appropriate methods as well as to identify actual a Avron et al., in the first sentence of their for obtaining insight from the enor- Blendenpik paper, observe that RandNLA is features (instead of eigenfeatures) mous quantities of noisy data that is “arguably the most exciting and innovative that might be easier to interpret for now being generated. At the center of idea to have hit linear algebra in a long time.”4 domain scientists.

82 COMMUNICATIONS OF THE ACM | JUNE 2016 | VOL. 59 | NO. 6 review articles

Basic RandNLA Principles estimator matrix à is equal to the origi- to the original matrix A, in the sense that RandNLA algorithms involve taking an nal matrix A. the error matrix, A − Ã, has, with high input matrix; constructing a “sketch” probability, a small spectral norm. Algorithm 1 A meta-algorithm for of that input matrix—where a sketch A more refined analysis18 showed that element-wise sampling is a smaller or sparser matrix that rep- resents the essential information in Input: m × n matrix A; s > 0 the original matrix—by random sam- denoting the number of elements to pling; and then using that sketch as be sampled; pij a surrogate for the full matrix to help (2) (i = 1, . . ., m and j = 1, . . ., n) with ∑i, j pij = 1. compute quantities of interest. To be where ×2 and ×F are the spectral and useful, the sketch should be similar 1. Let à be an all-zeros m × n matrix. Frobenius norms, respectively, of the to the original matrix in some way, for 2. For t = 1 to s, matrix.b If the spectral norm of the dif- example, small residual error on the •• Randomly sample one element ference A − à is small, then à can be difference between the two matrices, of A using the probability dis- used as proxy for A in applications. For or the two matrices should have simi- example, one can use à to approximate tribution pij. lar action on sets of vectors or in down- •• Let A denote the sampled the spectrum (that is, the singular val- it jt stream classification tasks. While element and set ues and singular vectors) of the origi- these ideas have been developed in nal matrix.2 If s is set to be a constant many ways, several basic design prin- (1) multiple of (m + n) ln (m + n), then the ciples underlie much of RandNLA: error scales with the Frobenius norm (i) randomly sample, in a careful of the matrix. This leads to an additive- ­data-dependent manner, a small num- Output: Return the m × n matrix Ã. error low-rank matrix approximation ber of elements from an input matrix algorithm, in which AF is the scale to create a much sparser sketch of the How to sample is, of course, very of the additional additive error.2 This original matrix; (ii) randomly sample, important. A simple choice is to per- is a large scaling factor, but improving in a careful data-dependent manner, a form uniform sampling, that is, set pij upon this with element-wise sampling, small number of columns and/or rows = 1/mn, for all i, j, and sample each ele- even in special cases, is a challenging from an input matrix to create a much ment with equal probability. While sim- open problem. smaller sketch of the original matrix; ple, this suffers from obvious problems: The mathematical techniques used and (iii) preprocess an input matrix for example, if all but one of the entries in the proof of these element-wise sam- with a random-projection-type­ matrix, of the original matrix equal zero, and pling results exploit the fact that the resid- in order to “spread out” or uniformize only a single non-zero entry exists, then ual matrix A − à is a random matrix whose the information in the original matrix, the probability of sampling the single entries have zero mean and bounded and then use naïve data-independent non-zero entry of A using uniform sam- variance. Bounding the spectral norm uniform sampling of rows/columns/ pling is negligible. Thus, the estimator of such matrices has a long history in elements in order to create a sketch. would have very large variance, in which random matrix theory.50 Early RandNLA Element-wise sampling. A naïve way case the sketch would, with high prob- element-wise sampling bounds2 used to view an m × n matrix A is an array of ability, fail to capture the relevant struc- a result of Füredi and Komlós on the numbers: these are the mn elements ture of the original matrix. Qualitatively spectral norm of symmetric, zero mean of the matrix, and they are denoted by improved results can be obtained by matrices of bounded variance.20 Sub­ 18 Aij (for all i = 1, . . ., m and all j = 1, . . ., n). using nonuniform data-dependent sequently, Drineas and Zouzias intro- It is therefore natural to consider the importance­ sampling distributions. For duced the idea of using matrix measure following approach in order to create example, sampling larger elements (in concentration inequalities37,40,47 to sim- a small sketch of a matrix A: instead absolute value) with higher probability plify the proofs, and follow-up work18 of keeping all its elements, randomly is advantageous in terms of variance has improved these bounds. sample and keep a small number of and can be used to obtain Row/column sampling. A more sop­ them. Algorithm 1 is a meta-algorithm worst-case additive-error bounds for histicated way to view a matrix A is as that samples s elements from a matrix low-rank matrix approximation.1,2,18,28 a linear operator, in which case the A in independent, identically distrib- More elaborate probability distribu- role of rows and columns becomes uted trials, where in each trial a single tions (the so-called element-wise lever- more central. Much RandNLA research element of A is sampled with respect to age scores that use information in the has focused on sketching a matrix by the importance sampling probability singular subspaces of A10) have been keeping only a few of its rows and/or distribution pij. The algorithm outputs shown to provide still finer results. 2 a matrix à that contains precisely the The first results for Algorithm 1 b In words, the spectral norm of a matrix measures selected elements of A, after appropriate showed that if one chooses entries how much the matrix elongates or deforms the rescaling. This rescaling is fundamental with probability proportional to their unit ball in the worst case, and the Frobenius from a statistical perspective: the sketch squared-magnitudes (that is, if norm measures how much the matrix elongates or deforms the unit ball on average. Sometimes à is an estimator for A. This rescaling in which case larger the spectral norm may have better properties makes it an unbiased estimator since, magnitude entries are more likely to especially when dealing with noisy data, as dis- element-wise, the expectation of the be chosen), then the sketch à is similar cussed by Achlioptas and McSherry.2

JUNE 2016 | VOL. 59 | NO. 6 | COMMUNICATIONS OF THE ACM 83 review articles

columns. This method of sampling When using norm-squared sampling, predates element-wise sampling algo- one can prove that rithms,19 and it leads to much stronger worst-case bounds.15,16 (5)

Algorithm 2 A meta-algorithm for row Motivated by holds in expectation (and thus, by sampling modern massive standard arguments, with high prob- Input: m × n matrix A; integer s > 0 ability) for arbitrary A.13,19,d The proof denoting the number of rows to be dataset problems, of Equation (5) is a simple sampled; probabilities p (i = 1, . . ., m) using basic properties of expecta- i there has been with ∑i = 1. tion and variance.­ This result can a great deal be generalized to approximate the 1. Let à be the empty matrix. of interest in product of two arbitrary matrices A 2. For t = 1 to s, and B.13 Proving such bounds with respect to other matrix norms is • developing • Randomly sample one row of more challenging but very important A using the probability distri- algorithms with for RandNLA. While Equation (5) bution p . i improved running trivially implies a bound for AT A − •• Let A denote the sampled T it∗ à Ã2, proving a better spectral norm row and set times and/or error bound necessitates the use of improved statistical more sophisticated methods such as (3) the Khintchine inequality or matrix- properties that Bernstein inequalities.42,47 Output: Return the s × n matrix Ã. are more Bounds of the form of Equation (5) immediately imply that à can be used Consider the meta-algorithm for row appropriate for as a proxy for A, for example, in order sampling (column sampling is analo- to approximate its (top few) singular gous) presented in Algorithm 2. Much of obtaining insight values and singular vectors. Since à is the discussion of Algorithm 1 is relevant from the enormous an s × n matrix, with s  n, computing to Algorithm 2. In particular, Algorithm its singular values and singular vectors 2 samples s rows of A in independent, quantities of noisy is a very fast task that scales linearly identically distributed trials accord- data now being with n. Due to the form of Equation (5), ing to the input probabilities pis; and this leads to additive-error low-rank the output matrix à contains pre- generated. matrix approximation algorithms, in

cisely the selected rows of A, after a which AF is the scale of the addi- rescaling that ensures un-biasedness tional additive error.19 That is, while of appropriate estimators (for example, norm-squared sampling avoids pit- the expectation­ of ÃT à is equal to AT A, falls of uniform sampling, it results element-wise).13,19 In addition, uniform in additive-error bounds that are only sampling can easily lead to very poor comparable to what element-wise results, but qualitatively improved sampling achieves.2,19 results can be obtained by using non- To obtain stronger and more use- uniform, data-dependent, importance ful bounds, one needs information sampling distributions. Some things, about the or subspace however, are different: the dimension of the high-dimensional of the sketch à is different than that of Euclidean space spanned by the the original matrix A. The solution is columns of A (if m  n) or the space to measure the quality of the sketch by spanned by the best rank-k approxi- comparing the difference between the mation to A (if m ∼ n). This can be matrices AT A and ÃT Ã. The simplest achieved with leverage score sam-

nonuniform distribution is known as 2 pling, in which pi is proportional to sampling or norm-squared sampling, in which p is proportional to square of i d That is, a provably good approximation to the the Euclidean norm of the ith row : product AT A can be computed using just a few rows of A; and these rows can be found by sam- (4) pling randomly according to a simple data-­ dependent importance sampling distribution. This matrix can be implemented in one pass over the data from c We will use the notation A to denote the ith external storage, using only O(sn) additional i* row of A as a row vector. space and O(s2n) additional time.

84 COMMUNICATIONS OF THE ACM | JUNE 2016 | VOL. 59 | NO. 6 review articles the ith leverage score of A. To define Subspace embeddings were first the application of a transformation, these scores, for simplicity assume used in RandNLA in a data-aware man- called the , to a given that m  n and that U is any m × n ner (meaning, by looking at the input problem instance such that the trans- spanning the data to compute exact or approximate formed instance is more easily solved column space of A.e In this case, U T leverage scores14) to obtain sampling- by a given class of algorithms.f The U is equal to the identity and UU T = based relative-error approximation main challenge for sampling-based

PA is an m-dimensional projection to the LS regression and related low- RandNLA algorithms is the construc- matrix onto the span of A. Then, the rank CX/CUR approximation prob- tion of the nonuniform sampling importance sampling probablities of lems.15,16 They were then used in a probabilities. A natural question Equation (4), applied to U, equal data-oblivious manner (meaning, in arises: is there a way to precondition conjuction with a random projection an input instance such that uniform (6) as a preconditioner) to obtain projec- random sampling of rows, columns, or tion-based relative-error approxima- elements yields an insignificant loss in Due to their historical importance tion to several RandNLA problems.43 approximation accuracy? in regression diagnostics and outlier A review of data-oblivious subspace The obvious obstacle to sampling detection, the pi’s in Equation (6) are embeddings for RandNLA, including ­uniformly at random from a matrix known as statistical leverage scores.9,14 its relationship with the early work is that the relevant information in the In some applications of RandNLA, the on least absolute deviations regres- matrix could be concentrated on a small largest leverage score is called the sion,11 has been provided.51 Due to number of rows, columns, or elements coherence of the matrix.8,14 the connection with data-aware and of the matrix. The solution is to spread Importantly, while one can naïvely data-oblivious subspace embeddings, out or uniformize this information, so compute these scores via Equation (6) approximating matrix multiplication that it is distributed almost uniformly by spending O (mn2) time to compute is one of most powerful primitives over all rows, columns, or elements of U exactly, this is not necessary.14 Let Π be in RandNLA. Many error formulae the matrix. (This is illustrated in Figure the fast as used for other problems ultimately boil 2.) At the same time, the preprocessed in Drineas et al.14 or the input-sparsity- down to matrix inequalities, where time random projection of Refs.12,34,36 the randomness of the algorithm only Then, in o(mn2) time, one can compute appears as a (randomized) approxi- f For example, if one is interested in iterative algo- rithms for solving the linear system Ax = b, one the matrix from a QR decomposi- mate matrix multiplication. typically transforms a given problem instance tion of ΠA and from that compute 1 ± ε Random projections as precon- to a related instance in which the so-called relative-error approximations to all the ditioners. Preconditioning refers to is not too large. leverage scores.14 In RandNLA, one is typically inter- Figure 2. In RandNLA, random projections can be used to “precondition” the input data so that uniform sampling algorithms perform well, in a manner analogous to how ested in proving that traditional pre-conditioners transform the input to decrease the usual condition number so that iterative algorithms perform well (see (a)). In RandNLA, the random projection- (7) based preconditioning involves uniformizing information in the eigenvectors, rather than flattening the eigenvalues (see (b)). either for arbitrary ε ∈ (0, 1) or for some fixed ε ∈ (0, 1). Approximate matrix multiplication bounds of the form of Equation (7) are very important in RandNLA algorithm design since the resulting sketch à preserves rank properties of the original data matrix A and provides a subspace embedding: (a) from the NLA perspective, this is sim- ply an acute perturbation from the original high-dimensional space to a much lower dimensional space.22 From the TCS perspective, this pro-

vides bounds analogous to the usual Leverage score Row index Johnson–Lindenstrauss bounds, except that it preserves the geometry of the entire subspace.43

e A generalization holds if m ∼ n: in this case, U is any m × k orthogonal matrix spanning the best Leverage score Row index rank-k approximation to the column space of (b) A, and one uses the leverage scores relative to the best rank-k approximation to A.14,16,33

JUNE 2016 | VOL. 59 | NO. 6 | COMMUNICATIONS OF THE ACM 85 review articles

matrix should have similar properties matrix-matrix multiplication that actu- first two have to do with identifying non- (for example, singular values and singu- ally performs the random projection. uniformity structure in the input data; lar vectors) as the original matrix, and More interestingly, Π could be a and the third has to do with preconditi­ the preprocessing should be computa- so-called Fast Johnson Lindenstrauss oning the input (that is, uniformizing the tionally efficient (for example, it should Transform (FJLT). This is the product nonuniformity structure) so uniform ran- be faster than solving the original prob- of two matrices, a random diagonal dom sampling performs well. Depending lem exactly) to perform. matrix with +1 or −1 on each diagonal on the area in which RandNLA algorithms Consider Algorithm 3, our meta- entry, each with probability 1/2, and the have been developed and/or imple- algorithm for preprocessing an input Hadamard-Walsh (or related Fourier- mented and/or applied, these principles matrix A in order to uniformize infor- based) matrix.3 Implementing FJLT-based can manifest themselves in very different mation in its rows or columns or ele- random projections can take advantage ways. Relatedly, in applications where ments. Depending on the choice of of well-studied fast Fourier techniques elements are of primary importance preprocessing (only from the left, only and can be extremely fast for arbitrary (for example, recommender systems26), from the right, or from both sides) the dense input matrices.4,41 Recently, there element-wise methods might be most information in A is uniformized in dif- has even been introduced an extremely appropriate, while in applications where ferent ways (across its rows, columns, sparse random projection construction subspaces are of primary importance or elements, respectively). For pedagog- that for arbitrary input matrices can be (for example, scientific computing25), ical simplicity, Algorithm 3 is described implemented in “input-sparsity time,” column/row-based methods might be such that the output matrix has the that is, time depending on the number most appropriate. same dimensions as the original matrix of nonzeros of A, plus lower-order terms, (in which case Π is approximately a ran- as opposed to the dimensions of A.12,34,36 Extensions and Applications of dom rotation). Clearly, however, if this With appropriate settings of prob- Basic RandNLA Principles algorithm is coupled with Algorithm lem parameters (for example, the We now turn to several examples of prob- 1 or Algorithm 2, then with trivial to number of uniform samples that are lems in various domains where the basic implement uniform sampling, only the subsequently drawn, which equals RandNLA principles have been used in rows/columns that are sampled actu- the dimension onto which the data is the design and analysis, implementa- ally need to be generated. In this case projected), all of these methods pre- tion, and application of novel algorithms. the sampled version of Π is known as a condition arbitrary input matrices so Low-precision approximations and random projection. that uniform sampling in the randomly high-precision numerical implemen- rotated basis performs as well as non- tations: least-squares and low-rank Algorithm 3 A meta-algorithm for pre- uniform sampling in the original basis. approximation. One of the most fun- conditioning a matrix for random sam- For example, if m  n, in which case damental problems in linear algebra is pling algorithms the leverage scores of A are given by the least-squares (LS) regression prob- 1: Input: m × n matrix A, randomized pre- Equation (6), then by keeping only lem: given an m × n matrix A and an roughly O(n log n) randomly-rotated m-dimensional vector b, solve processing matrices ΠL and/or ΠR. 2: Output: dimensions, uniformly at random, one can prove that the leverage scores of (8) • • To uniformize information across the preconditioned system are, up to the rows of A, return ΠLA. g logarithmic fluctuations, uniform. where ×2 denotes the 2 norm of a vec- •• To uniformize information across Which construction for Π should be tor. That is, compute the n-dimensional used in any particular application of vector x that minimizes the Euclidean the columns of A, return AΠR. RandNLA depends on the details of norm of the residual Ax − b.h If m  n, •• To uniformize information across the problem, for example, the aspect then we have the overdetermined (or the elements of A, return Π AΠ . L R ratio of the matrix, whether the RAM overconstrained) LS problem, and its model is appropriate for the particu- solution can be obtained in O(mn2) There is wide latitude in the choice lar computational infrastructure, how time in the RAM model with one of of the random matrix Π. For example, expensive it is to generate random bits, several methods, for example, solving although Π can be chosen to be a ran- and so on. For example, while slower the normal equations, QR decomposi- dom orthogonal matrix, other construc- in the RAM model, Gaussian-based tions, or the SVD. Two major successes tions can have much better algorithmic random projections can have stronger of RandNLA concern faster (in terms properties: Π can consist of appropriately- conditioning properties than other of low-precision asymptotic worst-case scaled independent identically distrib- constructions. Thus, given their ease theory, or in terms of high-precision uted (i.i.d.) Gaussian random variables, of use, they are often more appropriate wall- time) algorithms for this i.i.d. Rademacher for certain parallel and cloud-comput- ubiquitous problem. (+1 or −1, up to scaling, each with prob- ing architectures.25,35 ability 50%), or i.i.d. random variables Summary. Of the three basic RandNLA drawn from any sub-Gaussian distri- principles described in this section, the h Observe this formulation includes as a special case the problem of solving systems of linear bution. Implementing these variants equations (if m = n and A has full rank, then depends on the time to generate the ran- g This is equivalent to the statement that the the resulting system of linear equations has a dom bits plus the time to perform the coherence of the preconditioned system is small. unique solution).

86 COMMUNICATIONS OF THE ACM | JUNE 2016 | VOL. 59 | NO. 6 review articles

One major success of RandNLA asymptotic FLOPS than the corre- an appropriate random projection; was the following random sampling sponding random sampling algo- use that sketch to construct a precon- algorithm for the LS problem: quickly rithm, preconditioning with random ditioner for a traditional iterative NLA compute 1 ± ε approximations to the projections is a powerful primitive algorithm; and use that to solve the pre- leverage scores;14 form a subproblem more generally for RandNLA algo- conditioned version of the original full by sampling with Algorithm 2 roughly rithms.) Moreover, both of these algo- problem. This improves the ε dependence Θ(n log(m)/ε) rows from A and the cor- rithms have been improved to run in from poly(1/ε) to log(1/ε). Carefully- responding elements from b using time that is proportional to the num- engineered ­ of this those approximations as importance ber of nonzeros in the matrix, plus approach are competitive with or beat sampling probabilities; and return lower-order terms that depend on the high-quality numerical implementa- the LS solution of the subproblem.14,15 lower dimension of the input.12 tions of LS solvers such as those imple- Alternatively, one can run the fol- Another major success of RandNLA mented in LAPACK.4 lowing random projection algo- was the demonstration that the The difference between these two rithm: precondition the input with a sketches constructed by RandNLA algorithmic strategies (see Figure 3 Hadamard-based random projection; could be used to construct precon- for an illustration) highlights impor- form a subproblem by sampling with ditioners for high-quality traditional tant differences between TCS and Algorithm 2 roughly Θ(n log(m)/ε) NLA iterative software libraries.4 To see NLA approaches to RandNLA, as rows from A and the corresponding the need for this, observe that because well as between computer science elements from b uniformly at ran- of its dependence on ε, the previous and scientific computing more gen- dom; and return the LS solution of RandNLA algorithmic (con- erally: subtle but important differ- the subproblem.17, 43 struct a sketch and solve a LS problem ences in problem parameterization, Both of these algorithms return 1±ε on that sketch) can yield low-precision between what counts as a “good” solu- relative-error approximate solutions solutions, for example, ε = 0.1, but tion, and between error norms of inter- for arbitrary or worst-case input; and cannot practically yield high-preci- est. Moreover, similar approaches both run in roughly Θ(mn log(n)/ε) sion solutions, for example, ε = 10−16. have been used to extend TCS-style = o(mn2) time, that is, qualitatively Blendenpik4 and LSRN35 are LS solvers RandNLA algorithms for providing 1 ± faster than traditional algorithms that are appropriate for RAM and par- ε relative-error low-rank matrix approxi- for the overdetermined LS problem. allel environments, respectively, that mation16,43 to NLA-style RandNLA algo- (Although this random projection adopt the following RandNLA algorith- rithms for high-quality numerical algorithm is not faster in terms of mic strategy: construct a sketch, using low-rank matrix approximation.24,25,41

Figure 3. (a) RandNLA algorithms for least-squares problems first compute sketches, SA and Sb, of the input data, A and b. Then, either they solve a least-squares problem on the sketch to obtain a low-precision approximation, or they use the sketch to construct a traditional preconditioner for an iterative algorithm on the original input data to get high-precision approximations. Subspace-preserving embedding: if S is a random sampling matrix, then the high leverage point will be sampled and included in SA; and if S is a random-projection-type matrix, then the information in the high leverage point will be homogenized or uniformized in SA. (b) The “heart” of RandNLA proofs is

subspace-preserving embedding for orthogonal matrices: if UA is an orthogonal matrix (say the matrix of the left singular vectors of A),

then SUA is approximately orthogonal.

RandNLA high leverage data point

x x SA Sb A b

least-squares fit

(a)

RandNLA

T (SU )T UA I A I SUA

UA

(b)

JUNE 2016 | VOL. 59 | NO. 6 | COMMUNICATIONS OF THE ACM 87 review articles

For example, a fundamental structural following problem, which is an ideal- (11) condition for a sketching matrix to sat- ization of the important recommender 26 isfy to obtain good low-rank matrix systems problem. Given an arbitrary s.t. Ãij = Aij, for all sampled entries Aij, approximation is the following. Let m × n matrix A, reconstruct A by sam- n × k n × (n−k) a Vk ∈ R (resp., Vk,⊥ ∈ R ) be any pling a set of O ( (m + n)poly(1/ε ) ), as where ×* denotes the nuclear (or trace) matrix spanning the top-k (resp., bot- opposed to all mn, entries of the of a matrix (basically, the sum tom-(n − k) ) right singular subspace of such that the resulting approximation of the singular values of the matrix). m × n A ∈ R , and let Σk (resp., Σk,⊥) be the à satisfies, either deterministically or That is, if A is exactly low-rank (that is,

containing the top-k up to some failure probability, A = Ak and thus A − Ak is zero) and sat- (resp., all but the top-k) singular values. isfies an incoherence assumption, In addition, let Z ∈ Rn × r (r ≥ k) be any (10) then Equation (10) is satisfied, since

matrix (for example, a random sampling A = Ak = Ã. Recently, the incoherence matrix S, a random projection matrix Π, or Here, a should be small (for example, 2); assumption has been relaxed, under a matrix Z constructed deterministically) and the sample size could be increased the assumption that one is given oracle such that has full rank. Then, by (less important) logarithmic factors access to A according to a non-uniform of m, n, and ε. In addition, one would sampling distribution that essentially like to construct the sample and com- corresponds to element-wise leverage pute à after making a small number of scores.10 However, removing the assump­ (9) passes over A or without even touching tion that A has exact low-rank k, with k  where × is any unitarily invariant all of the entries of A. min{m, n}, is still an open problem.j matrix norm. A first line of research (already men- Informally, keeping only a few rows/ How this structural condition is used tioned) on this problem from TCS columns of a matrix seems more power- depends on the particular low-rank focuses on element-wise sampling:2 ful than keeping a comparable number problem of interest, but it is widely used sample entries from a matrix with prob- of elements of a matrix. For example, (either explicitly or implicitly) by low-rank abilities that (roughly) depend on their consider an m × n matrix A whose rank RandNLA algorithms. For example, magnitude squared. This can be done in is exactly equal to k, with k  min{m, n}: Equation (9) was introduced in the one pass over the matrix, but the result- selecting any set of k linearly indepen- context of the Column Subset Selection ing additive-error bound is much larger dent rows allows every row of A to be Problem7 and was reproven and used to than the requirements of Equation (10), expressed as a of the reparameterize low-rank random pro- as it scales with the Frobenius norm of A selected rows. The analogous procedure

jection algorithms in ways that could be instead of the Frobenius norm of A − Ak. for element-wise sampling seems harder. more easily implemented.25 It has also A second line of research from signal This is reflected in that state-of-the-art been used in ways ranging from devel- processing and applied mathemat- element-wise sampling algorithms use oping improved bounds for meth- ics has referred to this as the matrix convex optimization and other heavier- ods in machine learning21 to coupling completion problem.8 In this case, one is duty algorithmic machinery. with a version of the power method to interested in computing à without even Solving systems of Laplacian-based obtain improved numerical implemen- observing all of the entries of A. Clearly, linear equations. Consider the special tations41 to improving subspace itera- this is not possible without assump- case of the LS regression problem of tion methods.24 tions on A.i Typical assumptions are Equation (8) when m = n, that is, the well- The structural condition in Equation on the eigenvalues and eigenvectors of known problem of solving the system (9) immediately suggests a proof strat- A: for example, the input matrix A has of linear equations Ax = b. For worst-case egy for bounding the error of RandNLA rank exactly k, with k  min{m, n}, and dense input matrices A this problem can algorithms for low-rank matrix approxi- also that A satisfies some sort of - be solved in exactly O (n3) time, for example, mation: identify a sketching matrix Z vector delocalization or incoherence using the partial LU decomposition and such that has full rank; and, at the conditions.8 The simplest form of the other methods. However, especially when same time, bound the relevant norms of latter is the leverage scores of Equation A is positive semidefinite (PSD), iterative and Importantly, in many (6) are approximately uniform. Under techniques such as the conjugate gra- of the motivating scientific computing these assumptions, one can prove that dients method are typically preferable, applications, the matrices of interest given a uniform sample of O ( (m + n) k mainly because of their linear depen- are linear operators that are only implic- ln (m + n) ) entries of A, the solution to dency on the number of non-zero entries itly represented but that are structured the following nuclear norm minimiza- in the matrix A (times a factor depending such that they can be applied to an tion problem recovers A exactly, with on the condition number of A). arbitrary vector quickly. In these cases, high probability: An important special case is when FJLT-based or input-sparsity-based the PSD matrix A is the Laplacian matrix projections applied to arbitrary matri- i This highlights an important difference in ces can be replaced with Gaussian- problem parameterization: TCS-style ap- j It should be noted that there exists prior work based projections applied to these proaches assume worst-case input and must on matrix completion for low-rank matrices identify nonuniformity strucutre, while ap- with the addition of well-behaved noise; how- structured operators with similar com- plied mathematics approaches typically as- ever, removing the low-rank assumption and putational costs and quality guarantees. sume well-posed problems where the worst achieving error that is relative to some norm of

Matrix completion. Consider the nonuniformity structure is not present. the residual A − Ak is still open.

88 COMMUNICATIONS OF THE ACM | JUNE 2016 | VOL. 59 | NO. 6 review articles of an underlying undirected graph G removed these polylogarithmic factors, = (V, E), with n = |V| vertices and |E| leading to useful implementations of weighted, undirected edges.5 Variants Laplacian-based solvers.27 Extending of this special case are common in such techniques to handle general PSD unsupervised and semi-supervised input matrices A that are not Laplacian machine learning.6 Recall the Laplacian RandNLA has is an open problem. matrix of an undirected graph G is an n proven to be a Statistics and machine learning. × n matrix that is equal to the n × n diago- RandNLA has been used in statistics nal matrix D of node weights minus the model for truly and machine learning in several ways, n × n adjacency matrix of the graph. In interdisciplinary the most common of which is in the this special case, there exist random- so-called kernel-based machine learn- ized, relative-error algorithms for the research in this era ing.21 This involves using a PSD matrix problem of Equation (8).5 The running to encode nonlinear relationships time of these algorithms is of large-scale data. between data points; and one obtains different results depending on whether O (nnz(A)polylog(n)), one is interested in approximating a given kernel matrix,21 constructing new where nnz(A) represents the number kernel matrices of particular forms,39 of non-zero elements of the matrix or obtaining a low-rank basis with A, that is, the number of edges in which to perform downstream classi- the graph G. The first step of these fication, clustering, and other related algorithms corresponds to random- tasks.29 Alternatively, the analysis ized graph sparsification and keeps a used to provide relative-error low-rank small number of edges from G, thus matrix approximation for worst-case creating a much sparser Laplacian input can also be used to provide matrix . This is sub- bounds for kernel-based divide-and- sequently used (in a recursive man- conquer algorithms.31 More generally, ner) as an efficient preconditioner to CX/CUR decompositions provide scal- approximate the solution of the prob- able and interpretable solutions to lem of Equation (8). downstream data analysis problems While the original algorithms in in genetics, astronomy, and related this line of work were major theoretical areas.33,38,53,54 Recent work has focused breakthroughs, they were not imme- on statistical aspects of the “algorith- diately applicable to numerical imple- mic leveraging” approach that is cen- mentations and data applications. In tral to RandNLA algorithms.30 an effort to bridge the theory-practice gap, subsequent work proposed a much Looking Forward simpler algorithm for the graph sparsi- RandNLA has proven to be a model fication step.45 This subsequent work for truly interdisciplinary research in this showed that randomly sampling edges era of large-scale data. For example, while from the graph G (equivalently, rows TCS, NLA, scientific computing, math- from the edge-incidence matrix) with ematics, machine learning, statistics, probabilities proportional to the effec- and downstream scientific domains are tive resistances of the edges provides a all interested in these results, each of sparse Laplacian matrix satisfying these areas is interested for very different the desired properties. (On the nega- reasons. Relatedly, while technical results tive side, in order to approximate the underlying the development of RandNLA effective resistances of the edges of G, a have been nontrivial, some of the largest call to the original solver was necessary, obstacles to progress in RandNLA have clearly hindering the applicability of been cultural: TCS being cavalier about the simpler sparsification algorithm.45) factors, ε factors, and work- The effective resistances are equivalent ing in overly idealized computational to the statistical leverage scores of the models; NLA being extremely slow to weighted edge-incidence matrix of G. embrace randomization as an algo- Subsequent work has exploited graph rithmic resource; scientific computing theoretic ideas to provide efficient algo- researchers formulating and implement- rithms to approximate them in time pro- ing algorithms that make strong domain- portional to the number of edges in the specific assumptions; and machine graph (up to polylogarithmic factors).27 learning and statistics researchers being Recent improvements have essentially more interested in results on hypoth-

JUNE 2016 | VOL. 59 | NO. 6 | COMMUNICATIONS OF THE ACM 89 review articles

esized unseen data rather than the data 2. Achlioptas, D., McSherry, F. Fast of 30. Ma, P., Mahoney, M.W., Yu, B. A statistical perspective low-rank matrix approximations. J. ACM 54, 2 (2007), on algorithmic leveraging. J. Mach. Learn. Res. 16 being input to the algorithm. Article 9. (2015), 861–911. In spite of this, RandNLA has already 3. Ailon, N., Chazelle, B. Faster dimension reduction. 31. Mackey, L., Talwalkar, A., Jordan, M.I. Distributed Commun. ACM 53, 2 (2010), 97–104. matrix completion and robust . J. Mach. led to improved algorithms for several 4. Avron, H., Maymounkov, P., Toledo, S. Blendenpik: Learn. Res. 16 (2015), 913–960. fundamental matrix problems, but it is Supercharging LAPACK’s least-squares solver. SIAM 32. Mahoney, M.W. Randomized Algorithms for J. Sci. Comput. 32 (2010), 1217–1236. Matrices and Data. Foundations and Trends in important to emphasize that “improved” 5. Batson, J., Spielman, D.A., Srivastava, N., Teng, Machine Learning. NOW Publishers, Boston, 2011. means different things to different S.-H. Spectral sparsification of graphs: Theory 33. Mahoney, M.W., Drineas, P. CUR matrix and algorithms. Commun. ACM 56, 8 (2013), 87–94. decompositions for improved data analysis. Proc. people. For example, TCS is interested 6. Belkin, M., Niyogi, P. Laplacian eigenmaps for Natl. Acad. Sci. USA 106 (2009), 697–702. in these methods due to the deep con- dimensionality reduction and data representation. 34. Meng, X., Mahoney, M.W. Low-distortion subspace Neural Comput. 15, 6 (2003), 1373–1396. embeddings in input-sparsity time and applications to nections with Laplacian-based linear 7. Boutsidis, C., Mahoney, M.W., Drineas, P. An improved robust . In Proceedings of the 45th equation solvers5,27 and since fast ran- for the column subset Annual ACM Symposium on Theory of Computing selection problem. In Proceedings of the 20th Annual (2013), 91–100. dom sampling and random projection ACM-SIAM Symposium on Discrete Algorithms 35. Meng, X., Saunders, M.A., Mahoney, M.W. LSRN: A 12,14,17,43 (2009), 968–977. parallel iterative solver for strongly over- or under- algorithms represent­ an improve- 8. Candes, E.J., Recht, B. Exact matrix completion via determined systems. SIAM J. Sci. Comput. 36, 2 ment in the asymptotic running time of convex optimization. Commun. ACM 55, 6 (2012), (2014), C95–C118. 111–119. 36. Nelson, J., Huy, N.L. OSNAP: Faster numerical the 200-year-old 9. Chatterjee, S., Hadi, A.S. Influential observations, linear algebra algorithms via sparser subspace algorithms for least-squares problems high leverage points, and outliers in linear regression. embeddings. In Proceedings of the 54th Annual Stat. Sci. 1, 3 (1986), 379–393. IEEE Symposium on Foundations of Computer on worst-case input. NLA is interested in 10. Chen, Y., Bhojanapalli, S., Sanghavi, S., Ward, R. Science (2013), 117–126. these methods since they can be used to Coherent matrix completion. In Proceedings of the 37. Oliveira, R.I. Sums of random Hermitian matrices and 31st International Conference on Machine Learning an inequality by Rudelson. Electron. Commun. Prob. engineer variants of traditional NLA algo- (2014), 674–682. 15 (2010) 203–212. rithms that are more robust and/or faster 11. Clarkson, K. Subgradient and sampling algorithms 38. Paschou, P., Ziv, E., Burchard, E.G., Choudhry, S., Rodriguez-Cintron, W., Mahoney, M.W., Drineas, P. PCA- for 1 regression. In Proceedings of the 16th Annual in wall clock time than high-quality soft- ACM-SIAM Symposium on Discrete Algorithms correlated SNPs for structure identification in worldwide ware that has been developed over recent (2005), 257–266. human populations. PLoS Genet. 3 (2007), 1672–1686. 12. Clarkson, K.L., Woodruff, D.P. Low rank approximation 39. Rahimi, A., Recht, B. Random features for large-scale decades. (For example, Blendenpik and regression in input sparsity time. In Proceedings kernel machines. In Annual Advances in Neural “beats LAPACK’s direct dense least- of the 45th Annual ACM Symposium on Theory of Systems 20: Proceedings of Computing (2013), 81–90. the 2007 Conference, 2008. squares solver by a large margin on 13. Drineas, P., Kannan, R., Mahoney, M.W. Fast Monte 40. Recht, B. A simpler approach to matrix completion. essentially any dense tall matrix;”4 the Carlo algorithms for matrices I: approximating matrix J. Mach. Learn. Res. 12 (2011), 3413–3430. multiplication. SIAM J. Comput. 36 (2006), 132–157. 41. Rokhlin, V., Szlam, A., Tygert, M. A randomized randomized approach for low-rank matrix 14. Drineas, P., Magdon-Ismail, M., Mahoney, M.W., algorithm for principal component analysis. SIAM J. approximation in scientific computing Woodruff, D.P. Fast approximation of matrix Matrix Anal. Appl. 31, 3 (2009), 1100–1124. coherence and statistical leverage. J. Mach. Learn. 42. Rudelson, M., Vershynin, R. Sampling from large “beats its classical competitors in terms Res. 13 (2012), 3475–3506. matrices: an approach through geometric functional 25 15. Drineas, P., Mahoney, M.W., Muthukrishnan, S. analysis. J. ACM 54, 4 (2007), Article 21. of accuracy, speed, and robustness;” 43. Sarlós, T.. Improved approximation algorithms Sampling algorithms for 2 regression and applications. and least-squares and least absolute In Proceedings of the 17th Annual ACM-SIAM for large matrices via random projections. In Proceedings of the 47th Annual IEEE Symposium on deviations regression problems “can be Symposium on Discrete Algorithms (2006), 1127–1136. 16. Drineas, P., Mahoney, M.W., Muthukrishnan, S. Foundations of Computer Science (2006), 143–152. solved to low, medium, or high precision Relative-error CUR matrix decompositions. 44. Smale, S. Some remarks on the foundations of SIAM J. Matrix Anal. Appl. 30 (2008), 844–881. . SIAM Rev. 32, 2 (1990), 211–220. in existing distributed systems on up to 17. Drineas, P., Mahoney, M.W., Muthukrishnan, S., Sarlós, 45. Spielman, D.A., Srivastava, N. Graph sparsification by terabyte-sized­ data.”52) Mathematicians T. Faster least squares approximation. Numer. Math. effective resistances. SIAM J. Comput. 40, 6 (2011), 117, 2 (2010), 219–249. 1913–1926. are interested in these methods since 18. Drineas, P., Zouzias, A. A note on element-wise matrix 46. Stigler, S.M. The History of Statistics: The they have led to new and fruitful funda- sparsification via a matrix-valued Bernstein inequality. Measurement of Uncertainty before 1900. Harvard University Press, Cambridge, 1986. 23,40,42,47 Inform. Process. Lett. 111 (2011), 385–389. mental mathematical questions. 19. Frieze, A., Kannan, R., Vempala, S. Fast Monte-Carlo 47. Tropp, J.A. User-friendly tail bounds for sums of random Statisticians and machine learners are algorithms for finding low-rank approximations. matrices. Found. Comput. Math. 12, 4 (2012), 389–434. J. ACM 51, 6 (2004), 1025–1041. 48. Turing, A.M. Rounding-off errors in matrix processes. interested in these methods due to their 20. Füredi, Z., Komlós, J. The eigenvalues of random Quart. J. Mech. Appl. Math. 1 (1948), 287–308. connections with kernel-based learn- symmetric matrices. Combinatorica 1, 3 (1981), 49. von Neumann, J., Goldstine, H.H. Numerical inverting 233–241. of matrices of high order. Bull. Am. Math. Soc. 53 (1947), 1021–1099. ing and since the randomness inside the 21. Gittens, A. Mahoney, M.W. Revisiting the Nyström 50. Wigner, E.P. Random matrices in physics. SIAM Rev. method for improved large-scale machine learning. algorithm often implicitly implements a 9, 1 (1967), 1–23. J. Mach. Learn Res. In press. 51. Woodruff, D.P. Sketching as a Tool for Numerical Linear form of regularization on realistic noisy 22. Golub, G.H., Van Loan, C.F. Matrix Computations. Algebra. Foundations and Trends in Theoretical 21,29,30 Johns Hopkins University Press, Baltimore, 1996. input data. Finally, data analysts are Computer Science. NOW Publishers, Boston, 2014. 23. Gross, D. Recovering low-rank matrices from few 52. Yang, J., Meng, X., Mahoney, M.W. Implementing interested in these methods since they coefficients in any basis.IEEE Trans. Inform. Theory randomized matrix algorithms in parallel and distributed 57, 3 (2011), 1548–1566. provide scalable and interpretable solu- environments. Proc. IEEE 104, 1 (2016), 58–92. 24. Gu, M. Subspace randomization and singular 53. Yang, J., Rübel, O., Prabhat, Mahoney, M.W., tions to downstream scientific data analy- value problems. Technical report, 2014. Preprint: Bowen, B.P. Identifying important ions and positions 33, 38,54 arXiv:1408.2208. sis problems. Given the central role in mass spectrometry imaging data using CUR matrix 25. Halko, N., Martinsson, P.-G., Tropp, J.A. Finding decompositions. Anal. Chem. 87, 9 (2015), 4658–4666. that matrix problems have historically structure with randomness: Probabilistic algorithms 54. Yip, C.-W., Mahoney, M.W., Szalay, A.S., Csabai, I., for constructing approximate matrix decompositions. played in large-scale data analysis, we Budavari, T., Wyse, R.F.G., Dobos, L. Objective SIAM Rev. 53, 2 (2011), 217–288. identification of informative wavelength regions expect RandNLA methods will continue 26. Koren, Y., Bell, R., Volinsky, C. Matrix factorization in galaxy spectra. Astron. J. 147, 110 (2014), 15. to make important contributions not only techniques for recommender systems. IEEE Comp. 42, 8 (2009), 30–37. to each of those research areas but also to 27. Koutis, I., Miller, G.L., Peng, R. A fast solver for a Petros Drineas ([email protected]) is an associate bridging the gaps between them. class of linear systems. Commun. ACM 55, 10 (2012), professor in the Department of Computer Science at 99–107. Rensselaer Polytechnic Institute, Troy, NY. 28. Kundu, A., Drineas, P. A note on randomized element- wise matrix sparsification. Technical report, 2014. Michael W. Mahoney ([email protected]) References Preprint: arXiv:1404.0320. is an associate professor in ICSI and in the Department 1. Achlioptas, D., Karnin, Z., Liberty, E. Near-optimal 29. Le, Q.V., Sarlós, T., Smola, A.J. Fastfood— of Statistics at the University of California at Berkeley. entrywise sampling for data matrices. In Annual approximating kernel expansions in loglinear time. Advances in Neural Information Processing In Proceedings of the 30th International Conference Copyright held by authors. Systems 26: Proceedings of the 2013 Conference, 2013. on Machine Learning, 2013. Publication rights licensed to ACM. $15.00.

90 COMMUNICATIONS OF THE ACM | JUNE 2016 | VOL. 59 | NO. 6