<<

Determinantal Point Processes in Randomized Numerical

Michał Derezi´nskiand Michael W. Mahoney

1. Introduction problems are central to much of applied mathemat- Randomized Numerical Linear Algebra (RandNLA) is an ics, from traditional scientific computing and partial dif- area which uses randomness, most notably random sam- ferential to , machine learning, and ar- pling and random projection methods, to develop im- tificial intelligence. Generalizations and variants of matrix proved for ubiquitous matrix problems. It be- problems are central to many other areas of , gan as a niche area in theoretical science about via more general transformations and algebraic structures, fifteen years ago [11], and since then the area has exploded. nonlinear optimization, infinite-dimensional operators, etc. Much of the work in RandNLA has been propelled Michał Derezi´nskiis a postdoctoral research scientist in the department of statis- by recent developments in machine learning, artificial in- tics at the University of California at Berkeley. His email address is mderezin telligence, and large-scale data science, and RandNLA both @berkeley.edu. draws upon and contributes back to both pure and applied Michael W. Mahoney is an associate professor in the department of statistics mathematics. at the University of California at Berkeley and a research scientist at the In- A seemingly different topic, but one which has a long ternational Computer Science Institute. His email address is mmahoney@stat history in pure and , is that of De- .berkeley.edu . terminantal Point Processes (DPPs). A DPP is a stochas- A complete list of references is available in the technical report version of this paper: https://arxiv.org/pdf/2005.03185.pdf. tic point process, the of which is characterized by subdeterminants of some matrix. Such Communicated by Notices Associate Editor Reza Malek-Madani. processes were first introduced to model the distribution For permission to reprint this article, please contact: of fermions at thermal equilibrium [19]. In the context [email protected]. of random matrix theory, DPPs emerged as the eigenvalue DOI: https://doi.org/10.1090/noti2202

34 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY VOLUME 68, NUMBER 1 distribution for standard random matrix ensembles, and 2. RandNLA: Randomized Numerical they are of interest in other areas of mathematics such Linear Algebra as , , and quantum mechan- Much of the early work in numerical linear algebra (NLA) ics [17]. More recently, DPPs have also attracted signifi- focused on deterministic algorithms. However, Monte cant attention within machine learning and statistics as a Carlo sampling approaches demonstrated that random- tractable probabilistic model that is able to capture a bal- ness inside the is a powerful computational re- ance between quality and diversity within data sets and source which can lead to both greater efficiency and robust- that admits efficient algorithms for sampling, marginaliza- ness to worst-case data [12, 20]. The success of RandNLA tion, conditioning, etc. [18]. This resulted in practical ap- methods has been proven in many domains, e.g., when plication of DPPs in experimental design, recommenda- the randomized least squares solvers such as Blendenpik tion systems, stochastic optimization, and more. or LSRN have outperformed the established high perfor- Until very recently, DPPs have had little if any presence mance computing software LAPACK or other methods in within RandNLA. However, recent work has uncovered parallel/distributed environments, respectively, or when deep connections between these two topics. The purpose RandNLA methods have been used in conjunction with tra- of this article is to provide an overview of RandNLA, with ditional scientific computing solvers for low- approxi- an emphasis on discussing and highlighting these connec- mation problems. tions with DPPs. In particular, we will show how random In a typical RandNLA setting, we are given a large dataset sampling with a DPP leads to new kinds of unbiased esti- in the form of a real-valued matrix, say 퐗 ∈ ℝ푛×푑, and our mators for the classical RandNLA task of least squares re- goal is to compute quantities of interest quickly. To do so, gression, enabling a more refined statistical and inferen- we efficiently downsize the matrix using a randomized al- tial understanding of RandNLA algorithms. We will also gorithm, while approximately preserving its inherent struc- demonstrate that a DPP is, in some sense, an optimal ture, as measured by some objective. In doing so, we ob- randomized method for low-rank approximation, another tain a new matrix 퐗˜ (often called a sketch of 퐗) which is ubiquitous matrix problem. Finally, we also discuss how either smaller or sparser than the original matrix. This pro- a standard RandNLA technique, called leverage score sam- cedure can often be represented by , pling, can be derived as the marginal distribution of a DPP, i.e., 퐗˜ = 퐒퐗, where 퐒 is called the sketching matrix. Many as well as the algorithmic consequences this has for effi- applications of RandNLA follow the sketch-and-solve para- cient DPP sampling. digm: Instead of performing a costly operation on 퐗, we We start (in Section 2) with a brief review of a prototyp- first construct 퐗˜ (the sketch); we then perform the expen- ical RandNLA algorithm, focusing on the ubiquitous least sive operation (more cheaply) on the smaller 퐗˜ (the solve); squares problem and highlighting key aspects that will put and we use the solution from 퐗˜ as a proxy for the solution in context the recent work we will review. In particular, we would have obtained from 퐗. Here, cost often means we discuss the trade-offs between standard sampling meth- computational time, but it can also be communication or ods from RandNLA, including uniform sampling, norm- storage space or even human work. squared sampling, and leverage score sampling. Next (in Many different approaches have been established for Section 3), we introduce the family of DPPs, highlighting randomly downsizing data matrices 퐗 (see [12, 20] for some important subclasses and the basic properties that a detailed survey). While some methods randomly zero make them appealing for RandNLA. Then (in Section 4), out most of the entries of the matrix, most randomly we describe the fundamental connections between certain keep only a small random subset of rows and/or columns. classes of DPPs and the classical RandNLA tasks of least In either case, however, the choice of randomness is squares regression and low-rank approximation, as well as crucial in preserving the structure of the data. For ex- the relationship between DPPs and the RandNLA method ample, if the data matrix 퐗 contains a few dominant of leverage score sampling. Finally (in Section 5), we dis- entries/rows/columns (e.g., as measured by their abso- cuss the algorithmic aspects of both leverage scores and lute value or norm or some other “importance” score), DPPs. We conclude (in Section 6) by briefly mentioning then we should make sure that our sketch is likely to several other connections between DPPs and RandNLA, retain the information they carry. This leads to data- as well as a recently introduced class of random matrices, dependent sampling distributions that will be the focus of called preserving, which has proven useful in our discussion. However, data-oblivious sketching tech- this line of research. niques, where the random sketching matrix 퐒 is inde- pendent of 퐗 (typically called a “random rotation” or a “random projection,” even if it is not precisely a rota- tion or projection in the linear algebraic sense), have also proven very successful [16,24]. Among the most common

JANUARY 2021 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY 35 푑 푑 To illustrate the types of guarantees achieved by RandNLA methods on the least squares task, we will fo- 퐗˜ 퐗˜ cus on row sampling, i.e., sketches consisting of a small 푛 푛 random subset of the rows of 퐗, in the case that 푛 ≫ 푑. Concretely, the considered meta-strategy is to draw ran- dom i.i.d. row indices 푗1, ..., 푗푘 from {1, ..., 푛}, with each in- Rank-preserving sketch Low-rank approximation dex distributed according to (푝1, ..., 푝푛), and then solve the (e.g., Projection DPP) (e.g., L-ensemble DPP) subproblem formed from those indices: Figure 1. Two RandNLA settings which are differentiated by ˆ퐰= argmin ‖퐗퐰˜ − ̃퐲‖2 = 퐗˜† ̃퐲, (2) whether the sketch (matrix 퐗˜) aims to preserve the rank of 퐗 퐰 (left) or obtain a low-rank approximation (right). In Section 4 ⊤ 1 ⊤ 1 we associate each setting with a different DPP. where ̃퐱 = 퐱 and 푦̃푖 = 푦푗 for 푖 = 1, ..., 푘 푖 푘푝 푗푖 푘푝 푖 √ 푗푖 √ 푗푖 examples of such random transformations are i.i.d. Gauss- denote the 푖th row of 퐗˜ and entry of ̃퐲, respectively. The ian matrices, fast Johnson-Lindenstrauss transforms and rescaling is introduced to account for the biases caused by count sketches, all of which provide different trade-offs be- nonuniform sampling. We consider the following stan- tween efficiency and accuracy. These “data-oblivious ran- dard sampling distributions: dom projections” can be interpreted either in terms of the 1. Uniform: 푝푖 = 1/푛 for all 푖. Johnson-Lindenstrauss lemma or as a for 2 2 2. Squared norms: 푝푖 = ‖퐱푖‖ /‖퐗‖퐹 , where ‖ ⋅ ‖퐹 is the the “data aware random sampling” methods we discuss. Frobenius (Hilbert-Schmidt) norm. Most RandNLA techniques can be divided into one of 2 3. Leverage scores: 푝푖 = 푙푖/푑 for 푙푖 = ‖퐱푖‖(퐗⊤퐗)−1 (푖th two settings, depending on the dimensionality or aspect 퐗 ‖퐯‖ = √퐯⊤퐀퐯 ratio of 퐗, and on the desired size of the sketch (see leverage score of ) and 퐀 . Figure 1): Both squared norm and leverage score sampling are stan- 1. Rank-preserving sketch. When 퐗 is a tall full-rank ma- dard RandNLA techniques used in a variety of applications trix (i.e., 푛 ≫ 푑), then we can reduce the larger dimen- [11,14]. The following theorem (which, for convenience, sion while preserving the rank. we state with a failure probability of 훿 = 0.1) puts together 2. Low-rank approximation. When 퐗 has comparably the results that allow us to compare each row sampling dis- large dimensions (i.e., 푛 ∼ 푑), then the sketch typi- tribution in the context of least squares. Below, we use 퐶 cally has a much lower rank than 퐗. to denote an absolute positive constant. The classical application of rank-preserving sketches is Theorem 1. Estimator ˆ퐰 constructed as in (2) is an (휖, 0.1)- least squares regression, where, given matrix 퐗 ∈ ℝ푛×푑 and approximation, as in (1), if: 푛 vector 퐲 ∈ ℝ , we wish to find: 1. 푘 ≥ 퐶(휇푑 log 푑 + 휇푑/휖) for Uniform, where 휇 = ∗ 2 푛 퐰 = argmin ℒ(퐰) for ℒ(퐰) = ‖퐗퐰 − 퐲‖ . max푖 푙푖 ≥ 1 is the matrix coherence of 퐗. 푑 퐰 2. 푘 ≥ 퐶(휅푑 log 푑 + 휅푑/휖) for Squared norms, where 휅 ≥ The least squares solution can be computed exactly, using 1 is the of 퐗⊤퐗. ∗ † the Moore-Penrose inverse, 퐰 = 퐗 퐲. In a traditional 3. 푘 ≥ 퐶(푑 log 푑 + 푑/휖) for Leverage scores. RandNLA setup, in order to avoid solving the full problem, our goal is to use the sketch-and-solve paradigm to obtain Recall that (when considering least squares) we typi- an (휖, 훿)-approximation of 퐰∗, i.e., ˆ퐰 such that: cally assume that 푛 ≫ 푑, so any of the three sample sizes 푘 ∗ may be much smaller than 푛. Thus, each sampling method ℒ( ˆ퐰)≤ (1 + 휖)ℒ(퐰 ) with probability 1 − 훿. (1) offers a potentially useful guarantee for the number of Imposing statistical modeling assumptions on the vector rows needed to achieve a (1 + 휖)-approximation. However, 퐲 leads to different objectives, such as the mean squared in the case of both uniform and squared norm sampling, error (MSE): the sample size depends not only on the dimension 푑, but also on other data-dependent quantities. For uniform sam- MSE[ ˆ퐰] = 피 ‖ ˆ퐰− 휷‖2, given 퐲 = 퐗휷 + 흃, pling, that quantity is matrix coherence 휇, which measures where 흃 is a noise vector with a known distribution, and 휷 a degree of nonuniformity among the data points, with re- is fixed but unknown. There has been work on statistical spect to the canonical axes. For squared norm sampling, aspects of RandNLA methods, and these statistical objec- that quantity is the condition number 휅 (ratio between the tives pose different challenges than the standard RandNLA largest and the smallest eigenvalue) of the 푑×푑 data covari- guarantees (some of which can be addressed by DPPs; see ance 퐗⊤퐗. Leverage score sampling avoids both of these Section 4). dependencies.

36 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY VOLUME 68, NUMBER 1 We now briefly discuss a key structural property, called to DPP(퐊), denoted as 푆 ∼ DPP(퐊), if for any 푇 ⊆ [푛], the subspace embedding, which is needed to show the guar- Pr{푇 ⊆ 푆} = det(퐊 ). antees of Theorem 1. This important property—first in- 푇,푇 troduced into RandNLA for data-aware random sampling Here, 퐊푇,푇 denotes the |푇| × |푇| submatrix indexed by by [14] and then for data-oblivious random projection by the set 푇. Matrix 퐊 is called the marginal of 푆 (it can [23]—is ubiquitous in the analysis of many RandNLA tech- be shown that any 퐊 as in Definition 2 defines a DPP). If niques. Remarkably, most DPP results do not rely on sub- 퐊 is diagonal, then DPP(퐊) corresponds to a series of 푛 in- space embedding techniques, which is an important differ- dependent biased coin flips deciding whether to include entiating factor for this class of sampling distributions. each index 푖 into the set 푆. A more interesting distribution is obtained for a general 퐊, in which case the inclusion Definition 1. A sketching matrix 퐒 is a (1 ± 휖) subspace events are no longer independent. Some of the key prop- embedding for the column space of 퐗 if: erties that make DPPs useful as a mathematical framework are: (1 − 휖)‖퐗퐯‖2 ≤ ‖퐒퐗퐯‖2 ≤ (1 + 휖)‖퐗퐯‖2 ∀퐯 ∈ ℝ푑. 1. Negative correlation: if 푖 ≠ 푗 and 퐊푖푗 ≠ 0, then The matrix 퐗˜ used in (2) for constructing ˆ퐰 can be written Pr(푖 ∈ 푆 ∣ 푗 ∈ 푆) < Pr(푖 ∈ 푆). ˜ as 퐗 = 퐒퐗, by letting the 푖th row of 퐒 be the scaled standard 2. Cardinality: while the size |푆| is in general random, 1 vector 퐞푗 . Relying on established measure its expectation equals tr(퐊) and the variance also has 푘푝 푖 √ 푗푖 a simple expression. concentration results for random matrices, we can show 3. Restriction: for 푅 ⊆ [푛], the set 푆̃ = 푆 ∩푅 is distributed that 퐒 is a subspace embedding (up to some failure proba- as DPP(퐊푅,푅) (after relabeling). bility) for each of the i.i.d. sampling methods from Theo- 4. Complement: the complement set 푆̃ = [푛]\푆 is dis- rem 1. However, only leverage score sampling (or random tributed as DPP(퐈 − 퐊). projections, which precondition the input to have approxi- In the context of linear algebra, a slightly more restrictive mately uniform leverage scores) achieves this for 푂(푑 log 푑) definition of DPPs has proven useful. samples, independent of any input-specific quantities such as the coherence or condition number. Definition 3 (L-ensemble). Let 퐋 be an 푛×푛 p.s.d. matrix. The least squares task formulated in (1), as well as Point process 푆 ⊆ [푛] is drawn according to DPPL(퐋) and the subspace embedding condition, require using rank- called an L-ensemble if preserving sketches. However, these ideas can be naturally det(퐋 ) Pr{푆} = 푆,푆 . extended to the task of low-rank approximation. In partic- det(퐈 + 퐋) ular, as discussed in more detail in Section 4.3, low-rank approximation can be reformulated as a set of least squares It can be shown that any L-ensemble is a DPP by set- −1 problems. Similarly, leverage score sampling has been ex- ting 퐊 = 퐋(퐈 + 퐋) , but not vice versa. (However, there tended to adapt to the low-rank setting. In Section 4.4, we are extensions of the L-ensemble parameterization which discuss one of these extensions, called ridge leverage scores, cover all DPPs.) Unlike Definition 2, this definition explic- and its connections to DPPs. itly gives the probabilities of individual sets. These prob- In the next sections, we show how non-i.i.d. sampling abilities sum to one as a consequence of a determinantal via DPPs goes beyond the standard RandNLA analysis. identity (see Theorem 2.1 in [18] which is a special case of Among other things, this will allow us to obtain approx- the classical formula for the Fredholm determinant). Fur- imation guarantees with fewer than 푑 log 푑 samples and thermore, the L-ensemble formulation provides a natural without a failure probability. geometric interpretation in the context of row sampling for RandNLA. Suppose that we let 퐋 = 퐗퐗⊤ for some 푛 × 푑 3. DPPs: Determinantal Point Processes matrix 퐗. Then, the probability of sampling subset 푆 ac- In this section, we define DPPs and related families of dis- cording to DPPL(퐋) satisfies: tributions (see Figure 2 for a diagram), including some ba- Pr{푆} ∝ Vol2 ({퐱 ∶ 푖 ∈ 푆}). sic properties and intuitions. A detailed introduction to |푆| 푖 DPPs can be found in [18]. We focus on sampling over a Namely, this sampling probability is proportional to the discrete domain [푛] = {1, ..., 푛} (continuous domains are squared |푆|-dimensional volume of the parallelepiped discussed in [17]). spanned by the rows of 퐗 indexed by 푆. This immediately implies that the size of 푆 will never exceed the rank of 퐗 Definition 2 (Determinantal Point Process). Let 퐊 be an (which is bounded by 푑). Furthermore, such a distribution 푛 × 푛 positive semidefinite (p.s.d.) matrix with operator ensures that the set of sampled rows will be nondegener- norm ‖퐊‖ ≤ 1. Point process 푆 ⊆ [푛] is drawn according ate: no row can be obtained as a of the

JANUARY 2021 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY 37 SR measures The form of the probability implies that the rows {퐱푖 ∶ 푖 ∈ 푆} sampled from a Projection DPP will with proba- DPPs Projection DPPs bility 1 capture all directions of the ambient space that are present in the matrix 퐗, which is crucial for rank- L-ensembles 푘-DPPs preserving RandNLA sketches. 4. DPPs in RandNLA In this section, we demonstrate the fundamental connec- tions between DPPs and standard RandNLA tasks, as well as the new kinds of RandNLA guarantees that can be Figure 2. A diagram illustrating different classes of achieved via these connections. Our discussion focuses on determinantal distributions within a broader class of Strongly two types of DPP-based sketches (that were illustrated in Rayleigh (SR) measures: DPPs (Definition 2), L-ensembles (Definition 3), 푘-DPPs (Definition 4), and Projection DPPs Figure 1): (Remark 1). 1. Projection DPPs as a rank-preserving sketch; 2. L-ensemble DPPs as a low-rank approximation. others. Intuitively, this property is desirable for RandNLA sampling as it avoids redundancies. Also, all else being These sketches can be efficiently constructed using DPP equal, rows with larger norms are generally preferred as sampling algorithms which we discuss later (in Section 5). they contribute more to the volume. We also discuss the close relationship between DPPs and While the subset size of a DPP is in most cases a ran- the RandNLA method of leverage score sampling, shed- dom variable, it is easy to constrain the cardinality to some ding light on why these two different randomized tech- fixed value 푘. The resulting distribution is not a DPP, in the niques have proven effective in RandNLA. For the sum- sense of Definition 2, but it retains many useful properties mary, see Table 1. of proper DPPs. 4.1. Unbiased estimators. We now define the least squares estimators that naturally arise from row sampling Definition 4 (Cardinality constrained DPP). We will use with Projection DPPs and L-ensembles. The definitions 푘-DPP (퐋) to denote a distribution obtained by constrain- L are motivated by the fact that the estimators are unbiased, ing DPP (퐋) to only subsets of size |푆| = 푘. L relative to the solutions of the full least squares problems. While a 푘-DPP is not, in general, a DPP in the sense of Importantly, this property is not shared by i.i.d. row sam- Definition 2, both families belong to a broader classof pling methods used in RandNLA. negatively correlated distributions called Strongly Rayleigh We start with the rank-preserving setting, i.e., given a (SR) measures. See Figure 2. tall full-rank 푛 × 푑 matrix 퐗 and a vector 퐲 ∈ ℝ푛, where At the intersection of DPPs and 푘-DPPs lies a family of 푛 ≫ 푑, we wish to approximate the least squares solution distributions called Projection DPPs. This family is of par- ∗ 2 퐰 = argmin퐰 ‖퐗퐰 − 퐲‖ . To capture all of the directions ticular importance to RandNLA. present in the data and to obtain a meaningful estimate of 퐰∗ 푑 퐗 Remark 1 (Projection DPP). Point process 푆 ∼ 푘-DPPL(퐋) , we must sample at least rows from . We achieve satisfies Definition 2 iff 푘 = rank(퐋), in which case we call this by sampling from a Projection DPP defined as 푆 ∼ 푑- ⊤ it a Projection DPP since its marginal kernel 퐊 = 퐋퐋† is an DPPL(퐗퐗 ) (see Remark 1). The linear system (퐗푆, 퐲푆), orthogonal projection (recall that (⋅)† denotes the Moore- corresponding to the rows of 퐗 indexed by 푆, has exactly Penrose inverse). one solution because sets selected by a Projection DPP are always rank-preserving: ˆ퐰= 퐗−1퐲 . Moreover, the ob- We chose to introduce Projection DPPs via the connec- 푆 푆 tained random vector is an unbiased estimator of the least tion to L-ensembles to highlight once again the geomet- squares solution 퐰∗ [9]. ric interpretation. In this viewpoint, letting 퐋 = 퐗퐗⊤ for ⊤ an 푛 × 푑 matrix 퐗 with full column rank, a Projection Theorem 2. If 푆 ∼ 푑-DPPL(퐗퐗 ), then DPP associated with the L-ensemble 퐋 has marginal ker- −1 2 ∗ † 피 퐗푆 퐲푆 = argmin ‖퐗퐰 − 퐲‖ = 퐰 . nel 퐊 = 퐗퐗 , which is a 푑-dimensional projection onto 퐰 the span of the columns of 퐗. Furthermore, the probabil- This seemingly simple identity relies on the negative cor- ity of a row subset 퐗 under 푆 ∼ 푑-DPP (퐗퐗⊤) is propor- 푆 L relations between the samples in a DPP, and thus cannot tional to the squared 푑-dimensional volume spanned by hold for any i.i.d. row sampling method. It is perhaps no it, i.e., det(퐗 )2. Here, the normalization constant is ob- 푆 coincidence that the marginal kernel of this distribution, tained via the classical Cauchy-Binet formula: i.e., 퐊 = 퐗퐗† = 퐗(퐗⊤퐗)−1퐗⊤, coincides with the hat matrix 2 ⊤ ∑ det(퐗푆) = det(퐗 퐗). (as it is known in statistics) of the ordinary least squares es- 푆∶|푆|=푑 timator.

38 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY VOLUME 68, NUMBER 1 Rank-preserving sketch Low-rank approximation 1 Projection DPP 푆 ∼푑-DPP (퐗퐗⊤) L-ensemble 푆 ∼DPP ( 퐗퐗⊤) L L 휆 subset size 피 |푆| = dimension 푑 effective dim. tr(퐗(퐗⊤퐗 + 휆퐈)−1퐗⊤) ⊤ ⊤ −1 ⊤ ⊤ −1 marginal Pr{푖 ∈ 푆} = leverage score 퐱푖 (퐗 퐗) 퐱푖 ridge lev. score 퐱푖 (퐗 퐗 + 휆퐈) 퐱푖 † 2 2 2 expectation 피 퐗푆퐲푆 = least squares argmin ‖퐗퐰 − 퐲‖ ridge regression argmin ‖퐗퐰 − 퐲‖ + 휆‖퐰‖ 퐰 퐰 Table 1. Key properties of the DPPs discussed in Section 4, as they relate to: RandNLA tasks of least squares and ridge regression; RandNLA methods of leverage score sampling and ridge leverage score sampling.

Theorem 2 has an analogue in the context of low-rank of the non-i.i.d. nature of DPPs, these guarantees can be approximation, where both the dimensions of 퐗 are com- achieved with smaller sample sizes than for RandNLA sam- parably large (i.e., 푛 ∼ 푑), and so the desired sample size pling methods. Of course, this comes with computational is typically much smaller than 푑. When the selected sub- trade-offs, which we discuss in Section 5. problem (퐗푆, 퐲푆) has fewer than 푑 rows (i.e., it is under- We illustrate these differences in the context of rank- determined) then it has multiple exact solutions. A stan- preserving sketches for least squares regression. Consider the −1 dard way to address this is picking the solution with small- estimator ˆ퐰= 퐗푆 퐲푆 from Theorem 2, where the subset 푆 ⊤ est Euclidean norm, defined via the Moore-Penrose in- is sampled via the Projection DPP, i.e., 푆 ∼ 푑-DPPL(퐗퐗 ). † verse: ˆ퐰= 퐗푆퐲푆. To sample the under-determined sub- Recall that the sample size here is only 푑, which is less than problem, we use a scaled L-ensemble DPP with the ex- 푑 log 푑 needed by i.i.d. sampling methods such as leverage pected sample size controlled by a parameter 휆 > 0 [8]. scores (or random projection methods). Nevertheless, this estimator achieves an approximation guarantee in terms 1 ⊤ Theorem 3. If 푆 ∼ DPPL( 퐗퐗 ), then 2 휆 of the loss ℒ(퐰) = ‖퐗퐰 − 퐲‖ . Moreover, under minimal † 2 2 assumptions, the expected loss is given by a closed form 피 퐗푆퐲푆 = argmin ‖퐗퐰 − 퐲‖ + 휆‖퐰‖ . 퐰 expression [9]. Theorem 4. Assume that the rows of 퐗 are in general position, Thus, the minimum norm solution of the under- i.e., every set of 푑 rows is nondegenerate. If 푆 ∼ 푑-DPP (퐗퐗⊤), determined subproblem is an unbiased estimator of the L then Tikhonov-regularized least squares problem, i.e., ridge re- −1 ∗ gression. Ridge regression is a natural extension of the 피 ℒ(퐗푆 퐲푆) = (푑 + 1) ℒ(퐰 ), standard least squares task, particularly useful when 푛 ∼ 푑 and the factor 푑 + 1 is worst-case optimal. or 푛 ≪ 푑. Theorem 3 illustrates the implicit regularization effect This exact error analysis is particularly useful in statistical that occurs when choosing one out of many exact solu- modeling, where under additional assumptions about the 퐲 tions to a subsampled least squares task (see also Sec- vector , we wish to estimate accurately the generalization tion 6). Increasing the regularization 휆‖퐰‖2 in ridge re- error of our estimator. Specifically, consider the following 퐲 gression is interpreted as using fewer degrees of freedom, noisy linear model of the vector : which aligns with the effect that 휆 has on the distribution 퐲 = 퐗휷 + 흃, where 흃 ∼ 풩(ퟎ, 휎2퐈). (4) 1 ⊤ 푆 ∼ DPPL( 퐗퐗 ): larger 휆 means that smaller subsets 푆 휆 Here, 휷 is a fixed vector that we wish to recover, and the are more likely. In fact, since the marginal kernel of the L- mean squared error in this context is defined as 피 ‖ ˆ퐰−휷‖2, ensemble coincides with the hat matrix of the ridge estima- where the expectation is taken over both the sampling and tor, the expected subset size (trace of the marginal kernel) the noise. The full least squares solution 퐰∗ is an unbi- captures the notion of effective dimension (a.k.a. effective ased estimator of 휷, with error given by 피 ‖퐰∗ − 휷‖2 = degrees of freedom) in the same way as it is commonly 휎2tr((퐗⊤퐗)−1). The Projection DPP estimator, which ob- done for ridge regression in statistics [2]: serves only 푑 noisy measurements from 퐲, is also unbiased in this model, and its error scales linearly with that of 퐰∗ 푑 ∶= tr(퐗(퐗⊤퐗 + 휆퐈)−1퐗⊤) = 피 |푆|. (3) 휆 [9]. 4.2. Exact error analysis. The error analysis for DPP sam- Theorem 5. Assume that the rows of 퐗 are in general position, pling differs significantly from the standard RandNLA tech- i.e., every set of 푑 rows is nondegenerate, and consider 퐲 as in niques discussed in Section 2. In particular, approxima- ⊤ (4). If 푆 ∼푑-DPPL(퐗퐗 ), then tion guarantees are formulated in terms of the expected −1 2 ∗ 2 error, without relying on measure concentration results. 피 ‖퐗푆 퐲푆 − 휷‖ = (푛 − 푑 + 1) 피 ‖퐰 − 휷‖ . This means that we avoid failure probabilities such as the A number of extensions to Theorems 4 and 5 have been one present in Theorem 1, and the analysis is often much proposed, covering larger sample sizes [10] as well as dif- more precise, sometimes even exact. Furthermore, because ferent statistical models [8, 9].

JANUARY 2021 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY 39 4.3. Optimal approximation guarantees. As we have Originally developed in the context of obtaining numeri- seen above, the non-i.i.d. nature of DPP sampling can lead cal solutions to integral equations, this method has found to improved approximation guarantees, compared to stan- applications in a number of areas such as machine learn- dard RandNLA methods, when we wish to minimize the ing, Gaussian Process regression, and Independent Com- size of the downsampled problem. We next discuss this in ponent Analysis. Theorem 6 can be adapted to the setting the low-rank approximation setting, i.e., when 푛 ∼ 푑. Here, of Nyström approximation, providing the optimal sample cardinality constrained L-ensembles are known to achieve size with respect to the nuclear norm approximation error. 1 optimal (1 + 휖)-approximation guarantees. ⊤ We use ‖퐀‖∗ = tr((퐀 퐀) 2 ) to denote the nuclear norm and In Section 4.1, we used low-rank sketches to construct 퐋(푟) as the best rank 푟 approximation. unbiased estimators for regularized least squares, given matrix 퐗 and a vector 퐲. However, even without intro- Theorem 7. If 푆 ∼ 푘-DPPL(퐋), where the sample size satisfies ducing 퐲, a natural low-rank approximation objective for 푘 ≥ 푟 + 푟/휖 − 1, then sketching 퐗 can be defined via a reduction to least squares. 피 ‖퐋 − ˜퐋(푆)‖∗ ≤ (1 + 휖) ‖퐋 − 퐋(푟)‖∗, Namely, we can measure the error in reconstructing the 푖th 푟 + 푟/휖 − 1 row of 퐗 by finding the best fit among all linear combina- and the size is worst-case optimal. tions of the rows of the sketch 퐗푆. Repeating this over all 4.4. Connections to RandNLA methods. The natural ap- rows of 퐗, we get: plicability of DPPs in the RandNLA tasks of least squares least squares regression and low-rank approximation discussed above 푛 2 ⏞⎴⎴⎴⎴⏞⎴⎴⎴⎴⏞⊤ 2 ‖ † ‖ raises the question of how DPPs relate to traditional Er(푆) ∶= ∑ min ‖퐗푆 퐰 − 퐱푖‖ = 퐗퐗푆퐗푆 − 퐗 . 퐰 ‖ ‖ 푖=1 퐹 RandNLA sampling methods used for this task. As dis- cussed in Section 2, one of the main RandNLA techniques † Note that 퐗푆퐗푆 is the projection onto the span of {퐱푖 ∶ for constructing rank-preserving sketches (i.e., relative to a 푖 ∈ 푆}. If the size of 푆 is equal to some target rank 푟, then tall matrix 퐗 with 푛 ≫ 푑) is i.i.d. leverage score sampling. Er(푆) is at least as large as the error of the best rank 푟 ap- (From this perspective, random projections can be seen as proximation of 퐗, denoted 퐗(푟) (obtained by projecting preprocessing or preconditioning the input so that lever- onto the top 푟 right-singular vectors of 퐗). However, [15] age scores are approximately uniform, thereby enabling showed that using a cardinality constrained L-ensemble uniform sampling—in the randomly-transformed space— with 푘 = 푟 + 푟/휖 − 1 rows suffices to get within a 1 + 휖 to be successfully used.) Even though leverage score sam- factor of the best rank 푟 approximation error. pling was developed independently of DPPs, this method ⊤ can be viewed as an i.i.d. counterpart of the Projection DPP Theorem 6. If 푆 ∼ 푘-DPPL(퐗퐗 ), where the sample size sat- isfies 푘 ≥ 푟 + 푟/휖 − 1, then from Theorem 2. ⊤ 2 Theorem 8. Let 퐗 be 푛×푑 and rank 푑. For 푆 ∼ 푑-DPPL(퐗퐗 ) 피 Er(푆) ≤ (1 + 휖) ‖퐗(푟) − 퐗‖퐹 , and any index 푖, the marginal probability of 푖 ∈ 푆 is the 푖th and the size 푟 + 푟/휖 − 1 is worst-case optimal. leverage score of 퐗: ⊤ ⊤ −1 The task of finding the subset 푆 that minimizes Er(푆) is Pr{푖 ∈ 푆} = 퐱푖 (퐗 퐗) 퐱푖. sometimes known as the Column Subset Selection Prob- lem (with 퐗 replaced by 퐗⊤). Similar 1 + 휖 guarantees Here, the term marginal refers to the marginal distribution 푛 푏 , ..., 푏 are achievable with RandNLA sampling techniques such of any one out of binary variables 1 푛 which can 푆 푆 = {푖 ∶ 푏 = as leverage scores. However, those require sample sizes 푘 be used to represent the random set via 푖 1} of at least 푟 log 푟, they contain a failure probability, and . Recall that the marginal kernel of the Projection DPP 퐊 = 퐗(퐗⊤퐗)−1퐗⊤ they suffer from additional constant factors due to less ex- is the projection matrix . The marginal act analysis. probabilities of this distribution lie on the diagonal of the Nyström method. The task of low-rank approximation is marginal kernel, which also contains the leverage scores of 퐗 often formulated in the context of symmetric positive semi- . definite (p.s.d.) matrices. Let 퐋 be an 푛 × 푛 p.s.d. matrix. Thus, leverage score sampling can be obtained as a dis- We briefly discuss how DPPs can be applied in this setting tribution constructed from the marginals of the Projection via the Nyström method, which constructs a rank 푘 approx- DPP. Naturally, when going from non-i.i.d. to i.i.d. sam- imation of 퐋 by using the eigendecomposition of a small pling, we lose all the negative correlations between the 푘 × 푘 submatrix 퐋 for some index subset 푆. points in a DPP sample, and therefore the expectation 푆,푆 formulas and inequalities from the preceding sections no Definition 5. We define the Nyström approximation of longer hold for leverage score sampling. Furthermore, re- 퐋 based on a subset 푆 as the 푛 × 푛 matrix ˜퐋(푆) = call that to achieve a rank-preserving sketch (e.g., for least † 퐋[푛],푆퐋푆,푆퐋푆,[푛]. squares) with leverage score sampling for a full rank matrix

40 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY VOLUME 68, NUMBER 1 퐗 we require at least 푑 log 푑 rows (Theorem 1), whereas In the case of DPPs, the challenge may seem even more a Projection DPP generates only 푑 samples and also pro- daunting, since the naïve algorithm has exponential time vides a rank-preserving sketch (Theorem 4). This shows complexity relative to the data size. However, the connec- that losing the negative correlations costs us a factor of tions between leverage scores and DPPs (summarized in log 푑 in the sample size. Table 1) have recently played a crucial role in the algorith- Another connection between leverage scores and Projec- mic improvements for DPP sampling. In particular, recent tion DPPs emerges in the reverse direction, i.e., going from advances in DPP sampling have resulted in several algo- i.i.d. to non-i.i.d. samples. Namely, a leverage score sam- rithmic techniques which are faster than SVD, and in some ple of size at least 2푑 log 푑 contains a Projection DPP with regimes even approach the time complexity of fast leverage probability at least 1/2 [7]. score sampling algorithms. See Table 2 for an overview. 5.1. Leverage scores: Approximation. We start by dis- Theorem 9. Let 푗1, 푗2, ... be a sequence of i.i.d. leverage score samples from matrix 퐗. There is a random set 푇 ⊆ {1, 2, ...} of cussing fast sketching methods for approximating leverage size 푑 s.t. max{푖 ∈ 푇} ≤ 2푑 log 푑 with probability at least 1/2, scores more rapidly than by naïvely computing them via and: the SVD or a QR decomposition. This is a good illustration of RandNLA algorithmic techniques, and it is also relevant ⊤ {푗푖 ∶ 푖 ∈ 푇} ∼ 푑-DPPL(퐗퐗 ). in our later discussion of DPP sampling. Many extensions of leverage scores have been proposed for For simplicity, we focus on constructing leverage scores use in the low-rank approximation setting (i.e., when 푛 ∼ for rank-preserving sketches (i.e., for a tall 푛 × 푑 matrix 푑). Arguably the most popular one is called ridge leverage 퐗), but similar ideas apply to the low-rank approximation setup [13]. Recall that the 푖th leverage score of 퐗 is given by scores. [2]. Ridge leverage scores can be recovered as the ⊤ ⊤ −1 푙푖 = 퐱푖 (퐗 퐗) 퐱푖, which can be expressed as the squared marginals of an L-ensemble. 1 ⊤ − 푖 퐗(퐗 퐗) 2 1 ⊤ norm of the th row of the matrix . Assuming Theorem 10. For 푆 ∼DPPL( 퐗퐗 ) and any index 푖, the mar- 휆 that 푛 ≫ 푑, the primary computational cost of obtain- ginal probability of 푖 ∈ 푆 is the 휆-ridge leverage score of 퐗: ing this matrix involves two expensive matrix multiplica- ⊤ ⊤ −1 tions: first, computing 퐑 = 퐗⊤퐗 (or a similarly expen- Pr{푖 ∈ 푆} = 퐱푖 (퐗 퐗 + 휆퐈) 퐱푖. sive operation such as a QR decomposition or the SVD); 1 The typical sample size required for low-rank approxi- − and second, computing 퐗퐑 2 . While each of these steps mation with ridge leverage scores is at least 푑 log 푑 , where 휆 휆 costs 푂(푛푑2) arithmetic operations, [13] showed that both 푑 is the ridge effective dimension (3) and also the ex- 휆 of them can be approximated using efficient randomized pected size of the L-ensemble. Once again, the logarithmic sketching techniques, such as the Subsampled Random- factor appears as a trade-off coming from i.i.d. sampling. ized Hadamard Transform (SRHT) sketch [1]. The SRHT A reverse connection analogous to Theorem 9, i.e., go- is a random sketching matrix 퐒 that, with high probability, ing from i.i.d. to non-i.i.d. sampling, can also be obtained satisfies the subspace embedding property (Definition 1), for ridge leverage scores [5], although only a weaker ver- and that admits fast matrix-vector multiplication by ex- sion, with 푂(푑2) instead of 푂(푑 log 푑 ), is currently known 휆 휆 휆 ploiting recursive structure of the Hadamard matrix. The in this setting. resulting overall procedure returns leverage score approx- 3 Theorem 11. Let 푗1, 푗2, ... be a sequence of i.i.d. 휆-ridge lever- imations in time 푂(푛푑 log 푛 + 푑 log 푑), i.e., much faster age score samples from matrix 퐗. There is a random set than 푂(푛푑2) for the naïve algorithm. 2 푇 ⊆ {1, 2, ...} such that max{푖 ∈ 푇} ≤ 2푑휆 with probability A number of refinements have been proposed for ap- at least 1/2, and: proximating leverage scores, and similar approaches have 1 ⊤ also been developed for ridge leverage scores [2], which {푗푖 ∶ 푖 ∈ 푇} ∼ DPPL( 퐗퐗 ). 휆 are used for low-rank approximation. In this case, we of- 5. Sampling Algorithms ten consider the setting where instead of an 푛×푑 matrix 퐗, we are given an 푛×푛 matrix 퐊 = 퐗퐗⊤, i.e., the Gram matrix One of the key considerations in RandNLA is computa- of the rows of 퐗 (this is particularly relevant in the context tional efficiency of constructing random sketches. For ex- of DPPs). Here, 휆-ridge leverage scores can be defined as ample, the i.i.d. leverage score sampling sketch defined in the diagonal entries of the matrix 퐊(휆퐈 + 퐊)−1 and can be Section 2 requires precomputing all of the leverage scores. approximately computed in time 푂(푛푑2 poly(log 푛)), with If done naïvely, this costs as much as performing the sin- 휆 푑 as in (3). When 푑 ≪ 푛, this is less than the naïve cost gular value decomposition (SVD) of the data. However, 휆 휆 of 푂(푛3). by employing fast RandNLA projection methods, one ob- 5.2. DPPs: Eigendecomposition. In this and subsequent tains efficient near-linear time complexity approximation sections, we discuss several algorithmic techniques for algorithms for leverage score sampling [13].

JANUARY 2021 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY 41 Rank-preserving sketch Low-rank approximation Input: 푛 × 푑 data matrix 퐗, 푛 ≫ 푑 Input: 푛 × 푛 kernel matrix 퐋 Output: Sample of size 푘 = 푂(푑) Output: Sample of size 푘 ≪ 푛 First sample Subsequent samples First sample Subsequent samples Lev. scores: Exact 푛푑2 푑 푛3 푘 Approximate 푛푑 + 푑3 푑 푛푘2 푘 DPPs: Eigendecomposition 푛푑2 푑3 푛3 푛푘 + 푘3 Intermediate sampling 푛푑 + 푑4 푑4 푛 ⋅ poly(푘) 푘6 Monte Carlo sampling 푛 ⋅ poly(푑) 푛 ⋅ poly(푑) 푛 ⋅ poly(푘) 푛 ⋅ poly(푘) Table 2. Comparison of sampling cost for DPP algorithms, alongside the cost of exact and approximate leverage score sampling, given either a tall data matrix 퐗 or a square p.s.d. kernel 퐋. Most methods can also be extended to the wide data matrix 퐗 setting. We assume that an L-ensemble kernel 퐋 is used for the DPPs (if given a data matrix 퐗, we use 퐋 = 퐗퐗⊤). We allow either a 푘-DPP or an L-ensemble with expected size 푘, however, in some cases, there are differences in the time complexities (in which case we give the better of the two). For simplicity, we omit the log terms in these expressions. sampling from DPPs and 푘-DPPs. We focus on the gen- advantages in certain settings. eral parameterization of a DPP via an 푛 × 푛 kernel matrix 5.3. DPPs: Intermediate sampling. The DPP sampling (either the marginal kernel 퐊 or the L-ensemble kernel 퐋), algorithms from Section 5.2 can be accelerated with a re- but we also discuss how these techniques can be applied cently introduced technique [5, 6], which uses leverage to sampling from DPPs defined on a tall 푛 × 푑 matrix 퐗, score sampling to reduce the size of the 푛 × 푛 kernel ma- which we used in Section 4. trix, without distorting the underlying DPP distribution. We start with an important result of [17], which shows Recall from Section 4.4 that i.i.d. leverage score sampling that any DPP can be decomposed into a mixture of Projec- can be viewed as an approximation of a DPP in which tion DPPs. we ignore the negative correlations between the sampled Theorem 12. Consider the eigendecomposition 퐊 = points. Naturally, in most cases such a sample as a whole ⊤ will be a very poor approximation of a DPP. However, with ∑푖 휆푖퐮푖퐮푖 , where 휆푖 ∈ [0, 1] and ‖퐮푖‖ = 1. For each 푖, let high probability, it contains a DPP of a smaller size (The- 푠푖 be a which is 1 with probability 휆푖 and 0 ⊤ orems 9 and 11). otherwise. Then the mixture distribution DPP(∑푖 푠푖퐮푖퐮푖 ) is identical to DPP(퐊). This motivates a strategy called distortion-free intermedi- ate sampling. To explain how this strategy can be imple- Note that ∑ 푠 퐮 퐮⊤ randomly selects one of 2푛 projection 푖 푖 푖 푖 mented, we will consider the special case of a Projection matrices (all their eigenvalues are 0 or 1), which defines DPP of size 푘, where this approach was first introduced by a corresponding Projection DPP. In addition to the mix- 2 [10]. They showed that an i.i.d. sample of indices 푖1, ..., 푖푡 of ture decomposition, [17] also gave a simple 푂(푛푘 ) time size 푡 = 푂(푘2) drawn proportionally to the leverage scores procedure for sampling from this Projection DPP, where 푘 contains with high probability a subset 푆 distributed ac- denotes the sample size ∑ 푠 . This procedure was recently 푖 푖 cording to the desired Projection DPP. Moreover, this sub- 3 accelerated to 푂(푛푘 + 푘 log 푘) by [7]. Thus, combining set can be found by downsampling with a DPP restricted to the mixture decomposition and the Projection DPP algo- the smaller sample. This procedure essentially reduces the rithm, it became possible to sample from any determinan- task of sampling from a DPP over a large domain {1, ..., 푛} tal point process in low-degree polynomial time, which into sampling from a potentially much smaller domain significantly broadened the popularity of DPPs in thecom- of size 푂(푘2). Surprisingly, this can be performed with- puter science community. out any loss in accuracy, so that the final sample is drawn The overall sampling procedure proposed by [17] is very exactly from the target distribution. Similar intermediate efficient, if we are given the eigendecomposition of ker- sampling methods were later developed by [5, 6] for L- nel 퐊 or of the L-ensemble kernel 퐋. It can also be easily ensembles (where ridge leverage scores are used instead adapted to sampling cardinality constrained DPPs. How- of the standard leverage scores) and 푘-DPPs, resulting in ever, obtaining the eigendecomposition itself can be a sig- a linear in 푛 preprocessing cost (instead of cubic for the 3 nificant bottleneck: it costs 푂(푛 ) time for a general 푛 × 푛 eigendecomposition), and sampling cost independent of kernel. If we are given a tall 푛 × 푑 matrix 퐗 such that 푛 (see Table 2). 퐋 = 퐗퐗⊤, as was the case in Section 4, then the sampling cost can be reduced to 푂(푛푑2). There have been a num- Theorem 13. Let 푆1, 푆2 be i.i.d. random sets from DPPL(퐋), ber of attempts at avoiding the eigendecomposition in this with 푘 = 피[|푆|] or from any 푘-DPP(퐋). Then, given access to procedure, leading to several approximate algorithms. Fi- 퐋, we can return nally, approaches using other factorizations of the kernel 1. first, 푆1 in: 푛 ⋅ poly(푘 log 푛) time, matrix have been proposed, and these offer computational 2. then, 푆2 in: poly(푘) time.

42 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY VOLUME 68, NUMBER 1 Analogous time complexity statements can be provided its subdeterminants commute with taking the expectation, when 퐋 = 퐗퐗⊤ and we are given 퐗. In this case, the first i.e., if: sample can be obtained in 푂(푛푑 log 푛 + poly(푑)) time, and each subsequent sample takes poly(푑) time [5]. Also, ex- 피 det(퐀푆,푇 ) = det(피 퐀푆,푇 ) tensions of intermediate sampling exist for classes of dis- for all index subsets 푆, 푇 of the same size. Not all random tributions beyond DPPs, including all Strongly Rayleigh matrices satisfy this property, however there are many non- measures [21]. trivial examples. For instance, consider 퐀 = 푋퐂, where 푋 5.4. DPPs: Monte Carlo sampling. A completely differ- is a random variable with positive variance and 퐂 is ent approach of (approximately) sampling from a DPP a nonzero deterministic . Then, 퐀 is d.p. if was proposed by [3], who showed that a simple fast- and only if 퐂 is rank 1. More elaborate positive examples, mixing Monte Carlo Markov chain (MCMC) algorithm has such as matrices with independent random entries, can be a cardinality constrained L-ensemble 푘-DPPL(퐋) as its sta- constructed by taking advantage of the algebraic structure tionary distribution. The state space of this chain consists of the d.p. class: if 퐀 and 퐁 are independent and d.p., then of subsets 푆 ⊆ [푛] of some fixed cardinality 푘. At each step, both 퐀 + 퐁 and 퐀퐁 are also determinant preserving. The we choose an index 푖 ∈ 푆 and 푗 ∉ 푆 uniformly at random. first examples of d.p. matrices where given by [5] (used in Letting 푇 = 푆 ∪{푗}\{푖}, we transition from 푆 to 푇 with prob- the analysis of the fast DPP sampling algorithm from The- ability orem 13). Further discussion can be found in [8]. 1 det(퐋 ) min {1, 푇,푇 }, Of course, our survey of the applications of DPPs nec- 2 det(퐋푆,푆) essarily excluded many areas where this family of distribu- and otherwise, stay in 푆. It is easy to see that the tions appears. Here, we briefly discuss some other appli- stationary distribution of the above Markov chain is 푘- cations of DPPs which are relevant in the context of NLA DPPL(퐋). Moreover, [3] showed that the mixing time can and RandNLA but did not fit in the scope of this work. be bounded as follows. Implicit regularization. In many optimization tasks (e.g., Theorem 14. The number of steps required to get to in machine learning), the true minimizer of a desired ob- jective is not unique or not computable exactly, so that within 휖 total variation distance from 푘-DPPL(퐋) is at most poly(푘) 푂(푛 log(푛/휖)). the choice of the optimization procedure affects the out- put. Implicit regularization occurs when these algorithmic The advantages of these sampling procedures over the choices provide an effect similar to explicitly introducing algorithm of [17] are that we are not required to perform a regularization penalty into the objective. This has been the eigendecomposition and that the computational cost observed for approximate solutions returned by stochastic of the MCMC algorithm scales linearly with 푛. The disad- and combinatorial optimization algorithms, but a precise vantages are that the sampling is approximate and that we characterization of this phenomenon for RandNLA sam- have to run the entire chain every time we wish to produce pling methods has proven challenging. Recently, DPPs a new sample 푆. have been used to derive exact expressions for implicit reg- ularization in RandNLA algorithms [8], connecting it to a 6. Looking Forward phase transition called the double descent curve. We have briefly surveyed two established research areas Optimal design of experiments. In statistics, the task of which exhibit deep connections that have only recently be- selecting a subset of data points for a downstream regres- gan to emerge: sion task is referred to as optimal design. In this context, 1. Randomized Numerical Linear Algebra; and it is often assumed that the coefficients 푦푖 (or responses) 2. Determinantal Point Processes. are random variables obtained as a linear transformation In particular, we discussed recent developments in ap- of the vector 퐱푖 distorted by some mean zero noise, as in plying DPPs to classical tasks in RandNLA, such as least (4). A number of optimality criteria (such as A-optimality, squares regression and low-rank approximation; and we which uses mean squared error of the least squares estima- surveyed recent results on sampling algorithms for DPPs, tor) have been considered for selecting the subsets. DPP comparing and contrasting several different approaches. subset selection has been shown to provide useful guaran- We expect that these connections will be fruitful more tees for some of the most popular criteria (including A-, generally. As an example of this, we briefly mention a re- -, D-, and V-optimality), leading to new approximation cently proposed mathematical framework for studying de- algorithms [7, 22]. terminants, which played a key role in obtaining some of Stochastic optimization. Randomized selection of small these results. batches of data or subsets of parameters has been very Determinant-preserving random matrices. A square ran- successful in speeding up many iterative optimization al- dom matrix 퐀 is determinant preserving (d.p.) if all of gorithms. Here, nonuniform sampling can be used to

JANUARY 2021 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY 43 reduce the bias and/or variance in the iteration steps. In Thirty-Second Conference on Learning Theory, 2019, particular, [25] showed that using a DPP for sampling pp. 1050–1069. mini-batches in stochastic gradient descent improves the [8] Michał Derezi ´nski,Feynman Liang, and Michael W. Ma- convergence rate of the optimizer. honey, Exact expressions for double descent and implicit reg- Monte Carlo integration. DPPs have been shown to ularization via surrogate random design, Advances in neural information processing systems, 2020. achieve theoretically improved guarantees for numerical [9] Michał Derezi ´nskiand Manfred K. Warmuth, Reverse itera- integration, i.e., using a weighted sum of function evalu- tive volume sampling for , J. Mach. Learn. Res. ations to approximate an integral. In particular, [4] con- 19 (2018), Paper No. 23, 39. MR3862430 structed a DPP for which the root mean squared errors of [10] Michał Derezi ´nski,Manfred K Warmuth, and Daniel Monte Carlo integration decrease as 푛−(1+1/푑)/2, where 푛 Hsu, Unbiased estimators for random design regression, arXiv is the number of function evaluations and 푑 is the dimen- preprint, arXiv:1907.03411 (2019). sion. This is faster than the typical 푛−1/2 rate. [11] Petros Drineas, Ravi Kannan, and Michael W. Mahoney, In conclusion, despite having been studied for at least Fast Monte Carlo algorithms for matrices. I. Approximating ma- forty-five years, DPPs are enjoying an explosion of renewed trix multiplication, SIAM J. Comput. 36 (2006), no. 1, 132– 157, DOI 10.1137/S0097539704442684. MR2231643 interest, with novel applications emerging on a regular ba- [12] P. Drineas and M. W. Mahoney, RandNLA: Randomized sis. Their rich connections to RandNLA, which we have numerical linear algebra, Communications of the ACM 59 only briefly summarized and which offer a nice example (2016), 80–90. of how deep mathematics informs practical problems and [13] Petros Drineas, Malik Magdon-Ismail, Michael W. Ma- vice versa, provide a particularly fertile ground for future honey, and David P. Woodruff, Fast approximation of ma- work. trix coherence and statistical leverage, J. Mach. Learn. Res. 13 (2012), 3475–3506. MR3033372 [14] Petros Drineas, Michael W. Mahoney, and S. Muthukr- ACKNOWLEDGMENTS. We would like to acknowl- ishnan, Sampling algorithms for 푙2 regression and applica- edge DARPA, NSF (via the TRIPODS program), and tions, Proceedings of the Seventeenth Annual ACM-SIAM ONR (via the BRC on RandNLA) for providing partial Symposium on Discrete Algorithms, ACM, New York, support for this work. 2006, pp. 1127–1136, DOI 10.1145/1109557.1109682. MR2373840 [15] Venkatesan Guruswami and Ali Kemal Sinop, Optimal References column-based low-rank matrix reconstruction, Proceedings of [1] Nir Ailon and Bernard Chazelle, The fast Johnson- the Twenty-Third Annual ACM-SIAM Symposium on Dis- Lindenstrauss transform and approximate nearest neighbors, crete Algorithms, ACM, New York, 2012, pp. 1207–1214. SIAM J. Comput. 39 (2009), no. 1, 302–322, DOI MR3205285 10.1137/060673096. MR2506527 [16] N. Halko, P. G. Martinsson, and J. A. Tropp, Finding struc- [2] Ahmed El Alaoui and Michael W. Mahoney, Fast random- ture with randomness: probabilistic algorithms for construct- ized kernel ridge regression with statistical guarantees, Proceed- ing approximate matrix decompositions, SIAM Rev. 53 (2011), ings of the 28th International Conference on Neural Infor- no. 2, 217–288, DOI 10.1137/090771806. MR2806637 mation Processing Systems, 2015, pp. 775–783. [17] J. Ben Hough, Manjunath Krishnapur, Yuval Peres, [3] Nima Anari, Shayan Oveis Gharan, and Alireza Rezaei, and B´alint Vir´ag, Determinantal processes and inde- Monte Carlo Markov chain algorithms for sampling strongly pendence, Probab. Surv. 3 (2006), 206–229, DOI Rayleigh distributions and determinantal point processes, 29th 10.1214/154957806000000078. MR2216966 Annual Conference on Learning Theory, 2016, pp. 103– [18] Alex Kulesza and Ben Taskar, Determinantal point pro- 115. cesses for machine learning, Now Publishers Inc., Hanover, [4] ´emiBardenet and Adrien Hardy, Monte Carlo with de- MA, USA, 2012. terminantal point processes, Ann. Appl. Probab. 30 (2020), [19] Odile Macchi, The coincidence approach to stochastic point no. 1, 368–417, DOI 10.1214/19-AAP1504. MR4068314 processes, Advances in Applied Probability 7 (1975), no. 1, [5] Michał Derezi ´nski, Fast determinantal point processes via 83–122. distortion-free intermediate sampling, Proceedings of the [20] M. W. Mahoney, Randomized algorithms for matrices and 32nd Conference on Learning Theory, 2019, pp. 1029– data, Foundations and Trends in Machine Learning, NOW 1049. Publishers, Boston, 2011. [6] Michał Derezi ´nski, Daniele Calandriello, and Michal [21] N. Anari and M. Derezi ´nski, Isotropy and Log-Concave Poly- Valko, Exact sampling of determinantal point processes with nomials: Accelerated Sampling and High-Precision Counting of sublinear time preprocessing, Advances in neural information Matroid Bases, Proceedings of the 61st Annual Symposium processing systems, 2019, pp. 11542–11554. on Foundations of Computer Science (2020). [7] Michał Derezi ´nski,Kenneth L. Clarkson, Michael W. Ma- [22] Aleksandar Nikolov, Mohit Singh, and Uthaipon Tao honey, and Manfred K. Warmuth, Minimax experimental Tantipongpipat, Proportional volume sampling and approx- design: Bridging the gap between statistical and worst-case imation algorithms for 퐴-optimal design, Proceedings of approaches to least squares regression, Proceedings of the the Thirtieth Annual ACM-SIAM Symposium on Discrete

44 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY VOLUME 68, NUMBER 1 Algorithms, SIAM, Philadelphia, PA, 2019, pp. 1369–1386, DOI 10.1137/1.9781611975482.84. MR3909553 [23] Tamas Sarlos, Improved approximation algorithms for large matrices via random projections, Proceedings of the 47th An- nual IEEE Symposium on Foundations of Computer Sci- ence, 2006, pp. 143–152. [24] David P. Woodruff, Sketching as a tool for numerical lin- ear algebra, Found. Trends Theor. Comput. Sci. 10 (2014), no. 1-2, iv+157, DOI 10.1561/0400000060. MR3285427 [25] Cheng Zhang, Hedvig Kjellström, and Stephan Mandt, Determinantal point processes for mini-batch diversification, 33rd Conference on Uncertainty in Artificial Intelligence, UAI 2017, 2017. Introduction to Analysis in One Variable Michael E. Taylor, University of North Carolina, Chapel Hill, NC This is a text for students who have had a three-course calculus sequence and who are ready to explore the logical structure of analysis as the backbone of calcu- lus. It begins with a development of the real numbers, building this system from more basic objects (natural numbers, integers, rational numbers, Cauchy sequenc- es), and it produces basic algebraic and metric prop- Michał Derezi ´nski Michael W. Mahoney erties of the real number line as propositions, rather than axioms. The text also makes use of the complex Credits numbers and incorporates this into the development of differential and integral calculus. Opening graphic is courtesy of Ryzhi via Getty. Pure and Applied Undergraduate Texts, Volume 47; 2020; Figures 1 and 2 are courtesy of Michał Derezi ´nski. 247 pages; Softcover; ISBN: 978-1-4704-5668-9; List US$85; Photo of Michał Derezi ´nskiis courtesy of Julie Lucas. AMS members US$68; MAA members US$76.50; Order code AMSTEXT/47 | bookstore.ams.org/amstext-47 Photo of Michael W. Mahoney is courtesy of Madeleine Fitzgerald. Introduction to Analysis in Several Variables Advanced Calculus

Michael E. Taylor, University of North Carolina, Chapel Hill, NC This text was produced for the second part of a two- part sequence on advanced calculus, whose aim is to provide a firm logical foundation for analysis. The first part treats analysis in one variable, and the text at hand treats analysis in several variables. After a review of topics from one-variable analysis and linear algebra, the text treats in succession multivari- able differential calculus, including systems of differ- ential equations, and multivariable integral calculus. Pure and Applied Undergraduate Texts, Volume 46; 2020; 445 pages; Softcover; ISBN: 978-1-4704-5669-6; List US$85; AMS members US$68; MAA members US$76.50; Order code AMSTEXT/46 | bookstore.ams.org/amstext-46 = Textbook

JANUARY 2021 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY 45