Determinantal Point Processes in Randomized Numerical Linear Algebra
Total Page:16
File Type:pdf, Size:1020Kb
Determinantal Point Processes in Randomized Numerical Linear Algebra Michał Derezi´nskiand Michael W. Mahoney 1. Introduction Matrix problems are central to much of applied mathemat- Randomized Numerical Linear Algebra (RandNLA) is an ics, from traditional scientific computing and partial dif- area which uses randomness, most notably random sam- ferential equations to statistics, machine learning, and ar- pling and random projection methods, to develop im- tificial intelligence. Generalizations and variants of matrix proved algorithms for ubiquitous matrix problems. It be- problems are central to many other areas of mathematics, gan as a niche area in theoretical computer science about via more general transformations and algebraic structures, fifteen years ago [11], and since then the area has exploded. nonlinear optimization, infinite-dimensional operators, etc. Much of the work in RandNLA has been propelled Michał Derezi´nskiis a postdoctoral research scientist in the department of statis- by recent developments in machine learning, artificial in- tics at the University of California at Berkeley. His email address is mderezin telligence, and large-scale data science, and RandNLA both @berkeley.edu. draws upon and contributes back to both pure and applied Michael W. Mahoney is an associate professor in the department of statistics mathematics. at the University of California at Berkeley and a research scientist at the In- A seemingly different topic, but one which has a long ternational Computer Science Institute. His email address is mmahoney@stat history in pure and applied mathematics, is that of De- .berkeley.edu . terminantal Point Processes (DPPs). A DPP is a stochas- A complete list of references is available in the technical report version of this paper: https://arxiv.org/pdf/2005.03185.pdf. tic point process, the probability distribution of which is characterized by subdeterminants of some matrix. Such Communicated by Notices Associate Editor Reza Malek-Madani. processes were first introduced to model the distribution For permission to reprint this article, please contact: of fermions at thermal equilibrium [19]. In the context [email protected]. of random matrix theory, DPPs emerged as the eigenvalue DOI: https://doi.org/10.1090/noti2202 34 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY VOLUME 68, NUMBER 1 distribution for standard random matrix ensembles, and 2. RandNLA: Randomized Numerical they are of interest in other areas of mathematics such Linear Algebra as graph theory, combinatorics, and quantum mechan- Much of the early work in numerical linear algebra (NLA) ics [17]. More recently, DPPs have also attracted signifi- focused on deterministic algorithms. However, Monte cant attention within machine learning and statistics as a Carlo sampling approaches demonstrated that random- tractable probabilistic model that is able to capture a bal- ness inside the algorithm is a powerful computational re- ance between quality and diversity within data sets and source which can lead to both greater efficiency and robust- that admits efficient algorithms for sampling, marginaliza- ness to worst-case data [12, 20]. The success of RandNLA tion, conditioning, etc. [18]. This resulted in practical ap- methods has been proven in many domains, e.g., when plication of DPPs in experimental design, recommenda- the randomized least squares solvers such as Blendenpik tion systems, stochastic optimization, and more. or LSRN have outperformed the established high perfor- Until very recently, DPPs have had little if any presence mance computing software LAPACK or other methods in within RandNLA. However, recent work has uncovered parallel/distributed environments, respectively, or when deep connections between these two topics. The purpose RandNLA methods have been used in conjunction with tra- of this article is to provide an overview of RandNLA, with ditional scientific computing solvers for low-rank approxi- an emphasis on discussing and highlighting these connec- mation problems. tions with DPPs. In particular, we will show how random In a typical RandNLA setting, we are given a large dataset sampling with a DPP leads to new kinds of unbiased esti- in the form of a real-valued matrix, say 퐗 ∈ ℝ푛×푑, and our mators for the classical RandNLA task of least squares re- goal is to compute quantities of interest quickly. To do so, gression, enabling a more refined statistical and inferen- we efficiently downsize the matrix using a randomized al- tial understanding of RandNLA algorithms. We will also gorithm, while approximately preserving its inherent struc- demonstrate that a DPP is, in some sense, an optimal ture, as measured by some objective. In doing so, we ob- randomized method for low-rank approximation, another tain a new matrix 퐗˜ (often called a sketch of 퐗) which is ubiquitous matrix problem. Finally, we also discuss how either smaller or sparser than the original matrix. This pro- a standard RandNLA technique, called leverage score sam- cedure can often be represented by matrix multiplication, pling, can be derived as the marginal distribution of a DPP, i.e., 퐗˜ = 퐒퐗, where 퐒 is called the sketching matrix. Many as well as the algorithmic consequences this has for effi- applications of RandNLA follow the sketch-and-solve para- cient DPP sampling. digm: Instead of performing a costly operation on 퐗, we We start (in Section 2) with a brief review of a prototyp- first construct 퐗˜ (the sketch); we then perform the expen- ical RandNLA algorithm, focusing on the ubiquitous least sive operation (more cheaply) on the smaller 퐗˜ (the solve); squares problem and highlighting key aspects that will put and we use the solution from 퐗˜ as a proxy for the solution in context the recent work we will review. In particular, we would have obtained from 퐗. Here, cost often means we discuss the trade-offs between standard sampling meth- computational time, but it can also be communication or ods from RandNLA, including uniform sampling, norm- storage space or even human work. squared sampling, and leverage score sampling. Next (in Many different approaches have been established for Section 3), we introduce the family of DPPs, highlighting randomly downsizing data matrices 퐗 (see [12, 20] for some important subclasses and the basic properties that a detailed survey). While some methods randomly zero make them appealing for RandNLA. Then (in Section 4), out most of the entries of the matrix, most randomly we describe the fundamental connections between certain keep only a small random subset of rows and/or columns. classes of DPPs and the classical RandNLA tasks of least In either case, however, the choice of randomness is squares regression and low-rank approximation, as well as crucial in preserving the structure of the data. For ex- the relationship between DPPs and the RandNLA method ample, if the data matrix 퐗 contains a few dominant of leverage score sampling. Finally (in Section 5), we dis- entries/rows/columns (e.g., as measured by their abso- cuss the algorithmic aspects of both leverage scores and lute value or norm or some other “importance” score), DPPs. We conclude (in Section 6) by briefly mentioning then we should make sure that our sketch is likely to several other connections between DPPs and RandNLA, retain the information they carry. This leads to data- as well as a recently introduced class of random matrices, dependent sampling distributions that will be the focus of called determinant preserving, which has proven useful in our discussion. However, data-oblivious sketching tech- this line of research. niques, where the random sketching matrix 퐒 is inde- pendent of 퐗 (typically called a “random rotation” or a “random projection,” even if it is not precisely a rota- tion or projection in the linear algebraic sense), have also proven very successful [16,24]. Among the most common JANUARY 2021 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY 35 푑 푑 To illustrate the types of guarantees achieved by RandNLA methods on the least squares task, we will fo- 퐗˜ 퐗˜ cus on row sampling, i.e., sketches consisting of a small 푛 푛 random subset of the rows of 퐗, in the case that 푛 ≫ 푑. Concretely, the considered meta-strategy is to draw ran- dom i.i.d. row indices 푗1, ..., 푗푘 from {1, ..., 푛}, with each in- Rank-preserving sketch Low-rank approximation dex distributed according to (푝1, ..., 푝푛), and then solve the (e.g., Projection DPP) (e.g., L-ensemble DPP) subproblem formed from those indices: Figure 1. Two RandNLA settings which are differentiated by ˆ퐰= argmin ‖퐗퐰˜ − ̃퐲‖2 = 퐗˜† ̃퐲, (2) whether the sketch (matrix 퐗˜) aims to preserve the rank of 퐗 퐰 (left) or obtain a low-rank approximation (right). In Section 4 ⊤ 1 ⊤ 1 we associate each setting with a different DPP. where ̃퐱 = 퐱 and 푦̃푖 = 푦푗 for 푖 = 1, ..., 푘 푖 푘푝 푗푖 푘푝 푖 √ 푗푖 √ 푗푖 examples of such random transformations are i.i.d. Gauss- denote the 푖th row of 퐗˜ and entry of ̃퐲, respectively. The ian matrices, fast Johnson-Lindenstrauss transforms and rescaling is introduced to account for the biases caused by count sketches, all of which provide different trade-offs be- nonuniform sampling. We consider the following stan- tween efficiency and accuracy. These “data-oblivious ran- dard sampling distributions: dom projections” can be interpreted either in terms of the 1. Uniform: 푝푖 = 1/푛 for all 푖. Johnson-Lindenstrauss lemma or as a preconditioner for 2 2 2. Squared norms: 푝푖 = ‖퐱푖‖ /‖퐗‖퐹 , where ‖ ⋅ ‖퐹 is the the “data aware random sampling” methods we discuss. Frobenius (Hilbert-Schmidt) norm. Most RandNLA techniques can be divided into one of 2 3. Leverage scores: 푝푖 = 푙푖/푑 for 푙푖 = ‖퐱푖‖(퐗⊤퐗)−1 (푖th two settings, depending on the dimensionality or aspect 퐗 ‖퐯‖ = √퐯⊤퐀퐯 ratio of 퐗, and on the desired size of the sketch (see leverage score of ) and 퐀 . Figure 1): Both squared norm and leverage score sampling are stan- 1.