686: Non-PSD Matrix Sketching with Applications to Regression And
Total Page:16
File Type:pdf, Size:1020Kb
Non-PSD Matrix Sketching with Applications to Regression and Optimization Zhili Feng1 Fred Roosta2 David P. Woodruff3 1Machine Learning Department, Carnegie Mellon University, 2School of Mathematics and Physics, University of Queensland, 3 Computer Science Department, Carnegie Mellon University, Abstract trices. For example, the Hessian matrices in convex opti- mization [Xu et al., 2016], the covariance matrices X>X in regression over the reals, and quadratic form queries x>Ax A variety of dimensionality reduction techniques [Andoni et al., 2016]. Meanwhile, less is understood for non- have been applied for computations involving large PSD matrices. These matrices are naturally associated with matrices. The underlying matrix is randomly com- complex matrices: the Hessian of a non-convex optimization pressed into a smaller one, while approximately problem can be decomposed into H = X>X where X is a retaining many of its original properties. As a re- matrix with complex entries, and a complex design matrix sult, much of the expensive computation can be X has a non-PSD covariance matrix. However, almost all performed on the small matrix. The sketching of sketching techniques were developed for matrices with en- positive semidefinite (PSD) matrices is well under- tries in the real field R. While some results carry over to the stood, but there are many applications where the complex numbers C (e.g., Tropp et al.[2015] develops con- related matrices are not PSD, including Hessian centration bounds that work for complex matrices), many matrices in non-convex optimization and covari- do not and seem to require non-trivial extensions. In this ance matrices in regression applications involving work, we show how to efficiently sketch non-PSD matrices complex numbers. In this paper, we present novel and extend several existing sketching results to the complex dimensionality reduction methods for non-PSD field. We also show how to use these in optimization, for matrices, as well as their “square-roots", which in- both convex and non-convex problems, the sketch-and-solve volve matrices with complex entries. We show how paradigm for complex `p-regression with 1 ≤ p ≤ 1, as these techniques can be used for multiple down- well as vector-matrix-vector product queries. stream tasks. In particular, we show how to use the proposed matrix sketching techniques for both con- Finite-sum Optimization. We consider optimization prob- lems of the form vex and non-convex optimization, `p-regression for every 1 ≤ p ≤ 1, and vector-matrix-vector n 1 X | queries. min F (x) , fi(ai x) + r(x); (1) x2 d n R i=1 where n d ≥ 1, each fi : R ! R is a smooth but possi- 1 INTRODUCTION bly non-convex function, r(x) is a regularization term, and d ai 2 R ; i = 1; : : : ; n; are given. Problems of the form (1) Many modern machine learning tasks involve massive are abundant in machine learning [Shalev-Shwartz and Ben- n×d datasets, where an input matrix A 2 R is such that David, 2014]. Concrete examples include robust linear re- n d. In a number of cases, A is highly redundant. For gression using Tukey’s biweight loss [Beaton and Tukey, example, if we want to solve the ordinary least squares | 2 | 2 2 1974], i.e., fi(hai; xi) = (ai x − bi) =(1 + (ai x − bi) ), problem minx kAx − bk2, one can solve it exactly given T T where bi 2 R, and non-linear binary classification [Xu et al., only A A and A b. To exploit this redundancy, numerous 2 2020], i.e., f (ha ; xi) = (1=(1 + exp (−a|x)) − b ) , techniques have been developed to reduce the size of A. i i i i where b 2 f0; 1g is the class label. By incorporating curva- Such dimensionality reduction techniques are used to speed i ture information, second-order methods are gaining popular- up various optimization tasks and are often referred to as ity over first-order methods in certain applications. However, sketching; for a survey, see Woodruff et al.[2014]. when n d ≥ 1, operations involving the Hessian of F A lot of previous work has focused on sketching PSD ma- constitute a computational bottleneck. To this end, random- Accepted for the 37th Conference on Uncertainty in Artificial Intelligence (UAI 2021). ized Hessian approximations have shown great success in than ` regressions over the reals. p reducing computational complexity ([Roosta and Mahoney, Vector-matrix-vector queries. Many applications require 2019, Xu et al., 2016, 2019, Pilanci and Wainwright, 2017, queries of the form u>Mv, which we call vector-matrix- Erdogdu and Montanari, 2015, Bollapragada et al., 2019]). vector queries, see, e.g., Rashtchian et al.[2020]. For exam- In the context of (1), it is easy to see that the ple, if M is the adjacency matrix of a graph, then u>Mv Hessian of F can be written as r2F (x) = answers whether there exists an edge between pair fu; vg. Pn 00 | | 2 | i=1 fi (ai x)aiai =n + r r(x) = A D(x)A=n + These queries are also useful for independent set queries, cut 2 | d×n r r(x), where A = a1;:::; an 2 R and queries, etc. Many past works have studied how to sketch 00 | 00 | 00 | n×n D(x) = diag [f1 (a1 x) f2 (a2 x) : : : fn (anx)] 2 R : positive definite M (see, e.g., Andoni et al.[2016]), but it Of particular interest in this work is the application of remains unclear how to handle the case when M is non-PSD randomized matrix approximation techniques [Woodruff or has complex entries. et al., 2014, Mahoney, 2011, Drineas and Mahoney, 2016], Contributions. We consider non-PSD matrices and their in particular, constructing a random sketching matrix S "square-roots", which are complex matrices, in the context to ensure that H(x) A|D1=2S|SD1=2A + r2r(x) ≈ , of optimization and the sketch-and-solve paradigm. Our goal A|DA=n + r2r(x) = r2F (x). Notice that D1=2A may is to provide tools for handling such matrices in a number of have complex entries if f is non-convex. i different problems, and to the best of our knowledge, is the The Sketch-and-Solve Paradigm for Regression. In the first work to systematically study dimensionality reduction overconstrained least squares regression problem, the task techniques for such matrices. is to solve min kAx − bk for some norm k · k, and here x For optimization of (1), where each f is potentially non- we focus on the wide class of ` -norms, where for a vector i p convex, we investigate non-uniform data-aware methods to y, kyk = (P jy jp)1=p. Setting the value p allows for p j j construct a sampling matrix S based on a new concept of adjusting the sensitivity to outliers; for p < 2 the regression leverage scores for complex matrices. In particular, we pro- problem is often considered more robust than least squares pose a hybrid deterministic-randomized sampling scheme, because one does not square the differences, while for p > 2 which is shown to have important properties for optimiza- the problem is considered more sensitive to outliers than tion. We show that our sampling schemes can guarantee least squares. The different p-norms also have statistical appropriate matrix approximations (see (4) and (5)) with motivations: for instance, the ` -regression solution is the 1 competitive sampling complexities. Subsequently, we in- maximum likelihood estimator given i.i.d. Laplacian noise. vestigate the application of such sampling schemes in the Approximation algorithms based on sampling and sketch- context of convex and non-convex Newton-type methods ing have been thoroughly studied for ` -regression, see, p for (1). e.g., [Clarkson, 2005, Clarkson et al., 2016, Dasgupta et al., 2009, Meng and Mahoney, 2013, Sohler and Woodruff, For complex `p-regression, we use Dvoretsky-type em- 2011, Woodruff and Zhang, 2013, Clarkson and Woodruff, beddings as well as an isometric embedding from `1 to 2017, Wang and Woodruff, 2019]. These algorithms typi- `1 to construct oblivious embeddings from an instance cally follow the sketch-and-solve paradigm, whereby the of a complex `p-regression problem to a real-valued `p- dimensions of A and b are reduced, resulting in a much regression problem, for p 2 [1; 1]. Our algorithm runs in smaller instance of `p-regression, which is tractable. In the O((nnz(A) + poly(d/))) time for constant p 2 [1; 1), 2 case of p = 1, sketching is used inside of an optimization and O(nnz(A)2O~(1/ )) time for p = 1. Here nnz(A) method to speed up linear programming-based algorithms denotes the number of non-zero entries of the matrix A. [Cohen et al., 2019]. For vector-matrix-vector queries, we show that if the non- > To highlight some of the difficulties in extending `p- PSD matrix has the form M = A B, then we can approx- regression algorithms to the complex numbers, consider imately compute u>Mv in just O(nnz(A) + n/2) time, 2 2 two popular cases, of `1 and `1-regression. The standard whereas the naïve approach takes nd + d + d time. way of solving these regression problems is by formulating them as linear programs. However, the complex numbers Notation. Vectors and matrices are denoted by bold lower- are not totally ordered, and linear programming algorithms case and bold upper-case letters, respectively, e.g., v and V. therefore do not work with complex inputs. Stepping back, We use regular lower-case and upper-case letters to denote scalar constants, e.g., d or L.