686: Non-PSD Matrix Sketching with Applications to Regression And

686: Non-PSD Matrix Sketching with Applications to Regression And

Non-PSD Matrix Sketching with Applications to Regression and Optimization Zhili Feng1 Fred Roosta2 David P. Woodruff3 1Machine Learning Department, Carnegie Mellon University, 2School of Mathematics and Physics, University of Queensland, 3 Computer Science Department, Carnegie Mellon University, Abstract trices. For example, the Hessian matrices in convex opti- mization [Xu et al., 2016], the covariance matrices X>X in regression over the reals, and quadratic form queries x>Ax A variety of dimensionality reduction techniques [Andoni et al., 2016]. Meanwhile, less is understood for non- have been applied for computations involving large PSD matrices. These matrices are naturally associated with matrices. The underlying matrix is randomly com- complex matrices: the Hessian of a non-convex optimization pressed into a smaller one, while approximately problem can be decomposed into H = X>X where X is a retaining many of its original properties. As a re- matrix with complex entries, and a complex design matrix sult, much of the expensive computation can be X has a non-PSD covariance matrix. However, almost all performed on the small matrix. The sketching of sketching techniques were developed for matrices with en- positive semidefinite (PSD) matrices is well under- tries in the real field R. While some results carry over to the stood, but there are many applications where the complex numbers C (e.g., Tropp et al.[2015] develops con- related matrices are not PSD, including Hessian centration bounds that work for complex matrices), many matrices in non-convex optimization and covari- do not and seem to require non-trivial extensions. In this ance matrices in regression applications involving work, we show how to efficiently sketch non-PSD matrices complex numbers. In this paper, we present novel and extend several existing sketching results to the complex dimensionality reduction methods for non-PSD field. We also show how to use these in optimization, for matrices, as well as their “square-roots", which in- both convex and non-convex problems, the sketch-and-solve volve matrices with complex entries. We show how paradigm for complex `p-regression with 1 ≤ p ≤ 1, as these techniques can be used for multiple down- well as vector-matrix-vector product queries. stream tasks. In particular, we show how to use the proposed matrix sketching techniques for both con- Finite-sum Optimization. We consider optimization prob- lems of the form vex and non-convex optimization, `p-regression for every 1 ≤ p ≤ 1, and vector-matrix-vector n 1 X | queries. min F (x) , fi(ai x) + r(x); (1) x2 d n R i=1 where n d ≥ 1, each fi : R ! R is a smooth but possi- 1 INTRODUCTION bly non-convex function, r(x) is a regularization term, and d ai 2 R ; i = 1; : : : ; n; are given. Problems of the form (1) Many modern machine learning tasks involve massive are abundant in machine learning [Shalev-Shwartz and Ben- n×d datasets, where an input matrix A 2 R is such that David, 2014]. Concrete examples include robust linear re- n d. In a number of cases, A is highly redundant. For gression using Tukey’s biweight loss [Beaton and Tukey, example, if we want to solve the ordinary least squares | 2 | 2 2 1974], i.e., fi(hai; xi) = (ai x − bi) =(1 + (ai x − bi) ), problem minx kAx − bk2, one can solve it exactly given T T where bi 2 R, and non-linear binary classification [Xu et al., only A A and A b. To exploit this redundancy, numerous 2 2020], i.e., f (ha ; xi) = (1=(1 + exp (−a|x)) − b ) , techniques have been developed to reduce the size of A. i i i i where b 2 f0; 1g is the class label. By incorporating curva- Such dimensionality reduction techniques are used to speed i ture information, second-order methods are gaining popular- up various optimization tasks and are often referred to as ity over first-order methods in certain applications. However, sketching; for a survey, see Woodruff et al.[2014]. when n d ≥ 1, operations involving the Hessian of F A lot of previous work has focused on sketching PSD ma- constitute a computational bottleneck. To this end, random- Accepted for the 37th Conference on Uncertainty in Artificial Intelligence (UAI 2021). ized Hessian approximations have shown great success in than ` regressions over the reals. p reducing computational complexity ([Roosta and Mahoney, Vector-matrix-vector queries. Many applications require 2019, Xu et al., 2016, 2019, Pilanci and Wainwright, 2017, queries of the form u>Mv, which we call vector-matrix- Erdogdu and Montanari, 2015, Bollapragada et al., 2019]). vector queries, see, e.g., Rashtchian et al.[2020]. For exam- In the context of (1), it is easy to see that the ple, if M is the adjacency matrix of a graph, then u>Mv Hessian of F can be written as r2F (x) = answers whether there exists an edge between pair fu; vg. Pn 00 | | 2 | i=1 fi (ai x)aiai =n + r r(x) = A D(x)A=n + These queries are also useful for independent set queries, cut 2 | d×n r r(x), where A = a1;:::; an 2 R and queries, etc. Many past works have studied how to sketch 00 | 00 | 00 | n×n D(x) = diag [f1 (a1 x) f2 (a2 x) : : : fn (anx)] 2 R : positive definite M (see, e.g., Andoni et al.[2016]), but it Of particular interest in this work is the application of remains unclear how to handle the case when M is non-PSD randomized matrix approximation techniques [Woodruff or has complex entries. et al., 2014, Mahoney, 2011, Drineas and Mahoney, 2016], Contributions. We consider non-PSD matrices and their in particular, constructing a random sketching matrix S "square-roots", which are complex matrices, in the context to ensure that H(x) A|D1=2S|SD1=2A + r2r(x) ≈ , of optimization and the sketch-and-solve paradigm. Our goal A|DA=n + r2r(x) = r2F (x). Notice that D1=2A may is to provide tools for handling such matrices in a number of have complex entries if f is non-convex. i different problems, and to the best of our knowledge, is the The Sketch-and-Solve Paradigm for Regression. In the first work to systematically study dimensionality reduction overconstrained least squares regression problem, the task techniques for such matrices. is to solve min kAx − bk for some norm k · k, and here x For optimization of (1), where each f is potentially non- we focus on the wide class of ` -norms, where for a vector i p convex, we investigate non-uniform data-aware methods to y, kyk = (P jy jp)1=p. Setting the value p allows for p j j construct a sampling matrix S based on a new concept of adjusting the sensitivity to outliers; for p < 2 the regression leverage scores for complex matrices. In particular, we pro- problem is often considered more robust than least squares pose a hybrid deterministic-randomized sampling scheme, because one does not square the differences, while for p > 2 which is shown to have important properties for optimiza- the problem is considered more sensitive to outliers than tion. We show that our sampling schemes can guarantee least squares. The different p-norms also have statistical appropriate matrix approximations (see (4) and (5)) with motivations: for instance, the ` -regression solution is the 1 competitive sampling complexities. Subsequently, we in- maximum likelihood estimator given i.i.d. Laplacian noise. vestigate the application of such sampling schemes in the Approximation algorithms based on sampling and sketch- context of convex and non-convex Newton-type methods ing have been thoroughly studied for ` -regression, see, p for (1). e.g., [Clarkson, 2005, Clarkson et al., 2016, Dasgupta et al., 2009, Meng and Mahoney, 2013, Sohler and Woodruff, For complex `p-regression, we use Dvoretsky-type em- 2011, Woodruff and Zhang, 2013, Clarkson and Woodruff, beddings as well as an isometric embedding from `1 to 2017, Wang and Woodruff, 2019]. These algorithms typi- `1 to construct oblivious embeddings from an instance cally follow the sketch-and-solve paradigm, whereby the of a complex `p-regression problem to a real-valued `p- dimensions of A and b are reduced, resulting in a much regression problem, for p 2 [1; 1]. Our algorithm runs in smaller instance of `p-regression, which is tractable. In the O((nnz(A) + poly(d/))) time for constant p 2 [1; 1), 2 case of p = 1, sketching is used inside of an optimization and O(nnz(A)2O~(1/ )) time for p = 1. Here nnz(A) method to speed up linear programming-based algorithms denotes the number of non-zero entries of the matrix A. [Cohen et al., 2019]. For vector-matrix-vector queries, we show that if the non- > To highlight some of the difficulties in extending `p- PSD matrix has the form M = A B, then we can approx- regression algorithms to the complex numbers, consider imately compute u>Mv in just O(nnz(A) + n/2) time, 2 2 two popular cases, of `1 and `1-regression. The standard whereas the naïve approach takes nd + d + d time. way of solving these regression problems is by formulating them as linear programs. However, the complex numbers Notation. Vectors and matrices are denoted by bold lower- are not totally ordered, and linear programming algorithms case and bold upper-case letters, respectively, e.g., v and V. therefore do not work with complex inputs. Stepping back, We use regular lower-case and upper-case letters to denote scalar constants, e.g., d or L.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    11 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us