Non-PSD Matrix Sketching with Applications to Regression and Optimization

Zhili Feng1 Fred Roosta2 David P. Woodruff3

1Machine Learning Department, Carnegie Mellon University, 2School of Mathematics and Physics, University of Queensland, 3 Computer Science Department, Carnegie Mellon University,

Abstract trices. For example, the Hessian matrices in convex opti- mization [Xu et al., 2016], the covariance matrices X>X in regression over the reals, and quadratic form queries x>Ax A variety of techniques [Andoni et al., 2016]. Meanwhile, less is understood for non- have been applied for computations involving large PSD matrices. These matrices are naturally associated with matrices. The underlying matrix is randomly com- complex matrices: the Hessian of a non-convex optimization pressed into a smaller one, while approximately problem can be decomposed into H = X>X where X is a retaining many of its original properties. As a re- matrix with complex entries, and a complex design matrix sult, much of the expensive computation can be X has a non-PSD covariance matrix. However, almost all performed on the small matrix. The sketching of sketching techniques were developed for matrices with en- positive semidefinite (PSD) matrices is well under- tries in the real field R. While some results carry over to the stood, but there are many applications where the complex numbers C (e.g., Tropp et al.[2015] develops con- related matrices are not PSD, including Hessian centration bounds that work for complex matrices), many matrices in non-convex optimization and covari- do not and seem to require non-trivial extensions. In this ance matrices in regression applications involving work, we show how to efficiently sketch non-PSD matrices complex numbers. In this paper, we present novel and extend several existing sketching results to the complex dimensionality reduction methods for non-PSD field. We also show how to use these in optimization, for matrices, as well as their “square-roots", which in- both convex and non-convex problems, the sketch-and-solve volve matrices with complex entries. We show how paradigm for complex `p-regression with 1 ≤ p ≤ ∞, as these techniques can be used for multiple down- well as vector-matrix-vector product queries. stream tasks. In particular, we show how to use the proposed matrix sketching techniques for both con- Finite-sum Optimization. We consider optimization prob- lems of the form vex and non-convex optimization, `p-regression for every 1 ≤ p ≤ ∞, and vector-matrix-vector n 1 X | queries. min F (x) , fi(ai x) + r(x), (1) x∈ d n R i=1

where n  d ≥ 1, each fi : R → R is a smooth but possi- 1 INTRODUCTION bly non-convex function, r(x) is a regularization term, and d ai ∈ R , i = 1, . . . , n, are given. Problems of the form (1) Many modern tasks involve massive are abundant in machine learning [Shalev-Shwartz and Ben- n×d datasets, where an input matrix A ∈ R is such that David, 2014]. Concrete examples include robust linear re- n  d. In a number of cases, A is highly redundant. For gression using Tukey’s biweight loss [Beaton and Tukey, example, if we want to solve the ordinary least squares | 2 | 2 2 1974], i.e., fi(hai, xi) = (ai x − bi) /(1 + (ai x − bi) ), problem minx kAx − bk2, one can solve it exactly given T T where bi ∈ R, and non-linear binary classification [Xu et al., only A A and A b. To exploit this redundancy, numerous 2 2020], i.e., f (ha , xi) = (1/(1 + exp (−a|x)) − b ) , techniques have been developed to reduce the size of A. i i i i where b ∈ {0, 1} is the class label. By incorporating curva- Such dimensionality reduction techniques are used to speed i ture information, second-order methods are gaining popular- up various optimization tasks and are often referred to as ity over first-order methods in certain applications. However, sketching; for a survey, see Woodruff et al.[2014]. when n  d ≥ 1, operations involving the Hessian of F A lot of previous work has focused on sketching PSD ma- constitute a computational bottleneck. To this end, random-

Accepted for the 37th Conference on Uncertainty in Artificial Intelligence (UAI 2021). ized Hessian approximations have shown great success in than ` regressions over the reals. p reducing computational complexity ([Roosta and Mahoney, Vector-matrix-vector queries. Many applications require 2019, Xu et al., 2016, 2019, Pilanci and Wainwright, 2017, queries of the form u>Mv, which we call vector-matrix- Erdogdu and Montanari, 2015, Bollapragada et al., 2019]). vector queries, see, e.g., Rashtchian et al.[2020]. For exam- In the context of (1), it is easy to see that the ple, if M is the adjacency matrix of a graph, then u>Mv Hessian of F can be written as ∇2F (x) = answers whether there exists an edge between pair {u, v}. Pn 00 | | 2 | i=1 fi (ai x)aiai /n + ∇ r(x) = A D(x)A/n + These queries are also useful for independent set queries, cut 2 |   d×n ∇ r(x), where A = a1,..., an ∈ R and queries, etc. Many past works have studied how to sketch 00 | 00 | 00 | n×n D(x) = diag [f1 (a1 x) f2 (a2 x) . . . fn (anx)] ∈ R . positive definite M (see, e.g., Andoni et al.[2016]), but it Of particular interest in this work is the application of remains unclear how to handle the case when M is non-PSD randomized matrix approximation techniques [Woodruff or has complex entries. et al., 2014, Mahoney, 2011, Drineas and Mahoney, 2016], Contributions. We consider non-PSD matrices and their in particular, constructing a random sketching matrix S "square-roots", which are complex matrices, in the context to ensure that H(x) A|D1/2S|SD1/2A + ∇2r(x) ≈ , of optimization and the sketch-and-solve paradigm. Our goal A|DA/n + ∇2r(x) = ∇2F (x). Notice that D1/2A may is to provide tools for handling such matrices in a number of have complex entries if f is non-convex. i different problems, and to the best of our knowledge, is the The Sketch-and-Solve Paradigm for Regression. In the first work to systematically study dimensionality reduction overconstrained least squares regression problem, the task techniques for such matrices. is to solve min kAx − bk for some norm k · k, and here x For optimization of (1), where each f is potentially non- we focus on the wide class of ` -norms, where for a vector i p convex, we investigate non-uniform data-aware methods to y, kyk = (P |y |p)1/p. Setting the value p allows for p j j construct a sampling matrix S based on a new concept of adjusting the sensitivity to outliers; for p < 2 the regression leverage scores for complex matrices. In particular, we pro- problem is often considered more robust than least squares pose a hybrid deterministic-randomized sampling scheme, because one does not square the differences, while for p > 2 which is shown to have important properties for optimiza- the problem is considered more sensitive to outliers than tion. We show that our sampling schemes can guarantee least squares. The different p-norms also have statistical appropriate matrix approximations (see (4) and (5)) with motivations: for instance, the ` -regression solution is the 1 competitive sampling complexities. Subsequently, we in- maximum likelihood estimator given i.i.d. Laplacian noise. vestigate the application of such sampling schemes in the Approximation algorithms based on sampling and sketch- context of convex and non-convex Newton-type methods ing have been thoroughly studied for ` -regression, see, p for (1). e.g., [Clarkson, 2005, Clarkson et al., 2016, Dasgupta et al., 2009, Meng and Mahoney, 2013, Sohler and Woodruff, For complex `p-regression, we use Dvoretsky-type em- 2011, Woodruff and Zhang, 2013, Clarkson and Woodruff, beddings as well as an isometric embedding from `1 to 2017, Wang and Woodruff, 2019]. These algorithms typi- `∞ to construct oblivious embeddings from an instance cally follow the sketch-and-solve paradigm, whereby the of a complex `p-regression problem to a real-valued `p- dimensions of A and b are reduced, resulting in a much regression problem, for p ∈ [1, ∞]. Our algorithm runs in smaller instance of `p-regression, which is tractable. In the O((nnz(A) + poly(d/))) time for constant p ∈ [1, ∞), 2 case of p = ∞, sketching is used inside of an optimization and O(nnz(A)2O˜(1/ )) time for p = ∞. Here nnz(A) method to speed up linear programming-based algorithms denotes the number of non-zero entries of the matrix A. [Cohen et al., 2019]. For vector-matrix-vector queries, we show that if the non- > To highlight some of the difficulties in extending `p- PSD matrix has the form M = A B, then we can approx- regression algorithms to the complex numbers, consider imately compute u>Mv in just O(nnz(A) + n/2) time, 2 2 two popular cases, of `1 and `∞-regression. The standard whereas the naïve approach takes nd + d + d time. way of solving these regression problems is by formulating them as linear programs. However, the complex numbers Notation. Vectors and matrices are denoted by bold lower- are not totally ordered, and linear programming algorithms case and bold upper-case letters, respectively, e.g., v and V. therefore do not work with complex inputs. Stepping back, We use regular lower-case and upper-case letters to denote scalar constants, e.g., d or L. For a complex vector v, its what even is the meaning of the `p-norm of a complex vec- P p 1/p real and conjugate transposes are respectively denoted by tor y? In the definition above kykp = ( |yj| ) , and j v| and v∗. For two vectors v, w, their inner-product is de- |yj| denotes the modulus of√ the complex number,√ i.e., if 2 2 noted by hv, wi. For a vector v and a matrix V, kvkp, kVk, yj = a + b · i, where i = −1, then |yj| = a + b . and kVkF denote vector `p norm, matrix spectral norm, and Thus the `p-regression problem is really a question about minimizing the p-norm of a sum of Euclidean lengths of Frobenius norm, respectively. For kvk2, we write kvk as vectors. As we show later, this problem is very different an abbreviation. Let |V| denote the entry-wise modulus of matrix V. Let V denote the (i, j)-th entry, V = V see, e.g., Woodruff et al.[2014]. Here we extend the result i,j i i,∗ be the i-th row, and V∗,j be the j-th column. The iteration to the complex field: counter for the main algorithm appears as a subscript, e.g., Theorem 1. For i ∈ [n], let `˜ ≥ ` be a constant overes- p A B i i k. For two symmetric matrices and , the Löwner par- timate to the leverage score of the ith row of B ∈ n×d. A  B A−B C tial order indicates that is symmetric positive Assume that B|B ∈ d×d. Let p = `˜ /P `˜ and semi-definite. A† denotes the Moore-Penrose generalized R i i j∈[n] j t = cd−2 log(d/δ) for a large enough constant c. We inverse of matrix A. For a scalar d, we let poly(d) be a sample t rows of A where row i is sampled with prob- polynomial in d. We let diag(·) denote a . √ ability pi and rescaled to 1/ tpi. Denote the sampled Here we give the necessary definitions. matrix by C. Then with probability 1 − δ, C satisfies: B|B − B∗B  C|C  B|B + B∗B. Definition 1 (Well-conditioned basis and `p leverage scores). An n × d matrix U is an Theorem 2. Under the same assumptions and notation (α, β, p)-well-conditioned basis for the column in Theorem1, let t = cdγ−2 log(d/δ), where γ = P P p 1/p kB k2 span of A if (i) ( |Uij| ) ≤ α. P i ∗ i∈[n] j∈[d] i∈[n] ˜ Bi Bi . Then with probability 1 − δ, C sat- d `i (ii) For all x ∈ R , kxkq ≤ βkUxkp, where 1/p+1/q = 1. isfies kC|C − B|Bk < . (iii) The column span of U is equal to the column span of A. p Remark 1. It is hard to directly compare Theorem2 to the For such a well conditioned basis, kUi∗kp is defined to be sample complexity of Xu et al.[2019], where they require the `p leverage score of the i-th row of A. The `p leverage Kb 2 ∗ 2 scores are not invariant to the choice of well-conditioned t ≥ 2 log(2d/δ), Kb ≥ kB Bk = O(σ1). To apply The- basis. orem2 to a Hessian of the form A|DA, one should set B = D1/2A. Compared to the previously proposed sketch- Definition 2 (`p Auerbach Basis). An Auerbach basis A ing A|S|SDA for non-convex F , our proposed sketch- n×d of U ∈ R is such that: (i) span(U) = span(A). ing A|D1/2S|SD1/2A in practice often has better perfor- d (ii) For all j ∈ [d], kA∗jkp = 1. (iii) For all x ∈ R , mance (see Section 2.2). We conjecture this is because B −1/q d kxkq ≤ kxk∞ ≤ kAxkp, where 1/p + 1/q = 1 has several large singular values, but many rows have small n×d row norms kB k2  `˜ . Hence dγ can be much smaller Definition 3 (`p-subspace embedding). Let A ∈ R , i i s×n 2 S ∈ C . We call S an  `p-subspace embedding if for all than Kb . d x ∈ C , kAxkp ≤ kSAxkp ≤ (1 + )kAxkp. Note that Theorem1 cannot be guaranteed by row norm B = (∞, 1) 2 SKETCHING NON-PSD HESSIANS sampling. Consider diag . Then row norm sam- pling will never sample the second row, yet the lever- FOR NON-CONVEX OPTIMIZATION age scores of both rows are 1. All leverage scores can be computed up to a constant factor simultaneously in We first present our sketching strategies and then apply them O (nnz(A) + d2) log n time. See the appendix for de- to an efficient solution to (1) using different optimization tails. algorithms. All the proofs are in the supplementary material. Hybrid of Randomized-Deterministic Sampling. We pro- pose Algorithm2 to speed up the approximation of Hessian 2.1 COMPLEX LEVERAGE SCORE SAMPLING matrices by deterministically sampling the “heavy” rows. The proposed method provably outperforms the vanilla Algorithm 1 Construct Leverage Score Sampling Matrix leverage score sampling algorithm under a relaxed RIP con- dition. 1: Input: D1/2A ∈ n×d, number s of samples, empty C n×d matrices R ∈ s×s and Ω ∈ n×s Theorem 3. Let A ∈ R and D = diag(d1, . . . , dn) ∈ R R n×n 2: Compute SVD of D1/2A = UΣV∗ R . For any matrix A and any index set N, let AN ∈ n×d 3: for i ∈ [n] do R be such that for all i ∈ N, (AN )i = Ai, and all T th 2 other rows of A are 0. Suppose A|DA = P Ei + N, 4: Calculate the i leverage score `i = kUi,∗k N i=1 where Ei = A| D A ∈ d×d, E is an index set with 5: for j ∈ [s] do Ei Ei Ei R i 6: Pick row i independently and with replacement with size O(d) (that is, at each outer iteration in Algorithm2 `i step 3 below, we deterministically select at most |E | = probability pi = P i i `i T 1 O(d) rows). Let E = ∪i=1Ei, N = {1, . . . , n}\E, N = 7: Set Ωi,j = 1 and Ri,i = √ spi | d×d AN DN AN ∈ R . 8: Output: S = R · Ω| 1/2 Assume DN AN has the following relaxed restricted isometry property (RIP) with parameter ρ. That is, with It is well-known that leverage score sampling gives an  `2- probability 1 − 1/n over uniformly random sampling ma- subspace embedding for real matrices with high probability, trices S with t = O(dkA|DAk/2) rows each scaled by pn/t, we have ∀x 6∈ kernel(D1/2A ), kSD1/2A xk = 2.2 APPLICATION TO OPTIMIZATION N N N N (1 ± )ρkxk. ALGORITHMS Also assume that for some constant c > 1: As mentioned previously, to accelerate convergence of ckA| |D |A k ≤ k P Eik. Then the sketch can N N N i second-order methods with an inexact Hessian, one needs P i | 1/2 | 1/2 be expressed as i E + AN DN S SDN AN to construct the sub-sampled matrix such that H(x) ≈ and, with probability 1 − O(1/d), we have ∇2F (x). In randomized sub-sampling of the Hessian ma- P i | 1/2 | 1/2 | Pn 2 k i E + AN DN S SDN AN − A DAk ≤ . trix, we select the i-th term in ∇ fi(x) with proba- P i=1 bility pi, restricting i∈[n] pi = 1. Let S denote the sam- Remark 2. The takeaway from Theorem3 is that the total 1 P 1 2 ple collection and define H(x) , i∈S ∇ fi(x) + sample complexity of such a sampling scheme, i.e., Algo- n|S| pi ∇2r(x). p = rithm2, is O(T d + dkA|DAk/2) = O(dkA|DAk) for Uniform oblivious sampling is done with i 1/n constant  and T . On the other hand, the vanilla leverage , which often results in a poor approximation unless |S|  1 score sampling scheme requires Ω(dγ log d) rows. Notice . Leverage score sampling is in some sense an p that often γ  kA|DAk because γ involves kA||D|Ak. optimal data-aware sampling scheme where each i is pro- ` Although T is a tunable parameter, we found in the experi- portional to the leverage score i (see Algorithm1). ments that T = 1 performs well. One condition on the quality of approximation H(x) ≈ ∇2F (x) is typically taken to be

2 Algorithm 2 Hybrid Randomized-Deterministic Sampling kH(x) − ∇ F (x)k ≤ , for some 0 <  ≤ 1, (4) (LS-Det) which has been considered both in the contexts of convex 1: Input: D1/2A ∈ Cn×d, iteration number T , threshold and non-convex Newton-type optimization methods [Roosta m, precision , number k of rows left and Mahoney, 2019, Bollapragada et al., 2019, Xu et al., 2: Set k = n 2019, Yao et al., 2018, Liu and Roosta, 2019]. For convex 3: for t ∈ [T ] do settings where ∇2F (x)  0, a stronger condition can be 1/2 4: Calculate the leverage scores {`1, . . . , `k} of D A considered as

5: for i ∈ [k] do (1 − )∇2F (x)  H(x)  (1 + )∇2F (x), (5) 6: if `i ≥ m then 7: Select row i, set D1/2A to be the set of remain- which, in the context of sub-sampled Newton’s method, ing rows, set k = k − 1 leads to a faster convergence rate than (4) Roosta and Ma- 8: Sample h = O(d/2) rows from the remaining rows us- honey[2019], Xu et al.[2016], Liu et al.[2017]. However, ing either their leverage score distribution, or uniformly in all prior work, (5) has only been considered in the re- n p stricted case where each fi is convex. Here, using the result at random and scaled by h 9: Output: The set of sampled and rescaled rows of Section 2.1, we show that (5) can also be guaranteed in a more general case where the fi’s in (1) are allowed to be non-convex. We demonstrate the theoretical advan- Which Matrix to Sketch? We give a general rule of tages of complex leverage score sampling in Algorithms1 thumb that guides which matrix we should sample to and2 as a way to guarantee (4) and (5) in convex and get a better sample complexity. Recall that the Hes- non-convex settings, respectively. For the convex case, we sian matrix we try to sketch is of the form A|DA consider sub-sampled Newton-CG [Roosta and Mahoney, where D is diagonal. There are two natural candidates: 2019, Xu et al., 2016]. For non-convex settings, we have chosen two examples of Newton-type methods: the classi- s(A ) + s((DA) ) s((D1/2A) ) i i (2) i cal trust-region [Conn et al., 2000] and the more recently P P 1/2 (3) i (s(Ai) + s((DA)i)) i s((D A)i) introduced Newton-MR method [Roosta et al., 2018]. We for a score function s : Rd → R (e.g., leverage scores or emphasize that the choice of these non-convex algorithms row norms). It turns out that sampling according to (2) can was, to an extent, arbitrary and we could instead have picked lead to an arbitrarily worse upper bound than sampling any Newton-type method whose convergence has been pre- using (3). viously studied under Hessian approximation models, e.g., adaptive cubic regularization [Yao et al., 2018, Tripuraneni Theorem 4. Let A ∈ Rn×d. Let D ∈ Rn×n be et al., 2018]. The details of these optimization methods and diagonal. Let T be an  `2-subspace embedding for theoretical convergence results are deferred to the supple- span(A, DA) and S be an  `2-subspace embedding for mentary. span(D1/2A, (D1/2)∗A). Sampling by (2) can give an ar- bitrarily worse upper bound than sampling by (3). We verify the results of Section 2.1 by evaluating the em- pirical performance of the non-uniform sampling strategies proposed in the context of Newton-CG, Newton-MR and and Graff[2017]. trust-region, see details in the appendix. Performance Evaluation. For Newton-MR, the conver- gence is measured by the norm of the gradient, and hence we evaluate it using various sampling schemes by plotting k∇F (xk)k vs. the total number of oracle calls. For Newton- CG and trust-region, which guarantee descent in objective function, we plot F (xk) vs. the total number of oracle calls. We deliberately choose not to use “wall-clock” time since it heavily depends on the implementation details and system specifications. (a) k∇F (xk)k vs. Oracle calls (Drive Diagnostics) Comparison Among Various Sketching Techniques. We present empirical evaluations of Uniform, LS, RN, LS-MX, RN-MX and Full sampling in the context of Newton-MR in Figure1, and evaluation of all sampling schemes in Newton- CG, Newton-MR, Trust-region, and hybrid sampling on covertype in Figure2. For all algorithms, LS and RN sampling amounts to a more efficient algorithm than that with LS-MX and RN-MX variants respectively, and at times this difference is more pronounced than other times, as

(b) k∇F (xk)k vs. Oracle calls (Cover Type) predicted in Theorem4. Meanwhile, LS and LS-MX often outperform RN and RN-MX, as proven in Theorem1 and Theorem2. Evaluation of Hybrid Sketching Techniques. To verify the result of Algorithm2, we evaluate the performance of the trust-region algorithm by varying the terms involved in E, where we call the rows with large leverage scores heavy, and denote the matrix formed by the heavy rows by E. The matrix formed by the remaining rows is denoted by N, see (c) k∇F (xk)k vs. Oracle calls (UJIIndoorLoc) Theorem3 for details. We do this for a simple splitting of H = E + N. We fix the overall sample size and change the Figure 1: Comparison of Newton-MR with various sampling fraction of samples that are deterministically picked in E. schemes. The results are depicted in Figure2. The value in brackets after LS-Det is the fraction of samples that are included in Sub-sampling Schemes. We focus on several sub-sampling E, i.e., deterministic samples. “LS-Det (0)” and “LS-Det strategies (all are done with replacement). Uniform: For (1)” correspond to E = 0 and N = 0, respectively. The this we have pi = 1/n, i = 1, . . . , n. Leverage Scores latter strategy has been used in low rank matrix approxima- (LS): Complex leverage score sampling by considering tions [McCurdy, 2018]. As can be seen, the hybrid sampling the leverage scores of D1/2A as in Algorithm1. Row approach is always competitive with, and at times strictly Norms (RN): Row-norm sampling of D1/2A using (3) better than, LS-Det (0). As expected, LS-Det (1), which 1/2 00 2 where s((D A)i) = |fi (hai, xi)| kaik2. Mixed Lever- amounts to entirely deterministic samples, consistently per- age Scores (LS-MX): A mixed leverage score sampling forms worse. This can be easily attributed to the high bias strategy arising from a non-symmetric viewpoint of the of such a deterministic estimator. | product A (DA) using (2) with s(Ai) = `i(A) and s((DA)i) = `i(DA). Mixed Norm Mixture (RN-MX):A mixed row-norm sampling strategy with the same non- 3 SKETCH-AND-SOLVE PARADIGM symmetric viewpoint as in (2) with s(Ai) = k(A)ik and FOR COMPLEX REGRESSION s((DA)i) = k(DA)ik. Hybrid Randomized-Deterministic (LS-Det): Sampling using Algorithm2. Full: In this case, 3.1 THEORETICAL RESULTS the exact Hessian is used. Recall the ` -regression problem: Model Problems and Datasets. We consider the task of p binary classification using the non-linear least squares for- min kAx − bkp (6) mulation of (1). Numerical experiments in this section are x done using covertype, Drive Diagnostics, and Here we consider the complex version: A ∈ Cn×d, b ∈ UJIIndoorLoc from the UC Irvine ML Repository Dua Cn, x ∈ Cd.

(a) F (xk) vs. Oracle calls (b) k∇F (xk)k vs. Oracle calls

(c) F (xk) vs. Oracle calls (d) F (xk) vs. Oracle calls

Figure 2: Comparison of (a) Newton-CG, (b) Newton-MR, and (c) Trust-region using various sampling schemes. (d) Performance of the hybrid sampling scheme.

The inner product over the complex field can be embedded See Algorithm3 for details. In turn, such an embedding into a higher-dimensional real vector space. It suffices to gives an arbitrarily good approximation with high probabil- consider the scalar case. ity, as shown in Theorem5.

Lemma 1. Let x, y ∈ C where x = a + bi, y = c + di, Theorem 5. Let A ∈ Cn×d, b ∈ Cn. Then Algorithm3 let φ : C → R2 via φ(x) = φ(a + bi) = [a b]. Let with input A0 := σ(A) and b0 := φ(b) returns a regres-   2×2 c d sion instance whose optimizer is an  approximation to (6), σ : C → R via: σ(y) = σ(c + di) = . Then −d c with probability at least 0.98. The total time complexity φ and σ are bijections (between their domains and images), for p ∈ [1, ∞) is O(nnz(A) + poly(d/)); for p = ∞ it | | 2 and we have φ(¯yx) = σ(y)φ(x) is O(2O˜(1/ )nnz(A)). The returned instance can then be optimized by any `p-regression solver. Proof of Lemma1. It is clear that φ, σ are bijections between their domains and images. σ(y)φ(x)| = Proof. For simplicity, we let A ∈ R2n×2d to avoid repeat-  c d a ac + bd = = φ(¯yx)|. edly writing the prime symbol in A0. −d c b cb − ad 2n Let y ∈ R be arbitrary. Let P, P, Pl, Ph be as defined in We apply σ to each entry in A and concatenate in the nat- Algorithm3. We say a pair P = (a, b) is heavy if P ∈ Ph, ural way. Abusing notation, we then write: A0 = σ(A) ∈ and light otherwise. 2n×2d 0 2d 0 R . Similarly, we write x = φ(x) ∈ R , b = As an overview, when p ∈ [1, ∞), for the heavy pairs, we 2d φ(b) ∈ R . use a large Gaussian matrix and apply Dvoretsky’s theorem d Fact 1. For 1 ≤ p < q ≤ ∞ and x ∈ R , we have to show the |||·|||p,2 norm is preserved. For the light pairs, 1/p−1/q kxkq ≤ kxkp ≤ d kxkq. we use a single Gaussian vector and use Bernstein’s concen- tration. This is intuitive since the heavy pairs represent the Fact 2. An ` Auerbach basis is well-conditioned and al- p important directions in A, and hence we need more Gaus- ways exists. sian vectors to preserve their norms more accurately; but the Definition 4. Let x ∈ R2d for some d ∈ N. Define light pairs are less important and so the variance of the light 1/p Pd−1 2 2 p/2 pairs can be averaged across multiple coordinates. Hence, |||x|||p,2 = i=0 x2i+1 + x2i+2 . Note that let- d using one Gaussian vector suffices for each light pair. For ting x ∈ , we have kxkp = |||φ(x)||| . C p,2 p = ∞, we need to preserve the `2 norm of every pair, and so in this case we apply Dvoretsky’s theorem to sketch every By Definition4, we can “lift” the original `p-regression 2d single pair in P. min 2d |||σ(A)φ(x) − φ(b)||| to R and solve φ(x)∈R p,2 in- stead of (6). This equivalence allows us to consider sketch- We split the analysis into two cases: p ∈ [1, ∞) and p = ∞. ing techniques on real matrices with proper modification. In the main text we only present p ∈ [1, ∞) and defer the Algorithm 3 Fast Algorithm for Complex `p-regression Let Pl be the set of all light pairs. We have, P p P P 0 2n×2d 0 2d E[ |hgP , yP i| ] = kyP k2 . 1: Input: A ∈ R , b ∈ R , precision  > 0, p ∈ P ∈Pl P ∈Pl [1, ∞] 2 2 Let random variable ZP = |hgP , yP i|. This is σ kyP k2- 2: Set t ≥ C log(1/)poly(d) , γ = d−1/q−1, σ = sub-Gaussian (the parameter here can be improved by sub- √ 2 π 2 2 tracting µZ ). 2p/2Γ((p+1)/2) , P = {P = (2i + 1, 2i + 2) ∈ R , ∀i ∈ P {0, . . . , d − 1}} p Define event A to be : maxP ∈Pl ZP ≥ O((log(|Pl|) + 3: if p 6= ∞ then p poly(d)) ) , t. Since the ZP variables are sub-Gaussian 4: Compute the `p leverage score and p ≥ 1, we have that (·)p is monotonically increasing, 0 0 0 0 {`([A1 b1]), . . . , `([An bn])} for each row so we can use the standard sub-Gaussian bound to get of [A0 b0] 5: Let Ph be the collection of P = (a, b) ∈ P such P (A) < exp (−poly(d)) . (7) that `([A0 b0 ]) ≥ γ or `([A0 b0 ]) ≥ γ. Denote the a a b b Note that p is a constant, justifying the above derivation. collection of all other pairs Pl 6: Rearrange [A0 b0] such that the rows in P are on 2p h Also note that Var(ZP ) = O(kyP k2 ). Condition on ¬A. top By Bernstein’s inequality, we have: 7: for Each Pi ∈ Ph do t×2 ! 8: Sample a Gaussian random matrix GPi ∈ , X X X R P Zp − [| Zp |] >  [| Zp |] where each entry follows N (0, σ2) P E P E P P ∈Pl P ∈Pl P ∈P 9: for Each Qi ∈ Pl do 1×2 ! 10: Sample Gaussian random matrix GQ ∈ , X X X i R =P Zp − ky kp >  ky kp where each entry follows N (0, σ2) P P 2 P 2 P ∈P P ∈P P ∈P 11: Define a block diagonal matrix G = l l 2 P p 2 ! (8) diag(GP ,..., GP , GQ ,..., GQ )  ( P ∈P kyP k2) 1 |Ph| 1 |Pl| 0 0 ≤ exp − 2p p 12: Output: kGA y − Gb kp P ky k + t P ky k P ∈Pl P 2 P ∈Pl P 2 13: else 2 P p 2 !!  ( kyP k ) 14: for Each Pi ∈ P do ≤ exp −O P ∈P 2 P 2p s×2 P ∈P kyP k2 15: Sample Gaussian matrix GPi ∈ R with entries l   π log(1/) ≤ exp −O(2/γ) = exp O(−2d log d) from N (0, 2 ), where s = O 2 2s×s 16: Let RPi ∈ R where each row of RPi is a where the last step follows from the definition of light s vector in {−1, +1} pairs. Using that if for all P = (a, b) ∈ Pl, |ya|, |yb| < 17: Let R = diag(R ,..., R ), G = −1 p −1 p 1 |P| O(d )kykp, then kyP k2 < O(d )kykp, we have diag(G1,..., G|P|) 0 0 X 1 18: Output: kRGA y − RGb k∞ ky k2p ≤ O(d−2)kyk2p = O(d−1)kyk2p. P 2 O(d−1) p p P ∈Pl other case to the appendix. P p p Since P ∈P kyP k2 ≤ O(kykp), we will have P p 2 ( P ∈P kyP k2 ) P 2p = O(d). kyP k Case 1: p ∈ [1, ∞). For light rows: P ∈Pl 2 Let U be an Auerbach basis of A and y = Ux, where Net argument. In the above derivation, we fix a vector y ∈ 2n x ∈ Rd is arbitrary. Then for any row index i ∈ [n]: R . Hence for the above argument to hold for all pairs, a naïve argument will not work since there are an infinite 1/q |yi| = |Uix| ≤ kUikpkxkq ≤ d kUikpkykp number of pairs. However, note that each pair lives in a −1/q d |yi| two-dimensional subspace. Hence, we can take a finer union =⇒ ≤ kUikp 2 2 kykp bound over O((1 + ) / ) items in a two-dimensional `2 space using a net argument. This argument is standard, see, where the first step is because the Auerbach basis satisfies e.g., [Woodruff et al., 2014, Chapter 2]. −1/q kUxkp = kykp ≥ d kxkq. This implies that if |yi| ≥ 1/q γd kykp, then kUikp ≥ γ. Hence, by definition of γ, if Using the net argument, (7) and (8) holds with probability −1/q−1 −1 1 − O(e−d) y ∈ 2n kUikp ≤ d , then |yi| ≤ d kykp. at least for all R simultaneously. In particular, with probability at least 0.99: For any light pair P = (a, b), we sample two i.i.d. Gaussians √ X X X 2 π |hg , y i|p − ky kp ∈ (±) ky kp. gP = (ga, gb) from N (0, σ ) where σ = 2p/2Γ((p+1)/2) . P P P 2 P 2 2 2 P ∈Pl P ∈Pl P ∈P Since gaya + gbyb ∼ N (0, kyP k2σ ), we have E[|gaya + p p (9) gbyb| ] = kyP k2. For heavy rows: time. A is sparse and RG is a block matrix so comput- 2 ing RGA takes another O(2O˜(1/ )nnz(A)) time. Com- Since the `p leverage scores sum to d, there can be at 2 puting RGb takes O(2O˜(1/ )n log(1/) ) time. Since n < most d/γ = poly(d) heavy rows. For each pair P ∈ Ph, 2 s×2 O˜(1/2) we construct a Gaussian matrix GP ∈ R , where s = nnz(A), in total these take O(2 nnz(A)) time. This poly(d/). Applying Dvoretsky’s theorem for `p (Paouris concludes the proof. et al.[2017] Theorem 1.2), with probability at least 0.99, for Remark 3. Definition4 shows for sketching complex vec- all 2-dimensional vectors yP , kGP yP kp = (1 ± )kyP k2. Hence, tors in the `p norm, all one needs is an embedding `2 ,→ `p. In particular, for complex `2-regression, the identity map X p X p kGP yP kp = (1 ± Θ()) kyP k2. (10) is such an embedding with no distortion. Hence, complex P ∈Ph P ∈Ph `2-regression can be sketched exactly as for real-valued Combining (9) and (10), we have with probability at least `2-regression, while for other complex `p-regression the transformation is non-trivial. 0.98, for all y ∈ R2n: p X p p kGykp = (1 ± Θ()) kyP k2 = (1 ± Θ()) |||y|||p,2 3.2 NUMERICAL EVALUATION P ∈P Letting y = Ux − b and taking the 1/p-th root, we obtain We evaluate the performance of our proposed embedding the final claim. for `1 and `∞ regression on synthetic data. With A ∈ 100×50 100 50 C , b ∈ C , we solve minx∈C kAx − bk1 or 50 Case 2: p = ∞. In this case, we first construct a sketch G minx∈C kAx − bk∞. Each entry of A and b is sampled to embed every pair into `1. That is, for all P ∈ P, construct from a standard normal distribution (the real and imaginary O( log(1/) )×2 coefficients are sampled according to this distribution inde- a Gaussian matrix Gp ∈ R 2 . By Dvoretsky’s 2 pendently). Instead of picking the heavy (Ph) and light theorem, with probability at least 0.99, for all yP ∈ R , we (Pl) pairs , we construct a t × 2 (or s × 2 if p = ∞) have kGP yP k1 = (1 ± )kyP k2. Hence Gaussian matrix for each pair (that is, we treat all pairs 2n For all y ∈ R and any P ∈ P: as heavy), as it turns out in the experiments that very small t or s is sufficient. For `1 complex regression, we test with max kGP yP k1 = max(1 ± )kyP k2 = |||y|||∞,2 P ∈P P ∈P t = 2, 4, 6, 8, 10, 20, and the result is shown in Figure 3(a). (11) For `∞ complex regression, we test with s = 2, 3, 4, 5, 6 as However, we do not want to optimize the left hand side shown in Figure 3(b). In both figures, the x-axis represents directly. our choice of t or s, and the y-axis is the approximation ∗ error kxˆ − x k2, where xˆ is the minimizer of our sketched Recall that by construction G = diag(G1,..., G|P|) is a regression problem. block diagonal matrix.

Construct R as in Algorithm3. By Indyk[2001], for all 4 SKETCHING P ∈ P, RP is an isometric embedding `1 ,→ `∞, i.e., for 2 VECTOR-MATRIX-VECTOR QUERIES all yP ∈ R : kRP GP yP k∞ = kGP yP k1. Combining this with (11), we get that with probability at The sketches in Section2 can also be used for vector-matrix- least 0.99 for all y ∈ R2n: vector queries, but they are sub-optimal when there are a lot of cancellations. For example, if we have A>DA = kRGyk∞ = (1 ± Θ()) |||y|||∞,2 . Pn > i=1 diaiai where a1 = a2 = ··· = an, d1 = Letting y = Ux − b, we obtain the final claim. d2 = ··· dn/2, and dn/2+1 = dn/2+2 = ··· = dn, then A>DA = 0, yet our sampling techniques need their num- > Running time. For p ∈ [1, ∞), calculating a well- ber of rows to scale with kA |D|AkF , which can be arbi- conditioned basis takes O(nnz(A) + poly(d/)) time. trarily large. In this section, we give a sketching technique Since G is a block diagonal matrix and A is sparse, comput- for vector-matrix-vector product queries that scales with > > ing GA takes O(nnz(A) + poly(d/)) time. Calculating kA DAkF instead of kA |D|AkF . Therefore, for vector- Gb takes n + poly(d/) time. Minimizing kGAx − Gbk matrix-vector product queries, this new technique works up to a (1 + ) factor takes O(nnz(A) + poly(d/)) time. well, even if the matrices are complex. Such queries are Using the fact that n < nnz(A), the total running time is widely used, including standard graph queries and indepen- O(nnz(A) + poly(d/)). dent set queries [Rashtchian et al., 2020]. For the case of p = ∞, note that R is also a block diagonal In particular, we consider a vector-matrix-vector product RG > > Pn > matrix, so can be computed by multiplying the corre- query u Mv, where M = A B = i=1 aibi has a O˜(1/2) log(1/) d sponding blocks, which amounts to O(2 n 2 ) tensor product form, ai, bi ∈ C , for all i ∈ [n]. One has to

1.6

1.6 1.4

1.4 1.2

1.2 1.0

1.0 Approx error 0.8 Approx error

0.8 0.6

0.6 0.4

2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 t t

(a) `1 regression (b) `∞ regression

Figure 3: Approximation error of sketched `p regression with complex entries. either sketch M or compute M first. Then the queries u and product hS(A>B), S(u ⊗ v)i takes only k time. As a com- v arrive [Andoni et al., 2016]. In reality, this may be due to parison, computing u>A>Bv naïvely takes nd2 + O(d2) the fact that n  d and one cannot afford to store A and B. time, which can be arbitrarily worse than our sketched ver- Our approach is interesting when M is non-PSD and A, B sion. Note that it is prohibitive in our setting to compute might be complex. This can indeed happen, for example, in u>A> and Bv separately. a graph Laplacian with negative weights [Chen et al., 2016].

Algorithm 4 Tensor Sketch For Vector-Matrix-Vector Prod- 5 CONCLUSION ucts n n d d 1: Input: {ai}i=1, {bi}i=1 ⊆ C , u, v ∈ C Our work highlights the many places where non-PSD ma- 2: Let S : Cd ⊗ Cd → Ck be a TensorSketch [Pham and trices and their “square roots", which are complex matri- Pagh, 2013] with k hash buckets. ces, arise in optimization and randomized numerical lin- n ear algebra. We give novel dimensionality reduction meth- 3: Compute q = P S(a ⊗ b ) ∈ k. i=1 i i C ods for such matrices in optimization, the sketch-and-solve k 4: Compute p = S(u ⊗ v) ∈ C . paradigm, and for vector-matrix-vector queries. These meth- 5: Output: hp, qi ods can be used for approximating indefinite Hessian matri- ces, which constitute a major bottleneck for second-order optimization. We also propose a hybrid sampling method for Theorem 6. With probability at least 0.99, for given in- matrices that satisfy a relaxed RIP condition. We verify these d n×d put vectors u, v ∈ C , A, B ∈ C , Algorithm4 re- numerically using Newton-CG, trust region, and Newton- turns an answer z such that z − u>A>Bv ≤  in time MR algorithms. We also show how to reduce complex `p-  2 2 > 2  ˜ nkvk2kuk2kA BkF O nnz(A) + 2 . regression to real `p-regression in a black box way using random linear embeddings, showing that the many sketch- ing techniques developed for real matrices can be applied to Proof. It is known that TensorSketches are unbiased, that complex matrices as well. In addition, we also present how is, to efficiently sketch complex matrices for vector-matrix- > > > E[hS(A B), S(u ⊗ v)i] = u A Bv, vector queries. > Pn where S(A B) = i=1 S(ai, bi) follows from the linear- Acknowledgments: The authors would like to thank partial ity of TensorSketch Pham and Pagh[2013]. The variance is support from NSF grant No. CCF-181584, Office of Naval bounded by Research (ONR) grant N00014-18-1-256, and a Simons Investigator Award.  1  Var hS(A>B), S(u ⊗ v)i ≤ O kvk2kuk2kA>Bk2 , k 2 2 F References see, e.g., Pham and Pagh[2013]. By Chebyshev’s inequality,  kvk2kuk2kA>Bk2  2 2 F Alexandr Andoni, Jiecao Chen, Robert Krauthgamer, setting k = O 2 produces an estimate of Bo Qin, David P Woodruff, and Qin Zhang. On sketching u>A>Bv with an additive error . quadratic forms. In Proceedings of the 2016 ACM Con- Note that computing the sketches S(A>B) takes time ference on Innovations in Theoretical Computer Science, nnz(A) + nnz(B) + nk log k, and computing the inner pages 311–319, 2016. Albert E Beaton and John W Tukey. The fitting of Xuanqing Liu, Cho-Jui Hsieh, Jason D Lee, and Yuekai

power series, meaning polynomials, illustrated on band- Sun. An inexact subsampled proximal Newton-type spectroscopic data. Technometrics, 16(2):147–185, 1974. method for large-scale machine learning. arXiv preprint arXiv:1708.08552, 2017. Raghu Bollapragada, Richard H Byrd, and Jorge Nocedal. Exact and inexact subsampled newton methods for op- Yang Liu and Fred Roosta. Stability Analysis of Newton- timization. IMA Journal of Numerical Analysis, 39(2): MR Under Hessian Perturbations. arXiv preprint 545–578, 2019. arXiv:1909.06224, 2019. Yongxin Chen, Sei Zhen Khong, and Tryphon T Georgiou. Michael W Mahoney. Randomized algorithms for matrices On the definiteness of graph laplacians with negative and data. Foundations and Trends® in Machine Learning, weights: Geometrical and passivity-based approaches. In 3(2):123–224, 2011. 2016 American Control Conference (ACC), pages 2488– 2493. IEEE, 2016. Shannon McCurdy. Ridge regression and provable determin- istic ridge leverage score sampling. In Advances in Neural Kenneth L Clarkson. Subgradient and sampling algorithms Information Processing Systems, pages 2463–2472, 2018. for l 1 regression. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages Xiangrui Meng and Michael W Mahoney. Low-distortion 257–266. Society for Industrial and Applied Mathematics, subspace embeddings in input-sparsity time and applica- 2005. tions to robust . In Proceedings of the forty-fifth annual ACM symposium on Theory of comput- Kenneth L Clarkson and David P Woodruff. Low-rank ap- ing, pages 91–100, 2013. proximation and regression in input sparsity time. Journal of the ACM (JACM), 63(6):1–45, 2017. Grigoris Paouris, Petros Valettas, and Joel Zinn. Random n version of dvoretzky’s theorem in `p . Stochastic Pro- Kenneth L Clarkson, Petros Drineas, Malik Magdon-Ismail, cesses and their Applications, 127(10):3187–3227, 2017. Michael W Mahoney, Xiangrui Meng, and David P Woodruff. The fast cauchy transform and faster robust Ninh Pham and Rasmus Pagh. Fast and scalable polyno- linear regression. SIAM Journal on Computing, 45(3): mial kernels via explicit feature maps. In Proceedings 763–810, 2016. of the 19th ACM SIGKDD international conference on Knowledge discovery and , pages 239–247, Michael B Cohen, Yin Tat Lee, and Zhao Song. Solving lin- 2013. ear programs in the current matrix multiplication time. In Proceedings of the 51st annual ACM SIGACT symposium Mert Pilanci and Martin J Wainwright. Newton sketch: on theory of computing, pages 938–942, 2019. A near linear-time optimization algorithm with linear- SIAM Journal on Optimization Andrew R Conn, Nicholas IM Gould, and Philippe L Toint. quadratic convergence. , Trust region methods. SIAM, 2000. 27(1):205–245, 2017. Anirban Dasgupta, Petros Drineas, Boulos Harb, Ravi Ku- Cyrus Rashtchian, David P Woodruff, and Hanlin Zhu. mar, and Michael W Mahoney. Sampling algorithms and Vector-matrix-vector queries for solving linear alge- coresets for \ell_p regression. SIAM Journal on Comput- bra, , and graph problems. arXiv preprint ing, 38(5):2060–2078, 2009. arXiv:2006.14015, 2020. Petros Drineas and Michael W Mahoney. RandNLA: Ran- Farbod Roosta and Michael W Mahoney. Sub-sampled domized Numerical Linear Algebra. Communications of Newton methods. Mathematical Programming, 174(1-2): the ACM, 59(6):80–90, 2016. 293–326, 2019. Dheeru Dua and Casey Graff. UCI machine learning Farbod Roosta, Yang Liu, Peng Xu, and Michael W Ma- repository, 2017. URL http://archive.ics.uci. honey. Newton-MR: Newton’s Method Without Smooth- edu/ml. ness or Convexity. arXiv preprint arXiv:1810.00303, 2018. Murat A Erdogdu and Andrea Montanari. Convergence rates of sub-sampled Newton methods. In Proceedings of Shai Shalev-Shwartz and Shai Ben-David. Understanding the 28th International Conference on Neural Information machine learning: From theory to algorithms. Cambridge Processing Systems-Volume 2, pages 3052–3060, 2015. university press, 2014. Piotr Indyk. Algorithmic applications of low-distortion geo- Christian Sohler and David P Woodruff. Subspace embed- metric embeddings. In Proceedings 42nd IEEE Sympo- dings for the l1-norm with applications. In Proceedings sium on Foundations of Computer Science, pages 10–33. of the forty-third annual ACM symposium on Theory of IEEE, 2001. computing, pages 755–764, 2011. Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier,

and Michael I Jordan. Stochastic cubic regularization for fast nonconvex optimization. In Advances in neural information processing systems, pages 2899–2908, 2018. Joel A Tropp et al. An introduction to matrix concentra- tion inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230, 2015. Ruosong Wang and David P Woodruff. Tight bounds for `p oblivious subspace embeddings. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1825–1843. SIAM, 2019. David Woodruff and Qin Zhang. Subspace embeddings and\ell_p-regression using exponential random variables. In Conference on Learning Theory, pages 546–567, 2013. David P Woodruff et al. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science, 10(1–2):1–157, 2014. Peng Xu, Jiyan Yang, Farbod Roosta, Christopher Ré, and Michael W Mahoney. Sub-sampled newton methods with non-uniform sampling. In Advances in Neural Informa- tion Processing Systems, pages 3000–3008, 2016. Peng Xu, Farbod Roosta, and Michael W Mahoney. Newton- type methods for non-convex optimization under inexact Hessian information. Mathematical Programming, 2019. doi:10.1007/s10107-019-01405-z.

Peng Xu, Farbod Roosta, and Michael W. Mahoney. Second- Order Optimization for Non-Convex Machine Learning: An Empirical Study. In Proceedings of the 2020 SIAM International Conference on Data Mining. SIAM, 2020. Zhewei Yao, Peng Xu, Farbod Roosta, and Michael W Ma- honey. Inexact non-convex Newton-type methods. arXiv preprint arXiv:1802.06925, 2018. Under review.