Input-Sparsity Low Rank Approximation in Schatten Norm

Input-Sparsity Low Rank Approximation in Schatten Norm Yi Li * 1 David P. Woodruff * 2 Abstract lar value decomposition (SVD) of A, which is an expensive We give the first input-sparsity time algorithms operation. for the rank-k low rank approximation problem For large matrices A this is too slow, so we instead allow for in every Schatten norm. Specifically, for a given randomized approximation algorithms in the hope of achiev- m×n (m ≥ n) matrix A, our algorithm computes ing a much faster running time. Formally, given an approx- Y 2 Rm×k, Z 2 Rn×k, which, with high proba- imation parameter " > 0, we would like to find a rank-k bility, satisfy kA − YZT k ≤ (1 + ")kA − A k , p k p matrix X for which kA − XkF ≤ (1 + ") kA − AkkF Pn p 1=p where kMkp = ( i=1 σi(M) ) is the Schat- with large probability. For this relaxed problem, a num- ten p-norm of a matrix M with singular values ber of efficient methods are known, which are based on σ1(M); : : : ; σn(M), and where Ak is the best dimensionality reduction techniques such as random projec- rank-k approximation to A. Our algorithm runs tions, importance sampling, and other sketching methods, 1;2 in time O~(nnz(A) + mnαp poly(k=")), where with running times O~(nnz(A) + m poly(k=")), where αp = 0 for p 2 [1; 2) and αp = (! − 1)(1 − 2=p) nnz(A) denotes the number of non-zero entries of A. This for p > 2 and ! ≈ 2:374 is the exponent of matrix is significantly faster than the SVD, which takes Θ(~ mn!−1) multiplication. For the important case of p = 1, time, where ! is the exponent of matrix multiplication. See which corresponds to the more “robust” nuclear (Woodruff, 2014) for a survey. ~ norm, we obtain O(nnz(A)+m·poly(k/)) time, In this work, we consider approximation error with respect which was previously only known for the Frobe- to general matrix norms, i.e., to the Schatten p-norm. The nius norm (p = 2). Moreover, since αp < ! − 1 Schatten p-norm, denoted by k · kp, is defined to be the for every p, our algorithm has a better dependence `p-norm of the singular values of the matrix. Below is the on n than that in the singular value decomposition formal definition of the problem. for every p. Crucial to our analysis is the use of dimensionality reduction for Ky-Fan p-norms. Definition 1.1 (Low-rank Approximation). Let p ≥ 1. Given a matrix A 2 Rm×n, find a rank-k matrix X^ 2 Rm×n for which 1. Introduction ^ A − X ≤ (1 + ") min kA − Xkp : (1) p X:rank(X)=k A common task in processing or analyzing large-scale datasets is to approximate a large matrix A 2 Rm×n (m ≥ n) with a low-rank matrix. Often this is done with It is a well-known fact (Mirsky’s Theorem) that the optimal respect to the Frobenius norm, that is, the objective function solution for general Schatten norms coincides with the op- k A is to minimize the error kA − Xk over all rank-k matrices timal rank- matrix k for the Frobenius norm, given by F the SVD. However, approximate solutions for the Frobenius X 2 Rm×n for a rank parameter k. It is well-known that the optimal solution is A = P A = AP , where P is the norm loss function may give horrible approximations for k L R L p orthogonal projection onto the top k left singular vectors of other Schatten -norms. A, and PR is the orthogonal projection onto the top k right Of particular importance is the Schatten 1-norm, also called singular vectors of A. Typically this is found via the singu- the nuclear norm or the trace norm, which is the sum of the singular values of a matrix. It is typically considered to be *Equal contribution 1School of Physical and Mathemati- cal Sciences, Nanyang Technological University 2Department more robust than the Frobenius norm (Schatten 2-norm) and of Computer Science, Carnegie Mellon University. Corre- has been used in robust PCA applications (see, e.g., (Xu spondence to: Yi Li <[email protected]>, David Woodruff et al., 2010; Candes` et al., 2011; Yi et al., 2016)). <[email protected]>. 1We use the notation O~(f) to hide the polylogarithmic factors Proceedings of the 37 th International Conference on Machine in O(f poly(log f)). Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by 2Since outputting X takes O(mn) time, these algorithms usu- the author(s). ally output X in factored form, where each factor has rank k. Input-Sparsity Low Rank Approximation in Schatten Norm ~ αp For example, suppose the top singular value of anpn × n O(nnz(A) log n) + O(mn poly(k=")), where matrix A is 1, the next 2k singular values are 1= k, and ( the remaining singular values are 0. A Frobenius norm 0; 1 ≤ p ≤ 2; α = rank-k approximation could just choose thep top singular p (! − 1)(1 − 2 ); p > 2; direction and pay a cost of p2k · 1=k = 2. Since the p Frobenius norm ofp the bottom n − k singular values is 1 and the hidden constants depend only on p. (k + 1) · k , this is a 2-approximation. On the other hand, if a Schatten 1-norm rank-k approximation algorithm were In the particular case of p = 1, and more generally for to only outputp the top singularp direction, it would pay a all p 2 [1; 2], our algorithm achieves a running time of cost of 2k · 1= k = 2 k. The bottom n − k singular O(nnz(A) log n + m poly(k=")), which was previously values have Schatten 1-norm (k + 1) · p1 . Consequently, k known to be possible for p = 2 only. When p > 2, the the approximation factor wouldp be 2(1 − o(1)), and one running time begins to depend polynomially on m and n but can show if we insisted on a 2-approximation or better, a the dependence remains o(mn!) for all larger p. Thus, even Schatten 1-norm algorithm would need to capture a constant for larger values of p, when k is subpolynomial in n, our fraction of the top k directions, and thus capture more of the algorithm runs substantially faster than the SVD. Empirical underlying data than a Frobenius norm solution. evaluations are also conducted to demonstrate our improved p = 1 Consider another example where the top k singular values algorithm when in Section5. are all 1s and the (k + i)-th singular value is 1=i. When It was shown by Musco & Woodruff(2017) that computing k = o(log n), capturing only the top singular direction a constant-factor low-rank approximation to AT A, given gives a (1p + o(1))-approximation for the Schatten 1-norm only A, requires Ω(nnz(A)·k) time. Given that the squared but a Θ( k)-approximation for the Frobenius norm. This singular values of A are the singular values of AT A, it example, together with the preceding one, shows that the is natural to suspect that obtaining a constant-factor low Schatten norm is a genuinely a different error metric. rank approximation to the Schatten 4-norm low-rank ap- Ω(nnz(A) · k) Surprisingly, no algorithms for low-rank approximation in proximation would therefore require time. the Schatten p-norm were known to run in time O~(nnz(A)+ Surprisingly, we show this is not the case, and obtain an O~(nnz(A) + mn(!−1)=2 poly(k=")) m poly(k=")) prior to this work, except for the special case time algorithm. of p = 2. We note that the case of p = 2 has special In addition, we generalize the error metric from matrix geometric structure that is not shared by other Schatten norms to a wide family of general loss functions, see Sec- p-norms. Indeed, a common technique for the p = 2 set- tion6 for details. Thus, we considerably broaden the class ting is to first find a poly(k/)-dimensional subspace V of loss functions for which input sparsity time algorithms containing a rank-k subspace inside of it which is a (1 + )- were previously known for. approximate subspace to project the rows of A on. Then, by the Pythagorean theorem, one can first project the rows Technical Overview. We illustrate our ideas for p = 1. of A onto V , and then find the best rank-k subspace of the Our goal is to find an orthogonal projection Q^0 for which projected points inside of V . For other Schatten p-norms, ^0 A(I − Q ) ≤ (1 + O(")) kA − Akk . The crucial the Pythagorean theorem does not hold, and it is not hard to 1 1 idea in the analysis is to split k · k1 into a head part construct counterexamples to this procedure for p 6= 2. k · k(r), which, known as the Ky-Fan norm, equals the To summarize, the SVD runs in time Θ(mn!−1), which is sum of the top r singular values, and a tail part k · k(−r) much slower than nnz(A) ≤ mn. It is also not clear how to (this is just a notation—the tail part is not a norm), which adapt existing fast Frobenius-norm algorithms to generate equals the sum of all the remaining singular values. Ob- ^0 (1 + ")-factor approximations with respect to other Schatten serve that for r ≥ k=" it holds that A(I − Q ) (−r) ≤ p-norms.

Load more