Input-Sparsity Low Rank Approximation in Schatten Norm

Input-Sparsity Low Rank Approximation in Schatten Norm

Input-Sparsity Low Rank Approximation in Schatten Norm Yi Li * 1 David P. Woodruff * 2 Abstract lar value decomposition (SVD) of A, which is an expensive We give the first input-sparsity time algorithms operation. for the rank-k low rank approximation problem For large matrices A this is too slow, so we instead allow for in every Schatten norm. Specifically, for a given randomized approximation algorithms in the hope of achiev- m×n (m ≥ n) matrix A, our algorithm computes ing a much faster running time. Formally, given an approx- Y 2 Rm×k, Z 2 Rn×k, which, with high proba- imation parameter " > 0, we would like to find a rank-k bility, satisfy kA − YZT k ≤ (1 + ")kA − A k , p k p matrix X for which kA − XkF ≤ (1 + ") kA − AkkF Pn p 1=p where kMkp = ( i=1 σi(M) ) is the Schat- with large probability. For this relaxed problem, a num- ten p-norm of a matrix M with singular values ber of efficient methods are known, which are based on σ1(M); : : : ; σn(M), and where Ak is the best dimensionality reduction techniques such as random projec- rank-k approximation to A. Our algorithm runs tions, importance sampling, and other sketching methods, 1;2 in time O~(nnz(A) + mnαp poly(k=")), where with running times O~(nnz(A) + m poly(k=")), where αp = 0 for p 2 [1; 2) and αp = (! − 1)(1 − 2=p) nnz(A) denotes the number of non-zero entries of A. This for p > 2 and ! ≈ 2:374 is the exponent of matrix is significantly faster than the SVD, which takes Θ(~ mn!−1) multiplication. For the important case of p = 1, time, where ! is the exponent of matrix multiplication. See which corresponds to the more “robust” nuclear (Woodruff, 2014) for a survey. ~ norm, we obtain O(nnz(A)+m·poly(k/)) time, In this work, we consider approximation error with respect which was previously only known for the Frobe- to general matrix norms, i.e., to the Schatten p-norm. The nius norm (p = 2). Moreover, since αp < ! − 1 Schatten p-norm, denoted by k · kp, is defined to be the for every p, our algorithm has a better dependence `p-norm of the singular values of the matrix. Below is the on n than that in the singular value decomposition formal definition of the problem. for every p. Crucial to our analysis is the use of dimensionality reduction for Ky-Fan p-norms. Definition 1.1 (Low-rank Approximation). Let p ≥ 1. Given a matrix A 2 Rm×n, find a rank-k matrix X^ 2 Rm×n for which 1. Introduction ^ A − X ≤ (1 + ") min kA − Xkp : (1) p X:rank(X)=k A common task in processing or analyzing large-scale datasets is to approximate a large matrix A 2 Rm×n (m ≥ n) with a low-rank matrix. Often this is done with It is a well-known fact (Mirsky’s Theorem) that the optimal respect to the Frobenius norm, that is, the objective function solution for general Schatten norms coincides with the op- k A is to minimize the error kA − Xk over all rank-k matrices timal rank- matrix k for the Frobenius norm, given by F the SVD. However, approximate solutions for the Frobenius X 2 Rm×n for a rank parameter k. It is well-known that the optimal solution is A = P A = AP , where P is the norm loss function may give horrible approximations for k L R L p orthogonal projection onto the top k left singular vectors of other Schatten -norms. A, and PR is the orthogonal projection onto the top k right Of particular importance is the Schatten 1-norm, also called singular vectors of A. Typically this is found via the singu- the nuclear norm or the trace norm, which is the sum of the singular values of a matrix. It is typically considered to be *Equal contribution 1School of Physical and Mathemati- cal Sciences, Nanyang Technological University 2Department more robust than the Frobenius norm (Schatten 2-norm) and of Computer Science, Carnegie Mellon University. Corre- has been used in robust PCA applications (see, e.g., (Xu spondence to: Yi Li <[email protected]>, David Woodruff et al., 2010; Candes` et al., 2011; Yi et al., 2016)). <[email protected]>. 1We use the notation O~(f) to hide the polylogarithmic factors Proceedings of the 37 th International Conference on Machine in O(f poly(log f)). Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by 2Since outputting X takes O(mn) time, these algorithms usu- the author(s). ally output X in factored form, where each factor has rank k. Input-Sparsity Low Rank Approximation in Schatten Norm ~ αp For example, suppose the top singular value of anpn × n O(nnz(A) log n) + O(mn poly(k=")), where matrix A is 1, the next 2k singular values are 1= k, and ( the remaining singular values are 0. A Frobenius norm 0; 1 ≤ p ≤ 2; α = rank-k approximation could just choose thep top singular p (! − 1)(1 − 2 ); p > 2; direction and pay a cost of p2k · 1=k = 2. Since the p Frobenius norm ofp the bottom n − k singular values is 1 and the hidden constants depend only on p. (k + 1) · k , this is a 2-approximation. On the other hand, if a Schatten 1-norm rank-k approximation algorithm were In the particular case of p = 1, and more generally for to only outputp the top singularp direction, it would pay a all p 2 [1; 2], our algorithm achieves a running time of cost of 2k · 1= k = 2 k. The bottom n − k singular O(nnz(A) log n + m poly(k=")), which was previously values have Schatten 1-norm (k + 1) · p1 . Consequently, k known to be possible for p = 2 only. When p > 2, the the approximation factor wouldp be 2(1 − o(1)), and one running time begins to depend polynomially on m and n but can show if we insisted on a 2-approximation or better, a the dependence remains o(mn!) for all larger p. Thus, even Schatten 1-norm algorithm would need to capture a constant for larger values of p, when k is subpolynomial in n, our fraction of the top k directions, and thus capture more of the algorithm runs substantially faster than the SVD. Empirical underlying data than a Frobenius norm solution. evaluations are also conducted to demonstrate our improved p = 1 Consider another example where the top k singular values algorithm when in Section5. are all 1s and the (k + i)-th singular value is 1=i. When It was shown by Musco & Woodruff(2017) that computing k = o(log n), capturing only the top singular direction a constant-factor low-rank approximation to AT A, given gives a (1p + o(1))-approximation for the Schatten 1-norm only A, requires Ω(nnz(A)·k) time. Given that the squared but a Θ( k)-approximation for the Frobenius norm. This singular values of A are the singular values of AT A, it example, together with the preceding one, shows that the is natural to suspect that obtaining a constant-factor low Schatten norm is a genuinely a different error metric. rank approximation to the Schatten 4-norm low-rank ap- Ω(nnz(A) · k) Surprisingly, no algorithms for low-rank approximation in proximation would therefore require time. the Schatten p-norm were known to run in time O~(nnz(A)+ Surprisingly, we show this is not the case, and obtain an O~(nnz(A) + mn(!−1)=2 poly(k=")) m poly(k=")) prior to this work, except for the special case time algorithm. of p = 2. We note that the case of p = 2 has special In addition, we generalize the error metric from matrix geometric structure that is not shared by other Schatten norms to a wide family of general loss functions, see Sec- p-norms. Indeed, a common technique for the p = 2 set- tion6 for details. Thus, we considerably broaden the class ting is to first find a poly(k/)-dimensional subspace V of loss functions for which input sparsity time algorithms containing a rank-k subspace inside of it which is a (1 + )- were previously known for. approximate subspace to project the rows of A on. Then, by the Pythagorean theorem, one can first project the rows Technical Overview. We illustrate our ideas for p = 1. of A onto V , and then find the best rank-k subspace of the Our goal is to find an orthogonal projection Q^0 for which projected points inside of V . For other Schatten p-norms, ^0 A(I − Q ) ≤ (1 + O(")) kA − Akk . The crucial the Pythagorean theorem does not hold, and it is not hard to 1 1 idea in the analysis is to split k · k1 into a head part construct counterexamples to this procedure for p 6= 2. k · k(r), which, known as the Ky-Fan norm, equals the To summarize, the SVD runs in time Θ(mn!−1), which is sum of the top r singular values, and a tail part k · k(−r) much slower than nnz(A) ≤ mn. It is also not clear how to (this is just a notation—the tail part is not a norm), which adapt existing fast Frobenius-norm algorithms to generate equals the sum of all the remaining singular values. Ob- ^0 (1 + ")-factor approximations with respect to other Schatten serve that for r ≥ k=" it holds that A(I − Q ) (−r) ≤ p-norms.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    9 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us