Input-Sparsity Low Rank Approximation in Schatten Norm

Total Page:16

File Type:pdf, Size:1020Kb

Input-Sparsity Low Rank Approximation in Schatten Norm Input-Sparsity Low Rank Approximation in Schatten Norm Yi Li * 1 David P. Woodruff * 2 Abstract lar value decomposition (SVD) of A, which is an expensive We give the first input-sparsity time algorithms operation. for the rank-k low rank approximation problem For large matrices A this is too slow, so we instead allow for in every Schatten norm. Specifically, for a given randomized approximation algorithms in the hope of achiev- m×n (m ≥ n) matrix A, our algorithm computes ing a much faster running time. Formally, given an approx- Y 2 Rm×k, Z 2 Rn×k, which, with high proba- imation parameter " > 0, we would like to find a rank-k bility, satisfy kA − YZT k ≤ (1 + ")kA − A k , p k p matrix X for which kA − XkF ≤ (1 + ") kA − AkkF Pn p 1=p where kMkp = ( i=1 σi(M) ) is the Schat- with large probability. For this relaxed problem, a num- ten p-norm of a matrix M with singular values ber of efficient methods are known, which are based on σ1(M); : : : ; σn(M), and where Ak is the best dimensionality reduction techniques such as random projec- rank-k approximation to A. Our algorithm runs tions, importance sampling, and other sketching methods, 1;2 in time O~(nnz(A) + mnαp poly(k=")), where with running times O~(nnz(A) + m poly(k=")), where αp = 0 for p 2 [1; 2) and αp = (! − 1)(1 − 2=p) nnz(A) denotes the number of non-zero entries of A. This for p > 2 and ! ≈ 2:374 is the exponent of matrix is significantly faster than the SVD, which takes Θ(~ mn!−1) multiplication. For the important case of p = 1, time, where ! is the exponent of matrix multiplication. See which corresponds to the more “robust” nuclear (Woodruff, 2014) for a survey. ~ norm, we obtain O(nnz(A)+m·poly(k/)) time, In this work, we consider approximation error with respect which was previously only known for the Frobe- to general matrix norms, i.e., to the Schatten p-norm. The nius norm (p = 2). Moreover, since αp < ! − 1 Schatten p-norm, denoted by k · kp, is defined to be the for every p, our algorithm has a better dependence `p-norm of the singular values of the matrix. Below is the on n than that in the singular value decomposition formal definition of the problem. for every p. Crucial to our analysis is the use of dimensionality reduction for Ky-Fan p-norms. Definition 1.1 (Low-rank Approximation). Let p ≥ 1. Given a matrix A 2 Rm×n, find a rank-k matrix X^ 2 Rm×n for which 1. Introduction ^ A − X ≤ (1 + ") min kA − Xkp : (1) p X:rank(X)=k A common task in processing or analyzing large-scale datasets is to approximate a large matrix A 2 Rm×n (m ≥ n) with a low-rank matrix. Often this is done with It is a well-known fact (Mirsky’s Theorem) that the optimal respect to the Frobenius norm, that is, the objective function solution for general Schatten norms coincides with the op- k A is to minimize the error kA − Xk over all rank-k matrices timal rank- matrix k for the Frobenius norm, given by F the SVD. However, approximate solutions for the Frobenius X 2 Rm×n for a rank parameter k. It is well-known that the optimal solution is A = P A = AP , where P is the norm loss function may give horrible approximations for k L R L p orthogonal projection onto the top k left singular vectors of other Schatten -norms. A, and PR is the orthogonal projection onto the top k right Of particular importance is the Schatten 1-norm, also called singular vectors of A. Typically this is found via the singu- the nuclear norm or the trace norm, which is the sum of the singular values of a matrix. It is typically considered to be *Equal contribution 1School of Physical and Mathemati- cal Sciences, Nanyang Technological University 2Department more robust than the Frobenius norm (Schatten 2-norm) and of Computer Science, Carnegie Mellon University. Corre- has been used in robust PCA applications (see, e.g., (Xu spondence to: Yi Li <[email protected]>, David Woodruff et al., 2010; Candes` et al., 2011; Yi et al., 2016)). <[email protected]>. 1We use the notation O~(f) to hide the polylogarithmic factors Proceedings of the 37 th International Conference on Machine in O(f poly(log f)). Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by 2Since outputting X takes O(mn) time, these algorithms usu- the author(s). ally output X in factored form, where each factor has rank k. Input-Sparsity Low Rank Approximation in Schatten Norm ~ αp For example, suppose the top singular value of anpn × n O(nnz(A) log n) + O(mn poly(k=")), where matrix A is 1, the next 2k singular values are 1= k, and ( the remaining singular values are 0. A Frobenius norm 0; 1 ≤ p ≤ 2; α = rank-k approximation could just choose thep top singular p (! − 1)(1 − 2 ); p > 2; direction and pay a cost of p2k · 1=k = 2. Since the p Frobenius norm ofp the bottom n − k singular values is 1 and the hidden constants depend only on p. (k + 1) · k , this is a 2-approximation. On the other hand, if a Schatten 1-norm rank-k approximation algorithm were In the particular case of p = 1, and more generally for to only outputp the top singularp direction, it would pay a all p 2 [1; 2], our algorithm achieves a running time of cost of 2k · 1= k = 2 k. The bottom n − k singular O(nnz(A) log n + m poly(k=")), which was previously values have Schatten 1-norm (k + 1) · p1 . Consequently, k known to be possible for p = 2 only. When p > 2, the the approximation factor wouldp be 2(1 − o(1)), and one running time begins to depend polynomially on m and n but can show if we insisted on a 2-approximation or better, a the dependence remains o(mn!) for all larger p. Thus, even Schatten 1-norm algorithm would need to capture a constant for larger values of p, when k is subpolynomial in n, our fraction of the top k directions, and thus capture more of the algorithm runs substantially faster than the SVD. Empirical underlying data than a Frobenius norm solution. evaluations are also conducted to demonstrate our improved p = 1 Consider another example where the top k singular values algorithm when in Section5. are all 1s and the (k + i)-th singular value is 1=i. When It was shown by Musco & Woodruff(2017) that computing k = o(log n), capturing only the top singular direction a constant-factor low-rank approximation to AT A, given gives a (1p + o(1))-approximation for the Schatten 1-norm only A, requires Ω(nnz(A)·k) time. Given that the squared but a Θ( k)-approximation for the Frobenius norm. This singular values of A are the singular values of AT A, it example, together with the preceding one, shows that the is natural to suspect that obtaining a constant-factor low Schatten norm is a genuinely a different error metric. rank approximation to the Schatten 4-norm low-rank ap- Ω(nnz(A) · k) Surprisingly, no algorithms for low-rank approximation in proximation would therefore require time. the Schatten p-norm were known to run in time O~(nnz(A)+ Surprisingly, we show this is not the case, and obtain an O~(nnz(A) + mn(!−1)=2 poly(k=")) m poly(k=")) prior to this work, except for the special case time algorithm. of p = 2. We note that the case of p = 2 has special In addition, we generalize the error metric from matrix geometric structure that is not shared by other Schatten norms to a wide family of general loss functions, see Sec- p-norms. Indeed, a common technique for the p = 2 set- tion6 for details. Thus, we considerably broaden the class ting is to first find a poly(k/)-dimensional subspace V of loss functions for which input sparsity time algorithms containing a rank-k subspace inside of it which is a (1 + )- were previously known for. approximate subspace to project the rows of A on. Then, by the Pythagorean theorem, one can first project the rows Technical Overview. We illustrate our ideas for p = 1. of A onto V , and then find the best rank-k subspace of the Our goal is to find an orthogonal projection Q^0 for which projected points inside of V . For other Schatten p-norms, ^0 A(I − Q ) ≤ (1 + O(")) kA − Akk . The crucial the Pythagorean theorem does not hold, and it is not hard to 1 1 idea in the analysis is to split k · k1 into a head part construct counterexamples to this procedure for p 6= 2. k · k(r), which, known as the Ky-Fan norm, equals the To summarize, the SVD runs in time Θ(mn!−1), which is sum of the top r singular values, and a tail part k · k(−r) much slower than nnz(A) ≤ mn. It is also not clear how to (this is just a notation—the tail part is not a norm), which adapt existing fast Frobenius-norm algorithms to generate equals the sum of all the remaining singular values. Ob- ^0 (1 + ")-factor approximations with respect to other Schatten serve that for r ≥ k=" it holds that A(I − Q ) (−r) ≤ p-norms.
Recommended publications
  • Schatten Class and Berezin Transform of Quaternionic Linear Operators
    Schatten class and Berezin transform of quaternionic linear operators Fabrizio Colombo Jonathan Gantner Politecnico di Milano Politecnico di Milano Dipartimento di Matematica Dipartimento di Matematica Via E. Bonardi, 9 Via E. Bonardi, 9 20133 Milano, Italy 20133 Milano, Italy [email protected] [email protected] Tim Janssens University of Antwerp Department of Mathematics Middelheimlaan, 1 2020 Antwerp, Belgium [email protected] Abstract In this paper we introduce the Schatten class of operators and the Berezin transform of operators in the quaternionic setting. The first topic is of great importance in operator theory but it is also necessary to study the second one because we need the notion of trace class operators, which is a particular case of the Schatten class. Regarding the Berezin transform, we give the general definition and properties. Then we concentrate on the setting of weighted Bergman spaces of slice hyperholomorphic functions. Our results are based on the S-spectrum for quaternionic operators, which is the notion of spectrum that appears in the quaternionic version of the spectral theorem and in the quaternionic S-functional calculus. AMS Classification: 47A10, 47A60, 30G35. Key words: Schatten class of operators, Spectral theorem and the S-spectrum, Berezin transform of quater- nionic operators, Weighted Bergman spaces of slice hyperholomorphic functions. arXiv:1511.07308v1 [math.FA] 23 Nov 2015 1 Introduction The study of quaternionic linear operators was stimulated by the celebrated paper of G. Birkhoff and J. von Neumann [13] on the logic of quantum mechanics, where they proved that there are essentially two possible ways to formulate quantum mechanics: the well known one using complex numbers and also one using quaternions.
    [Show full text]
  • On Sketching Matrix Norms and the Top Singular Vector
    On Sketching Matrix Norms and the Top Singular Vector Yi Li Huy L. Nguy˜ên University of Michigan, Ann Arbor Princeton University [email protected] [email protected] David P. Woodruff IBM Research, Almaden [email protected] Abstract 1 Introduction Sketching is a prominent algorithmic tool for processing Sketching is an algorithmic tool for handling big data. A large data. In this paper, we study the problem of sketching matrix norms. We consider two sketching models. The first linear skech is a distribution over k × n matrices S, k is bilinear sketching, in which there is a distribution over n, so that for any fixed vector v, one can approximate a pairs of r × n matrices S and n × s matrices T such that for function f(v) from Sv with high probability. The goal any fixed n×n matrix A, from S ·A·T one can approximate is to minimize the dimension k. kAkp up to an approximation factor α ≥ 1 with constant probability, where kAkp is a matrix norm. The second is The sketch Sv thus provides a compression of v, and general linear sketching, in which there is a distribution over is useful for compressed sensing and for achieving low- n2 k linear maps L : R → R , such that for any fixed n × n communication protocols in distributed models. One n2 matrix A, interpreting it as a vector in R , from L(A) one can perform complex procedures on the sketch which can approximate kAkp up to a factor α. We study some of the most frequently occurring matrix would be too expensive to perform on v itself, and this norms, which correspond to Schatten p-norms for p ∈ has led to the fastest known approximation algorithms {0, 1, 2, ∞}.
    [Show full text]
  • Low Rank Approximation Lecture 1
    Low Rank Approximation Lecture 1 Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL [email protected] 1 Organizational aspects I Lectures: Tuesday 8-10, MA A110. First: September 25, Last: December 18. I Exercises: Tuesday 8-10, MA A110. First: September 25, Last: December 18. I Exam: Miniproject + oral exam. I Webpage: https://anchp.epfl.ch/lowrank. I [email protected], [email protected] 2 From http://www.niemanlab.org ... his [Aleksandr Kogan’s] message went on to confirm that his approach was indeed similar to SVD or other matrix factorization meth- ods, like in the Netflix Prize competition, and the Kosinki-Stillwell- Graepel Facebook model. Dimensionality reduction of Facebook data was the core of his model. 3 Rank and basic properties For field F, let A 2 F m×n. Then rank(A) := dim(range(A)): For simplicity, F = R throughout the lecture and often m ≥ n. Lemma Let A 2 Rm×n. Then 1. rank(AT ) = rank(A); 2. rank(PAQ) = rank(A) for invertible matrices P 2 Rm×m, Q 2 Rn×n; 3. rank(AB) ≤ minfrank(A); rank(B)g for any matrix B 2 Rn×p. A11 A12 m ×n 4. rank = rank(A11) + rank(A22) for A11 2 R 1 1 , 0 A22 m ×n m ×n A12 2 R 1 2 ,A22 2 R 2 2 . Proof: See Linear Algebra 1 / Exercises. 4 Rank and matrix factorizations m Let B = fb1;:::; br g ⊂ R with r = rank(A) be basis of range(A). Then each of the columns of A = a1; a2;:::; an can be expressed as linear combination of B: 2 3 ci1 ;:::; 6 .
    [Show full text]
  • Bilinear Factor Matrix Norm Minimization for Robust PCA: Algorithms and Applications
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 9, SEPTEMBER 2018 2066 Bilinear Factor Matrix Norm Minimization for Robust PCA: Algorithms and Applications Fanhua Shang, Member, IEEE, James Cheng, Yuanyuan Liu, Zhi-Quan Luo, Fellow, IEEE, and Zhouchen Lin, Senior Member, IEEE Abstract—The heavy-tailed distributions of corrupted outliers and singular values of all channels in low-level vision have proven effective priors for many applications such as background modeling, photometric stereo and image alignment. And they can be well modeled by a hyper-Laplacian. However, the use of such distributions generally leads to challenging non-convex, non-smooth and non-Lipschitz problems, and makes existing algorithms very slow for large-scale applications. Together with the analytic solutions to `p-norm minimization with two specific values of p, i.e., p=1=2 and p=2=3, we propose two novel bilinear factor matrix norm minimization models for robust principal component analysis. We first define the double nuclear norm and Frobenius/nuclear hybrid norm penalties, and then prove that they are in essence the Schatten-1=2 and 2=3 quasi-norms, respectively, which lead to much more tractable and scalable Lipschitz optimization problems. Our experimental analysis shows that both our methods yield more accurate solutions than original Schatten quasi-norm minimization, even when the number of observations is very limited. Finally, we apply our penalties to various low-level vision problems, e.g., text removal, moving object detection, image alignment and inpainting, and show that our methods usually outperform the state-of-the-art methods.
    [Show full text]
  • T-Schatten-P Norm for Low-Rank Tensor Recovery Hao Kong, Xingyu Xie, Student Member, IEEE, Zhouchen Lin, Fellow, IEEE
    JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 1 t-Schatten-p Norm for Low-Rank Tensor Recovery Hao Kong, Xingyu Xie, Student Member, IEEE, Zhouchen Lin, Fellow, IEEE Abstract—In this paper, we propose a new definition of tensor Minimizing the rank function directly is usually NP-hard and Schatten-p norm (t-Schatten-p norm) based on t-SVD [1], and is difficult to be solved within polynomial time, hence we prove that this norm has similar properties to matrix Schatten- often replace the function rank(X ) by its convex/non-convex p norm. More importantly, the t-Schatten-p norm can better surrogate function f(X ): approximate the `1 norm of the tensor multi-rank [2] with 0 < p < 1. Therefore it can be used for the Low-Rank min f(X ); Tensor Recovery problems as a tighter regularizer. We further X (2) prove the tensor multi-Schatten-p norm surrogate theorem and s:t: Ψ(X ) = T : give an efficient algorithm accordingly. By decomposing the target tensor into many small-scale tensors, the non-convex The main difference among the present LRTR models is the optimization problem (0 < p < 1) is transformed into many choice of surrogate function f(·). Based on different definitions convex sub-problems equivalently, which can greatly improve the of tensor singular values, various tensor nuclear norms or tensor computational efficiency when dealing with large-scale tensors. 1 Finally, we provide the theories on the conditions for exact Schatten-p norms are proposed as the rank surrogates [1], [13], recovery in the noiseless case and give the corresponding error [14].
    [Show full text]
  • Matrix Analysis
    Appendix A Matrix Analysis This appendix collects useful background from linear algebra and matrix analysis, such as vector and matrix norms, singular value decompositions, Gershgorin’s disk theorem, and matrix functions. Much more material can be found in various books on the subject including [47, 51, 231, 273, 280, 281, 475]. A.1 Vector and Matrix Norms We work with real or complex vector spaces X, usually X = Rn or X = Cn.We n usually write the vectors in C in boldface, x, while their entries are denoted xj , j ∈ [n],where[n]:={1,...,n}. The canonical unit vectors in Rn are denoted by e1,...,en. They have entries 1 if j = , (e ) = δ = j ,j 0 otherwise. Definition A.1. A nonnegative function ·: X → [0, ∞) is called a norm if (a) x =0if and only if x = 0 (definiteness). (b) λx = |λ|x for all scalars λ and all vectors x ∈ X (homogeneity). (c) x + y≤x + y for all vectors x, y ∈ X (triangle inequality). If only (b) and (c) hold, so that x =0does not necessarily imply x = 0,then· is called a seminorm. If (a) and (b) hold, but (c) is replaced by the weaker quasitriangle inequality x + y≤C(x + y) for some constant C ≥ 1,then·is called a quasinorm. The smallest constant C is called its quasinorm constant. S. Foucart and H. Rauhut, A Mathematical Introduction to Compressive Sensing, 515 Applied and Numerical Harmonic Analysis, DOI 10.1007/978-0-8176-4948-7, © Springer Science+Business Media New York 2013 516 A Matrix Analysis A space X endowed with a norm ·is called a normed space.
    [Show full text]