Input-Sparsity Low Rank Approximation in Schatten

Yi Li * 1 David P. Woodruff * 2

Abstract lar value decomposition (SVD) of A, which is an expensive We give the first input-sparsity time algorithms operation. for the rank-k low rank approximation problem For large matrices A this is too slow, so we instead allow for in every Schatten norm. Specifically, for a given randomized approximation algorithms in the hope of achiev- m×n (m ≥ n) matrix A, our algorithm computes ing a much faster running time. Formally, given an approx- Y ∈ Rm×k, Z ∈ Rn×k, which, with high proba- imation parameter ε > 0, we would like to find a rank-k bility, satisfy kA − YZT k ≤ (1 + ε)kA − A k , p k p matrix X for which kA − XkF ≤ (1 + ε) kA − AkkF Pn p 1/p where kMkp = ( i=1 σi(M) ) is the Schat- with large probability. For this relaxed problem, a num- ten p-norm of a matrix M with singular values ber of efficient methods are known, which are based on σ1(M), . . . , σn(M), and where Ak is the best dimensionality reduction techniques such as random projec- rank-k approximation to A. Our algorithm runs tions, importance sampling, and other sketching methods, 1,2 in time O˜(nnz(A) + mnαp poly(k/ε)), where with running times O˜(nnz(A) + m poly(k/ε)), where αp = 0 for p ∈ [1, 2) and αp = (ω − 1)(1 − 2/p) nnz(A) denotes the number of non-zero entries of A. This for p > 2 and ω ≈ 2.374 is the exponent of matrix is significantly faster than the SVD, which takes Θ(˜ mnω−1) multiplication. For the important case of p = 1, time, where ω is the exponent of matrix multiplication. See which corresponds to the more “robust” nuclear (Woodruff, 2014) for a survey. ˜ norm, we obtain O(nnz(A)+m·poly(k/)) time, In this work, we consider approximation error with respect which was previously only known for the Frobe- to general matrix norms, i.e., to the Schatten p-norm. The nius norm (p = 2). Moreover, since αp < ω − 1 Schatten p-norm, denoted by k · kp, is defined to be the for every p, our algorithm has a better dependence `p-norm of the singular values of the matrix. Below is the on n than that in the singular value decomposition formal definition of the problem. for every p. Crucial to our analysis is the use of dimensionality reduction for Ky-Fan p-norms. Definition 1.1 (Low-rank Approximation). Let p ≥ 1. Given a matrix A ∈ Rm×n, find a rank-k matrix Xˆ ∈ Rm×n for which

1. Introduction ˆ A − X ≤ (1 + ε) min kA − Xkp . (1) p X:rank(X)=k A common task in processing or analyzing large-scale datasets is to approximate a large matrix A ∈ Rm×n (m ≥ n) with a low-rank matrix. Often this is done with It is a well-known fact (Mirsky’s Theorem) that the optimal respect to the Frobenius norm, that is, the objective function solution for general Schatten norms coincides with the op- k A is to minimize the error kA − Xk over all rank-k matrices timal rank- matrix k for the Frobenius norm, given by F the SVD. However, approximate solutions for the Frobenius X ∈ Rm×n for a rank parameter k. It is well-known that the optimal solution is A = P A = AP , where P is the norm loss function may give horrible approximations for k L R L p orthogonal projection onto the top k left singular vectors of other Schatten -norms. A, and PR is the orthogonal projection onto the top k right Of particular importance is the Schatten 1-norm, also called singular vectors of A. Typically this is found via the singu- the nuclear norm or the trace norm, which is the sum of the singular values of a matrix. It is typically considered to be *Equal contribution 1School of Physical and Mathemati- cal Sciences, Nanyang Technological University 2Department more robust than the Frobenius norm (Schatten 2-norm) and of Computer Science, Carnegie Mellon University. Corre- has been used in robust PCA applications (see, e.g., (Xu spondence to: Yi Li , David Woodruff et al., 2010; Candes` et al., 2011; Yi et al., 2016)). . 1We use the notation O˜(f) to hide the polylogarithmic factors Proceedings of the 37 th International Conference on Machine in O(f poly(log f)). Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by 2Since outputting X takes O(mn) time, these algorithms usu- the author(s). ally output X in factored form, where each factor has rank k. Input-Sparsity Low Rank Approximation in Schatten Norm

˜ αp For example, suppose the top singular value of an√n × n O(nnz(A) log n) + O(mn poly(k/ε)), where matrix A is 1, the next 2k singular values are 1/ k, and ( the remaining singular values are 0. A Frobenius norm 0, 1 ≤ p ≤ 2; α = rank-k approximation could just choose the√ top singular p (ω − 1)(1 − 2 ), p > 2, direction and pay a cost of p2k · 1/k = 2. Since the p Frobenius norm of√ the bottom n − k singular values is 1 and the hidden constants depend only on p. (k + 1) · k , this is a 2-approximation. On the other hand, if a Schatten 1-norm rank-k approximation algorithm were In the particular case of p = 1, and more generally for to only output√ the top singular√ direction, it would pay a all p ∈ [1, 2], our algorithm achieves a running time of cost of 2k · 1/ k = 2 k. The bottom n − k singular O(nnz(A) log n + m poly(k/ε)), which was previously values have Schatten 1-norm (k + 1) · √1 . Consequently, k known to be possible for p = 2 only. When p > 2, the the approximation factor would√ be 2(1 − o(1)), and one running time begins to depend polynomially on m and n but can show if we insisted on a 2-approximation or better, a the dependence remains o(mnω) for all larger p. Thus, even Schatten 1-norm algorithm would need to capture a constant for larger values of p, when k is subpolynomial in n, our fraction of the top k directions, and thus capture more of the algorithm runs substantially faster than the SVD. Empirical underlying data than a Frobenius norm solution. evaluations are also conducted to demonstrate our improved p = 1 Consider another example where the top k singular values algorithm when in Section5. are all 1s and the (k + i)-th singular value is 1/i. When It was shown by Musco & Woodruff(2017) that computing k = o(log n), capturing only the top singular direction a constant-factor low-rank approximation to AT A, given gives a (1√ + o(1))-approximation for the Schatten 1-norm only A, requires Ω(nnz(A)·k) time. Given that the squared but a Θ( k)-approximation for the Frobenius norm. This singular values of A are the singular values of AT A, it example, together with the preceding one, shows that the is natural to suspect that obtaining a constant-factor low Schatten norm is a genuinely a different error metric. rank approximation to the Schatten 4-norm low-rank ap- Ω(nnz(A) · k) Surprisingly, no algorithms for low-rank approximation in proximation would therefore require time. the Schatten p-norm were known to run in time O˜(nnz(A)+ Surprisingly, we show this is not the case, and obtain an O˜(nnz(A) + mn(ω−1)/2 poly(k/ε)) m poly(k/ε)) prior to this work, except for the special case time algorithm. of p = 2. We note that the case of p = 2 has special In addition, we generalize the error metric from matrix geometric structure that is not shared by other Schatten norms to a wide family of general loss functions, see Sec- p-norms. Indeed, a common technique for the p = 2 set- tion6 for details. Thus, we considerably broaden the class ting is to first find a poly(k/)-dimensional subspace V of loss functions for which input sparsity time algorithms containing a rank-k subspace inside of it which is a (1 + )- were previously known for. approximate subspace to project the rows of A on. Then, by the Pythagorean theorem, one can first project the rows Technical Overview. We illustrate our ideas for p = 1. of A onto V , and then find the best rank-k subspace of the Our goal is to find an orthogonal projection Qˆ0 for which projected points inside of V . For other Schatten p-norms, ˆ0 A(I − Q ) ≤ (1 + O(ε)) kA − Akk . The crucial the Pythagorean theorem does not hold, and it is not hard to 1 1 idea in the analysis is to split k · k1 into a head part construct counterexamples to this procedure for p 6= 2. k · k(r), which, known as the Ky-Fan norm, equals the To summarize, the SVD runs in time Θ(mnω−1), which is sum of the top r singular values, and a tail part k · k(−r) much slower than nnz(A) ≤ mn. It is also not clear how to (this is just a notation—the tail part is not a norm), which adapt existing fast Frobenius-norm algorithms to generate equals the sum of all the remaining singular values. Ob- ˆ0 (1 + ε)-factor approximations with respect to other Schatten serve that for r ≥ k/ε it holds that A(I − Q ) (−r) ≤ p-norms. kAk(−r) ≤ kA − Akk(−r) + ε kA − Akk1 for any rank- k orthogonal projection Qˆ0 and it thus suffices to find Qˆ0 ˆ0 Our Contributions In this paper we obtain the first for which A(I − Q ) (r) ≤ (1 + ε) kA − Akk(r). To provably efficient algorithms for the rank-k (1 + ε)- do this, we sketch A(I − Q) on the left by a projection- approximation problem with respect to the Schatten p-norm cost preserving matrix S by Cohen et al.(2017) such that for every p ≥ 1. kSA(I − Q)k(r) = (1 ± ε) kA(I − Q)k(r) ± εkA − Akk1 Theorem 1.1 (informal, combination of Theorems 3.6 and for all rank-k projections Q. Then we solve minQ kSA(I − 4.4). Suppose that m ≥ n and A ∈ Rm×n. There is a Q)k(r) over all rank-k projections Q and obtain a (1 + ε)- randomized algorithm which outputs two matrices Y ∈ approximate projection Qˆ0, which, intuitively, is close to m×k n×k ˆ T R and Z ∈ R for which X = YZ satisfies (1) the best projection PR for minQ kA(I − Q)k(r), and can with probability at least 0.9. The algorithm runs in time be shown to satisfy the desired property above. Input-Sparsity Low Rank Approximation in Schatten Norm

The last step is to approximate AQˆ0, which could be expen- that A ∈ Rm×n with m ≥ n and the (thin) singular value sive if done trivially, so we reformulate it as a regression decomposition is A = UΣV T , where U ∈ Rm×n and T n×k n×n problem minY A − YZ 1 over Y ∈ R , where Z is Σ,V ∈ R . Computing the full thin SVD takes time an n × k matrix whose columns form an orthonormal basis O˜(mnω−1). ˆ0 of the target space of the projection Q . This latter idea has Lemma 2.2 (Multiplicative Spectral Approximation (Co- been applied successfully for Frobenius-norm low-rank ap- m×n hen et al., 2015b)). Suppose that A ∈ R (n ≤ proximation (see, e.g., (Clarkson & Woodruff, 2017)). Here m ≤ poly(n)) has rank r. There exists a sampling ma- we need to argue that the solution to the Frobenius-norm trix R of O(ε−2r log r) rows such that (1 − ε)AT A  T min A − YZ T T regression Y F problem gives a good solu- (RA) (RA)  (1 + ε)A A with probability at least 0.9 tion to the Schatten 1-norm regression problem. Finally we and R can be computed in O(nnz(A) log n + nω log2 n + output Y and Z. n2ε−2) time, where θ is an arbitrary constant in (0, 1]. Lemma 2.3 (Additive-Multiplicative Spectral Approxima- 2. Preliminaries tion (Cohen et al., 2017; Musco, 2018)). Suppose that A ∈ m×n with m ≥ n, error parameters ε ≥ η ≥ 1/ poly(n). Notation For an m × n matrix A, let σ1(A) ≥ σ2(A) ≥ R · · · ≥ σ (A) denote its singular values, where s = Let K = k + ε/η. There exists a randomized algorithm s which runs in O(nnz(A) log n)+O˜(mKω−1) time and out- min{m, n}. The Schatten p-norm (p ≥ 1) of A is defined −2 to be kAk := Ps (σ (A)p)1/p and the singular (p, r)- puts a matrix C of t = Θ(ε K log K) columns, which are p i=1 i rescaled column samples of A without replacement, such norm (r ≤ s) to be kAk = Pr (σ (A)p)1/p. It is (p,r) i=1 i that with probability at least 0.99, clear that kAkp = kAk(p,s). When p = 2, the Schatten p-norm coincides with the Frobenius norm and we shall use T 2 T (1 − ε)AA − η kA − Akk I  CC the notation k·k in preference to k·k . F F 2 T 2  (1 + ε)AA + η kA − Akk I. (2) Suppose that A has the singular value decomposition F A = UΣV T , where Σ is a diagonal matrix of the sin- We also need an elementary inequality. gular values. For k ≤ min{m, n}, let Σk denote the diagonal matrix for the largest k singular values only, Lemma 2.4. Suppose that p ≥ 1 and ε ∈ (0, 1]. Let Cp,ε = p−1 p i.e., Σk = diag{σ1(A), . . . , σk(A), 0,..., 0}. We define p(1 + 1/ε) . It holds for x ∈ [ε, 1] that (1 + x) ≤ T p p p Ak = UΣkV . The famous Mirsky’s theorem states that 1 + Cp,εx and that (1 − x) ≥ 1 − Cp,εx . Ak is the best rank-k approximation to A for any rotation- ally invariant . Proof. It is easy to see that for x ∈ [ε, 1], (1 + x)p ≤ p p (1+ε) −1 p p 1−(1+ε) p n 1 + p x and (1 − x) ≥ 1 − p x . Then note For a subspace E ⊆ , we define PE to be an n × dim(E) ε ε R 1 p−1 (1+ε)p−1 1−(1+ε)p matrix whose columns form an orthonormal basis of E. that p 1 + ε ≥ εp ≥ εp .

Toolkit There has been extensive research on randomized numerical linear algebra in recent years. Below are several 3. Case p < 2 existing results upon which our algorithm will be built. The algorithm is presented in Algorithm1. In this section Definition 2.1 (Sparse Embedding Matrix). Let ε > 0 be we shall prove the correctness and analyze the running time an error parameter. The (n, ε)-sparse embedding matrix R for a constant p ∈ [1, 2). Throughout this section we set of dimension n × r is constructed as follows, where r is to r = k/ε. be specified later. Let h :[n] → [r] be a random function Lemma 3.1. Suppose that p ∈ (0, 2) and S satisfies (3). It and σ :[n] → {−1, 1} be a random function. The matrix then holds for all rank-k orthogonal projections Q that R has only n nonzero entries: R = σ(i) for all i ∈ [n]. i,h(i) p The value of r is chosen to be r = Θ(1/ε2) such that p 2 p (1 − ε) kA(I − Q)k(p,r) − rη1 kA − Akkp 2 2 2 Pr { AT RRT B − AT B ≤ ε2 kAk kBk } ≥ 0.99 p R F F F ≤ kSA(I − Q)k A (p,r) for all with orthonormal columns. This is indeed possible p p 2 p by (Clarkson & Woodruff, 2017; Meng & Mahoney, 2013; ≤ (1 + ε) kA(I − Q)k(p,r) + rη1 kA − Akkp . Nelson & Nguyen, 2013). Proof. Since S satisfies (3), it holds for any rank-k orthog- It is clear that, for a matrix A with n columns and an (n, ε)- onal projection Q that sparse embedding matrix R, the matrix product AR can be T 2 computed in O(nnz(A)) time. (1 − ε)(I − Q)A A(I − Q) − η1 kA − AkkF I Lemma 2.1 (Thin SVD (Demmel et al., 2007)). Suppose  (I − Q)AT ST SA(I − Q) Input-Sparsity Low Rank Approximation in Schatten Norm

Algorithm 1 Outline of the algorithm for finding a low-rank Proof. Observe that approximation min kSA − SAQk = min kSA − WSAk , 1: if p < 2 then Q (p,r) W (p,r) 2 2/p 2 2/p−1 2: η1 ← O((ε /k) ), η2 ← O(ε /k ) 3: else where the minimizations are over all rank-k orthogonal 1+2/p 2/p 1−2/p 2 1−2/p 4: η1 ←O(ε /k n ), η2 ←O(ε /n ) projections Q and all rank-k orthogonal projections W , and 5: end if the equality is achieved when Q is the projection onto the 6: Use Lemma 2.3 to obtain a sampling matrix S of s rows right top k singular vectors of SA and W the left top k such that singular vectors. (1 − ε)AT A − η kA − A k2 I 1 k F Since T is an oblivious subspace embedding matrix and  AT ST SA (3) preserves all singular values of (I − W )SA up to a factor T 2 of (1 ± ε), we have  (1 + ε)A A + η1 kA − AkkF I. 7: T ← subspace embedding matrix for s-dimensional min kSAT −W SAT k =(1±ε) min kSA−WSAk . W (p,r) W (p,r) subspaces with error O(ε) 8: W 0 ← projection onto the top k left singular vectors of The minimization on the left-hand side above is easy to SAT solve: the minimizer W 0 is exactly the projection onto the 9: Z ← matrix whose columns are an orthonormal basis top k singular vectors of SAT , as computed in Line8 of of the row space of W 0SA Algorithm1. Since Qˆ0 is the projection onto the row space p 0 0 10: R ← (n, Θ( η2/k))-sparse embedding matrix of W SA, it holds that the row space of SAQˆ is the closest ˆ 0 11: Y ← ARProwspace(ZT R) space to that of SA in the row space of W SA. Hence ˆ 12: return Y,Z ˆ0 0 SA − SAQ ≤ kSA − W SAk(p,r) . (p,r) T 2  (1 + ε)(I − Q)A A(I − Q) + η1 kA − AkkF I. The claimed result (5) then follows from 1 The following relationship between singular values of kSA − W 0SAk ≤ kSAT − W 0SAT k SA(I − Q) and A(I − Q) is an immediate corollary via (p,r) 1 − ε (p,r) the max-min characterization of singular values (cf., e.g., 1 = min kSAT − W SAT k Lemma 7.2 of (Li et al., 2019)) 1 − ε W (p,r)

2 2 1 + ε (1 − ε)σ (A(I − Q)) − η1 kA − Akk ≤ min kSA − WSAk i F 1 − ε W (p,r) 2 ≤ σi (SA(I − Q)) (4) ≤ (1 + 4ε) min kSA − SAQk(p,r) 2 2 Q ≤ (1 + ε)σi (A(I − Q)) + η1 kA − AkkF and rescaling ε. Since p < 2 and thus k·k ≤ k·k , we have from (4) that F p Lemma 3.3. Let ε ∈ (0, 1/2]. Suppose that Qˆ0 satisfies (5). p p 2 p Then it holds that (1 − ε)σi (A(I − Q)) − η1 kA − Akkp p p ≤ σi (SA(I − Q)) 0 p A(I − Qˆ ) ≤ (1 + c1ε) kA − Akk p (p,r) p 2 p (p,r) ≤ (1 + ε)σ (A(I − Q)) + η kA − Akk . i 1 p p/2 p + c2kη1 kA − Akk . Passing to the (p, r)-singular norm yields the desired result. p

for some absolute constants c1, c2 > 0. Lemma 3.2. When p ∈ (0, 2) is a constant and ε ∈ 0 T ˆ (0, 1/2], let Qˆ = ZZ be the projection onto the column Proof. Let Q = arg minQ kSA(I − Q)k(p,r), where the ∗ space of Z, where Z is as defined in Line9 of Algorithm1. minimization is over all rank-k projections Q. Let Q be the With probability at least 0.99, it holds that orthogonal projection onto the top k right singular vectors A of . It follows that ˆ0 SA(I − Q ) ≤ (1 + ε) min kSA(I − Q)k(p,r) , p (p,r) Q A(I − Qˆ0) (5) (p,r) where the minimization on the right-hand side is over all 1 p 1 p ˆ0 2 p ≤ SA(I − Q ) + kη1 kA − Akkp rank-k orthogonal projections Q. 1 − ε (p,r) 1 − ε Input-Sparsity Low Rank Approximation in Schatten Norm

(1 + ε)p p 1 p ˆ 2 p for some absolute constant c1. However, it is not clear how ≤ SA(I − Q) + kη1 kA − Akkp 1 − ε (p,r) 1 − ε to compute the matrix product AQˆ0 efficiently. Hence we p p consider the regression problem (1 + ε) ∗ p 1 2 p ≤ kSA(I − Q )k(p,r) + kη1 kA − Akkp 1 − ε 1 − ε T p min A − YZ . (1 + ε)  p  Y :rank(Y )=k p ≤ (1 + ε) kA(I − Q∗)kp + kη 2 kA − A kp 1 − ε (p,r) 1 k p It is clear that the minimizer is Y = AZ, which satisfies 1 p 2 p T ˆ0 + kη1 kA − Akkp that A − YZ = A − AQ , since the rowspace of 1 − ε p p T T (1+ε)p+1 (1+ε)p +1 p YZ is a k-dimensional subspace of the rowspace of Z = kA−A kp + kη 2 kA−A kp , 1 − ε k (p,r) 1 − ε 1 k p and thus it is exactly the rowspace of Q. The next lemma shows that Yˆ is an approximation to Y . where the first inequality follows from Lemma 3.1, the second inequality Lemma 3.2, the third inequality follows Lemma 3.5. When 1 ≤ p < 2 is a constant, the matrix Yˆ from the optimality of Qˆ and the fourth inequality again defined in Line 11 of Algorithm1 satisfies with probability from Lemma 3.1. at least 0.99 that T T A − YZˆ ≤ (1 + cε) min A − YZ , The next lemma is an immediate corollary of the preceding p Y :rank(Y )=k p lemma. for some absolute constant c > 0, whenever η ≤ ˆ0 2 Lemma 3.4. Let ε ∈ (0, 1/2]. Suppose that Q satisfies (5). ε2/(2k)2/p−1. Then it holds for some absolute constants c1, c2 > 0 that p ˆ0 p Proof. First, it is clear that the optimal solution to A(I − Q ) ≤ (1 + c1ε) kA − Akkp p T minY A − YZ p is Y = AZ, where the minimization 2 2/p whenever η1 ≤ (ε /(c2k)) . is over all rank-k n × k matrices Y . Note that Proof. Again let Qˆ = arg min kSA(I − Q)k , where ˆ Q (p,r) Y = ARProwspace(ZT R) the minimization is over all rank-k projections Q. Observe is the minimizer to the Frobenius-norm minimization prob- that T p lem minY k(A − YZ )RkF . Since R is a sparse embed- ˆ0 p A(I − Q ) ding matrix of error Θ( η2/k), one can show that (see, p p e.g., Lemma 7.8 of (Clarkson & Woodruff, 2017)) with ˆ0 X p ˆ0 = A(I − Q ) + σi (A(I − Q )) probability at least 0.99, (p,r) i≥r+1 √ T p/2 AZ − Yˆ ≤ η A − AZZ . p p 2 F ≤ (1 + c1ε) kA − Akk(p,r) + c2rη1 kA − Akkp F X p It follows that + σi (A)

i≥r+1 ˆ T A − YZ ≤ (1 + c ε) kA − A kp + c rηp/2 kA − A kp p 1 k p 2 1 k p ≤ A − AZZT + YZˆ T − AZZT r+k+1 p X p p + σi (A) T ˆ i=r+1 ≤ (1 + c1ε) A − AZZ + Y − AZ p p p p/2 p ≤ (1 + c1ε) kA − Akk + c2rη kA − Akk T 1 − 1 p 1 p ≤ (1 + c ε) A − AZZ + (2k) p 2 Yˆ − AZ 1 p k F p 1 1 √ + kA − Akkp T p − 2 T r ≤ (1 + c1ε) A−AZZ p + (2k) η2 A−AZZ F p p/2 p T 1 − 1 √ T ≤ (1 + (c1 + 1)ε) kA − Akkp + c2rη1 kA − Akkp p 2 = (1 + c1ε) A−AZZ p + (2k) η2 A−AZZ p where we used the preceding lemma (Lemma 3.3) in the first T ≤ (1 + (c1 + 1)ε) A − AZZ inequality and r = k/ε in the last inequality. The claimed p result holds when η ≤ (ε/(c r))2/p. 2 2/p−1 2 provided that η2 ≤ ε /(2k) .

So far we have found a rank-k orthogonal projection Qˆ0 = Remark 1. The preceding lemma (Lemma 3.5) may be of ZZT such that independent interest, as it solves Schatten p-norm regression efficiently, which has not been discussed in the literature in ˆ0 A − AQ ≤ (1 + c1ε) kA − Akkp p the context of dimensionality reduction before. Input-Sparsity Low Rank Approximation in Schatten Norm

In summary, we conclude the section with our main theorem. Theorem 3.6. Let p ∈ [1, 2). Suppose that A ∈ m×n with R where Kp ≥ 1 is some constant that depends only on p. m ≥ n. There is a randomized algorithm which outputs m×k n×k ˆ T Y ∈ R and Z ∈ R such that X = YZ satisfies Proof. We now have two cases based on (4). (1) with probability at least 0.97. The algorithm runs in 2(ω−1)/p (4/p−1)(ω−1) time O(nnz(A) log n) + O˜p(mk /ε + 2 2 • When σi(A(I − Q)) ≥ (1/ε)η1kA − Akk , we have k2ω/p/ε(4/p−1)(2ω+2)). F p p (1 − Op(ε))σi (A(I − Q)) ≤ σi (SA(I − Q)) Proof. The correctness of the output is clear from the pre- p ≤ (1 + Op(ε))σ (A(I − Q)) ceding lemmata by rescaling ε. We discuss the runtime i below. 2 2 • When σi(A(I − Q)) < (1/ε)η1kA − AkkF , we have from Lemma 2.4 that We first examine the runtime to obtain Z. For η1 in Lemma 3.4 we have kη1 ≤ ε and thus K = Θ(ε/η1). p p/2 p (1 − ε)σ (A(I − Q)) − Cp/2,εη kA − Akk Applying Lemma 2.3, we have s = O˜(K/ε2) and obtaining i 1 F ≤ σp(SA(I − Q)) the matrix S takes time O(nnz(A) log n)+O˜(mKω−1). By i p p/2 p Lemma 2.2, we can obtain (SA)T in time O(nnz(SA)) + ≤ (1 + ε)σi (A(I − Q)) + Cp/2,εη1 kA − AkkF O˜(sω/ε2) for a matrix T of O˜(s/ε2) columns and thus the subsequent SVD of SAT , by Lemma 2.1, takes O˜(sω/ε2). The claimed result follows in the same manner as in the These three steps take time O˜(mKω−1) + O(nnz(SA)) + proof of Lemma 3.1. O˜(sω/ε2) = O(nnz(A)) + O˜(mK2 + Kω/ε2ω+2), where we used the fact that S samples the rows of A without The analogy of Lemma 3.4 is the following, where we apply 1/2−1/p replacement and so nnz(SA) ≤ nnz(A). Calculating Holder’s¨ inequality that kA−AkkF ≤ n kA−Akkp. 0 the row span of W SA, which is a k-by-n matrix, takes Lemma 4.2. Let ε ∈ (0, 1/2]. Suppose that Qˆ0 satisfies . ω−1 (5) O(nk ) time. The total runtime is O(nnz(A) log n) + Then it holds for some constants c , c0 > 0 which depend ˜ ω−1 ω 2ω+2 p p Op(mK + K /ε ). Plugging in K = ε/η1 = only on p that Θ(k2/p/ε4/p−1) yields the runtime O(nnz(A) log n) + p O˜(mk2(ω−1)/p/ε(4/p−1)(ω−1) + k2ω/p/ε(4/p−1)(2ω+2)). ˆ0 p A(I − Q ) ≤ (1 + cpε) kA − Akkp , p Next we examine the runtime to obtain Yˆ . Since R has 0 1+2/p 2/p 1−2/p t = Θ(k/η2) rows and AR can be computed in O(nnz(A)) whenever η1 ≤ cpε /(k n ). time, ZT R can be computed in O(nk) time, the row space T ω−1 ω of Z R (which is a k×t matrix) in O(k t) = O˜(k /η2) Proof. Similarly we have time, and the final matrix product (AR)Prowspace(ZT R) in p p ω−1 ω−1 ˆ0 O(mt ) = O˜(mk /η ) time. Overall, computing Yˆ A(I − Q ) ≤ (1 + cpε) kA − Akkp 2 p ˜ ω−1 takes time O(nnz(A)) + O(mk /η2) = O(nnz(A)) + p/2 p p 2 −1 O˜(mkω+2/p−2/ε2). + rCp/2,εη1 n kA − Akkp . The overall runtime follows immediately. The conclusion follows when ε ε2 C ηp/2np/2−1 ≤ = , 4. Case p > 2 p/2,ε 1 r k that is, The algorithm remains the same in Algorithm1. In this sec- p −1! tion we shall prove the correctness and analyse the runtime   2 p 2 p 1 p −1 ε 1 + η 2 n 2 ≤ . for constant p > 2. The outline of the proof is the same 2 ε 1 k and we shall only highlight the differences. Again we let r = k/ε. The analogy of Lemma 3.5 is the following. In place of Lemma 3.1, we now have: Lemma 4.3. When p > 2 is a constant, the matrix Yˆ de- Lemma 4.1. Suppose that p > 2 and S satisfies (3). It then fined in Line 11 of Algorithm1 satisfies with probability at holds for all rank-k orthogonal projection Q that least 0.9 that

p p/2 p ˆ T T (1 − Kpε) kA(I − Q)k(p,r) − Cp/2,εrη1 kA − AkkF A − YZ ≤ (1 + cpε) min A − YZ , p Y :rank(Y )=k p p ≤ kSA(I − Q)k(p,r) for some constant that depends only on p, whenever η2 ≤ p p/2 p 2 1−2/p ≤ (1 + Kpε) kA(I − Q)k(p,r) + Cp/2,εrη1 kA − AkkF , ε /n . Input-Sparsity Low Rank Approximation in Schatten Norm

Proof. The proof is similar to that of Lemma 3.5 except that Table 1: Performance of our algorithm on synthetic data we have instead in the last part of the argument that compared with the approximate Frobenius-norm solution and the SVD.

A − YZˆ p k = 5 k = 10 k = 20

≤ (1 + c ε) A − AZZT + Yˆ − AZ median of ε1 0.00372 0.00377 0.00486 p p p median of ε2 0.00412 0.00485 0.00637

T ˆ ≤ (1 + cpε) A − AZZ + Y − AZ median runtime of p F 0.067s 0.196s 0.428s √ k · k1 algorithm ≤ (1 + c ε) A − AZZT + η A − AZZT p p 2 F median runtime of √ 1 1 0.044s 0.073s 0.191s T − T k · kF algorithm ≤ (1 + cpε) A − AZZ + η2n 2 p A − AZZ p p median runtime of SVD 5.788s

2 1− 2 and we would need η2 ≤ ε /n p .

norm, that is, a rank-k matrix X for which kA − Xk ≤ In summary, we have the following main theorem. F (1+ε)kA−AkkF . This problem admits a simple solution as Theorem 4.4. Let p > 2 be a constant. Sup- follows. Take S to be a Count-Sketch matrix and let Z be an m×n pose that A ∈ R (m ≥ n). There is a ran- n × k matrix whose columns form an orthonormal basis of m×k domized algorithm which outputs Y ∈ R and the top-k right singular vectors of SA. Then X = AZZT n×k ˆ T Z ∈ R such that X = YZ satisfies (1) is a Frobenius-norm solution with high probability (Cohen with probability at least 0.97. The algorithm runs in et al., 2015a). ω(1−2/p) 2ω/p 2ω/p+2 time O(nnz(A) log n) + O˜p(n k /ε + mn(ω−1)(1−2/p)(k/ε)2(ω−1)/p). We shall compare the quality (i.e., approximation ratio mea- sured in Schatten 1-norm) of both solutions and the running times3. Proof. The correctness follows from the previous lemmata as in the proof of Theorem 3.6. Below we discuss the running time. Synthetic Data. We adopt a simpler version of Algo- S OUNT KETCH First we examine the time required to obtain rithm1 by taking to be a C -S matrix of k2 R T Z. It is easy to verify that kη ≤ ε and so target dimension and both and to be identity ma- 1 trices of appropriate dimension. For the Frobenius-norm K = Θ(ε/η1). Similar to the analysis in Theo- solution, we also take S to be a COUNT-SKETCH matrix rem 3.6, we have the total runtime O(nnz(A) log n) + 2 ω−1 ω 2ω+2 of target dimension k . We choose n = 3000 and generate O˜p(mK + K /ε )). Note that K = 1−2/p 2/p a random n × n matrix A of independent entries, each of Θ(ε/η1) = Θ(n (k/ε) ), so the runtime becomes O(nnz(A) log n) + O˜ (mn(ω−1)(1−2/p)(k/ε)2(ω−1)/p + which is uniform in [0, 1] with probability 0.05 and 0 with p 0.95 k  n nω(1−2/p)k2ω/p/ε2ω/p+2). probability . Since the regime of interest is , we vary k among {5, 10, 20}. For each value of k, we run our Next we examine the time required to obtain Yˆ . Again algorithm 50 times and record the relative approximation ˜ ω−1 T similarly the runtime is O(nnz(A)) + O(mk /η2) = error ε1 = kA − YZ k1/kA − Akk1 − 1 with the running O(nnz(A)) + O˜(mn1−2/pkω−1/ε2). time and the relative approximation error of the Frobenius- ε = kA − Xk /kA − A k − 1 Combining the two runtimes above yields the overall run- norm solution 2 1 k 1 with the A time. running time. The same matrix is used for all tests. In Table1 we report the median of ε1, the median of ε2, the median running time of both algorithms, among 50 indepen- 5. Experiments dent runs for each k, and the median running time of a full SVD of A among 10 runs. The contribution of our work is primarily theoretical: an input sparsity time algorithm for low-rank approximation for We can observe that our algorithm achieves a good (relative) any Schatten p-norm. In this section, nevertheless, we give approximation error, which is less than 0.005 in all such an empirical verification of the advantage of our algorithm cases of k. Our algorithm also outperforms the approximate on both synthetic and real-world data. We focus on the most Frobenius-norm solution by 10%–30% in terms of approxi- important case of the nuclear norm, i.e., p = 1. mation error. Our algorithm also runs about 13-fold faster than a regular SVD. In addition to the solution provided by our algorithm, we also consider a natural candidate for a low-rank ap- 3All tests are run under MATLAB 2019b on a machine of Intel proximation algorithm, which is the solution in Frobenius Core i7-6550U [email protected] with 2 cores. Input-Sparsity Low Rank Approximation in Schatten Norm

Table 2: Performance of our algorithm on KOS data com- 2 Kφ,ε = supx>0 supy∈[εx,x](φ(x)−φ(x−y))/φ(y) < pared with the approximate Frobenius-norm solution and ∞. the SVD. (c) it holds that for each sufficiently small ε, Lφ,ε = sup φ(εx)/φ(x) < ∞. k = 5 k = 10 k = 20 x>0 (d) there exists γ > 0 such that φ(x + y) ≤ γ(φ(x) + median of ε 0.0149 0.0145 0.0132 1 φ(y)). median of ε2 0.0183 0.0216 0.0259 When the function φ is clear from the text, we also ab- median runtime of i i 0.155s 0.204s 0.323s breviate Kφ,ε and Lφ,ε as Kε and Lε, respectively. Let k · k1 algorithm K = max{K1,K2}. median runtime of ε ε ε 0.113s 0.154s 0.242s k · kF algorithm It follows from a similar argument to Lemma 4.1 and Con- median runtime of SVD 4.999s ditions (a)–(c) that

√ (1 − αε)φ(σi(A(I − Q))) − L η1 Kεφ(kA − AkkF ) KOS data. For real-world data, we use a word fre- ≤ φ(σi(SA(I − Q))) quency dataset, named KOS, from UC Irvine.4 The matrix ≤ (1 + αε)φ(σ (A(I − Q))) + L√ K φ(kA − A k ) represents word frequencies in blogs and has dimension i η1 ε k F 3430 × 6906 with 353160 non-zero entries. Again we re- Note Condition (c) implies that φpP x2 ≤ port the median relative approximation error and the median i i φ P x  ≤ γ P φ(x ), which further implies that running time of our algorithm and those of the Frobenius- i i i i φ(kA − A k ) ≤ γΦ(A − A ). Therefore norm algorithm among 50 independent runs for each value k F k √ of k ∈ {5, 10, 20}. The results are shown in Table2. (1 − αε)φ(σi(A(I − Q))) − γL η1 KεΦ(A − Ak)

Our algorithm achieves a good approximation error, less ≤ φ(σi(SA(I − Q))) than 0.015, and surpasses the approximate Frobenius-norm √ ≤ (1 + αε)φ(σi(A(I − Q))) + γL η1 KεΦ(A − Ak) solution for all such values of k. The gap between two solutions in the approximation error widens as k increases. Analogously to the singular (p, r)-norm, we define When k = 10, our algorithm outperforms the approximate Pr Φr(A) = i=1 φ(σi(A)). It is easy to verify that the argu- Frobenius-norm by 30%; when k = 20, this increases to ment of Lemmata 3.2 to 3.4 will go through with minimal almost 50%. Our algorithm, although stably slower than changes, yielding that the Frobenius norm algorithm by 30%–40%, still displays a 0 14.5-fold speed-up compared with the regular SVD. Φk(A(I − Qˆ )) ≤ (1 + c1ε)Φk(A − Ak)

+ c2rΦ(A − Ak) 6. Generalization for some constants c1, c2 > 0 that depend on α, γ, Kε,Lε. 1/α More generally, one can ask to solve the problem of low- When η1 ≤ c3(ε/r) we have rank approximation with respect to some function Φ on the Φ(A(I − Qˆ0)) ≤ (1 + c ε)Φ (A − A ). matrix singular values, i.e., 4 k k We can then output AZ and Z in time O(nnz(A) · k + nk). min Φ(A − X) (6) X:rank(X)=k Performing a similar analysis on the running time as before, we arrive at the following theorem. Here we consider Φ(A) = P φ(σ (A)) for some increas- i i Theorem 6.1. Suppose that φ : [0, ∞) → [0, ∞) is in- ing function φ : [0, ∞) → [0, ∞). It is clear that Φ is rota- creasing and satisfies Conditions (a)–(d) and that K = tionally invariant and that Φ(A) ≥ Φ(B) if σ (A) ≥ σ (B) ε i i poly(1/ε) and L = poly(1/ε). Let A ∈ n×n. There is a for all i. These two properties allow us to conclude that A ε R k randomized algorithm which outputs matrices Y,Z ∈ n×k remains an optimal solution for such general Φ. R such that X = YZT satisfies (6) with probability at least We further assume that φ satisfies the following conditions. 0.98. The algorithm runs in time O(nnz(A)(k + log n)) + (a) there exists α > 0 such that φ((1 + ε)x) ≤ (1 + O˜(n poly(k/ε)), where the hidden constants depend on αε)φ(x) and φ((1 − ε)x) ≥ (1 − αε)φ(x) for all α, γ and the polynomial exponents for Kε and Lε. sufficiently small ε. We remark that a few common loss functions satisfy our con- (b) it holds that for each sufficiently small ε, K1 = φ,ε ditions for φ. These include the Tukey p-norm loss function supx>0 supy∈[εx,x](φ(x + y) − φ(x))/φ(y) < ∞ and p p φ(x) = x · 1{x≤τ} + τ · 1{x>τ}, the `1-`2 loss func- p 4https://archive.ics.uci.edu/ml/datasets/ tion φ(x) = 2 1 + x2/2 − 1 and the Huber loss function 2 Bag+of+Words φ(x) = x /2 · 1{x≤τ} + τ(x − τ/2) · 1{x>τ}. Input-Sparsity Low Rank Approximation in Schatten Norm

Acknowledgements Musco, C. and Woodruff, D. P. Is input sparsity time possi- ble for kernel low-rank approximation? In Proceedings The authors would like to thank the anonymous reviewers of the 31st International Conference on Neural Informa- for helpful comments. Y. Li was supported in part by Singa- tion Processing Systems, NIPS’17, pp. 4438–4448, Red pore Ministry of Education (AcRF) Tier 2 grant MOE2018- Hook, NY, USA, 2017. Curran Associates Inc. ISBN T2-1-013. D. P. Woodruff was supported in part by Office 9781510860964. of Naval Research (ONR) grant N00014-18-1-2562. Part of this work was done during a visit of D. P. Woodruff to Nelson, J. and Nguyen, H. L. OSNAP: Faster numerical lin- Nanyang Technological University, funded by the aforemen- ear algebra algorithms via sparser subspace embeddings. tioned Singapore Ministry of Education grant. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pp. 117–126, Oct 2013. References Woodruff, D. P. Sketching as a tool for numerical linear Candes,` E. J., Li, X., Ma, Y., and Wright, J. Robust principal algebra. Foundations and Trends in Theoretical Computer component analysis? J. ACM, 58(3), June 2011. Science, 10(1–2):1–157, 2014.

Clarkson, K. L. and Woodruff, D. P. Low-rank approxima- Xu, H., Caramanis, C., and Sanghavi, S. Robust PCA via tion and regression in input sparsity time. J. ACM, 63(6), outlier pursuit. In Lafferty, J. D., Williams, C. K. I., January 2017. ISSN 0004-5411. Shawe-Taylor, J., Zemel, R. S., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 23, Cohen, M. B., Elder, S., Musco, C., Musco, C., and Persu, pp. 2496–2504. Curran Associates, Inc., 2010. M. Dimensionality reduction for k-means clustering and Yi, X., Park, D., Chen, Y., and Caramanis, C. Fast algo- low rank approximation. In Proceedings of the Forty- rithms for robust PCA via gradient descent. In Proceed- Seventh Annual ACM Symposium on Theory of Comput- ings of the 30th International Conference on Neural In- ing, STOC ’15, pp. 163–172, New York, NY, USA, 2015a. formation Processing Systems, NIPS’16, pp. 4159–4167, Association for Computing Machinery. Red Hook, NY, USA, 2016. Curran Associates Inc. Cohen, M. B., Lee, Y. T., Musco, C., Musco, C., Peng, R., and Sidford, A. Uniform sampling for matrix ap- proximation. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, ITCS ’15, pp. 181–190, 2015b.

Cohen, M. B., Musco, C., and Musco, C. Input sparsity time low-rank approximation via ridge leverage score sam- pling. In Proceedings of the Twenty-Eighth Annual ACM- SIAM Symposium on Discrete Algorithms, SODA’17, pp. 1758–1777, USA, 2017. Society for Industrial and Ap- plied Mathematics.

Demmel, J., Dumitriu, I., and Holtz, O. Fast linear algebra is stable. Numer. Math., 108(1):59–91, October 2007. ISSN 0029-599X.

Li, Y., Nguyen, H. L., and Woodruff, D. P. On approxi- mating matrix norms in data streams. SIAM Journal on Computing, 48(6):1643–1697, 2019.

Meng, X. and Mahoney, M. W. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In Symposium on Theory of Computing Conference, STOC’13, Palo Alto, CA, USA, June 1-4, 2013, pp. 91–100, 2013.

Musco, C. Faster linear algebra for data analysis and machine learning. PhD thesis, MIT, 2018.