<<

Efficient Spectrum-Revealing CUR Matrix Decomposition

Cheng Chen Ming Gu Zhihua Zhang Shanghai Jiao Tong University University of California, Berkeley Peking University

Weinan Zhang Yong Yu Shanghai Jiao Tong University Shanghai Jiao Tong University

Abstract mining, collaborative filtering, bioinformatics, image and video processing (Drineas et al., 2008; Xu et al., The CUR matrix decomposition is an impor- 2015; Wang and Zhang, 2012, 2013; Mahoney et al., tant tool for low-rank matrix approximation. 2008; Mackey et al., 2011). It approximates a data matrix though select- In recent years, many variants of the CUR decom- ing a small number of columns and rows of position have been developed (Drineas et al., 2008; the matrix. Those CUR algorithms with gap- Wang and Zhang, 2012, 2013; Boutsidis and Woodruff, dependent approximation bounds can obtain 2017; Bien et al., 2010; Anderson et al., 2015; Drineas high approximation quality for matrices with et al., 2006; Wang et al., 2016; Song et al., 2019). good singular value spectrum decay, but they Many randomized CUR algorithms adopt random sam- have impractically high time complexities. In pling on columns and rows of data matrices. They this paper, we propose a novel CUR algorithm aim to achieve 1+ε relative error bounds on Frobe- based on truncated LU factorization with an nius norm with a failure probability δ (Drineas et al., efficient variant of complete pivoting. Our 2008; Wang and Zhang, 2013; Boutsidis and Woodruff, algorithm has gap-dependent approximation 2 2 2017), i.e., kA − CURkF ≤ (1+ε)kA − AkkF , where bounds on both spectral and Frobenius norms k is the target rank of approximation. These bounds while maintaining high efficiency. Numerical are tight theoretically, but are less useful in practi- experiments demonstrate the effectiveness of cal applications because they require O(k/ε) columns our algorithm and verify our theoretical guar- and rows. For example, Algorithm 3 of (Boutsidis and antees. Woodruff, 2017) requires to select 4k+4820k/ε columns and rows, but only gets a success probability of 0.16. If we choose k = 10 and ε = 0.5 (very common setting), 1 Introduction the number of selected columns and rows is close to 106. For real-world data sets, it is impractical to select CUR matrix decomposition is a well known method so many columns and rows. Anderson et al. (2015) for low-rank approximation (Mahoney and Drineas, proposed a deterministic column selection method for 2009; Drineas et al., 2008). It approximates the data CUR decomposition with gap-dependent approxima- matrix A as the product of three matrices: A ≈ CUR, tion bounds. These bounds allow their method only to where C and R are composed of sampled columns and choose ` = k+O(1) columns and rows when the given sampled rows of matrix A respectively. Compared matrix has a rapidly decaying spectrum. However, with other low-rank approximation methods such as their algorithm is highly time-consuming and requires truncated Singular Value Decomposition (SVD), CUR O(`mn(m+n)) time, where ` is the number of selected decomposition can better preserve the sparsity of the columns and rows. input matrix and facilitate the interpretation of the computed results. Because of these advantages, CUR A natural question now arises: can we develop an effi- decomposition is more attractive than SVD in text cient CUR algorithm which is as efficient as previous randomized CUR algorithms while maintaining gap- dependent approximation bounds? In this paper, we Proceedings of the 23rdInternational Conference on Artificial address this question via presenting a novel CUR al- Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. gorithm, namely Spectrum-Revealing CUR (SRCUR) PMLR: Volume 108. Copyright 2020 by the author(s). decomposition. Our method borrows the idea from a Efficient Spectrum-Revealing CUR Matrix Decomposition

> pP 2 classic numerical linear algebra method, i.e., pivoted For a vector x = (x1, . . . , xn) , let kxk2 = i xi m×n LU factorization (Trefethen and Bau III, 1997). Unlike be the `2-norm of x. For a matrix A = [aij] ∈ R , existing CUR algorithms which sample columns and let kAk2 denote the spectral norm and kAkF denote rows separately (Drineas et al., 2008; Wang and Zhang, the Frobenius norm. We use nnz(A) to denote the 2013; Boutsidis and Woodruff, 2017; Anderson et al., number of nonzero elements of A. Let det(A) be the 2015), the pivoting procedure of LU factorization se- determinant of A and adj(A) be the adjugate matrix lects columns and rows simultaneously and considers of A. We use O˜ to hide logarithmic factors. the inter-connectivity of selected columns and rows. Let ρ be the rank of matrix A. The reduced SVD of However, the vanilla pivoting procedure is costly and A is defined as does not bound the relative error. To make use of the ρ advantages and bypass the disadvantages of pivoted > X > LU, we propose Fast Spectrum-Revealing LU (FSRLU) A = UΣV = σiuivi , factorization, which is efficient and has theoretical guar- i=1 antees. Our SRCUR algorithm first adopts FSRLU to where σi are positive singular values in the descend- select columns and rows and then computes the inter- ing order. That is, σi(A) is the i-th largest singular section matrix. The main contributions of this paper Pk > value of A. Let Ak = i=1 σiuivi denote the best are summarized as follows: rank-k approximation of A and A† = VΣ−1U> de- note the Moore-Penrose pseudo-inverse of A. κ(A) = • We propose FSRLU algorithm which can efficiently σmax/σmin to denote the condition number of matrix compute truncated LU factorization with complete A. Here σmax is the maximum singular value and σmin pivoting. We also analyze the stability and time is the minimum non-zero singular value of A.of A when complexity of FSRLU. A is square. • We propose SRCUR algorithm based on FSRLU. We use the Matlab colon notation for a block ma- We provide error analysis for SRCUR algorithm trix. Let Ai,: be the i-th row of A and A:,j in both spectral and Frobenius norms. The proofs be the j-th column of A. We use Ai:j,: to de- show that our method has gap dependency error  > > > > note Ai,:, Ai+1,:,..., Aj,: and A:,p:q to denote bounds with regard to the quadratic of the spec- [A , A ,..., A ]. Let A denote the block tral gap σ /σ . These error bounds indicate :,p :,p+1 :,q i:j,p:q k+1 p+1 a . . . a  that our method only needs to select ` = k+O(1) ip iq  . .. .  columns and rows for data matrices with rapidly matrix  . . .  . decaying singular values. ajp . . . ajq • SRCUR is much faster than existing gap- Johnson-Lindenstrauss (JL) transform (Johnson and dependent CUR algorithms. The time complexity Lindenstrauss, 1984) is a powerful tool for dimension- of our method is O(nnz(A) log n)+O˜(`2(m+n)), ality reduction. It has been proved to preserve the which is comparable to the state-of-the-art ran- vector norm within an ε-error ball. We present the domized CUR algorithms. Johnson-Lindenstrauss Lemma as follows: d • We validate our method with numerical experi- Lemma 1. For any vector x ∈ R , 0 < ε, δ < 1/2, p×d ments on four real-world datasets, which demon- there exists a JL transform matrix S ∈ R , with −2 strate that our algorithms are substantially faster p = Θ(log(1/δ)ε ), for which satisfies than existing CUR algorithms while maintaining (1 − ε)kxk2 ≤ kSxk2 ≤ (1 + ε)kxk2, good performance. 2 2 2 with probability 1 − δ. The remainder of the paper is organized as follows. In Section 2, we describe notation and preliminaries used 3 Related Work in this paper. In Section 3 we introduce related work. Then we present our algorithms in Section 4 and show In Section 3.1, we review some representative methods theoretical analysis in Section 5. Experimental results of CUR matrix decomposition. In Section 3.2, we are given in Section 6. Conclusions and discussions are introduce truncated LU factorization with complete provided in Section 7. pivoting.

2 Notation and Preliminaries 3.1 CUR Matrix Decomposition

We use Im to denote m × m identity matrix, and 0 CUR decomposition approximates the data matrix with to denote a zero vector or matrix of appropriate size. actual columns and rows. Given a matrix A ∈ Rm×n, Cheng Chen, Ming Gu, Zhihua Zhang, Weinan Zhang, Yong Yu

CUR decomposition selects a subset of columns C ∈ `1-norm (Song et al., 2017) are less relevant to our Rm×c and a subset of rows R ∈ Rr×n and computes an work. intersection matrix U = C†AR†. Then, Ab = CUR is a low-rank approximation of matrix A. Some works 3.2 LU factorization with Complete Pivoting try to compute an approximation Ab k with exact rank k (Boutsidis and Woodruff, 2017; Anderson et al., 2015). LU factorization factors a matrix as a product of a The rank-constrained approximation is more compara- unit lower triangular matrix and an upper triangular ble to the truncated SVD decomposition. matrix with row permutation and/or column permu- The essential of CUR decomposition is how to select tation. Partial pivoting is widely used in numerical columns and rows effectively and efficiently. The tra- algebra area because it is more efficient. However, it ditional approaches select columns according to de- may fail on rank deficient matrices. The complete terministic pivoting rules (Stewart, 1999; Chan, 1987; pivoting, even though very expensive, is necessary for Gu and Eisenstat, 1996; Berry et al., 2005), but these those matrices to guarantee that factorized matrices methods do not have good approximation errors. Re- are well-defined. Recently, Melgaard and Gu (2015) cently, randomized algorithms for CUR decomposition proposed a randomized LU factorization with complete aims to obtain 1 + ε relative error bounds in Frobenius pivoting by leverage random projection. norm (Mahoney and Drineas, 2009; Drineas et al., 2008; Rank-revealing LU algorithms (Pan, 2000; Miranian Wang and Zhang, 2013; Boutsidis and Woodruff, 2017; and Gu, 2003) are good quality approximations. They Mahoney et al., 2011): guarantee that the approximation capture the rank of 2 2 the data matrix within a low polynomial factor in k, m kA − CURk ≤ (1 + ε)kA − Akk . F F and n, which could be very large. Spectrum-revealing Drineas et al. (2008) randomly sampled columns and LU (Anderson and Gu, 2017) performs extra swaps to rows according to their leverage scores. They achieve ensure spectral gap error bounds, but it also introduces 1 + ε relative error bounds in Frobenius norm. Wang extra computational costs. and Zhang (2013) proposed an adaptive CUR algorithm with better theoretical results, and their results were further improved by (Boutsidis and Woodruff, 2017; 4 Our Approaches Song et al., 2019). However, these methods have to choose O(k/ε) rows and columns to achieve the 1+ε In this section, we propose our main algorithms. We error bound, which is usually impractical for real-world first propose the EELM algorithm which quickly esti- data set. mates the element with largest magnitude of a given matrix. Then we propose the FSRLU algorithm to Anderson et al. (2015) proposed unweighted column se- efficiently computes the spectrum-revealing LU factor- lection for CUR decomposition which provides spectral ization. Finally, we propose our SRCUR algorithm gap error bounds as the following form: based on the FSRLU. 2 2 2 kA − CURkξ ≤ (1 + O(ω ))kA − CURkξ, for ξ ∈ {2,F }. Here ω is a number that relies on 4.1 Estimating the Element with Largest the rate of spectrum decay of A and the number of Magnitude oversampling l−k. Their method only needs to choose ` = k+O(1) columns and rows to obtain a good rank-k Finding the element with largest magnitude in a matrix has many applications (Higham and Relton, 2016). approximation when the data matrix has rapid spec- m×n trum decay. However, their algorithm is very expensive Usually computing kAkmax ∈ R from its definition and requires O(`mn(m+n)) time to obtain a rank-k has O(mn) cost. However, if we can only access matrix approximation. A implicitly, such as through a product A = BC or an inverse A = B−1, the computational cost can be There have been several investigations of ways to con- prohibitive if we need to calculate all elements of matrix struct the approximation. Wang et al. (2016) proposed A. The method in (Higham and Relton, 2016) uses an efficient method to compute U by matrix sketching. matrix-vector product to estimate the position of the Anderson et al. (2015) proposed a numerically stable element with the largest magnitude. But their method algorithm to construct Ab and Ab k. This method avoids does not have theoretical guarantees on the estimated computing the pseudo-inverse of matrices by perform- value. ing QR factorization on C and R. We present their We propose an efficient algorithm to estimate the ele- algorithm in Appendix A. ment with the largest magnitude, called EELM. Given Some other variants of CUR algorithms including tensor matrix A and projected matrix R = ΩA (Ω ∈ Rp×m CUR decomposition (Song et al., 2019) and CUR with is a JL transform matrix), EELM first finds the col- Efficient Spectrum-Revealing CUR Matrix Decomposition

Algorithm 1 Estimating the Element with Largest Algorithm 2 Fast Pivoting for Truncated LU factor- Magnitude (EELM) ization m×n Input: Input matrix A ∈ Rm×n, R ∈ Rp×n. Input: data matrix A ∈ R , target number of Output: row index r, column index c, estimated value columns and rows `. m×` `×n x. Output: Lb ∈ R , Ub ∈ R , Π1, Π2. 1: 1: c = arg max kR:,jk2. Generate JL transform matrix Ω ∈ p1×m i≤j≤n R ,compute B = ΩA, Π1 = Im, Π2 = In. 2: Compute the c-th column of A. 2: for i = 1, . . . , ` do 3: r = arg max |Aj,c|. 3: [c, r, x] = EELM(A,B). i≤j≤m 4: Swap i-th and c-th column of A, update Π2, 4: x = |Ar,c|. Swap i-th and r-th row of A, update Π1. 5: Ai:m,i = Ai:m,i − Ai:m,1:i−1A1:i−1,i, Ai+1:m,i = Ai+1:m,i/Ai,i. umn with the maximum `2-norm in the sketched ma- 6: Ai,i+1:n = Ai,i+1:n − Ai,1:i−1A1:i−1,i+1:n, trix R and then chooses the largest element in the 7: Update B:,i+1:n. corresponding column of the original matrix A. We 8: end for describe EELM in Algorithm 1. Since JL transform 9: Let Lb be the lower triangular part of A:,1:` with preserves the vector norm well, the estimated value can unit diagonal. be bounded as the following theorem: 10: Let Ub be the upper triangular part of A1:`,:. Theorem 1. The estimated value x given by Algorithm q 1−ε 1 satisfies x ≥ mn(1+ε) kAkF with probability 1 − δ. 4.2.2 Fast Spectrum-Revealing Pivoting

Given a truncated LU factorization in the following As shown in the next section, ε only affect the time form, complexity of our algorithm but have little influence on the error bounds. Thus, in the rest of our paper,     L11 U11 u12 U13 we assume that ε = 0.5 and δ = 0.01 for convenience. > > Π1AΠ2 =  l21 1   α s12  , L31 I s21 S22 4.2 Fast Spectrum-Revealing LU   L11 Factorization  where Lb =  l21 , Ub = U11 u12 U13 and S = L In this section, we describe our FSRLU algorithm which 31  α s>  consists of two parts: Fast Pivoting for Truncated LU 12 , FSRP aims to find the element α with s S factorization (FPTLU) and Fast Spectrum-Revealing 21 22 the largest magnitude in S and the largest element Pivoting (FSRP). FPTLU computes a rough pivoting −> β in matrix |A | in each iteration. Here A11 = for truncated LU, and FSRP performs extra operations    11  on the result of FPTLU in order to bound the error. L11 U11 u12 > . Since computing S is very l21 1 α expensive, we use EELM to estimate α and β rather 4.2.1 Fast Pivoting for Truncated LU than compute their accurate value. We present the Factorization FSRP in Algorithm 3. In each iteration, the sketched matrix R and W only needs a rank-2 update after Truncated LU factorization aims to compute Π AΠ = each swap. We leverage techniques in (Gondzio, 1992)   1 2 0 0 to turn L and U back into trapezoidal form after the LbUb + , where Π1 and Π2 are permutation ma- 0 S swaps are performed. We show details of updating Lb trices, S is the Schur complement. In each iteration, the and Ub in Appendix C. vanilla complete pivoting of LU factorization choose the largest magnitude element in the Schur comple- 4.3 Spectrum-Revealing CUR Decomposition ment as the pivot. This procedure is very expensive. To reduce the computational cost, we adopt EELM The permutation matrices produced by FSRP algo- algorithm to estimate pivots of LU. We present FPTLU rithm indicate which columns and rows we should in Algorithm 2. Note that line 5-7 of Algorithm 2 is choose for CUR decomposition. Namely, we compute standard LU update. Note that the sketched matrix matrix C and R as B can be updated efficiently as in (Anderson and Gu, > > 2017) (see Appendix B). C = Π1 LbUb :,1:` and R = Lb1:`,:UΠb 2 . Cheng Chen, Ming Gu, Zhihua Zhang, Weinan Zhang, Yong Yu

2 Algorithm 3 Fast Spectrum-revealing Pivoting require O(p1n) time to update B and O(` (m+n)) time to perform standard LU update(line 5-7). Since Input: Truncated LU factorization Π1AΠ2 ≈ LbUb and tolerance f. p1 = O(log n), FPTLU can compute a truncated LU 2 m×` `×n factorization in O(nnz(A) log n + ` (m+n)) time. Output: Lb ∈ R , Ub ∈ R , Π1, Π2. p2×(m−`) 1: Generate JL transform matrix Ω1 ∈ R , To analyze the complexity of the FSRP, we first give p3×(`+1) Ω2 ∈ R . the following theorem which shows the upper bound of 2: R = Ω1S,[sr, sc, α] = EELM(S, R). number of iterations. 3: Swap the (` + 1)-th row and the (` + s )-th row of r Theorem 2. If the input matrix Lb and Ub are ob- Lb and update Π1 correspondingly. tained by Algorithm 2, Algorithm 3 performs at most 4: Swap the (` + 1)-th column and the (` + sc)-th O(` log(mn)) number of iterations with probability 0.99. column of Ub and update Π2 correspondingly. −> −> 5: W = Ω A ,[a , a , β] = EELM(A , W). We require O(nnz(A) log n+`(m+n) log n) to ini- 2 11 r c 11 2 6: while αβ > f do tialize matrix R and O(` log `) to initialize ma- trix W. In each iteration, the FSRP costs 7: Exchange the ar-th row and the (` + 1)-th row O(pn+`(m+n)+`2 log `). Using Theorem 2, we know of A and update Lb. that the time complexity of FSRP is O(nnz(A) log ` + 8: Exchange the a -th column and the (` + 1)-th c `pn log(mn) + `2(m + n) log(mn) + `3 log(mn) log `). column of A and update U. b The StableCUR require O(`2(m+n)) time to construct 9: Update Π1, Π2, S, R and perform step 3,4,5. −> the CUR approximation. Therefore, our SRCUR al- 10: Update W, let [ar, ac, β] = EELM(A11 , W). gorithm needs O(nnz(A) log n)+O˜(`2(m + n)) time in 11: end while total.

Algorithm 4 Spectrum-Revealing CUR decomposi- 5.2 Spectral Gap Error Bounds tion Input: Data matrix A ∈ Rm×n, target rank k, target We present the norm error bounds in Theorem 3 and column and row number `, f > 1. the singular value bound in Theorem 4. The detailed m×n m×n Output: Ab ∈ R and Ab k ∈ R . proofs are presented in the supplementary material due to page limit. 1: Perform Algorithm 2 to compute a truncated LU √ factorization Π1AΠ2 ≈ LbUb . Theorem 3. For γ = O(f` mn), Algorithm 4 satis- 2: Perfrom Algorithm 3 to update Π1, Π2, Lb and Ub . fies: > > 3: Compute C = Π1 LbUb :,1:` and R = Lb1:`,:UΠb 2 . kA − Ak ≤ kA − Ak ≤ γσ (A), (1) 4: Perform StableCUR Algorithm to obtain Ab and b 2 b F `+1 A . 2 2 ! b k 2 2γ σ`+1(A) 2 kA−Ab kk ≤ 1+ kA−Akk , F Prank(A) 2 F i=k+1 σi (A) (2) To make our algorithm more stable, we construct the  2  approximation matrices by StableCUR (Anderson et al., 2 2 σ`+1(A) 2 kA − Ab kk2 ≤ 1 + 2γ 2 kA − Akk2, (3) 2015), which is shown in Appendix A. We show our σk+1(A) SRCUR algorithm in Algorithm 4. with probability 0.98. √ Theorem 4. For 1 ≤ j ≤ ` and γ = O(f` mn), 5 Theoretical Analysis Algorithm 4 satisfies: s In this section, we analyze time complexity of our  2 2 σ`+1(A) algorithm in Section 5.1 and provide spectral gap error σj(A) ≥ σj(Ab ) ≥ 1 − 2γ σj(A) σ (A) bounds on singular values, 2-norm and Frobenius norm j in Section 5.2. with probability 0.98.

5.1 Time complexity 5.3 Discussion

It is obvious that the time cost of EELM is O(m + pn). We would like to emphasize the differences between our Then we discuss the time complexity of FPTLU. If work and those methods which achieve 1+ε relative we choose Gaussian random projection matrix as the error bounds (Drineas et al., 2008; Wang and Zhang, JL transform matrix, we only require O(p1nnz(A)) 2013; Boutsidis and Woodruff, 2017). First, 1+ε rela- to initialize matrix B. In each iteration, FPTLU tive error bounds only bound Frobenius norm, while Efficient Spectrum-Revealing CUR Matrix Decomposition we provide bounds on spectral norm, Frobenius norm Table 1: Summary of datasets for CUR matrix decom- and singular values. In addition, our bounds are data position dependent, while 1+ε relative error bounds depend on the universal constant ε. Dataset m n %nnz Extend Yale Face B 2414 1024 97.84 Now we compare our bounds with DetUCS algo- Dexter 2600 20000 4.8 rithm (Anderson et al., 2015). First, both our bounds Sido 4932 12678 9.84 Reuters 8293 18933 0.25 and those of DetUCS are related to the quadratic of the spectral gap σ2 (A)/σ2 (A). Thus, our bounds `+1 k+1 4 104 10 have the same order as that of DetUCS. However, De- 12 tUCS costs O(`mn(m + n)) time, which is much higher 10 2

8 than our costs. 1.5 6 1 Finally, we discuss the differences between our method 4 0.5 and SRLU (Anderson and Gu, 2017). First, our method 2 and SRLU are designed for different tasks. The goal 0 0 200 400 600 800 1000 500 1000 1500 2000 2500 of SRLU is to obtain a high quality LU factorization, Yale Face B Dexter while we aim to compute CUR decomposition. Also, SRLU does not have time guarantees on the extra swap 1200 400 procedure. Thus, our FSRLU is faster than SRLU while 1000 maintaining the same theoretical guarantee. Though 800 300 600 SRLU provided bounds on kΠ AΠ − LMUk , where 200 1 2 b b ξ 400 † † 100 M = Lb AUb and ξ ∈ {2,F }. Their approximation is 200 0 0 not rank-constrained and worse than our result. 1000 2000 3000 4000 2000 4000 6000 8000 Sido Reuters

6 Experiments Figure 1: Singular value distribution

In this section, we empirically evaluate the perfor- mance of our SRCUR algorithm. We choose Gaus- main reason is that state-of-the-art CUR algorithms all sian random projection matrix as the JL transform require to perform SVD or randomized SVD while SR- matrix in SRCUR. The baseline methods include uni- CUR is SVD-free. The randomized SVD, though has form sampling (Williams and Seeger, 2001), subspace good time complexity Clarkson and Woodruff (2017), is sampling (Drineas et al., 2008), near-optimal sam- much less efficient than LU update because the constant pling (Wang and Zhang, 2013; Boutsidis and Woodruff, of SVD is extreme large. Furthermore, the approxi- 2017) and deterministic unweighted column selec- mation errors of SRCUR algorithm are better than or tion (Anderson et al., 2015). comparable with other algorithms. Especially, SRCUR In our empirical evaluation, we consider following six performs better than other methods when ` is close to k. This result matches our theoretical analysis in measures: kA−Ab kF /kA−AkkF , kA−Ab k2/kA−Akk2, Section 5. kA − Ab kkF /kA − AkkF , kA − Ab kk2/kA − Akk2, σk(Ab )/σk(A) and running time. We compare CUR al- gorithms on the following four datasets: The Extended 7 Conclusions Yale Face Database B (Georghiades et al., 2001), Dex- 1 2 ter (Guyon et al., 2005), Sido and Reuters . We In this paper, we have proposed Spectrum-Revealing summarize the information of these data sets in Table CUR decomposition algorithm. We have built the 1 and show the singular value distribution in Figure 1. bridge between CUR decomposition and LU factoriza- We choose ` = k +1, 2k, ··· , 10k for all datasets, where tion, and leveraged truncated LU factorization to select k = dmin(m, n)/200e. We report our results in Figures columns and rows for CUR decomposition. We have 2-5. All results of randomized algorithm are averaged develop spectral gap error bound on spectral norm and over 5 runs. The experiments are all performed in Frobenius norm and singular values. These bounds Matlab R2017b. confirm high quality approximations computed by our The figures show that our algorithm is much more methods for matrices with fast enough singular value de- efficient than state-of-the-art CUR algorithms. The cay, a property which is typically observed in practical data matrices. We have also conducted experiments to 1http://www.causality.inf.ethz.ch/data/SIDO.html empirically demonstrate the effectiveness and reliability 2http://www.zjucadcg.cn/dengcai/Data/TextData.html of our algorithms on four machine learning datasets. Cheng Chen, Ming Gu, Zhihua Zhang, Weinan Zhang, Yong Yu

1.2 Leverage 4.5 Leverage 4.5 1.8 4 4 Uniform 3.5 Uniform 3.5 1.6 NearOptimal 3 1.15 NearOptimal 3 DetUCS 2.5 DetUCS 1.4 2.5 SRCUR 2 SRCUR 1.1 2 1.2 1.5

1.05 1.5 1 1

1 k+1 2k 4k 6k 8k 10k k+1 2k 4k 6k 8k 10k k+1 2k 4k 6k 8k 10k k+1 2k 4k 6k 8k 10k

(a) F-Norm Error (b) 2-Norm Error (a) F-Norm Error (b) 2-Norm Error

5 1.2 4.5 1.8 4 4 3.5 1.6 3 1.15 3 2.5 1.4 2 1.1 2

1.2 1.5

1.05 1 k+1 2k 4k 6k 8k 10k k+1 2k 4k 6k 8k 10k k+1 2k 4k 6k 8k 10k k+1 2k 4k 6k 8k 10k (c) Rank-k F-Norm Error (b) Rank-k 2-Norm Error (c) Rank-k F-Norm Error (b) Rank-k 2-Norm Error

1 1.2 0.8 25

1 0.8 20 0.6 0.8 0.6 15 0.6 0.4 0.4 10 0.4 0.2 0.2 5 0.2

0 0 0 0 k+1 2k 4k 6k 8k 10k k+1 2k 4k 6k 8k 10k k+1 2k 4k 6k 8k 10k k+1 2k 4k 6k 8k 10k (e) Singular Value Ratio (f) Time (e) Singular Value Ratio (f) Time

Figure 2: Results of CUR decomposition algorithms com- Figure 4: Results of CUR decomposition algorithms com- parison on Yale Face B data matrix. parison on Sido data matrix.

6 1.4 8 1.12 Leverage Leverage 7 Uniform 5 1.3 Uniform 6 1.1 NearOptimal NearOptimal 5 4 1.2 1.08 DetUCS DetUCS 4 1.06 SRCUR 3 1.1 SRCUR 3

1.04 1 2 1.02 2

1 0.9 1 k+1 2k 4k 6k 8k 10k k+1 2k 4k 6k 8k 10k k+1 2k 4k 6k 8k 10k k+1 2k 4k 6k 8k 10k

(a) F-Norm Error (b) 2-Norm Error (a) F-Norm Error (b) 2-Norm Error

6 1.4 8 1.12 1.35 7 5 6 1.3 5 1.1 4 1.25 4 1.08 3 1.2 3

1.06 1.15 2 2 1.1 1.04 1.05 1.02 k+1 2k 4k 6k 8k 10k k+1 2k 4k 6k 8k 10k k+1 2k 4k 6k 8k 10k k+1 2k 4k 6k 8k 10k (c) Rank-k F-Norm Error (b) Rank-k 2-Norm Error (c) Rank-k F-Norm Error (b) Rank-k 2-Norm Error

0.8 30 1 300

25 250 0.8 0.6 20 200 0.6 0.4 15 150 0.4 10 100 0.2 0.2 5 50

0 0 0 0 k+1 2k 4k 6k 8k 10k k+1 2k 4k 6k 8k 10k k+1 2k 4k 6k 8k 10k k+1 2k 4k 6k 8k 10k (e) Singular Value Ratio (f) Time (e) Singular Value Ratio (f) Time

Figure 3: Results of CUR decomposition algorithms com- Figure 5: Results of CUR decomposition algorithms com- parison on Dexter data matrix. parison on Reuters data matrix. Efficient Spectrum-Revealing CUR Matrix Decomposition

Acknowledgements Gu, M. and Eisenstat, S. C. (1996). Efficient algo- rithms for computing a strong rank-revealing qr fac- The work is supported by the National Natural Science torization. SIAM Journal on Scientific Computing, Foundation of China (61702327, 61772333, 61632017). 17(4):848–869. Ming Gu is supported by the NSF (1760316). Zhihua Zhang is supported by the National Natural Science Guyon, I., Gunn, S., Ben-Hur, A., and Dror, G. (2005). Foundation of China (No. 11771002) and Beijing Nat- Result analysis of the nips 2003 feature selection chal- ural Science Foundation (Z190001). Yong Yu is the lenge. In Advances in neural information processing corresponding author. systems, pages 545–552. Higham, N. J. and Relton, S. D. (2016). Estimating References the largest elements of a matrix. SIAM Journal on Scientific Computing, 38(5):C584–C601. Anderson, D. and Gu, M. (2017). An efficient, sparsity- preserving, online algorithm for low-rank approxi- Johnson, W. B. and Lindenstrauss, J. (1984). Ex- mation. In International Conference on Machine tensions of lipschitz mappings into a . Learning, pages 156–165. Contemporary mathematics, 26(189-206):1. Anderson, D. G., Du, S., Mahoney, M., Melgaard, C., Mackey, L. W., Jordan, M. I., and Talwalkar, A. (2011). Wu, K., and Gu, M. (2015). Spectral gap error Divide-and-conquer matrix factorization. In Ad- bounds for improving cur matrix decomposition and vances in Neural Information Processing Systems, the nystr¨ommethod. In AISTATS. pages 1134–1142. Berry, M. W., Pulatova, S. A., and Stewart, G. (2005). Mahoney, M. W. and Drineas, P. (2009). Cur matrix de- Algorithm 844: Computing sparse reduced-rank ap- compositions for improved data analysis. Proceedings proximations to sparse matrices. ACM Transactions of the National Academy of Sciences, 106(3):697–702. on Mathematical Software (TOMS), 31(2):252–269. Mahoney, M. W. et al. (2011). Randomized algorithms Bien, J., Xu, Y., and Mahoney, M. W. (2010). Cur for matrices and data. Foundations and Trends R in from a sparse optimization viewpoint. In Advances Machine Learning, 3(2):123–224. in Neural Information Processing Systems, pages Mahoney, M. W., Maggioni, M., and Drineas, P. (2008). 217–225. Tensor-cur decompositions for tensor-based data. Boutsidis, C. and Woodruff, D. P. (2017). Optimal cur SIAM Journal on Matrix Analysis and Applications, matrix decompositions. SIAM Journal on Comput- 30(3):957–987. ing, 46(2):543–589. Melgaard, C. and Gu, M. (2015). Gaussian elimination Chan, T. F. (1987). Rank revealing qr factorizations. with randomized complete pivoting. arXiv preprint Linear algebra and its applications, 88:67–82. arXiv:1511.08528. Clarkson, K. L. and Woodruff, D. P. (2017). Low-rank Miranian, L. and Gu, M. (2003). Strong rank revealing approximation and regression in input sparsity time. lu factorizations. Linear algebra and its applications, Journal of the ACM (JACM), 63(6):54. 367:1–16. Drineas, P., Kannan, R., and Mahoney, M. W. (2006). Pan, C.-T. (2000). On the existence and computation Fast monte carlo algorithms for matrices iii: Comput- of rank-revealing lu factorizations. Linear Algebra ing a compressed approximate matrix decomposition. and its Applications, 316(1-3):199–222. SIAM Journal on Computing, 36(1):184–206. Song, Z., Woodruff, D. P., and Zhong, P. (2017). Low Drineas, P., Mahoney, M. W., and Muthukrishnan, S. rank approximation with entrywise l1-norm error. (2008). Relative-error cur matrix decompositions. In Proceedings of the 49th Annual ACM SIGACT SIAM Journal on Matrix Analysis and Applications, Symposium on Theory of Computing, pages 688–701. 30(2):844–881. ACM. Georghiades, A. S., Belhumeur, P. N., and Kriegman, Song, Z., Woodruff, D. P., and Zhong, P. (2019). Rela- D. J. (2001). From few to many: Illumination cone tive error tensor low rank approximation. In Proceed- models for face recognition under variable lighting ings of the Thirtieth Annual ACM-SIAM Symposium and pose. IEEE transactions on pattern analysis and on Discrete Algorithms, pages 2772–2789. Society for machine intelligence, 23(6):643–660. Industrial and Applied Mathematics. Gondzio, J. (1992). Stable algorithm for updating dense Stewart, G. (1999). Four algorithms for the the effi- lu factorization after row or column exchange and cient computation of truncated pivoted qr approxi- row and column addition or deletion. Optimization, mations to a sparse matrix. Numerische Mathematik, 23(1):7–26. 83(2):313–323. Cheng Chen, Ming Gu, Zhihua Zhang, Weinan Zhang, Yong Yu

Trefethen, L. N. and Bau III, D. (1997). Numerical linear algebra, volume 50. Siam. Wang, S. and Zhang, Z. (2012). A scalable cur matrix decomposition algorithm: Lower time complexity and tighter bound. In Advances in Neural Informa- tion Processing Systems, pages 647–655. Wang, S. and Zhang, Z. (2013). Improving cur matrix decomposition and the nystr¨omapproximation via adaptive sampling. Journal of Machine Learning Research, 14(1):2729–2769. Wang, S., Zhang, Z., and Zhang, T. (2016). Towards more efficient spsd matrix approximation and cur matrix decomposition. Journal of Machine Learning Research, 17(210):1–49. Williams, C. K. and Seeger, M. (2001). Using the nystr¨ommethod to speed up kernel machines. In Advances in neural information processing systems, pages 682–688. Xu, M., Jin, R., and Zhou, Z.-H. (2015). Cur algorithm for partially observed matrices. In ICML, pages 1412–1421.