Efficient Spectrum-Revealing CUR Matrix Decomposition

Efficient Spectrum-Revealing CUR Matrix Decomposition Cheng Chen Ming Gu Zhihua Zhang Shanghai Jiao Tong University University of California, Berkeley Peking University Weinan Zhang Yong Yu Shanghai Jiao Tong University Shanghai Jiao Tong University Abstract mining, collaborative filtering, bioinformatics, image and video processing (Drineas et al., 2008; Xu et al., The CUR matrix decomposition is an impor- 2015; Wang and Zhang, 2012, 2013; Mahoney et al., tant tool for low-rank matrix approximation. 2008; Mackey et al., 2011). It approximates a data matrix though select- In recent years, many variants of the CUR decom- ing a small number of columns and rows of position have been developed (Drineas et al., 2008; the matrix. Those CUR algorithms with gap- Wang and Zhang, 2012, 2013; Boutsidis and Woodruff, dependent approximation bounds can obtain 2017; Bien et al., 2010; Anderson et al., 2015; Drineas high approximation quality for matrices with et al., 2006; Wang et al., 2016; Song et al., 2019). good singular value spectrum decay, but they Many randomized CUR algorithms adopt random sam- have impractically high time complexities. In pling on columns and rows of data matrices. They this paper, we propose a novel CUR algorithm aim to achieve 1+" relative error bounds on Frobe- based on truncated LU factorization with an nius norm with a failure probability δ (Drineas et al., efficient variant of complete pivoting. Our 2008; Wang and Zhang, 2013; Boutsidis and Woodruff, algorithm has gap-dependent approximation 2 2 2017), i.e., kA − CURkF ≤ (1+")kA − AkkF ; where bounds on both spectral and Frobenius norms k is the target rank of approximation. These bounds while maintaining high efficiency. Numerical are tight theoretically, but are less useful in practi- experiments demonstrate the effectiveness of cal applications because they require O(k=") columns our algorithm and verify our theoretical guar- and rows. For example, Algorithm 3 of (Boutsidis and antees. Woodruff, 2017) requires to select 4k+4820k=" columns and rows, but only gets a success probability of 0:16. If we choose k = 10 and " = 0:5 (very common setting), 1 Introduction the number of selected columns and rows is close to 106. For real-world data sets, it is impractical to select CUR matrix decomposition is a well known method so many columns and rows. Anderson et al. (2015) for low-rank approximation (Mahoney and Drineas, proposed a deterministic column selection method for 2009; Drineas et al., 2008). It approximates the data CUR decomposition with gap-dependent approxima- matrix A as the product of three matrices: A ≈ CUR, tion bounds. These bounds allow their method only to where C and R are composed of sampled columns and choose ` = k+O(1) columns and rows when the given sampled rows of matrix A respectively. Compared matrix has a rapidly decaying spectrum. However, with other low-rank approximation methods such as their algorithm is highly time-consuming and requires truncated Singular Value Decomposition (SVD), CUR O(`mn(m+n)) time, where ` is the number of selected decomposition can better preserve the sparsity of the columns and rows. input matrix and facilitate the interpretation of the computed results. Because of these advantages, CUR A natural question now arises: can we develop an effi- decomposition is more attractive than SVD in text cient CUR algorithm which is as efficient as previous randomized CUR algorithms while maintaining gap- dependent approximation bounds? In this paper, we Proceedings of the 23rdInternational Conference on Artificial address this question via presenting a novel CUR al- Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. gorithm, namely Spectrum-Revealing CUR (SRCUR) PMLR: Volume 108. Copyright 2020 by the author(s). decomposition. Our method borrows the idea from a Efficient Spectrum-Revealing CUR Matrix Decomposition > pP 2 classic numerical linear algebra method, i.e., pivoted For a vector x = (x1; : : : ; xn) , let kxk2 = i xi m×n LU factorization (Trefethen and Bau III, 1997). Unlike be the `2-norm of x. For a matrix A = [aij] 2 R , existing CUR algorithms which sample columns and let kAk2 denote the spectral norm and kAkF denote rows separately (Drineas et al., 2008; Wang and Zhang, the Frobenius norm. We use nnz(A) to denote the 2013; Boutsidis and Woodruff, 2017; Anderson et al., number of nonzero elements of A. Let det(A) be the 2015), the pivoting procedure of LU factorization se- determinant of A and adj(A) be the adjugate matrix lects columns and rows simultaneously and considers of A. We use O~ to hide logarithmic factors. the inter-connectivity of selected columns and rows. Let ρ be the rank of matrix A. The reduced SVD of However, the vanilla pivoting procedure is costly and A is defined as does not bound the relative error. To make use of the ρ advantages and bypass the disadvantages of pivoted > X > LU, we propose Fast Spectrum-Revealing LU (FSRLU) A = UΣV = σiuivi ; factorization, which is efficient and has theoretical guar- i=1 antees. Our SRCUR algorithm first adopts FSRLU to where σi are positive singular values in the descend- select columns and rows and then computes the inter- ing order. That is, σi(A) is the i-th largest singular section matrix. The main contributions of this paper Pk > value of A. Let Ak = i=1 σiuivi denote the best are summarized as follows: rank-k approximation of A and Ay = VΣ−1U> denote the Moore-Penrose pseudo-inverse of A. κ(A) = • We propose FSRLU algorithm which can efficiently σmax/σmin to denote the condition number of matrix compute truncated LU factorization with complete A. Here σmax is the maximum singular value and σmin pivoting. We also analyze the stability and time is the minimum non-zero singular value of A.of A when complexity of FSRLU. A is square. • We propose SRCUR algorithm based on FSRLU. We use the Matlab colon notation for a block ma- We provide error analysis for SRCUR algorithm trix. Let Ai;: be the i-th row of A and A:;j in both spectral and Frobenius norms. The proofs be the j-th column of A. We use Ai:j;: to de- show that our method has gap dependency error > > > > note Ai;:; Ai+1;:;:::; Aj;: and A:;p:q to denote bounds with regard to the quadratic of the spec- [A ; A ;:::; A ]. Let A denote the block tral gap σ /σ . These error bounds indicate :;p :;p+1 :;q i:j;p:q k+1 p+1 0a : : : a 1 that our method only needs to select ` = k+O(1) ip iq B . .. C columns and rows for data matrices with rapidly matrix @ . A : decaying singular values. ajp : : : ajq • SRCUR is much faster than existing gap- Johnson-Lindenstrauss (JL) transform (Johnson and dependent CUR algorithms. The time complexity Lindenstrauss, 1984) is a powerful tool for dimension- of our method is O(nnz(A) log n)+O~(`2(m+n)), ality reduction. It has been proved to preserve the which is comparable to the state-of-the-art ran- vector norm within an "-error ball. We present the domized CUR algorithms. Johnson-Lindenstrauss Lemma as follows: d • We validate our method with numerical experi- Lemma 1. For any vector x 2 R , 0 < "; δ < 1=2, p×d ments on four real-world datasets, which demon- there exists a JL transform matrix S 2 R , with −2 strate that our algorithms are substantially faster p = Θ(log(1/δ)" ), for which satisfies than existing CUR algorithms while maintaining (1 − ")kxk2 ≤ kSxk2 ≤ (1 + ")kxk2; good performance. 2 2 2 with probability 1 − δ. The remainder of the paper is organized as follows. In Section 2, we describe notation and preliminaries used 3 Related Work in this paper. In Section 3 we introduce related work. Then we present our algorithms in Section 4 and show In Section 3.1, we review some representative methods theoretical analysis in Section 5. Experimental results of CUR matrix decomposition. In Section 3.2, we are given in Section 6. Conclusions and discussions are introduce truncated LU factorization with complete provided in Section 7. pivoting. 2 Notation and Preliminaries 3.1 CUR Matrix Decomposition We use Im to denote m × m identity matrix, and 0 CUR decomposition approximates the data matrix with to denote a zero vector or matrix of appropriate size. actual columns and rows. Given a matrix A 2 Rm×n, Cheng Chen, Ming Gu, Zhihua Zhang, Weinan Zhang, Yong Yu CUR decomposition selects a subset of columns C 2 `1-norm (Song et al., 2017) are less relevant to our Rm×c and a subset of rows R 2 Rr×n and computes an work. intersection matrix U = CyARy. Then, Ab = CUR is a low-rank approximation of matrix A. Some works 3.2 LU factorization with Complete Pivoting try to compute an approximation Ab k with exact rank k (Boutsidis and Woodruff, 2017; Anderson et al., 2015). LU factorization factors a matrix as a product of a The rank-constrained approximation is more compara- unit lower triangular matrix and an upper triangular ble to the truncated SVD decomposition. matrix with row permutation and/or column permu- The essential of CUR decomposition is how to select tation. Partial pivoting is widely used in numerical columns and rows effectively and efficiently. The tra- algebra area because it is more efficient. However, it ditional approaches select columns according to de- may fail on rank deficient matrices. The complete terministic pivoting rules (Stewart, 1999; Chan, 1987; pivoting, even though very expensive, is necessary for Gu and Eisenstat, 1996; Berry et al., 2005), but these those matrices to guarantee that factorized matrices methods do not have good approximation errors.

Load more