Fast Algorithms for Sparse Principal Component Analysis Based on Rayleigh Quotient Iteration
Total Page:16
File Type:pdf, Size:1020Kb
Fast algorithms for sparse principal component analysis based on Rayleigh quotient iteration Volodymyr Kuleshov [email protected] Department of Computer Science, Stanford University, Stanford, CA Abstract point x 2 Rn corresponds to a well-understood pa- rameter (such as the expression level of a gene), in the We introduce new algorithms for sparse prin- 0 cipal component analysis (sPCA), a varia- PCA basis, the new coordinates xi are weighted com- binations P w x of every original parameter x , and tion of PCA which aims to represent data i i i i it is no longer easy to assign a meaning to different in a sparse low-dimensional basis. Our al- 0 gorithms possess a cubic rate of convergence values of xi. and can compute principal components with An effective way of ensuring that the new coordinates k non-zero elements at a cost of O(nk + k3) are interpretable is to require that each of them be flops per iteration. We observe in numeri- a weighted combination of only a small subset of the cal experiments that these components are original dimensions. In other words, we may require of equal or greater quality than ones obtained that the basis vectors for our new coordinate system from current state-of-the-art techniques, but be sparse. This technique is referred to in the litera- require between one and two orders of magni- ture as sparse principal component analysis (sPCA), tude fewer flops to be computed. Conceptu- and has been successfully applied in areas as diverse as ally, our approach generalizes the Rayleigh bioinformatics (Lee et al., 2010), natural language pro- quotient iteration algorithm for computing cessing (Richt´ariket al., 2012), and signal processing eigenvectors, and can be viewed as a second- (D'Aspremont et al., 2008). order optimization method. We demonstrate the applicability of our algorithms on several Formally, an instance of sPCA is defined in terms of the cardinality-constrained optimization problem (1), datasets, including the STL-10 machine vi- n×n sion dataset and gene expression data. in which Σ 2 R is a symmetric matrix (normally a positive semidefinite covariance matrix), and k > 0 is a sparsity parameter. 1. Introduction 1 max xT Σx (1) A fundamental problem in statistics is to find simpler, 2 low-dimensional representations for data. Such rep- s.t. jjxjj2 ≤ 1 resentations can help uncover previously unobserved jjxjj0 ≤ k patterns and often improve the performance of ma- ∗ chine learning algorithms. Problem (1) yields a single sparse component x and for k = n, reduces to finding the leading eigenvector A basic technique for finding low-dimensional rep- of Σ (i.e. to the standard formulation of PCA). resentations is principal component analysis (PCA). PCA produces a new K-dimensional coordinate sys- Although there exists a vast literature on sPCA, most tem along which data exhibits the most variability. popular algorithms are essentially variations of the Although this technique can significantly reduce the same technique called the generalized power method dimensionality of the data, the new coordinate sys- (GPM, see Journ´eeet al. (2010)). This technique has tem it introduces is not easily interpretable. In other been shown to match or outperform most other al- gorithms, including ones based on SDP formulations words, whereas initially, each coordinate xi of a data (D'Aspremont et al., 2007), bilinear programming Proceedings of the 30 th International Conference on Ma- (Witten et al., 2009), and greedy search (Moghaddam chine Learning, Atlanta, Georgia, USA, 2013. JMLR: et al., 2006). The GPM is a straightforward exten- W&CP volume 28. Copyright 2013 by the author(s). sion of the power method for computing the largest Fast algorithms for sparse principal component analysis based on Rayleigh quotient iteration eigenvector of a matrix (Parlett, 1998) and can also Algorithm 1 GRQI(Σ, x0, k, J, ) be interpreted as projected gradient ascent on a vari- j 0 ation of problem (1). repeat Although the GPM is a very simple and intuitive al- // Compute Rayleigh quotient and working set: gorithm, it can be slow to converge when the covari- µ (x(j))T Σx(j)=(x(j))T x(j) (j) ance matrix Σ is large. This should not come as un- W fijxi 6= 0g expected, as the GPM generalizes two rather unso- phisticated algorithms. In practice, eigenvectors are // Rayleigh quotient iteration update: computed using a technique called Rayleigh quotient (j) −1 (j) xW (ΣW − µI) xW (j) (j) iteration (normally implemented as part of the QR xnew x =jjx jj2 algorithm). Similarly, in unconstrained optimization, Newton's method converges orders of magnitude faster // Power method update: than gradient descent. if j < J then Here, we introduce a new algorithm for problem (1) xnew Σxnew=jjΣxnewjj2 that generalizes Rayleigh quotient iteration, and that end if can also be interpreted as a second-order optimization (j+1) technique similar to Newton's method. It converges x Pk (xnew) to a local solution of (1) in about eight iterations or j j + 1 (j) (j−1) less on most problem instances, and generally requires until jjx − x jj < (j) between one and two orders of magnitude fewer float- return x ing point operations (flops) than the GPM. Moreover, the solutions it finds tend to be of as good or of bet- Algorithm 2 SPCA-GRQI(Σ, κ, J, , K, δ) ter quality than those found by the GPM. Conceptu- for k = 1 to K do ally, our algorithm fills a gap in the family of eigen- x0 largest column of Σ (k) value algorithms and their generalizations that is left x GRQI (Σ; x0; κ, J; ) by the lack of second-order optimization methods in Σ Σ − δ((x(k))T Σx(k))x(k)(x(k))T // Deflation the sparse setting (Table 1). end for 2. Algorithm Algorithm 3 GPower0(D, x0, γ, ) Let S : n × ! n be defined as We refer to our technique as generalized Rayleigh quo- 0 R R R (S (a); γ) := a [sgn(a2 − γ)] . tient iteration (GRQI). GRQI is an iterative algorithm 0 i i i + j 0 that converges to local optima of (1) in a small num- repeat ber of iterations; see Algorithm 1 for a pseudocode T (j) T (j) y S0(D x ; γ)=jjS0(D x ; γ)jj2 definition. (j) x Dy=jjDyjj2 Briefly, GRQI performs up to two updates at every j j + 1 iteration j: a Rayeligh quotient iteration update, fol- until jjx(j) − x(j−1)jj < lowed by an optional power method update. The next T (j) (j+1) S0(D x ,γ) iterate x is set to the Euclidean projection Pk(·) return T (j) jjS0(D x ,γ)jj2 on the set fjjxjj0 ≤ kg \ fjjxjj2 ≤ 1g of the result xnew of these steps. Algorithm 4 GPower1(D, x0, γ, ) The first update is an iteration of the standard n n Let S1 : R × R ! R be defined as Rayleigh quotient iteration algorithm on the set of (j) (S1(a; γ))i := sgn(ai)(jaij − γ)+. working indices W = fijxi 6= 0g of the current iterate j 0 (j) x . The indices which equal zero are left unchanged. repeat T (j) T (j) Rayleigh quotient iteration is a method that com- y S1(D x ; γ)=jjS1(D x ; γ)jj2 (j) putes eigenvectors by iteratively performing the up- x Dy=jjDyjj2 −1 T T j j + 1 date x+ = (Σ − µI) x for µ = x Σx=x x. To get (j) (j−1) an understanding of why this technique works, observe until jjx − x jj < that when µ is close to an eigenvalue λi of Σ, the i-th T (j) S1(D x ,γ) −1 return T (j) eigenvalue of (Σ−µI) tends to infinity, and after the jjS1(D x ,γ)jj2 Fast algorithms for sparse principal component analysis based on Rayleigh quotient iteration Eigenvalue algorithm Power method Rayleigh quotient iteration Equivalent optimization method Gradient descent Newton's method Sparse generalization Generalized power method Generalized Rayleigh quotient iteration Convergence rate Linear Cubic Table 1. Conceptual map of eigenvalue algorithms and their generalizations. The light gray box indicates the previously unoccupied spot where our method can be placed. −1 multiplication x+ = (Σ − µI) x, x+ becomes almost 2.2. Starting point parallel to the i-th eigenvector. Since µ is typically a We recommend setting x to the largest column of very good estimate of an eigenvalue of Σ, this happens 0 Σ, as previously suggested in Journ´eeet al. (2010). after only a few multiplications in practice. Interestingly, we also observed that we could initialize The second update is simply a power method step Algorithm 1 randomly, as long as the first Rayleigh (0) along all indices. This step starts from where the quotient µ was close the largest eigenvalue λ1 of Σ. Rayleigh quotient update ended. In addition to im- In practice, an accurate estimate of λ1 can be obtained proving the current iterate, it can introduce large very quickly by performing only a few steps of Power changes to the working set at every iteration, which iteration (O'Leary et al., 1979). helps the algorithm to find a good sparsity pattern quickly. 2.3. Multiple sparse principal components Finally, the projection Pk(x) on the intersection of the Most often, one wants to compute more than one (j+1) l0 and l2 balls ensures that x is sparse. Note that sparse principal component of Σ. Algorithm 1 eas- this projection can be done by setting all but the k ily lends itself to this setting; the key extra step that largest components of x (in absolute value) to zero must be added is the deflation of Σ.