Fast algorithms for sparse principal component analysis based on Rayleigh quotient iteration

Volodymyr Kuleshov [email protected] Department of Computer Science, Stanford University, Stanford, CA

Abstract point x ∈ Rn corresponds to a well-understood pa- rameter (such as the expression level of a gene), in the We introduce new algorithms for sparse prin- 0 cipal component analysis (sPCA), a varia- PCA basis, the new coordinates xi are weighted com- binations P w x of every original parameter x , and tion of PCA which aims to represent data i i i i it is no longer easy to assign a meaning to different in a sparse low-dimensional basis. Our al- 0 gorithms possess a cubic rate of convergence values of xi. and can compute principal components with An effective way of ensuring that the new coordinates k non-zero elements at a cost of O(nk + k3) are interpretable is to require that each of them be flops per iteration. We observe in numeri- a weighted combination of only a small subset of the cal experiments that these components are original dimensions. In other words, we may require of equal or greater quality than ones obtained that the basis vectors for our new coordinate system from current state-of-the-art techniques, but be sparse. This technique is referred to in the litera- require between one and two orders of magni- ture as sparse principal component analysis (sPCA), tude fewer flops to be computed. Conceptu- and has been successfully applied in areas as diverse as ally, our approach generalizes the Rayleigh bioinformatics (Lee et al., 2010), natural language pro- quotient iteration algorithm for computing cessing (Richt´ariket al., 2012), and signal processing eigenvectors, and can be viewed as a second- (D’Aspremont et al., 2008). order optimization method. We demonstrate the applicability of our algorithms on several Formally, an instance of sPCA is defined in terms of the cardinality-constrained optimization problem (1), datasets, including the STL-10 machine vi- n×n sion dataset and gene expression data. in which Σ ∈ R is a (normally a positive semidefinite ), and k > 0 is a sparsity parameter. 1. Introduction 1 max xT Σx (1) A fundamental problem in statistics is to find simpler, 2 low-dimensional representations for data. Such rep- s.t. ||x||2 ≤ 1 resentations can help uncover previously unobserved ||x||0 ≤ k patterns and often improve the performance of ma- ∗ chine learning algorithms. Problem (1) yields a single sparse component x and for k = n, reduces to finding the leading eigenvector A basic technique for finding low-dimensional rep- of Σ (i.e. to the standard formulation of PCA). resentations is principal component analysis (PCA). PCA produces a new K-dimensional coordinate sys- Although there exists a vast literature on sPCA, most tem along which data exhibits the most variability. popular algorithms are essentially variations of the Although this technique can significantly reduce the same technique called the generalized power method dimensionality of the data, the new coordinate sys- (GPM, see Journ´eeet al. (2010)). This technique has tem it introduces is not easily interpretable. In other been shown to match or outperform most other al- gorithms, including ones based on SDP formulations words, whereas initially, each coordinate xi of a data (D’Aspremont et al., 2007), bilinear programming Proceedings of the 30 th International Conference on Ma- (Witten et al., 2009), and greedy search (Moghaddam chine Learning, Atlanta, Georgia, USA, 2013. JMLR: et al., 2006). The GPM is a straightforward exten- W&CP volume 28. Copyright 2013 by the author(s). sion of the power method for computing the largest Fast algorithms for sparse principal component analysis based on Rayleigh quotient iteration eigenvector of a matrix (Parlett, 1998) and can also Algorithm 1 GRQI(Σ, x0, k, J, ) be interpreted as projected gradient ascent on a vari- j ← 0 ation of problem (1). repeat Although the GPM is a very simple and intuitive al- // Compute Rayleigh quotient and working set: gorithm, it can be slow to converge when the covari- µ ← (x(j))T Σx(j)/(x(j))T x(j) (j) ance matrix Σ is large. This should not come as un- W ← {i|xi 6= 0} expected, as the GPM generalizes two rather unso- phisticated algorithms. In practice, eigenvectors are // Rayleigh quotient iteration update: computed using a technique called Rayleigh quotient (j) −1 (j) xW ← (ΣW − µI) xW (j) (j) iteration (normally implemented as part of the QR xnew ← x /||x ||2 algorithm). Similarly, in unconstrained optimization, Newton’s method converges orders of magnitude faster // Power method update: than gradient descent. if j < J then Here, we introduce a new algorithm for problem (1) xnew ← Σxnew/||Σxnew||2 that generalizes Rayleigh quotient iteration, and that end if can also be interpreted as a second-order optimization (j+1) technique similar to Newton’s method. It converges x ← Pk (xnew) to a local solution of (1) in about eight iterations or j ← j + 1 (j) (j−1) less on most problem instances, and generally requires until ||x − x || <  (j) between one and two orders of magnitude fewer float- return x ing point operations (flops) than the GPM. Moreover, the solutions it finds tend to be of as good or of bet- Algorithm 2 SPCA-GRQI(Σ, κ, J, , K, δ) ter quality than those found by the GPM. Conceptu- for k = 1 to K do ally, our algorithm fills a gap in the family of eigen- x0 ← largest column of Σ (k) value algorithms and their generalizations that is left x ← GRQI (Σ, x0, κ, J, ) by the lack of second-order optimization methods in Σ ← Σ − δ((x(k))T Σx(k))x(k)(x(k))T // Deflation the sparse setting (Table 1). end for

2. Algorithm Algorithm 3 GPower0(D, x0, γ, ) Let S : n × → n be defined as We refer to our technique as generalized Rayleigh quo- 0 R R R (S (a), γ) := a [sgn(a2 − γ)] . tient iteration (GRQI). GRQI is an iterative algorithm 0 i i i + j ← 0 that converges to local optima of (1) in a small num- repeat ber of iterations; see Algorithm 1 for a pseudocode T (j) T (j) y ← S0(D x , γ)/||S0(D x , γ)||2 definition. (j) x ← Dy/||Dy||2 Briefly, GRQI performs up to two updates at every j ← j + 1 iteration j: a Rayeligh quotient iteration update, fol- until ||x(j) − x(j−1)|| <  lowed by an optional power method update. The next T (j) (j+1) S0(D x ,γ) iterate x is set to the Euclidean projection Pk(·) return T (j) ||S0(D x ,γ)||2 on the set {||x||0 ≤ k} ∩ {||x||2 ≤ 1} of the result xnew of these steps. Algorithm 4 GPower1(D, x0, γ, ) The first update is an iteration of the standard n n Let S1 : R × R → R be defined as Rayleigh quotient iteration algorithm on the set of (j) (S1(a, γ))i := sgn(ai)(|ai| − γ)+. working indices W = {i|xi 6= 0} of the current iterate j ← 0 (j) x . The indices which equal zero are left unchanged. repeat T (j) T (j) Rayleigh quotient iteration is a method that com- y ← S1(D x , γ)/||S1(D x , γ)||2 (j) putes eigenvectors by iteratively performing the up- x ← Dy/||Dy||2 −1 T T j ← j + 1 date x+ = (Σ − µI) x for µ = x Σx/x x. To get (j) (j−1) an understanding of why this technique works, observe until ||x − x || <  that when µ is close to an eigenvalue λi of Σ, the i-th T (j) S1(D x ,γ) −1 return T (j) eigenvalue of (Σ−µI) tends to infinity, and after the ||S1(D x ,γ)||2 Fast algorithms for sparse principal component analysis based on Rayleigh quotient iteration

Eigenvalue algorithm Power method Rayleigh quotient iteration Equivalent optimization method Gradient descent Newton’s method Sparse generalization Generalized power method Generalized Rayleigh quotient iteration Convergence rate Linear Cubic

Table 1. Conceptual map of eigenvalue algorithms and their generalizations. The light gray box indicates the previously unoccupied spot where our method can be placed.

−1 multiplication x+ = (Σ − µI) x, x+ becomes almost 2.2. Starting point parallel to the i-th eigenvector. Since µ is typically a We recommend setting x to the largest column of very good estimate of an eigenvalue of Σ, this happens 0 Σ, as previously suggested in Journ´eeet al. (2010). after only a few multiplications in practice. Interestingly, we also observed that we could initialize The second update is simply a power method step Algorithm 1 randomly, as long as the first Rayleigh (0) along all indices. This step starts from where the quotient µ was close the largest eigenvalue λ1 of Σ. Rayleigh quotient update ended. In addition to im- In practice, an accurate estimate of λ1 can be obtained proving the current iterate, it can introduce large very quickly by performing only a few steps of Power changes to the working set at every iteration, which iteration (O’Leary et al., 1979). helps the algorithm to find a good sparsity pattern quickly. 2.3. Multiple sparse principal components

Finally, the projection Pk(x) on the intersection of the Most often, one wants to compute more than one (j+1) l0 and l2 balls ensures that x is sparse. Note that sparse principal component of Σ. Algorithm 1 eas- this projection can be done by setting all but the k ily lends itself to this setting; the key extra step that largest components of x (in absolute value) to zero must be added is the deflation of Σ. In order to avoid and normalizing the resulting vector to a norm of one. describing variation in the data that has already been For the sake of simplicity, we defined Algorithm 1 to explained by an earlier principal component x, we re- T T perform power method updates only at the first J place Σ at the next iteration by Σ+ = Σ−(x Σx)xx . iterations, with J ∈ [0, ∞] being a parameter. We Thus, if a new principal component x+ explains any T also implemented more sophisticated ways of combin- variance in the direction of x (i.e. if x x+ 6= 0), that ing Rayleigh quotient and power method updates, but variance will not count towards the new objective func- T they offered only modest performance improvements tion x+Σ+x+. over Algorithm 1, and so we focus here on a very sim- In practice, there are settings where one may not want ple but effective method. to perform a full deflation of Σ, but instead only give variance explained in the direction of x less weight. 2.1. Convergence This can be done by changing the deflation step to Σ = Σ − δ(xT Σx)xxT , where 0 ≤ δ ≤ 1 is a parame- In all our experiments, we simply set J = ∞, and we + ter. For example, this partial deflation turns out to be recommend this choice of parameter for most appli- useful when analyzing image features, as we show in cations. However, to formally guarantee convergence, Section 5. In addition, there exist several other defla- J must take on a finite value, in which case we can tion schemes which could be used with our algorithm; establish the following proposition. we refer the reader to the survey by Mackey (2009) for n×n T Proposition. Let Σ ∈ R such that Σ = Σ , x0 ∈ details. The full version of our method for computing n R , k > 0, J < ∞ be input parameters to Algorithm 1. K principal components can be found in Algorithm 2. ∗ ∗ There exists an x such that ||x ||0 ≤ k and such that ∞ Finally, we would like to point out that block algo- the iterates (xj)j=1 generated by Algorithm 1 converge to x∗ at a cubic rate: rithms that compute several components at once, like the ones in Journ´eeet al. (2010), can also be derived ∗ ∗ 3 ||xj+1 − x || = O ||xj − x || . from our work.

This proposition follows from the fact that when j ≥ 3. Discussion J, Algorithm 1 reduces to Rayleigh quotient iteration on a fixed set of k non-zero indices, and from the fact Essentially, the GRQI algorithm modifies the power that standard Rayleigh quotient iteration converges at method by replacing the power iteration step by a cubic rate (Parlett, 1998). Fast algorithms for sparse principal component analysis based on Rayleigh quotient iteration

Rayleigh quotient iteration, followed by an optional According to Journ´eeet al. (2010), its most effective step of the power method. Rayleigh quotient iteration formulations are two algorithms called GPower0 and is a simple technique that converges to an eigenvector GPower1 (Algorithms 3 and 4). Both algorithms are much faster than power iteration, and is responsible equivalent to subgradient ascent on the objective func- for the high speed of modern eigenvalue algorithms. tion Perhaps our most interesting observation is that this 1 T T technique dramatically improves the performance of max x D Dx − γ||x||, (2) x:||x||2≤1 2 algorithms in the sparse setting as well. where D is the data matrix, γ > 0 is a sparsity pa- Existing algorithms for solving (1) essentially do two rameter and || · || can be the l0 norm (in the case of tasks at once: (a) identifying a good sparsity pattern, GPower0), or the l1 norm (in the case of GPower1). and (b) computing a good eigenvector within that pat- tern. Intuitively, our algorithm works well because Our method has several immediate advantages over Rayleigh quotient iteration converges to high-variance Algorithms 3 and 4. For one, the user can directly eigenvectors within the current working set of indices specify the desired cardinality of the solution, instead W much faster than the power method. Therefore, it of having to search for a penalty parameter γ. In fact, solves task (b) much more quickly than the GPM, at we found that varying γ by < 1% in equation (2) some- the cost of some extra computation. times led to changes in the cardinality of the solution of more than 10%. Moreover, the desired cardinality is 3.1. Rayleigh quotient iteration as Newton’s kept constant at every iteration of our algorithm; thus method one can stop iterating when a desired level of variance has been attained. Interestingly, the effectiveness of Rayleigh quotient it- eration can also be explained by the fact that it is Other algorithms for sPCA are often variants of the equivalent to a slight variation of Newton’s method generalized power method, including some highly cited (Tapia & Whitley, 1988). By the same logic, one can ones, such as Zou et al. (2006) or Witten et al. (2009). view GRQI as a second-order projected optimization Methods that are not equivalent to the GPM include method on the objective (1). the SDP relaxation approach of D’Aspremont et al. (2007); since it cannot scale to more than a few hun- In fact, Algorithm 1 resembles in some ways the pro- dred variables, we do not consider it in this study. jected Newton methods of Gafni & Bertsekas (1984). Another class of algorithms includes greedy methods Both approaches take at every iteration a gradient step that search the exponential space of sparsity patterns and a Newton step; in both cases, Newton steps are directly (Moghaddam et al., 2006); such algorithms restricted to a subspace to avoid getting stuck at bad have been shown to be much slower than the GPM local optima (see Bertsekas (1982) for an example). (Journ´eeet al., 2010), and we don’t consider them ei- Unfortunately, projected Newton methods can only be ther. used with very simple constraints, unlike those in (1). 4. Evaluation 3.2. Computational complexity We now proceed to compare the performance of gen- Perhaps the chief concern with second order methods eralized Rayleigh quotient iteration to that of Algo- is their per-step cost. Algorithm 1 performs at every rithms 3 and 4. We evaluate the algorithms on a series iteration only O(k3 + nk) flops, where k is the number of standard tasks that are modeled after the ones of of non-zero components. For comparison, the GPM Journ´eeet al. (2010). The experiments in this sec- requires O(n2 + nk) flops (see next section and Algo- tion are performed on random positive semidefinite rithms 3 and 4). Thus, our algorithm has an advan- matrices Σ ∈ n×n of the form Σ = AT A for some tage when k  n, which is precisely when the principal R A ∼ N (0, 1)n×n. Although we show results only for components are truly sparse. n = 1000, our observations carry over to other matrix dimensions. 3.3. Comparison to the generalized power method and to other sPCA algorithms Throughout this section, we use as our convergence cri- −6 terion the condition ||x − xprev|| < 10 . Note that we The generalized power method is a straightforward ex- cannot use criteria based on the objective functions (1) tension of the power method for computing eigenval- and (2), as they are not comparable: one measures car- ues; in its simplest formulation, it alternates between dinality and one doesn’t. Furthermore, we set J = ∞ power iteration and projection steps until convergence. in all experiments, in which case Algorithm 1 performs Fast algorithms for sparse principal component analysis based on Rayleigh quotient iteration

7 x 10 12 Convergence rate (GRQI) Convergence rate (GPower1) GPower0 1 10 GPower1 GRQI || 0.6 || 8 prev 0.4 prev 0.5 6

|| x − 0.2 || x − Flops 0 0 4 1 2 3 4 5 6 0 50 100 150 Iterations Iterations 2 Variance (GRQI) Variance (GPower1) 1160 0 0 50 100 150 200 250 1140 1000 Number of non−zero components 1120 800 Variance 1100 Variance Figure 2. Number of total flops versus sparsity of the fi- 1080 600 nal solution. Each curve is an average of ten 1000 × 1000 1 2 3 4 5 6 0 50 100 150 Iterations Iterations random matrices. Sparsity (GRQI) Sparsity (GPower1) 45 60

44.5 55 50 44 45 a sparse component on a number of random matri- 43.5 40 ces. We chose to measure algorithm speed in flops be- 43 35 cause unlike time measurements, they do not depend 1 2 3 4 5 6 0 50 100 150

Number of non−zero entries Iterations Number of non−zero entries Iterations on various implementation details. In fact, when im- plemented in a popular high-level language like MAT- LAB, GRQI would have an unfair advantage over the Figure 1. Comparison of generalized Rayleigh quotient it- GPM, as it spends relatively more time in low-level eration and the generalized power method on a random subroutines. 1000 × 1000 matrix. To measure flops, we counted the number of matrix operations performed by each algorithm. For every at every iteration a Rayleigh quotient step, followed by k × n matrix-vector multiplication, we counted kn a full power method step. flops, whereas for k × k matrix inversions, we counted 1 3 2 3 k + 2k . The latter is the complexity of inverting T 4.1. Convergence rates a through an LDL factorization. Operations taking o(n2) time were ignored, as their As a demonstration of the rapid convergence of gen- flop count was negligible. eralized Rayleigh quotient iteration, we start by plot- ting in Figure 1 the variance, sparsity, and precision Results of this experiment appear in Figure 2. GRQI at every iteration of GRQI and GPower1. We set the uses between one and two orders of magnitude fewer sparsity parameter γ of GPower1 to 5, and set the pa- flops at sparsities below 20%; at sparsities below 5%, rameter k to match the sparsity of the final solution it used a hundred times fewer flops. However, for large returned by that algorithm, which was k = 44 (4.4% values of k, the complexity of GRQI outgrows that of sparsity). We found that this setting was representa- the GPM. In this case, we found that we could bring tive of what happens at other sparsities. down the number of GRQI flops to at least the level of the GPM by letting J be small; however at those We observe that GRQI converges in six iterations, levels of k the problem is not truly sparse, and so we which is quite typical for the problem instances we con- do not give details. sidered. This observation is consistent with the speed of regular Rayleigh quotient iteration and appears to In practice, the computational requirements of the support empirically the cubic convergence guarantees GPM would appear somewhat smaller if we used a of our algorithm. Interestingly, the rapid convergence loose convergence criterion, such as ||x − xprev|| < −2 in Figure 1 was achieved without setting J < ∞, as 10 . However, even in that case, GRQI required at the convergence argument formally requires. least an order of magnitude fewer flops. On the other hand, the GPM in our experiments needed about 3- 4 restarts before we could find a penalty term that 4.2. Time complexity analysis yielded the desired number of non-zero entires in the To further evaluate the speed of GRQI, we counted principal component. We did not count these restarts the number of flops that were required to compute in our flop measurements. Fast algorithms for sparse principal component analysis based on Rayleigh quotient iteration

4000 inverse of Σ − µI as follows. GPower0 GPower1  T −1  1 T −1 1 1 T −1  3500 −µI R 2 R S R − I R S GRQI = µ µ µ , R −µI 1 S−1RS−1 3000 µ where S = RRT − µI is the Schur complement of µI. Variance 2500 When R is a k1 × k2 submatrix with k1 ≤ k2,Σ − 2000 2 3 µI can be inverted using O(k1k2 + k1) floating point operations. Note that when the matrix R is square, 1500 0 100 200 300 400 500 600 700 this complexity reduces to that of Algorithm 1. Details Number of non−zero components may be found in Algorithm 5. Figure 3. Variance explained versus sparsity of the leading Algorithm 5 SSVD(R, u , v , k , k , J, ) principal component. Each curve represents an average of 0 0 u v ten 1000 × 1000 random matrices. j ← 0 repeat // Compute Rayleigh quotient and working sets (j) T (j) 4.3. Variance µ ← (u ) Rv /(||u||2||v||2) (j) Finally, we demonstrate that the quality of the solu- Wu ← {i|ui 6= 0} (j) tions to which our algorithm quickly arrives matches Wv ← {i|vi 6= 0} that of the solutions obtained by the GPM. In Figure 3, we plot the tradeoff between the sparsity and the // Rayleigh quotient step variance of a solution for both algorithms. The two A ← RWu,Wv ! −1 ! curves are essentially identical, and this observation v(j)  −µI AT  v(j) Wv ← Wv could be also made for other matrix dimensions. u(j) A −µI u(j) Wu Wu  v   v(j)   v(j)  new ← / 5. Sparse SVD u u(j) u(j) new 2 Thus far, our methods have been presented only in the // Power step context of positive semidefinite covariance matrices. if j < J then However, they can be easily extended to handle an unew ← Rvnew T arbitrary rectangular m × n matrix R. In that setting, vnew ← R unew the problem we solve becomes a generalization of the end if singular value decomposition (SVD); hence, we refer to (j+1) it as the sparse singular value decomposition (sSVD). u ← Pku (unew) v(j+1) ← P (v ) Sparse SVD generalizes objective (1) as follows. kv new j ← j + 1 (j) (j−1) (j) (j−1) T until max(||u − u ||, ||v − v ||) <  max u Rv (3) return u(j), v(j) s.t. ||u||2 ≤ 1 ||v||2 ≤ 1

||u||1 ≤ ku ||v||1 ≤ kv 5.1. Gene expression data Problem (3) can be reduced to problem (1) by exploit- We test the performance of Algorithm 5 on gene ex- ing the well-known observation that pression data collected in the prostate cancer study by Singh et al. (2002). The study measured expres- T 1  v   0 RT   v  uT Rv = sion profiles for 52 tumor and 50 normal samples over 2 u R 0 u 12,600 genes, resulting in a 102 × 12, 600 data matrix. T 1  v   v  Figure 4 shows the variance-sparsity tradeoffs of GRQI =: Σ . 2 u u and GPower for the first principal component of the data matrix, as well the number of flops they use. Thus we can solve (3) by running Algorithm 1 on the Both algorithms explain roughly the same variance (in (m + n) × (m + n) matrix Σ. Fortunately, this can be fact, GRQI explains 2.35% more for small sparsities), done without ever explicitly forming (m+n)×(m+n) whereas time complexity is again significantly smaller matrices. Using blockwise inversion, we compute the for Algorithm 1. Although the relative advantage is Fast algorithms for sparse principal component analysis based on Rayleigh quotient iteration

0.23 GPower0 GPower1 0.225 GRQI

0.22 Variance

0.215

0.21 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 (a) sSVD (b) Autoencoder Number of non−zero components (a) Variance/sparsity tradeoff Figure 5. Sparse components learned by sSVD on image patches (a) compared to features learned by an autoen- 9 x 10 2.5 coder neural net (b).

2

1.5 it challenging to reproduce the above results using the

Flops GPower algorithms because of the difficulties in tuning 1 the sparsity parameter γ; therefore we do not report a GPower0 0.5 GPower1 speed comparison on this data. GRQI 0 In the vision literature, the edges shown in Figure 5 are 1000 2000 3000 4000 5000 6000 7000 8000 9000 Number of non−zero components known as Gabor filters, and are used as a basis for rep- resenting images in many machine vision algorithms. (b) Total flops required at different sparsity levels Their use considerably improves these algorithms’ per- formance. Figure 4. Computing the leading sparse principal compo- nent on the Singh et al. (2002) gene expression dataset. In the past several years, learning such features auto- matically from data has been a major research topic in artificial intelligence. The fact that sparse principal not as large as in Figure 2, GRQI still uses about an component analysis can be used as a technique for un- order of magnitude fewer flops at small sparsity levels. supervised feature extraction may lead to new feature learning algorithms that are simple, fast, and that use Yao et al. (2012) have shown that the principal com- matrix multiplications that can be parallelized over ponents of the above data matrix map to known path- many machines. ways associated with cancer. Our algorithms can iden- tify these pathways at a fraction of the computational More generally, sparse principal component analysis cost of existing methods. addresses a key challenge in statistics and thus has applications in many other areas besides bioinformat- 5.2. Application in deep learning/machine ics and machine vision. We believe that developing vision fast, simple algorithms for this problem will spread its use to an even wider range of interesting application As a further example of how our methods can be ap- domains. plied in practice, we use Algorithm 5 to perform un- supervised feature learning on the STL-10 dataset of Acknowledgements images. Briefly, we sampled 100,000 random 3 × 8 × 8 RGB We wish to thank Stephen Boyd for helpful discussions image patches and computed sparse principal compo- and advice, as well as Ben Poole for reviewing an early nents on the resulting 196 × 100, 000 matrix using a draft of the paper. deflation parameter δ = 0.2. This partial deflation al- lowed us to recover 400 principal components, which References is more than twice the rank of the data matrix. Bertsekas, D. P. Projected Newton Methods for Opti- Curiously, we observe that when visualized as 8 × 8 mization Problems with Simple Constraints. SIAM color matrices, these principal components take on the Journal on Control and Optimization, 20(2):221– shape of little “edges”, as shown in Figure 5. We found 246+, 1982. Fast algorithms for sparse principal component analysis based on Rayleigh quotient iteration

D’Aspremont, Alexandre, El Ghaoui, Laurent, Jordan, Tapia, R. A. and Whitley, David L. The Projected Michael I., and Lanckriet, Gert R. G. A Direct For- Newton Method has Order 1 + 2 for the Symmetric mulation for Sparse PCA Using Semidefinite Pro- Eigenvalue Problem. SIAM Journal on Numerical gramming. SIAM Rev., 49(3):434–448, 2007. Analysis, 25(6):pp. 1376–1382, 1988.

D’Aspremont, Alexandre, Bach, Francis, and Ghaoui, Witten, D M, Tibshirani, R, and Hastie, T. A pe- Laurent El. Optimal Solutions for Sparse Principal nalized matrix decomposition, with applications to Component Analysis. J. Mach. Learn. Res., 9:1269– sparse principal components and canonical correla- 1294, 2008. tion analysis. Biostatistics, 10(3):515–534, jun 2009.

Gafni, Eli M. and Bertsekas, Dimitri P. Two-Metric Yao, Fangzhou, Coquery, Jeff, and Le Cao, Kim-Anh. Projection Methods for Constrained Optimization. Independent Principal Component Analysis for bi- SIAM Journal on Control and Optimization, 22(6): ologically meaningful dimension reduction of large 936–964, 1984. biological data sets. BMC Bioinformatics, 13(1):24, 2012. Journ´ee, Michel, Nesterov, Yurii, Richt a rik, Pe- Zou, Hui, Hastie, Trevor, and Tibshirani, Robert. ter, and Sepulchre, Rodolphe. Generalized Power Sparse Principal Component Analysis. Journal of Method for Sparse Principal Component Analysis. Computational and Graphical Statistics, 15(2):265– J. Mach. Learn. Res., 11:517–553, 2010. 286, jun 2006. Lee, Donghwan, Lee, Woojoo, Lee, Youngjo, and Paw- itan, Yudi. Super-sparse principal component analy- ses for high-throughput genomic data. BMC Bioin- formatics, 11(1):296, 2010.

Mackey, Lester. Deflation Methods for Sparse PCA. In Advances in Neural Information Processing Sys- tems, pp. 1017–2024. MIT Press, 2009.

Moghaddam, Baback, Weiss, Yair, and Avidan, Shai. Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms. In Advances in Neural Information Pro- cessing Systems, pp. 915–922. MIT Press, 2006.

O’Leary, Dianne P., Stewart, G. W., and Vandergraft, James S. Estimating the Largest Eigenvalue of a Positive Definite Matrix. Mathematics of Computa- tion, 33(148):pp. 1289–1292, 1979.

Parlett, Beresford N. The symmetric eigenvalue prob- lem. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1998.

Richt´arik, Peter, Tak´ac, Martin, and Ahipasaoglu, Selin Damla. Alternating Maximization: Unifying Framework for 8 Sparse PCA Formulations and Ef- ficient Parallel Codes. CoRR, abs/1212.4137, 2012.

Singh, Dinesh, Febbo, Phillip G, Ross, Kenneth, Jack- son, Donald G, Manola, Judith, Ladd, Christine, Tamayo, Pablo, Renshaw, Andrew A, D’Amico, An- thony V, Richie, Jerome P, Lander, Eric S, Loda, Massimo, Kantoff, Philip W, Golub, Todd R, and Sellers, William R. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2): 203–209, 2002.