<<

A Selective Overview of Sparse Principal Component Analysis This paper provides a selective overview of methodological and theoretical developments of sparse PCA that produce principal components that are sparse, i.e., have only a few nonzero entries.

By HUI ZOU AND LINGZHOU

ABSTRACT | Principal component analysis (PCA) is a widely This review article focuses on the high-dimensional used technique for dimension reduction, data processing, and extension of the regular PCA, which is often called sparse feature extraction. The three tasks are particularly useful and PCA (SPCA). There are several popular SPCA methods in important in high-dimensional data analysis and statistical the literature, which will be reviewed in Section II. Their learning. However, the regular PCA encounters great funda- formulations are different but related, because the regu- mental challenges under high dimensionality and may produce lar PCA has several equivalent definitions from different “wrong” results. As a remedy, sparse PCA (SPCA) has been viewing angles. These definitions are equivalent without proposed and studied. SPCA is shown to offer a “right” solution sparsity constraints, and differ with sparsity constraints. under high dimensions. In this paper, we review method- To be self-contained, we briefly discuss the several views ological and theoretical developments of SPCA, as well as its of PCA in the following. applications in scientific studies. From a dimension reduction perspective, PCA can be described as a set of orthogonal linear transformations of KEYWORDS | Covariance matrices; mathematical program- the original variables such that the transformed variables ming; principal component analysis (PCA); statistical learning. maintain the information contained in the original vari- ables as much as possible. Specifically, let X be an n × p I. PRINCIPAL COMPONENT ANALYSIS data matrix, where n and p are the number of observations Principal component analysis (PCA) was invented by and the number of variables, respectively. For ease of X Pearson [45]. As a dimension reduction and feature presentation, assume the column means of are all 0. The Z Èp α X extraction method, PCA has numerous applications in first principal component is defined as 1 = j=1 1j j α α ,...,α T statistical learning, such as handwritten zip code classi- where 1 =(11 1p) is chosen to maximize the Z , fication [26], human face recognition [24], eigengenes variance of 1,i.e. analysis [1], gene shaving [25], and so on. It would not be T exaggerating to say that PCA is one of the most widely used α1 =argmaxα Σ α subject to α1 =1 (1) α and most important multivariate statistical techniques. T with Σ =(X X)/n. The rest principal components can

Manuscript received January 30, 2018; revised May 28, 2018; accepted June 8, be defined sequentially as follows: 2018. Date of current version August 2, 2018. The work of H. Zou was supported in part by the National Science Foundation (NSF) under Grant DMS-1505111. T αk+1 =argmaxα Σˆ α (2) The work of L. Xue was supported by the National Science Foundation (NSF) α under Grant DMS-1505256. (Corresponding author: Hui Zou.) H. Zou is with the Department of Statistics, University of Minnesota, Minneapolis, MN 55455 USA (e-mail: [email protected]). subject to L. Xue is with the Pennsylvania State University, State College, PA 16801 USA (e-mail: [email protected]). T α =1 and α αl =0, ∀1 ≤ l ≤ k. (3) Digital Object Identifier 10.1109/JPROC.2018.2846588

0018-9219 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Vol. 106, No. 8, August 2018 |PROCEEDINGS OF THE IEEE 1311 Zou and Xue: A Selective Overview of Sparse Principal Component Analysis

This definition implies that the first K loading vectors are Sparse learning is ubiquitous in high-dimensional data the first K eigenvectors of Σˆ . analysis. Prior to the sparse principal component problem, The eigendecomposition formulation of PCA also relates an important question is how to select variables in high- PCA to the singular value decomposition (SVD) of X.Let dimensional regression. For the regression problem, the the SVD of X be lasso proposed in [50] is a very promising technique for T X = UDV simultaneous variable selection and prediction. The lasso regression is an 1 penalized least squares method. The use where D is a diagonal matrix with diagonal elements of 1 penalization yields a sparse solution and also permits d1,...,dp in a descending order, and U and V are n × p efficient computations. and p × p orthonormal matrices, respectively. Because the columns of V are the eigenvectors of Σ , V is the loading matrix of the principal components. By XV = UD,wesee A. SCoTLASS that Zk = Ukdk where Uk is the kth column of U .Notethat Inspired by the lasso regression, Jolliffe et al. [31] SVD can be interpreted as the best low-rank approximation proposed a procedure called SCoTLASS to obtain sparse to the data matrix. loadings by directly imposing an 1 constraint on the PCA has another geometric interpretation, as the closest loading vector. linear manifold approximation of the observed data. This SCoTLASS extends the standard PCA by taking the vari- definition actually matches the construction of PCA con- ance maximization perspective of the PCA. It successively sidered in [45]. Let xi be the ith row of X.Considerthe maximizes the variance first k principal components jointly V k =[V1|···|Vk].By T definition, V k is a p × k orthonormal matrix. Project each ak Σˆ ak (5) observation to the linear space spanned by {V1,...,Vk}. T The projection operator is P k = V kV k and the projected subject to data are P kXi, 1 ≤ i ≤ n. One way to define the best T T projection is by minimizing the total 2 approximation ak ak =1 and (for k ≥ 2) ah ak =0,h

error n  T 2 1 min xi − V kV k xi . (4) and the extra constraints V k i=1

p  It is easy to show that the solution is exactly the first k |akj|≤t (7) principal components. j=1 In applications, variables can have different scales and t units. Practitioners often standardize each variable such for some tuning parameter . However, SCoTLASS is that its marginal sample variance is one. When this practice high computational cost which makes it an impracti- is applied to PCA, the resulting covariance matrix of stan- cal solution for high-dimensional data analysis. It moti- dardized variables is the sample correlation matrix of the vated researchers to consider more efficient proposals for raw variables. Note that the eigenvalues and eigenvectors sparse principal components. A related but more efficient of the correlation matrix can be different from those of the approach is the generalized power method presented in covariance matrix. Section II-E.

II. METHODS FOR SPARSE PRINCIPAL B. Sparse Principal Component Analysis COMPONENTS After SCoTLASSO, the first computational efficient Each principal component is a linear combination of all p SPCA algorithm for high-dimensional data was introduced variables, which makes it difficult to interpret the derived in [61]. Their method is named sparse principal com- principal components as new features. Rotation techniques ponent analysis (SPCA). Before reviewing the technical are commonly used to help practitioners to interpret details, let us consider the application of SPCA to the principal components [30]. Vines [51] considered simple pitprops data [28], a classical example showing the dif- principal components by restricting the loadings to take ficulty of interpreting principal components. The pitprops values from a small set of allowable integers such as data have 180 observations and 13 measured variables. 0, 1, and −1. This restriction may be useful for certain In [61], the first six ordinary principal components and applications but not all. Simple thresholding is an ad hoc the first six sparse principal components are computed. way to achieve sparse loadings by setting the loadings with Here we only cite the results of the first three principal absolute values smaller than a threshold to zero. Although components in Table 1. Compared with the standard PCA, the simple thresholding is frequently used in practice, SPCA generated very sparse loading structures without it can be potentially misleading in various respects [9]. losing much variance. Sparse variants of PCA aim to achieve a good balance In the original lasso paper, Tibshirani used a quadratic between variance explained (dimension reduction) and programming solver to compute the lasso regression esti- sparse loadings (interpretability). mator, which is not very efficient for high-dimensional

1312 PROCEEDINGS OF THE IEEE | Vol. 106, No. 8, August 2018 Zou and Xue: A Selective Overview of Sparse Principal Component Analysis

Table 1 Compare PCA and SPCA on the Pitprops Data. Empty Cells Mean β are separated variables. With a fixed α, the optimization Zero Loadings. The Variance of SPCA Is Expected to Be Smaller Than That problem over β is a regression problem. of PCA, by the Definition of PCA. The Differences in Variance Are Small Based on Theorem 1, we can impose a sparse penalty on β to gain zero loading because the normalizing step does not change the support of β. The SPCA for the first principal component is defined as

n  T 2 2 (ˆα, βˆ)= arg min xi − αβ xi + λ0β + λ1β1 α,β i=1 2 subject to α =1 (11)

and the output loading vector is Vˆ1 = β/ˆ βˆ.Forn>p, we can let λ0 =0and solving β with a fixed α is a lasso regression problem, which can be done efficiently. When −3 n 0,let Motivated by LARS, Zou et al. [61] proposed to tackle

n k  the sparse principal component problem from a regression  T 2 2 formulation. The resulting algorithm is SPCA. (Aˆ , Bˆ )=arg min xi − AB xi + λ0 βj A,B SPCA extends the linear manifold approximation view i=1 j=1 T of the PCA to derive sparse loadings. Recall that the first subject to A A = Ik×k. (12) principal component can be defined as ˆ n Then, βj ∝ Vj for j =1, 2,...,k.  T 2 k α1 = arg min xi − αα xi Then, the SPCA criterion for the first sparse principal α,β i=1 components is defined as

2 n k

α .   subject to =1 (8) T 2 2 (Aˆ , Bˆ )=arg min xi − AB xi + λ0 βj A,B i=1 j=1 We reformulate (8) as

k  n + λ1,j βj 1  T 2 arg min xi − αβ xi j=1 α,β i=1 T subject to A A = Ik×k (13) 2 subject to α =1andα = β. (9) where different λ1,js are allowed for penalizing the load- Thefollowingtheoremsaysthatwecandroptheequality ings of different principal components. constraint in (9) and still recover the first loading vector Zou et al. [61] proposed an alternating algorithm to exactly. minimize the SPCA criterion (13). ∗ For each j,letYj = Xαj .ItisshownthatBˆ = Theorem 1 [61]: For any λ0 > 0,let [βˆ1,...,βˆk] and each βˆj is obtained via n  T 2 2 α, βˆ xi − αβ xi λ0β ∗ 2 2 (ˆ )= arg min + βˆj Yj −Xβj λ0βj λ1,j βj 1. α,β = arg min + + (14) i=1 βj 2 subject to α =1. (10) One can use either the LARS-EN algorithm [60] or the

Then, βˆ ∝ V1. cyclic coordinate descent algorithm [21] to solve (14). 2 In Theorem 1, the extra 2 term λ0β is not needed Both algorithms are efficient for high-dimensional data. B A when pn,anyλ0 > 0 should be used and it If is fixed, the optimization problem of is β does not affect the normalized 1. By dropping the equality n α β  T 2 T 2 constraint = , we can use an alternating minimization arg min xi − AB xi = X − XBA algorithm to optimize the criterion in (10) because α and A i=1

Vol. 106, No. 8, August 2018 |PROCEEDINGS OF THE IEEE 1313 Zou and Xue: A Selective Overview of Sparse Principal Component Analysis

T subject to A A = I k×k.Thisiscalledareducedrank optimal k-sparse loading vector is Procrustes rotation problem in [61] because when k = p, T it is the Procrustes rotation problem [41]. Zou et al. [61] arg max α Σˆ α α derived an explicit solution to the reduced rank Procrustes subject to α =1, Card(α) ≤ k. (18) rotation problem. We compute the SVD

T T k p (X X)B = UDV (15) When = , then the above definition gives the load- ings of the first principal component. However, (18) is and set Aˆ = UV T . nonconvex and computationally difficult, especially when p The SPCA algorithm iterates between the elastic net is large. Convex relaxation is a standard technique regression step and the SVD step till convergence. The out- used in operational research to handle difficult nonconvex problems. d’Aspremont et al. [15] developed a convex put is the normalized B matrix Vˆj = βˆj /βˆj , 1 ≤ j ≤ k. Zou et al. [61] derived another SPCA criterion to further relation of (18), which is expressed as a semidefinite speed up the computation efficiency. The derivation is programming problem. P ααT αT ˆα ˆP based on the observation that Theorem 2 is valid for all Let = .Wewrite Σ =Tr(Σ ).Thenorm-1 constraint on α leads to a linear equality constraint on P : λ0 > 0. It turns out that a thrifty solution emerges if λ0 is P α ≤ k taken to be a large constant. Tr =1. Moreover, the cardinality constraint 0 implies Card(P ) ≤ k2. Hence, we consider the following

Theorem 3 [61]: Let Vj (λ0)=(βˆj /βˆj ) (j =1,...,k) optimization problem of P :  be the sparse loadings defined in (13). Let (A , B) be the solution of the optimization problem arg max Tr(Σˆ P ) P

k 2

  T T  2 subject to TrP =1, Card(P ) ≤ k (Aˆ , Bˆ )= arg min −2Tr A X XB + βj A,B j=1 P 0 and rank(P )=1. (19)

k  + λ1,j βj 1 The above formulation in (19) is still nonconvex and dif- j=1 ficult to handle due to the cardinality constraint and T subject to A A = Ik×k. (16) the rank-1 constraint. By definition, P is symmetric and P 2 = P .Observethat

When λ0 →∞, Vj (λ0) → (βˆj /βˆj ). 2 T Solving (16) can also be done via an alternating mini- P F =Tr(P P )=Tr(P )=1. mization algorithm. Given A,wehavethatforeachj

T T 2 By Cauchy–Schwartz βˆj = arg min −2αj (X X)βj + βj + λ1,j βj 1 (17) β j Õ T 2 1p |P |1p ≤ Card(P )P F ≤ k.

and the solution is given by   Therefore, d’Aspremont et al. [15] suggested to relax the T λ1,j βˆj = S X Xαj , cardinality constraint in (19) to a linear inequality con- 2 T straint 1p |P |1p ≤ k. Furthermore, they dropped the rank-1 where S(Z, γ) is the soft-thresholding operator on a vector constraint and ended up with the DSPCA formulation Z =(z1,...,zp) with thresholding parameter γ and arg max Tr(Σˆ P ) P S Z, γ |z |−γ z , ≤ j ≤ p. ( )j =( j )+ sgn( j ) 1 subject to TrP =1 T 1 |P |1p ≤ k Given B, the solution of A is again Aˆ = UV T where U, p V are from the SVD of (X T X)B: (X T X)B = UDV T . P 0. (20)

The above is recognized as a semidefinite program- C. A Semidefinite Programming Approach ming problem and can be solved by software such as SDPT3. We introduce some necessary notation first. We use DSPCA only solves for P but not α. To compute the Card(M) to denote the number of nonzero element of M, loading vector α, d’Aspremont et al. [15] recommended where M can be a vector of a matrix. The notation |M| truncating P and retaining only the dominant (sparse) means that we replace each element of M with its absolute eigenvector of P . For the second sparse principal compo- T T value. Let 1p be the p-vector of 1. nent, it is suggested to replace Σˆ with Σˆ − (α Σˆ α)αα Consider the first k-sparse principal component with in (20). The same procedure can be repeated to compute at most k nonzero loadings. A natural definition of the the rest sparse principal components.

1314 PROCEEDINGS OF THE IEEE | Vol. 106, No. 8, August 2018 Zou and Xue: A Selective Overview of Sparse Principal Component Analysis

Èk T For larger problems, d’Aspremont et al. [15] discussed X − l UlVl and then the iterative thresholding algo- a Nesterov’s smooth minimization technique to handle rithm is applied to X (k+1) to get (U(k+1),V(k+1)).The

DSPCA. The computation complexity of the algorithm is normalized V(k+1) is the loading vector of the (k +1)th Ô O(p4 log(p)/),where is the numerical accuracy of sparse principal component. The λ parameter is allowed to the solution. d’Aspremont et al. [14] discussed a greedy differ for different principal components. algorithm to speed up the computation. An alternating In the same vein, Witten et al. [56] proposed a penalized direction method of multipliers was proposed in [39]. matrix decomposition (PMD) criterion as follows: DSPCA formulation generated many interests in the T 2 (U,ˆ V,ˆ dˆ)=argmin X − dUV F operational research and machine learning communities. U,V,d Some follow-up works include [38], [53], and [13], among U , U ≤c others. subject to =1 1 1 V =1, V 1 ≤c2. (23) D. Iterative Thresholding Methods PCA can be done via the SVD of the data matrix. Thus, it By straightforward calculation, it can be shown that (23) is natural to consider a SPCA algorithm based on the SVD is equivalent to the following optimization problem: of X. This idea was explored in [48] and [56]. T X X UDV T (U,ˆ Vˆ ) = arg max U XV Let the SVD of be = . Consider the first U,V principal component. We know the loading vector is V1, subject to U =1, U1 ≤ c1 the first column of V .Itisawell-knownresultthatSVD V =1, V 1 ≤ c2 (24) of X is related to the best lower rank approximation of X ˜ ˜ [16]. Specifically, let U be a norm-1 n-vector and V be a T T and dˆ= Uˆ XVˆ . p-vector. Consider U˜V˜ as a rank-1 approximation of X. They also used an alternating minimization algorithm to The best rank-1 approximation is defined as compute (24). Given V ,weupdateU by solving T 2 min X − U˜ V˜ F ˜ ˜ T U,V max U XV subject to U =1, U1 ≤ c1. (25) U subject to U˜ =1 (21) Given U,weupdateV by solving and the solution is U˜ = U1 and V˜ = d1V1 where d1 is the T first singular value. max U XV subject to V =1, V 1 ≤ c2. (26) V Based on (21), Shen and [48] proposed the following optimization problem: The equality constraint U =1, V =1in (25) and

T 2 (26) can be replaced with inequality constraint U≤1, (U,ˆ Vˆ )= argminX − UV F + λV 1 U,V V ≤1 and the solutions remain the same. So, (25) subject to U =1 (22) and (26) are examples of the following convex optimiza- tion problem: and the sparse loading vector is normalized Vˆ , (V/ˆ Vˆ ). T Zˆ =argmaxZ R subject to Z≤1, Z1 ≤ c. (27) An alternating minimization algorithm is used to solve Z (22). Note that given V , the optimal U is U = XV/XV . Given U, the optimal V is It is easy to see that the solution to (27) is

T T 2 S(R, Δc) arg min −2Tr(X UV )+V + λV 1 Zˆ V = S(R, Δc)

and the solution is given by the soft-thresholding operator where S is the soft-thresholding operator and Δc is   selected as follows: Δc =0if (R/R)1 ≤ c,otherwise T λ V = S X U, . Δc > 0 is chosen to satisfy Zˆ1 = c. 2

Thus, Shen and Huang’s method is an iterative threshold- ing algorithm. E. A Generalized Power Method Note that the above procedure is similar in spirit to the Consider the first principal component. By the variance SPCA algorithm in (16). The big difference is that SPCA maximization definition, a direct formulation of 1 con- k solves components simultaneously, but Shen and Huang’s strained sparse principal component is method only deals with one component at a time. T T Shen and Huang [48] proposed to sequentially compute arg max α X Xα the rest sparse principal components. Suppose that we α=1 have computed the first k (U, V ) pairs, then let X (k+1) = subject to α1 ≤ t. (28)

Vol. 106, No. 8, August 2018 |PROCEEDINGS OF THE IEEE 1315 Zou and Xue: A Selective Overview of Sparse Principal Component Analysis

Equivalently, we can solve Solving U ∗ is an n-dimensional optimization problem,

although the original formulation (31) is a p-dimensional Ô arg max αT X T Xα optimization problem. When p n,thegeneralizedpower α=1 method achieves great computational savings. Moreover, α ≤ t. subject to 1 (29) the objective function in (35) is differentiable and con- vex, and the constraint set is compact and convex. Journée et al. [32] considered the Lagrangian form of (29) Journée et al. [32] used an efficient gradient method to ∗

Ô compute U and analyzed its convergence property. They T T arg max α X Xα − λα1. (30) α=1 also showed that the generalized power method can be extended to handle the first k principal components jointly. They offered a generalized power method for solving (31). There are other proposals for constructing spare prin- Their idea takes advantage of this simple observation: cipal components such as the truncated power method T let U˜ =argmaxU=1 U Z,thenU˜ = Z/Z and in [58] and the exact and greedy algorithms in [42]. U˜ T Z = Z. Thus, an equivalent formulation of (31) is

∗ ∗ T (U ,α ) = arg max U Xα − λα1 III. THEORETICAL RESULTS U,α subject to U =1, α =1. (31) Theoretical analysis of SPCA received considerable atten- tion in the past decade. In what follows, we first discuss the Note that the formulation (31) is the Lagrangian form inconsistency of the classical PCA in the high-dimensional of the PMD formulation (24) without imposing the 1 setting, and then present recent theoretical developments constraint on U. of SPCA. For any U, the optimal α and X T U must share the same z |α | Z |α| sign for each component. Let j = j ,and = .Then, A. Inconsistency of PCA Under High Dimensions the optimal Z∗ must satisfy Statistical analysis of PCA views Σˆ as the empirical p ∗  T covariance matrix, and there is the population PCA on Z =argmax (|X U|j − λ)zj Z Σ j=1 the true covariance matrix . In the conventional setting p where the dimension is fixed and the sample size increases,  2 subject to zj ≥ 0, zj =1. (32) the principal eigenvectors of the sample covariance matrix j=1 are the consistent estimates of the principal eigenvectors of the corresponding population covariance matrix [3]. |X T U| − λ ≤ z∗ When j 0, j =0. By Cauchy–Schwartz, it is However, the sample principal eigenvectors are incon- easy to see that the solution to (32) is sistent estimates of the corresponding population principal T eigenvectors in the high-dimensional setting where the ∗ (|X U|j − λ)+

zj = Õ (33) dimension is no longer fixed and may be much larger than Èp |X T U| − λ 2 j=1( j )+ the sample size. The inconsistency phenomenon was first observed in the unsupervised learning theory literature in which yields physics (for example, [7] and [55]). About a decade ago, S(X T U, λ) a series of papers in the statistics literature (for example, α = (34) S(X T U, λ) [5], [29], [43], [44], and [33]) investigated the incon- sistency results of the classical PCA when estimating the where S is the soft-thresholding operator. Plugging (33) leading principal eigenvectors in the high-dimensional set- back to the objective function in (31), we obtain a new ting. Baik and Silverstein [5], Paul [44], and Nadler [43] limn→∞ p/n = γ ∈ (0, 1) optimization criterion of U showed that when ,thelargest√

eigenvalue λ1 is of unit multiplicity and λ1 ≤ γ,the Ú

Ù p v

 ˆ1

Ù leading sample principal eigenvector is asymptotically ∗ T 2 U =arg max Ø (|X U|j − λ)+ U:U=1 orthogonal to the leading population principal eigenvec- j=1 tor v1 almost surely, that is or equivalently T P ( lim |v1 vˆ1| =0)=1. n→∞ p ∗  T 2 U =arg max (|X U|j − λ)+. (35) U:U≤1 Johnstone and [29] considered the rank-1 case and j=1 gave the sufficient and necessary condition for the consis- ∗ tence estimation of the leading population principal eigen- Once U is solved, we have vector. Let R(vˆ1, v1)=cosα(vˆ1, v1)be the cosine of the T ∗ 2 2 ∗ S(X U ,λ) angle between vˆ1 and v1,andletω = limn→∞ v1 /σ α = . S(X T U ∗,λ) be the limiting signal-to-noise ratio. Johnstone and Lu [29]

1316 PROCEEDINGS OF THE IEEE | Vol. 106, No. 8, August 2018 Zou and Xue: A Selective Overview of Sparse Principal Component Analysis

proved that C. Minimax Rates of Convergence   2 2 The minimax rate of estimation is another impor- P lim R (vˆ1, v1)=R∞(ω,c) =1 n→∞ tant theoretical development for the SPCA. The semi- nal paper by Birnbaum et al. [8] studied the minimax where c = limn→∞ p/n and rates of convergence and adaptive estimation when the rank is a fixed number and the ordered coefficients of R2 ω,c ω2 − c / ω2 cω . ∞( )=( )+ ( + ) each principal eigenvector have rapid decay. Specifically, Birnbaum et al. [8] established a lower bound on the min- R2 ω,c < c> v Note that ∞( ) 1 if and only if 0.Thus,ˆ1 imax risk of estimators under various models of sparsity is a consistent estimate of v1 if and only c =0,which for the population eigenvectors. [40] showed that the implies the inconsistency of the classical PCA in the high- iterative thresholding estimator attains the minimax rate of dimensional setting. Jung and Marron [33] further studied convergence over a certain Gaussian class of distributions the strong inconsistency of the leading sample principal when the rank is treated as a fixed constant. By allow- eigenvector in the high dimension and low sample size ing the rank increase with the sample size, [10] and context where the sample size is fixed and the dimension Vu and [52] studied the minimax optimality and adap- increases. tive estimation of the principal subspace for the SPCA in These inconsistency results call for new formulation of the high-dimensional setting. Following [10], we assume principal components that are consistent estimators of the that the n × p data matrix X is generated as follows: population principal components under high dimensions. T X = UDV + Z B. Consistency of SPCA where U is the n × k random effects matrix with inde- In recent years, there have been a series of papers to pendent identically distributed (i.i.d.) N(0, 1) entries, develop the theoretical properties of the SPCA in the sta- 1/2 1/2 D = diag(λ1 ,... ,λk ) is a diagonal matrix with tistics literature. The consistency results are established for ordered eigenvalues λ1 ≥ ··· ≥ λk > 0, V is an ortho- various regularized estimators of the leading eigenvectors. normal matrix, Z is a random matrix with i.i.d. N(0,σ2) −1 Under the rank-1 scenario with n log(n ∨ p) → 0 as entries, and U and Z are independent. Denote by Σ the n →∞ T 2 , Johnstone and Lu [29] established a con- covariance matrix of X.NotethatΣ = V ΛV + σ I p sistency result for the classical PCA performed on a and also that the estimation of span(V ) is equivalent to σ2 ≥ σ2 T selected subset of variables satisfying ˆ (1 + the estimation of VV . Now, we consider the optimal esti- α α α n−1 n ∨ p 1/2 n),where n = ( log( )) . Specifically, mation of the principal subspace span(V ) under the com- T T 2 Johnstone and Lu [29] proved that the estimated principal monly used loss function L(V , Vˆ )=VV −Vˆ Vˆ F and vI eigenvector ˆ1 obtained via the subset selection rule is the following parameter space for Σ: consistent T 2

s, p, k, λ {Σ V ΛV σ I p κλ ≥ λ1   Θ( )= = + : I P lim α(vˆ1, v1)=0 =1 T n→∞ ≥···≥λk ≥ λ>0, V V = Ik, V w ≤ s} when the magnitudes of ordered coefficients of v1 have where κ>1, Λ = diag(λ1,... ,λk),andV w = rapid decay, i.e., the r-th largest magnitude of v1 is no maxj=1,··· ,p jV (j)∗0 is the weak 0 radius of V .Notethat greater than Cr−1/q, r =1, 2,...,forsome0

Vˆ Σ∈Θ(s,p,k,λ)

and low sample size context. Amini and Wainwright [2]   λ/σ2 +1  ep studied the support recovery property of the semidefi-  k(s − k)+s log ∧ 1. n(λ/σ2)2 s nite programming approach of [15] under the k-sparse assumption for the leading eigenvector in the rank-1 Cai et al. [11] studied the minimax rates under the spectral spiked covariance model. Ma [40] proved the consistency norm, which is directly related to estimating the rank of the of the iterative thresholding approach under a spiked factor model. covariance model. Lei and Vu [37] provided the gen- eral sufficient conditions for sparsistency for the Fantope projection and selection method. In a very recent paper, D. Statistical and Computational Tradeoff Jankova and van de Geer [27] proposed a debiased SPCA It is important to point out that there is a fundamen- estimator and studied the asymptotic inference of the tal tradeoff between statistical and computational perfor- sparse eigenvectors. mance. In general, there are no known computationally

Vol. 106, No. 8, August 2018 |PROCEEDINGS OF THE IEEE 1317 Zou and Xue: A Selective Overview of Sparse Principal Component Analysis efficient methods to obtain the minimax rate optimal esti- identified the deformation of the CC corresponding to the mators for the SPCA. Several seminal papers highlight the measure of walking speed. The results for verbal fluency tradeoff between computational and statistical efficiency were also meaningful anatomically. Sjöstrand [49] found for the SPCA, including [2], [6], [22], [35], [54], and that SPCA is useful to derive localized and interpretable others. Amini and Wainwright [2] proved that no algo- patterns of variability while PCA did not provide much rithm can reliably recover the sparse eigenvector under interpretational value. the single-spike covariance model when k ≥ Cn/log p Ecological study: Motivated by generating for some positive constant C and all sufficiently large n. meaningful combinations of the explanatory variables, Krauthgamer et al. [35] further proved that the semidef- Gravuer et al. [23] applied SPCA to perform the dimension inite programming approach [15] does not close the gap reduction before fitting the ABT model. The sparsity between computational√ and statistical efficiency as helps the interpretability of their model. Specifically, as k ≥ C n for some positive constant C and all the Gravuer et al. [23] used SPCA to study a range of sufficiently large n. Berthet and Rigollet [6] considered the human, biogeographic, and biological influences on optimal detection of sparse principal components in high the invasion of Trifolium species into New Zealand. dimension The sparse principal components were obtained from 29 categorical and continuous variables for three invasion  H0 : x ∼ N(0, Ip) versus H1 : x ∼ N(0, I p + θv1v1) stages (i.e., introduction, naturalization, and spread), and studied the relationship of sparse principal components where v1 has a fixed number of nonzero components. to invasion success by using aggregated boosted trees. To this end, Berthet and Rigollet [6] studied a minimax Specifically, the authors identified eight sparse principal optimal test based on the k-sparse largest eigenvalue components on 22 variables for intentional introduction of the empirical covariance matrix. The computation of and unintentional introduction-naturalization stages, this sparse eigenvalue statistic depends on a well-known seven sparse principal components on 25 variables for decision problem associated to finding whether a graph naturalization of intentionally introduced species, and contains a clique of size k, whose computational com- seven sparse principal components on 28 variables for plexity is proved to be NP-complete in general [34]. In relative spread rate. Gravuer et al. [23] found that SPCA the follow-up paper, under the hardness assumption of simultaneously improves interpretability and maintains the planted clique problem [20], et al. [54] showed high explained variance. that there is an effective sample size regime in which Neuroscience study: SPCA was used in [4] to study the no randomized polynomial time algorithm can achieve light-driven Ca2+ signals of the GCL cells given a set of the minimax optimal rate for new and larger classes standardized visual stimuli in a probabilistic clustering satisfying a restricted covariance concentration condition. framework. Baden et al. [4] first used SPCA to extract Recently, et al. [22] obtained the first computational features that are localized in time and readily interpretable lower bounds for SPCA under the Gaussian single spiked from the responses to the chirp, color, and moving bar covariance model and closed the gap in SPCA computa- stimulus, and then used a Gaussian mixture model on the tional lower bounds left by [6] and [54]. extracted feature set for clustering. The authors extracted 20 features from the mean response to the chirp, six features from the mean response to the color stimulus, IV. APPLICATIONS eight features from the response time course, and four SPCA can be used in applications where PCA is normally features from its temporal derivative. Many classically used. For example, the use of SPCA in clustering can lead used temporal response features were identified, including to sparse clustering algorithms [12]. PCA is a part of the ON-andOFF-responses with different kinetics or selectivity integrated omic-data analysis, where SPCA can be used to different temporal frequencies. They also tried the to replace the regular PCA [46], [59]. We discuss a few standard PCA and found the results lead to inferior cluster recent applications of SPCA in medical imaging, ecology, quality. and neuroscience, respectively. SPCA is implemented in the R package elasticnet avail- Shape/image analysis: Sjöstrand [49] applied SPCA able from CRAN: http://cran.r-project.org/. to landmark-based shape analysis of the CC brain The Matlab implementation of SPCA is available structure. The author extracted 5, 20, and 50 nonzero in the toolbox SpaSM from http://www2.imm.dtu.dk/ principal components out of the total 156 components projects/spasm/. corresponding to landmarks, and also applied the standard PCA as a benchmark. In the subsequent analysis, V. CONCLUDING REMARKS he used the univariate regression to study the relationship between the resulting deformations based on extracted In our discussion, we have only presented the use variables and four clinical outcome variables (gender, of 1-norm for sparsity, but there are other equally suitable age, walking speed, and verbal fluency). His findings penalty functions to be used in the SPCA methods, includ- confirmed the male/female mean shape differences and ing SCAD [18], [19] or 0, among others.

1318 PROCEEDINGS OF THE IEEE | Vol. 106, No. 8, August 2018 Zou and Xue: A Selective Overview of Sparse Principal Component Analysis

We now have a good understanding of the role of the pros and cons of each proposed SPCA technique, which sparsity in PCA and ways to effectively exploit the sparsity. may also inspire new and better approaches.  There are still remaining issues. A very important issue to be investigated further is automated SPCA. By “automated” Acknowledgments we mean that there is a principled but not overly compli- The authors would like to thank Prof. T. Bouwmans, cated procedure to set these sparse parameters in SPCA. Prof. Y. Chi, and Prof. N. Vaswani for inviting them to This question is particularly challenging when we solve contribute this review paper to the special issue. They several sparse principal components jointly. We would also are grateful for the very helpful comments from three like to have more empirical results to help us understand anonymous reviewers.

REFERENCES [1] O. Alter, P. O. Brown, and D. Botstein, “Singular [20] V.Feldman, E. Grigorescu, L. Reyzin, S. Vempala, approach for sparse principal component analysis,” value decomposition for genome-wide expression and Y. , “Statistical algorithms and a lower Math. Program., vol. 135, nos. 1–2, pp. 149–193, data processing and modeling,” Proc.Nat.Acad.Sci. bound for detecting planted cliques,” J. ACM,vol. 2012. USA, vol. 97, no. 18, pp. 10101–10106, Aug. 2000. 64, no. 2, p. 8, 2014. [39] S. Ma, “Alternating direction method of multipliers [2] A.A.AminiandM.J.Wainwright, [21] J. Friedman, T. Hastie, H. Höfling, and for sparse principal component analysis,” J. Oper. “High-dimensional analysis of semidefinite R. Tibshirani, “Pathwise coordinate optimization,” Res. Soc. , vol. 1, no. 2, pp. 253–274, relaxations for sparse principal components,” Ann. Ann. Appl. Stat., vol. 1, no. 2, pp. 302–332, 2007. 2013. Stat., vol. 37, no. 5B, pp. 2877–2921, 2009. [22] C. Gao, Z. Ma, and H. H. , “Sparse CCA: [40] Z. Ma, “Sparse principal component analysis and [3] T. W. Anderson, An Introduction to Multivariate Adaptive estimation and computational barriers,” iterative thresholding,” Ann. Stat.,vol.41,no.2, Statistical Analysis.NewYork,NY,USA:Wiley, Ann. Stat., vol. 45, no. 5, pp. 2074–2101, 2017. pp. 772–801, 2013. 2003. [23] K. Gravuer, J. J. Sullivan, P. A. Williams, and [41] K.Mardia,J.Kent,andJ.Bibby,Multivariate [4] T. Baden, P.Berens,K.Franke,M.Roseön, R. P. Duncan, “Strong human association with plant Analysis. Orlando, FL, USA: Academic, M. Bethge, and T. Euler, “The functional diversity of invasion success for Trifolium introductions to New 1979. retinal ganglion cells in the mouse,” Nature,vol. Zealand,” Proc. Nat. Acad. Sci. USA, vol. 105, no. [42] B. Moghaddam, Y. Weiss, and S. Avidan, “Spectral 529, no. 7586, pp. 345–350, 2016. 17, pp. 6344–6349, 2008. bounds for sparse PCA: Exact and greedy [5] J. Baik and J. W. Silverstein, “Eigenvalues of large [24] P. Hancock, A. Burton, and V.Bruce, “Face algorithms,” Adv. Neural Inf. Process. Syst., 2006, sample covariance matrices of spiked population processing: Human perception and principal pp. 915–922. models,” J. Multivariate Anal.,vol.97,no.6,pp. components analysis,” Memory Cogn.,vol.24,no. [43] B. Nadler, “Finite sample approximation results for 1382–1408, 2006. 1, pp. 26–40, 1996. principal component analysis: A matrix [6] Q. Berthet and P. Rigollet, “Optimal detection of [25] T. Hastie et al., “‘Gene shaving’ as a method for perturbation approach,” Ann. Stat., vol. 36, no. 6, sparse principal components in high dimension,” identifying distinct sets of genes with similar pp. 2791–2817, 2008. Ann. Stat., vol. 41, no. 4, pp. 1780–1815, 2013. expression patterns,” Genome Biol.,vol.1,pp. [44] D. Paul, “Asymptotics of sample eigenstructure for a [7] M. Biehl and A. Mietzner, “Statistical mechanics of 1–21, Aug. 2000. large dimensional spiked covariance model,” Stat. unsupervised structure recognition,” J. Phys. A, [26] T. Hastie, R. Tibshirani, and J. Friedman, The Sinica, vol. 17, no. 4, pp. 1617–1642, 2007. Math. Gen., vol. 27, no. 6, pp. 1885–1897, 1994. Elements of Statistical Learning; Data Mining, [45] K. Pearson, “Liii. on lines and planes of closest fit to [8] A. Birnbaum, I. M. Johnstone, B. Nadler, and Inference and Prediction, 2nd ed. New York, NY, systems of points in space,” Philos.Mag.J.Sci.,vol. D. Paul, “Minimax bounds for sparse PCA with USA: Springer-Verlag, 2009. 2, no. 11, pp. 559–572, 1901. noisy high-dimensional data,” Ann. Stat.,vol.41, [27] J. Jankova and S. van de Geer (2018). “De-biased [46] M. D. Ritchie, E. R. Holzinger, R. , no. 3, pp. 1055–1084, 2013. sparse PCA: Inference and testing for eigenstructure S. A. Pendergrass, and D. Kim, “Methods of [9] J. Cadima and I. T. Jolliffe “Loading and of large covariance matrices.” [Online]. Available: integrating data to uncover genotype–phenotype correlations in the interpretation of principle https://arxiv.org/abs/1801.10567 interactions,” Nature Rev. Genetics,vol.16,no.2, compenents,” J. Appl. Stat., vol. 22, no. 2, pp. [28] J. Jeffers, “Two case studies in the application of pp. 85–97, 2015. 203–214, 1995. principal component,” Appl. Stat.,vol.16,no.3, [47] D. Shen, H. Shen, and J. S. Marron, “Consistency of [10] T. T. Cai, Z. Ma, and Y. , “Sparse PCA: Optimal pp. 225–236, 1967. sparse PCA in high dimension, low sample size rates and adaptive estimation,” Ann. Stat., vol. 41, [29] I. M. Johnstone and A. Y. Lu, “On consistency and contexts,” J. Multivariate Anal., vol. 115, pp. no. 6, pp. 3074–3110, 2013. sparsity for principal components analysis in high 317–333, Mar. 2013. [11] T. T. Cai, Z. Ma, and Y. Wu, “Optimal estimation dimensions,” J. Amer. Stat. Assoc., vol. 104, no. 486, [48] H. Shen and J. Z. Huang, “Sparse principal and rank detection for sparse spiked covariance pp. 682–693, 2009. component analysis via regularized low rank matrices,” Probability Theory Related Fields,vol. [30] I. T. Jolliffe, “Rotation of principal components: matrix approximation,” J. Multivariate Anal.,vol.6, 161, nos. 3–4, pp. 781–815, 2015. Choice of normalization constraints,” J. Appl. Stat., no. 99, pp. 1015–1034, 2008. [12] G. , P. F. Sullivan, and M. Kosorok, vol. 22, no. 1, pp. 29–35, 1995. [49] K. Sjöstrand, “Sparse decomposition and modeling “Biclustering with heterogeneous variance,” Proc. [31] I. T. Jolliffe, N. T. Trendafilov, and M. Uddin, “A of anatomical shape variation,” IEEE Trans. Med. Nat. Acad. Sci. USA, vol. 110, no. 30, pp. modified principal component technique based on Imag., vol. 26, no. 12, pp. 1625–1635, Dec. 2007. 12253–12258, 2013. the lasso,” J. Comput. Graph. Stat., vol. 12, no. 3, [50] R. Tibshirani, “Regression shrinkage and selection [13] A. d’Aspremont, “Identifying small mean-reverting pp. 531–547, 2003. via the lasso,” J. Roy. Statist. Soc. B, Methodol.,vol. portfolios,” Quantitative Finance,vol.11,no.3,pp. [32] M. Journée, Y. Nesterov, P. Richtárik, and 58, no. 1, pp. 267–288, 1996. 351–364, 2011. R. Sepulchre, “Generalized power method for [51] S. K. Vines, “Simple principal components,” Appl. [14] A. d’Aspremont, F. Bach, and L. E. Ghaoui, “Optimal sparse principal component analysis,” J. Mach. Stat., vol. 49, no. 4, pp. 441–451, 2000. solutions for sparse principal component analysis,” Learn. Res., vol. 11, pp. 517–553, [52] V.VuandJ.Lei,“Minimaxsparseprincipal J. Mach. Learn. Res., vol. 9, pp. 1269–1294, Jul. Feb. 2010. subspace estimation in high dimensions,” Ann. 2008. [33] S. Jung and J. S. Marron, “PCA consistency in high Stat., vol. 41, no. 6, pp. 2905–2947, 2013. [15] A. d’Aspremont, L. El Jordan, M. I. Jordan, and dimension, low sample size context,” Ann. Stat., [53] V.Q.Vu,J.Cho,J.Lei,andK.Rohe,“Fantope G. R. G. Lanckriet, “A direct formulation for sparse vol. 37, no. 6B, pp. 4104–4130, 2009. projection and selection: A near-optimal convex PCA using semidefinite programming,” SIAM Rev., [34] R. M. Karp, “Reducibility among combinatorial relaxation of sparse PCA,” in Proc. 26th Int. Conf. vol. 49, no. 3, pp. 434–448, 2007. problems,” in Proc. Complexity Comput. Symp., Neural Inf. Process. Syst., 2013, pp. 2670–2678. [16] C.EckartandG.Young,“Theapproximationofone Yorktown Heights, NY, USA: IBM Thomas J. Watson [54] T.Wang,Q.Berthet,andR.J.Samworth, matrix by another of lower rank,” Psychometrika, Research Center, 1972, pp. 85–103. “Statistical and computational trade-offs in vol. 1, no. 3, pp. 211–218, Sep. 1936. [35] R. Krauthgamer, B. Nadler, and D. Vilenchik, “Do estimation of sparse principal components,” Ann. [17] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, semidefinite relaxations solve sparse PCA up to the Stat., vol. 44, no. 5, pp. 1896–1930, “Least angle regression,” Ann. Stat.,vol.32,no.2, information limit?” Ann. Stat., vol. 43, no. 3, pp. 2016. pp. 407–499, 2004. 1300–1322, 2015. [55] T. L. H. Watkin and J.-P. Nadal, “Optimal [18] J. Fan and R. Li, “Variable selection via nonconcave [36] L. LeCam, “Convergence of estimates under unsupervised learning,” J. Phys. A, Math. Gen.,vol. penalized likelihood and its oracle properties,” dimensionality restrictions,” Ann. Stat.,vol.1,no. 27, no. 6, pp. 1899–1915, 1994. J. Amer. Stat. Assoc., vol. 96, no. 456, pp. 1, pp. 38–53, 1973. [56] D. M. Witten, R. Tibshirani, and T. Hastie, “A 1348–1360, 2001. [37] J. Lei and V.Q. Vu, “Sparsistency and agnostic penalized matrix decomposition, with applications [19] J. Fan, L. Xue, and H. Zou, “Strong oracle inference in sparse PCA,” Ann. Stat.,vol.43,no.1, to sparse principal components and canonical optimality of folded concave penalized estimation,” pp. 299–322, 2015. correlation analysis,” Biostatistics, vol. 10, no. 3, Ann. Stat., vol. 42, no. 3, pp. 819–849, 2014. [38] Y. Lu and Z. , “An augmented Lagrangian pp. 515–534, 2009.

Vol. 106, No. 8, August 2018 |PROCEEDINGS OF THE IEEE 1319 Zou and Xue: A Selective Overview of Sparse Principal Component Analysis

[57] Y. and A. Barron, “Information-theoretic Res., vol. 14, no. 1, pp. 899–925, 2013. selection via the elastic net,” J. Roy. Stat. Soc. B, determination of minimax rates of convergence,” [59] C. Zang et al., “High-dimensional genomic data bias Methodol., vol. 67, no. 2, pp. 301–320, 2005. Ann. Stat., vol. 27, no. 5, pp. 1564–1599, 1999. correction and data integration using MANCIE,” [61] H. Zou, T. Hastie, and R. Tibshirani, “Sparse [58] X.-T. and T. Zhang, “Truncated power method Nature Commun., vol. 7, p. 11305, Apr. 2016. principal component analysis,” J. Comput. Graph. for sparse eigenvalue problems,” J. Mach. Learn. [60] H. Zou and T. Hastie, “Regularization and variable Stat., vol. 15, no. 2, pp. 265–286, 2006.

ABOUT THE AUTHORS

Hui Zou received the B.S. and M.S. degrees Lingzhou Xue received the B.S. degree in physics from the University of Science and in mathematics from Peking University, Technology of China, Hefei, China, in 1997 Beijing, China, in 2008 and the M.S. and and 1999, respectively, the M.S. degree in Ph.D. degrees in statistics from the Univer- statistics from Iowa State University, Ames, sity of Minnesota, Minneapolis, MN, USA, in IA, USA, in 2001, and the Ph.D. degree in 2011 and 2012, respectively. statistics from Stanford University, Stanford, His research interests include high- CA, USA, in 2005. dimensional statistical inference and large- His research interests include high- scale optimization. dimensional inference, statistical learning, and computational statistics.

1320 PROCEEDINGS OF THE IEEE | Vol. 106, No. 8, August 2018