View metadata, citation and similar papers at core.ac.uk brought to you by CORE
provided by ScholarlyCommons@Penn
University of Pennsylvania ScholarlyCommons
Statistics Papers Wharton Faculty Research
2013 Sparse Principal Component Analysis and Iterative Thresholding Zongming Ma University of Pennsylvania
Follow this and additional works at: http://repository.upenn.edu/statistics_papers Part of the Statistics and Probability Commons
Recommended Citation Ma, Z. (2013). Sparse Principal Component Analysis and Iterative Thresholding. Annals of Statistics, 41 (2), 772-801. http://dx.doi.org/10.1214/13-AOS1097
This paper is posted at ScholarlyCommons. http://repository.upenn.edu/statistics_papers/296 For more information, please contact [email protected]. Sparse Principal Component Analysis and Iterative Thresholding
Abstract .
Principal component analysis (PCA) is a classical dimension reduction method which projects data onto the principal subspace spanned by the leading eigenvectors of the covariance matrix. However, it behaves poorly when the number of features p is comparable to, or even much larger than, the sample size n. In this paper, we propose a new iterative thresholding approach for estimating principal subspaces in the setting where the leading eigenvectors are sparse. Under a spiked covariance model, we find that the new approach recovers the principal subspace and leading eigenvectors consistently, and even optimally, in a range of high-dimensional sparse settings. Simulated examples also demonstrate its competitive performance.
Keywords dimension reduction, high-dimensional statistics, principal component analysis, principal subspace, sparsity, spiked covariance model, thresholding
Disciplines Statistics and Probability
This journal article is available at ScholarlyCommons: http://repository.upenn.edu/statistics_papers/296 The Annals of Statistics 2013, Vol. 41, No. 2, 772–801 DOI: 10.1214/13-AOS1097 © Institute of Mathematical Statistics, 2013
SPARSE PRINCIPAL COMPONENT ANALYSIS AND ITERATIVE THRESHOLDING
BY ZONGMING MA University of Pennsylvania
Principal component analysis (PCA) is a classical dimension reduction method which projects data onto the principal subspace spanned by the lead- ing eigenvectors of the covariance matrix. However, it behaves poorly when the number of features p is comparable to, or even much larger than, the sam- ple size n. In this paper, we propose a new iterative thresholding approach for estimating principal subspaces in the setting where the leading eigenvectors are sparse. Under a spiked covariance model, we find that the new approach recovers the principal subspace and leading eigenvectors consistently, and even optimally, in a range of high-dimensional sparse settings. Simulated ex- amples also demonstrate its competitive performance.
1. Introduction. In many contemporary datasets, if we organize the p- dimensional observations x1,...,xn, into the rows of an n × p data matrix X,the number of features p is often comparable to, or even much larger than, the sample size n. For example, in biomedical studies, we usually have measurements on the expression levels of tens of thousands of genes, but only for tens or hundreds of individuals. One of the crucial issues in the analysis of such “large p” datasets is dimension reduction of the feature space. As a classical method, principal component analysis (PCA) [8, 23] reduces di- mensionality by projecting the data onto the principal subspace spanned by the m leading eigenvectors of the population covariance matrix , which represent the principal modes of variation. In principle, one expects that for some m