Positive-Definite ℓ1-Penalized Estimation of Large Covariance
Total Page:16
File Type:pdf, Size:1020Kb
Positive-Definite ℓ1-Penalized Estimation of Large Covariance Matrices Lingzhou X UE , Shiqian M A, and Hui Z OU The thresholding covariance estimator has nice asymptotic properties for estimating sparse large covariance matrices, but it often has negative eigenvalues when used in real data analysis. To fix this drawback of thresholding estimation, we develop a positive-definite ℓ1- penalized covariance estimator for estimating sparse large covariance matrices. We derive an efficient alternating direction method to solve the challenging optimization problem and establish its convergence properties. Under weak regularity conditions, nonasymptotic statistical theory is also established for the proposed estimator. The competitive finite-sample performance of our proposal is demonstrated by both simulation and real applications. KEY WORDS: Alternating direction methods; Matrix norm; Positive-definite estimation; Soft-thresholding; Sparsity. 1. INTRODUCTION where sλ(z) is the generalized thresholding function. The gen- eralized thresholding function covers a number of commonly Estimating covariance matrices is of fundamental impor- used shrinkage procedures, for example, the hard threshold- tance for an abundance of statistical methodologies. Nowa- ing s (z) = zI , the soft thresholding s (z) = sign( z)( |z| − days, the advance of new technologies has brought massive λ {| z|>λ } λ λ) , the smoothly clipped absolute deviation thresholding (Fan high-dimensional data into various research fields, such as func- + and Li 2001 ), and the adaptive lasso thresholding (Zou 2006 ). tional magnetic resonance imaging (fMRI) imaging, web min- Consistency results and explicit rates of convergence have been ing, bioinformatics, climate studies and risk management, and obtained for these regularized estimators in the literature, for so on. The usual sample covariance matrix is optimal in the example, Bickel and Levina ( 2008a , b), El Karoui ( 2008 ), classical setting with large samples and fixed low dimensions Rothman, Levina, and Zhu ( 2009 ), and Cai and Liu ( 2011 ). (Anderson 1984 ), but it performs very poorly in the high- The recent articles by Cai and Zhou ( 2012a , b) have estab- dimensional setting (Mar cenkoˇ and Pastur 1967 ; Johnstone lished the minimax optimality of the thresholding estimator for 2001 ). In the recent literature, regularization techniques have estimating a wide range of large sparse covariance matrices un- been used to improve the sample covariance matrix estima- der commonly used matrix norms. The existing theoretical and tor, including banding (Wu and Pourahmadi 2003 ; Bickel and empirical results show no clear favoritism to a particular thresh- Levina 2008a ), tapering (Furrer and Bengtsson 2007 ; Cai, olding rule. In this article, we focus on the soft-thresholding Zhang, and Zhou 2010 ), and thresholding (Bickel and Levina because it can be formulated as the solution of a convex opti- 2008b ; El Karoui 2008 ; Rothman, Levina, and Zhu 2009 ). Band- mization problem. Let k · k be the Frobenius norm and | · | ing or tapering is very useful when the variables have a natural F 1 be the element-wise ℓ -norm of all off-diagonal elements. Then ordering and off-diagonal entries of the target covariance ma- 1 the soft-thresholding covariance estimator is equal to trix decay to zero as they move away from the diagonal. On the 1 other hand, thresholding is proposed for estimating permutation- ˆ ˆ 2 6 = arg min k6 − 6nkF + λ|6|1. (1) invariant covariance matrices. Thresholding can be used to pro- 6 2 duce consistent covariance matrix estimators when the true co- However, there is no guarantee that the thresholding estimator variance matrix is bandable (Bickel and Levina 2008b ; Cai and is always positive definite. Although the positive-definite prop- Zhou 2012a ). In this sense, thresholding is more robust than erty is guaranteed in the asymptotic setting with high probabil- banding/tapering for real applications. ity, the actual estimator can be an indefinite matrix, especially Let 6ˆ n = (σ ˆ ij )1≤i,j ≤p be the sample covariance matrix. in real data analysis. To illustrate this issue, we consider the Rothman, Levina, and Zhu ( 2009 ) defined the general thresh- Michigan lung cancer gene expression data (Beer et al. 2002 ), Downloaded by [University of Minnesota Libraries, Twin Cities] at 20:55 04 February 2013 olding covariance matrix estimator as 6ˆ thr = { sλ(σ ˆ ij )}1≤i,j ≤p, which have 86 tumor samples from patients with lung adeno- carcinomas and 5217 gene expression values for each sample. More details about this dataset are referred to Beer et al. ( 2002 ) Lingzhou Xue is Postdoctoral Research Associate, Department of Operations and Subramaniana et al. ( 2005 ). We randomly chose p genes Research & Financial Engineering, Princeton University, Princeton, NJ 08544. (p = 200 , 500) and obtained the soft-thresholding sample cor- Shiqian Ma is Assistant Professor, Department of Systems Engineering & En- relation matrix for these genes. We repeated the process 10 times gineering Management, The Chinese University of Hong Kong, Hong Kong. Hui Zou is Associate Professor, School of Statistics, University of Minnesota, for p = 200 and 500 , respectively, and each time the threshold- Minneapolis, MN 55455 (E-mail: [email protected] ). The article was com- ing parameter λ was selected via the fivefold cross-validation. pleted when Lingzhou Xue was a Ph.D. student at the University of Minnesota We found that none of these soft-thresholding estimators would and Shiqian Ma was a Postdoctoral Fellow in the Institute for Mathematics and Its Applications at the University of Minnesota. The authors thank Adam become positive definite. On average, there are 22 and 124 Rothman for sharing his code. We are grateful to the coeditor, the associate editor, and two referees for their helpful and constructive comments. Shiqian Ma was supported by the National Science Foundation postdoctoral fellowship © 2012 American Statistical Association through the Institute for Mathematics and Its Applications at the University of Journal of the American Statistical Association Minnesota. Lingzhou Xue and Hui Zou are supported in part by grants from the December 2012, Vol. 107, No. 500, Theory and Methods National Science Foundation and the Office of Naval Research. DOI: 10.1080/01621459.2012.725386 1480 Xue, Ma, and Zou: Estimating Large Covariance Matrices 1481 e e u u al al v v −0.6 −0.4 −0.2 0.0 Eigen Eigen 8 −0.5 −0.4 −0.3 −0.2 −0.1 0.0 −1.2 −1.0 −0. 170 175 1 80 1 85 190 195 200 380 400 420 440 460 4 80 500 Index Index (A) p = 200: the minimal 30 eigenvalues (B) p = 500: the minimal 130 eigenvalues Figure 1. Illustration of the indefinite soft-thresholding estimator in the Michigan lung cancer data. negative eigenvalues for the soft-thresholding estimator for the positive-definite constraint {6 ǫ I} for some arbitrarily + p = 200 and 500, respectively. Figure 1 displays the 30 smallest small ǫ > 0. Then the modified 6ˆ is always positive definite. + eigenvalues for p = 200 and the 130 smallest eigenvalues for In this work, we focus on solving the positive-definite 6ˆ as p = 500. follows From both methodological and practical perspectives, the + 1 2 positive-definite property is crucial for any covariance matrix 6ˆ = arg min k6 − 6ˆ nk + λ|6|1. (3) I F estimator. First of all, any statistical procedure that uses the nor- 6ǫ 2 mal distribution requires a positive-definite covariance matrix, It is important to note that ǫ is not a tuning parameter like λ. otherwise the density function is ill defined. Two well-known We simply include ǫ in the procedure to ensure that the smallest examples are the parametric bootstrap method and the quadratic eigenvalue of the estimator is at least ǫ. If one knows that the discriminant analysis. Second, there are important statistical smallest eigenvalue of the true covariance estimator is bounded methods that do not use the normal distribution but still could not below by a positive number δ′, then ǫ can be δ′. To fix the idea, be carried out without a positive-definite covariance matrix es- we use ǫ = 10 −5 in all our numerical examples. timator. The celebrated Markowitz portfolio optimization prob- Despite its natural motivation, ( 3) is actually a very challeng- lem is one such example. We will provide more detailed discus- ing optimization problem due to the positive-definite constraint. sion and examples in Section 3.2 to illustrate the importance of Rothman ( 2012 ) considered a slightly perturbed version of ( 3) a positive-definite covariance matrix estimator in the Markowitz by adding a log-determinant barrier function: portfolio optimization problem. To deal with the indefiniteness 1 ˘ + ˆ 2 issue in covariance matrix estimation, one possible solution is 6 = arg min k6 − 6nkF − τ log det( 6) + λ|6|1, (4) 60 2 to use the eigen-decomposition of 6ˆ and project 6ˆ into the con- vex cone {6 0}. Assume that 6ˆ has the eigen-decomposition where the barrier parameter τ is a small positive constant, + −4 ˆ p ˆ T ˜ say 10 . From the optimization viewpoint, ( 4) is similar to 6 = i=1 λi vi vi , and then a positive semidefinite estimator 6 + p T the graphical lasso criterion (Friedman, Hastie, and Tibshirani can be obtained by setting 6˜ = max( λˆ i, 0) v vi . How- P i=1 i 2008 ) that also has a log-determinant part and the element- Downloaded by [University of Minnesota Libraries, Twin Cities] at 20:55 04 February 2013 ever, this strategy does not work well for sparse covariance matrix estimation, because the projectionP destroys the sparsity wise ℓ1-penalty. Rothman ( 2012 ) derived an iterative procedure pattern of 6ˆ . Consider the Michigan data again, after semidef- to solve ( 4). Rothman’s (2012) proposal is based on heuristic inite projection, the soft-thresholding estimator has no zero arguments and its convergence property is unknown. Another entry.