<<

Positive-Definite ℓ1-Penalized Estimation of Large Covariance Matrices

Lingzhou X UE , Shiqian M A, and Hui Z OU

The thresholding covariance estimator has nice asymptotic properties for estimating sparse large covariance matrices, but it often has negative eigenvalues when used in real data analysis. To fix this drawback of thresholding estimation, we develop a positive-definite ℓ1- penalized covariance estimator for estimating sparse large covariance matrices. We derive an efficient alternating direction method to solve the challenging optimization problem and establish its convergence properties. Under weak regularity conditions, nonasymptotic statistical theory is also established for the proposed estimator. The competitive finite-sample performance of our proposal is demonstrated by both simulation and real applications. KEY WORDS: Alternating direction methods; Matrix norm; Positive-definite estimation; Soft-thresholding; Sparsity.

1. INTRODUCTION where sλ(z) is the generalized thresholding function. The gen- eralized thresholding function covers a number of commonly Estimating covariance matrices is of fundamental impor- used shrinkage procedures, for example, the hard threshold- tance for an abundance of statistical methodologies. Nowa- ing s (z) = zI , the soft thresholding s (z) = sign( z)( |z| − days, the advance of new technologies has brought massive λ {| z|>λ } λ λ) , the smoothly clipped absolute deviation thresholding (Fan high-dimensional data into various research fields, such as func- + and 2001 ), and the adaptive lasso thresholding (Zou 2006 ). tional magnetic resonance imaging (fMRI) imaging, web min- Consistency results and explicit rates of convergence have been ing, bioinformatics, climate studies and risk management, and obtained for these regularized estimators in the literature, for so on. The usual sample covariance matrix is optimal in the example, Bickel and Levina ( 2008a , b), El Karoui ( 2008 ), classical setting with large samples and fixed low dimensions Rothman, Levina, and ( 2009 ), and and ( 2011 ). (Anderson 1984 ), but it performs very poorly in the high- The recent articles by Cai and ( 2012a , b) have estab- dimensional setting (Mar cenkoˇ and Pastur 1967 ; Johnstone lished the minimax optimality of the thresholding estimator for 2001 ). In the recent literature, regularization techniques have estimating a wide range of large sparse covariance matrices un- been used to improve the sample covariance matrix estima- der commonly used matrix norms. The existing theoretical and tor, including banding ( and Pourahmadi 2003 ; Bickel and empirical results show no clear favoritism to a particular thresh- Levina 2008a ), tapering (Furrer and Bengtsson 2007 ; Cai, olding rule. In this article, we focus on the soft-thresholding , and Zhou 2010 ), and thresholding (Bickel and Levina because it can be formulated as the solution of a convex opti- 2008b ; El Karoui 2008 ; Rothman, Levina, and Zhu 2009 ). Band- mization problem. Let k · k be the Frobenius norm and | · | ing or tapering is very useful when the variables have a natural F 1 be the element-wise ℓ -norm of all off-diagonal elements. Then ordering and off-diagonal entries of the target covariance - 1 the soft-thresholding covariance estimator is equal to trix decay to zero as they move away from the diagonal. On the 1 other hand, thresholding is proposed for estimating permutation- ˆ ˆ 2 6 = arg min k6 − 6nkF + λ|6|1. (1) invariant covariance matrices. Thresholding can be used to pro- 6 2 duce consistent covariance matrix estimators when the true co- However, there is no guarantee that the thresholding estimator variance matrix is bandable (Bickel and Levina 2008b ; Cai and is always positive definite. Although the positive-definite prop- Zhou 2012a ). In this sense, thresholding is more robust than erty is guaranteed in the asymptotic setting with high probabil- banding/tapering for real applications. ity, the actual estimator can be an indefinite matrix, especially Let 6ˆ n = (σ ˆ ij )1≤i,j ≤p be the sample covariance matrix. in real data analysis. To illustrate this issue, we consider the Rothman, Levina, and Zhu ( 2009 ) defined the general thresh- Michigan lung cancer gene expression data (Beer et al. 2002 ), Downloaded by [University of Minnesota Libraries, Twin Cities] at 20:55 04 February 2013 olding covariance matrix estimator as 6ˆ thr = { sλ(σ ˆ ij )}1≤i,j ≤p, which have 86 tumor samples from patients with lung adeno- carcinomas and 5217 gene expression values for each sample. More details about this dataset are referred to Beer et al. ( 2002 ) Lingzhou is Postdoctoral Research Associate, Department of Operations and Subramaniana et al. ( 2005 ). We randomly chose p genes Research & Financial Engineering, Princeton University, Princeton, NJ 08544. (p = 200 , 500) and obtained the soft-thresholding sample cor- Shiqian Ma is Assistant Professor, Department of Systems Engineering & En- relation matrix for these genes. We repeated the process 10 times gineering Management, The Chinese University of Hong , . Hui Zou is Associate Professor, School of Statistics, University of Minnesota, for p = 200 and 500 , respectively, and each time the threshold- Minneapolis, MN 55455 (E-mail: [email protected] ). The article was com- ing parameter λ was selected via the fivefold cross-validation. pleted when Lingzhou Xue was a Ph.D. student at the University of Minnesota We found that none of these soft-thresholding estimators would and Shiqian Ma was a Postdoctoral Fellow in the Institute for Mathematics and Its Applications at the University of Minnesota. The authors thank Adam become positive definite. On average, there are 22 and 124 Rothman for sharing his code. We are grateful to the coeditor, the associate editor, and two referees for their helpful and constructive comments. Shiqian Ma was supported by the National Science Foundation postdoctoral fellowship © 2012 American Statistical Association through the Institute for Mathematics and Its Applications at the University of Journal of the American Statistical Association Minnesota. Lingzhou Xue and Hui Zou are supported in part by grants from the December 2012, Vol. 107, No. 500, Theory and Methods National Science Foundation and the Office of Naval Research. DOI: 10.1080/01621459.2012.725386

1480 Xue, Ma, and Zou: Estimating Large Covariance Matrices 1481 e e u u al al v v −0.6 −0.4 −0.2 0.0 Eigen Eigen 8 −0.5 −0.4 −0.3 −0.2 −0.1 0.0 −1.2 −1.0 −0.

170 175 1 80 1 85 190 195 200 380 400 420 440 460 4 80 500

Index Index

(A) p = 200: the minimal 30 eigenvalues (B) p = 500: the minimal 130 eigenvalues Figure 1. Illustration of the indefinite soft-thresholding estimator in the Michigan lung cancer data.

negative eigenvalues for the soft-thresholding estimator for the positive-definite constraint {6  ǫ I} for some arbitrarily + p = 200 and 500, respectively. Figure 1 displays the 30 smallest small ǫ > 0. Then the modified 6ˆ is always positive definite. + eigenvalues for p = 200 and the 130 smallest eigenvalues for In this work, we focus on solving the positive-definite 6ˆ as p = 500. follows From both methodological and practical perspectives, the + 1 2 positive-definite property is crucial for any covariance matrix 6ˆ = arg min k6 − 6ˆ nk + λ|6|1. (3) I F estimator. First of all, any statistical procedure that uses the nor- 6ǫ 2 mal distribution requires a positive-definite covariance matrix, It is important to note that ǫ is not a tuning parameter like λ. otherwise the density function is ill defined. Two well-known We simply include ǫ in the procedure to ensure that the smallest examples are the parametric bootstrap method and the quadratic eigenvalue of the estimator is at least ǫ. If one knows that the discriminant analysis. Second, there are important statistical smallest eigenvalue of the true covariance estimator is bounded methods that do not use the normal distribution but still could not below by a positive number δ′, then ǫ can be δ′. To fix the idea, be carried out without a positive-definite covariance matrix es- we use ǫ = 10 −5 in all our numerical examples. timator. The celebrated Markowitz portfolio optimization prob- Despite its natural motivation, ( 3) is actually a very challeng- lem is one such example. We will provide more detailed discus- ing optimization problem due to the positive-definite constraint. sion and examples in Section 3.2 to illustrate the importance of Rothman ( 2012 ) considered a slightly perturbed version of ( 3) a positive-definite covariance matrix estimator in the Markowitz by adding a log-determinant barrier function: portfolio optimization problem. To deal with the indefiniteness 1 ˘ + ˆ 2 issue in covariance matrix estimation, one possible solution is 6 = arg min k6 − 6nkF − τ log det( 6) + λ|6|1, (4) 60 2 to use the eigen-decomposition of 6ˆ and project 6ˆ into the con- vex cone {6  0}. Assume that 6ˆ has the eigen-decomposition where the barrier parameter τ is a small positive constant, + −4 ˆ p ˆ T ˜ say 10 . From the optimization viewpoint, ( 4) is similar to 6 = i=1 λi vi vi , and then a positive semidefinite estimator 6 + p T the graphical lasso criterion (Friedman, Hastie, and Tibshirani can be obtained by setting 6˜ = max( λˆ i, 0) v vi . How- P i=1 i 2008 ) that also has a log-determinant part and the element- Downloaded by [University of Minnesota Libraries, Twin Cities] at 20:55 04 February 2013 ever, this strategy does not work well for sparse covariance matrix estimation, because the projectionP destroys the sparsity wise ℓ1-penalty. Rothman ( 2012 ) derived an iterative procedure pattern of 6ˆ . Consider the Michigan data again, after semidef- to solve ( 4). Rothman’s (2012) proposal is based on heuristic inite projection, the soft-thresholding estimator has no zero arguments and its convergence property is unknown. Another entry. theoretical limitation of Rothman’s (2012) proposal is that it In order to simultaneously achieve sparsity and positive requires the smallest eigenvalue of the true covariance matrix semidefiniteness, a natural solution is to add the positive to be bounded away from zero, otherwise the influence of the semidefinite constraint to ( 1). Consider the following con- perturbing term due to the log-determinant barrier could not be strained ℓ penalization problem well controlled. 1 In this article, we present an efficient alternating direction 1 algorithm for solving ( 3) directly. Numerical examples show ˆ + ˆ 2 6 = arg min k6 − 6nkF + λ|6|1. (2) 60 2 that our algorithm is much faster than the log-barrier method. We prove the convergence properties of our algorithm and dis- Note that the solution to ( 2) could be positive semidefinite. To cuss the statistical properties of the positive-definite constrained obtain a positive-definite covariance estimator, we can consider ℓ1-penalized covariance estimator. Besides the computational 1482 Journal of the American Statistical Association, December 2012

advantage, we also point out the methodological and theoretical the convex cone {2  ǫ I}. Assume that Z has the eigen- p T Z advantages of our method over the log-barrier method. decomposition j=1 λj vj vj , and then ( )+ can be obtained as p max( λ , ǫ )vT v . Then the 2 step can be analytically j=1 Pj j j 2. ALTERNATING DIRECTION ALGORITHM solved as follows P We use an alternating direction method to solve ( 3) di- 2i+1 = arg min L(2, 6i ; 3i ) rectly. The alternating direction method is closely related to the 2ǫ I operator-splitting method that has a history back to 1950s i 1 i 2 for solving numerical partial differential equations (see, e.g., = arg min −h 3 , 2i + k2 − 6 kF 2ǫ I 2µ Douglas and Rachford 1956 ; Peaceman and Rachford 1955 ). i i 2 = arg min k2 − (6 + µ3 )kF Recently, the alternating direction method has been revisited 2ǫ I and successfully applied to solving large-scale problems arising i i = (6 + µ3 )+. from different applications. For example, Scheinberg, Ma, and Goldfarb ( 2010 ) introduced the alternating linearization meth- Next, define an entry-wise soft-thresholding rule for all ods to efficiently solve the graphical lasso optimization problem. the off-diagonal elements of a matrix Z as S(Z, τ ) = We refer to Fortin and Glowinski ( 1983 ) and Glowinski and Le {s(zjℓ , τ )}1≤j,ℓ ≤p with Tallec ( 1989 ) for more details on operator-splitting and alternat- ing direction methods and a recent survey article by Boyd et al. s(zjℓ , τ ) = sign( zjℓ ) max( |zjℓ | − τ, 0) I{j6=ℓ} + zjℓ I{j=ℓ}. (2011 ) for a more complete list of references. In the sequel, we propose an alternating direction method to Then the 6 step has a closed-form solution given as follows solve the ℓ1-penalized covariance matrix estimation problem ( 3) under the positive-definite constraint. We first introduce a new 6i+1 = arg min L(2i+1, 6; 3i ) variable 2 and an equality constraint as follows 6 1 + + ˆ 2 i ˆ ˆ = arg min k6 − 6nkF + λ|6|1 + h 3 , 6i (2 , 6 ) 6 2 1 ˆ 2 I 1 i+1 2 = arg min k6 − 6nkF + λ|6|1 : 6 = 2, 2  ǫ . + k6 − 2 k 2,6 2 F   2µ 2 (5) i i+1 1 µ(6ˆ n − 3 ) + 2 λµ = arg min 6 − + |6|1 The solution to ( 5) gives the solution to ( 3). To deal with the 6 2 1 + µ 1 + µ equality constraint in ( 5), we shall minimize its augmented La- F 1 grangian function for some given penalty parameter µ, that is, = S(µ (6ˆ − 3i ) + 2i+1, λµ ). 1 + µ n

1 2 Algorithm 1 shows the complete details of our alternating direc- L(2, 6; 3) = k6 − 6ˆ nk + λ|6|1 − h 3, 2 − 6i 2 F tion method for ( 3). In Section 4, we provide the convergence 1 + k2 − 6k2 , (6) analysis of Algorithm 1 and prove that Algorithm 1 always 2µ F converges to the optimal solution of ( 5) from any starting point. where 3 is the Lagrange multiplier. We iteratively solve In our implementation, we use the soft-thresholding estimator as the initial value for both 20 and 60, and we set 30 as a zero (2i+1, 6i+1) = arg min L(2, 6; 3i ) (7) matrix. Note that our convergence result in Theorem 1 shows 2ǫ I,6 that Algorithm 1 globally converges for any µ > 0. Unlike λ, µ and then update the Lagrange multiplier 3i+1 by does not change the final covariance estimator. In our numerical experiments, we fixed µ = 2 just for simplicity. Before invoking 1 3i+1 = 3i − (2i+1 − 6i+1). Algorithm 1, we always check whether the soft-thresholding µ

Downloaded by [University of Minnesota Libraries, Twin Cities] at 20:55 04 February 2013 estimator is positive definite. If yes, then the soft-threhsolding For ( 7) we do it by alternatingly minimizing L(2, 6; 3i ) with estimator is the final solution to ( 3). respect to 2 and 6. To sum up, the entire algorithm proceeds as follows: For i = 0, 1, 2,..., perform the following three steps sequen- Algorithm 1 Our alternating direction method for the ℓ1- tially till convergence: penalized covariance estimator 0 0 2 step : 2i+1 = arg min L(2, 6i ; 3i ), (8) 1. Input: µ, 6 , and 3 . 2ǫ I 2. Iterative alternating direction augmented Lagrangian step: 6 step : 6i+1 = arg min L(2i+1, 6; 3i ), (9) for the ith iteration 6 i+1 i i i+1 i 1 i+1 i+1 2.1 Solve 2 = (6 + µ3 )+; 3 step : 3 = 3 − (2 − 6 ). (10) i+1 1 S ˆ i i+1 µ 2.2 Solve 6 = 1+µ (µ(6n − 3 ) + 2 , λµ ); 2.3 Update 3i+1 = 3i − 1 (2i+1 − 6i+1). To further simplify the alternating direction algorithm, we de- µ rive the closed-form solutions for ( 8) and (9). Consider the 3. Repeat the above cycle till convergence. 2 step. Define ( Z)+ as the projection of a matrix Z onto Xue, Ma, and Zou: Estimating Large Covariance Matrices 1483

Table 1. Total time (in seconds) for computing a solution path with Model 1 has been used in Bickel and Levina ( 2008a ) and Cai 99 thresholding parameters λ = { 0.01 , 0.02 ,..., 0.99 }. Timing was and Liu ( 2011 ), and Model 2 is similar to the overlapping block carried out on an AMD 2.8 GHz processor diagonal design that has been used in Rothman ( 2012 ). + First, we compare the run times of our estimator 6ˆ with Model 1 Model 2 + the log-barrier estimator 6˘ by Rothman ( 2012 ). As shown in p 100 200 500 100 200 500 Table 1 , our method is much faster than the log-barrier method. + + In what follows, we compare the performance of 6ˆ , 6˘ , Soft thresholding 0 .2 1 .3 5 .7 0 .1 1 .3 5 .6 and the soft-thresholding estimator 6ˆ . For all three regular- Our method 9 .2 65 .2 1156 .0 7 .5 51 .1 986 .6 ized estimators, the thresholding parameter was chosen over 99 Rothman’s method 84 .1 822 .1 35911 .8 51 .3 611 .1 32803 .0 thresholding parameters λ = { 0.01 , 0.02 ,..., 0.99 } by fivefold cross-validation (Bickel and Levina 2008b ; Rothman, Levina, + 3. NUMERICAL EXAMPLES and Zhu 2009 ; Cai and Liu 2011 ). For 6˘ , we set τ = 10 −4 as in Rothman ( 2012 ). The estimation performance is measured Before delving into theoretical analysis of our proposed al- by the average losses under both the Frobenius norm and the gorithm and estimator, we first demonstrate the computational spectral norm. The selection performance is examined by the advantage of our estimator over the log-barrier method and then false positive rate show an application of our estimator to the Markowitz portfolio selection problem using stock data. #{(i, j ) :σ ˆ ij 6= 0 & σij = 0} #{(i, j ) : σ = 0} 3.1 Comparing Two Positive-Definite Thresholding ij Estimators and the true positive rate We first use simulation to show the competitive performance #{(i, j ) :σ ˆ 6= 0 & σ 6= 0} ij ij . of our proposal. In all examples, we standardize the variables to #{(i, j ) : σij 6= 0} have zero mean and unit variance. In each simulation model, we generated 100 independent datasets, each with n = 50 indepen- Moreover, we compare the average number of negative eigen- dent p-variate random vectors from the multivariate normal dis- values over 100 replications and the percentage of positive def- 0 initeness to check the positive definiteness. tribution with mean 0 and covariance matrix 60 = (σij )1≤i,j ≤p for p = 100 , 200 , and 500. We considered two covariance mod- Tables 2 and 3 show the average metrics over 100 replica- els with different sparsity patterns: tions. The soft-thresholding estimator 6ˆ is positive definite in + + 19 or fewer out of 100 simulation runs, while 6ˆ and 6˘ can 0 Model 1: σij = (1 − | i − j|/10) +. always guarantee a positive-definite estimator. The larger the Model 2: Partition the indices {1, 2,...,p } into K = p/ 20 dimension, the less likely for the soft-thresholding estimator to + + nonoverlapping subsets of equal size, and let ik denote the be positive definite. In terms of estimation, both 6ˆ and 6˘ are maximum index in I + k more accurate than 6ˆ . As for the selection performance, 6ˆ and + K 6˘ achieve a slightly better true positive rate than 6ˆ . Overall, 0 + σij = 0.6I{i=j} + 0.4 I{i∈Ik ,j ∈Ik } 6ˆ is the best among all three regularized estimators. k=1 X To show the advantage of our method over Rothman’s pro- K−1 posal, we further consider two gene expression datasets: one + 0.4 (I + I ). {i=ik ,j ∈Ik+1} {i∈Ik+1,j =ik } from a small round blue-cell tumors microarray experiment k=1 X Table 2. Comparison of the three regularized estimators for Model 1

Frobenius norm Spectral norm False positive True positive Negative eigenvalues Positive definiteness Downloaded by [University of Minnesota Libraries, Twin Cities] at 20:55 04 February 2013 p = 100 Soft thresholding 8 .41 (0 .06) 4 .02 (0 .04) 24 .5 (0 .1) 87 .6 (0 .0) 2 .24 (0 .14) 53/100 Our method 8 .40 (0 .06) 4 .02 (0 .04) 24 .8 (0 .1) 87 .8 (0 .0) 0 .00 (0 .00) 100/100 Rothman’s method 8 .40 (0 .06) 4 .02 (0 .04) 24 .5 (0 .1) 87 .7 (0 .0) 0 .00 (0 .00) 100/100 p = 200 Soft thresholding 13 .82 (0 .06) 4 .70 (0 .03) 14 .3 (0 .4) 83 .2 (0 .3) 3 .74 (0 .22) 23/100 Our method 13 .80 (0 .06) 4 .69 (0 .03) 14 .6 (0 .4) 83 .5 (0 .3) 0 .00 (0 .00) 100/100 Rothman’s method 13 .81 (0 .05) 4 .69 (0 .03) 14 .6 (0 .4) 83 .5 (0 .3) 0 .00 (0 .00) 100/100 p = 500 Soft thresholding 25 .15 (0 .11) 5 .28 (0 .04) 6 .3 (0 .2) 78 .1 (0 .3) 4 .64 (0 .60) 7/100 Our method 25 .10 (0 .11) 5 .28 (0 .04) 6 .5 (0 .2) 78 .3 (0 .3) 0 .00 (0 .00) 100/100 Rothman’s Method NA NA NA NA NA NA NA NA NA NA NA

+ NOTE: Each metric is averaged over 100 replications with the standard error shown in the bracket. NA means that the results for 6˘ (Rothman’s method) are not available due to the extremely long run times. 1484 Journal of the American Statistical Association, December 2012

Table 3. Comparison of the three regularized estimators for Model 2

Frobenius norm Spectral norm False positive True positive Negative eigenvalues Positive definiteness

p = 100 Soft thresholding 9 .81 (0 .07) 4 .87 (0 .05) 29 .5 (0 .0) 97 .2 (0 .0) 1 .54 (0 .14) 19/100 Our method 9 .78 (0 .07) 4 .85 (0 .05) 30 .2 (0 .0) 97 .3 (0 .0) 0 .00 (0 .00) 100/100 Rothman’s method 9 .78 (0 .07) 4 .85 (0 .05) 30 .0 (0 .0) 97 .3 (0 .0) 0 .00 (0 .00) 100/100 p = 200 Soft thresholding 15 .95 (0 .12) 5 .90 (0 .06) 17 .1 (0 .4) 94 .1 (0 .3) 3 .93 (0 .27) 7/100 Our method 15 .81 (0 .12) 5 .84 (0 .06) 18 .8 (0 .3) 95 .0 (0 .3) 0 .00 (0 .00) 100/100 Rothman’s method 15 .83 (0 .12) 5 .85 (0 .06) 18 .3 (0 .4) 94 .6 (0 .3) 0 .00 (0 .00) 100/100 p = 500 Soft thresholding 29 .46 (0 .18) 6 .92 (0 .07) 7 .6 (0 .1) 87 .7 (0 .5) 3 .84 (0 .78) 4/100 Our method 29 .17 (0 .20) 6 .84 (0 .06) 8 .7 (0 .2) 88 .8 (0 .6) 0 .00 (0 .00) 100/100 Rothman’s Method NA NA NA NA NA NA NA NA NA NA NA

+ NOTE: Each metric is averaged over 100 replications with the standard error shown in the bracket. NA means that the results for 6˘ (Rothman’s method) are not available due to the extremely long run times.

(Khan et al. 2001 ) and the other one from a cardiovascular other two regularized estimators guarantee the positive defi- microarray study (Efron 2009 , 2010 ). The first dataset has 64 niteness. The soft-thresholding estimator contains 37 negative training tissue samples with four types of tumors (23 EWS, 8 eigenvalues in the small round blue-cell data and 46 negative BL-NHL, 12 NB, and 21 RMS) and 6567 gene expression val- eigenvalues in the cardiovascular data. Regularized correlation ues for each sample. We applied the prefiltering step used in matrix estimation has a natural application in clustering when Khan et al. ( 2001 ) and then picked the top 40 and bottom 160 the dissimilarity measure is constructed using the correlation genes based on the F-statistic as done in Rothman, Levina, and among features. For both datasets, we did hierarchical cluster- Zhu ( 2009 ). The second dataset has 63 subjects with 44 healthy ing using the three regularized estimators. The heat maps are controls and 19 cardiovascular patients, and 20,426 genes mea- shown in Figure 3 in which the estimated sparsity pattern well sured for each subject. We used the F-statistic to pick the top matches the expected sparsity pattern. 50 and bottom 150 genes. By doing so, it is expected that there Finally, we compared the average run times over five cross- + + is weak dependence between the top and the bottom genes. We validations for both 6ˆ and 6˘ , as shown in Table 4 . It is considered the soft-thresholding estimator (Bickel and Levina obvious that our proposal is much more efficient. 2008b ), the log-barrier estimator (Rothman 2012 ), and our esti- mator. For all three estimators, the thresholding parameter was 3.2 Applications to Markowitz Portfolio Selection chosen by the fivefold cross-validation. As evidenced in Figure 2 , the soft-thresholding estimator To further support our proposal and illustrate the importance yields an indefinite matrix for both real examples, whereas the of positive definiteness, we consider the celebrated Markowitz Downloaded by [University of Minnesota Libraries, Twin Cities] at 20:55 04 February 2013 e e u u al al v v Eigen Eigen −0.4 −0.3 −0.2 −0.1 0.0 0.1 −0.4 −0.3 −0.2 −0.1 0.0 0.1

150 160 170 1 80 190 200 150 160 170 1 80 190 200

Index Index

(A) (B) Figure 2. Plots of the bottom 50 eigenvalues of all regularized estimators for (A) the small round blue-cell data and (B) the cardiovascular + + data: 6ˆ (dashed line), 6ˆ (solid line), and 6˘ (dotted line). Xue, Ma, and Zou: Estimating Large Covariance Matrices 1485

Figure 3. Heat maps of the absolute values of three regularized sample correlation matrix estimator for (A) the small round blue-cell data and + + (B) the cardiovascular data: 6ˆ (A1, B1), 6ˆ (A2, B2), and 6˘ (A3, B3). The genes are ordered by hierarchical clustering using the estimated correlations.

portfolio selection problem (Markowitz 1952 ) in finance, which covariance estimator and our proposal, respectively. We consid- constructs the optimal mean-variance efficient portfolios by ered the monthly stock return data of companies in the S&P 100 solving the following quadratic optimization problem: index from January 1990 to December 2007. We disregarded

′ ′ ′ 16 companies that were listed after 1990 and only used the wˆ = arg min w 6w such that w µ = µP , w e = 1, (11) w∈Rp other p = 84 companies in the S&P 100 index. We followed Jagannathan and Ma ( 2003 ) and Brodie et al. ( 2009 ) to focus on where e is the p-dimensional vector whose entries are all equal the Markowitz problem with no-shortsale constraints, that is, to 1. The practical solution to the Markowitz problem is often obtained by solving the empirical version of ( 11 ) where the true min w′6w such that w ≥ 0, w′e = 1. (12) covariance matrix 6 is replaced with the sample covariance ma- w∈Rp trix. In the recent literature, many researchers have revisited the classic Markowitz problem under the high-dimensional setting For ease of notation, we denote the Markowitz problem ( 12 ) with + (Jagannathan and Ma 2003 ; Brodie et al. 2009 ; DeMiguel et al. 6 = 6ˆ n as (P1) and the Markowitz problem ( 12 ) with 6 = 6ˆ + 2009 ; El Karoui 2010 ; Fan, Zhang, and 2012 ). In particu- as (P2). The thresholding parameter in 6 = 6ˆ is chosen by lar, El Karoui ( 2010 ) provided a detailed theoretical analysis the threefold cross-validation. In the sequel, we consider the to show the undesirable risk underestimation issue in the high- following strategy to construct two Markowitz portfolios using dimensional Markowitz problem ( 11 ) when 6 is simply esti- n = 36 historically monthly returns for each month from 1993 ˆ mated by the sample covariance matrix 6n. El Karoui ( 2010 ) to 2007. At the beginning of each month, we use the past 36 further suggested using the thresholding covariance estimator months’ stock return data to construct portfolios (P1) and (P2), to deal with this challenge. and only the current month returns are recorded for (P1) and Downloaded by [University of Minnesota Libraries, Twin Cities] at 20:55 04 February 2013 Note that if any empirical estimator of 6 is used in the (P2). Then we repeat the above process for the next month. Markowitz problem, the estimator must be positive definite. There are 180 monthly predictions in total. Otherwise, the corresponding optimization problem is ill de- We compared the performance of (P1) and (P2) by computing fined. To elaborate, we provide an empirical study to evaluate the average monthly return, the standard deviation of monthly the performance of the Markowitz problem when 6 is estimated returns, and the Sharpe ratio for the time period of 1993–2007. by the sample covariance estimator and the simple thresholding As shown in Table 5 , (P2) significantly outperforms (P1) in terms of both returns and volatility. We also applied the log- ˘ Table 4. Total time (in seconds) for computing a solution path with barrier estimator to the Markowitz problem ( 12 ) with 6 = 6n, 99 thesholding parameters. Timing was carried out on an AMD 2.8 and we found that the outcomes are similar to that of (P2), but the GHz processor log-barrier method required about 15–20 times more computing time than (P2). Moreover, we want to point out that the results of Blue-cell data Cardiovascular data the simple soft-thresholding estimator are not reported because for the majority of the time period (137 out of 180 months), the Our method 74.7 66.3 soft-thresholding estimator is not positive definite thus cannot Rothman’s method 1302.7 1575.3 be used in the Markowitz problem ( 12 ) to produce a portfolio. 1486 Journal of the American Statistical Association, December 2012

Table 5. Comparison of two Markowitz portfolio selection methods we also assume that log p ≤ n. For any M > 0, we pick (P1) and (P2) for three different time periods in the monthly stock the thresholding parameter as return data log p log p 1/2 λ = c2 + c . Mean Std. Dev. Sharpe ratio 0 n 1 n   Time period Jan. 1993–Dec. 2007 where 1 1/2 −1/2 P1 1.02 3.66 27.73 c0 = eK 1η + η (M + 1) 2 P2 1.13 3.42 33.06 and NOTE: Monthly mean returns, standard deviations of monthly returns, and corresponding 1 1 Sharpe ratios are all expressed in %. c = 2K η−1 + ησ 2 exp ησ 1 1 4 max 2 max −1    4. THEORETICAL PROPERTIES + 2η (M + 2) . −M 4.1 Convergence Analysis of the Algorithm With probability at least 1 − 3p , we have + i i i 0 1/2 In this section, we prove that the sequence ( 2 , 6 , 3 ) pro- k6ˆ − 6 kF ≤ 5λ(s + p) . duced by the alternating direction method (Algorithm 1) con- (b) Under the polynomial-tail condition that for all γ > 0, ˆ + + ˆ + ˆ + + verges to ( 2 , 6ˆ , 3 ), where ( 2 , 6ˆ ) is an optimal solution ε > 0, and 1 ≤ i ≤ n, 1 ≤ j ≤ p ˆ + of ( 5) and 3 is the optimal dual variable. This automatically 4(1 +γ +ε) implies that Algorithm 1 gives an optimal solution of ( 3). E{| Xij | } ≤ K2, We define some necessary notation for ease of presentation. we also assume that p ≤ cn γ for some c > 0. For any Let G be a 2 p × 2p matrix defined as M > 0, we pick the thresholding parameter as

µI p×p 0 log p λ = 8( K2 + 1)( M + 1) G = 1 . n  0 I  1/2 µ p×p log p   + 8( K2 + 1)( M + 2) .   n Define the norm k · k 2 as kUk2 = h U, i and the correspond-   G G −M ing inner product h· , ·i G as , V iG = h U, GV i. Before we give With probability at least 1 − O(p ) − 3K2p 2(1 +γ +ε) −γ −ε the main theorem about the global convergence of Algorithm 1, (log n) n , we have we need the following lemma. + 0 1/2 k6ˆ − 6 kF ≤ 5λ(s + p) . + + Lemma 1. Assume that ( 2ˆ , 6ˆ ) is an optimal solution of + Define d = max I and assume that σ is bounded (5) and 3ˆ is the corresponding optimal dual variable associ- j k {σjk 6=0} max by a fixed constant, then we can pick λ = O((log p/n )1/2) to ated with the equality constraint 6 = 2. Then the sequence P {(2i , 6i , 3i )} produced by Algorithm 1 satisfies achieve the minimax optimal rate of convergence under the Frobenius norm as in Theorem 4 of Cai and Zhou ( 2012b ) that i ∗ 2 i+1 ∗ 2 i i+1 2 kU − U kG − k U − U kG ≥ k U − U kG, (13) 1 + 0 2 s log p log p + + k6ˆ − 6 k = Op 1 + = Op d . where U ∗ = (3ˆ , 6ˆ )T and U i = (3i , 6i )T. p F p n n      Now we are ready to give the main convergence result of However, to attain the same rate in the presence of the log- Algorithm 1. determinant barrier term, Rothman ( 2012 ) instead would require i i i that σmin , the minimal eigenvalue of the true covariance matrix, Theorem 1. The sequence {(2 , 6 , 3 )} produced by Algo- should be bounded away from zero by some positive constant rithm 1 from any starting point converges to an optimal solution

Downloaded by [University of Minnesota Libraries, Twin Cities] at 20:55 04 February 2013 and also that the barrier parameter should be bounded by some of ( 5). positive quantity. We would like to point out that if σmin is 4.2 Statistical Analysis of the Estimator bounded away from zero, then the soft-thresholding estimator 6ˆ st will be positive definite with an overwhelming probability 0 Define 6 as the true covariance matrix for the obser- tending to 1 (Bickel and Levina 2008b ; Cai and Zhou 2012a , X 0 vations = (Xij )n×p, and define the active set of 6 = b). Therefore, the theory requiring a lower bound on σmin is not 0 0 (σjk )1≤j,k ≤p as A0 = { (j, k ) : σjk 6= 0, j 6= k} with the car- very appealing. B dinality s = | A0|. Denote by A0 the Hadamard prod- B uct p×p ◦ (I{(j,k )∈A0})1≤j,k ≤p = (bjk · I{(j,k )∈A0})1≤j,k ≤p. De- 5. DISCUSSION o 0 fine σmax = max j σ as the maximal true variance in 6 . jj The soft-thresholding estimator has been shown to enjoy good Theorem 2. Assume that the true covariance matrix 60 is asymptotic properties for estimating large sparse covariance positive definite. matrices. But its positive definiteness property can be easily violated in practice, which prevents its use in many impor- (a) Under the exponential-tail condition that for all |t| ≤ η tant applications such as quadratic discriminant analysis and and 1 ≤ i ≤ n, 1 ≤ j ≤ p Markowitz portfolio selection. In this article, we have put the 2 E exp tX ij ≤ K1, soft-thresholding estimator in a convex optimization framework   Xue, Ma, and Zou: Estimating Large Covariance Matrices 1487

+ + and considered a natural modification by imposing the positive 2ˆ = 6ˆ , (A.3) + definiteness constraint. We have developed a fast alternating di- 2ˆ  ǫ I, (A.4) rection method to solve the constrained optimization problem, and the resulting estimator retains the sparsity and positive def- and + + initeness properties simultaneously. The algorithm and the new h3ˆ , 2 − 2ˆ i ≤ 0, ∀2  ǫ I. (A.5) estimator are supported by numerical and theoretical results. The log-determinant barrier method is also a valid technique Note that the optimality conditions for the first subproblem in Algo- to achieve positive definiteness in sparse covariance matrix es- rithm 1, that is, the subproblem with respect to 2 in ( 8), are given by timation. However, it is still unclear whether the iterative pro- cedure proposed by Rothman ( 2012 ) actually finds the right 1 3i − (2i+1 − 6i ), 2 − 2i+1 ≤ 0, ∀2  ǫ I. (A.6) solution to the log-determinant barrier perturbed criterion in µ   (4), although the code seems to produce a reasonably good Using the updating formula for 3i in Algorithm 1, that is, estimator. We have clearly demonstrated the computational ad- 1 vantage of our method. We have shown that unlike in Rothman 3i+1 = 3i − (6i+1 − 2i+1), (A.7) (2012 ), our theory does not require a lower bound on the small- µ est eigenvalue of the true covariance matrix. Thus our theory is (A.6 ) can be rewritten as practically more relevant. 1 We would also like to argue that the main idea behind our 3i+1 − (6i+1 − 6i ), 2 − 2i+1 ≤ 0, ∀2  ǫ I. (A.8) µ method is much more flexible than the log-determinant barrier   + method in the sense that the former can be easily extended Now by letting 2 = 2i+1 in ( A.5 ) and 2 = 2ˆ in ( A.8 ), we can get to compute the positive-definite ℓ1-penalized covariance that estimator under some additional constraints. For illustration, ˆ + i+1 ˆ + let us consider the scenario when we know some extra prior h3 , 2 − 2 i ≤ 0, (A.9) information that eigenvalues of the true covariance matrix are and bounded by two positive constants α and β (d’Aspremont, 1 + Banerjee, and Ghaoui 2008 ; 2010 ). Then we solve the 3i+1 − (6i+1 − 6i ), 2ˆ − 2i+1 ≤ 0. (A.10) µ following constrained optimization problem   Summing ( A.9 ) and ( A.10 ) yields + 1 2 6ˆ = arg min k6 − 6ˆ nk + λ|6|1. (14) α,β I I F + + 1 β 6α 2 2i+1 − 2ˆ , (3i+1 − 3ˆ ) + (6i − 6i+1) ≥ 0. (A.11) µ Define ( Z)α,β as the projection of a matrix Z onto   the convex cone {β I  2  α I}, that is, ( Z)α,β = The optimality conditions for the second subproblem in Algorithm p T Z 1, that is, the subproblem with respect to 6 in ( 8) are given by j=1 min(max( λj , α ), β )vj vj when has the eigen- p T decomposition λj v vj . Hence, an efficient alternating 1 P j=1 j 0 ∈ (6i+1 − 6ˆ ) + λ∂ 6i+1 + 3i + (6i+1 − 2i+1) , direction algorithm can be obtained from Algorithm 1 by n jℓ jℓ jℓ µ jℓ P simply modifying its 2 step as ∀j = 1,...,p,ℓ = 1,...,p, and j 6= ℓ, (A.12)

i+1 i i i i 2 = arg min L(2, 6 ; 3 ) = (6 + µ3 )α,β . and β I6α I i+1 i 1 i+1 i+1 However, if we apply the log-determinant barrier method (6 − 6ˆ n)jj + 3 + (6 − 2 )jj = 0, ∀j = 1,...,p. jj µ to solve the new problem, the corresponding optimization (A.13) criterion becomes Note that by using ( A.7 ), ( A.12 ) and ( A.13 ) can be, respectively, rewrit- 1 ˆ + ˆ 2 ten as 6α,β = arg min k6 − 6nkF + λ|6|1 β I6α I 2 1 I I i+1 i+1 ˆ i+1 Downloaded by [University of Minnesota Libraries, Twin Cities] at 20:55 04 February 2013 − τ log det( 6 − α ) − τ log det( β − 6). (15) (−3 − 6 + 6n)jℓ ∈ ∂ 6 , 1 2 λ jℓ ∀j = 1,...,p,ℓ = 1,...,p, and j 6= ℓ, (A.14) It is not clear how to extend the heuristic iterative procedure in Rothman ( 2012 ) to handle the above optimization problem and which is considerably more complex than ( 4). i+1 ˆ i+1 (6 − 6n)jj + 3jj = 0, ∀j = 1,...,p. (A.15) Using the fact that ∂| · | is a monotone function, ( A.1 ), ( A.2 ), ( A.14 ), APPENDIX: TECHNICAL PROOFS and ( A.15 ) imply + + + Proof of Lemma 1. Since ( 2ˆ , 6ˆ , 3ˆ ) is optimal to ( 5), it follows i+1 ˆ + ˆ + i+1 ˆ + i+1 from the Karush–Kuhn–Tucker (KKT) conditions that the followings h6 − 6 , (3 − 3 ) + (6 − 6 )i ≥ 0. (A.16) hold The summation of ( A.11 ) and ( A.16 ) gives 1 + + + ˆ ˆ ˆ ˆ + + + (−3 − 6 + 6n)jℓ ∈ ∂|6jℓ |, i+1 ˆ 2 i+1 ˆ ˆ i+1 λ k6 − 6 kF ≤ h 6 − 6 , 3 − 3 i + + ∀j = 1,...,p,ℓ = 1,...,p and j 6= ℓ, (A.1) + h 2ˆ − 2i+1, 3ˆ − 3i+1i

1 + i+1 i+1 i ˆ + ˆ ˆ + + h2ˆ − 2 , 6 − 6 i. (A.17) (6 − 6n)jj + 3jj = 0, ∀j = 1,...,p, (A.2) µ 1488 Journal of the American Statistical Association, December 2012

+ + i i+ Combining ( A.17 ) with 2i+1 = µ(3i − 3i+1) + 6i+1 and 2ˆ = 6ˆ It follows from (a) that 3 − 3 1 → 0 and 6i − 6i+1 → 0. Then i i+ i leads to (A.7 ) implies that 2 − 2 1 → 0 and 2 − 6i → 0. From (b), we obtain that U i has a subsequence {U ij } that converges to U¯ = i+1 ˆ + 2 i+1 ˆ + ˆ + i+1 k6 − 6 kF ≤ h 6 − 6 , 3 − 3 i (3¯ , 6¯ ), that is, 3ij → 3¯ and 6ij → 6¯ . From 2i − 6i → 0, we + + + h 6ˆ − 6i+1 − µ(3i − 3i+1), 3ˆ − 3i+1i also get that 2ij → 2¯ := 6¯ . Therefore, ( 2¯ , 6¯ , 3¯ ) is a limit point i i i 1 + of {(2 , 6 , 3 )}. + h6ˆ − 6i+1 − µ(3i − 3i+1), 6i+1 − 6i i. µ Note that ( A.14 ) and ( A.13 ), respectively, imply that (A.18) 1 (−3¯ − 6¯ + 6ˆ n)jℓ ∈ ∂|6¯ jℓ |, Simple algebraic derivation from ( A.18 ) yields the following λ inequality: ∀j = 1,...,p,ℓ = 1,...,p, and j 6= ℓ, (A.26) i+1 ˆ + 2 i i+1 i i+1 and k6 − 6 kF − h 3 − 3 , 6 − 6 i i+1 + i i+1 i+1 + i i+1 ¯ ˆ ¯ ≤ µh3 − 3ˆ , 3 − 3 i + h 6 − 6ˆ , 6 − 6 iµ. (A.19) (6 − 6n)jj + 3jj = 0, ∀j = 1,...,p, (A.27) + + Rearranging the right-hand side of ( A.19 ) using 3ˆ − 3i+1 = (3ˆ − and ( A.8 ) implies that i i i+1 + i+1 + i i i+1 3 ) + (3 − 3 ) and 6ˆ − 6 = (6ˆ − 6 ) + (6 − 6 ), then h3¯ , 2 − 2¯ i ≤ 0, ∀2  ǫ I. (A.28) (A.17 ) can be reduced to (A.26 ), ( A.27 ), and ( A.28 ) together with 2¯ = 6¯ mean that ( 2¯ , 6¯ , 3¯ ) + 1 + µh3i − 3ˆ , 3i − 3i+1i + h6i − 6ˆ , 6i − 6i+1i is an optimal solution to ( 5). Therefore, we showed that any limit point µ of {(2i , 6i , 3i )} is an optimal solution to ( 5). 2 1 + ≥ µk3i − 3i+1k2 + k6i − 6i+1k2 + k 6i+1 − 6ˆ k2 F µ F F Proof of Theorem 2. Without loss of generality, we may always − h 3i − 3i+1, 6i − 6i+1i. (A.20) assume that E(Xij ) = 0 for all 1 ≤ i ≤ n, 1 ≤ j ≤ p. By the condition that 60 is positive definite, we can always choose some very small Using the notation of U i and U ∗, ( A.20 ) can be rewritten as ǫ > 0 such that ǫ is smaller than the minimal eigenvalue of 60. We 0 + introduce 1 = 6 − 6 , and then we can write ( 3) in terms of 1 as hU i − U ∗, U i − U i+1i ≥ k U i − U i+1k2 + k 6i+1 − 6ˆ k2 G G F follows, − h 3i − 3i+1, 6i − 6i+1i. (A.21) 1 ˆ 0 ˆ 2 0 1 = arg min k1+6 − 6nkF + λ|1 + 6 |1 (≡ F (1)) . Combining ( A.21 ) with the following identity 1:1=1T,1+60ǫ I 2

i+1 ∗ 2 i+1 i 2 k i+1 i ∗ + kU − U kG = k U − U kG − 2hU − U , U − U iG Note that it is easy to see that 1ˆ = 6ˆ − 60. i ∗ 2 T 0 + k U − U kG, Now we consider 1 ∈ { 1 : 1 = 1 , 1 + 6  ǫ I, k1kF = 5λs 1/2}. Under the probability event {| σˆ n − σ 0 | ≤ λ, ∀(i, j )}, we we get ij ij have i ∗ 2 i+1 ∗ 2 kU − U kG − k U − U kG 1 1 0 ˆ 2 0 ˆ 2 i i+1 i ∗ i+1 i 2 F (1) − F (0) = k1 + 6 − 6nkF − k6 − 6nkF + λ|1 = 2hU − U , U − U i − k U − U kG 2 2 i i+1 2 i+1 ˆ + 2 i i+1 i i+1 + 60| − λ|60| ≥ 2kU −U kG +2k6 − 6 k − 2h3 − 3 , 6 − 6 i 1 1 i+1 i 2 1 2 0 −k U − U k ˆ c G = k1kF + < 1, 6 − 6n > +λ|1A |1 + 2 0 = k U i − U i+1k2 + 2k6i+1 − 6ˆ k2 − 2h3i − 3i+1, 6i − 6i+1i. G 0 0 + λ 1A + 6 − 6 (A.22) 0 A0 1 A0 1   Now, using ( A.14 ) and ( A.15 ) for i instead of i + 1, we get, 1 2 ≥ k1k − λ |1|1 + |1 ii | + λ|1Ac |1 − λ|1A |1 2 F 0 0 1 ! i # (−3i − 6i + 6ˆ ) ∈ ∂|6i |, X λ n jℓ jℓ 1 ≥ k1k2 − 2λ |1 | + |1 | ∀j = 1,...,p,ℓ = 1,...,p, and j 6= ℓ, (A.23) 2 F A0 1 ii ! i # X and 1 2 1/2 ≥ k1kF − 2λ(s + p) k1kF Downloaded by [University of Minnesota Libraries, Twin Cities] at 20:55 04 February 2013 2 (6i − 6ˆ ) + 3i = 0, ∀j = 1,...,p. (A.24) n jj jj 5 ≥ λ2(s + p) Combining ( A.14 ), ( A.15 ), ( A.23 ), and ( A.24 ) and using the fact that 2 ∂| · | is a monotone function, we obtain > 0. h6i − 6i+1, 3i+1 − 3i + 6i+1 − 6i i ≥ 0, Note that 1ˆ is also the optimal solution to the following convex opti- mization problem which immediately implies, 1ˆ = arg min F (1) − F (0) ( ≡ G(1)) . i i+1 i+1 i i+1 i 2 T 0 I h6 − 6 , 3 − 3 i ≥ k 6 − 6 kF ≥ 0. (A.25) 1:1=1 ,1+6 ǫ 1/2 By substituting ( A.25 ) into ( A.22 ), we get the desired result ( 13 ). 2 Under the same probability event, k1ˆ kF ≤ 5λ(s + p) would always hold. Otherwise, the fact that G(1) > 0 for k1k = Proof of Theorem 1. From Lemma 1, we can easily get that F 5λ(s + p)1/2 should contradict with the convexity of G(·) and i i+1 ˆ (a) kU − U kG → 0; G(1) ≤ G(0) = 0. Therefore, we can obtain the following probability (b) {U i } lies in a compact region; bound (c) kU i − U ∗k2 is monotonically non-increasing and thus con- G ˆ + 0 1/2 n 0 Pr( k6 − 6 kF ≤ 5λ(s + p) ) ≥ 1 − Pr max σˆij − σij > λ . verges. i,j  

Xue, Ma, and Zou: Estimating Large Covariance Matrices 1489

Now we shall prove the probability bound under the exponential- Now, we can apply the Markov inequality to obtain the following prob- tail condition. First it is easy to verify two simple inequalities that ability bound: 1 2 2 2 1 + u ≤ exp( u) ≤ 1 + u + 2 u exp( |u|) and v exp( |v|) ≤ exp( v + 1). The first inequality can be proved by using the Taylor expansion, 0 and the second one can be easily derived using the obvious facts that Pr Xij Xik − σjk > nε 1 ! # v2 + ≥ |v| |v| ≥ v2 i exp( 1) exp(2 ) and exp( ) . X  n log p 1/2 1 1/2 −1/2 Let t0 = (η ) , c0 = eK 1η + η (M + 1), and ε0 = 0 n 2 ≤ exp( −t1nε 1) · E exp t1 Xij Xik − σ log p 1/2 jk c0( n ) . For any M > 0, we can apply the Markov inequality to i=1 obtain that n Y   − 1 c η 1 2 0 2 ≤ p 2 1 · 1 + t · E X X − σ n 2 1 ij ik jk i=1  Pr Xij > nε 0 ≤ exp( −t0nε 0) · E[exp( t0Xij )] Y h  ! i # i=1 0 X Y · exp t1 Xij Xik − σjk n 2 t0 2  ≤ exp( −t0nε 0) · 1 + E X exp( t0|Xij |) n  i ij 1 1 2 2 − c1η 2 0 i=1   ≤ p 2 · exp t · E Xij Xik − σ Y   2 1 jk 2 n ! i=1 1/2 t h ≤ p−c0η · exp 0 E X2 exp( t |X |) X  2 ij 0 ij 0 ! i=1 # · exp t1 Xij Xik − σjk X   2 n i −c η1/2 t0 2 2 1 1 1 0 − c1η 2 2 ≤ p · exp E exp t0 Xij + 1 ≤ p 2 · exp K1 1 + η σ · exp ησ max · log p ! 2 # 4 max 2 i=1       X   −M−2 1/2 1 (= p ), ≤ p−c0η · exp eK η log p (= p−M−1), 2 1   1 2 0 where we apply exp( u) ≤ 1 + u + 2 u exp( |u|) and E[Xij Xik ] = σjk 1 2 where we apply exp( u) ≤ 1 + u + 2 u exp( |u|) in the second in- for i = 1, 2,...,n in the second inequality, and we use 1 + u ≤ exp( u) equality and 1 + u ≤ exp( u) in the third inequality and then use in the third inequality. v2 exp( |v|) ≤ exp( v2 + 1) in the fourth inequality. Moreover, the sim- log p log p 1/2 2 Recall that λ = c0 n + c1( n ) = ε0 + ε1 and 2 log p ple facts that E[Xij ] = 0 (1 ≤ i ≤ n) and t0 = η n ≤ η are also used. Let t = 1 η( log p )1/2 and c = 2K (η−1 + 1 ησ 2 ) exp( 1 ησ ) + 1 2 n 1 1 4 max 2 max n 0 1 0 1 1 −1 log p 1/2 σˆjk − σjk = Xij Xik − σjk − Xjk · Xik . 2η (M + 2). Define ε1 = c1( ) . For any M > 0, we first apply n n n n ! i # ! i # ! i # the Cauchy inequality to obtain that X X X 1 Therefore, we can complete the probability bound under the E X2 X2 · exp η|X X | exponential-tail condition as follows ij ik 2 ij ik    2 2 1 2 2 ≤ E Xij Xik · exp η Xij + Xik n 0 2 0 4 Pr max σˆjk − σjk > λ ≤ p Pr Xij Xik > n σjk + ε1    j,k 4 2 1/2  4 2 1/2   ! i # ≤ E X exp ηX /2 · E X exp ηX /2 X  ij ij ik ik −2 2 1/2 2 1/2 + 2p Pr X > nε ≤ 4η  · E exp ηX ij  · E exp ηX ik  ij 0 ! i # ≤ 4K η−2, X 1     ≤ 3p−M . where we use the simple inequality exp( |v|) ≥ v2 in the third inequality. Then, combining this result with the Cauchy inequality again yields In the sequel, we shall prove the probability bound under the that polymonial-tail condition. First, we define c2 = 8( K2 + 1)( M + 1) log p 1/2 1/4 −1/2 and ε2 = c2( n ) . Define δn = n (log n) , Yij = Xij I{| Xij |≤ δn}, 0 2 0 E Xij Xik − σjk · exp t1 Xij Xik − σjk and Zij = Xij I{| Xij |>δ n}. Then we have Xij = Yij + Zij and E[Xij ] = E[Yij ] + E[Zij ]. By construction, |Yij | ≤ δn are bounded random h 2 2  1 0 i 0 2 ≤ 2E X X · exp η Xij Xik − σ + 2 σ variables, and E[Z ] are bounded by o(ε ) due to the fact that ij ik 2 jk jk ij 2 |E Z | ≤ δ−3E |X |4I ≤ K δ−3 = o ε . Downloaded by [University of Minnesota Libraries, Twin Cities] at 20:55 04 February 2013    [ ij ] [ ij {| X |>δ }] ( ) Now we can ap- 1  n ij n 2 n 2 · E exp η X X − σ 0 ply the Bernstein’s inequality (Bernstein 1946 ; Bennett 1962 ) to obtain 2 ij ik jk    that 1 1 −2 0 0 2 0 ≤ 8K1η · exp ησ jk + 2 σjk · exp ησ jk 2 2 2 1 −nε 2     Pr {Yij − E[Yij ]) } > nε 2 ≤ exp 1  2 Y + 4 δ ε 2 2 ! i # ! 8var( ij ) 3 n 2 # · E exp η Xij + Xik X 4 −c log p    ≤ exp 2 1  1 −1/4 −2 2 8K2 + 8 + O(n ) ≤ 8K1η · exp ησ max + 2σmax · exp ησ max   2 2 = O(p−M−1),   1/2   1/2 1 2 1 2 × E exp ηX ij · E exp ηX ik 2 2 2 2 where the fact that var( Yij ) ≤ E[Xij ] ≤ E[Xij I{| Xij |≥ 1}] +         2 1 E[Xij I{| Xij |≤ 1}] ≤ K2 + 1 is used in the second inequality. Be- ≤ 2K (4 η−2 + σ 2 ) · exp ησ , 1 max 2 max sides, we can apply the Markov inequality to obtain that   log p −4(1 +γ +ε) 4(1 +γ +ε) 1 1/2 1 Pr ( |Xij | > δ n) ≤ δ E |Xij | where we use the fact that t1 = 2 η( n ) ≤ 2 η < η in the first in- n equality and then use |σ 0 | ≤ (σ 0 σ 0 )1/2 ≤ σ in the third inequality. ≤ K (log n)2(1 +γ +ε)n−1−γ −ε. jk jj kk max 2   1490 Journal of the American Statistical Association, December 2012

Then, we can derive the following probability bound 0 ≤ Pr max Xij Xik − σjk > nε 3 ! j,k # i Pr Xij > nε 2 = Pr {Yij + Zij − E[Yij + Zij ]} > nε 2 X  ! i # ! i # X X + Pr max Xij > nε 2 j 1 ! i # ≤ Pr {Yij − E[Yij ]} > nε X 2 2 ≤ O(p−M ) + 3K p(log n)2(1 +γ +ε)n−γ −ε. ! i # 2 X 1 + Pr {Z − E[Z ]} > nε ij ij 2 2 ! i # [Received November 2011. Revised June 2012.] X 1 ≤ O(p−M−1) + Pr {Z − o(ε )} > nε ij 2 2 2 ! i # X −M−1 ≤ O(p ) + Pr( |Xij | > δ n) REFERENCES i X Anderson, T. (1984), An Introduction to Multivariate Statistical Analysis , New −M−1 2(1 +γ +ε) −γ −ε  ≤ O(p ) + K2(log n) n . York: Wiley. [ 1480 ] Beer, D., Kardia, S., , C., Giordano, T., Levin, A., Misek, D., , L., = + + = log p 1/2 = Let c3 8( K2 1)( M 2) and ε3 c3( n ) . Recall that δn , G., Gharib, T., Thomas, D., Lizyness, M. L., Kuick, R., Hayasaka, n 1/4 ( log( n) ) , and define Rijk = Xij Xik I{| Xij |>δ n or |Xik |>δ n}. Then we have S., Taylor, J. M., Iannettoni, M. D., Orringer, M. B., and Hanash, S. (2002), 0 “Gene-Expression Profiles Predict Survival of Patients With Lung Adeno- Xij Xik = Yij Yik + Rijk and σjk = E[Xij Xik ] = E[Yij Yik ] + E[Rijk ]. By construction, |Y Y | ≤ δ2 are bounded random variables and carcinoma,” Nature Medicine , 8, 816–824. [ 1480 ] ij ik n Bennett, G. (1962), “Probability Inequalities for the Sum of Independent Ran- E[Rijk ] is bounded by o(ε3) due to the fact that dom Variables,” Journal of the American Statistical Association , 57, 33– 45. [ 1489 ] |E[R ]| ≤ | E[X X I ]| + | E[X X I ]| ijk ij ik {| Xij |>δ n} ij ik {| Xik |>δ n} Bernstein, S. (1946), The Theory of Probabilities , Moscow: Gostekhizdat. ≤ δ−2−4γ E X4(1 +γ )I · E X2 [1489 ] n ij {| Xij |>δ n} ik Bickel, P., and Levina, E. (2008a), “Regularized Estimation of Large Covariance + Matrices,” The Annals of Statistics , 36, 199–227. [ 1480 ,1483 ] + δ−2−4γhE X4(1 γ )I i · E X2 n ik {| Xik |>δ n} ij ——— (2008b), “Covariance Regularization by Thresholding,” The Annals of −2−4γ Statistics , 36, 2577–2604. [ 1480 ,1483 ,1486 ] ≤ 2K2δ h (= o(ε3)) . i   n Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011), “Dis- Again, we can apply the Bernstein’s inequality to obtain that tributed Optimization and Statistical Learning via the Alternating Direc- tion Method of Multipliers,” Foundations and Trends in Machine Learning , 1 3, 1–122. [ 1482 ] Pr {Y Y − E[Y Y ]} > nε Brodie, J., Daubechies, I., De Mol, C., Giannone, D., and Loris, I. (2009), ij ik ij ik 2 3 ! i # “Sparse and Stable Markowitz Portfolios,” Proceedings of the National X Academy of Sciences , 106, 12267–12272. [ 1485 ] −nε 2 ≤ exp 3 Cai, T., and Liu, W. (2011), “Adaptive Thresholding for Sparse Covariance 4 2 Matrix Estimation,” Journal of the American Statistical Association , 106, ! 8K2 + 8 + 3 δnε3 # 672–684. [ 1480 ,1483 ] −c log p ≤ exp 3 Cai, T., Zhang, C., and Zhou, H. (2010), “Optimal Rates of Convergence for 8K + 8 + O((log n)−1/2) Covariance Matrix Estimation,” The Annals of Statistics , 38, 2118–2144.  2  = O(p−M−2), [1480 ] Cai, T., and Zhou, H. (2012a), “Minimax Estimation of Large Covariance Ma- where the fact that var( Y Y ) ≤ E[X2 X2 ] ≤ (E[X4 ]E[X4 ]) 1/2 ≤ trices Under ℓ1-Norm” (with discussion), Statistica Sinica , 22, 1319–1348. ij ik ij ik ij ik [1480 ,1486 ] K2 + 1 is used. ——— (2012b), “Optimal Rates of Convergence for Sparse Covariance Matrix Estimation,” The Annals of Statistics , to appear. [ 1480 ,1486 ] 0 d’Aspremont, A., Banerjee, O., and Ghaoui, L. (2008), “First-Order Methods Pr max Xij Xik − σjk > nε 3 ! j,k # for Sparse Covariance Selection,” SIAM Journal on Matrix Analysis and i X  Applications , 30, 56–66. [ 1487 ] 1 DeMiguel, V., Garlappi, L., Nogales, F., and Uppal, R. (2009), “A Gen- ≤ Pr max {Yij Yik − E [Yij Yik ]} > nε 3 eralized Approach to Portfolio Optimization: Improving Performance ! j,k 2 # i by Constraining Portfolio Norms,” Management Science , 55, 798– X 1 812. [ 1485 ]

Downloaded by [University of Minnesota Libraries, Twin Cities] at 20:55 04 February 2013 + Pr max {Rijk − E[Rijk ]} > nε 3 Douglas, J., and Rachford, H. H. (1956), “On the Numerical Solution of the ! j,k 2 # Heat Conduction Problem in 2 and 3 Space Variables,” Transactions of the i X American Mathematical Society , 82, 421–439. [ 1482 ] 1 Efron, B. (2009), “Are a Set of Microarrays Independent of Each Other?” The ≤ 2 Pr {Yij Yik − E[Yij Yik ] } > nε 2 3 Annals of Applied Statistics , 3, 922–942. [ 1484 ] j,k ! i # X X ——— (2010), Large-Scale Inference: Empirical Bayes Methods for Esti- 1 mation, Testing, and Prediction , Cambridge: Cambridge University Press. + Pr max {Rijk − o(ε3)} > nε 3 [1484 ] j,k 2 ! i # El Karoui, N. (2008), “Operator Norm Consistent Estimation of Large Dimen- X −M sional Sparse Covariance Matrices,” The Annals of Statistics , 36, 2717– ≤ O(p ) + Pr( |Xij | > δ n) 2756. [ 1480 ] i,j X ——— (2010), “High-Dimensionality Effects in the Markowitz Problem and −M 2(1 +γ +ε) −γ −ε ≤ O(p ) + K2p(log n) n Other Quadratic Programs With Linear Constraints: Risk Underestimation,” The Annals of Statistics , 38, 3487–3566. [ 1485 ] log p log p 1/2 2 Recall that λ = c2 n + c3( n ) = ε2 + ε3. Therefore, we can Fan, J., and Li, R. (2001), “Variable Selection via Nonconcave Penalized Like- prove the desired probability bound under the polynomial-tail con- lihood and Its Oracle Properties,” Journal of the American Statistical Asso- dition as follows ciation , 96, 1348–1360. [ 1480 ] Fan, J., Zhang, J., and Yu, K. (2012), “Vast Portfolio Selection with Gross- n 0 Exposure Constraints,” Journal of the American Statistical Association , 107, Pr max σˆjk − σjk > λ j,k 592–606. [ 1485 ]  

Xue, Ma, and Zou: Estimating Large Covariance Matrices 1491

Fortin, M., and Glowinski, R. (1983), Augmented Lagrangian Methods: Appli- Mar cenko,ˇ V.A., and Pastur, L. A. (1967), “Distribution of Eigenvalues for Some cations to the Numerical Solution of Boundary-Value Problems , Amsterdam: Sets of Random Matrices,” Sbornik: Mathematics , 1, 457–483. [ 1480 ] North-Holland. [ 1482 ] Markowitz, H. (1952), “Portfolio Selection,” Journal of Finance , 7, 77–91. Friedman, J., Hastie, T., and Tibshirani, R. (2008), “Sparse Inverse Co- [1485 ] variance Estimation With the Graphical Lasso,” Biostatistics , 9, 432. Peaceman, D. H., and Rachford, H. H. (1955), “The Numerical Solution of [1481 ] Parabolic Elliptic Differential Equations,” SIAM Journal on Applied Math- Furrer, R., and Bengtsson, T. (2007), “Estimation of High-Dimensional Prior ematics , 3, 28–41. [ 1482 ] and Posterior Covariance Matrices in Kalman Filter Variants,” Journal of Rothman, A. (2012), “Positive Definite Estimators of Large Covariance Matri- Multivariate Analysis , 98, 227–255. [ 1480 ] ces,” Biometrika , 99, 733–740. [ 1481 ,1483 ,1486 ,1487 ] Glowinski, R., and Le Tallec, P. (1989), Augmented Lagrangian and Rothman, A., Levina, E., and Zhu, J. (2009), “Generalized Thresholding of Operator-Splitting Methods in Nonlinear Mechanics , Philadelphia, PA: Large Covariance Matrices,” Journal of the American Statistical Associa- SIAM. [ 1482 ] tion , 104, 177–186. [ 1480 ,1483 ] Jagannathan, R., and Ma, T. (2003), “Risk Reduction in Large Portfolios: Why Scheinberg, K., Ma, S., and Goldfarb, D. (2010), “Sparse Inverse Covariance Imposing the Wrong Constraints Helps,” Journal of Finance , 58, 1651– Selection via Alternating Linearization Methods,” Advances in Neural In- 1684. [ 1485 ] formation Processing Systems . arXiv:1011.0097. [ 1482 ] Johnstone, I. (2001), “On the Distribution of the Largest Eigenvalue in Subramaniana, A., Tamayoa, P., Moothaa, V., Mukherjeed, S., Eberta, B., Gillet- Principal Components Analysis,” The Annals of Statistics , 29, 295–327. tea, M., Paulovichg, A., Pomeroyh, S., Goluba, T., Landera, E., Lander, [1480 ] E. S., and Mesirov, J. P. (2005), “Gene Set Enrichment Analysis: A Khan, J., Wei, J., Ringn er,´ M., Saal, L., Ladanyi, M., Westermann, F., Berthold, Knowledge-Based Approach for Interpreting Genome-Wide Expression F., Schwab, M., Antonescu, C., Peterson, C., and Meltzer, P. S. (2001), Profiles,” Proceedings of the National Academy of Sciences , 102, 15545– “Classification and Diagnostic Prediction of Cancers Using Gene Expres- 15550. [ 1480 ] sion Profiling and Artificial Neural Networks,” Nature Medicine , 7, 673– Wu, W., and Pourahmadi, M. (2003), “Nonparametric Estimation of Large 679. [ 1484 ] Covariance Matrices of Longitudinal Data,” Biometrika , 90, 831–844. Lu, Z. (2010), “Adaptive First-Order Methods for General Sparse Inverse Co- [1480 ] variance Selection,” SIAM Journal on Matrix Analysis and Applications , 31, Zou, H. (2006), “The Adaptive Lasso and Its Oracle Properties,” Journal of the 2000–2016. [ 1487 ] American Statistical Association , 101, 1418–1429. [ 1480 ] Downloaded by [University of Minnesota Libraries, Twin Cities] at 20:55 04 February 2013