A Constrained 1 Minimization Approach to Sparse Precision Matrix Estimation

Tony C AI, Weidong , and Xi LUO

This article proposes a constrained 1 minimization method for estimating a sparse inverse covariance matrix based on a sample of n iid p-variate random variables. The resulting estimator is shown to have a number of desirable properties.√ In particular, the rate of convergence between the estimator and the true s-sparse precision matrix under the spectral norm is s log p/n when the population distribution has either exponential-type tails or polynomial-type tails. We present convergence rates under the elementwise ∞ norm and Frobenius norm. In addition, we consider graphical model selection. The procedure is easily implemented by linear programming. Numerical performance of the estimator is investigated using both simulated and real data. In particular, the procedure is applied to analyze a breast cancer dataset and is found to perform favorably compared with existing methods. KEY WORDS: Covariance matrix; Frobenius norm; Gaussian graphical model; Precision matrix; Rate of convergence; Spectral norm.

1. INTRODUCTION and Bickel and Levina (2008b) proposed thresholding of the sample covariance matrix for estimating a class of sparse co- Estimation of a covariance matrix and its inverse is an im- variance matrices and obtained rates of convergence for the portant problem in many areas of statistical analysis; among thresholding estimators. the many interesting examples are principal components analy- Estimation of the precision matrix is more involved due to sis, linear/quadratic discriminant analysis, and graphical mod- 0 the lack of a natural pivotal estimator like . Assuming certain els. Stable and accurate covariance estimation is becoming in- n ordering structures, methods based on banding the Cholesky creasingly important in the high-dimensional setting where the factor of the inverse have been proposed and studied (see, e.g., p n dimension can be much larger than the sample size .In and Pourahmadi 2003; et al. 2006; Bickel and Lev- this setting, classical methods and results based on fixed p and ina 2008b). Penalized likelihood methods also have been in- large n are no longer applicable. An additional challenge in the troduced for estimating sparse precision matrices. In particular, high-dimensional setting is the high computational cost. It is the 1 penalized normal likelihood estimator and its variants, important that estimation procedures be computationally effec- which we call 1-MLE type estimators, have been considered tive so that they can be used in high-dimensional applications. by several authors (see, e.g., and 2007; d’Aspremont, = T Let X (X1,...,Xp) be a p-variate random vector with co- Banerjee, and El Ghaoui 2008; Friedman, Hastie, and Tibshi- := −1 variance matrix 0 and precision matrix 0 0 .Given rani 2008; Rothman et al. 2008). Convergence rates under the an independent and identically distributed random sample Frobenius norm loss were given by Rothman et al. (2008). Yuan { } X1,...,Xn from the distribution of X, the most natural es- (2009) derived the convergence rates for sub-Gaussian distribu- timator of 0 is perhaps tions. Under more restrictive conditions, such as mutual inco- n herence or irrepresentable conditions, Ravikumar et al. (2008) 1 ¯ ¯ T = (X − X)(X − X) , obtained the convergence rates in the elementwise ∞ norm and n n k k k=1 spectral norm. Nonconvex penalties, which are usually compu-  − tationally more demanding, also have been considered under where X¯ = n 1 n X . However, is singular if p > n and k=1 k n the same normal likelihood model. For example, Lam and Fan thus is unstable for estimating , not to mention that its inverse 0 (2009) and Fan, Feng, and Wu (2009) considered penalizing the cannot be used to estimate the precision matrix . To estimate 0 normal likelihood with the nonconvex SCAD penalty. The main the covariance matrix consistently, special structures are 0 goal is to ameliorate the bias problem due to penalization. usually imposed, and various estimators have been introduced 1 A closely related problem is recovery of the support of the under these assumptions. When the variables exhibit a certain precision matrix, which is strongly connected to the selection ordering structure, which is often the case for time series data, of graphical models. To be more specific, let G = (V, E) be Bickel and Levina (2008a) proved that banding the sample co- a graph representing conditional independence relations be- variance matrix leads to a consistent estimator. , , and tween components of X. The vertex set V has p components (2010) established the minimax rate of convergence and X ,...,X , and the edge set E consists of ordered pairs (i, j), introduced a rate-optimal tapering estimator. El Karoui (2008) 1 p where (i, j) ∈ E if there is an edge between Xi and Xj. The edge between Xi and Xj is excluded from E if and only if Xi and = ∼ Tony Cai is Dorothy Silberberg Professor, Department of Statistics, The Xj are independent given (Xk, k i, j).IfX N(μ0, 0), then Wharton School, University of Pennsylvania, Philadelphia, PA 19104 (E-mail: the conditional independence between Xi and Xj given other [email protected]). Weidong Liu is Faculty Member, Department of 0 = = 0 Mathematics and Institute of Natural Sciences, Shanghai Jiao Tong University, variables is equivalent to ωij 0, where we set 0 (ωij). and Postdoctoral Fellow, Department of Statistics, The Wharton School, Thus, for Gaussian distributions, recovering the structure of the University of Pennsylvania, Philadelphia, PA 19104. Xi Luo is Postdoctoral Fellow, Department of Statistics, The Wharton School, University of Pennsyl- vania, Philadelphia, PA 19104. The research of Tony Cai and Weidong Liu was © 2011 American Statistical Association supported in part by NSF FRG grant DMS-0854973. We would like to thank Journal of the American Statistical Association the Associate Editor and two referees for their very helpful comments which June 2011, Vol. 106, No. 494, Theory and Methods have led to a better presentation of the paper. DOI: 10.1198/jasa.2011.tm10155

594 Cai, Liu, and Luo: Estimating Sparse Precision Matrix 595 graph G is equivalent to estimating the support of the precision 2. ESTIMATION VIA CONSTRAINED 1 MINIMIZATION matrix (Lauritzen 1996). Liu, Lafferty, and Wasserman (2009) In the compressed sensing and high-dimensional linear re- recently showed that for a class of non-Gaussian distribution gression literature, it is now well understood that constrained 1 called nonparanormal distribution, the problem of estimating minimization provides an effective way to reconstruct a sparse the graph also can be reduced to estimating the precision - signal (see, e.g., Donoho, Elad, and Temlyakov 2006; Candès trix. In an important article, Meinshausen and Bühlmann (2006) and 2007). A particularly simple and elementary analy- convincingly demonstrated a neighborhood selection approach sis of constrained 1 minimization methods was given by Cai, to recovering the support of 0 in a row-by-row fashion. Yuan , and (2010). In this section, we introduce a method (2009) replaced the lasso selection by a Dantzig-type modifica- of constrained 1 minimization for inverse covariance ma- tion, where first the ratios between the off-diagonal elements ωij trix estimation. We begin with basic notations and defini- and the corresponding diagonal element ω were estimated for tions. Throughout, for a vector a = (a ,...,a )T ∈ Rp,we ii  1 p each row i and then the diagonal entries ωii were obtained given | | = p | | | | = p 2 define a 1 j=1 aj and a 2 j=1 aj . For a matrix the estimated ratios. Convergence rates under the matrix 1 × A = (a ) ∈ Rp q, we define the elementwise l∞ norm |A|∞ = norm and spectral norm losses were established. ij max ≤ ≤ ≤ ≤ |a |, the spectral norm A = sup| | ≤ |Ax| , 1 i p,1 j q ij 2 x 2 1 2 In this article, we study estimation of the precision ma- p the matrix 1 norm AL = max1≤j≤q = |aij|, the Frobe- trix 0 for both sparse and nonsparse matrices, without restrict-  1 i 1 ing to a specific sparsity pattern. We also consider graphical nius norm A = a2 , and the elementwise norm  F i,j ij 1 model selection. We introduce a new method of constrained   = p q | | × A 1 i=1 j=1 ai,j . I denotes a p p identity matrix. For 1-minimization for inverse matrix estimation (CLIME). Rates any two index sets T and T and matrix A,weuseATT to de- of convergence in the spectral norm, as well as the elemen- note the |T|×|T | matrix with rows and columns of A indexed twise ∞ norm and Frobenius norm, are established under by T and T , respectively. The notation A 0 indicates that A weaker assumptions and shown to be faster than those given is positive definite. ˆ for the 1-MLE estimators when the population distribution has We now define our CLIME estimator. Let {1} be the solu- polynomial-type tails. A matrix is called s-sparse if there are at tion set of the following optimization problem: most s nonzero elements on each row. We show that when 0   is s-sparse and X has either exponential-type or polynomial- min 1 subject to: (1) ˆ p×p type tails, the error√ between our estimator and √0 satisfies |n − I|∞ ≤ λn, ∈ R , ˆ ˆ −02 = OP(s log p/n) and |−0|∞ = OP( log p/n), where λ is a tuning parameter. In (1), we do not impose the where · and |·|∞ are the spectral norm and elementwise n 2 symmetry condition on , and as a result the solution is not l∞ norm, respectively. We discuss properties of the CLIME es- symmetric in general. The final CLIME estimator of is ob- timator for estimating banded precision matrices. The CLIME 0 tained by symmetrizing ˆ as follows. Write ˆ = (ωˆ 1) = method also can be adopted for the selection of graphical mod- 1 1 ij (ωˆ 1,...,ωˆ 1). The CLIME estimator ˆ of is defined as els, with an additional thresholding step. The elementwise ∞ 1 p 0 norm result is instrumental for graphical model selection. ˆ = (ωˆ ij), where Besides having desirable theoretical properties, the CLIME (2) ˆ =ˆ =ˆ1 {| ˆ 1|≤|ˆ1|} + ˆ 1 {| ˆ 1| |ˆ1|} estimator is computationally very attractive for high-dimen- ωij ωji ωijI ωij ωji ωjiI ωij > ωji . sional data. It can be obtained one column at a time by solving ωˆ 1 ωˆ 1 a linear program, and the resulting matrix estimator is formed In other words, between ij and ji, we take the one with ˆ by combining the vector solutions (after a simple symmetriza- smaller magnitude. Clearly, is a symmetric matrix; moreover, tion). No outer iterations are needed, and the algorithm is easily Theorem 1 shows that it is positive definite with high probabil- scalable. An R package of our method has been developed and ity. is publicly available on the Web. Here we investigate the numer- The convex program (1) can be further decomposed into p ical performance of the estimator using both simulated and real vector minimization problems. Let ei be a standard unit vector Rp i data. In particular, we apply the procedure to analyze a breast in with 1 in the th coordinate and 0 in all other coordinates. For 1 ≤ i ≤ p,letβˆ be the solution of the following convex cancer dataset. The results show that the procedure performs i optimization problem: favorably compared with existing methods. The rest of the article is organized as follows. In Section 2, min |β|1 subject to |nβ − ei|∞ ≤ λn, (3) after introducing basic notations and definitions, we present where β is a vector in Rp. The following lemma shows that the CLIME estimator. We establish the theoretical properties, solving the optimization problem (1) is equivalent to solv- including the rates of convergence, in Section 3, and discuss ing the p optimization problems (3), that is, {ˆ }={Bˆ }:= graphical model selection in Section 4. We consider the numer- 1 {(βˆ ,...,βˆ )}. This simple observation is useful for both im- ical performance of the CLIME estimator in Section 5 through 1 p plementation and technical analysis. simulation studies and a real data analysis. We provide further ˆ ˆ discussions on the connections and differences between our re- Lemma 1. Let {1} be the solution set of (1), and let {B}:= { ˆ ˆ } ˆ = sults and those of related work in Section 6. We relegate the (β1,...,βp) , where βi are solutions to (3)fori 1,...,p. ˆ ˆ proofs of our main results to Section 7. Then {1}={B}. 596 Journal of the American Statistical Association, June 2011

(C1) Exponential-type tails: Suppose that there exist some 0 <η<1/4 such that log p/n ≤ η and − 2 Eet(Xi μi) ≤ K < ∞ for all |t|≤η, for all i, where K is a bounded constant. (C2) Polynomial-type tails: Suppose that for some γ,c1 > 0, γ p ≤ c1n , and for some δ>0, 4γ +4+δ E|Xi − μi| ≤ K for all i.

For 1-MLE type estimators, the convergence rates in the case of polynomial-type tails typically are much slower than those in the case of exponential-type tails (see, e.g., Raviku- mar et al. 2008). We show that our CLIME estimator attains the same rates of convergence under either of the two moment con- ditions and significantly outperforms 1-MLE type estimators in the case of polynomial-type tails.

3.1 Rates of Convergence Under Spectral Norm We begin by considering the uniformity class of matrices, Figure 1. Plot of the elementwise ∞ constrained feasible set (shaded polygon) and the elementwise norm objective (dashed dia- U := U(q, s0(p)) 1   mond near the origin) from CLIME. The log-likelihood function as in p Glasso is represented by the dotted line. =   ≤ | |q ≤ : 0, L1 M, max ωij s0(p) 1≤i≤p j=1 To illustrate the motivation of (1), let us recall the method for 0 ≤ q < 1, where =: (ωij) = (ω1,...,ωp). Similar pa- based on 1 regularized log-determinant program (cf. Banerjee, rameter spaces were used by Bickel and Levina (2008b) to esti- Ghaoui, and d’Aspremont 2008; d’Aspremont, Banerjee, and mate the covariance matrix . Note that in the special case of El Ghaoui 2008; Friedman, Hastie, and Tibshirani 2008)asfol- 0 q = 0, U(0, s (p)) is a class of s (p)-sparse matrices. Let lows, which we call Glasso after the algorithm that efficiently 0 0 = [ − − − 0]2 =: computes the solution: θ max E (Xi μi)(Xj μj) σij max θij. ˆ ij ij Glasso := arg min{ , n −log det() + λn1}. (4)  0 The quantity θij is related to the variance of σˆij, and the max- imum value θ captures the overall variability of . It is easy The solution ˆ satisfies n Glasso to see that under either (C1) or (C2), θ is a bounded constant ˆ −1 − = ˆ Glasso n λnZ, depending only on γ,δ,K. The following theorem gives the rates of convergence for the where Zˆ is an element of the subdifferential ∂ˆ  .This Glasso 1 CLIME estimator ˆ under the spectral norm loss. leads us to consider the following optimization problem: ∈ U   Theorem 1. Suppose that 0 (q, s0(p)).√ min 1 subject to: = (5) (a) Assume that (C1) holds. Let λn C0M log p/n, where −1 p×p = −2 + + −1 2 2 2 | − n|∞ ≤ λn, ∈ R . C0 2η (2 τ η e K ) and τ>0. Then   (1−q)/2 However, the feasible set in (5) is very complicated. By mul- − log p ˆ −  ≤ C M2 2qs (p) , (6) tiplying the constraint with , such a relaxation of (5) leads 0 2 1 0 n to the convex optimization problem (1), which can be solved −τ easily. Figure 1 illustrates the solution for recovering a 2 by 2 with probability greater than 1 − 4p , where C1 ≤ 2(1 + [ xz] 1−q + 1−q 1−q 1−q precision matrix zy, and here we consider only the plane x 2 3 )4 C0 . √ = (= y)versusz for simplicity. The point where the feasible poly- (b)√ Assume that (C2) holds. Let λn C2M log p/n, where ˆ gon meets the dashed diamond is the CLIME solution .Note C2 = (5 + τ)(θ + 1). Then   that here, as in Glasso, the log-likelihood function is a smooth (1−q)/2 − log p curve, rather than the polygon constraint in CLIME. ˆ −  ≤ C M2 2qs (p) , (7) 0 2 3 0 n 3. RATES OF CONVERGENCE −δ/8 −τ/2 with probability greater than 1−O(n +p ), where C3 ≤ In this section, we investigate the theoretical properties of the − 2(1 + 21−q + 31−q)41−qC1 q. CLIME estimator and establish the rates of convergence under 2 = ˆ = ˆ ˆ = 0 different norms. Write n (σij) (σ 1,...,σ p), 0 (σij ), When M does not depend on n, p, the rates in Theorem 1 T and EX = (μ1,...,μp) . It is conventional to split the techni- are the same as those used to estimate 0 by Bickel and cal analysis into two cases according to the moment conditions Levina (2008b). In the polynomial-type tails case and when on X. q = 0,theratein(7) is significantly better than the rate Cai, Liu, and Luo: Estimating Sparse Precision Matrix 597  √ 1/(γ +1+δ/4) p Theorem 3. Let 0 ∈ Uo(α, B) and λn = CB log p/n with O(s0(p) n ) for the 1-MLE estimator obtained by Ravikumar et al. (2008). sufficiently large C. It would be of great interest to get the convergence rates for (a) If (C1) or (C2) holds, then with probability greater than − −δ/8 + −τ/2 Eˆ − 2 1 O(n p ), sup0∈U 0 2. However, it is difficult even to prove  ˆ − 2    +  the existence of the expectation of 0 2, because we are α/(2α 2) ˆ ˆ 2 log p dealing with the inverse matrix. We modify the estimator to  − 02 = O B . (11) n ensure that such an expectation exists and the same rates are established. Let {ˆ } be the solution set of the following opti- ξ 1ρ (b)√ Suppose that p ≥ n for some ξ>0. If (C1) holds and mization problem: ρ = log p/n, then       min 1 subject to: α/(α+1) (8)  ˆ − 2 = 4 log p p×p sup E ρ 0 2 O B . (12) |n,ρ − I|∞ ≤ λn, ∈ R , 0∈Uo(α,B) n = + ˆ = ˆ 1 where n,ρ n ρI with ρ>0. Write 1ρ (ωijρ). Define Theorem 3 shows that our estimator has the same rate as that ˆ the symmetrized estimator ρ as in (2)by of Bickel and Levina (2008a) by banding the Cholesky factor ˆ of the precision matrix for the ordered variables. ρ = (ωˆ ijρ), where (9) ˆ =ˆ =ˆ1 {| ˆ 1 |≤|ˆ1 |} + ˆ 1 {| ˆ 1 | |ˆ1 |} 3.2 Rates Under the l∞ Norm and Frobenius Norm ωijρ ωjiρ ωijρI ωijρ ωjiρ ωjiρI ωijρ > ωjiρ . − So far, we have focused on the performance of the estima- Clearly, 1 is a feasible point, and thus we have ˆ  ≤ n,ρ 1ρ L1 tor under the spectral norm loss. Rates of convergence also can −1  ≤ ρ−1p. The expectation Eˆ − 2 is then well n,ρ L1 ρ 0 2 be obtained under the elementwise l∞ norm and the Frobenius defined. The other motivation to replace with comes n n,ρ norm. from our implementation, which computes (1) by the primal dual interior point method. One usually needs to specify a fea- Theorem 4. (a) Under the conditions of Theorem 1(a), we sible initialization. When p > n, it is hard to find an initial value have −1 for (1). For (8), we can simply set the initial value to n,ρ . ˆ 2 log p Theorem 2. Suppose that ∈ U(q, s (p)) and (C1) holds. | − 0|∞ ≤ 4C0M , √ 0 0 n Let λn = C0M log p/n, with C0 defined as in Theorem 1(a)   √ 1−q/2 and τ sufficiently large. Let ρ = log p/n. If p ≥ nξ for some 1 − log p ˆ − 2 ≤ 4C M4 2qs (p) , ξ>0, then we have p 0 F 1 0 n    −  1 q −τ  ˆ − 2 = 4−4q 2 log p with probability greater than 1 − 4p . sup E ρ 0 2 O M s0(p) . 0∈U n (b) Under the conditions of Theorem 1(b), we have √ Remark. It is not necessary to restrict ρ = log p/n. In fact, ˆ 2 log p | − |∞ ≤ 4C M , from the proof, we can see that Theorem 2 still holds for 0 2 n     log p log p 1−q/2 −α ≤ ≤ 1 − log p min , p ρ (10) ˆ − 2 ≤ 4C M4 2qs (p) , n n p 0 F 3 0 n with any α>0. with probability greater than 1 − O(n−δ/8 + p−τ/2). When the variables of X are ordered, better rates can be ob- tained. Similar to Bickel and Levina (2008a), we consider the The rate in Theorem 4(b) is significantly faster than the rate following class of precision matrices: obtained by Ravikumar et al. (2008); see Section 3.3 for more detailed discussion. A similar rate to ours was obtained by Lam Uo(α, B) = : 0, and Fan (2009) under the Frobenius norm. The elementwise ∞  norm result will lead to the model selection consistency result −α that we present in the next section. We now give the rates for max {|ωij| : |i − j|≥k}≤B(k + 1) j ˆ − i ρ 0 under expectation.

Theorem 5. Under the conditions of Theorem 2,wehave for all k ≥ 0   log p |ˆ − |2 = O M4 for α>0. Suppose that the modified Cholesky factor of is sup E ρ 0 ∞ , 0 ∈U n = −1 0 0 TD T, with the unique lower triangular matrix T and     1−q/2 diagonal matrix D. To estimate 0, Bickel and Levina (2008a) 1  ˆ − 2 = 4−2q log p sup E ρ 0 F O M s0(p) . used the banding method and assumed T ∈ Uo(α, B). It is easy p 0∈U n to see that T ∈ Uo(α, B) implies 0 ∈ Uo(α, B1) for some con- stant B1. Rather than assuming T ∈ Uo(α, B),weuseamore The proofs of Theorems 1–5 rely on the following more gen- general assumption that 0 ∈ Uo(α, B). eral theorem. 598 Journal of the American Statistical Association, June 2011

Theorem 6. Suppose that 0 ∈ U(q, s0(p)) and ρ ≥ 0. If We next compare our result with that of Ravikumar et al. ≥  |ˆ − 0|+ λn 0 L1 (maxij σij σij ρ), then we have (2008) under the case of polynomial-type tails. Suppose that (C2) holds. Corollary 2 of Ravikumar et al. (2008) shows that ˆ | − |∞ ≤ 4  λ , (13) = { 2 }(γ +1+δ/4)/τ ρ 0 0 L1 n if p O( n/s0(p) ) for some τ>2, then with prob- τ−2  ˆ −  ≤ 1−q ability greater than 1 − 1/p , ρ 0 2 C4s0(p)λn , (14)    + + and ˆ pτ/(γ 1 δ/4) | − |∞ = O .  0 n 1 ˆ − 2 ≤ 2−q ρ 0 F C5s0(p)λn , (15) p √Theorem 4 shows that our estimator still has the order of log p/n in the case of polynomial-type tails. Moreover, when where C ≤ 2(1 + 21−q + 31−q)(4  )1−q and C ≤ 4 0 L1 5 γ ≥ 1, the range p = O(nγ ) in our theorem is wider than their 4  C . 0 L1 4 = { 2 }(γ +1+δ/4)/τ range p O( n/s0(p) ) with τ>2. 3.3 Comparison With the Lasso-Type Estimator It is worth noting that our estimator allows for a wider class of matrices compared with the sparse precision matrices. For Here we compare our results with those of Ravikumar et al. example, the estimator is still consistent for the model, which is (2008), who estimated 0 by solving the following 1 regular- not truly sparse but has many small entries. ized log-determinant program: ˆ 4. GRAPHICAL MODEL SELECTION CONSISTENCY  := arg min{ , n −log det() + λn1,off}, (16)  0 As mentioned in the Introduction, graphical model selec-    = | | tion is an important problem. The constrained 1 minimiza- where 1,off i=j ωij . To obtain the rates of convergence in the elementwise ∞ norm and the spectral norm, they im- tion procedure introduced in Section 2 for estimating 0 can posed the following condition: be modified to recover the support of 0. We introduce an ad- ditional thresholding step based on ˆ . More specifically, define ˜ Irrepresentable Condition in Ravikumar et al. (2008). There a threshold estimator = (ω˜ ij) with exists some α ∈ (0, 1] such that ω˜ ij =ˆωijI{|ω ˆ ij|≥τn}, −1  c  ≤ − S S(SS) L1 1 α, (17) where τn ≥ 4Mλn is a tuning parameter and λn is given in The- = −1 ⊗ −1 c = where  0 0 , S is the support of 0 and S orem 1. {1,...,p}×{1,...,p}−S. Define ˜ The foregoing assumption is particularly strong. Under this M() ={sgn(ω˜ ij), 1 ≤ i, j ≤ p}, ˆ assumption, Ravikumar et al. (2008) showed that  estimates M ={ 0 ≤ ≤ } (0) sgn(ωij), 1 i, j p , the zero elements of 0 exactly by 0 with high probability. In fact, a similar condition to (17) for Lasso with the covariance ={ 0 = } S(0) (i, j) : ωij 0 , matrix 0 taking the place of the matrix  is sufficient and nearly necessary for recovering the support using the ordinary and Lasso (see, e.g., Meinshausen and Bühlmann 2006). = | 0| θmin min ωij . (i,j)∈S( ) Suppose that 0 iss0(p)-sparse and consider sub-Gaussian 0 0 From the elementwise ∞ results established in Theorem 4, random variables Xi/ σ with the parameter σ . In addition to ii ˆ (17), Ravikumar et al. (2008) assumed that the sample size n with high probability, the resulting elements in will exceed the threshold level if the corresponding element in is large satisfies the bound 0 in magnitude. In contrast, the elements of ˆ outside the support 2 + 2 + n > C1s0(p)(1 8/α) (τ log p log 4), (18) of 0 will remain below the threshold level with high probabil- √ ity. Therefore, we have the following theorem on the threshold ={ + 2 0 {  ˜ where C1 48 2(1 4σ ) maxi(σii ) max 0 L1 K, estimator .  3 K2 }}2. Under the aforementioned conditions, they 0 L1  − Theorem 7. Suppose that (C1) or (C2) holds and ∈ showed that with probability greater than 1 − 1/pτ 2, 0 U(0, s0(p)).Ifθmin > 2τn, then, with probability greater than √ − − ˆ 2 −1 1 − O(n δ/8 + p τ/2),wehaveM(˜ ) = M( ). | − 0|∞ ≤ 16 2(1 + 4σ ) max(σii)(1 + 8α )K 0 i The threshold estimator ˜ recovers not only the sparsity pat- τ log p + log 4 × , tern of 0, but also the signs of the nonzero elements. This n property is called sign consistency in some of the literature. − The condition θ > 2τ is needed to ensure that nonzero where K =([ ⊗ ] ) 1 . Note that their constant de- min n  0 0 SS L1 elements are correctly retained. From Theorem 4, we see that pends on quantities α and K , whereas our constant depends √  if M does not depend on n, p, then τ is of order log p/n, on M, the bound of   . They required (18), whereas we n 0 L1 which is of the same order as in the assumption of Ravikumar need only log p = o(n). Another substantial difference is that et al. (2008) for exponential-type tails, but weaker than in their the irrepresentable condition (17) is not needed for our re-  ≥ pτ/(γ+1+δ/4) sults. assumption θmin C n for polynomial-type tails. Cai, Liu, and Luo: Estimating Sparse Precision Matrix 599

Based on Meinshausen and Bühlmann (2006), Zhou, van linear system nβ = ei for each i when p > n, because n is de Geer, and Bühlmann (2009) applied the adaptive Lasso to singular.√ The remedy is to add a small positive constant ρ (e.g., covariance selection in Gaussian graphical models. For X = ρ = log p/n) to all of the diagonal entries of the matrix n; T ∼ (X1,...,Xp) N(0, 0), they regressed Xi versus the other that is, we use the ρ-perturbed matrix n,ρ = n + ρI to re- { ; = } = i + variables Xk k i : Xi j=i βj Xj Vi, where Vi is a nor- place the n in (19). Such a perturbation does not noticeably mally distributed random variables with mean 0 and the under- affect the computational accuracy of the final solution in our i =− 0 0 lying coefficients can be shown to be βj ωij/ωii. They then numerical experiments. In Sections 3 and 4, the resulting solu- { i} tion ˆ in the perturbed problem (8) is shown to have all of the used the adaptive Lasso to recover the support of βj , which is ρ identical to the support of . One of their main assumptions theoretical properties, and, even better, the convergence rate of 0 ˆ is the restricted eigenvalue assumption on 0, which is weaker the spectral norm under expectation is established for ρ . than the irrepresentable condition. Their method can recover the In the context of high-dimensional linear regression, a sec- support of 0 but cannot estimate the elements in 0. Besides ond-stage refitting procedure was considered by Candés and not imposing the unnecessary irrepresentable condition, an ad- Tao (2007) to correct the biases introduced by the 1 norm pe- ditional advantage of our method is that it not only recovers the nalization. Their refitting procedure seeks the best coefficient support of 0, but also provides consistency results under the vector, giving the maximum likelihood, which has the same elementwise l∞ norm and the spectral norm. support as the original Dantzig selector. Inspired by this two- stage procedure, we propose a similar two-stage procedure to 5. NUMERICAL RESULTS further improve the numerical performance of the CLIME esti- In this section, we turn to the numerical performance of our mator by refitting as CLIME estimator. The procedure is easy to implement. An R ˇ = arg min{ , n −log det()}, package of our method has been developed and is available = Sˆc 0 at http://stat.wharton.upenn.edu/~tcai/paper/html/Precision- ˆ = ˜ ={ ∈ ˆc} Matrix.html. Here we first investigate the numerical perfor- where S S() and Sˆc ωij,(i, j) S . Here the estima- mance of the estimator through simulation studies, and then tor ˇ minimizes the Bregman divergence among all symmetric apply our method to the analysis of a breast cancer dataset. positive definite matrices under the constraint. We call ˇ the The proposed estimator ˆ can be obtained in a column-by- refitted CLIME. We can establish the bounds under the three ˇ column fashion as illustrated in Lemma 1. Thus we focus on norms in Section 3 and the support recovery S() = S(0).For the numerical implementation of solutions to the optimization example, the Frobenius loss bound can be easily derived from problem (3): the approach used by Rothman et al. (2008) and Fan, Feng, and Wu (2009). Other theoretical properties are more involved, and min |β| subject to: | β − e |∞ ≤ λ . 1 n i n we leave this to future work. We consider relaxation of the foregoing, which is equivalent to the following linear programming problem: 5.1 Simulations p Here we compare the numerical performance of the CLIME ˆ min uj subject to: estimator CLIME, the Refitted CLIME estimator, the Graphical ˆ ˆ j=1 Lasso Glasso, and the SCAD SCAD from Fan, Feng, and Wu − ≤ ≤ ≤ (2009), which is defined as βj uj for all 1 j p, 

+ βj ≤ uj for all 1 ≤ j ≤ p, (19) ˆ SCAD := arg min , n −log det() − ˆ T + { = }≤ ≤ ≤  0 σ k β I k i λn for all 1 k p,  p p + σˆ T β − I{k = i}≤λ for all 1 ≤ k ≤ p. k n + SCADλ,a(|ωij|) , The same linear relaxation was considered by Candès and i=1 j=1

Tao (2007), who found it very efficient for the Dantzig selec- where the SCAD function SCADλ,a is that proposed by Fan tor problem in regression. To solve (19), we follow the pri- (1997). We use Fan and ’s (2001) recommended choice a = mal dual interior method approach (see, e.g., Boyd and Van- 3.7 throughout and set all λ to be the same for all (i, j) entries denberghe 2004). The resulting algorithm has comparable nu- for simplicity. This setting for a and λ is the same as that of Fan, merical performance to other numerical procedures, including Feng, and Wu (2009). (See Fan, Feng, and Wu (2009)forfur- Glasso. Note that we need only sweep through the p columns ˆ ˆ ther details on SCAD.) Note that Glasso performs equivalently once, but Glasso requires an extra outer layer of iterations to to the SPICE estimator of Rothman et al. (2008) according to loop through the p columns several times by cyclical coordi- ˆ ˆ their study. nate descent. Once 1 is obtained by combining the β’s for ˆ We consider three models as follows: each column, we symmetrize 1 by setting the entry (i, j) to ˆ 1 ˆ 1 • Model 1. ω0 = 0.6|i−j|. be the smaller in magnitude of two entries ωij and ωji, for all ij 1 ≤ i, j ≤ p,asin(2). • Model 2. The second model comes from Rothman et al. Similar to many iterative methods, our method also requires (2008). We let 0 = B + δI, where each off-diagonal en- a proper initialization within the feasible set. However, the ini- try in B is generated independently and equals 0.5 with tializing β0 cannot simply be replaced by the solution of the probability 0.1 or 0 with probability 0.9. δ is chosen such 600 Journal of the American Statistical Association, June 2011

that the conditional number (the ratio of maximal and min- are nonzero) are plotted for p = 60 as a representative example imal singular values of a matrix) is equal to p. Finally, the of other cases. matrix is standardized to have unit diagonals. To better illustrate the recovery performance elementwise, • Model 3. In this model, we consider a nonsparse matrix Figure 3 shows heatmaps of the nonzeros identified out of 100 and let 0 have all off-diagonal elements 0.5 and the di- replications. All of the heatmaps suggest that CLIME is more agonal elements 1. sparse than Glasso, and visual inspection shows that the spar- sity pattern recovered by CLIME has a significantly closer re- Model 1 has a banded structure, and the values of the entries semblance to the true model than Glasso. When the true model decay as they move away from the diagonal. Model 2 is an ex- has significant nonzero elements scattered on the off-diagonals, ample of a sparse matrix without any special sparsity patterns. Glasso tends to include more nonzero elements than are needed. Model 3 serves as a dense matrix example. SCAD is the sparsest of the three methods but again could zero For each model, we generate a training sample of size n = out more true nonzero entries, as shown in Model 1. Similar 100 from a multivariate normal distribution with mean 0 and patterns were observed in our experiments for other values of p. covariance matrix 0, and an independent sample of size 100 from the same distribution for validating the tuning parame- 5.2 Analysis of a Breast Cancer Dataset ter λ. Using the training data, we compute a series of estimators with 50 different values of λ and use the one with the smallest We now apply our CLIME method on a real data exam- likelihood loss on the validation sample, where the likelihood ple. The breast cancer data were analyzed by Hess et al. loss is defined by (2006) and are available at http://bioinformatics.mdanderson. org/. The dataset consists of 22,283 gene expression levels of L(, ) = , −log det(). 133 subjects, including 34 subjects with pathological complete We compute the Glasso and SCAD estimators on the same response (pCR) and 99 subjects with residual disease (RD). The training and testing data using the same cross-validation pCR subjects are considered to have a high chance of cancer- scheme. We consider different values of p = 30, 60, 90, 120, free survival in the term, and thus it is of great interest to 200 and replicate 100 times. study the response states of the patients (pCR or RD) to neoad- We first measure the estimation quality by the following ma- juvant (preoperative) chemotherapy. Based on the estimated in- verse covariance matrix of the gene expression levels, we apply trix norms: the operator norm, the matrix 1 norm, and the Frobenius norm. Table 1 reports the averages and standard er- linear discriminant analysis (LDA) to predict whether or not a rors of these losses. subject can achieve the pCR state. We see that CLIME nearly uniformly outperforms Glasso. For a fair comparison with other methods of estimating the The improvement tends to be slightly more significant for inverse covariance matrix, we follow the same analysis scheme sparse models when p is large, but overall the improvement used by Fan, Feng, and Wu (2009) and the references therein. is not dramatic. SCAD is the most computationally costly of For completeness, we briefly describe these steps here. The the three methods, but numerically it has the best performance data are randomly divided into the training and testing datasets. when p < n and is comparable to CLIME when p is large. A stratified sampling approach is applied to divide the data, Note that SCAD uses a nonconvex penalty to correct the bias, with 5 pCR subjects and 16 RD subjects randomly selected to constitute the testing data (roughly 1/6 of the subjects in whereas CLIME currently optimizes the convex 1 norm ob- jective efficiently. A more comparable procedure that also cor- each group). The remaining subjects form the training set. For t rects the bias is our two-stage Refitted CLIME, denoted by the training set, a two-sample test is performed between the ˆ two groups for each gene, and the 113 most significant genes R-CLIME. Table 2 illustrates the improvement from bias cor- p rection, and we only list the spectral norm loss for reasons of (i.e., with the smallest -values) are retained as the covariates space. It is clear that our Refitted CLIME estimator has com- for prediction. Note that the size of the training sample is 112, parable or better performance than SCAD, and our Refitted 1 less than the variable size, which allows us to examine the CLIME is especially favorable when p is large. performance when p > n. The gene data are then standardized Gaussian graphical model selection has also received consid- by the estimated standard deviation, estimated from the train- erable attention in the literature. As we discussed earlier, this ing data. Finally, following the LDA framework, the normalized is equivalent to the support recovery of the precision matrix. gene expression data are assumed to be normally distributed as μ , ) The proportion of true zero (TN) and nonzero (TP) elements N( k , where the two groups are assumed to have the same μ k = recovered by two methods are also reported here in Table 3. covariance matrix, , but different means, k, 1 for pCR = ˆ The numerical values over 10−3 in magnitude are considered to and k 2 for RD. The estimated inverse covariance produced be nonzero, because the computation accuracy is set to be 10−4. by different methods is used in the LDA scores, It is noticeable that Glasso tends to be more noisy by in- = T ˆ ˆ − 1 ˆ T ˆ ˆ + ˆ δk(x) x μk 2 μk μk log πk, (20) cluding erroneous nonzero elements; CLIME tends to be more ˆ = sparse than Glasso, which is usually favorable in real applica- where πk nk/n is the proportion of group k subjects in the ˆ = tions; SCAD produces the most sparse among the three but with training set and μk (1/nk) i∈group k xi is the within-group a price of erroneously estimating more true nonzero entries by average vector in the training set. The classification rule is taken ˆ zero. This conclusion also can be reached in Figure 2, where to be k(x) = arg max δk(x) for k = 1, 2. the TPR and FPR values of 100 realizations of these three pro- The classification performance is clearly associated with the cedures for first two models (note that all elements in Model 3 estimation accuracy of ˆ . We use the testing dataset to assess Cai, Liu, and Luo: Estimating Sparse Precision Matrix 601 SCAD ˆ Glasso ˆ CLIME ˆ SCAD ˆ Glasso -norm ˆ 1 Operator norm Frobenius norm Matrix CLIME ˆ SCAD ˆ Table 1. Comparison of average (SE) matrix losses for three models over 100 replications Glasso ˆ Model 1 Model 2 Model 3 CLIME ˆ 306090 2.91 (0.02) 3.32 (0.01) 3.44 (0.01)30 3.08 (0.01)60 3.55 (0.01)90 3.72 (0.01) 3.81 (0.04) 2.91 6.63 (0.02) (0.03) 3.11 8.78 (0.01) (0.04) 3.19 (0.01) 4.23 (0.03) 1.29 (0.02) 7.14 (0.02) 2.10 (0.02) 9.25 (0.01) 2.95 (0.02) 1.36 (0.01) 3.97 (0.03) 2.11 (0.02) 6.37 (0.02) 2.87 (0.02) 7.98 (0.01) 0.81 (0.02) 1.72 (0.02) 1.98 (0.03) 3.33 (0.02) 2.71 (0.03) 4.92 (0.02) 15.12 1.71 (0.004) (0.01) 30.17 3.10 (0.002) (0.01) 45.18 4.36 (0.002) (0.01) 15.08 1.23 (0.003) (0.02) 30.15 3.11 (0.002) (0.01) 45.18 4.51 (0.002) (0.01) 15.10 14.96 (0.002) (0.004) 30.12 30.02 (0.002) (0.002) 45.13 45.02 (0.002) (0.002) 14.97 (0.004) 30.02 (0.002) 45.04 (0.001) 14.97 (0.001) 29.98 (0.001) 44.99 (0.001) 306090 2.28 (0.02) 2.79 (0.01) 2.97 (0.01) 2.48 (0.01) 2.93 (0.01) 3.07 (0.004) 2.38 (0.02) 2.71 (0.01) 2.76 (0.004) 0.74 (0.01) 1.13 1.69 (0.01) (0.01) 0.77 (0.01) 1.12 1.49 (0.01) (0.004) 0.59 (0.02) 1.14 (0.01) 0.95 (0.01) 14.95 (0.004) 45.01 (0.002) 30.01 (0.002) 14.96 (0.004) 45.03 (0.001) 30.02 (0.002) 14.97 (0.002) 44.98 (0.001) 29.98 (0.001) p 120200 3.48 (0.01) 3.55 (0.01)120 3.81 (0.01)200 4.01 (0.01) 10.58 (0.02) 14.20 (0.04) 3.24 (0.01) 3.37 (0.01) 10.97 (0.01) 14.85 (0.01) 3.69 (0.02) 4.13 (0.02) 9.31 (0.01) 3.33 12.21 (0.02) (0.01) 4.52 (0.02) 6.50 (0.03) 3.32 7.57 (0.03) (0.02) 4.67 (0.03) 5.50 (0.01) 8.15 (0.01) 60.20 (0.002) 100.22 (0.002) 5.89 (0.01) 8.41 (0.01) 60.20 100.24 (0.003) (0.002) 60.01 100.02 (0.001) (0.001) 102.64 (0.05) 60.55 (0.06) 100.08 (0.001) 60.05 (0.001) 103.41 (0.02) 60.60 (0.08) 120200 3.08 (0.004) 3.17 (0.01) 3.14 (0.003) 3.25 (0.002) 2.79 (0.004) 2.83 (0.003) 2.16 (0.01) 2.36 (0.01) 1.82 (0.003) 2.46 (0.002) 1.38 (0.01) 2.11 (0.01) 60.01 (0.002) 100.02 (0.001) 60.04 (0.001) 100.08 (0.001) 58.40 (0.10) 96.69 (0.01) 602 Journal of the American Statistical Association, June 2011

Table 2. Comparison of average (SE) operator norm losses from penalty only on the off-diagonal entries, which is slightly dif- Models 1 and 2 over 100 replications ferent from our starting point (4) presented here. We also can Model 1 Model 2 consider the following optimization problem: ˆ ˆ ˆ ˆ   | − | ≤ ∈ Rp×p p R-CLIME SCAD R-CLIME SCAD min 1,off subject to: n I ∞ λn, . 30 1.56 (0.02) 2.38 (0.02) 0.85 (0.01) 0.59 (0.02) Analogous results can be established for the foregoing estima- 60 2.15 (0.01) 2.71 (0.01) 1.14 (0.01) 0.95 (0.09) tor. We omit these here, due to close resemblance in proof tech- 90 2.42 (0.01) 2.76 (0.004) 1.17 (0.01) 1.14 (0.01) niques and conclusions. 120 2.56 (0.01) 2.79 (0.004) 1.44 (0.01) 1.38 (0.01) There are several possible extensions of our method. For 200 2.71 (0.01) 2.83 (0.003) 1.91 (0.01) 2.11 (0.01) example, Zhou, Lafferty, and Wasserman (2010) considered the time-varying undirected graphs and estimated (t)−1 by Glasso. It would be very interesting to study the estimation − the estimation performance and to make comparisons with the of (t) 1 by our method. Ravikumar and Wainwright (2009) existing results of Fan, Feng, and Wu (2009) using the same considered high-dimensional Ising model selection using 1- criterion. For the tuning parameters, we use a 6-fold cross- regularized logistic regression. It would be interesting to apply validation on the training data for choosing λ. We repeat the our method to their setting as well. foregoing estimation scheme 100 times. Another important subject is investigating the theoreti- To compare the classification performance, we use speci- cal property of the tuning parameter selected by the cross- ficity, sensitivity, and Mathews correlation coefficient (MCC) validation method, although based on our experiments, CLIME criteria, defined as follows: is not very sensitive to the choice of tuning parameter. An exam- ple of such results on cross-validation was provided by Bickel TN TP Specificity = , Sensitivity = , and Levina (2008b) on thresholding. TN + FP TP + FN After submitting this article for publication, we were made TP × TN − FP × FN aware that Zhang (2010) had proposed a precision matrix esti- MCC = √ . (TP + FP)(TP + FN)(TN + FP)(TN + FN) mator, called GMACS, which is the solution of the following optimization problem: Here TP and TN stand for true positives (pCR) and true neg- p×p min  subject to: | − I|∞ ≤ λ , ∈ R . atives (RD), respectively, and FP and FN stand for false pos- L1 n n itives/negatives. The larger the criterion value, the better the The objective function here is different from that of CLIME, classification performance. The averages and standard errors of and this basic version cannot be solved column-by-column and the foregoing criteria, along with the number of nonzero entries is not as easy to implement. Zhang (2010) considered only the ˆ in over 100 replications, are reported in Table 4. The Glasso, Gaussian case and 0 balls, whereas we consider sub-Gaussian Adaptive lasso, and SCAD results are taken from the work of and polynomial-tail distributions and more general q balls. In Fan, Feng, and Wu (2009), which used the same procedure on addition, the GMACS estimator requires an additional thresh- ˆ ˆ the same data set, except that here we use CLIME in place of . olding step for the rates to hold over 0 balls. In contrast, Clearly, CLIME significantly outperforms the other two CLIME does not need an additional thresholding step, and the methods on sensitivity and is comparable with them on speci- rates hold over general q balls. ficity. The overall classification performance measured by 7. PROOF OF MAIN RESULTS MCC overwhelmingly favors CLIME, which shows an 25% p improvement over the best alternative methods. CLIME also Proof of Lemma 1. Write = (ω1,...,ωp), where ωi ∈ R . produces the sparsest matrix, which is usually favorable for in- The constraint |n − I|∞ ≤ λn is equivalent to terpretation purposes on real datasets. |nωi − ei|∞ ≤ λn for 1 ≤ i ≤ p. 6. DISCUSSION Thus we have This article presents a new constrained minimization | ˆ 1| ≥|ˆ | ≤ ≤ 1 ωi 1 βi 1 for 1 i p. (21) method for estimating high-dimensional precision matrices. | ˆ − | ≤ { ˆ } Both the method and the analysis are relatively simple and Because nB I ∞ λn, by the definitions of 1 ,wehave ˆ ˆ straightforward, and may be extended to other related problems. 11 ≤B1. (22) Moreover, the method and the results are not restricted to a spe- ˆ ∈{ˆ } ˆ ∈ cific sparsity pattern. Thus the estimator can be used to recover By (21) and (22), we have B 1 . On the other hand, if 1 / { ˆ } | ˆ | | ˆ | a wide class of matrices in theory as well as in applications. B , then there exists an i such that ωi 1 > βi 1. Thus, by (21),  ˆ   ˆ  In particular, when applying our method to covariance selec- we have 1 1 > B 1. This is in conflict with (22). tion in Gaussian graphical models, the theoretical results can The main results all rely on Theorem 6, which upper bounds be established without assuming the irrepresentable condition the elementwise ∞ norm. We prove this theorem first. of Ravikumar et al. (2008), which is very stringent and hard to ˆ check in practice. Proof of Theorem 6. Let βi,ρ be a solution of (3) by replac- ˆ Several authors, including Yuan and Lin (2007), Rothman et ing n with n,ρ . Note that Lemma 1 still holds for n,ρ and { ˆ } ≥ al. (2008), and Ravikumar et al. (2008), have estimated the pre- βi,ρ with ρ 0. For conciseness of notation, we only prove cision matrix by solving the optimization problem (16) with 1 the theorem for ρ = 0. The proof is exactly the same for general Cai, Liu, and Luo: Estimating Sparse Precision Matrix 603 SCAD ˆ Glasso ˆ CLIME ˆ SCAD ˆ Glasso ˆ TP% TN% CLIME ˆ SCAD ˆ Table 3. Comparison of average (SE) support recovery for three models over 100 replications Glasso ˆ Model 1 Model 2 Model 3 CLIME ˆ 306090 78.69 (0.61) 90.37 (0.27) 94.30 (0.27) 50.65 (0.75) 69.47 (0.29) 77.62 (0.20) 99.26 (0.17) 99.86 (0.03) 99.88 (0.02) 77.41 (0.86) 85.98 (0.36) 91.15 (0.17) 64.70 (0.42) 69.44 (0.21) 71.57 (0.15) 99.10 (0.08) 96.08 (0.14) 95.98 (0.11) N/A N/A N/A N/A N/A N/A N/A N/A N/A 306090 41.07 (0.58) 25.96 (0.30) 20.32 (0.32) 60.20 (0.56) 41.72 (0.32) 33.70 (0.23) 16.93 (0.28) 12.72 (0.15) 11.94 (0.09) 99.66 (0.09) 85.10 (0.36) 66.25 (0.39) 99.98 (0.02) 96.47 (0.13) 91.62 (0.15) 97.70 (0.24) 79.81 (0.44) 67.93 (0.48) 14.88 (0.50) 6.86 (0.05) 5.86 (0.03) 20.07 (0.57) 10.49 (0.20) 7.54 (0.13) 3.38 (0.001) 1.67 (0.001) 1.11 (0.001) p 120200 96.45 (0.06) 97.41 (0.11) 81.46 (0.16) 85.36 (0.11) 99.91 (0.01) 99.92 (0.01) 94.87 (0.19) 81.74 (0.26) 75.33 (0.10) 66.07 (0.12) 95.69 (0.10) 96.97 (0.05) N/A N/A N/A N/A N/A N/A 120200 17.16 (0.09) 15.03 (0.13) 29.32 (0.20) 25.34 (0.15) 11.57 (0.07) 11.07 (0.06) 42.37 (0.49) 57.07 (0.27) 82.45 (0.15) 73.43 (0.14) 54.92 (0.41) 30.50 (0.40) 5.11 (0.02) 3.56 (0.01) 6.20 (0.12) 4.94 (0.02) 20.63 (2.47) 39.76 (0.02) 604 Journal of the American Statistical Association, June 2011

Figure 2. TPR vs FPR for p = 60. The solid, dashed and dotted lines are the average TPR and FPR values for CLIME, Glasso, and SCAD, respectively, as the tuning parameters of these methods vary. The circles, triangles and pluses correspond to 100 different realizations of CLIME, Glasso and SCAD respectively, with the tuning parameter picked by cross validation.

Figure 3. Heatmaps of the frequency of the zeros identified for each entry of the precision matrix (when p = 60) out of 100 replications. White represents 100 zeros identified out of 100 runs, and black represents 0/100.

Table 4. Comparison of average (SE) pCR classification errors over 100 replications. Glasso, Adaptive lasso, and SCAD results are taken from Fan, Feng, and Wu (2009,table2)

Method Specificity Sensitivity MCC Nonzero entries in ˆ Glasso 0.768 (0.009) 0.630 (0.021) 0.366 (0.018) 3923 (2) Adaptive lasso 0.787 (0.009) 0.622 (0.022) 0.381 (0.018) 1233 (1) SCAD 0.794 (0.009) 0.634 (0.022) 0.402 (0.020) 674 (1) CLIME 0.749 (0.005) 0.806 (0.017) 0.506 (0.020) 492 (7) Cai, Liu, and Luo: Estimating Sparse Precision Matrix 605

1−q 1−q 1−q ρ>0. By the condition in Theorem 6, ≤ (2tn) s0(p) + (tn) s0(p) + (3tn) s0(p) | − | ≤   ≤ + 1−q + 1−q 1−q 0 n ∞ λn/ 0 L1 . (23) (1 2 3 )tn s0(p), (27) We then have where we use the following inequality: for any a, b, c ∈ R,we

|I − n0|∞ =|(0 − n)0|∞ have ≤  | − | ≤ | { }− { }| ≤ {| − | | − |} 0 L1 0 n ∞ λn, (24) I a < c I b < c I b c < a b . | | ≤| |   where we use the inequality AB ∞ A ∞ B L1 for matrices This completes the proof of (14). ˆ A, B of appropriate sizes. By the definition of βi, we can see Finally, (15) follows from (13), (27), and the inequality | ˆ | ≤  ≤ ≤  2 ≤   | |∞ × that βi 1 0 L1 for 1 i p. By Lemma 1, A F p A L1 A for any p p matrix.  ˆ  ≤  Proof of Theorems 1(a) and 4(a). 1 L1 0 L1 . (25) By Theorem 6, we only need to prove We have  |ˆ − 0|≤ | ˆ − | ≤| ˆ − | +| − | max σij σij C0 log p/n (28) n(1 0) ∞ n1 I ∞ I n0 ∞ ij ≤ 2λn. (26) with probability greater than 1 − 4p−τ under (C1). With- = 0 := Therefore, by (23)–(26), out loss of generality, we assume that EX 0. Let n −1 n T ˆ n = XkX and Ykij = XkiXkj − EXkiXkj. We then have | ( − )|∞ k 1 k √ 0 1 0 = 0 − ¯ ¯ T = ˆ ˆ n n XX .Lett η log p/n. Using the inequality ≤| ( − )|∞ +|( − )( − )|∞ s 2 max(s,0) n 1 0 n 0 1 0 |e − 1 − s|≤s e for any s ∈ R and letting CK1 = ˆ + + −1 2 ≤ 2λ + −  | − |∞ ≤ 4λ . 2 τ η K , by basic calculations, we can get n 1 0 L1 n 0 n   It follows that n  ≥ −1 ˆ ˆ P Ykij η CK1 n log p | − |∞ ≤  | ( − )|∞ ≤ 4  λ . 1 0 0 L1 0 1 0 0 L1 n k=1 − This establishes (13) by the definition in (2). ≤ CK1 log p n =|ˆ − | e (E exp(tYkij)) We next prove (14). Let tn 0 ∞ and define   2 2 t|Ykij| = ˆ − 0 ≤ exp −CK1 log p + nt EY e hj ωj ωj , kij   −1 2 1 = ˆ {| ˆ |≥ }; ≤ ≤ T − 0 2 = − 1 ≤ exp(−CK1 log p + η K log p) hj ωijI ωij 2tn 1 i p ωj , hj hj hj . ≤ − + ˆ | ˆ | ≤|ˆ 1| ≤| 0| exp( (τ 2) log p). By the definition (2)of,wehave ωj 1 ωj 1 ωj 1. Then | 0| −| 1| +| 2| ≤| 0 + 1| +| 2| =|ˆ | ≤| 0| Thus, we have ωj 1 hj 1 hj 1 ωj hj 1 hj 1 ωj 1 ωj 1,  | 0 − | ≥ −1 ≤ −τ | 2| ≤| 1| | | ≤ | 1| P( n 0 ∞ η CK1 log p/n) 2p . (29) which implies that hj 1 hj 1. It follows that hj 1 2 hj 1. | 1| s s2+1 t|X | Thus we only need to upper bound hj 1.Wehave By the simple inequality e ≤ e for s > 0, we have Ee j ≤ 1/2 −1 2 2 p   eK for all t ≤ η .LetCK2 = 2 + τ + η e K and an = | 1| =  ˆ {| ˆ |≥ }− 0 2 1/2 hj 1 ωijI ωij 2tn ωij CK2(log p/n) . As before, we can show that i=1  ¯ ¯ T −2 P(|XX |∞ ≥ η an log p/n) p      0 0  ≤ ω I{|ω |≤2tn} n  ij ij −1 i=1 ≤ p max P Xki ≥ η CK2 n log p i k=1 p      0 0  + ωˆ ijI{|ω ˆ ij|≥2tn}−ω I{|ω |≥2tn} n  ij ij −1 i=1 + p max P − Xki ≥ η CK2 n log p i = p k 1 1−q − − ≤ (2tn) s0(p) + tn I{|ω ˆ ij|≥2tn} ≤ 2p τ 1. (30) i=1 By (29), (30), and the inequality C >η−1C + η−2a ,wesee p   0 K1 n + | 0| {| ˆ |≥ }− {| 0|≥ } that (28) holds. ωij I ωij 2tn I ωij 2tn i=1 Proof of Theorems 1(b) and 4(b). Let p  ≤ 1−q + {| 0|≥ } ¯ = {| |≤ 3} (2tn) s0(p) tn I ωij tn Ykij XkiXkjI XkiXkj n/(log p) i=1  − {| |≤ 3} p   EXkiXkjI XkiXkj n/(log p) , + |ω0|I ||ω0|−2t |≤|ˆω − ω0| ij ij n ij ij ˇ = − ¯ i=1 Ykij Ykij Ykij. 606 Journal of the American Statistical Association, June 2011  3 ξ Because bn := maxi,j E|XkiXkj|I{|XkiXkj|≥ n/(log p) }= By (28), Theorem 6, the fact p ≥ n , and because τ is large O(1)n−γ −1/2, we have, by (C2), enough, we have     n   ˆ − 2 =  ˆ − 2   sup E ρ 0 2 sup E ρ 0 2 ˇ ∈U ∈U P max Ykij ≥ 2nbn 0 0 i,j   k=1      × |ˆ − 0|+ ≤    I max σij σij ρ C0 log p/n n  ij 3 ≤ P max XkiXkjI{|XkiXkj| > n/(log p) } ≥ nbn i,j   +  ˆ − 2 = sup E ρ 0 2 k 1 ∈U   0 n   ≤ | | { 2 + 2 ≥ 3}≥ × |ˆ − 0|+ P max XkiXkj I Xki Xkj 2 n/(log p) nbn I max σij σij ρ>C0 log p/n i,j ij k=1       −  log p 1 q ≤ 2 ≥ 3 = 4−4q 2 P max Xki n/(log p) O M s0(p) k,i n      n − ≤ pnP(X2 ≥ n/(log p)3) + O p4 max , p2α p τ/2 1 log p −δ/8     = O(1)n . 1−q − log p = O M4 4qs2(p) . By Bernstein’s inequality (cf. Bennett 1962) and some elemen- 0 n tary calculations,     This proves Theorem 2. The proof of Theorem 5 is similar. n    ¯  Proof of Theorem 3. k ≤ P max Ykij ≥ (θ + 1)(4 + τ)n log p Let n be an integer satisfying 1 i,j   ≤ k=1 kn n. Define     = ˆ − 0 n   hj ωj ωj , 2  ¯  ≤ p max P  Ykij ≥ (θ + 1)(4 + τ)n log p i,j   h1 = ˆ I{ ≤ i ≤ k }; ≤ i ≤ p T − ω0 h2 = h − h1 k=1 j (ωij 1 n 1 ) j , j j j .  2 | | ≤ | 1| ≤ 2p max exp −(θ + 1)(4 + τ)n log p By the proof of Theorem 6, we can show that hj 1 2 hj 1. i,j 0 −α Because ∈ Uo(α, M),wehave ≥ |ω |≤Mk . By The-   0 √j kn ij n /(2nEY¯ 2 + (θ + 1)(64 + 16τ)n/(3logp)) kn |ˆ − 0|= 1ij orem 4, j=1 ωij ωij O(kn log p/n) with probability − − −δ/8 + −τ/2 = O(1)p τ/2. greater than 1 O(n p ). Theorem 3(a) is proved by 1/(2α+2) taking kn =[(n/ log p) ]. The proof of Theorem 3(b) is Thus we have  similar as that of Theorem 2. | 0 − | ≥ + + + P( n 0 ∞ (θ 1)(4 τ)log p/n 2bn) [Received March 2010. Revised January 2011.]   = −δ/8 + −τ/2 O n p . (31) REFERENCES Using the same truncation argument and Bernstein’s inequality, Banerjee, O., Ghaoui, L. E., and d’Aspremont, A. (2008), “Model Selection we can show that Through Sparse Maximum Likelihood Estimation,” Journal of Machine       Learning Research, 9, 485–516. [596] n  Bennett, G. (1962), “Probability Inequalities for the Sum of Independent Ran-   ≥ 0 + Journal of the American Statistical Association P max Xki max σii (4 τ)n log p dom Variables,” , 57, 33–45. i   i [606] k=1   Bickel, P., and Levina, E. (2008a), “Regularized Estimation of Large Covari- = −δ/8 + −τ/2 ance Matrices,” The Annals of Statistics, 36, 199–227. [594,597] O n p . (2008b), “Covariance Regularization by Thresholding,” The Annals of Statistics, 36, 2577–2604. [594,596,602] Thus  Boyd, S., and Vandenberghe, L. (2004), Convex Optimization, Cambridge, | ¯ ¯ T | ≥ 0 + U.K.: Cambridge University Press. [599] P XX ∞ max σii (4 τ)log p/n i Cai, T., Wang, L., and Xu, G. (2010), “Shifting Inequality and Recovery of   Sparse Signals,” IEEE Transactions on Signal Processing, 58, 1300–1308. − − = O n δ/8 + p τ/2 . (32) [595] Cai, T., Zhang, C.-H., and Zhou, H. (2010), “Optimal Rates of Convergence for Combining (31) and (32), we have Covariance Matrix Estimation,” The Annals of Statistics, 38, 2118–2144.  [594] |ˆ − 0|≤ + + Candès, E., and Tao, T. (2007), “The Dantzig Selector, Statistical Estimation max σij σij (θ 1)(5 τ)log p/n (33) ij When p Is Much Larger Than n,” The Annals of Statistics, 35, 2313–2351. [595,599] − − with probability greater than 1 − O(n δ/8 + p τ/2). The proof d’Aspremont, A., Banerjee, O., and El Ghaoui, L. (2008), “First-Order Methods is completed by (33) and Theorem 6. for Sparse Covariance Selection,” SIAM Journal on Matrix Analysis and Its Applications, 30, 56–66. [594,596] Proof of Theorems 2 and 5. Because −1 is a feasible point, Donoho, D., Elad, M., and Temlyakov, V. (2006), “Stable Recovery of Sparse n,ρ Overcomplete Representations in the Presence of Noise,” IEEE Transac- we have, by (10), tions on Information Theory, 52, 6–18. [595]   El Karoui, N. (2008), “Operator Norm Consistent Estimation of Large- − n ˆ  ≤ˆ  ≤ 1  ≤ p2 max , pα . Dimensional Sparse Covariance Matrices,” The Annals of Statistics, 36, ρ 1 1ρ 1 n,ρ 1 log p 2717–2756. [594] Cai, Liu, and Luo: Estimating Sparse Precision Matrix 607

Fan, J. (1997), Comment on “Wavelets in Statistics: A Review,” by A. Anto- Meinshausen, N., and Bühlmann, P. (2006), “High-Dimensional Graphs and niadis, Journal of the Italian Statistical Association, 6, 131–138. [599] Variable Selection With the Lasso,” The Annals of Statistics, 34, 1436– Fan, J., and Li, R. (2001), “Variable Selection via Nonconcave Penalized Like- 1462. [595,598,599] lihood and Its Oracle Properties,” Journal of the American Statistical Asso- Ravikumar, P., and Wainwright, M. (2009), “High-Dimensional Ising Model ciation, 96, 1348–1360. [599] Selection Using l1-Regularized Logistic Regression,” The Annals of Statis- Fan, J., Feng, Y., and Wu, Y. (2009), “Network Exploration via the Adaptive tics, 38, 1287–1319. [602] Lasso and SCAD Penalties,” The Annals of Applied Statistics, 2, 521–541. Ravikumar, P., Wainwright, M., Raskutti, G., and , B. (2008), “High- [594,599,600,602,604] Dimensional Covariance Estimation by Minimizing l1-Penalized Log- Friedman, J., Hastie, T., and Tibshirani, R. (2008), “Sparse Inverse Covariance Determinant Divergence,” Technical Report 797, UC Berkeley, Statistics Estimation With the Graphical Lasso,” Biostatistics, 9, 432–441. [594,596] Dept. [594,596-598,602] Hess, K. R., Anderson, K., Symmans, W. F., Valero, V., Ibrahim, N., Rothman, A., Bickel, P., Levina, E., and , J. (2008), “Sparse Permutation Mejia, J. A., Booser, D., Theriault, R. L., Buzdar, A. U., Dempsey, P. J., Invariant Covariance Estimation,” Electronic Journal of Statistics,2,494– Rouzier, R., Sneige, N., Ross, J. S., Vidaurre, T., Gómez, H. L., 515. [594,599,602] Hortobagyi, G. N., and Pusztai, L. (2006), “Pharmacogenomic Predictor of Wu, W. B., and Pourahmadi, M. (2003), “Nonparametric Estimation of Large Sensitivity to Preoperative Chemotherapy With Paclitaxel and Fluorouracil, Covariance Matrices of Longitudinal Data,” Biometrika, 90, 831–844. [594] Doxorubicin, and Cyclophosphamide in Breast Cancer,” Journal of Clinical Yuan, M. (2009), “Sparse Inverse Covariance Matrix Estimation via Linear Oncology, 24, 4236–4244. [600] Programming,” Journal of Machine Learning Research, 11, 2261–2286. Huang, J., Liu, N., Pourahmadi, M., and Liu, L. (2006), “Covariance Matrix [594,595] Selection and Estimation via Penalized Normal Likelihood,” Biometrika, Yuan, M., and Lin, Y. (2007), “Model Selection and Estimation in the Gaussian 93, 85–98. [594] Graphical Model,” Biometrika, 94, 19–35. [594,602] Lam, C., and Fan, J. (2009), “Sparsistency and Rates of Convergence in Large Zhang, C. (2010), “Estimation of Large Inverse Matrices and Graphical Model Covariance Matrix Estimation,” The Annals of Statistics, 37, 4254–4278. Selection,” technical report, Rutgers University, Dept. of Statistics and Bio- [594,597] statistics. [602] Lauritzen, S. L. (1996), Graphical Models. Oxford Statistical Science Series, Zhou, S., Lafferty, J., and Wasserman, L. (2010), “Time Varying Undirected New York: Oxford University Press. [595] Graphs,” Machine Learning, 80, 295–319. [602] Liu, H., Lafferty, J., and Wasserman, L. (2009), “The Nonparanormal, Semi- Zhou, S., van de Geer, S., and Bühlmann, P. (2009), “Adaptive Lasso for parametric Estimation of High Dimensional Undirected Graphs,” Journal High Dimensional Regression and Gaussian Graphical Modeling,” preprint, of Machine Learning Research, 10, 2295–2328. [595] available at arXiv:0903.2515.[599]