1</Sub> Minimization Approach to Sparse Precision Matrix Estimation

A Constrained 1 Minimization Approach to Sparse Precision Matrix Estimation Tony C AI, Weidong LIU, and Xi LUO This article proposes a constrained 1 minimization method for estimating a sparse inverse covariance matrix based on a sample of n iid p-variate random variables. The resulting estimator is shown to have a number of desirable properties.√ In particular, the rate of convergence between the estimator and the true s-sparse precision matrix under the spectral norm is s log p/n when the population distribution has either exponential-type tails or polynomial-type tails. We present convergence rates under the elementwise ∞ norm and Frobenius norm. In addition, we consider graphical model selection. The procedure is easily implemented by linear programming. Numerical performance of the estimator is investigated using both simulated and real data. In particular, the procedure is applied to analyze a breast cancer dataset and is found to perform favorably compared with existing methods. KEY WORDS: Covariance matrix; Frobenius norm; Gaussian graphical model; Precision matrix; Rate of convergence; Spectral norm. 1. INTRODUCTION and Bickel and Levina (2008b) proposed thresholding of the sample covariance matrix for estimating a class of sparse co- Estimation of a covariance matrix and its inverse is an im- variance matrices and obtained rates of convergence for the portant problem in many areas of statistical analysis; among thresholding estimators. the many interesting examples are principal components analy- Estimation of the precision matrix is more involved due to sis, linear/quadratic discriminant analysis, and graphical mod- 0 the lack of a natural pivotal estimator like . Assuming certain els. Stable and accurate covariance estimation is becoming in- n ordering structures, methods based on banding the Cholesky creasingly important in the high-dimensional setting where the factor of the inverse have been proposed and studied (see, e.g., p n dimension can be much larger than the sample size .In Wu and Pourahmadi 2003; Huang et al. 2006; Bickel and Lev- this setting, classical methods and results based on fixed p and ina 2008b). Penalized likelihood methods also have been in- large n are no longer applicable. An additional challenge in the troduced for estimating sparse precision matrices. In particular, high-dimensional setting is the high computational cost. It is the 1 penalized normal likelihood estimator and its variants, important that estimation procedures be computationally effec- which we call 1-MLE type estimators, have been considered tive so that they can be used in high-dimensional applications. by several authors (see, e.g., Yuan and Lin 2007; d’Aspremont, = T Let X (X1,...,Xp) be a p-variate random vector with co- Banerjee, and El Ghaoui 2008; Friedman, Hastie, and Tibshi- := −1 variance matrix 0 and precision matrix 0 0 .Given rani 2008; Rothman et al. 2008). Convergence rates under the an independent and identically distributed random sample Frobenius norm loss were given by Rothman et al. (2008). Yuan { } X1,...,Xn from the distribution of X, the most natural es- (2009) derived the convergence rates for sub-Gaussian distribu- timator of 0 is perhaps tions. Under more restrictive conditions, such as mutual inco- n herence or irrepresentable conditions, Ravikumar et al. (2008) 1 ¯ ¯ T = (X − X)(X − X) , obtained the convergence rates in the elementwise ∞ norm and n n k k k=1 spectral norm. Nonconvex penalties, which are usually compu- − tationally more demanding, also have been considered under where X¯ = n 1 n X . However, is singular if p > n and k=1 k n the same normal likelihood model. For example, Lam and Fan thus is unstable for estimating , not to mention that its inverse 0 (2009) and Fan, Feng, and Wu (2009) considered penalizing the cannot be used to estimate the precision matrix . To estimate 0 normal likelihood with the nonconvex SCAD penalty. The main the covariance matrix consistently, special structures are 0 goal is to ameliorate the bias problem due to penalization. usually imposed, and various estimators have been introduced 1 A closely related problem is recovery of the support of the under these assumptions. When the variables exhibit a certain precision matrix, which is strongly connected to the selection ordering structure, which is often the case for time series data, of graphical models. To be more specific, let G = (V, E) be Bickel and Levina (2008a) proved that banding the sample co- a graph representing conditional independence relations be- variance matrix leads to a consistent estimator. Cai, Zhang, and tween components of X. The vertex set V has p components Zhou (2010) established the minimax rate of convergence and X ,...,X , and the edge set E consists of ordered pairs (i, j), introduced a rate-optimal tapering estimator. El Karoui (2008) 1 p where (i, j) ∈ E if there is an edge between Xi and Xj. The edge between Xi and Xj is excluded from E if and only if Xi and = ∼ Tony Cai is Dorothy Silberberg Professor, Department of Statistics, The Xj are independent given (Xk, k i, j).IfX N(μ0, 0), then Wharton School, University of Pennsylvania, Philadelphia, PA 19104 (E-mail: the conditional independence between Xi and Xj given other [email protected]). Weidong Liu is Faculty Member, Department of 0 = = 0 Mathematics and Institute of Natural Sciences, Shanghai Jiao Tong University, variables is equivalent to ωij 0, where we set 0 (ωij). China and Postdoctoral Fellow, Department of Statistics, The Wharton School, Thus, for Gaussian distributions, recovering the structure of the University of Pennsylvania, Philadelphia, PA 19104. Xi Luo is Postdoctoral Fellow, Department of Statistics, The Wharton School, University of Pennsyl- vania, Philadelphia, PA 19104. The research of Tony Cai and Weidong Liu was © 2011 American Statistical Association supported in part by NSF FRG grant DMS-0854973. We would like to thank Journal of the American Statistical Association the Associate Editor and two referees for their very helpful comments which June 2011, Vol. 106, No. 494, Theory and Methods have led to a better presentation of the paper. DOI: 10.1198/jasa.2011.tm10155 594 Cai, Liu, and Luo: Estimating Sparse Precision Matrix 595 graph G is equivalent to estimating the support of the precision 2. ESTIMATION VIA CONSTRAINED 1 MINIMIZATION matrix (Lauritzen 1996). Liu, Lafferty, and Wasserman (2009) In the compressed sensing and high-dimensional linear re- recently showed that for a class of non-Gaussian distribution gression literature, it is now well understood that constrained 1 called nonparanormal distribution, the problem of estimating minimization provides an effective way to reconstruct a sparse the graph also can be reduced to estimating the precision ma- signal (see, e.g., Donoho, Elad, and Temlyakov 2006; Candès trix. In an important article, Meinshausen and Bühlmann (2006) and Tao 2007). A particularly simple and elementary analy- convincingly demonstrated a neighborhood selection approach sis of constrained 1 minimization methods was given by Cai, to recovering the support of 0 in a row-by-row fashion. Yuan Wang, and Xu (2010). In this section, we introduce a method (2009) replaced the lasso selection by a Dantzig-type modifica- of constrained 1 minimization for inverse covariance ma- tion, where first the ratios between the off-diagonal elements ωij trix estimation. We begin with basic notations and defini- and the corresponding diagonal element ω were estimated for tions. Throughout, for a vector a = (a ,...,a )T ∈ Rp,we ii 1 p each row i and then the diagonal entries ωii were obtained given | | = p | | | | = p 2 define a 1 j=1 aj and a 2 j=1 aj . For a matrix the estimated ratios. Convergence rates under the matrix 1 × A = (a ) ∈ Rp q, we define the elementwise l∞ norm |A|∞ = norm and spectral norm losses were established. ij max ≤ ≤ ≤ ≤ |a |, the spectral norm A = sup| | ≤ |Ax| , 1 i p,1 j q ij 2 x 2 1 2 In this article, we study estimation of the precision ma- p the matrix 1 norm AL = max1≤j≤q = |aij|, the Frobe- trix 0 for both sparse and nonsparse matrices, without restrict- 1 i 1 ing to a specific sparsity pattern. We also consider graphical nius norm A = a2 , and the elementwise norm F i,j ij 1 model selection. We introduce a new method of constrained = p q | | × A 1 i=1 j=1 ai,j . I denotes a p p identity matrix. For 1-minimization for inverse matrix estimation (CLIME). Rates any two index sets T and T and matrix A,weuseATT to de- of convergence in the spectral norm, as well as the elemen- note the |T|×|T | matrix with rows and columns of A indexed twise ∞ norm and Frobenius norm, are established under by T and T , respectively. The notation A 0 indicates that A weaker assumptions and shown to be faster than those given is positive definite. ˆ for the 1-MLE estimators when the population distribution has We now define our CLIME estimator. Let {1} be the solu- polynomial-type tails. A matrix is called s-sparse if there are at tion set of the following optimization problem: most s nonzero elements on each row. We show that when 0 is s-sparse and X has either exponential-type or polynomial- min 1 subject to: (1) ˆ p×p type tails, the error√ between our estimator and √0 satisfies |n − I|∞ ≤ λn, ∈ R , ˆ ˆ −02 = OP(s log p/n) and |−0|∞ = OP( log p/n), where λ is a tuning parameter. In (1), we do not impose the where · and |·|∞ are the spectral norm and elementwise n 2 symmetry condition on , and as a result the solution is not l∞ norm, respectively. We discuss properties of the CLIME es- symmetric in general.

Load more