<<

Accepted for publication by IEEE Access (2016)

Positive Definite Estimation of Large Using Generalized Nonconvex Penalties

Fei Wen, Member, IEEE, Yuan Yang, Peilin Liu, Member, IEEE, Robert C. Qiu, Fellow, IEEE

 covariance matrix of a distribution from independent and Abstract—This work addresses the issue of large covariance identically distributed samples. In the high-dimensional setting, matrix estimation in high-dimensional statistical analysis. Re- cently, improved iterative algorithms with positive-definite the dimensionality is often comparable to (or even larger than) guarantee have been developed. However, these algorithms the sample size, in which cases the standard sample covariance cannot be directly extended to use a nonconvex penalty for spar- matrix estimator has a poor performance, since the number of sity inducing. Generally, a nonconvex penalty has the capability unknown parameters grows quadratically in the dimension [2, of ameliorating the bias problem of the popular convex lasso penalty, and thus is more advantageous. In this work, we propose 6, 7]. a class of positive-definite covariance estimators using general- To achieve better estimation of large covariance matrix, ized nonconvex penalties. We develop a first-order algorithm intrinsic structures of the covariance matrix can be exploited by based on the alternating direction method framework to solve the nonconvex optimization problem efficiently. The convergence of using regularization techniques, such as banding and tapering this algorithm has been proved. Further, the statistical proper- for banded structure [8-12], thresholding for sparse structure ties of the new estimators have been analyzed for generalized [13-15]. The former is useful for the applications where the nonconvex penalties. Moreover, extension of this algorithm to variables have a natural ordering and variables far apart are covariance estimation from sketched measurements has been considered. The performances of the new estimators have been only weakly correlated, such as in longitudinal data, , demonstrated by both a simulation study and a gene clustering spatial data, or spectroscopy. For other applications, where the example for tumor tissues. Code for the proposed estimators is variables do not have such properties but the true covariance available at https://github.com/FWen/Nonconvex-PDLCE.git. matrix is sparse, permutation-invariant thresholding methods

Index Terms—Covariance matrix estimation, covariance have been proposed in [13, 14]. These methods have good sketching, alternating direction method, positive-definite esti- theoretical properties and are computationally efficient. It has mation, nonconvex optimization, sparse. been shown in [15] that the generalized thresholding estima- tors are consistent over a large class of (approximately) sparse I. INTRODUCTION covariance matrices. However, in practical finite sample ap- Nowadays, the advance of information technology makes plications, such an estimator is not always positive-definite massive high-dimensional data widely available for scientific although it converges to a positive-definite limit in the as- discovery, which makes Big Data a very hot research topic. In ymptotic setting. this context, effective statistical analysis for high-dimensional Positive definiteness is desirable in many statistical learning data is becoming increasingly important. In much statistical applications such as quadratic discriminant analysis and co- analysis of high-dimensional data, estimating large covariance regularized regression [16]. To simultaneously matrices is needed, which has attracted significant research achieve sparsity and positive definiteness, iterative methods attentions in the past decade and has found applications in have been proposed recently in [17-20]. In [17], a posi- many fields, such as economics and finance, , tive-definite estimator has been proposed via maximizing a social networks, smart grid, and climate studies [1-5]. In these penalized Gaussian likelihood with a lasso penalty, and a applications, the covariance information is necessary for ef- majorize-minimize algorithm has been designed to solve the fective dimensionality reduction and discriminant analysis. estimation problem. In [18], a logarithmic barrier term is added The goal of covariance estimation is to recover the population into the objective function of the soft-thresholding estimator to enforce positive-definiteness. Then, new positive-definite estimators have been proposed in [19, 20] by imposing an This work was supported in part by the National Natural Science Foundation of China (NSFC) under grants 61401501, 61573242 and 61472442, and by the eigenvalue constraint on the optimization problem of the Young Star Science and Technology Project in Shaanxi province under grant soft-thresholding estimator. Although these positive-definite 2015KJXX-46. estimators have good theoretical properties, the derived algo- F. Wen, P. Liu and R. C. Qiu are with the Department of Electronic Engi- neering, Shanghai Jiao Tong University, Shanghai 200240, China (e-mail: rithms are restricted to the convex  1 -norm (lasso) penalty and [email protected]; [email protected]; [email protected]). cannot be directly extended to use a nonconvex penalty, since F. Wen is also with the Air Control and Navigation Institution, Air Force En- gineering University, Xian 710000, China. they are not guaranteed to converge in that case. In [20], in Y. Yang is with the Air Control and Navigation Institution, Air Force Engi- addition to the  1 -penalty, the nonconvex minimax concave neering University, Xian 710000, China.

T (MC) penalty has also been considered and an algorithm has elements be one. () denotes the transpose operator. min ()M been proposed based on local linear approximation of the MC and max ()M stand for the minimum and maximal eigenval- penalty. ues of M , respectively.  ,  and  stand for the Hadamard

Compared with the convex  1 -penalty, a nonconvex penalty, product, Hadamard and Kronecker product, respec- such as the hard-thresholding or smoothly clipped absolute tively. vec(M ) is the “vectorization” operator stacking the deviation (SCAD), is more advantageous since it can amelio- columns of the matrix one below another. rate the bias problem of the  1 -one. This work proposes a class dist(XY ,SS ): inf{YX F :  } denotes the distance from a of positive-definite covariance estimators using generalized point Xm n to a subset S m n . X 0 and X 0 nonconvex penalties. We use an eigenvalue constraint to en- that X is positive-semidefinite and positive-definite, respec- sure the positive definiteness of the estimator similar to [19, 20], tively. but the penalty can be noncovex. With an eigenvalue constraint of the covariance and simultaneously employing a nonconvex II. BACKGROUND penalty make the optimization problem challenging. To solve For a vector xd with covariance Σ0 E{}xxT , the goal the nonconvex optimization problem efficiently, we present a is to estimate the covariance from n observations x1,, xn . first-order algorithm based on the alternating direction method In this work, we are interested in estimating the correlation (ADM) framework. It has been proved that the sequence gen- matrix Θ0diag(Σ 0 )1 2Σ 0 diag(Σ 0 )  1 2 , where Θ0 is the erated by the proposed algorithm converges to a stationary true correlation matrix, and diag(Σ0 )1 2 is the diagonal ma- point of the objective function if the penalty is a trix of true standard deviations. With the estimated correlation Kurdyka-Lojasiewicz (KL) function. Further, the statistical matrix, denoted by Θˆ , the estimation of the covariance matrix properties of the new estimator have been analyzed for a gen- is Σˆ diag(RR )1 2Θˆ diag( ) 1 2 , where R denotes the sample eralized nonconvex penalty. The effectiveness of the new es- covariance matrix. This procedure is more favorable than that timators has been demonstrated via both a simulation study and of estimating the covariance matrix directly, since the corre- a real gene clustering . lation matrix retains the same sparsity structure of the covar- Moreover, extension of the proposed ADM algorithm to iance matrix but with all the diagonal elements known to be sparse covariance estimation from sketches or compressed one. Since the diagonal elements need not to be estimated, the measurements has also been considered. Covariance sketching correlation matrix can be estimated more accurately than the is an efficient approach for covariance estimation from covariance matrix [20-22]. high-dimensional data stream [36-42]. In many practical ap- plications to extract the covariance information from A. Generalized Thresholding Estimator high-dimensional data stream at a high rate, it may be infea- Given the sample correlation matrix S , the generalized sible to sample and store the whole stream due to memory and thresholding estimator [15] solves the following problem processing power constraint. In this case, by exploiting the structure information of the covariance matrix, it can be reli- 1 2 minΘS F g (Θ ) (1) ably recovered from compressed measurements of the data Θ 2 stream with a significantly lower dimensionality. The proposed algorithm can simultaneously achieve positive-definiteness where g ()Θ is a generalized penalty function depending on a and sparsity in estimating the covariance from compressed penalty parameter  . The penalty function can be expressed in an element-wise form as g()()Θ  g Θ , which only measurements. i j  ij The rest of this paper is organized as follows. Section II penalizes the off-diagonal elements since the diagonal ele- introduces the background of large covariance estimation. In ments of a correlation (also covariance) matrix are always section III, we detail the new algorithm and present some positive. The thresholding operation corresponding to g () is analysis on its convergence property. Section IV contains the defined as statistical properties of the new method. In section V, we ex-   1 2  tend the new method to positive-definite covariance estimation T( x ) min ( z  x )  g  ( z ) . (2) z 2  from compressed measurements. Section VI contains experi- mental results. Finally, Section VII ends the paper with con- For the popular SCAD, hard-,  q -, and soft-thresholding, the cluding remarks. penalties and corresponding thresholding functions are given The following notations are use throughout the paper. For a as follows. matrix M , diag(M ) is a which has the same (i) Hard-thresholding. The penalty is given by [23] diagonal elements as that of M , whilst for a vector v , g( x )2  (| x |   ) 2 I (| x |   ) diag(v ) is a diagonal matrix with diagonal elements be v . 

I() denotes the indicator function. Im stands for an m m and the corresponding thresholding function is and 1m m denotes an m m matrix with all

T ( x ) xI (| x |  ) . (3) imposes a constant shrinkage on the parameter. In contrast, the hard-thresholding and SCAD estimators are unbiased for large Note that, the  0 -norm penalty parameter. Another idea to mitigate the bias of 2  2 soft-thresholding is to employ an adaptive penalty [28]. The g ( x ) | x |0  I (| x |  0) thresholding rule corresponding to this adaptive penalty, as 2 2 well as SCAD and  q , fall in (sandwiched) between hard- and also results in (3). soft-thresholding.

(ii) Soft-thresholding, g ( x ) | x | . The corresponding These generalized covariance estimators are computation- thresholding function is [24] ally efficient, e.g., the SCAD, hard- and soft-thresholding estimators only need an element-wise thresholding operation of T( x ) sign( x )(| x |  ) .   the sample covariance matrix. For the adaptive lasso estimator Soft-thresholding is widely used in sparse variable selection, [28], only a few iterations are sufficient to achieve satisfactory performance, and each iteration only need a soft-thresholding since the  1 -norm minimization problem is more tractable (due to its convexity) than the problems using nonconvex operation. However, these estimators are not always positive definite in practical finite sample applications, although the penalties. However, the  1 -penalty has a bias problem as lasso soft-thresholding would produce biased estimates for large positive definiteness can be guaranteed in the asymptotic set- coefficients. This bias problem can be ameliorated by using a ting with probability tending to 1. Intuitively, to deal with the nonconvex penalty, e.g., Hard-thresholding or SCAD. infiniteness problem, we can project the estimate Θˆ into a ˆ d T (iii) SCAD. The penalty is given by convex cone {Θ 0} . Specifically, let Θ = i1 iv i v i de- note the eigen-decomposition of Θˆ , where i is eigenvalue |x |, 0 | x |   corresponding to the eigenvector . A positive-semidefinite  vi  2 2  d T p ( x ) (2 a | x |  x   ) [2( a  1)],   | x |  a  estimate can be obtained as Θˆ = max(i ,0)v i v i .  i1 (a 1)2 2, | x |  a  However, this projection would destroy the sparsity pattern of  the true correlation matrix and result in a non-sparse estimate for some a2 . The corresponding thresholding function is (see [19] for a detailed example). [25]

 sign(x )(| x | ) , | x |  2   T ( x ) [( a  1) x  sign( x ) a ] ( a  2), 2   | x |  a  .  x, | x | a 

q (iv)  q -norm ( 0q  1 ), p ( x ) (  , q )| x | where (,)() q     1q q with 2(1 q )  (2  q ) . The thresholding function is [26, 31]  0, |x |  T ( x ) {0,sign( x ) }, | x |   (4)  sign(x ) , | x |   where  is the solution of h( )  q q1    x  0 over the region ( ,|x |) . Since h() is convex, when |x | ,  can be Fig. 1. Generalized thesholding\shrinkage functions for the efficiently solved using a Newton’s method, e.g., hard-, soft-,  q - and SCAD penalties. h()k k1   k  , k 0,1,2,. h()k B. Positive Definite Estimator 0 The starting point can be simply chosen as  |x |. For the To simultaneously achieve sparsity and positive-definiteness, special cases of q 1/2 or q 2/3 , the proximal mapping iterative methods have been proposed recently in [18-20], e.g., can be explicitly expressed as the solution of a cubic or quartic the constrained correlation matrix estimator which solves the equation [27]. following problem [20] Fig. 1 shows the generalized thesholding\shrinkage func- 1 tions for the hard-, soft-,  q - and SCAD penalties with a same 2 min ΘS F  Θ 1,off threshold. When the true parameter has a relatively large Θ 2 magnitude, the soft-thresholding estimator is biased since it subject to diag(Θ )Id and Θ1Id (5)

where 1 0 is the lower bound for the minimum eigenvalue. (7) are naturally separated, which makes the problem easy to A covariance matrix version of (5) has been proposed in [19]. tackle as each step of the alternating minimization is much Unlike the works [19] and [20] imposing an eigenvalue con- easier than the global problem. More specifically, using two d d straint, the method in [18] employs a logarithmic barrier term auxiliary variables VV1, 2  , the problem (7) can be to ensure the positive-definiteness of the solution. These works equivalently rewritten as focus mainly on the  1 -norm penalty, since it results in convex 1 2 optimization problems which can be efficiently solved with minΘS F g (VV1 ) 1 ( 2 ) Θ,,VV1 2 2 convergence guarantee. In [20], in addition to the  1 -norm penalty, the nonconvex minimax concave penalty has also been subject to diag(VI1 ) d , GVΘ (8) considered, which retains the global convexity of the problem with when a tuning parameter is appropriately selected. However, as mentioned by the authors, the derived augmented Lagrangian     V1  Id  method (ALM) based algorithm often fails to converge. To V   , G   . V2  Id  address this problem, an alternative algorithm using local     linear approximation has been proposed in [20]. The constrained minimization problem (8) can be attacked via Generally, these iterative algorithms cannot be directly ex- solving tended to use a nonconvex penalty, since they are not guaran- teed to converge in that case. With a nonconvex penalty, the 1 2 2 minΘSGVFFg (VV1 ) 1 ( 2 )  Θ  optimization problem is more difficult to handle than the Θ,,VV1 2 2 2 convex case. subject to diag(VI1 ) d (9)

III. PROPOSED ESTIMATORS USING GENERALIZED NONCONVEX where 0 is a penalty parameter. For sufficiently large  , PENALTIES e.g.,   , the solution of (9) approaches that of the In this section, we propose a class of positive-definite co- problem (8). In practical applications, selecting a moderate variance estimator using generalized penalties as follows value of  suffices to achieve satisfactory performance. Then, ADM applied to (9) consists of the following three steps in the 1 2 -th iteration minΘS F g (Θ ) k 1 Θ 2  2 c 2  k1  k k k  VV1arg min g () 1   VV  subject to diag(Θ )Id and Θ1Id (6) Θ V1 F 1 1 F  diag(VI1 ) d  2 2  (10) where g a generalized (may be nonconvex) penalty, such as   SCAD, hard-, soft-, or  -penalty. The problem (6) can be  2 dk 2 q VVk1 arg min ( )k  k  (11) 2 1 2 Θ V2 F VV2  2 F  equivalently rewritten as V2  2 2    1 2 1 2 2 ming ( )  ( ) Θk1 arg min  k1  (12) ΘS F  Θ 1 Θ  ΘSGVFFΘ   Θ 2 Θ 2 2 

subject to diag(Θ )Id (7) where ck 0 and dk 0 . This method considers the proximal regularization of the Gauss-Seidel scheme via coupling the where 1:{:} Θ Θ  1Id is a closed convex set, 1 () denotes the indicator function defined on this set, i.e., Gauss-Seidel iteration scheme with a proximal term. Using this proximal regularization strategy, as will be shown later,  0, Θ1 the sequence generated via (10)-(12) is guaranteed to converge  ()Θ  . 1 , otherwise  in the case of a nonconvex g . The V1 -subproblem (10) is a proximal minimization The problem (7) is generally difficult to solve since in addition problem, which has a solution as to the nonconvexity of the penalty term g , both the terms g  k k   Θijc k()V1 ij and  are nonsmooth. For nonconvex g , the ALM algo-    1 T() c  , i j k1  k   ()V ij   c  (13) rithms proposed in [19, 20] cannot be directly used to solve (7), 1  k 1, i j since these algorithms may fail to converge in this case. In the  following, we propose an ADM algorithm to solve the problem (7). where T() is the thresholding operation corresponding to the ADM is a powerful optimization framework that is well penalty function g . The V2 -subproblem (11) is also a suited to large-scale problems arising in machine learning and proximal minimization problem, whose solution is given by signal processing. In the ADM framework, the three terms in

Θkd V k  ilar to [20], we consider the following class of “approximately V k1   k 2 ,   . (14) 2  1 sparse” correlation matrices  dk  :maxp M ,  1, ( )   where (,)   is a spectral projection operator. Let (p , Md , ):  Θ  Θij dΘ ii min Θ  i  d T j i M i v i v i denote the eigen-decomposition of M , i1 (17) where i and vi are the eigenvalues and eigenvectors, d T (,)   is defined as (M ,1 ) i1 max( i ,  1 ) v i v i . for 0p  1 . Further, we define a class of covariance matrices The objective function in the Θ -subproblem (12) has a based on (17) as quadratic form, whose solution is directly given by 1   1 U(,,,): p M d  Σ:maxΣii ,Σ ΣΣ  (p , M d ,  ) SGV T k1 i Θk1  . (15) 2 1 where Σ diag(Σ11 , ,Σdd ) . To derive the statistical properties, we assume each marginal Next, we give a result for the convergence of the ADM al- distribution of X:(,,) XX is sub-Gaussian and satisfies gorithm (10)-(12) for generalized penalty functions. While the 1 d the exponential-tail condition convergence properties of ADM have been extensively studied 2 for the convex case, there have been only a few studies for the E{exp( tXi )} K  for t  . (18) nonconvex case. Very recently, the convergence of ADM has been analyzed under nonconvex frameworks in [29, 30]. The We first give a result for the special case of strictly sparse following result is derived via adopting the approaches in [29, covariance matrices with p=0 , under the assumption that the 30]. penalty g satisfies

Theorem 1. Suppose that g is a closed, proper, lower    0  logd   0 semi-continuous, KL function. Let ZVVk(,,)Θ k k k , for maxg (Θij ) O   and maxg (Θij ) o (1) (19) 1 2 (,)i j S  n  (,)i j S arbitrary starting point Z0 , the sequence {}Zk generated by the ADM algorithm via (10), (11) and (12) has finite length where Θ 0 is the true correlation matrix, S denotes the 0  off-diagonal support of Θ , i.e., k1 k ZZ  . (16) d d 0  F S: {( i , j )  :Θij  0, i  j } . For example, if the penalty k1 parameter  is selected to satisfy In particular, the sequence {}Zk converges to a stationary 0 point of the problem (9). Θij min  Proof: See Appendix A. (,)i j S  In the proposed algorithm, a large value of the penalty pa- 2 as n  , for hard-thresholding and SCAD penalties, rameter is desirable in order to enforce that 0 . GVΘ F 0 0 maxg (Θij )= max g   (Θ ij )=0 for sufficiently large n , while When  , the solution of (9) accurately approaches that of (,)(,)i j S i j  S the problem (8). However, an ADM algorithm in general tends 0 0 for  1 -penalty maxg (Θij )= and maxg (Θij )=0 . to be very slow when  gets very large. In practical applica- (,)i j S (,)i j S 0 tions, a standard trick to speed up the algorithm is to adopt a Theorem 2: Suppose that Σ UM( , 0,d , min ) for some continuation process for the penalty parameter. Specifically, min 0 , and let s| S | denote the number of nonzero we can use a properly small starting value of the penalty pa- off-diagonal elements in Σ0 . Under condition (18) and for a rameter and gradually increase it by iteration until reaching the nonconvex penalty satisfies (19), if 1  min , target value, e.g., 00   1   KK   1     . In this O( s log d / n ) and logd / n o (1) , then there at least case, Theorem 1 still applies as the penalty parameter becomes exists a local minimizer Θˆ of (6) such that fixed (at  ) after a finite number of iterations. Moreover, in the case of a nonconvex penalty, the performance of the proposed slog d  ˆ 0   ΘΘ F OP  . algorithm is closely related to the initialization. Intensive  n  numerical studies show that, the new algorithm can achieve satisfactory performance with an initialization by a convex Further, for the spectral norm of the covariance matrix Σˆ diag(RR )1 2Θˆ diag( ) 1 2 ,  1 -penalty based method.   0  slog d  IV. STATISTICAL PROPERTIES Σˆ Σ 2 OP   .  n  In this section, we analyze the statistical properties of the proposed estimator for generalized nonconvex penalties. Sim- In addition, for the  1 -norm penalty, we only need O( log d / n ) .

proof: See appendix B. In finite sample cases, the generalized thresholding esti- Note that this result allows d n as long as mator (1) is not guaranteed to be positive-definite, but, em- logd / n o (1) , and it achieves the minimax optimal rate of pirically, it can be positive-definite in many situations. Thus, in convergence under both Frobenius and spectral norms [11]. practical applications, we can first check the minimum ei- This result is derived for the procedure that, the covariance genvalue of the generalized thresholding estimator, and pro- matrix is obtained by estimating the correlation matrix firstly. ceed to the proposed algorithm only when the feasibility con- dition of positive-definiteness is not satisfied. If we estimate the covariance directly, the result rate of con- vergence would be O( ( d s )log d / n ) , where the part P V. EXTENSION TO COVARIANCE SKETCHING dlog d / n comes from estimating the diagonal. Sketching via randomly linear measurements is a useful Next, we give an asymptotic result for the generalized case of algorithmic tool for dimensionality reducing, which has been approximately sparse covariance matrices with 0p  1 . widely used in computer science [32, 33] and compressed Similar to [15], we assume that the thresholding\shrinkage sensing [34, 35]. Recently, covariance estimation from operator T corresponding to the penalty g satisfies   sketches or compressed measurements has attracted much

(i) T( x ) sign( x ) T  (| x |) and T ( x ) | x |; attention in various fields of science and engineering [36-42]. An important application of covariance sketching arises in Big

(ii) T ( x ) 0 for x  ; Data, e.g., covariance estimation from high-dimensional data stream. To extract the covariance information from (iii) T ( x ) x  | |. (20) high-dimensional real-time data stream at a high rate, it may be infeasible or undesirable to sample and store the whole stream These conditions establish the sign consistency, shrinkage, due to memory and processing power constraint. In such a thresholding, and limited shrinkage properties for the thresh- scenario, it is desirable to extract the covariance information of olding-shrinkage operator corresponding to a generalized the data instances from compressed inputs with low require- penalty. ment of memory and computational complexity and without Theorem 3: Suppose that Σ0 U(,,,) p M  for some d min storing the entire stream. min 0 , and the thresholding\shrinkage operator corre- Let xd denote a zero-mean random vector with covar- ˆ sponding to g satisfies (20). Let Θ denote the estimate given iance matrix Σ0 , Am d is the matrix with by (6), under condition (20), there exist constants c and c 0 1 m d , with compressed samples yt Ax t , t1,2, , n , such that, if 1  min , c0 log d / n  o (1) and define the sample covariance of yt as 2 (1q ) n[( c1 Md )/( min   1 )] log d , then 1 n Y yTT y  ARA 1q   t t    2  n t1 . (21) 0  logd  0TT 0 Θˆ Θ F OP     . Md    AAARA  ()    n    N Further, for the spectral norm of the covariance matrix Under the assumption that x is zero-mean and its covariance 1 2 1 2 Σˆ diag(RR )Θˆ diag( ) , is finite, the perturbation N in (21) satisfies E{}N 0 and 2 E{N } 1 n . Since m d , the reconstruction of Σ0 from 1q  F    2  0  logd  Y is an ill-posed problem. However, if the covariance matrix Σˆ Σ 2 OP     . Md    0   n   Σ has low-dimensional structures such as low-rankness and sparsity, it can be accurately reconstructed from Y by ex- Theorem 3 is derived based on the results in [15]. Specifi- ploiting such structural information. Given the sample covar- cally, under the conditions in Theorem 3, it can be shown that iance Y of the compressed observations, we can find a posi- the estimator (6) gives the same estimation as the generalized tive definite estimation of Σ0 via solving the following opti- thresholding estimator (1) with overwhelming probability in mization problem the asymptotic case. The detailed proof follows similarly to that 1 2 in [20] with some changes, which is omitted here for T minYAA Σ F g (Σ ) subject to Σ2Id (22) succinctness. Note that, in the strictly sparse case of q 0 , Σ 2

Md is a bound on the number of non-zero elements in each where 2 0 is the lower bound for the minimum eigenvalue row (without counting the diagonal element). Thus, in this case of the covariance matrix. When AI= d and the covariance s Md , and the rate in Theorem 3 coincides with that in matrix is replaced by the correlations matrix, (22) reduces to Theorem 2, even though the approaches of proof are very the correlation estimation problem (5). Very recently, an ALM different. algorithm has been proposed in [42] to solve (22) for convex

T penalties, e.g., g ()Σ  Σ 1 . However, this algorithm is not Take the orthogonal eigen-decomposition of AA as TT guaranteed to converge for a nonconvex g . In this following, AAEE Λ , the solution of the Σ -subproblem (28) is given we show that the ADM algorithm presented in section III can by (see Appendix C) be extended to solve (22) with convergence guarantee for a Σk1EEAYAG[( T ( T  TΓ k  1 ) Eaa )  ( T  2  1 )] E T (31) nonconvex penalty. n n Similar to (7), the problem (22) can be equivalently where the vector ad contains the eigenvalues of AAT , i.e., rewritten as Λdiag(a ).

Theorem 4. Suppose that g is a closed, proper, lower 1 2 T k k k k minYAA Σ F g (Σ ) 2 (Σ ) . (23) semi-continuous, KL function. Let Z (,,)Σ Γ Γ , for ar- Σ 2 1 2 bitrary starting point Z0 , the sequence {}Zk generated by the where 2:{:} Σ Σ  2Id is a closed convex set, 2 () ADM algorithm via (26), (27) and (28) has finite length denotes the indicator function defined on this set. Using two  d d k1 k auxiliary variables Γ1, Γ 2  , the problem (23) can be ZZ  . (32)  F equivalently rewritten as k1 In particular, the sequence {}Zk converges to a stationary 1 2 minT g (Γ )  (Γ ) YAA Σ F  12 2 point of the problem (25). Σ,,Γ1Γ 2 2 subject to GΣ Γ (24) Proof: See Appendix A. In implementing the proposed algorithm, a large value of  with 2 is desirable in order to enforce that GΣΓ F 0 . Similar to the discussion in section III, we can adopt a continuation Γ1  Γ   . process for the penalty parameter to speed up this algorithm. Γ  2  Moreover, for a nonconvex penalty, a good initialization, e.g., by a convex  -penalty based method, is crucial for this algo- The constrained minimization problem (24) can be attacked via 1 rithm to achieve satisfactory performance. solving

1 2 2 VI. NUMERICAL T minYAAGΣ FFg (Γ1 ) 2 (Γ 2 )  Σ Γ (25) Σ,,Γ1Γ 2 2 2 We evaluate the proposed estimators using the SCAD, hard-,  -, and lasso penalties, termed as SCAD-ADM, Hard-ADM, where 0 is a penalty parameter. For sufficiently large  , q Lq-ADM and L1-ADM, respectively, in comparison with the e.g.,   , the solution of (25) accurately approaches that ALM algorithm using the lasso penalty [20], termed L1-ALM. of the problem (24). Then, ADM applied to (25) consists of the We chose q 0.5 for Lq-ADM and set a lower bound of 103 following three steps in the k 1 -th iteration for the minimum eigenvalue for each estimator. Moreover, for

 2 c 2  k1  k k k  the proposed algorithms, a continuation process is used for the Γ1 arg ming (Γ1 )    (26)  Σ Γ1 F Γ1 Γ1 F  Γ1  2 2  penalty parameter as k1.09  k1 if k   and k   oth- erwise. The thresholding parameter for each estimator is  2 d 2  k1  k k k  chosen by subsampling and fivefold cross-validation [13]. We Γ2 arg min (Γ2 )    (27)  2 Σ Γ2 F Γ2 Γ2 F  Γ2  2 2  conduct mainly two evaluation experiments on simulated data sets and a real gene data set, respectively. 1 2 2  k1  T k1  Matlab codes for the proposed estimators and for repro- Σ arg min YAAGΣ FF Σ Γ . (28) Σ   2 2  ducing the results in the following experiments are available at https://github.com/FWen/Nonconvex-PDLCE.git. The Γ1 - and Γ2 - subproblems can be respectively solved as

k k    k ij  A. Simulated Datasets   Σij c ()Γ1  T() ck , i j k1    (Γ )ij  ck  (29) 1  We consider three typical sparse covariance matrix models,  k Σii , i j which are standard test cases in the literature. Fig. 2 shows the heat maps of these there covariance models with d 100 . and : Partition the indices {1,2, ,d } evenly into

k k  K d 20 nonoverlapping subsets SS1,, K , the covariance k1 Σ dk Γ 2  Γ2  , 2  . (30) is given by dk  K 0 ()Σ ij 0.2(I i  j )0.8  I ( i  S k , j  S k ) . k1

Toeplitz matrix:

0 i j (Σ )ij  0.75 .

Banded matrix:

0 (Σ )ij  1  |i  j | 10 .  +

(a) Block matrix

Fig. 2. Heat maps of three simulated covariance matrices for d 100 .

We consider two dimension conditions with d 100 and d 400 , respectively, and five sample number conditions of n {100,200,400,600,800}. The performance is evaluated in (b) terms of the relative error of estimation under both the Fro- benius norm and the spectral norm. Each provided result is an average over 100 independent Monte Carlo runs. For each independent run, the data set (with a size of n ) is generated from a d -dimensional Gaussian with ze- ro-mean and covariance Σ0 . Fig. 3 and Fig. 4 show the estimation performance of the compared estimators for d 100 and d 400 , respectively. In the condition of d 100 , the proposed estimators with the nonconvex SCAD, hard- and  q -penalties show considerable performance gain over the lasso penalty based L1-ALM and L1-ADM estimators, except for the case of Toeplitz covariance model under the Frobenius norm metric. This is due to the fact that the considered Toeplitz covariance model is less sparse than the other two covariance models. For example, for the (c) Banded matrix block covariance model and n 200 , the averaged estimation Fig. 3. Estimation performance of the compared algorithms errors of the SCAD-ADM, Hard-ADM and Lq-ADM estima- for d 100 . tors under the Frobenius norm are approximately 57.3%, 59.7% and 62.3% that of the L1-ALM estimator, while that under the spectral norm are 66.4%, 69.5% and 74.2%, re- tors are more significant in the condition of d 400 , since the spectively. The advantages of SCAD-ADM, Hard-ADM and three covariance models in this condition are more sparse Lq-ADM estimators over the L1-ALM and L1-ADM estima- compared with the condition of d 100 . For example, in this

while that under the spectral norm are 66.3%, 67.5% and 67.3%, respectively. The proposed estimator with lasso penalty (L1-ADM) performs comparably with the L1-ALM estimator. Fig. 5 shows the eigenvalues of the compared estimators in a typical case for d 100 and n 200 . Extensive numerical studies show that all the compared estimators can achieve a positive definiteness rate of 100%, but the generalized thresholding estimators often yield indefinite covariance ma- trices, similar to the results shown in [18-20].

(a) Block matrix

Fig. 5. Typical plots of the eigenvalues of the compared es- timators for d 100 and n 200 .

B. Gene Clustering Example

(b) Toeplitz matrix Gene clustering based on the correlations among genes is a popular technique in gene expression data analysis [43]. Here we consider a gene clustering example using a gene expression dataset from a small round blue-cell tumors (SRBCTs) mi- croarray experiment [44] to further evaluate the compared methods. This dataset contains 88 SRBCT tissue samples, with 63 labeled calibration samples and 25 test samples, and 2308 gene expression values are recorded for each of the samples. These 2308 genes were selected out from 6567 originally measured genes via requiring that each gene has a red intensity greater than 20 over all the tissue samples. We use the 63 labeled calibration samples and pick up the top 40 and bottom 160 genes based on their F- as done in [15]. Accord- ingly, the top 40 genes are informative while the bottom 160 genes are non-informative, and the dependence between these two parts is weak. We apply (group (c) Banded matrix average agglomerative clustering) to the genes using the es- Fig. 4. Estimation performance of the compared algorithms timated correlations by the compared methods. The general- for d 400 . ized thresholding estimators [15], including soft-, hard- and SCAD-thresholding, are also tested in the comparison. Fig. 6 condition the block covariance model and n 200 , the aver- shows the heat map of the gene expression data of the top 40 aged estimation errors of the SCAD-ADM, Hard-ADM and genes of the 63 samples. The genes are sorted by hierarchical Lq-ADM estimators under the Frobenius norm are approxi- clustering based on the sample correlations, and the samples mately 59.5%, 59.1% and 61.3% that of the L1-ALM estimator,

are sorted by tissue class (there are four classes of tumors in the sample, including 23 EWS, 8 BL, 12 NB, and 20 RMS). Fig. 7 shows the heat maps of the absolute values of esti- mated correlations by the compared estimators for the selected 200 genes. For each estimator, the presented heat map is or- dered by group average agglomerative clustering based on the estimated correlation matrix. Meanwhile, for each estimated correlation matrix, the percentage of the entries with absolute values less than 105 is also shown. Fig. 8 plots the bottom 80 eigenvalues (ordered in descending values) of the compared estimators. It can be clearly seen from the heat maps that, compared with the L1-ALM and L1-ADM estimators, the SCAD-ADM, Hard-ADM and Lq-ADM estimators give cleaner and more informative estimates of the sparsity pattern. (a) Gene expression data Moreover, the soft-, hard- and SCAD-thresholding estimators yield cleaner estimates of the sparsity pattern than the iterative positive estimators. However, from Fig. 8, while the estimates of the proposed methods and the L1-ALM method are posi- tive-definite, the estimates of soft-, hard- and SCAD-thresholding methods are indefinite. Specifically, these three generalized thresholding estimators contain 13, 40 and 10 negative eigenvalues, respectively.

VII. CONCLUSION This work proposed a class of positive-definite covariance estimators using generalized nonconvex penalties, and de- veloped an alternating direction algorithm to efficiently solve the corresponding formulation. The proposed algorithm is (b) Sample correlations guaranteed to converge in the case of a generalized nonconvex Fig. 6. Heat maps of the gene expression data and sample penalty. The established statistical properties of the new esti- correlations of the top 40 genes (with genes sorted by hierar- mators indicate that, for generalized nonconvex penalties, they chical clustering and tissue samples sorted by tissue class). can attain the optimal rates of convergence in terms of both the Frobenius norm and spectral norm errors. Extension of the proposed algorithm to covariance sketching has been discussed. In the simulation study, the new estimators showed signifi- cantly lower estimation error under both the Frobenius norm and the spectral norm compared with lasso penalty based es- timators. The advantage of the new estimators has also been demonstrated by a gene clustering example using a gene ex- pression dataset from a small round blue-cell tumors micro- array experiment.

Fig. 8. Plots of the bottom 80 eigenvalues of the compared estimators.

Fig. 7. Heat maps of the absolute values of estimated correlations by the compared estimators for the selected 200 genes. The percentage of the entries with absolute values less than 105 is given in parentheses.

1 2 APPENDIX A T LΣ,,Γ1Γ 2 := YAA Σ F 2 (33) Note that, the ADM algorithm (10)-(12) is a special case of  g ()()   2 the algorithm (26)-(28) when and the covariance ma-  Γ12 Γ 2 GΣΓ F AI d 2 trix is replaced by the correlations matrix. Accordingly, we only present the proof of Theorem 4, while Theorem 1 can be and derived in a similar manner. Theorem 4 is derived via adopting Zk :Σk,,Γ kΓ k  . the approach in [29, 30], we only sketch the proof here. In the 1 2 sequel for convenience we use the notations We first give the following lemmas in the proof of Theorem 4.

k Lemma 1. Let {}Z be a sequence generated by the ADM ck LLΣk,,,,Γ kΓ kΣ k1Γ k  1Γ k  1  Γ k  1 Γ k 2 algorithm (26)-(28), then the sequence {()}L Zk is nonin- 1 2 1 2 1 1 F 2 (38) 2 T creasing and dk 2 (AA ) 2  k1 k min k1 k 2 Γ 2Γ 2 F  Σ Σ F  2 2 2 k1 k   ZZ F k k k k0 which implies that L(,,)Σ Γ1Γ 2 is nonincreasing. Follows

from (38), we can find a positive constant C 0 such that k1 k and hence lim ZZ F 0 . k C 0 2 k k k k k1 k  1 k  1 k1 k Lemma 2. Let {}Z be a sequence generated by the ADM LLΣ ,,,,Γ Γ Σ Γ Γ   ZZ F . (39) 1 2 1 2 2 algorithm (26)-(28), define Let N be a positive integer, summing up (39) from k 0 to k k1 k k  1 k B ck1()()Γ1 Γ 1  Σ Σ Γ1 N 1 we have

k k1 k k  1 k N1 B dk1()()Γ2 Γ 2  Σ Σ . 2 2 Γ2 k1 k 0 N  ZZZZ F LL     . k0 C 0 k k k Then, (,,)()0 B BL Z and there exists C1 0 such that Γ1Γ 2 Since L is lower semi-continuous, it is bounded from below. k k k k1 (,,)0 B BC 1 Z  Z . k Γ1Γ 2 F F Further, since L()Z is nonincreasing, it converges to some real number L . Taking the limit as N  , we obtain Lemma 3. Let {}Zk be a sequence generated by the ADM   algorithm (26)-(28), and denote the cluster point set by  . The 2 2 k1 k 0  ZZZ F LL    . following assertions hold. k0 C 0 (i) Any cluster point of {}Zk is a stationary point of L . Proof of Lemma 2: From the definition of the iterative steps (ii) limdist(Zk , ) 0 and  is a nonempty, compact and k k k k (26)-(28), Σ ,,Γ1Γ 2  given by the k -th iteration satisfies connected set. k1 k k  1 k k (iii) The objective function L is finite and constant on  . ck1()()()Γ1Γ 1  Σ Γ 1  g Γ 1 (40) Proof of Lemma 1: From the definition of Γk1 as a min- 1 d ()()()Γk1Γ k Σ k  1 Γ k   Γ k (41) imizer of the objective in (26), we have k1 2 2 2 2 2 and c 2 LL k k1 kk k  1 k   k k k . (34) Σ ,,,,Γ1Γ 2Γ 1Γ 1F Σ Γ 1Γ 2 k 2 0Σ L() Z . (42)

Similarly, it follows from (27) that Then, from (40) and (41), it is easy to see

dk 2 k k k k1 k  1 k  1 k k k  1 k B  Γ1 LZ  LLΣ ,,,,Γ Γ Γ Γ  Σ Γ Γ  . (35) Γ1 1 22 2 2F 1 2 k k B Γ L   k1 k  1 Γ2 2 Z Moreover, the Hessian of LΣ,,Γ1Γ 2  satisfies

2 k1 k  1 TT k k k LΣ,,Γ Γ  (AAAAI )  ( )  2 d 2 which together with (42) implies (,,)()0 B BL Z . Fur- 1 2 (36) 1  2 2 T thermore, it is easy to see from (33) that the sequence {}Zk is [min (AAI )  2  ] d 2 bounded, since all the sub-functions of L are coercive. Thus, where the inequality follows from we have TTT2 ()()()AAAAAAI min d2 since the eigenvalues of (,,)0 Bk B k AB are the pairwise products of the eigenvalues of A and Γ1Γ 2 F k k B . The inequality in (36) implies that the objective function in BBΓ  Γ 1FF 2 (43) 2 T k1 k k  1 k k  1 k ck1Γ1 Γ 1  d k  1 Γ 2 Γ 2 2 Σ Σ the Σ -subproblem (28) is (min (AA ) 2  ) -strongly convex FFF k1 k k1 with respect to Σ . Then, from the definition of Σ as a C 1 ZZ  F k1 k  1 minimizer, i.e., LΣ,,Γ1Γ 2  0 , we have for some C1 0 . 2 T (AA ) 2  k k1 k  1 k  1 min k1 k 2 Proof of Lemma 3: Since {}Z is bounded, for a cluster LΣ ,,Γ1Γ 2  Σ Σ F . (37) **** kj 2 point Z :(,,) Σ Γ1Γ 2 , there exists a subsequence {}Z L k k1 k  1 * Σ ,,Γ1Γ 2 which converges to Z . Since g ()Γ1 is lower semi-continuous, it follows that Summing (34), (35) and (37), we obtain

k j * lim infg (Γ1 ) g  (Γ1 ) . (44) j

kj From the definition of Γ1 as a minimizer, it follows that slog d 0 ˆ C 1 . (47) kj1kj k j1 k j  1 * k j  1 ΘΘ F LLΣ ,,,,Γ1Γ 2 Σ Γ1 Γ 2  . Then, taking limit as n j  we obtain Define G()()()Δ  fΘ0 Δ  f Θ 0 , to show (46), it is suffi- 2  2 kj ***** lim supg (Γ1 )Σ Γ1  g  (Γ1 )  Σ Γ 1 cient to show that G(Δ ) 0 for ΔB . G()Δ can be expressed j 2FF 2 as kj * which together with (44) yields limg (Γ1 ) g  (Γ1 ) . In a j k j * 12 1 2 similar manner, we have lim2 (Γ2 )   2 (Γ2 ) and finally 0 0 0 0 j G()()()Δ Θ Δ S F Θ S F  gΘ Δ  g  Θ get 2 2 1 2 0 0 0 k *   g()()Θ Δ  g Θ limLL (ZZj ) ( ) . Δ F SΘ , Δ   j 2 (48) From Lemma 2, (,,)()0 Bk B kL Z k and Γ1Γ 2 (,,)(,,)0 Bk B k  0 0 0 as k  , which together with the For logd / n o (1) , with probability tending to 1 we have [9] Γ1Γ 2 closedness property of L imply that (,,)()0 0 0L Z* . Thus, * 0 logd Z is a stationary point of L . max ()SΘ ij C 2 . i j n Properties (ii) is generic for any sequence {}Zk which sat- k1 k Then, it follows that isfies lim ZZ F 0 . The proof is given in Lemma 5 in k [30]. Properties (iii) is straightforward since L()Zk is con- 0 0 0 vergent (see Lemma 1). SΘ , Δ ()SΘ ijΔ ij  ()SΘ iiΔ ii Proof of Theorem 4: Based on the above lemmas, the rest i j i 0 max ()SΘ ij Δ 1,off (49) proof of Theorem 4 is to show that the generated sequence i j {}Zk has finite length, i.e., logd C 2 Δ 1,off n  k1 k ZZ F  (45) 0  with probability tending to 1, where SiiΘ ii 1 is used. Let k0 MS and MS  denote the projection of the matrix M to the k which implies that {}Z is a Cauchy sequence and thus is subspace S and its complement, respectively. Under the as- convergent. Finally, this property together with Lemma 3 sumption that the penalty function g is decomposable, i.e., k implies that {}Z converges to a stationary point of L . The 0  , and with , we have g()()()MMM g S  g  S ΘS  0 derivation of (45) relies heavily on the KL property of L , 0 0 g()()Θ Δ  g  Θ which holds if the penalty g is a KL function. This is the case . (50) 0 0 of all the considered hard-thresholding, soft-thresholding, g()()()ΘSS ΔS  g ΔS   g  Θ

SCAD and  q -norm (with rational q ) penalties. With the Plugging (49) and (50) into (48) yields above lemmas, the proof of (45) follows similarly the proof of Theorem 1 in [30] with some minor changes, thus is omitted 12s log d log d 0 0 G()()()Δ  C1  C2 Δ  gΘSS ΔS  g  Θ 2 n n S 1  here for succinctness.  K2 K1 . logd APPENDIX B  g ()ΔS  C 2 ΔS  1,off n The derivation of Theorem 2 follows similarly to [21, 22]. K3 Consider the set (51)

  Next, we bound the three terms K1 , K2 and K 3 . Since T 0 slog d  BCΔ:,,Δ Δ Δ Θ 1Id Δ F  1    ΔSSFsΔ  s Δ , we have n  1 F where 1  min . Let slog d KCC1 1 2 . (52) 1 n 00 0 f()()Θ Δ Θ Δ S  g Θ Δ 2 0 Using Taylor’s expansion of f()() t g ΘS  tΔS with integral if we can show that form of the remainder, it follows that inf{f (Θ0Δ ):Δ  B }  f (Θ 0 ) (46) there exists at least one local minimizer Θˆ such that

0 [2] R. C. Qiu and P. Antonik, Smart Grid and Big Data: Theory and Practice. K2  g (Θ ),ΔS S John Wiley & Sons, 2015. 1 T 0  [3] C. Zhang and R. C. Qiu, “Massive MIMO as a big data system: Random  vec(ΔSS )(1t ) g (ΘS  tΔS ) dt  vec(Δ ) 0  matrix models and testbed,” IEEE Access, vol. 3, pp. 837–851, 2015. 0 2 (53) [4] Y. He, F. R. Yu, N. Zhao, H. Yin, H. Yao, and R. C. Qiu, “Big data analytics maxg ()Θij Δ SS1 o (1) Δ F (,)i j S in mobile cellular networks,” IEEE Access, Mar. 2016. [5] J. Fan, Y. Liao, and H. Liu, “An overview on the estimation of large co- slog d  slog d   2 variance and matrices,” arXiv:1504, 2015. OC1   o(1) C1 n  n [6] V. A. Marcenko and L. A. Pastur, “Distribution of eigenvalues for some sets of random matrices,” Sbornik: Mathematics, vol. 1, pp. 457–483, 1967. 0 where the conditions maxg (Θij ) O  logd / n  and [7] I. Johnstone, “On the distribution of the largest eigenvalue in principal (,)i j S components analysis,” The Annals of Statistics, vol. 29, pp. 295–327, 2001. 0 [8] W. Wu and M. Pourahmadi, “Nonparametric estimation of large covariance maxg (Θij ) o (1) are used. Moreover, since (,)i j S matrices of longitudinal sata,” Biometrika, vol. 90, pp. 831–844, 2003. [9] P. Bickel and E. Levina, “Regularized estimation of large covariance ma- slog d trices,” The Annals of Statistics, vol. 36, pp. 199–227, 2008. supΔij CO1  ( ) [10] R. Furrer and T. Bengtsson, “Estimation of high-dimensional prior and Δ F B n posterior covariance matrices in variants,” Journal of Multi- variate Analysis, 98, 227–255, 2007. we can find a constant C 3 0 such that g ()Δij  C 3 Δij for [11] T. Cai, C. Zhang, and H. Zhou, “Optimal rates of convergence for covariance all (,)i j S  . Then, it follows that matrix estimation,” The Annals of Statistics, vol. 38, pp. 2118–2144, 2010. [12] T. Cai and W. Liu, “Adaptive Thresholding for Sparse Covariance Matrix    logd  Estimation,” Journal of the American Statistical Association, 106, 672–684, K 3 CC    0 (54)  3 2  S 1,off 2011. n [13] P. Bickel and E. Levina, “Covariance Regularization by Thresholding,” The Annals of Statistics, vol. 36, pp. 2577–2604, 2008. since O  slog d / n  . Consequently, by taking a suffi- [14] N. E. Karoui, “Operator Norm Consistent Estimation of Large Dimensional Sparse Covariance Matrices,” The Annals of Statistics, vol. 36, pp. ciently large constant C1 , we have G(Δ ) 0 with probability 2717–2756, 2008. tending to 1, which complete the proof. For the  1 -norm [15] A. Rothman, E. Levina, and J. Zhu, “Generalized Thresholding of Large penalty, we can set  C3 log d / n with CC3 2 such that Covariance Matrices,” Journal of the American Statistical Association, vol. 104, pp. 177–186, 2009. (10) holds, since g ()Δ    .  S ΔS 1,off [16] D. M. Witten, R. Tibshirani, “Covariance-regularized regression and clas- The proof of the spectral norm result follows from the same sification for high-dimensional problems,” J. R. Statist. Soc. Ser. B, vol. 71, argument as Theorem 4.2 in [20], which is omitted for con- pp. 615–36, 2009. ciseness. [17] J. Bien and R. J. Tibshirani, “Sparse estimation of a covariance matrix,” Biometrika, vol. 98, no. 4, pp.807–820, 2011. [18] A. J. Rothman, “Positive definite estimators of large covariance matrices,” APPENDIX C Biometrika, vol. 99, no. 3, pp. 733-740, 2012. [19] L. Xue, S. Ma, and H. Zou, “Positive Definite L1 Penalized Estimation of The objective function in the Σ -subproblem (28) is quad- Large Covariance Matrices,” Journal of the American Statistical Association, ratic in Σ , thus, it has an analytical solution. Specifically, vol. 107, 1480–1491, 2012. from the first-order optimality condition, the minimizer of the [20] H. Liu, L. Wang, and T. Zhao, “Sparse covariance matrix estimation with eigenvalue constraints,” Journal of Computational and Graphical Statistics, Σ -subproblem satisfies vol. 23, no. 2, pp. 439-459, 2014. [21] C. Lam and J. Fan, “Sparsistency and Rates of Convergence in Large Co- T k1 T k  1 T T k  1 AAAAAYAGΣ 2Σ    Γ . (55) variance Matrix Estimation,” The Annals of Statistics, vol. 37, pp. 42–54, 2009. We can construct a matrix Σk1 that satisfies this condition [22] A. J. Rothman, P. J. Bickel, E. Levina, and J. Zhu, “Sparse permutation k T T k1 invariant covariance estimation,” Electronic Journal of Statistics, Vol. 2, pp. and thus minimizes the objective. Let ZAYAG  Γ , 494–515, 2008. it follows that [23] A. Antoniadis, “ in Statistics: A Review” (with discussion), Journal of the Italian Statistical Association, vol. 6, pp. 97–144, 1997. ΛEEEEEZETΣ k1Λ2 TΣ k  1  T k [24] R. J. Tibshirani, “Regression Shrinkage and Selection via the LASSO,” Journal of the Royal Statistical Society, Ser. B, vol. 58, pp. 267–288, 1996. which can be equivalently written as [25] J. Fan and R. Li, “Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties,” Journal of the American Statistical Association, T T k1 T k  1 T k vol. 96, 1348–1360, 2001. (aa ) ( EΣ E ) (2 1n n )  ( EΣ E )  E Z E . (56) [26] G. Marjanovic and V. Solo, “On ℓq optimization and matrix comple- Then, it follows from (56) that tion,” IEEE Trans. Signal Process., vol. 60, no. 11, pp. 5714-5724, 2012. [27] Z. Xu, X. Chang, F. Xu, and H. Zhang, “L1/2 regularization: a thresholding ETΣ k1 E[ E T Z k E ]  ( aa T  2 1 ) (57) representation theory and a fast solver,” IEEE Trans. Neural Networks n n Learning Systems, vol. 23, no. 7, pp. 1013-1027, 2012. which finally results in (31). [28] H. Zou, “The Adaptive Lasso and Its Oracle Properties,” Journal of the American Statistical Association, vol. 101, pp. 1418–1429. 2006. [29] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran, “Proximal alternating REFERENCES minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Lojasiewicz inequality,” Mathematics of Operations [1] J. Fan, F. Han, and H. Liu, “Challenges of big data analysis,” National Research, vol. 35, pp. 438-457, 2010. science review, vol. 1, pp. 293–314, 2014.

[30] J. Bolte, S. Sabach, M. Teboulle, “Proximal alternating linearized mini- mization for nonconvex and nonsmooth problems,” Mathematical Pro- gramming, vol. 146, pp. 459–494, 2014. [31] F. Wen, P. Liu. Y. Liu, R. C. Qiu, and W. Yu, “Robust sparse recovery for compressive sensing in impulsive noise using Lp-norm model fitting,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2016, pp. 3885–3888. [32] K. J. Ahn, S. Guha, and A. McGregor, “Graph sketches: sparsification, spanners, and subgraphs,” in Proc. 31st symposium on Principles of Data- base Systems, ACM, 2012, pp. 5–14. [33] Y. Chi, “Compressive graph clustering from random sketches,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2015, pp. 5466-5469. [34] D. Donoho, “Compressed sensing,” IEEE Trans. Info. Theory, vol. 52, no. 4, pp. 1289–1306, Apr. 2006. [35] E. Candes and T. Tao, “Decoding by linear programming,” IEEE Trans- actions on Information Theory, vol. 51, no. 12, pp. 4203–4215, Dec. 2005. [36] D. Ariananda and G. Leus, “Compressive wideband power spectrum esti- mation,” IEEE Trans. on Signal Proc., vol. 60, no. 9, pp. 4775–4789, 2012. [37] A. Lexa, M. Davies, J. Thompson, and J. Nikolic, “Compressive power spectral ,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2011, pp. 3884–3887. [38] D. Cohen and Y. C. Eldar, “Sub-Nyquist sampling for power spectrum sensing in cognitive radios: A unified approach,” CoRR, vol. abs/1308.5149, 2013. [39] S. A. Razavi, M. Valkama, and D. Cabric, “High-resolution cyclic spectrum reconstruction from sub-Nyquist samples,” IEEE SPAWC, pp. 205–254, 2013. [40] T. Wimalajeewa, Y. C. Eldar, and P. K. Varshney, “Recovery of sparse matrices via matrix sketching,” CoRR, vol. abs/1311.2448, 2013. [41] G. Dasarathy, P. Shah, B. Bhaskar, and R. Nowak, “Sketching sparse matrices,” arXiv preprint, arXiv:1303.6544, 2013. [42] J. M. Bioucas-Dias, D. Cohen, and Y. C. Eldar, “Covalsa: Covariance estimation from compressive measurements using alternating minimization,” In Proc. 22nd European Signal Processing Conference (EUSIPCO), 2014, pp. 999-1003. [43] M. Eisen, P. Spellman, P. Brown, and D. Botstein, ‘‘ and Display of Genome-wide Expression Patterns,’’ Proceedings of the National Academy of Sciences of the United States of America, vol. 95, pp. 14863–14868, 1998. [44] J. Khan, J. Wei, M. Ringner, L. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson, and P. Meltzer, “Clas- sification and Diagnostic Prediction of Cancers Using Gene Expression Profiling and Artificial Neural Networks,’’ Nature Medicine, vol. 7, 673–679, 2001.