An Efficient ADMM Algorithm for High Dimensional Precision Matrix
Total Page:16
File Type:pdf, Size:1020Kb
An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss Cheng Wang ∗a and Binyan Jiang yb aSchool of Mathematical Sciences, MOE-LSC,Shanghai Jiao Tong University, Shanghai 200240, China. bDepartment of Applied Mathematics, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong. July 10, 2019 Abstract The estimation of high dimensional precision matrices has been a central topic in statistical learning. However, as the number of parameters scales quadratically with the dimension p, many state-of-the-art methods do not scale well to solve problems with a very large p. In this paper, we propose a very efficient algorithm for precision matrix estimation via penalized quadratic loss functions. Under the high dimension low sample size setting, the computation complexity of our al- gorithm is linear in both the sample size and the number of parameters. Such a computation complexity is in some sense optimal, as it is the same as the complex- ity needed for computing the sample covariance matrix. Numerical studies show that our algorithm is much more efficient than other state-of-the-art methods when the dimension p is very large. Key words and phrases: ADMM, High dimension, Penalized quadratic loss, Precision matrix. arXiv:1811.04545v2 [stat.CO] 9 Jul 2019 1 Introduction Precision matrices play an important role in statistical learning and data analysis. On the one hand, estimation of the precision matrix is oftentimes required in various sta- tistical analysis. On the other hand, under Gaussian assumptions, the precision matrix has been widely used to study the conditional independence among the random vari- ables. Contemporary applications usually require fast methods for estimating a very ∗[email protected] [email protected] 1 high dimensional precision matrix (Meinshausen and Buhlmann,¨ 2006). Despite recent advances, estimation of the precision matrix remains challenging when the dimension p is very large, owing to the fact that the number of parameters to be estimated is of order O(p2). For example, in the Prostate dataset we are studying in this paper, 6033 genetic activity measurements are recorded for 102 subjects. The precision matrix to be estimated is of dimension 6033 × 6033, resulting in more than 18 million parameters. A well-known and popular method in precision matrix estimation is the graphical lasso (Yuan and Lin, 2007; Banerjee et al., 2008; Friedman et al., 2008). Without loss of generality, assume that X1; ··· ;Xn are i.i.d. observations from a p-dimensional Gaussian distribution with mean 0 and covariance matrix Σ. To estimate the pre- cision matrix Ω∗ = Σ−1, the graphical lasso seeks the minimizer of the following `1-regularized negative log-likelihood: tr(SΩ) − log jΩj + λkΩk1; (1) over the set of positive definite matrices. Here S is the sample covariance matrix, kΩk1 is the element-wise `1 norm of Ω, and λ ≥ 0 is the tuning parameter. Although (1) is constructed based on Gaussian likelihood, it is known that the graphical lasso also works for non-Gaussian data (Ravikumar et al., 2011). Many algorithms have been developed to solve the graphical lasso. Friedman et al. (2008) proposed a coordinate descent procedure and Boyd et al. (2011) provided an alternating direction method of multipliers (ADMM) algorithm for solving (1). In order to obtain faster conver- gence for the iterations, second order methods and proximal gradient algorithms on the dual problem are also well developed; see for example Hsieh et al. (2014), Dalal and Rajaratnam (2017), and the references therein. However, eigen-decomposition or calculation of the determinant of a p×p matrix is inevitable in these algorithms, owing to the matrix determinant term in (1). Note that the computation complexity of eigen- decomposition or matrix determinant is of order O(p3). Thus, the computation time for these algorithms will scale up cubically in p. Recently, Zhang and Zou (2014) and Liu and Luo (2015) proposed to estimate Ω∗ by some trace based quadratic loss functions. Using the Kronecker product and matrix vectorization, our interest is to estimate, ∗ −1 −1 −1 vec(Ω ) = vec(Σ ) = vec(Σ · Ip · Ip) = (Ip ⊗ Σ) vec(Ip); (2) or equivalently, ∗ −1 −1 −1 vec(Ω ) = vec(Σ ) = vec(Ip · Ip · Σ ) = (Σ ⊗ Ip) vec(Ip); (3) where Ip denotes the identity matrix of size p, and ⊗ is the Kronecker product. Mo- tivated by (2) and the LASSO (Tibshirani, 1996), a natural way to estimate β∗ = vec(Ω∗) is 1 T T arg min β (Ip ⊗ S)β − β vec(Ip) + λkβk1: (4) p2 2 β2R 2 To obtain a symmetric estimator we can use both (2) and (3), and estimate β∗ by 1 T T arg min β (S ⊗ Ip + Ip ⊗ S)β − β vec(Ip) + λkβk1: (5) p2 4 β2R Denoting β = vec(Ω), (4) can be written in matrix notation as 1 T Ω^ 1 = arg min tr(Ω SΩ) − tr(Ω) + λkΩk1: (6) p×p 2 Ω2R Symmetrization can then be applied to obtain a final estimator. The loss function 1 L (Ω) := tr(ΩTSΩ) − tr(Ω); 1 2 is used in Liu and Luo (2015) and they proposed a column-wise estimation approach called SCIO. Lin et al. (2016) obtained these quadratic losses from a more general score matching principle. Similarly, the matrix form of (5) is 1 T 1 T Ω^ 2 = arg min tr(Ω SΩ) + tr(ΩSΩ ) − tr(Ω) + λkΩk1: (7) p×p 4 4 Ω2R The loss function 1 1 1 L (Ω) = fL (ΩT) + L (Ω)g = tr(ΩSΩT) + tr(ΩTSΩ) − tr(Ω); 2 2 1 1 4 4 is equivalent to the D-trace loss proposed by Zhang and Zou (2014), owing to the fact that L2(Ω) naturally force the solution to be symmetric. In the original papers by Zhang and Zou (2014) and Liu and Luo (2015), the authors have established consistency results for the estimators (6) and (7) and have shown that their performance is comparable to the graphical lasso. As can be seen in the vectorized formulation (2) and (3), the loss functions L1(Ω) and L2(Ω) are quadratic in Ω. In this note, we propose efficient ADMM algorithms for the estimation of the precision matrix via these quadratic loss functions. In Section 2, we show that under the quadratic loss functions, explicit solutions can be obtained in each step of the ADMM algorithm. In particular, we derive explicit formulations for the inverses of (S + ρI) and (2−1S ⊗ I + 2−1I ⊗ S + ρI) for any given ρ > 0, from which we are able to solve (6) and (7), or equivalently (4) and (5), with computation complexity of order O(np2). Such a rate is in some sense optimal, as the complexity for computing S is also of order O(np2). Numerical studies are provided in Section 3 to demonstrate the computational efficiency and the estimation accuracy of our proposed algorithms. An R package “EQUAL” has been developed to implement our methods and is available at https://github.com/cescwang85/EQUAL, together with all the simulation codes. All technical proofs are relegated to the Appendix section. 2 Main Results p T For any real matrix M, we use kMk2 = tr(MM ) to denote the Frobenius norm, kMk to denote the spectral norm, i.e., the square root of the largest eigenvalue of 3 T M M, and kMk1 to denote the matrix infinity norm, i.e., the element of M with largest absolute value. We consider estimating the precision matrix via arg min L(Ω) + λkΩk1; (8) p×p Ω2R where L(Ω) is the quadratic loss function L1(Ω) or L2(Ω) introduced above. The augmented Lagrangian is 2 La(Ω; A; B) = L(Ω) + ρ/2kΩ − A + Bk2 + λkAk1; where ρ > 0 is the step size in the ADMM algorithm. By Boyd et al. (2011), the alternating iterations are k+1 k k Ω = arg min La(Ω;A ;B ); p×p Ω2R k+1 k+1 k k+1 k A = arg min La(Ω ; A; B ) = soft(Ω + B ; λ/ρ); p×p A2R Bk+1 = Ωk+1 − Ak+1 + Bk; where soft(A; λ) is an element-wise soft thresholding operator. Clearly the computa- tion complexity will be dominated by the update of Ωk+1, which amounts to solving the following problem: k k 2 arg min L(Ω) + ρ/2kΩ − A + B k2: (9) p×p Ω2R From the convexity of the objective function, the solution of (9) satisfies L0(Ω) + ρ(Ω − Ak + Bk) = 0: Consequently, for the estimation (6) and (7), we need to solve the following equations respectively, k k SΩ + ρΩ = Ip + ρ(A − B ); (10) −1 −1 k k 2 SΩ + 2 ΩS + ρΩ = Ip + ρ(A − B ): (11) By looking at (10) and the vectorized formulation of (11) (i.e. equation (5)), we im- mediately have that, in order to solve (10) and (11), we need to compute the inverses of (S + ρI) and (2−1S ⊗ I + 2−1I ⊗ S + ρI). The following proposition provides explicit expressions for these inverses. Proposition 1 Write the decomposition of S as S = UΛU T where U 2 Rp×m; m = T min(n; p);U U = Im and Λ = diagfτ1; ··· ; τmg; τ1; : : : ; τm ≥ 0. For any ρ > 0, we have −1 −1 −1 T (S + ρIp) = ρ Ip − ρ UΛ1U ; (12) 4 and −1 −1 −1 (2 S ⊗ Ip + 2 Ip ⊗ S + ρIp2 ) −1 −1 T −1 T =ρ Ip2 − ρ (UΛ2U ) ⊗ Ip − ρ Ip ⊗ (UΛ2U ) −1 T T + ρ (U ⊗ U)diagfvec(Λ3)g(U ⊗ U ); (13) where τ1 τm τ1 τm Λ1 =diag ; ··· ; ; Λ2 = diag ; ··· ; ; τ1 + ρ τm + ρ τ1 + 2ρ τm + 2ρ τiτj(τi + τj + 4ρ) Λ3 = : (τi + 2ρ)(τj + 2ρ)(τi + τj + 2ρ) m×m Using the explicit formulas of Proposition 1, we can solve (10) and (11) efficiently.