Learning Graph Laplacian with MCP
Yangjing Zhang∗ Kim-Chuan Toh† Defeng Sun‡
October 20, 2020
Abstract
Motivated by the observation that the ability of the `1 norm in promoting sparsity in graphical models with Laplacian constraints is much weakened, this paper proposes to learn graph Laplacian with a non-convex penalty: minimax concave penalty (MCP). For solving the MCP penalized graphical model, we design an inexact proximal difference-of-convex algorithm (DCA) and prove its convergence to critical points. We note that each subproblem of the proximal DCA enjoys the nice property that the objective function in its dual problem is continuously differentiable with a semismooth gradient. Therefore, we apply an efficient semismooth Newton method to subproblems of the proximal DCA. Numerical experiments on various synthetic and real data sets demonstrate the effectiveness of the non-convex penalty MCP in promoting sparsity. Compared with the state-of-the-art method [46, Algorithm 1], our method is demonstrated to be more efficient and reliable for learning graph Laplacian with MCP.
Keywords. Graph learning, graph Laplacian estimation, precision matrix estimation, non-convex penalty, difference-of-convex. 1 Introduction
In modern multivariate data analysis, one of the most important problems is the estimation of the preci- sion matrix (or the inverse covariance matrix) of a multivariate distribution via an undirected graphical model. A Gaussian graphical model for a Gaussian random vector ∆ ∼ Nn(µ, Σ) is represented by a graph G = (V, E), where V is a collection of n vertices corresponding to the n random variables (features), and an edge (i, j) is absent, i.e., (i, j) ∈/ E, if and only if the i-th and j-th random variables are conditionally arXiv:2010.11559v1 [cs.LG] 22 Oct 2020 independent of each other, given all other variables. The conditional independence is further equivalent −1 to having the (i, j)-th entry of the precision matrix (Σ )ij being zero [25]. Thus, finding the graph structure of a Gaussian graphical model is equivalent to the identification of zeros in the corresponding precision matrix. n n Let S+ (S++) denote the cone of positive semidefinite (definite) matrices in the space of n × n real n symmetric matrices S . Given a Gaussian random vector ∆ ∼ Nn(µ, Σ) and its sample covariance matrix ∗Department of Mathematics, National University of Singapore, Singapore ([email protected]). †Department of Mathematics, and Institute of Operations Research and Analytics, National University of Singapore, Singapore ([email protected]). The research of this author is partially supported by the Academic Research Fund of the Ministry of Education of Singapore under grant number MOE2019-T3-1-010. ‡Department of Applied Mathematics, The Hong Kong Polytechnic University, Hung Hom, Hong Kong, China ([email protected]). The research of this author is supported in part by Hong Kong Research Grant Council under grant number 15304019.
1 n S ∈ S+, a notable way of learning a precision matrix from the data matrix S is via the following `1-norm penalized maximum likelihood approach [2, 12, 36, 48]: min − log det Θ + hS, Θi + λkΘk , n 1,off (1) Θ∈S++ P where kΘk1,off = i6=j |Θij| and λ is a non-negative penalty parameter. The solutions to (1) are only constrained to be positive definite, and both positive and negative edge weights are allowed in the esti- mated graph. However, one may further require all edge weights to be non-negative, and thus prefer to estimate a graph Laplacian matrix from the data due to several reasons. First of all, a negative edge weight implies a negative partial correlation between the two connected random variables, which might be difficult to interpret in some applications. For certain types of data, one feature is likely to be pre- dicted by non-negative linear combinations of other features. Under such application settings, the extra non-negative constraints on the edge weights can provide a more accurate estimation of the graph than (1). More broadly, graph Laplacian matrices are desirable for a large majority of studies and applications, for example, spectral graph theory [6], clustering and partition problems [31, 38, 40], dimensionality re- duction and data representation [3], and graph signal processing [39]. Therefore, it is essential to learn graph Laplacian matrices from data. We start by giving the definition of a graph Laplacian matrix formally. For an undirected weighted graph G = (V, E) with V being the set of vertices and E the set of edges, the weight matrix of G is defined n as W ∈ S where Wij = w(ij) ∈ R+ is the weight of the edge (i, j) ∈ E and Wij = 0 if (i, j) ∈/ E. Therefore, the weight matrix W consists of non-negative off-diagonal entries and zero diagonal entries. The graph Laplacian matrix, also known as the combinatorial graph Laplacian matrix, is defined as
n L = D − W ∈ S , Pn n where D is the diagonal matrix such that Dii = j=1 Wij. The connectivity (adjacency) matrix A ∈ S is defined as the sparsity pattern of W , i.e., Aij = 1 if Wij > 0, and Aij = 0 if Wij = 0. The set of graph Laplacian matrices then consists of matrices with non-positive off-diagonal entries and zero row-sum:
n L = {Θ ∈ S | Θ1 = 0, Θij ≤ 0 for i 6= j}. (2) Here, 1 (0) denotes the vector of all ones (zeros). If the graph connectivity matrix A is known a priori, then the constrained set of graph Laplacian matrices is n Θij ≤ 0 if Aij = 1 L(A) = Θ ∈ S | Θ1 = 0, for i 6= j . (3) Θij = 0 if Aij = 0 As we know, a Laplacian matrix is diagonally dominant and positive semidefinite, and it has a zero eigenvalue with the associated eigenvector 1. If the graph is connected, then the null space of the Laplacian matrix is one-dimensional and spanned by the vector 1 [6]. Starting from the earlier work of imposing the Laplacian structure (2) on the estimation of precision matrix in [23], this line of research has seen a recent surge of interest in [8, 9, 14, 15, 17, 20, 50]. To handle the singularity issue of a Laplacian matrix in the calculation of the log-determinant term, Lake and Tenenbaum [23] considered a regularized graph Laplacian matrix Θ+αI (hence, full rank) by adding a positive scalar α to the diagonal entries of the graph Laplacian matrix Θ. More recently, by considering connected graphs, Egilmez et al. [9] and Hassan-Moghaddam et al. [14] used a modified version of (1) by adding the constant matrix J = (1/n)11T to the graph Laplacian matrix to compensate for the null space spanned by the vector 1. Moreover, Egilmez et al. [9] incorporated the connectivity matrix A into the model to exploit any prior structural information about the graph. Given the connectivity matrix A and a data matrix S (typically a sample covariance matrix), Egilmez et al. [9] proposed the following `1-norm penalized combinatorial graph Laplacian (CGL-L1) estimation model:
min {− log det (Θ + J) + hS, Θi + λkΘk1,off | Θ ∈ L(A)}. (CGL-L1)
2 When the prior knowledge of the structural information A is not available, especially for real data sets, one can take the fully connected matrix as the connectivity matrix, i.e., A = 11T − I, with I being an identity matrix. In this case, the model (CGL-L1) involves estimating both the graph structure and graph edge weights. The model (CGL-L1) is a natural extension of the classical model (1) due to the equality log det (Θ + J) = log pdet Θ for any Laplacian matrix Θ of a connected graph. Here pdet (·) denotes the pseudo-determinant of a square matrix, i.e., the product of all non-zero eigenvalues of the matrix. The model (CGL-L1) naturally extends the classical model (1) to incorporate the graph Laplacian constraint. However, the model (CGL-L1) has the drawback that the `1 penalty may lose its power in promoting sparsity in the estimated graph. In fact, we found from our experiments that the number of edges of the estimated graphs of (CGL-L1) does not always decrease with increasing values of the tuning parameter λ (see, e.g., Fig. 5 and Fig. 6). We note that the phenomena also appeared in the experiments of the existing work [50, Fig. 8(a)], wherein for many of the tested instances, setting the parameter λ = 1 results in denser graphs compared to choosing smaller values such as λ = 0 or 0.02. That the `1 penalty fails to promote sparsity can make it difficult for one to obtain a desired sparsity level in the graph estimated from the model (CGL-L1). This phenomena is likely caused by the zero row-sum constraint of a valid Laplacian matrix Θ ∈ L(A), which satisfies X X λkΘk1,off = −λ Θij = λ Θii = hλI, Θi. i6=j i
Thus the `1 penalty term in (CGL-L1) simply penalizes the diagonal entries of Θ but not the individual entries Θij. Hence adjusting λ may not affect the sparsity level of the solution of (CGL-L1). Very recently, the relative failure of `1 penalty to promote sparsity in the model (CGL-L1) was also noticed in [46]. Motivated by the observation above, we consider a non-convex penalty function, the minimax concave penalty (MCP) [49], to replace the `1 penalty. Non-convex penalties can generally reduce estimation bias, and they have been applied in sparse precision matrix estimation in [24, 37]. In this paper, we propose to learn a graph Laplacian matrix as the precision matrix from the constrained maximum likelihood estimation with the MCP:
min {− log det (Θ + J) + hS, Θi + P (Θ) | Θ ∈ L(A)}, (CGL-MCP) where the MCP function P is given as follows (we omit its dependence on the parameters γ and λ for brevity): X n P (Θ) = pγ (Θij; λ), γ > 1, for Θ ∈ S , i6=j ( 2 (4) λ|x| − x , if |x| ≤ γλ, p (x; λ) = 2γ for x ∈ , λ > 0. γ 1 2 R 2 γλ , if |x| > γλ, It is known that the MCP can be expressed as the difference of two convex (d.c.) functions [1, Section 6.2]. Therefore, we can design a d.c. algorithm (DCA) for solving (CGL-MCP) by using the d.c. property of the MCP function. DCA is an important tool for solving d.c. programs, and numerous research has been conducted on this topic; see [26, 32, 42, 43], to name only a few. Specifically in this paper, we aim to develop an inexact proximal DCA for solving (CGL-MCP). We solve approximately a sequence of convex minimization subproblems by finding an affine minorization of the second d.c. component. Allowing inexactness for solving the subproblem is crucial since computing the exact solution of the subproblem is generally impossible. The novel step of our algorithm is to introduce a proximal term for each subproblem. With the proximal terms, we can obtain a convex and continuously differentiable dual problem of the subproblem. Due to the nice property of the dual formulation, we can apply a highly
3 efficient semismooth Newton (SSN) method [21, 33, 41] for solving the dual of the subproblem. We also prove the convergence of the inexact proximal DCA to a critical point. Summary of contributions: 1) We propose a new model (CGL-MCP) for learning a graph Laplacian matrix with MCP to overcome the difficulty that the `1 penalty may not promote sparsity in the graph Laplacian matrix. 2) For solving the proposed model (CGL-MCP), we propose an inexact proximal DCA and prove its convergence to a critical point. 3) As the subroutine for solving the proximal DCA subproblems, we design a semismooth Newton method for solving the dual form of the subproblems. 4) The effectiveness of the proposed model (CGL-MCP) and the efficiency of the inexact proximal DCA for solving (CGL-MCP) are comprehensively demonstrated via various experiments on both synthetic and real data sets. 5) More generally, both the model and algorithm can be extended directly to other non-convex penalties, such as the smoothly clipped absolute deviation (SCAD) function [10]. Outline: The remainder of the paper is organized as follows. The inexact proximal DCA for solving (CGL-MCP), together with its convergence property, is given in section 2. Section 3 presents the semis- mooth Newton method for solving the subproblems of the proximal DCA. Numerical experiments are presented in section 4. Finally, section 5 concludes the paper.
2 Inexact proximal DCA for (CGL-MCP)
We propose an inexact proximal DCA for solving (CGL-MCP) in this section and prove its convergence to a critical point. We solve approximately a sequence of convex minimization subproblems by finding an affine minorization of the second d.c. component. Allowing inexactness for solving the subproblem is necessary since computing the exact solution of the subproblem is generally impossible. The novel step of our algorithm is to introduce a proximal term for each subproblem so that its dual has a favourable structure. As can be seen later, with the proximal term, we can obtain a convex and continuously differentiable dual form of the subproblem. Due to the nice property of the dual formulation, we can apply the SSN method for solving the dual of the subproblem.
2.1 Reformulation of (CGL-MCP) In order to take full advantage of the sparsity of the graph via the connectivity matrix, we reformulate the model (CGL-MCP) with the vector of edge weights as the decision variable. This formulation eliminates the extra computation incurred by non-existence edges (Aij = 0). Given an undirected graph G and its connectivity matrix A, we know that the edge set can be |E| characterized by the connectivity matrix: E = {(i, j) | Aij = 1, i < j}. Let R be the vector space such |E| that for any w ∈ R , the components of w are indexed by the elements of E, i.e., [w(ij)](i,j)∈E . We also n×|E| let B ∈ R be the node-arc incidence matrix such that the (ij)-th column is given by B(ij) = ei − ej, n ∗ |E| n where ei, i = 1, . . . , n are the standard unit vectors in R . We define a linear map A : R → S :
∗ T |E| A w = BDiag(w)B , ∀ w ∈ R .
We can see that A∗w ∈ L(A) if w ∈ R|E| is a non-negative weight vector by noting that P w + P w , if i = j, (ik) (ki) (i,k)∈E (k,i)∈E ∗ (A w)ij = −w(ij), if (i, j) ∈ E −w , if (j, i) ∈ E, (ji) 0, otherwise.
Example 2.1. Consider a graph with n = 3 nodes, and edge set E = {(1, 2), (2, 3)}. Then the connectivity
4 matrix and the node-arc incidence matrix are given, respectively, by
0 1 0 1 3×2 A = 1 0 1 ,B = −1 1 ∈ R . 0 1 0 −1
T |E| Let w = [w(12), w(23)] ∈ R . We have that w(12) −w(12) 0 ∗ A w = −w(12) w(12) + w(23) −w(23) . 0 −w(23) w(23)
If w ≥ 0, then it is easy to see that A∗w ∈ L(A).
It is easy to obtain from the definition that the adjoint map A : Sn → R|E| is given by AX = diag(BT XB), ∀ X ∈ Sn. With the relationship between the Laplacian matrix and the weight vector, the model (CGL-MCP) can be reformulated as follows:
min − log det (A∗w + J) + hS, A∗wi + P (A∗w) w (5) |E| s.t. w ∈ R+ . We note that the constraint Θ ∈ L(A) in the original model (CGL-MCP) is reformulated as a linear ∗ |E| constraint and a non-negative constraint, namely, Θ = A w and w ∈ R+ .
2.2 Inexact proximal DCA for solving (5)
The function pγ (·; λ) defined in (4) can be expressed as the difference of two convex functions [1, Sec- tion 6.2] as follows: pγ (x; λ) = λ|x| − hγ (x; λ), where γ and λ are given positive parameters, and
( 2 x , if |x| ≤ γλ, h (x; λ) = 2γ for x ∈ , λ > 0. γ 1 2 R λ|x| − 2 γλ , if |x| > γλ,