EXACT FORMULAS FOR THE NORMALIZING CONSTANTS OF WISHART DISTRIBUTIONS FOR GRAPHICAL MODELS
By Caroline Uhler∗, Alex Lenkoski†, and Donald Richards‡ Massachusetts Institute of Technology∗, Norwegian Computing Center†, and Penn State University‡
Gaussian graphical models have received considerable attention during the past four decades from the statistical and machine learning communities. In Bayesian treatments of this model, the G-Wishart distribution serves as the conjugate prior for inverse covariance ma- trices satisfying graphical constraints. While it is straightforward to posit the unnormalized densities, the normalizing constants of these distributions have been known only for graphs that are chordal, or decomposable. Up until now, it was unknown whether the normaliz- ing constant for a general graph could be represented explicitly, and a considerable body of computational literature emerged that at- tempted to avoid this apparent intractability. We close this question by providing an explicit representation of the G-Wishart normalizing constant for general graphs.
1. Introduction. Let G = (V,E) be an undirected graph with vertex p set V = {1, . . . , p} and edge set E. Let S be the set of symmetric p × p p p matrices and S 0 the cone of positive definite matrices in S . Let
p p (1.1) S 0(G) = {M = (Mij) ∈ S 0 | Mij = 0 for all (i, j) ∈/ E}
p denote the cone in S of positive definite matrices with zeros in all entries not corresponding to edges in the graph. Note that the positivity of all diagonal entries Mii follows from the positive-definiteness of the matrices M. p A random vector X ∈ R is said to satisfy the Gaussian graphical model (GGM) with graph G if X has a multivariate normal distribution with mean −1 p µ and covariance matrix Σ, denoted X ∼ Np(µ, Σ), where Σ ∈ S 0(G). arXiv:1406.4901v2 [math.ST] 22 Jun 2016 The inverse covariance matrix Σ−1 is called the concentration matrix and, throughout this paper, we denote Σ−1 by K.
MSC 2010 subject classifications: Primary 62H05, 60E05; secondary 62E15 Keywords and phrases: Bartlett decomposition; Bipartite graph; Cholesky decomposi- tion; Chordal graph; Directed acyclic graph; G-Wishart distribution; Gaussian graphical model; Generalized hypergeometric function of matrix argument; Moral graph; Normaliz- ing constant; Wishart distribution.
1 2 C. UHLER, A. LENKOSKI, D. RICHARDS
p Statistical inference for the concentration matrix K constrained to S 0(G) goes back to Dempster [6], who proposed an algorithm for determining the maximum likelihood estimator [cf., 31]. A Bayesian framework for this prob- lem was introduced by Dawid and Lauritzen [5], who proposed the Hyper- Inverse Wishart (HIW) prior distribution for chordal (also known as decom- posable or triangulated) graphs G. Chordal graphs enjoy a rich set of properties that led the HIW distribution to be particularly amenable to Bayesian analysis. Indeed, for nearly a decade after the introduction of GGMs, focus on the Bayesian use of GGMs was placed primarily on chordal graphs [see, e.g., 11]. This tractability stems from two causes: the ability to sample directly from HIWs [28], and the ability to calculate their normalizing constants, which are critical quantities when comparing graphs or nesting GGMs in hierarchical structures. Roverato [29] extended the HIW to general G. Focusing on K, Atay- Kayis and Massam [2] further studied this prior distribution. Following Letac and Massam [22], Lenkoski and Dobra [21] termed this distribution the G- p Wishart. For D ∈ S 0(G) and δ ∈ R, the G-Wishart density has the form 1 (δ−2) 1 fG(K | δ, D) ∝ |K| 2 exp(− tr(KD)) 1 p . 2 K∈S 0(G) This distribution is conjugate [29] and proper for δ > 1 [24]. Early work on the G-Wishart distribution was largely computational in nature [4,7,8, 17, 21, 32, 33] and was predicated on two assumptions: first, that a direct sampler was unavailable for this class of models and, second, that the normalizing constant could not be explicitly calculated. Lenkoski [20] developed a direct sampler for G-Wishart variates, mimicking the algorithm of Dempster [6], thereby resolving the first open question. In this paper, we close the second question by deriving for general graphs G an explicit formula for the G-Wishart normalizing constant, Z 1 (δ−2) 1 CG(δ, D) = |K| 2 exp(− tr(KD)) dK, p 2 S 0(G) Qp Q where dK = i=1 dkii · i 1 2 pδ+|E| 1 CG(δ, D) = 2 IG 2 (δ − 2),D . NORMALIZING CONSTANTS FOR G-WISHART DISTRIBUTIONS 3 The normalizing constant IG(δ, D) is well-known for complete graphs, in which every pair of vertices is connected by an edge. In such cases, 1 −(δ+ 2 (p+1)) 1 (1.2) Icomplete(δ, D) = |D| Γp δ + 2 (p + 1) , where p p(p−1)/4 Y 1 (1.3) Γp(α) = π Γ α − 2 (i − 1) , i=1 1 Re(α) > 2 (p − 1), is the multivariate gamma function. The formula (1.2) has a long history, dating back to Wishart [34], Wishart and Bartlett [35], Ingham [15], Siegel [30, Hilfssatz 37], Maass [23], and many derivations of a statistical nature; see Olkin [27] and Giri [10, p. 224]. As noted above, IG(δ, D) is also known for chordal graphs. Let G be chordal, and let (T1,...,Td) denote a perfect sequence of cliques (i.e., com- plete subgraphs) of V . Further, let Si = (T1 ∪· · ·∪Ti)∩Ti+1, i = 1, . . . , d−1; then, S1,...,Sd−1 are called the separators of G. Note that the separators Si are cliques as well. We denote the cardinalities by ti = |Ti| and si = |Si|. For S ⊆ {1, . . . , p}, let DSS denote the submatrix of D corresponding to the rows and columns in S. Then, Qd I (δ, D ) I (δ, D) = i=1 Ti TiTi G Qd−1 j=1 ISj (δ, DSj Sj ) 1 Qd − δ+ 2 (ti+1) 1 i=1 |DTiTi | Γti δ + 2 (ti + 1) (1.4) = . 1 Qd−1 − δ+ 2 (sj +1) 1 j=1 |DSj Sj | Γsj δ + 2 (sj + 1) This result follows because, for a chordal graph G, the G-Wishart density function can be factored into a product of density functions [5]. For non-chordal graphs the problem of calculating IG(δ, D) has been open for over 20 years, and much of the computational methodology mentioned above was developed with the objective of either approximating IG(δ, D) or avoiding its calculation. Our result shows that an explicit representation of this quantity is indeed possible. In deriving the explicit formula for the normalizing constant IG(δ), we utilize methods that are familiar to researchers in this area. These methods include the Cholesky decomposition or the Bartlett decomposition of a pos- itive definite matrix, Schur complements for factorizing determinants, and the chordal cover of a graph. Furthermore, we make crucial use of certain formulas from the theory of generalized hypergeometric functions of matrix 4 C. UHLER, A. LENKOSKI, D. RICHARDS argument [13, 16], and analytic continuation of differential operators on the cone of positive definite matrices [9]. The article proceeds as follows. In Section2 we treat the case in which D = Ip, the p×p identity matrix, deriving a closed-form product formula for the normalizing constant IG(δ, Ip) for various classes of non-chordal graphs. In Section3 we consider the case of general matrices D; in our main result in Theorem 3.3 we derive an explicit representation of IG(δ, D) for general graphs as a closed-form product formula involving differentials of principal minors of D. We end with a brief discussion in Section4. 2. Computing the normalizing constant IG(δ, Ip). In this section, we compute IG(δ, Ip) for two classes of non-chordal graphs. We begin in Section 2.1 with the class of complete bipartite graphs and use an approach based on Schur complements to attain a closed-form formula. In Section 2.2 we introduce directed Gaussian graphical models and show how these models relate to a Cholesky factor approach to computing IG(δ, Ip). This leads to a formula for computing normalizing constants of graphs with minimum fill-in equal to 1, namely graphs that become chordal after the addition of one edge. However, these approaches do not lead to a general formula for the normalizing constant in the case D = Ip. To obtain a formula for any graph G, we found it necessary to calculate the more general case IG(δ, D) and then specialize D = Ip, as is done for moment generating functions or Laplace transforms. This is explained in Section3. 2.1. Bipartite graphs. A complete bipartite graph on m + n vertices, de- noted by Hm,n, is an undirected graph whose vertices can be divided into disjoint sets U = {1, . . . , m} and V = {m + 1, . . . , m + n}, such that each vertex in U is connected to every vertex in V , but there are no edges within U or V . For the graph Hm,n, the corresponding matrix K is a block matrix, KAA KAB K = T , KAB KBB where KAA,KBB are diagonal matrices of sizes m×m and n×n, respectively, and KAB is unconstrained, i.e., no entry of KAB is constrained to be zero. Proposition 2.1. The integral IHm,n (δ, Im+n) converges absolutely for all δ > −1, and 1 m 1 n (2.1) IHm,n (δ, Im+n) = Γ δ + 2 n + 1 Γ(δ + 2 m + 1) 1 Γm+n δ + 2 (m + n + 1) × 1 1 . Γm δ + 2 (m + n + 1) Γn δ + 2 (m + n + 1) NORMALIZING CONSTANTS FOR G-WISHART DISTRIBUTIONS 5 Proof. Applying the Schur complement formula for block matrices, we obtain Z δ IHm,n (δ, m+n) = |K| exp(−tr(K)) dK I m+n S 0 (G) Z δ T −1 δ = |KAA| |KBB − KAB(KAA) KAB| m+n S 0 (G) · exp(−tr(KAA) − tr(KBB)) dKAA dKAB dKBB. Since KAB is unconstrained, we can change variables by replacing KAB by 1/2 1/2 n/2 m/2 KAA KABKBB; then the corresponding Jacobian is |KAA| |KBB| . Since 1/2 T 1/2 T |KBB − KBBKABKABKBB| = |KBB| · |In − KABKAB|, we obtain Z 1 1 δ+ 2 n δ+ 2 m T δ IHm,n (δ, m+n) = |KAA| |KBB| | n − KABKAB| I m+n I S 0 (G) · exp(−tr(KAA) − tr(KBB)) dKAA dKAB dKBB, where the range of integration is such that each diagonal entry of KAA T and KBB is positive, KAB is unconstrained, and In − KABKAB is positive definite. Integrating over each diagonal entry of KAA and KBB, we obtain 1 m 1 n IHm,n (δ, Im+n) = Γ δ + 2 n + 1 Γ δ + 2 m + 1 Z T δ × |In − KABKAB| dKAB. KAB Finally, since KAB is unconstrained, we deduce from (3.4) the value of the latter integral. In this computation, we used the special structure of the graph to de- compose the inverse covariance matrix K into a special block matrix. In Section3 we use a similar approach to show how the normalizing constant changes when removing a clique (i.e. a completely connected subgraph) from a graph. This leads to an algorithm for computing the normalizing constant IG(δ, D) for any graph G. In the reminder of this section, we show how an approach based on the Cholesky factorization of K can be used to eas- ily compute the normalizing constant for graphs that have minimum fill-in equal to 1. This requires introducing directed Gaussian graphical models. 6 C. UHLER, A. LENKOSKI, D. RICHARDS 2.2. Directed Gaussian graphical models. Let G = (V, E) be a directed acyclic graph (DAG) consisting of vertices V = {1, . . . , p} and directed edges E. We assume, without loss of generality, that the vertices in G are topologically ordered, meaning that i < j for all (i, j) ∈ E. We associate to G a strictly upper-triangular matrix B of edge weights. So B = (bij) with bij 6= 0 if and only if (i, j) ∈ E. Then a directed Gaussian graphical p model on G for a random variable X ∈ R is defined by X ∼ Np(0, Σ) with Σ = [(I − B)D(I − B)T ]−1, where D is a diagonal matrix. p To simplify notation, let aii = dii and aij = −bij djj, and let A = (Aij) √ −1 T with Aii = aii and Aij = −aij for all i 6= j. Then Σ = AA , and aij 6= 0 for i 6= j if and only if (i, j) ∈ E. Note that AAT is the upper Cholesky decomposition of Σ−1. Such a decomposition exists for any positive definite matrix and is unique. We will associate to a DAG, G = (V, E), and its corresponding directed Gaussian graphical model two undirected graphs. We denote by Gs = (V, Es) the skeleton of G obtained by replacing all directed edges in G by undirected edges. We denote by Gm = (V, Em) the moral graph of G, which reflects the conditional independencies in Np(0, Σ), i.e., m (i, j) ∈/ E if and only if Xi ⊥⊥ Xj | XV \{i,j}. Since Σ−1 also encodes the conditional independence relations of the form Xi ⊥⊥ Xj | XV \{i,j}, this is equivalent to the criterion, m −1 (i, j) ∈/ E if and only if Σ ij = 0. So, the moral graph Gm reflects the zero pattern of Σ−1. The moral graph of G can also be defined graph-theoretically: It is formed by connecting all nodes i, j ∈ V that have a common child in G, i.e., for which there exists a node k ∈ V \{i, j} such that (i, k), (j, k) ∈ E, and then making all edges in the graph undirected. The name stems from the fact that the moral graph is obtained by ‘marrying’ the parents. For a review of basic graph-theoretic concepts see e.g. [19, Chapter 2]. The moral graph is an important concept for our application. Let G = (V,E) be an undirected graph, with V = {1, . . . , p}, for which we want to compute IG(δ, Ip). Let G0 = (V,E0) with G0 = G. Given a labeling of the vertices V we associate a DAG, G0 = (V, E0), to G0 by orienting the edges in E0 according to the topological ordering, i.e., for all (i, j) ∈ E0 let (i, j) ∈ E0 if i < j. Note that the skeleton of G0 is the original undirected graph G0. Let m G1 = (V,E1) be the moral graph of G0, i.e., G1 = G0 , and let G1 = (V, E1) be the corresponding DAG obtained by orienting the edges in E1 according NORMALIZING CONSTANTS FOR G-WISHART DISTRIBUTIONS 7 to the ordering of the vertices V . So G0 is a subgraph of G1. We repeat this procedure until Gq+1 = Gq. This results in a sequence of DAGs, G0 ( G1 ( ··· ( Gq. In the following, we denote by G = (V, E) the DAG associated to G = (V,E) obtained by orienting the edges in E according to the ordering of the vertices V . We denote by G¯ = (V, E¯) the DAG associated to G = (V,E) obtained by repeatedly marrying parents in G, i.e. G¯ = Gq. We call G¯ the moral DAG of G. Note that G¯s, the skeleton of G¯, is a chordal graph with G ⊂ G¯s (Lauritzen [19, Chapter 2]), so G¯s is a chordal cover of G. A chordal cover in general is not unique; however, G¯s is the unique chordal cover obtained by repeatedly marrying parents according to the vertex labeling V . We call this chordal cover the moral chordal graph of G and denote it by G¯ = (V, E¯). We now show how to deduce from the undirected graph G = (V,E) the normalizing constant IG(δ, Ip) as an integral in terms of the Cholesky factor p A. Since the proof is the same for general correlation matrices D ∈ S 0, we give the result directly for IG(δ, D). In the following, we use the standard graph-theoretic notation indeg(i) for the indegree of node i, representing the number of edges “arriving at” (or “pointing to”) node i in a DAG G. Theorem 2.2. Let G = (V,E) be an undirected graph with vertices V = {1, . . . , p}. Let G = (V, E) be the DAG associated to G = (V,E) obtained by orienting the edges in E according to the ordering of the vertices in V . Let G¯ = (V, E¯) denote the moral DAG of G and G¯ = (V, E¯) its skeleton, the moral chordal graph of G. Let A be an upper-triangular p × p matrix √ with diagonal entries Aii = aii and off-diagonal entries Aij = −aij for all i < j. Then p p Z 1 Y δ+ 2 indeg(i) X X 2 IG(δ, D) = aii exp − aii + aij A ∗ i=1 i=1 j:(i,j)∈E¯ X √ X · exp − 2 dij − aij ajj + ailajl dA∗, (i,j)∈E l:(i,l),(j,l)∈E¯ p where D ∈ S 0 is a correlation matrix, A∗ = {aij : i = j or (i, j) ∈ E}, the range of aii is (0, ∞), the range of aij for (i, j) ∈ E is (−∞, ∞), indeg(i) denotes the indegree of node i in G, and for aij ∈/ A∗ 0, if (i, j) ∈/ E¯, 1 X aij = √ ailajl, if (i, j) ∈ E\E¯ . ajj l∈V (i,l),(j,l)∈E¯ 8 C. UHLER, A. LENKOSKI, D. RICHARDS p ¯ p ¯ Proof. Let K ∈ S 0(G). Since G ⊂ G, then K ∈ S 0(G) and we can view K as an inverse covariance matrix of a directed Gaussian graphical model on G¯. Because the Cholesky decomposition is unique, A is a weighted adjacency matrix of G¯ and hence aij = 0 for all (i, j) ∈/ E¯. Let (i, j) be an edge that is present in the moral chordal graph G¯ but not in G. We can assume that i < j. Hence (i, j) ∈ E\E¯ and therefore T √ X 0 = Kij = (AA )ij = −aij ajj + ailajl. l>max(i,j) Thus, for each edge (i, j) ∈ E\E¯ , we obtain an equation, 1 X aij = √ ailajl. a jj l∈V (i,l),(j,l)∈E¯ To complete the proof, we need to compute the Jacobian J of the change of variables from K to A. We list the aij’s column-wise, meaning that aij precedes alm if j < m or if j = m and i < l, omitting aij for (i, j) ∈/ E, corresponding to the zeros in K. We list the kij’s in the same ordering. Let the aij’s correspond to the columns of the Jacobian, while the kij’s correspond to the rows. In order to form J, we calculate the partial derivative T of each kij with respect to each alm. Since K = AA and A is upper- triangular then J also is upper-triangular; therefore, |J| = |diag(J)|. Since X 2 √ X kii = aii + aij and kij = −aij ajj + ailajl, (i,j)∈E¯ l∈V (i,l),(j,l)∈E¯ for all (i, j) ∈ E, then p Y indeg(i)/2 |J| = aii . i=1 Collecting together these formulas completes the proof. The number of edges in E\E¯ depend on the ordering of the vertices. It is well-known (see e.g. Lauritzen [19, Chapter 2]) that one can find an ordering of the vertices such that G¯ = G if and only if G is chordal. Hence when G is chordal we can directly derive the normalizing constant of IG(δ, Ip) from Theorem 2.2 by evaluating Gaussian and Gamma integrals. One could also prove the following corollary using Equation (1.4). 2 14 AUTHORS’ SURNAMES NORMALIZING CONSTANTS FOR G-WISHART DISTRIBUTIONS 9 5 5 1 4 1 4 2 3 2 3 Fig 1. Undirected graph G5 (left) discussed in Example 2.4 and its moral DAG G¯5 (right). Corollary 2.3. Let G = (V,E) be a chordal graph, where the vertices V = {1, . . . , p} are labelled according to a perfect ordering. Then p |E|/2 Y 1 IG(δ, Ip) = π Γ δ + 2 indeg(i) + 1 . i=1 where indeg(i) denotes the indegree of node i in the corresponding DAG G. Example 2.4. We illustrate Theorem 2.2 by studying the non-chordal graph G5, shown in Figure1 (left). We wish to calculate Z δ (2.2) IG5 (δ, I5) = |K| exp − tr(K) dK 5 K∈S 0(G5) T through the change of variables, K = AA . The moral DAG of G5 is denoted by G¯5 and depicted in Figure1 (right). Since the edges (2 , 4) and (2, 5) are missing in G¯5, we immediately deduce that a24 = a25 = 0. In this example, we chose an ordering where only one edge needed to be added in the process of marrying parents, namely the edge (1,3). This results in one equation for a13, which can be deduced from the colliders over the additional edge, i.e., nodes l ∈ V with (1, l), (3, l) ∈ G¯, and results in 1 a13 = √ (a14a34 + a15a35). a33 Finally, the Jacobian can be deduced from the indegrees of the nodes in G5, which corresponds to the moral DAG G¯5 after omitting the red edge. Therefore, the determinant of the Jacobian is 0/2 1/2 1/2 2/2 3/2 a11 a22 a33 a44 a55 , 10 C. UHLER, A. LENKOSKI, D. RICHARDS and we find that the integral (2.2) equals Z δ δ+1/2 δ+1/2 δ+1 δ+3/2 a11a22 a33 a44 a55 A 2 h 2 a14a34 + a15a35 2 2 × exp − a11 + a12 + √ + a14 + a15 a33 2 2 2 2 i + a22 + a23 + a33 + a34 + a35 + a44 + a45 + a55 dA, where aii > 0; aij ∈ R, i < j; and dA denotes the product of all differentials. As seen in Example 2.4, the equations corresponding to the additional edges (i, j) ∈ E\E¯ complicate the integral significantly. Therefore, given a non-chordal graph G, it is desirable to find an ordering such that |E¯ \ E| is minimized. This ordering is given by a perfect ordering of a minimal chordal cover of G, where minimality is with respect to the number of edges that need to be added in order to make G chordal. Using Corollary 2.3, we can compute the normalizing constant corresponding to a minimal chordal cover of G. The question arises: How can one compute the normalizing constant of G from the normalizing constant of a minimal chordal cover of G? In the following theorem, we show how one can compute the normalizing constant of a graph G that results from removing one edge from a chordal graph. Such graphs are said to have minimum fill-in equal to 1. Theorem 2.5. Let G = (V,E) be an undirected graph with minimum fill-in 1 and with vertices V = {1, . . . , p}. Let Ge = (V,Ee) denote the graph G with one additional edge e, i.e., Ee = E ∪ {e}, such that Ge is chordal. Let d denote the number of triangles formed by the edge e and two other edges in Ge. Then