Exact Formulas for the Normalizing Constants of Wishart Distributions for Graphical Models

EXACT FORMULAS FOR THE NORMALIZING CONSTANTS OF WISHART DISTRIBUTIONS FOR GRAPHICAL MODELS By Caroline Uhler∗, Alex Lenkoskiy, and Donald Richardsz Massachusetts Institute of Technology∗, Norwegian Computing Centery, and Penn State Universityz Gaussian graphical models have received considerable attention during the past four decades from the statistical and machine learning communities. In Bayesian treatments of this model, the G-Wishart distribution serves as the conjugate prior for inverse covariance matrices satisfying graphical constraints. While it is straightforward to posit the unnormalized densities, the normalizing constants of these distributions have been known only for graphs that are chordal, or decomposable. Up until now, it was unknown whether the normalizing constant for a general graph could be represented explicitly, and a considerable body of computational literature emerged that at- tempted to avoid this apparent intractability. We close this question by providing an explicit representation of the G-Wishart normalizing constant for general graphs. 1. Introduction. Let G = (V; E) be an undirected graph with vertex p set V = f1; : : : ; pg and edge set E. Let S be the set of symmetric p × p p p matrices and S0 the cone of positive definite matrices in S . Let p p (1.1) S0(G) = fM = (Mij) 2 S0 j Mij = 0 for all (i; j) 2= Eg p denote the cone in S of positive definite matrices with zeros in all entries not corresponding to edges in the graph. Note that the positivity of all diagonal entries Mii follows from the positive-definiteness of the matrices M. p A random vector X 2 R is said to satisfy the Gaussian graphical model (GGM) with graph G if X has a multivariate normal distribution with mean −1 p µ and covariance matrix Σ, denoted X ∼ Np(µ, Σ), where Σ 2 S0(G). arXiv:1406.4901v2 [math.ST] 22 Jun 2016 The inverse covariance matrix Σ−1 is called the concentration matrix and, throughout this paper, we denote Σ−1 by K. MSC 2010 subject classifications: Primary 62H05, 60E05; secondary 62E15 Keywords and phrases: Bartlett decomposition; Bipartite graph; Cholesky decomposition; Chordal graph; Directed acyclic graph; G-Wishart distribution; Gaussian graphical model; Generalized hypergeometric function of matrix argument; Moral graph; Normaliz- ing constant; Wishart distribution. 1 2 C. UHLER, A. LENKOSKI, D. RICHARDS p Statistical inference for the concentration matrix K constrained to S0(G) goes back to Dempster [6], who proposed an algorithm for determining the maximum likelihood estimator [cf., 31]. A Bayesian framework for this problem was introduced by Dawid and Lauritzen [5], who proposed the Hyper- Inverse Wishart (HIW) prior distribution for chordal (also known as decomposable or triangulated) graphs G. Chordal graphs enjoy a rich set of properties that led the HIW distribution to be particularly amenable to Bayesian analysis. Indeed, for nearly a decade after the introduction of GGMs, focus on the Bayesian use of GGMs was placed primarily on chordal graphs [see, e.g., 11]. This tractability stems from two causes: the ability to sample directly from HIWs [28], and the ability to calculate their normalizing constants, which are critical quantities when comparing graphs or nesting GGMs in hierarchical structures. Roverato [29] extended the HIW to general G. Focusing on K, Atay- Kayis and Massam [2] further studied this prior distribution. Following Letac and Massam [22], Lenkoski and Dobra [21] termed this distribution the G- p Wishart. For D 2 S0(G) and δ 2 R, the G-Wishart density has the form 1 (δ−2) 1 fG(K j δ; D) / jKj 2 exp(− tr(KD)) 1 p : 2 K2S0(G) This distribution is conjugate [29] and proper for δ > 1 [24]. Early work on the G-Wishart distribution was largely computational in nature [4,7,8, 17, 21, 32, 33] and was predicated on two assumptions: first, that a direct sampler was unavailable for this class of models and, second, that the normalizing constant could not be explicitly calculated. Lenkoski [20] developed a direct sampler for G-Wishart variates, mimicking the algorithm of Dempster [6], thereby resolving the first open question. In this paper, we close the second question by deriving for general graphs G an explicit formula for the G-Wishart normalizing constant, Z 1 (δ−2) 1 CG(δ; D) = jKj 2 exp(− tr(KD)) dK; p 2 S0(G) Qp Q where dK = i=1 dkii · i<j; (i;j)2E dkij denotes the product of differentials corresponding to all distinct non-zero entries in K. For notational simplicity, we will consider the integral Z δ IG(δ; D) = jKj exp(−tr(KD)) dK; p S0(G) which can be expressed in terms of CG(δ; D) as follows: Denote by jEj the cardinality of the edge set E; by changing variables, K ! 2K, one obtains 1 2 pδ+jEj 1 CG(δ; D) = 2 IG 2 (δ − 2);D : NORMALIZING CONSTANTS FOR G-WISHART DISTRIBUTIONS 3 The normalizing constant IG(δ; D) is well-known for complete graphs, in which every pair of vertices is connected by an edge. In such cases, 1 −(δ+ 2 (p+1)) 1 (1.2) Icomplete(δ; D) = jDj Γp δ + 2 (p + 1) ; where p p(p−1)=4 Y 1 (1.3) Γp(α) = π Γ α − 2 (i − 1) ; i=1 1 Re(α) > 2 (p − 1), is the multivariate gamma function. The formula (1.2) has a long history, dating back to Wishart [34], Wishart and Bartlett [35], Ingham [15], Siegel [30, Hilfssatz 37], Maass [23], and many derivations of a statistical nature; see Olkin [27] and Giri [10, p. 224]. As noted above, IG(δ; D) is also known for chordal graphs. Let G be chordal, and let (T1;:::;Td) denote a perfect sequence of cliques (i.e., complete subgraphs) of V . Further, let Si = (T1 [···[Ti)\Ti+1, i = 1; : : : ; d−1; then, S1;:::;Sd−1 are called the separators of G. Note that the separators Si are cliques as well. We denote the cardinalities by ti = jTij and si = jSij. For S ⊆ f1; : : : ; pg, let DSS denote the submatrix of D corresponding to the rows and columns in S. Then, Qd I (δ; D ) I (δ; D) = i=1 Ti TiTi G Qd−1 j=1 ISj (δ; DSj Sj ) 1 Qd − δ+ 2 (ti+1) 1 i=1 jDTiTi j Γti δ + 2 (ti + 1) (1.4) = : 1 Qd−1 − δ+ 2 (sj +1) 1 j=1 jDSj Sj j Γsj δ + 2 (sj + 1) This result follows because, for a chordal graph G, the G-Wishart density function can be factored into a product of density functions [5]. For non-chordal graphs the problem of calculating IG(δ; D) has been open for over 20 years, and much of the computational methodology mentioned above was developed with the objective of either approximating IG(δ; D) or avoiding its calculation. Our result shows that an explicit representation of this quantity is indeed possible. In deriving the explicit formula for the normalizing constant IG(δ), we utilize methods that are familiar to researchers in this area. These methods include the Cholesky decomposition or the Bartlett decomposition of a positive definite matrix, Schur complements for factorizing determinants, and the chordal cover of a graph. Furthermore, we make crucial use of certain formulas from the theory of generalized hypergeometric functions of matrix 4 C. UHLER, A. LENKOSKI, D. RICHARDS argument [13, 16], and analytic continuation of differential operators on the cone of positive definite matrices [9]. The article proceeds as follows. In Section2 we treat the case in which D = Ip, the p×p identity matrix, deriving a closed-form product formula for the normalizing constant IG(δ; Ip) for various classes of non-chordal graphs. In Section3 we consider the case of general matrices D; in our main result in Theorem 3.3 we derive an explicit representation of IG(δ; D) for general graphs as a closed-form product formula involving differentials of principal minors of D. We end with a brief discussion in Section4. 2. Computing the normalizing constant IG(δ; Ip). In this section, we compute IG(δ; Ip) for two classes of non-chordal graphs. We begin in Section 2.1 with the class of complete bipartite graphs and use an approach based on Schur complements to attain a closed-form formula. In Section 2.2 we introduce directed Gaussian graphical models and show how these models relate to a Cholesky factor approach to computing IG(δ; Ip). This leads to a formula for computing normalizing constants of graphs with minimum fill-in equal to 1, namely graphs that become chordal after the addition of one edge. However, these approaches do not lead to a general formula for the normalizing constant in the case D = Ip. To obtain a formula for any graph G, we found it necessary to calculate the more general case IG(δ; D) and then specialize D = Ip, as is done for moment generating functions or Laplace transforms. This is explained in Section3. 2.1. Bipartite graphs. A complete bipartite graph on m + n vertices, denoted by Hm;n, is an undirected graph whose vertices can be divided into disjoint sets U = f1; : : : ; mg and V = fm + 1; : : : ; m + ng, such that each vertex in U is connected to every vertex in V , but there are no edges within U or V . For the graph Hm;n, the corresponding matrix K is a block matrix, KAA KAB K = T ; KAB KBB where KAA;KBB are diagonal matrices of sizes m×m and n×n, respectively, and KAB is unconstrained, i.e., no entry of KAB is constrained to be zero.

Exact Formulas for the Normalizing Constants of Wishart Distributions for Graphical Models

2 Probability Theory and Classical Statistics

Importance Sampling

Probability Distributions and Related Mathematical Constructs

Lecture 17: Bivariate Normal Distribution F (X, Y) Is Given By

Bridgesampling: an R Package for Estimating Normalizing Constants

Bayesian Inference

9 Importance Sampling 3 9.1 Basic Importance Sampling

Derivations of the Univariate and Multivariate Normal Density

Orthogonal Polynomials in Stein's Method

Beta Distribution

The Continuous Bernoulli: Fixing a Pervasive Error in Variational

3 Basics of Bayesian Statistics