WISHART DISTRIBUTIONS FOR COVARIANCE GRAPH MODELS

By

Kshitij Khare Bala Rajaratnam

Technical Report No. 2009-1 July 2009

Department of STANFORD UNIVERSITY Stanford, California 94305-4065

WISHART DISTRIBUTIONS FOR COVARIANCE GRAPH MODELS

By

Kshitij Khare Bala Rajaratnam

Department of Statistics Stanford University

Technical Report No. 2009-1 July 2009

This research was supported in part by National Science Foundation grant DMS 0505303.

Department of Statistics STANFORD UNIVERSITY Stanford, California 94305-4065

http://statistics.stanford.edu Technical Report, Department of Statistics, Stanford University

WISHART DISTRIBUTIONS FOR COVARIANCE GRAPH MODELS

By Kshitij Khare† and Bala Rajaratnam∗ Stanford University Gaussian covariance graph models encode marginal independence among the compo- nents of a multivariate random vector by means of a graph G. These models are distinctly different from the traditional concentration graph models (often also referred to as Gaus- sian graphical models or covariance selection models), as the zeroes in the parameter are now reflected in the covariance Σ, as compared to the concentration matrix −1 Ω=Σ . The parameter space of interest for covariance graph models is the cone PG of positive definite matrices with fixed zeroes corresponding to the missing edges of G.As in Letac and Massam [20] we consider the case when G is decomposable. In this paper we construct on the cone PG a family of Wishart distributions, that serve a similar pur- pose in the covariance graph setting, as those constructed by Letac and Massam [20]and Dawid and Lauritzen [8] do in the concentration graph setting. We proceed to undertake a rigorous study of these “covariance” Wishart distributions, and derive several deep and useful properties of this class. First, they form a rich conjugate family of priors with mul- tiple shape parameters for covariance graph models. Second, we show how to sample from these distributions by using a block Gibbs sampling algorithm, and prove convergence of this block Gibbs sampler. Development of this class of distributions enables , which in turn allows for the estimation of Σ even in the case when the sample size is less than the dimension of the data (i.e., when ‘n

this particular case, the family IWQG of [20] is a special case of our covariance Wishart distributions. Fourth, we also show that this family of priors satisfies an analogue of the strong directed hyper Markov property, when G is homogeneous, and proceed to compute moments and Laplace transforms. We then introduce a new concept called ‘generalized posterior linearity’ for curved exponential families. We proceed to demonstrate that a subclass of our priors satisfies this notion of posterior linearity - thereby drawing paral- lels with the Diaconis-Ylvisaker priors for natural exponential families. Fifth and finally, we illustrate the use of our family of conjugate priors on real and simulated data.

1. Introduction. Due to recent advances in science and information technology, there has been a huge influx of high-dimensional data from various fields such as genomics, en- vironmental sciences, finance and the social sciences. Making sense of all the many complex relationships and multivariate dependencies present in the data and formulating correct models and developing inferential procedures is one of the major challenges in modern day statistics. In parametric models the covariance or correlation matrix (or its inverse) is the fundamental object that quantifies relationships between random variables. Estimating the covariance ma- trix in a sparse way is crucial in high dimensional problems and enables the detection of the most important relationships. In this light, graphical models have served as tools to discover structure in high-dimensional data. The primary aim of this paper is to develop a new family of distributions for covariance graph models (a subclass of graphical models), and consequently study the

∗Bala Rajaratnam was supported in part by NSF grant DMS 0505303 †Kshitij Khare was supported in part by the B.C. and E.J. Eaves Stanford graduate fellowship. 1 2 KHARE, RAJARATNAM properties of this family of distributions. These properties are highly attractive for Bayesian inference in high-dimensional settings. In covariance graph models, specific entries of the co- matrix are restricted to be zero, which implies marginal independence in the Gaussian case. Covariance graph models correspond to curved exponential families, and are distinctly different from the well-studied concentration graph models, which in turn correspond to nat- ural exponential families. A rich framework for Bayesian inference for natural exponential families has been estab- lished in the last three decades, starting with the seminal and celebrated work of Diaconis and Ylvisaker [10] that laid the foundations for constructing conjugate prior distributions for natural models. The Diaconis-Ylvisaker (henceforth referred to as DY) conjugate priors are characterized by posterior linearity. An analogous framework for curved exponential families is not available in the literature. Concentration graph models or covariance selection models, were one of the first graphical models to be formally introduced to the statistics community. These models reflect conditional independencies in multivariate probability distributions by means of a graph. In the Gaus- sian case, they induce sparsity or zeroes in the inverse , and correspond to natural exponential families. In their pioneering work, Dawid and Lauritzen [8] developed the DY prior for this class of models. In particular, they introduced the hyper inverse Wishart as the DY conjugate prior for concentration graph models. In a recent major contribution to this field, a rich family of conjugate priors that subsumes the DY class has been developed by Letac and Massam [20]. Both the hyper inverse Wishart priors and the “Letac-Massam” priors have attractive properties which enable Bayesian inference, with the latter allowing multiple shape parameters and hence suitable in high-dimensional settings. Bayesian procedures corre- sponding to these Letac-Massam priors have been derived in a decision theoretic framework in the recent work of Rajaratnam et al. [25]. Consider an undirected1 graph G with a finite set of vertices V (of size p) and a finite set E of edges between these vertices, i.e., G =(V,E). The Gaussian covariance graph model corresponding to the graph G is the collection of p-variate Gaussian distributions with co- variance matrix Σ, such that Σij = 0 whenever (i, j) ∈/ E. This class of models was first formally introduced by Cox and Wermuth [4, 5]. In the frequentist setting, maximum likeli- hood estimation in covariance graph models has been a topic of interest in recent years. Many iterative methods that obtain the maximum likelihood estimate have been proposed in the literature. The graphical modeling software MIM in Edwards [12] fits these models by using the “dual likelihood method” from Kauermann [17]. In Wermuth et al. [29], the authors de- rive asymptotically efficient approximations to the maximum likelihood estimate in covariance graph models for exponential families. Chaudhuri et al. [3] propose an iterative conditional fitting algorithm for maximum likelihood estimation in this class of models. Covariance graph models have also been used in applications in Butte et al. [2], Grzebyk et al. [15], Mao et al. [21] among others. Although Gaussian covariance graph models are simple and intuitive to understand, no comprehensive theoretical framework for Bayesian inference for this class of models has been developed in the literature. In that sense, Bayesian inference for covariance graph models has

1We shall use dotted edges for our graphs keeping in line with notation in the literature, bi-directed edges have also been used for representing covariance graphs. WISHART DISTRIBUTIONS 3 been an open problem since the introduction of these models by Cox and Wermuth [4, 5] more than 15 years ago. The main difficulty is that these models give rise to curved expo- nential families. The zero restrictions on the entries of the covariance matrix Σ translate into complicated restrictions on the corresponding entries of the natural parameter, Ω = Σ−1. Hence, the sparseness in Σ does not translate into sparseness in Σ−1. No general theory for Bayesian inference in curved exponential families for continuous random variables, akin to the Diaconis-Ylvisaker [10] or standard conjugate theory for natural exponential families, is available. There are several desirable properties that one might want when constructing a class of priors, but one of the foremost requirements is to be able to compute quantities such as the mean or of the posterior distribution, either in closed form or by sampling from the posterior distribution by a simple mechanism. This is especially important in high dimensional situations, where computations are complex and can become infeasible very quickly. Another desirable and related feature is conjugacy, i.e., the class of priors is such that the posterior distribution also belongs to this class. Among other things, this increases the prospects of obtaining closed form Bayes estimators and can also add to the interpretability of the hyper parameters. The classes of Wishart distributions developed by Letac and Massam [20], and used later on in the concentration graph setting by Rajaratnam et al. [25], seem promising as priors for this situation. We however establish that the posterior distribution fails to belong to the same class, and computing the posterior mean or mode either in closed form or by sampling from the posterior distribution is intractable. A principal objective of this paper is to develop a framework for Bayesian inference for Gaussian covariance graph models. We proceed to define a rich class of Wishart distributions on the space of positive definite matrices with fixed zeroes, that correspond to a decomposable graph G. Our approach yields a flexible class of conjugate priors with multiple shape param- eters for Gaussian covariance graph models. This class of distributions is specified up to a normalizing constant and conditions under which this normalizing constant can be evaluated in closed form are derived. We explore the distributional properties of our class of priors, and in particular show that the parameter can be partitioned into blocks so that the conditional distribution of each block given the others are tractable. Based on this property, we propose a block Gibbs sampling algorithm to simulate from the posterior distribution. We proceed to formally prove the convergence of this block Gibbs sampler. Our priors yield proper inferential procedures even in the case when the sample size n is less than the dimension p of the data, whereas maximum likelihood estimation is in general only possible when n ≥ p.Wealsoshow that our covariance Wishart distributions are in general very different from the Letac-Massam priors. However, when the underlying graph G is homogeneous, the Letac-Massam IWQG pri- ors are a special case of our distributions. We also show that when the underlying graph G is homogeneous, not only can one explicitly evaluate the normalizing constant, quantities like the posterior mean of the covariance matrix are also available in closed form. In this scenario, we show that our class of priors satisfies the strong directed “covariance” hyper Markov property (parallel to the one introduced in Dawid and Lauritzen [8] for concentration graph models). We noted above that for concentration graph models or the traditional Gaussian graphical models, a rich theory has been established by Dawid and Lauritzen [8], who derive the single parameter DY conjugate prior for these models, and by Letac and Massam [20] who derive 4 KHARE, RAJARATNAM a larger flexible class with multiple shape parameters. In essence, this paper is the analogue of these two above mentioned papers in the covariance graph model setting, with parallel results all of which are contained in one single comprehensive piece. Hence this work com- pletes the powerful theory that has been developed in the mathematical statistics literature for decomposable models. We also point out that a class of priors in a recent unpublished manuscript [27] are a narrow special case of our flexible covariance Wishart distributions2. Our family allows multiple shape parameters as compared to a single shape parameter, and hence yields a richer class suitable to high dimensional problems. Moreover we show that their iterative algorithm to sample from the posterior is different from ours. In fact, it turns out that their algorithm is not a Gibbs sampler in the true sense. Since the authors do not undertake a theoretical investigation of the convergence properties of this algorithm, it is not clear if it does indeed converge to the desired distribution. On the other hand, our algorithm is indeed a Gibbs sampler and we proceed to formally prove that our algorithm converges to the desired distribution. Moreover, contrary to what is stated in [27], we show that it is possible to explicitly evaluate the normalizing constant for a non-trivial subclass of covariance graph models. This property allows for Bayesian model selection using closed form expressions. All the remaining sections of our paper are vastly different from [27], as we undertake a rigorous theoretical analysis of our conjugate Wishart distributions for covariance graph models, whereas they give a treatment of latent variables and mixed graph models in a machine learning context. This paper is structured as follows. Section 2 introduces required preliminaries and no- tation. In Section 3, the class of covariance Wishart distributions are formally constructed. Conjugacy to the class of covariance graph models and sufficient conditions for integrability are established. Comparison with the Letac-Massam priors, which are not in general conju- gate in the covariance graph setting, are also undertaken. In Section 4, a block Gibbs sampler which enables sampling from the posterior distribution is proposed, and the corresponding conditional distributions are derived. Thereafter, a formal proof of convergence of this block Gibbs sampler is provided. In subsequent sections, we restrict ourselves to the case when G is a homogeneous graph. In Section 5, we show that our family of priors satisfies a version of the strong directed hyper Markov property. In Section 6, we compute moments and Laplace transforms for our class of distributions, and demonstrate that a subclass of our priors satisfies a generalized notion of posterior linearity - thereby drawing parallels to the DY approach for natural exponential families. Finally, we illustrate the use of our family of conjugate priors and the methodology developed in this paper on a real example as well as on simulated data. The appendix contains the proofs of various results stated in the main text.

2. Preliminaries. In this section, we give the necessary notation, background and pre- liminaries that are needed in subsequent sections.

2.1. Modified Cholesky decomposition. If Σ is a positive definite matrix, then there exists a unique decomposition (2.1) Σ = LDLT ,

2In a similar spirit to the way in which the HIW prior of Dawid and Lauritzen [8]isaspecialcaseofthe generalized family of Wishart distributions proposed by Letac and Massam [20] for the concentration graph setting. WISHART DISTRIBUTIONS 5 where L is a lower triangular matrix with diagonal entries equal to 1, and D a diagonal matrix with positive diagonal entries. This decomposition of Σ is referred to as the modified Cholesky decomposition of Σ (see [24]). We now provide a formula that explicitly computes the inverse of a lower triangular matrix with 1’s on the diagonal such as those that appear in (2.1).

Proposition 1. Let L be an m × m lower triangular matrix with 1’s on the diagonal. Let

m r A = ∪r=2 {τ : τ ∈{1, 2, ··· ,m} ,τi <τi−1 ∀ 2 ≤ i ≤ r} and dim(τ) ∀ ∈A Lτ = Lτi−1τi τ . i=2 where dim(τ) denotes the length of the vector τ.ThenL−1 = N,where ⎧ ⎪ ⎨⎪0 if ij. τ∈A,τ1=i,τdim(τ)=j

The proof is provided in the appendix. An undirected graph G is a pair (V,E), where V is a permutation3 of the set {1, 2, ··· ,m} denoting the set of vertices of G.ThesetE ⊆ V × V denotes the set of edges in the graph. If vertices u and v are such that (u, v) ∈ E then we say that there is an edge between u and v. It is also understood that (u, v) ∈ E implies (v,u) ∈ E, i.e., the edges are undirected. Though the dependence of G =(V,E) on the particular ordering in V is often suppressed, the reader should bear in mind that unlike traditional graphs, the graphs defined above are not equivalent up to permutation of the vertices4 modulo the edge structure. We describe below two classes of graphs which play a central role in this paper.

2.2. Decomposable graphs. An undirected graph G is said to be decomposable if any induced subgraph does not contain a cycle of length greater than or equal to four. The reader is referred to Lauritzen [19] for all the common notions of graphical models (and in particular decomposable graphs) that we will use here. One such important notion is that of a perfect order of the cliques. Every decomposable graph admits a perfect order of its cliques. Let (C1,C2, ··· ,Ck) be one such perfect order of the cliques of the graph G.Thehistory for the graph is given by H1 = C1 and

Hj = C1 ∪ C2 ∪···∪Cj,j=2, 3, ··· ,k, and the minimal separators of the graph are given by

Sj = Hj−1 ∩ Cj,j=2, 3, ··· ,k.

3The ordering in V is emphasized here as the elements of V will later correspond to rows or columns of matrices. 4This has been done for notational convenience, as will be seen later. 6 KHARE, RAJARATNAM

Let Rj = Cj \ Hj−1 for j =2, 3, ··· ,k. Let k ≤ k − 1 denote the number of distinct separators and ν(S) denote the multiplicity of S, i.e., the number of j such that Sj = S. Generally, we will denote by C the set of cliques of agraphandbyS its set of separators. Now, let Σ be an arbitrary positive definite matrix with zero restrictions according to 5 G =(V,E) , i.e., Σij = 0 whenever (i, j) ∈/ E.ItisknownthatifG is decomposable, there exists an ordering of the vertices such that if Σ = LDLT is the modified Cholesky decomposition corresponding to this ordering, then for i>j,

(2.2) Lij = 0 whenever (i, j) ∈/ E.

Although the ordering is not unique in general, the existence of such an ordering characterizes decomposable graphs (see [23]). A constructive way to obtain such an ordering is given as follows. Label the vertices in descending order starting with vertices in C1,R2,R3, ··· ,Rk, with vertices belonging to a particular set being ordered arbitrarily. Lauritzen [19]showsthat this ordering provides a ‘perfect vertex elimination scheme’ for G,andin[23] the authors show that such an ordering satisfies (2.2). Two related articles are [26, 28].

6 2.3. The spaces PG, QG and LG. An m-dimensional Gaussian covariance graph model can be represented by the class of multivariate normal distributions with fixed zeros in the covariance parameter (i.e., marginal independencies) described by a given graph G =(V,E). That is, if (i, j) ∈/ E,thei-th and j-th components of the multivariate random vector are marginally independent. Without loss of generality, we can assume that these models have mean zero and are characterized by the parameter set PG of positive definite covariance ma- trices Σ such that Σij = 0 whenever the edge (i, j)isnotinE. Following the notation in [20, 25]forG decomposable, we define QG to be the space on which the free elements of the matrices (or inverse covariance matrices) Ω live. + More formally let M denote the set of symmetric matrices of order m, Mm ⊂ M the cone of positive definite matrices (abbreviated > 0), IG the linear space of symmetric incomplete matrices x with missing entries xij, (i, j) ∈/ E and κ : M → IG the projection of M into IG. The parameter set of the precision matrices of Gaussian covariance graph models can also be −1 described as the set of incomplete matrices Ω = κ(Σ ), Σ ∈ PG. Indeed it is easy to verify that the entries Ωij, (i, j) ∈/ E are such that

−1 (2.3) Ωij =Ωi,V \{i,j}ΩV \{i,j},V \{i,j}ΩV \{i,j},j, and are therefore not free parameters of the precision matrix for Gaussian covariance graph models. We are therefore led to consider the two cones

+ (2.4) PG = {y ∈ Mm| yij =0, (i, j) ∈ E}

(2.5) QG = {x ∈ IG| xCi > 0,i =1,...,k}.

5It is emphasized here that the ordering of the vertices reflected in V plays a crucial role in the definitions and results that follow. 6A brief overview of the literature in this area is provided in the introduction. WISHART DISTRIBUTIONS 7 where PG ⊂ ZG and QG ⊂ IG,whereZG denotes the linear space of symmetric matrices with zero entries yij, (i, j) ∈ E. Gr˝one et al.. [14] proved the following:

+ Proposition 2. When G is decomposable, for any x in QG there exists a unique xˆ in Mm −1 such that ∀(i, j) ∈ E we have xij =ˆxij and such that xˆ is in PG.

7 This defines an isomorphism between PG and QG:

−1 −1 (2.6) ϕ : y =(ˆx) ∈ PG → x = ϕ(y)=κ(y ) ∈ QG , where κ denotes the projection of M into IG.Thematrixˆx is said to be the completion of x. 8 We now introduce new spaces that we shall need in our subsequent analysis .LetLG denote the space of all lower triangular matrices with diagonal entries equal to 1, such that the entries in the lower triangle have zero restrictions corresponding to G, i.e.,

LG = {L : Lij = 0 whenever i

Define ΘG (the modified Cholesky space) by

ΘG = {θ =(L, D): L ∈LG,Ddiagonal with Dii > 0 ∀1 ≤ i ≤ m}.

Denote the mapping + ψ : ΘG → Mm as defined by

(2.7) ψ(L, D)=LDLT

The above mapping ψ plays an important role in our analysis, and shall be studied later.

2.4. Homogeneous graphs. AgraphG =(V,E) is defined to be homogeneous if for all (i, j) ∈ E,either

{u : u = j or (u, j) ∈ E}⊆{u : u = i or (u, i) ∈ E}, or {u : u = i or (u, i) ∈ E}⊆{u : u = j or (u, j) ∈ E}. Equivalently, a graph G is said to be homogeneous if it is decomposable and does not contain 1 2 3 4 the graph •−•−•−•, denoted by A4, as an induced subgraph. Homogeneous graphs have an equivalent representation in terms of directed rooted trees, called Hasse diagrams.The reader is referred to [20] for a detailed account of the properties of homogeneous graphs. We write i → j whenever

{u : u = j or (u, j) ∈ E}⊆{u : u = i or (u, i) ∈ E}.

7Furthermore, it also defines a diffeomorphism. 8These spaces are not defined in [20, 25]. 8 KHARE, RAJARATNAM

Denote by R the equivalence relation on V defined by

iRj ⇔ i → j and j → i.

Let ¯i denote the equivalence class in V/R containing i. The Hasse diagram of G is defined as a directed graph with vertex set VH = V/R = {¯i : i ∈ V } and edge set EH consisting of directed edges with (¯i, ¯j) ∈ EH for ¯i = ¯j if the following holds: i → j and k such that i → k → j, k¯ = ¯i, k¯ = ¯j. If G is a homogeneous graph, then the Hasse diagram described above is a directed rooted tree such that the number of children of a vertex is never equal to one. It was proved in [20] that there is a one to one correspondence between the set of homogeneous graphs and the set of directed rooted trees with vertices weighted by positive integers (w(¯i)=|¯i|), such that no vertex has exactly one child. Also when iRj,wesayi and j are twins in the Hasse diagram of G.Figure1 provides an example of a homogeneous graph with seven vertices and the corresponding Hasse diagram.

f e

f b d e g

a b c a={a,d} g c

Fig 1. (a) An example of a homogeneous graph with 7 vertices (b) The corresponding Hasse diagram

We prove the following proposition for homogeneuos graphs, which will play an important role in our analysis.

Proposition 3. If G is a homogeneous graph, there exists an ordering of the vertices, such that for this ordering T 1. Σ ∈ PG ⇔ L ∈LG,whereΣ=LDL is the modified Cholesky decomposition of Σ. −1 2. L ∈LG ⇔ L ∈LG.

The proof of this proposition can be found in the appendix. We now describe a procedure for ordering the vertices, under which Proposition 3 holds. Given a homogeneous graph G, we first construct the Hasse diagram for G. The vertices are labelled in descending order starting from the root of the tree. If the equivalence class at any node has more than one element, they are labelled in any order. We start from the left at each level in the tree and exhaust all nodes in that level before going down to the next level. The process is terminated after reaching the rightmost leaf of the tree. Hereafter, we shall refer to this ordering scheme as the Hasse perfect vertex elimination scheme. For example, if we apply this ordering procedure for the graph in Figure 1, then the resulting labels are {a, b, c, d, e, f, g}→{4, 5, 1, 3, 7, 6, 2}. WISHART DISTRIBUTIONS 9

2.5. Vertex ordering. Let G =(V,E) be an undirected decomposable graph with vertex set V = {1, 2, ··· ,m} and edge set E.LetSV denote the permutation group associated −1 −1 with V . For any σ ∈ SV ,letGσ := (σ(V ),Eσ), where (u, v) ∈ Eσ iff (σ (u),σ (v)) ∈ E. + Let SD ⊂ SV denote the subset of permutations σ of V , such that for any Σ ∈ Mm with T + Σ=LDL , L ∈LGσ ⇔ Σ ∈ PGσ . Hence for every σ ∈ SD, the mapping ψσ : ΘGσ → Mm defined in (2.7), is a bijection from ΘGσ to PGσ . In particular, the ordering corresponding to any perfect vertex elimination scheme lies in SD (see Section 2.2). If G is homogeneous, let −1 SH ⊂ SD denote the subset of permutations σ of V , such that L ∈LGσ ⇔ L ∈LGσ .In particular, any ordering of the vertices corresponding to the Hasse perfect vertex elimination scheme lies in SH (see Section 2.4). The above defines a nested triplet of permutations of V given by SH ⊂ SD ⊂ SV .

3. Wishart distributions for covariance graphs. Let G =(V,E) be an undirected decomposable graph with vertex set V and edge set E. We assume that the vertices in V are ordered so that V ∈ SD. The covariance graph model associated with G is the family of distributions

G = {Nm(0, Σ) : Σ ∈ PG} ∼ T = {Nm(0,LDL ): (L, D) ∈ ΘG}.  m Consider the class of measures on ΘG with density w.r.t. i>j,(i,j)∈E dLij i=1 dDii  T −1 m tr (LDL ) U + αi log Dii − ( ( ) i=1 ) (3.1) π U,α(L, D)=e 2 ,θ=(L, D) ∈ ΘG.

These measures are parameterized by a positive definite matrix U and a vector α ∈ Rm with non-negative entries. Let us first establish some notation. •N(i):={j :(i, j) ∈ E} •N≺(i):={j :(i, j) ∈ E,i > j} ≺i • U := ((Ukl))k,l∈N ≺(i) i • U := ((Ukl))k,l∈N ≺(i)∪{i} ≺ • U·i := (Uki)k∈N ≺(i) Let  T −1 m tr (LDL ) U + αi log Dii − ( ( ) i=1 ) zG(U, α):= e 2 dLdD.

If zG(U, α) < ∞,thenπ U,α can be normalized to obtain a probability measure. A sufficient condition for the existence of a normalizing constant for π U,α(L, D) is provided in the following proposition.

m Theorem 1. Let dL := (i,j)∈E,i>j dLij and dD := i=1 dDii.Then,  T −1 m tr (LDL ) U + αi log Dii − ( ( ) i=1 ) e 2 dLdD < ∞ ΘG if ≺ αi > |N (i)| +2 ∀ i =1, 2, ··· ,m. 10 KHARE, RAJARATNAM

As the proof of this proposition is rather long and technical, it is given in the Appendix. The normalizing constant zG(U, α) is not available in closed form in general. Let us consider a simple example to illustrate the difficulty in computing the normalizing constant explicitly. 1 2 3 4 Let G = A4, i.e, the path on 4 vertices or •−•−•−•. Note that this is a decomposable (but not homogeneous) graph. The restrictions on L are L31 = L41 = L42 =0.LetU ∈ PG, and α =(4, 4, 4, 4). Then after integrating out the elements Dii, 1 ≤ i ≤ 4 (recognizing them as inverse-gamma integrals) and transforming the entries of L to the independent entries of L−1 (as in the proof of Proposition 1), the normalizing constant reduces to an integral of the form

1 2 2 2 2 2 (U22 +2U12x1 + U11x )(U11x x + U22x + U33 +2U12x1x +2U13x1x2 +2U23x2) R3 1 1 2 2 2 1 × dx. U x2x2x2 U x2x2 U x2 U U x x2x2 U x x x2 U x x x U x x2 U x x U x 11 1 2 3 + 22 2 3 + 33 3 + 44 +2 12 1 2 3 +2 13 1 2 3 +2 14 1 2 3 +2 23 2 3 +2 24 2 3 + 34 3

The above integral does not seem to be computable by standard techniques for general U. Despite this inherent difficulty, we propose a novel method which allows sampling from this rich family of distributions (see Section 4). We will show later that the condition in Theorem 1 is necessary and sufficient for the existence of a normalizing constant for homogeneous graphs (see Section 3.3). Moreover, in this case the normalizing constant can be computed in closed form. We denote by πU,α, the normalized version of π U,α whenever zG(U, α) < ∞. The following lemma shows that the family πU,α is a conjugate family for Gaussian covariance graph models.

Lemma 1. Let G =(V,E) be a decomposable graph, where vertices in V are ordered so T that V ∈ SD.LetY1, Y2, ··· , Yn be an i.i.d. sample from Nm(0,LDL ),where(L, D) ∈ ΘG. 1 n T Let S = n i=1 YiYi denote the empirical covariance matrix. If the prior distribution on

(L, D) is πU,α, then the posterior distribution of (L, D) is given by πU ,α ,whereU = nS + U and α =(n + α1,n+ α2, ··· ,n+ αm).

Proof. The likelihood of the data is given by T − tr (LDL ) 1(nS) +n log |D| 1 − ( ( ) ) f(y1, y2, ··· , yn | L, D)= √ e 2 . ( 2π)nm

Using πU,α as a prior for (L, D), the posterior distribution of (L, D)giventhedata(Y1, Y2, ··· , Yn) is  T −1 m tr (LDL ) (nS+U) + (n+αi)logDii − ( ( ) i=1 ) πU,α(L, D | Y1, Y2, ··· , Yn) ∝ e 2 ,θ∈ ΘG. Hence the posterior distribution belongs to the same family as the prior, i.e., ·| ··· · πU,α( Y1, Y2, , Yn)=πU, α ( ), where U = nS + U and α =(n + α1,n+ α2, ··· ,n+ αm). 

Remark. If we assume that the observations do not have mean zero, i.e., Y1, Y2, ··· , Yn are m i.i.d. N (μ, Σ), with μ ∈ R , Σ ∈ PG,then n 1 T S := (Yi − Y¯ )(Yi − Y¯ ) n i=1 WISHART DISTRIBUTIONS 11 is the minimal sufficient statistic for Σ. Here, nS has a Wishart distribution with parameter Σandn−1 degrees of freedom. Hence, if we assume a prior πU,α for (L, D), then the posterior distribution is given by ·| · πU,α( S)=πU, α ( ), where U = nS + U,andα =(n − 1+α1,n− 1+α2, ··· ,n− 1+αm).

3.1. Induced prior on PG and QG. The prior πU,α on ΘG (the modified Cholesky space) induces a prior on PG (the covariance matrix space) and QG. Recall from Section 2.3 that PG is the space of positive definite matrices with zero restrictions according to G,andQG is the space of incomplete matrices isomorphic to PG. We provide an expression for the induced priors on these spaces in order to compare our Wishart distributions with other classes of distributions. Note that since the vertices have been ordered so that V ∈ SD, the transformation

+ ψ : ΘG → Mm defined by ψ(L, D)=LDLT =: Σ is a bijection from ΘG to PG. The lemma below provides the required Jacobians for deriving the induced priors on PG and QG. We prove the first part, and the proof of the second part can be found in [26]. The reader is referred to Section 2.2 for notation on decomposable graphs. Note that if x is a matrix, |x| denotes its , while if C is a set, then |C| denotes its cardinality.

Lemma 2. Jacobians of transformations.

1. The Jacobian of the transformation ψ :(L, D) → Σ from ΘG to PG is m −nj Djj(Σ) . i=1

Here Djj(Σ) denotes that Djj is a function of Σ,andnj := |{i :(i, j) ∈ E,i > j}| for j =1, 2, ··· ,m. −1 2. The absolute value of the Jacobian of the bijection ζ : x → xˆ from QG to PG is −|C|−1 (|S|+1)ν(S) |xC | |xS| . C∈C S∈S

∈ | −1 ∈ Proof (Part 1). Let Σ PG,and(L, D)=(ψ PG ) (Σ). Note that for (i, j) E,i > j,

m j T Σij =(LDL )ij = LikLjkDkk = LikLjkDkk, k=1 k=1 since L is lower triangular. Now note that,

∂ T (LDL )ij = Djj. ∂Lij 12 KHARE, RAJARATNAM

For 1 ≤ i ≤ m, ∂ T (LDL )ii =1. ∂Dii

Arrange the entries of θ =(L, D) ∈ ΘG as D11, {L2k :(2,k) ∈ E,1 ≤ k<2},D22, {L3k : (3,k) ∈ E,1 ≤ k<3}, ··· ,Dm−1,m−1, {Lmk :(m, k) ∈ E,1 ≤ kj Djj(Σ)

It follows from the expression above that the Jacobian of the transformation ψ is m −n Djj(Σ) j . j=1 

These Jacobians allow us to compute the induced priors on PG and QG. The induced prior corresponding to π U,α on PG is given by

m tr −1U n α D ( (Σ )+ i (2 i+ i)log ii(Σ)) PG − =1 (3.2) π U,α(Σ) ∝ e 2 , Σ ∈ PG.

We first note that the traditional Inverse-Wishart distribution (see [22]) with parameters U and n is a special case of (3.2)whenG is the complete graph, and αi = n − 2m +2i, ∀ 1 ≤ i ≤ m. We also note that the G-inverse Wishart priors introduced in [27] have a one-dimensional shape PG parameter δ, and are a very special case of our richer class π U,α. The single shape parameter 9 δ is given by the relationship αi +2ni = δ +2m, 1 ≤ i ≤ m . −1 We now proceed to derive the induced prior on QG.Letx = ϕ(Σ) = κ(Σ )denotethe image of Σ in QG. Using the second part of Lemma 2, the induced prior corresponding to π U,α on QG is given by

m tr xU n α D x −1 ( (ˆ )+ i (2 i+ i)log ii(( ) )) QG − =1 π U,α(x) ∝ e 2 (|S|+1)ν(S) |xS| × S ∈S ∈ |C|+1 ,x QG. C∈C |xC |

9There is an interesting parallel here that becomes apparent from our derivations above. In the concentration graph setting, the single shape parameter hyper inverse Wishart (HIW) prior of Dawid and Lauritzen [8]isa special case of the multiple shape parameter class of priors introduced by Letac and Massam [20], in the sense 1 αi − δ ci − = 2 ( + 1) (see [25] for notation). In a similar spirit, we discover that the single shape parameter class of priors in [27] is a special case of the multiple shape parameter class of priors πU,α introduced in this paper, in the sense αi = δ − 2ni +2m. WISHART DISTRIBUTIONS 13

3.2. Comparison with the Letac-Massam priors. We now carefully compare our class of priors to those proposed in Letac and Massam [20]. In [20], the authors construct two classes of distributions named WPG and WQG on the spaces PG and QG respectively, for G decomposable. (see [20, Section 3.1]). These distributions are generalizations of the Wishart distribution on these convex cones and have been found to be very useful for high-dimensional Bayesian inference as illustrated in [25]. These priors lead to corresponding classes of inverse Wishart ∼ ˆ −1 ∼ distributions IWPG (on QG)andIWQG (on PG), i.e., U IWPG whenever U WPG and ∼ −1 ∼ V IWQG whenever κ(V ) WQG .In[20], it is shown that the family of distributions

IWPG yields a family of conjugate priors in the Gaussian concentration graph setting, i.e., when Σ ∈ QG.

As the IWQG priors of [20] are defined on the space PG, in principle, they can potentially serve as priors in the covariance graph setting, since the parameter of interest Σ lives in PG. Let us examine this class more carefully, first with a view to understanding their use in the covariance graph setting, and second to compare them to our priors. Following the notation for decomposable graphs in Section 2.2 and in [20], the density of the IWQG distribution is given by    α(C) Σ−1 U,α,β − 1 tr(Σ−1U) C∈C C IWQ (Σ) ∝ e 2   , Σ ∈ PG, G  −1β(S)ν(S) S∈S ΣS  where U ∈ PG and α(C),C ∈C,β(S),S ∈Sare real numbers. The posterior density of Σ under this prior is given by   n  α(C)+ Σ−1 2 IW − 1 tr(Σ−1(U+nS)) C∈C C πU,α,β(Σ | Y1, Y2, ··· , Yn) ∝ e 2 .  (β(S)ν(S)+ nν(S) )  −1 2 S∈S ΣS 

However, U + nS may not in general be in PG, which is a crucial assumption in the analysis in [20]. Hence the conjugacy breaks down.

We now investigate similarities and differences between our class of priors and the IWQG U,α,β ∈ G class. Since the IWQG density is defined only for U P , a pertinent question is whether our class of priors has the same functional form when U ∈ PG. We discover that this is not the case, and demonstrate this through an example. Consider the 4-path A4. One can easily verify, − 1 tr Σ−1U that the terms e 2 ( ) are identical in both priors. We now show that the remaining terms are not identical. If Σ = LDLT is the modified Cholesky decomposition of Σ, then for this particular graph with C1 = {1, 2},C2 = {2, 3},C3 = {3, 4} and S1 = {2},S2 = {3},S3 = {4}, the expression that is not in the exponential term for the IWQG density is given as follows:    αi   4 −1   α1−β1   α1 2 2 2 i=1 ΣCi 1 1 L L L   = + 32 + 32 43  βi 3  −1 D11 D22 D33 D44 i=1 ΣSi    α 2 α2−β2  α 1 2 1 L43 1 3 × + . D22 D33 D44 D33D44

− − 1 tr(Σ 1U) PG This expression is clearly different from the term other than the exponent e 2 in πU,α, which is a product of different powers of Dii,i=1, 2, 3, 4. Yet another way to see that our 14 KHARE, RAJARATNAM

class of priors is not the same in general as the IWQG is to note that the number of shape parameters in IWQG is equal to the sum of the number of cliques plus one, whereas the number of shape parameters in πPG is equal to the number of vertices in the graph. However, an interesting property emerges when G is homogeneous. Note that in this case, for any clique C, and any separator S,      −1 1  −1 1 ΣC  = , ΣS  = . i∈C Dii i∈S Dii

PG Hence when G is homogeneous, the class IWQG is contained in the class π because in IWQG , the exponent of Dii and Djj is the same if iRj, i.e., the shape parameter is shared for vertices in the same equivalence class as defined by the relation R. Hence the containment is strict, P notwithstanding the fact that U need not be in PG for our class π G . In the restrictive case first when G is homogeneous, second when no two vertices belong to the same equivalence class as defined by the relation R, and third when U ∈ PG, the two classes of distributions PG π and IWQG have the same functional form.

3.3. Normalizing constant for homogeneous graphs. Suppose G =(V,E) is a homogeneous graph, and the vertices are ordered such that V ∈ SH . In this case the normalizing constant can be computed explicitly, thus enabling closed form Bayesian model selection for this class of graphs. Recall that  T −1 m tr (LDL ) U + αi log Dii − ( ( ) i=1 ) zG(U, α):= e 2 dLdD.

We provide below necessary and sufficient conditions for existence of the normalizing constant, andanexplicitexpressionforitinsuchcases.

Theorem 2. Let G =(V,E) be a homogeneous graph with vertices ordered such that V ∈ SH .ThenzG(U, α) < ∞ if and only if α satisfies the conditions in Proposition 1, i.e., ≺ αi > |N (i)| +2 ∀i =1, 2, ··· ,m. In this case,

 α |N ≺ i | |N ≺(i)| αi √ ≺   i − ( ) − 3 m αi −1 |N (i)|  ≺i 2 2 2 Γ 2 − 2 − 1 2 2 ( π) U G (3.3) z (U, α)= α |N ≺(i)| . i i − −1 i=1 |U | 2 2

The proof is provided in the Appendix. Furthermore, it can be shown that the result is maximal in the sense that in general, if the graph G is not homogeneous, the normalizing constant cannot be evaluated in closed form (see [18] for more details).

4. Sampling from the posterior distribution. In this section, we study the proper- ties of our family of distributions, and thereby provide a method that allows us to generate samples from the posterior distribution corresponding to the priors defined in Section 3.In particular, we prove that θ =(L, D) ∈ ΘG can be partitioned into blocks so that the condi- tional distribution of each block given the others are standard distributions in statistics and hence easy to sample from. We can therefore generate samples from the posterior distribution by using the block Gibbs sampling algorithm. WISHART DISTRIBUTIONS 15

4.1. Distributional properties and the block Gibbs sampler. Let us introduce some notation before deriving the required conditional distributions. Let G =(V,E) be a decomposable graph, such that V ∈ SD. For a lower triangular matrix L with diagonal entries equal to 1,

th Lu· := u row of L, u =1, 2, ··· ,m, th L·v := v column of L, v =1, 2, ··· ,m, G L·v := (Luv)u>v,(u,v)∈E ,v=1, 2, ··· ,m− 1.

G th So L·v is the v column of L without the components which are specified to be zero under the model G (and without the vth diagonal entry, which is 1). In terms of this notation, the parameter space can be represented as (4.1)   G G G G ΘG = L·1,L·2,L·3, ··· ,L·m−1,D : Lij ∈ R, ∀ 1 ≤ j 0, ∀ 1 ≤ i ≤ m .

m Suppose θ ∼ πU,α for some positive definite U and α ∈ R with non-negative entries. Then ··· the posterior distribution is πU, α ,whereU = nS + U, α =(n + α1,n+ α2, ,n+ αm). In the following proposition, we derive the distributional properties which provide the essential ingredients for our block Gibbs sampling procedure.

Theorem 3. Using the notation above, the conditional distributions of each component of θ (as in (4.1)) given the other components and the data Y1, Y2, ··· , Yn are as follows. 1. G G v,G v,G L·v | (L \ L·v,D,Y1, Y2, ··· , Yn) ∼N(μ ,M ) ∀ v =1, 2, ··· ,m− 1, where  v,G v v,G −1 T −1 T −1 v μu := μu + M  L U(L ) (LDL )  μw uu vv u w u>v:(u,v)∈E w>v:(w,v)∈/E −1 or wv,(u, v) ∈ E, −1 v (L U)vu −1 μu :=  ∀u such that Lvu =0, L−1U (LT )−1  vv v,G −1 −1 T −1 T −1   (M )  := L U(L ) (LDL )  ∀u, u >v,(u, v), (u ,v) ∈ E. uu vv uu

2. ⎛  ⎞ −1 T −1 L U(L ) ⎝αi ii ⎠ Dii | L, Y1, Y2, ··· , Ym ∼ IG − 1, 2 2

independently for i =1, 2, ··· ,m,whereIG represents the inverse-.

−1 v,G −1 Remark The notation Lvw = 0 in the definition of μ above means indices w for which Lvw is 0 as a function of entries of L.

Deriving the required conditional distributions in Theorem 3 entails careful analysis. We first state two lemmas which are essential for deriving these distributions. 16 KHARE, RAJARATNAM

Lemma 3. Let u>v,(u, v) ∈ E.Then,

−1 ∂Lij −1 −1 = −Liu Lvj ∀1 ≤ j

−1 Proof. Let u>v.Ifiv, Lij does not depend on Luv from Proposition 1, hence,

∂ −1 −1 −1 Lij =0=−Liu Lvj . ∂Luv with the second equality above being a consequence of L−1 being lower triangular. If i ≥ u> v ≥ j, then using the identity L−1LL−1 = L−1, m m −1 −1 −1 Lij = Lii Lij Ljj i=1 j=1 −1 −1 −1 =2Lij + Liu LuvLvj + C,

−1 −1 where C is functionally independent of Luv.SincebothLiu and Lvj are functionally inde- pendent of Luv,wegetthat, −1 −1 ∂Lij ∂Lij −1 −1 =2 + Liu Lvj , ∂Luv ∂Luv and hence, −1 ∂Lij −1 −1 = −Liu Lvj . ∂Luv  −1 Recall from Proposition 1 that, Lij functionally depends on Luv only if i ≥ u>v≥ j.Weuse this observation repeatedly in our arguments. For a given v, to prove conditional multivariate G normality of the conditional distribution of L·v given the others, we shall demonstrate that if we treat D and the other columns of L as constants, then tr((LDLT )−1U ) is a quadratic form G in the entries of L·v .

Lemma 4. Let u, u >v,(u, v), (u,v) ∈ E.Then,

2  ∂ T −1 −1 T −1 T −1 tr((LDL ) U)=2 L U(L ) (LDL )uu , ∂Luv∂Luv vv

G which is functionally independent of the elements of L·v. WISHART DISTRIBUTIONS 17

Proof. First note that, ∂ tr((LDLT )−1U ) ∂Luv ⎛ ⎞ m m m −1 −1 ∂ ⎝ Lki Lkj ⎠ = Uij ∂Luv i=1 j=1 k=1 Dkk   m m m −1 −1 −1 −1 −1 −1 Lku Lvi Lkj + Lki Lku Lvj = − Uij i=1 j=1 k=1 Dkk ( by Lemma 3) m m m −1 −1 −1 Lku Lvi Lkj = −2 Uij. i=1 j=1 k=1 Dkk

Note that L−1 is a lower triangular matrix. Hence, ∂2 tr((LDLT )−1U ) ∂Luv∂Lu⎛v ⎞ m m m −1 −1 −1 ∂ ⎝ Lku Lvi Lkj ⎠ = −2 Uij  ∂Lu v i=1 j=1 k=1 Dkk m m m −1 −1 −1 −1 Lku Lvi Lku Lvj =2 Uij i=1 j=1 k=1 Dkk ( by Lemma 3) ⎛ ⎞   m m m −1 −1 ⎝ −1 −1⎠ Lku Lku =2 Lvi UijLvj i=1 j=1 k=1 Dkk  −1 T −1 T −1 =2L U(L ) (LDL )  . vv uu

−1 G −1 Note that by Proposition 1, Lvi is functionally independent of L·v for all 1 ≤ i ≤ m,andLku G is functionally independent of L·v for all 1 ≤ k ≤ m and u>v. We thereby conclude that −1 T −1 T −1 G 2 L U(L ) (LDL )  is independent of L·v.  vv uu Proof of Theorem 3. An immediate consequence of Lemma 4 is that we can write tr((LDLT )−1U ) as follows:

tr((LDLT )−1U )  −1 T −1 T −1 = L U(L ) (LDL )  (Luv − bu)(Luv − bu )+C, vv uu u>v,(u,v)∈E u>v,(u,v)∈E

G where b =(bu)u>v,(u,v)∈E and C are independent of L·v. In order to evaluate (bu)u>v,(u,v)∈E , ∂ T −1 G note that the term in ∂Luv tr((LDL ) U) which is independent of L·v is given by  −1 T −1 T −1 (4.2) − 2 L U(L ) (LDL )  bu , vv uu u>v,(u,v)∈E 18 KHARE, RAJARATNAM for every u>v,(u, v) ∈ E. However, from the proof of Lemma 4 we alternatively know that

m m m −1 −1 −1 ∂ T −1 Lku Lvi Lkj tr((LDL ) U)=−2 Uij. ∂Luv i=1 j=1 k=1 Dkk

−1 −1 G −1 Note that by Lemma 3, Lku Lkj is functionally dependent on L·v if and only if Lku =0and −1 ∂ T −1 Lvj = 0 (as a function of L). And hence the term in ∂Luv tr((LDL ) U) which is independent G of L·v is given by

m m m −1 −1 −1 Lku Lvi Lkj −2 Uij1{L−1=0 L−1=0} vj or ku i=1 j=1 k=1 Dkk    m m −1 −1 −1 Lku Lkj = −2 Lvi Uij −1 i=1 k=1 Dkk j:Lvj =0  = −2 L−1U (LDLT )−1 vj uj j:L−1=0 vj  −1 L U   vj −1 T −1 T −1 (4.3) = −2 L U(L ) (LDL )uj . −1 T −1 vv −1 L U(L ) j:Lvj =0 vv

Now observe the following facts.

1. The expressions  in (4.2)and(  4.3) should be the same for every u>v,(u, v) ∈ E. A1 A2 ξ1 2. If A = T ,ξ = and η are such that A2 A3 ξ2

A1ξ1 + A2ξ2 = A1η,

then −1 η = ξ1 + A1 A2ξ2. If we choose A, ξ and η as  −1 T −1 T −1  −1 −1 Auu := L U(L ) (LDL )  ∀u, u such that Lvu ,L  =0, vv uu vu −1 (L U)vu −1 ξu :=  ∀u such that Lvu =0, L−1U (LT )−1 vv ηu := bu ∀u>v,(u, v) ∈ E, then combining the observations above with (4.2)and(4.3), we obtain

tr((LDLT )−1U )  −1 T −1 T −1 v,G v,G = L U(L ) (LDL )  (Luv − μu )(Luv − μ  )+C. vv uu u u>v,(u,v)∈E u>v,(u,v)∈E WISHART DISTRIBUTIONS 19

As defined earlier,  L−1U v vu −1 μu =  ∀u such that Lvu =0, L−1U (LT )−1 vv  v,G v v,G −1 T −1 T −1 v μu = μu + M  L U(L ) (LDL )  μw uu vv u w u>v,(u,v)∈E w>v,(w,v)∈/E −1 or wv,(u, v) ∈ E,

G G and C is independent of L·v. It follows that under πU ,α , the conditional distribution of L·v v,G v,G given the other parameters and the data Y1, Y2, ··· , Yn is N (μ ,M ). For deriving the conditional distribution of the entries of D,wenotethat,

 L−1U LT −1 T −1 m m ( ( ) ) (tr((LDL ) U )+ αi log Dii) − jj − i=1 1 D e 2 = e 2 jj , α j=1 j 2 Djj

The above leads us to conclude that the conditional distribution  ofDjj given the other param- L−1U (LT )−1 α j jj eters and the data Y1, Y2, ··· , Yn is IG 2 − 1, 2 , independently for every j =1, 2, ··· ,m. 

4.2. Convergence of block Gibbs sampler. The block Gibbs sampling procedure, based on the conditional distributions derived above, gives rise to a Markov chain. It is natural to ask whether this Markov chain converges to the desired distribution πU, α . Convergence properties are sometimes overlooked due to the theoretical demands in establishing them. However, they yield theoretical safeguards that the block Gibbs sampling algorithm can be used for sampling from the posterior distribution. We now prove that sufficient conditions for convergence of a Gibbs sampling Markov chain to its stationary distribution (see [1, Theorem 6]) are satisfied by the Markov chain corresponding to our block Gibbs sampler. Let φ(x | μ, Σ) denote the N (μ, Σ) density evaluated at x.Let fIG(d | α, λ)denotetheIG(α, λ) density evaluated at d. Let us fix ψ, d1,d2 > 0 arbitrarily. Let { ∈ | |≤ ≤ ≤ ∀ ∈ } Θψ,d1,d2 := θ =(L, D) ΘG : Lij ψ, d1 Dii d2 i>j,(i, j) E . We now formally prove the conditions which are sufficient for establishing convergence.

Proposition . ∃ ∈ 4 δ>0 such that for all θ =(L, D) Θψ,d1,d2 ,  G v,G v,G φ L·,v | μ ,M >δ∀ v =1, 2, ··· ,m− 1, ⎛  ⎞ −1 T −1 L U(L ) ⎝ αi ii ⎠ fIG Dii| − 1, >δ∀ i =1, 2, ··· ,m. 2 2 20 KHARE, RAJARATNAM

Proof Firstly, by Proposition 1, all entries of L−1 are polynomials in the entries of L.Since

Θψ,d1,d2 is bounded and closed, there exists ψ1 > 0 such that     ∈ ⇒  −1 ≤ ∀ ∈ (L, D) Θψ,d1,d2 Luv ψ1 u>v,(u, v) E.

∈ Using the above, there exists a constant ψ2 > 0 such that if (L, D) Θψ,d1,d2 ,then        −1   T −1   −1 T −1  (4.4)  L U  ≤ ψ2, (LDL )   ≤ ψ2,  L U(L )  ≤ ψ2, vu uu vv

 −1 for every 1 ≤ v,u,u ≤ m. Secondly, since Lvv =1∀1 ≤ v ≤ m,andU is positive definite, it ∈ follows that there exists a constant ψ3 > 0 such that if (L, D) Θψ,d1,d2 ,then  −1 T −1 (4.5) ψ3 ≤ L U(L ) , vv for every 1 ≤ v ≤ m. ∈ Let (L, D) Θψ,d1,d2 .Notethat,

 T  −1  G v,G v,G G v,G L·v − μ M L·v − μ  T  −1  T  −1  T  −1 G v,G G G v,G v,G v,G v,G v,G = L·v M L·v − 2 L·v M μ + μ M μ .     ζ1 Σ11 Σ12 Observe that if ζ = ∈ Rm,andΣ= is a positive definite matrix, then ζ2 Σ21 Σ22   −1 T −1 T m ζ1 +Σ11 Σ12ζ2 Σ11 ζ1 +Σ11 Σ12ζ2 ≤ ζ Σζ, ∀ ζ ∈ R .

If we choose ζ and Σ as  −1 T −1 T −1  −1 −1 Σuu := L U(L ) (LDL )  ∀u, u such that Lvu ,L  =0, vv uu vu v −1 ζu = μu ∀u such that Lvu =0, then combining the observation above and the definition of μv,G,wegetthat,

 T  −1  v,G v,G v,G v −1 T −1 T −1 v μ M μ ≤ μ L U(L ) (LDL )  μ  . u vv uu u u:L−1=0 u:L−1 =0 vu vu

From the definitions in Theorem 3 we also have,    −1  M v,G μv,G = L−1U (LDLT )−1 ∀u>v,(u, v) ∈ E. vj uj u −1 j:Lvj =0

∈ It follows by (4.4), that for (L, D) Θψ,d1,d2 ,thereexistsψ4 > 0 such that

 T  −1  G v,G v,G G v,G (4.6) L·v − μ M L·v − μ ≤ ψ4, WISHART DISTRIBUTIONS 21

v,G for every v =1 , 2, ··· ,m− 1. Also, by the definition of M , it follows that for (L, D) ∈      v,G ∞  v,G Θψ,d1,d2 , 0 < M < and M is a continuous function of (L, D). Recall that for a | | matrix A, A denotes the determinant of A.Since Θ ψ,d1,d2 is a bounded and closed set, both    v,G the maximum and minimum of the function M are attained in Θψ,d1,d2 . It follows that ∈ for (L, D) Θψ,d1,d2 ,thereexist0<κ1 <κ2 such that    v,G (4.7) κ1 < M  <κ2

··· − ∈ for every v =1, 2, ,m 1. It follows by (4.4), (4.5), (4.6)and(4.7)thatfor(L, D) Θψ,d1,d2 , there exists δ1 > 0 such that  G v,G v,G φ L·v | μ ,M >δ1 ∀ v =1, 2, ··· ,m− 1.

∈ Note furthermore that if (L, D) Θψ,d1,d2 ,thenfrom(4.4)and(4.5)  −1 T −1 ψ3 ≤ L U(L ) ≤ ψ2 ii for every 1 ≤ i ≤ m. Hence, there exists δ2 > 0, such that ⎛  ⎞ −1 T −1 L U(L ) ⎝ αi ii ⎠ fIG Dii| − 1, >δ2 ∀ i =1, 2, ··· ,m. 2 2

∈ Let δ =min(δ1,δ2). Hence, for θ =(L, D) Θψ,d1,d2 ,  G v,G v,G φ L·v | μ ,M >δ ∀ v =1, 2, ··· ,m− 1, ⎛  ⎞ −1 T −1 L U(L ) ⎝ αi ii ⎠ fIG Dii | − 1, >δ ∀ i =1, 2, ···m. 2 2



Recall that nv = |{u : u>v,(u, v) ∈ E}|. Note that the measures corresponding to v,G v,G N μ ,M and N (0,Inv ) are mutually absolutely continuous, and the corresponding Rnv ∀ ··· − densities w.r.t. Lebesgure measure are positive everywhere on  ,  v =1, 2, ,m 1. In L−1U (LT )−1  α i ii α i addition, the measures corresponding to IG 2 − 1, 2 and the IG 2 − 1, 1 are mutually absolutely continuous, and the corresponding densities w.r.t. Lebesgue measure are positive everywhere on (0, ∞), ∀ i =1, 2, ··· ,m.Also,sinceΘψ,d ,d is bounded and   1 2 m−1 G | m α i − closed, v=1 φ L·v 0,Inv i=1 fIG 2 1, 1 is bounded on Θψ,d1,d2 . Combining this with Proposition 4, all required conditions in [1, Theorem 6] are satisfied. Hence, the block Gibbs sampling Markov chain, based on the derived conditional distributions converges to the desired stationary distribution πU,α. We note that in [27, Page 18], the authors introduce a procedure to sample from the G- PG inverse Wishart distributions (these are a narrow subclass of our priors πU,α). Essentially, at 22 KHARE, RAJARATNAM every iteration, they cycle through all the rows of Σ. At the ith step in an iteration, they sample G the vector Σi· := (Σij)j∈N (i) from its conditional distribution (Gaussian) given the other 1 entries of Σ, and then sample γi := −1 from its conditional distribution (Inverse-Gamma) Σii given the other entries of Σ. Since Σ is a symmetric  matrix, for (i, j) ∈ E,thevariableΣij G G G G G appears in Σi· as well as Σj·. Hence, Σ1·,γ1 , Σ2·,γ2 , ··· , Σm·,γm is not a disjoint partition of the variable space. Therefore, their procedure is not strictly a Gibbs sampling procedure, and its convergence properties are not clear. On the other hand, in our procedure, G G G we cycle through L·1,L·2, ··· ,L·m,D , which is a disjoint partition of the variable space. Hence, our procedure is a Gibbs sampler in the true sense. There are also other differences between the two procedures, such as the fact that γi = Dii unless i = m.

5. Covariance hyper Markov properties. In this section we study the “covariance” hyper Markov properties of the class of distributions introduced in this paper. The conceptual foundations of hyper Markov properties in the context of concentration graph models were laid in Dawid and Lauritzen [8]. The reader is referred to Letac and Massam [20]forabrief overview. We now prove a property of our class of priors that will be used to establish properties that are analogous to hyper Markov properties in the covariance graph setting. This property PG is also crucial in evaluating E[Σ] when Σ ∼ πU,α.

Theorem 4. Let G =(V,E) be homogeneous, with vertices ordered according to the Hasse PG perfect vertex elimination scheme specified in Section 2.4, i.e., V ∈ SH .IfΣ ∼ πU,α,and Σ=LDLT is its modified Cholesky decomposition, then   ≺i −1 ≺ Dii, (Σ ) Σ ·i 1≤i≤m are mutually independent. Furthermore, the distributions of these quantities are specified as follows.  ≺i −1 ≺ ≺i −1 ≺ ≺i −1 (Σ ) Σ·i | Dii ∼N (U ) U·i ,Dii(U ) , and   ≺ ≺ T ≺i −1 ≺ αi |N (i)| Uii − (U·i ) (U ) U·i Dii ∼ IG − − 1, ∀i =1, 2, ··· ,m. 2 2 2

Remark The above result decomposes Σ into mutually independent coordinates. Note that for any i such that ¯i is a leaf of the Hasse tree and i has the minimal label in its equivalence class ¯i, N ≺(i)=φ. ≺i ≺ In this case, it is understood that Σ and Σ·i are vacuous parameters and that Dii =Σii. Proof. Let G be a homogeneous graph with m vertices, with the vertices ordered according to the Hasse perfect elimination scheme specified in Section 2.4. Recall that the vertices of the Hasse diagram of G are equivalence classes formed by the relation R defined in Section 2.4. The vertex labelled m clearly lies in the equivalence class of vertices at the root of the corresponding Hasse diagram. Let us remove the vertex labelled m from the graph G,andlet G denote the induced graph on the remaining m − 1 vertices. The graph G can be of the following two types. WISHART DISTRIBUTIONS 23

• Case I: If the equivalence class of m contains more than one element, then G is a homogeneous graph with the Hasse diagram having the same depth as the Hasse diagram of G, but with one less vertex in the equivalence class at the root. Recall that the depth of a tree is the length of the longest path from its root to any leaf. • Case II: If the equivalence class of m contains only one element, then G is a discon- nected graph, with the connected components being homogeneous graphs with the Hasse diagram having depth one less than the depth of the Hasse diagram of G. Note that for every 1 ≤ i ≤ m such that N ≺(i) = φ, Σi can be partitioned as   ≺i ≺ i Σ Σ·i Σ = ≺ T . (Σ·i ) Σii

Also note that if Z ∼N(0, Σ) then Dii is the conditional variance of Zi given Z1,Z2, ··· ,Zi−1 ≺ ≺ (see [16]). Note that Σkl =0, ∀ 1 ≤ k, l ≤ i, k ∈N (i),l/∈N (i). It follows that ≺ T ≺i −1 ≺ Dii =Σii − (Σ·i ) (Σ ) Σ·i . Hence, by the formula for the inverse of a partitioned matrix, it follows that ⎡ T ⎤ ≺i −1 ≺ ≺i −1 ≺ ≺i −1 ≺ ≺i −1 ((Σ ) Σ·i )((Σ ) Σ·i ) (Σ ) Σ ⎢(Σ ) + − ·i ⎥ ⎢ Dii Dii ⎥ (Σi)−1 = ⎢ ⎥ . ⎣ T ⎦ (Σ≺i)−1Σ≺ −( ·i ) 1 Dii Dii Hence,   tr (Σi)−1U i = tr (Σ≺i)−1U ≺i +  T  1 ≺i −1 ≺ ≺i −1 ≺ ≺i ≺i −1 ≺ ≺i −1 ≺ (Σ ) Σ·i − (U ) U·i U (Σ ) Σ·i − (U ) U·i + Dii  1 ≺ T ≺i −1 ≺ (5.1) Uii − (U·i ) (U ) U·i Dii We note again that from our argument at the beginning of the proof, Σ≺i =Σ(i−1) or Σ≺i has a block diagonal structure (after an appropriate permutation of the rows and columns) i i i 10 with blocks Σ 1 , Σ 2 , ··· , Σ k for some k>1, 1 ≤ i1,i2, ··· ,ik

10 If the equivalence class of i has k children in the Hasse diagram of G,andVj the set of vertices in th V belonging to the subtree rooted at the j child, then Vj ,for1≤ j ≤ k, are disjoint subsets. Infact, if    ij =max{i : i ∈ Vj }, then it follows by the construction of the Hasse diagram that Vj = N (ij )for 1 ≤ j ≤ k. 24 KHARE, RAJARATNAM

Let us now evaluate the Jacobian of the transformation   ≺i −1 ≺ Σ → Dii, (Σ ) Σ . ·i 1≤i≤m   R u Suppose M = is a p × p matrix, and consider the transformation uT w

M =(R, u,w) → (R, R−1u,w− uT R−1u).

Since R−1u depends only on (R, u)andw − uT R−1u depends on (R, u,w), it follows that the ∂(R,R−1u,w−uT R−1u) gradient matrix ∂(R,u,w) is block lower triangular, and hence the Jacobian is given by          −1  T −1 −1 ∂R ∂(R u) ∂(w − u R u)   ·   , ∂R  ∂u  ∂w which simplifies to |R|. In particular, the Jacobian of the transformation  i ≺i ≺i −1 ≺ Σ → Σ , (Σ ) Σ·i ,Dii   is given by Σ≺i. Note once more that Σ = Σm, and as mentioned earlier, Σ≺i =Σ(i−1) or Σ≺i (after an appropriate permutation of the rows and columns) has a block diagonal structure with i i i blocks Σ 1 , Σ 2 , ··· , Σ k for some k>1, 1 ≤ i1,i2, ··· ,ik

Here as in Section 3.1 Lemma 2,

nj = |{(i>j:(i, j) ∈ E}| ∀j =1, 2, ··· ,m.

Also from Section 3.1,   −1 m tr Σ U + (2nj +αj )logDjj ( ) j=1 PG 1 − πU,α(Σ) = e 2 , Σ ∈ PG. zG(U, α) WISHART DISTRIBUTIONS 25

Let   ≺i −1 ≺ Γ = Dii, (Σ ) Σ . ·i 1≤i≤m   It follows from the decomposition of tr Σ−1U from (5.2) and the computation of the deter- minant of the Jacobian (5.3)that,   Γ ≺i −1 ≺ πU,α Dii, (Σ ) Σ ·i 1≤i≤m

m T 1 ≺i −1 ≺ ≺i −1 ≺ ≺i ≺i −1 ≺ ≺i −1 ≺ 1 − ((Σ ) Σ·i −(U ) U·i ) U ((Σ ) Σ·i −(U ) U·i ) = e 2Dii zG(U, α) i=1 m α 1 ≺ T ≺i − ≺ i − Uii−(U ) (U ) 1U − 2D ( ·i ·i ) 2 × e ii Dii . i=1   ≺i −1 ≺ The above proves the mutual independence of Dii, (Σ ) Σ·i 1≤i≤m.Fromthejointden- ≺i −1 ≺ sity of Dii, (Σ ) Σ·i , it is clear that  ≺i −1 ≺ ≺i −1 ≺ ≺i −1 (Σ ) Σ·i | Dii ∼N (U ) U·i ,Dii(U ) .

≺i −1 ≺ To evaluate the marginal density of Dii,weintegrateout(Σ ) Σ·i from the joint density ≺i −1 ≺ of (Dii, (Σ ) Σ·i ). Note that

T |N ≺ i | − 1 (Σ≺i)−1Σ≺−(U ≺i)−1U ≺ U ≺i (Σ≺i)−1Σ≺−(U ≺i)−1U ≺ − ( ) D ( ·i ·i ) ( ·i ·i ) ≺i −1 ≺ 2 e 2 ii d((Σ ) Σ·i )=CDii , R|N ≺(i)| where C is a constant, since the above integral is essentially an unnormalized multivariate normal integral. Hence the marginal density of Dii is given by  ≺ U − U≺ T U≺i −1U≺ αi |N (i)| ( ii ( ·i ) ( ) ·i ) − + Dii − 2 2 πU,α(d) ∝ e 2d d . We can therefore conclude that,   ≺ ≺ T ≺i −1 ≺ αi |N (i)| Uii − (U·i ) (U ) U·i Dii ∼ IG − − 1, . 2 2 2  The mutual independence of the components of Γ provides the ingredients to establish the strong directed “covariance” hyper Markov property of our class of priors. Let G =(V,E) be a decomposable graph with V ∈ SD, i.e., the vertices have been ordered according to the perfect vertex elimination scheme for decomposable graphs outlined in Section 2.2.LetD be the directed graph obtained from G by directing all edges in G from the vertex with the smallest number, to the vertex with the highest number. Let pa(i) denote the set of parents of i according to the direction specified in D. It follows that pa(i)=N ≺(i). As in [8, 20], let pr(i)={1, 2, ··· ,i− 1} denote the set of predecessors of i according to the direction specified in D. Analogous to the strong hyper Markov property in the context of concetration graph models defined in [8, 20], we now define the “covariance” hyper Markov property in the context of covariance graph models. 26 KHARE, RAJARATNAM

Definition 1. A family of priors F on PG satisfies the strong “covariance” hyper Markov property with respect to the direction D, if whenever π ∈F and Σ ∼ π,

Σi|pa(i) ⊥ Σpr(i) ∀ 1 ≤ i ≤ m,

≺ T ≺i −1 ≺ where Σi|pa(i) := Σii − (Σ·i ) (Σ ) Σ·i = Dii.

Let G =(V,E) be a homogeneous graph with V ∈ SH , i.e., the vertices have been ordered according to the perfect vertex elimination scheme for homogeneous graphs outlined in Section 2.4. Recall that a homogeneous graph is decomposable and SH ⊆ SD.LetD be the directed graph obtained from G by directing all edges in G from the vertex with the smallest number, to the vertex with the highest number (the Hasse diagram of G is directed in the reverse way). PG We show in the following proposition that in this situation, the family of priors πU,α satisfies the strong “covariance” hyper Markov property with respect to the direction D.

PG Proposition 5. Let G =(V,E) be homogeneous, with V ∈ SH .IfΣ ∼ πU,α,then

Dii ⊥ Σ{1,2,··· ,i−1} ∀ 1 ≤ i ≤ m.

Remark. Recall that ≺i Σ = ((Σuv))u,v∈N ≺(i) is different from Σ{1,2,··· ,i−1} = ((Σuv))1≤u,v≤i−1

Proof Since the vertices have been labelled according to the Hasse perfect vertex elimi- nation scheme, Σ{1,2,··· ,i−1} (after an appropriate permutation of the rows and columns) ≺i j j j has a block diagonal structure with diagonal blocks Σ and Σ 1 , Σ 2 , ··· , Σ l for some ≥ ≤ ··· ∈ ∀ ≤ ≤ l 0, 1 j1,j2, ,jl

Dii(Σ) ⊥ Σ{1,2,··· ,i−1}. 

We demonstrated in Section 3.2 that the family IWQG of Letac and Massam [20] is a sub- PG family of our class of priors πU,α when G is homogeneous. Consequently, we can now prove

“covariance”’ hyper Markov properties for the IWQG family.

Corollary 1. Let G =(V,E) be homogeneous, with V ∈ SH .LetD be the directed graph obtained from G by directing all edges in G from the vertex with the smallest number to the vertex with the highest number. Then, the family IWQG is strong hyper Markov with respect to the direction GH .

Hyper Markov properties for the IWQG family were not established in [20]. Hence, we note that the corollary above is a new result for this family. WISHART DISTRIBUTIONS 27

6. Laplace transforms, expected values and generalized posterior linearity. We now consider the task of computing the Laplace transform and expected values related to our PG class of priors πU,α. In particular, we provide closed form expressions for these quantities, when G is homogeneous. Let us first establish notation that is needed in this section. Suppose A is a matrix of dimension m and M is a subset of {1, 2, ··· ,m}. We use the following notation.

AM := ((Aij ))i,j∈M . 0 Aij i, j ∈ M, AM := 0otherwise.

We now provide an expression for the of the inverse covariance matrix in the following proposition.

PG Proposition 6. Suppose α satisfies the condition in Proposition 1.Then,forΣ ∼ πU,α, −1 EU,α Σ m  0 m  0 ≺ i −1 ≺ ≺i −1 = (αi −|N (i)|−2) U − (αi −|N (i)|−3) U . i=1 i=1

The proposition is proved in the Appendix. We can also obtain the Laplace transform of Σ−1 PG when Σ ∼ πU,α.

Lemma 5. For positive definite matrices U, B,andα satisfying the condition in Proposi- −1 tion 1, the Laplace transform of the distribution of Σ defined as LU,α(B) is given as follows:   − tr Σ 1B − ( ) zG(U + B,α) LU,α(B)=EU,α e 2 = . zG(U, α)

PG This lemma follows immediately from the definition of πU,α and zG(U, α). We now provide a recursive method that gives closed form expressions for the expected PG value of the covariance matrix, when Σ ∼ πU,α. The basis for the method is the block mutual independence result given in Theorem 4.SinceΣuv =0∀(u, v) ∈/ E, we only need to evaluate ≺ the expectation of Σii and Σ·i for every 1 ≤ i ≤ m. The procedure is given as follows. Let ≺ A1 := {i ∈ V : N (i)=φ}.

Recall from Section 2.4 that since G is homogeneous, there exists an associated Hasse diagram on the equivalence classes of the vertices generated by the relation R. All vertices whose equivalence classes are leafs of the Hasse diagram, and which have the smallest label in their ≺i equivalence classes, are members of the set A1. Clearly, if i ∈ A1,thenΣ and Σ·i are vacuous parameters and Dii =Σii. It follows from Theorem 4 that for i ∈ A1, ≺ T ≺i −1 ≺ Uii − (U·i ) (U ) U·i EU,α[Σii]=EU,α[Dii]= , αi − 4 λ assuming that αi > 4, since X ∼ IG(λ, γ) implies that E[X]= γ−1 . 28 KHARE, RAJARATNAM

For k =2, 3, 4, ···, define    ≺ k−1 k−1 Ak = i ∈ V : N (i) ⊆∪l=1 Al \ ∪l=1 Al .

∗ ∗ Since there are finitely many vertices in V , ∃k such that Ak = φ for k>k.Thesets {Ak}1≤k≤k∗ essentially provide a way of computing EU,α [Σ] by starting at the bottom of the Hasse diagram of G and then moving up sequentially.

≺ k−1 Proposition 7. Given the expectations of Σjj and Σ·j for j ∈∪l=1 Al, the expectations ≺ of Σii and Σ·i for i ∈ Ak are given by the following expressions. ≺ ≺i ≺i −1 ≺ EU,α[Σ·i ]=EU,α Σ (U ) U·i , ≺ T ≺i −1 ≺ Uii − (U·i ) (U ) U·i EU,α[Σii]= ≺ + αi −|N (i)|−4 ⎛ ⎛  ⎞⎞ ≺i −1 ≺ T ≺i −1 ≺ (U ) Uii − (U ) (U ) U ⎝ ≺i ⎝ ·i ·i ≺i −1 ≺ ≺ T ≺i −1⎠⎠ tr EU,α Σ ≺ +(U ) U·i (U·i ) (U ) . αi −|N (i)|−4

≺ provided αi > |N (i)| +4.

The proposition is proved in the Appendix. The expressions above yield a recursive but closed PG form method to calculate E[Σ] when Σ ∼ πU,α.

6.1. Posterior linearity. In their seminal paper, Diaconis and Ylvisaker [10]laidouta general framework for the construction of families of conjugate priors for natural exponential families. These conjugate priors have a specific form. More specifically, if X is a random vector distributed according to a natural exponential family with canonical parameter η ∈ Rd,their conjugate priors are characterized through the property of linear posterior expectation of the d mean parameter of X: Eπ(η|x)[EPη [X | η]] = ax + b, a ∈ R,b ∈ R . As covariance graph models give rise to curved exponential families, it is natural to ask if our class of priors has any posterior linearity properties. Note that for this class of models, the canonical parameter −1 Ω=Σ lies in a lower dimensional space (κ(Ω) ∈ QG), and

E[S | Ω] = Σ.

A concrete example of posterior linearity is available in the concentration graph setting. The hyper inverse Wishart (HIW) prior of Dawid and Lauritzen [8] and is a special case of the larger flexible family of conjugate priors by Letac and Massam [20] and satisfies posterior linearity., i.e., the HIW is a DY prior whereas the entire family of Letac-Massam priors does not satisfy this property in general. In the same spirit, a relevant question is whether any member of our flexible class of priors has this posterior linearity property, i.e., if it is a DY conjugate prior. More precisely, does there exist U and α such that

| (6.1) EU,α[Σ S]=EU, α [Σ] = aS + bU, WISHART DISTRIBUTIONS 29 for some a, b ∈ R and U = nS + U and α =(α1 + n, α2 + n, ··· ,αm + n). It is immediately apparent that it is not possible to obtain such a relation because the left hand side of (6.1) lies in PG, whereas the minimal sufficient statistic S almost surely does not lie in PG but lies in the higher dimensional space P+ (although if the sample size n is large enough, the entries of S corresponding to edges not in G will be close to zero, i.e., Sij → 0a.s.if(i, j) ∈/ E). Nevertheless, in the case of curved exponential families, we introduce the notion of “pseudo posterior linearity” or “generalized posterior linearity”, which is a natural extension of the con- cept of posterior linearity in regular exponential family settings. We show that this pseudo/generalized posterior linearity is satisfied by certain members of our class of priors, and therefore draw parallels to the natural exponential family setting. We shall first introduce some new concepts that allow us to formalize this property.

Definition 2. Consider a curved exponential family with density (w.r.t. some measure μ) given by  d − ηiTi−φ(η) (6.2) e i=1 ,η∈ Ξ, where η ∈ Rd is the canonical parameter and Ξ a lower-dimensional curve in Rd.Thiscurved  exponential family is called “separable” if ∃ d

For these curved exponential families, we can “separate” out the functionally independent components of the canonical parameter from the dependent components. Since the minimal sufficient statistic and the canonical parameter in a regular exponential family are of the same dimension, the concept of posterior linearity is well defined for regular exponential family models. In order to obtain posterior linearity in a curved setting akin to the regular exponential family setup, a natural approach is to modify the likelihood so that it resembles a regular exponential family and then pose the question of posterior linearity for this modified likelihood. We make this notion more precise in the following definition.

Definition 3. The pseudo-likelihood of a distribution that arises from a “separable” curved exponential family (6.2) is defined to be  d − ηiTi−φ(η) e i=1 ,η∈ Ξ.

The “pseudo likelihood” is obtained by considering only the functionally independent variables in the first term in the exponent in (6.2). We now apply this notion to covariance graph models. Consider the class of distributions G as in Section 3, with a homogeneous graph G =(V,E). We assume that the vertices have been ordered according to the Hasse perfect vertex elimination scheme specified in Section 2.4. The family G is indeed a separable exponential family with −1 canonical parameter Ω = Σ and {Ωij : i>j,(i, j) ∈ E} being the functionally independent components. Hence, the “pseudo likelihood” for this class of models is given by

∗ − n(tr(ΩS )+log |Σ|) e 2 , 30 KHARE, RAJARATNAM where ∗ Sij if (i, j) ∈ E, Sij = 0otherwise. Σ If Σ has a prior distribution given by πU,α, then the posterior distribution based on the pseudo Σ ∗ ∗ ∗ likelihood is given by πU ∗,α∗ ,whereU = nS + U, α =(n + α1,n+ α2, ··· ,n+ αm). We now prove the following proposition.

≺ Proposition 8. Let U ∈ PG and αi =2|N (i)|+4+c for some c>0 and i =1, 2, ··· ,m. PG If Σ ∼ πU,α,then, 1 EU,α[Σ] = U. c The proposition is proved in the Appendix. ∗ ∗ ∗ Now note that if U ∈ PG,thenU ∈ PG (since S ∈ PG). Hence, if U is chosen so that U is positive definite 11, then under the conditions of Proposition 8, the expectation of Σ under Σ the distribution πU ∗,α∗ (the posterior distribution based on the pseudo likelihood) is given by

nS∗ + U EU ∗,α∗ [Σ] = . n + c which is a linear function of the data. Hence we conclude that a select subclass of our family of distributions satisfies the pseudo posterior linearity property.

7. Examples. The main purpose of the paper is to undertake a theoretical investigation of our class of distributions, and their efficacy for use in Bayesian estimation in covariance graph models, we nevertheless provide two examples (one real and one simulated) to demonstrate how the methodology developed in this paper can be implemented.

7.1. Genomics Example. We provide an illustration of our methods on a dataset consisting of gene expression data from microarray experiments with yeast strands from Gasch et al. [13]. This dataset has also been analyzed in [3, 11]. As in [3, 11], we consider a subset of eight genes involved in galactose utilization. There are n = 134 experiments, and the empirical covariance matrix for these measurements is provided in Table 1. Note that the sample covariance matrix is obtained after centering, as the mean is not assumed to be zero. We consider the covariance graph model specified by the graph G in Figure 2 with the overall aim of estimating Σ under this covariance graph model. The maximum likelihood estimate for Σ ∈ PG, provided by the iterative conditional fitting algorithm provided in [3], yields a deviance of 4.694 over 7 degrees of freedom, thus indicating a good model fit. The maximum likelihood estimate is provided in Table 1.Weusetheordering {GAL11,GAL4,GAL80,GAL3,GAL7,GAL10,GAL1,GAL2} for our analysis. Our goal is to obtain the posterior mean for Σ under our new class of priors and then provide Bayes estimators for Σ. We use two diffuse priors to illustrate our methodology. The tr(S) 1 ≺ first prior is denoted as πU ,α1 ,whereU1 = 8 I8,αi =5+|N (i)|; i =1, 2, ··· , 8, i.e., α1 = 1 2 2 2 ··· (5, 6, 6, 8, 7, 8, 9, 12). The second prior used is πU2,α ,whereU =0,αi =2i =1, 2, , 8.

11For example, U canbechosensothatU ∗ is diagonally dominant WISHART DISTRIBUTIONS 31

GAL11 GAL4 GAL80 GAL3 GAL7 GAL10 GAL1 GAL2 GAL11 0.152 GAL4 0.034 0.130 GAL80 0.015 0.039 0.221 GAL3 -0.055 0.034 0.073 0.608 GAL7 -0.051 -0.053 0.183 0.722 3.423 GAL10 -0.048 -0.039 -0.188 0.553 2.503 2.372 GAL1 -0.066 -0.061 0.224 0.517 2.768 2.409 2.890 GAL2 -0.119 -0.018 0.208 0.583 2.547 2.278 2.514 2.890 Table 1 Table 1: Empirical covariance matrix for yeast data

GAL80

GAL4

GAL7 GAL2

GAL11

GAL10 GAL3

GAL1

Fig 2. Covariance graph for yeast data 32 KHARE, RAJARATNAM

GAL11 GAL4 GAL80 GAL3 GAL7 GAL10 GAL1 GAL2 Method GAL11 0.152 0.030 0 -0.052 0 0 0 -0.068 ICF 0.164 0.030 0 -0.050 0 0 0 -0.068 BY1 0.156 0.030 0 -0.052 0 0 0 -0.068 BY2 GAL4 0.128 0.040 0.042 0 0 0 0.030 ICF 0.142 0.040 0.041 0 0 0 0.027 BY1 0.133 0.041 0.042 0 0 0 0.028 BY2 GAL80 0.223 0.082 0.197 0.198 0.239 0.227 ICF 0.237 0.072 0.193 0.194 0.235 0.216 BY1 0.232 0.076 0.199 0.2 0.243 0.223 BY2 GAL3 0.612 0.723 0.549 0.515 0.582 ICF 0.626 0.713 0.544 0.509 0.575 BY1 0.643 0.747 0.568 0.532 0.599 BY2 GAL7 3.422 2.593 2.768 2.540 ICF 3.462 2.584 2.756 2.533 BY1 3.588 2.682 2.866 2.636 BY2 GAL10 2.372 2.409 2.267 ICF 2.373 2.400 2.266 BY1 2.453 2.497 2.358 BY2 GAL1 2.890 2.502 ICF 2.961 2.501 BY1 3.086 2.604 BY2 GAL2 2.870 ICF 3.003 BY1 3.153 BY2 Table 2 ICF: Maximum likelihood estimate from iterative conditional fitting, BY1: Bayesian posterior mean estimate π 1 π 2 for prior U1,α , BY2: Bayesian posterior mean estimate for prior U2,α .

The block Gibbs sampling procedure was run for both priors as specified in Section 4.The burn-in period was chosen to be 1000 iterations, and the subsequent 1000 iterations were used to compute the posterior mean. Increasing the burn-in period to more than 1000 iterations results in insignificant changes to our estimates, thus indicating that the burn-in period chosen is sufficient. The posterior mean estimates for both the priors together with the MLE estimates are provided in Table 2. The running time for the Gibbs sampling procedure for each prior is approximately 26 seconds on a Pentium M 1.6 GHz processor. We find that the Bayesian approach using our priors and the corresponding block Gibbs sampler gives stable estimates and thus yields a useful alternative methodology for inference in covariance graph models.

7.2. Simulation Example. A proof of convergence of the block Gibbs sampling algorithm proposed in Section 4.1 was provided in Section 4.2. The speed at which convergence occurs is also a very important concern for implementation of the algorithm. The number of steps that are required before one can generate a reasonable approximate sample from the pos- terior distribution is reflective of the rate of convergence. Understanding this is important for the accuracy of Bayes estimates, like the posterior mean. We proceed to investigate the performance of the block Gibbs sampling algorithm in a situation where the posterior mean is known exactly and hence allows a direct comparison. Consider a homogeneous graph G WISHART DISTRIBUTIONS 33

Fig 3. Hasse diagram for a homogeneous graph with 50 vertices

with 50 vertices, with the corresponding Hasse tree given by Figure 3.LetΣ∈ PG,where the vertices have been ordered according to the Hasse perfect vertex elimination scheme in Section 2.4, the diagonal entries are 50 and all other non-zero entries are 1. We simulate 100 observation vectors Y1, Y2, Y3, ··· , Y99, Y100 from N50(0, Σ). For illustration purposes, we Σ < choose a diffuse prior πU,α with U =0andαi =2|N (i)| +5,i=1, 2, ··· , 50. Since the graph G is homogeneous, we can compute the posterior mean Σmean := EU,α[Σ | Y1, Y2, ··· , Y100] explicitly. We can therefore assess the ability of the block Gibbs sampling algorithm in estimating the posterior mean by comparing it to the true value of the mean. We run the block Gibbs sampling algorithm to sample from the posterior distribution and subsequently check its performance in estimating Σmean. We use an initial burn-in period of B iterations and then average over the next I iterations to get the estimate Σ. The times needed Σ−Σ for computation (using the R software) and the relative errors mean 2 corresponding to Σmean 2 various choices of B and I areprovidedinTable3. The diagnostics in Table 3 indicate that the block Gibbs sampling algorithm performs exceptionally well, and yields estimates that approach the true mean in only a few thousand steps. The time taken for running the algorithm is also provided in Table 3. The diagnostics in Table 3 indicate that the block Gibbs sampling algorithm performs excep- tionally well, and yields estimates that approach the true mean in only a few thousand steps. The time taken for running the algorithm is also provided in Table 3.

8. Closing remarks. In this paper, we have proposed a theoretical framework for Bayesian inference in covariance graph models. The main challenge was the unexplored terrain of work- ing with curved exponential families in the continuous setting. A parallel theory analogous to the standard conjugate theory for natural exponential families has been developed for this class of models. In particular, a rich class of conjugate priors has been developed for covariance 34 KHARE, RAJARATNAM

Burn-in (B) Average (I) Time Relative Error 1000 1000 139.77 0.01748220 2000 1000 209.72 0.01240595 3000 1000 279.52 0.01300910 4000 1000 349.44 0.01142864 4000 3000 489.19 0.01246141 4000 5000 631.21 0.01081264 4000 7000 769.70 0.009244206 Table 3 Performance assessment of the Gibbs sampler in simulation example graph models, when the underlying graph is decomposable. We have been able to exploit the structure of the conjugate priors to develop a block Gibbs sampler to effectively sample from the posterior distribution. A rigorous proof of convergence is also given. Comparison with other classes of priors is also undertaken. We are able to compute the normalizing constant for homogeneous graphs, thereby making Bayesian model selection possible in a tractable way for this class of models. The Bayesian approach yields additional dividends in the sense that we can now do inference in covariance graph models even when thesamplesizen is less than the dimension p of the data, otherwise not possible in general in the maximum likelihood framework. Furthermore, we thoroughly explore the theoretical prop- erties of our class of conjugate priors. In particular, we establish “covariance” hyper-Markov properties, expectation of the covariance matrix, Laplace transforms and investigate poste- rior linearity. Furthermore the usefulness of the methodology that is developed is illustrated through examples. A few open problems are worthy of mentioning. • What are the necessary conditions for existence of the normalizing constant for decom- posable graphs? • Does the “covariance” hyper Markov property for the class of priors developed in this paper hold for decomposable graphs? We conclude by noting that the use of the class of Wishart distributions introduced in this paper for Bayesian inference, and a detailed study of Bayes estimators in this context, is clearly an important topic, and is the focus of current research.

Acknowledgments. The authors gratefully acknowledge the faculty at the Stanford Statistics department for their feedback and tremendous enthusiasm for this work.

References. [1] Athreya, K.B., Doss, H. and Sethuraman, J. (1996). On the convergence of the Markov chain simulaton method, Ann. Statist. 24, 69-100. [2] Butte, A.J., Tamayo, P., Slonim, D., Golub, T.R. and Kohane, I.S. (2000). Discovering functional relation- ships between RNA expression and chemotherapeutic susceptibility using relevance networks, Proc. Nat. Acad. Sci. 97, 12182-12186. [3] Chaudhuri, S., Drton, M. and Richardson, T.S. (2007). Estimation of a covariance matrix with zeroes, Biometrika 94, 199-216. [4] Cox, D.R. and Wermuth, M. (1993). Linear dependencies represented by chain graphs (with Discussion), Statist. Sci. 8, 204-218, 247-277. [5] Cox, D.R. and Wermuth, M. (1996). Multivariate Dependencies: Models, Analysis and Interpretation. London: Chapman and Hall. WISHART DISTRIBUTIONS 35

[6] Daniels, M.J. and Pourahmadi, M. (2002). Bayesian analysis of covariance matrices and dynamic models for longitudinal data, Biometrika 89, 553-566. [7] Datta, G.S. and Ghosh, M. (1995). Some remarks on noninformative priors, Journal of the American Statistical Association 90, 1357-1363. [8] Dawid, A.P. and Lauritzen, S.L. (1993). Hyper-Markov laws in the statistical analysis of decomposable graphical models, Ann. Statist. 21, 1272-1317. [9] Diaconis, P., Khare, K. and Saloff-Coste, L. (2008). Gibbs sampling, exponential families and orthogonal polynomials, Statistical Science 23, 151-178. [10] Diaconis, P. and Ylvisaker, D. (1979). Conjugate priors for exponential families, Ann. Statist. 7, 269-281. [11] Drton, M. and Richardson, T.S. (2008). Graphical methods for efficient likelihood inference in Gaussian covariance models, Journal of Machine Learning Research 9, 893-914. [12] Edwards, D.M. (2000). Introduction to Graphical Modelling,2nd ed. New York: Springer-Verlag. [13] Gasch, A.P., Spellman, P.T., Kao, C.M., Carmel-Harel, O., Eisen, M.B., Storz, G., Botstein, D. and Brown, P.O. (2000). Genomic expression programs in the response of yeast cells to environmental changes, Mol. Biol. Cell 11, 4241-4257. [14] Grone, R., Johnson, C.R., S‘a, E.M. and Wolkowicz, H. (1984). Positive definite completions of partial hermitian matricies, Linear Algebra Appl. 58, 109124. [15] Grzebyk, M., Wild, P. and Chouaniere, D. (2004). On identification of multifactor models with correlated residuals, Biometrika 91, 141-151. [16] Huang, J., Liu, N., Pourahmadi, M. and Liu, L. (2006). Covariance matrix selection and estimation via penalised normal likelihood, Biometrika 93, 8598. [17] Kauermann, G. (1996). On a dualization of graphical Gaussian models, Scand. J. Statist. 23, 105-116. [18] Khare, K. and Rajaratnam, B. (2008). Homogeneous graphs and matrix decompositions, working paper, Department of Statistics, Stanford University. [19] Lauritzen, S.L. (1996). Graphical models, Oxford University Press Inc., New York. [20] Letac, G. and Massam, H. (2007). Wishart distributions for decomposable graphs, Ann. Statist. 35, 1278- 1323. [21] Mao, Y., Kschischang, F.R. and Frey, B.J. (2004). Convolutional factor graphs as probabilistic models, In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, Eds. M. Chickering and J. Halperin, pp. 374-381, Arlington, MA: AUAI Press. [22] Muirhead, R.J. (1982). Aspects of multivariate statistical theory, John Wiley and Sons, New York. [23] Paulsen, V.I., Power, S.C. and Smith, R.R. (1989). Schur products and matrix completions, J. Funct. Anal. 85, 151-178. [24] Pourahmadi, M. (2007). Cholesky decompositions and estimation of a covariance matrix: orthogonality of variance-correlation parameters, Biometrika 94, 10061013. [25] Rajaratnam, B., Massam, H. and Carvalho, C. (2008). Flexible covariance estimation in graphical models, Annals of Statistics 36, 2818-2849. [26] Roverato, A. (2000). Cholesky decomposition of a hyper inverse Wishart matrix, Biometrika 87, 99-112. [27] Silva, R. and Ghahramani, Z. (2008). The hidden life of latent variables: Bayesian learning with mixed graph models. Preprint, Cambridge University. [28] Wermuth, N. (1980). Linear recursive equations, covariance selection and path analysis, J. Am. Statist. Assoc. 75, 963-972. [29] Wermuth, M., Cox, D.R. and Marchetti, G.M. (2006). Covariance chains, Bernoulli 12, 841-862. [30] Yang, R. and Berger, J.O. (1994). Estimation of a covariance matrix using the reference prior, Annals of Statistics 22, 1195-1211.

Appendix. Proof of Proposition 1. From the definition of N and L,itiseasytoverify that

(LN)ii =1, ∀1 ≤ i ≤ m, (LN)ij =0, ∀1 ≤ i

Now let i>j. It follows by the definition of N that,

i (LN)ij = LikNkj k=j i−1 dim(τ)−1 = Nij + Lik (−1) Lτ + Lij

k=j+1 τ∈A,τ1=k,τdim(τ)=j i−1 dim(τ) = Nij − (−1) LikLτ + Lij

k=j+1 τ∈A,τ1=k,τdim(τ)=j

    Note that any τ ∈Awith τ1 = i, τdim(τ ) = j, dim(τ ) > 2, can be uniquely expressed as  ≤ ≤ −  τ =(i, τ), where j +1 τ1 i 1,τdim(τ) = j. Recall that by definition Lτ = Liτ1 Lτ .Also,      if τ ∈Awith τ1 = i, τdim(τ ) = j, dim(τ )=2,thenτ =(i, j)andLτ  = Lij. Hence,

dim(τ )−1 (LN)ij = Nij − (−1) Lτ  τ ∈A,τ  =i,τ  =j 1 dim(τ) = Nij − Nij =0.

Hence LN = I,thusL−1 = N. 

Proof of Proposition 3. Assume that the vertices have been ordered according to the Hasse perfect vertex elimination scheme in Section 2.4. Suppose Σ ∈ PG.Wewishtoprovethat L ∈LG. ∗ Let i =min1≤i≤m{∃jvbe such that (u, v) ∈/ E, i.e., Σuv = 0. Assume that Lij =0∀(i, j) ∈/ E with j

v −1 Σuv = LuvDvv + LuiLviDii. i=1

Suppose ∃i, 1 ≤ i ≤ v − 1 such that (u, i) ∈ E,(v,i) ∈ E. This means that i is a descendant or twin of both u and v in the Hasse diagram, which then implies that v is a descendant or twin of u, which is not possible, because (u, v) ∈/ E. Hence, using the induction hypothesis, LuiLvi =0forevery1≤ i ≤ v − 1. Hence Luv = 0. We can therefore inductively prove that Lij =0foralli>j,(i, j) ∈/ E. Suppose L ∈LG.Letu>vbe such that (u, v) ∈/ E.ThenLuv = 0. Suppose ∃i, 1 ≤ i ≤ v−1 such that (u, i) ∈ E,(v,i) ∈ E. This means that i is a descendant or twin of both u and v WISHART DISTRIBUTIONS 37 in the Hasse diagram, which then implies that v is a descendant or twin of u, which is not possible, because (u, v) ∈/ E.SinceL ∈LG, this implies LuiLvi =0forevery1≤ i ≤ v − 1. Hence, Σuv = 0. Hence, the first part of the proposition is proved. −1 For the second part, by Proposition 1, it follows that for u>v,(L )uv =0∀ L ∈LG if and only if there is no path from u to v in the graph G, such that the vertices in the path are strictly decreasing. This happens if and only if v is not a descendant of u in the rooted Hasse tree of G, i.e., (u, v) ∈/ E, which implies Luv = 0. Hence the second part of the proposition is proved. 

Proof of Theorem 1. Let us simplify the integral by integrating out the terms Dii, 1 ≤ i ≤ m.  T −1 m tr (LDL ) U + αi log Dii − ( ( ) i=1 ) e 2 dLdD  −1 −1 T −1 m tr D L U(L ) + αi log Dii − ( ( ( )) i=1 ) = e 2 dLdD

−1 T −1 m L U(L ) α − ( )ii − i 2D 2 = e ii Dii dDdL i=1   αi m αi −1 Γ 2 − 1 2 2 = α dL (assuming αi > 2 ∀ i =1, 2, ··· ,m) −1 T −1 i −1 i=1 ((L U(L ) )ii) 2   αi m αi −1 Γ 2 − 1 2 2 = α dL (∗∗) −1 −1 T i −1 i=1 ((L )i·U((L )i·) ) 2 In order to simplify this integral we perform a change of measure by transforming the non-zero elements of L to the corresponding elements of L−1. Note the following facts.

1. Let L ∈LG. From Proposition 1,for(i, j) ∈ E,i > j,  −1 (8.1) Lij = −Lij + f (Luv)(u,v)∈E, j≤uv,(u,v)∈E .Leti =min{i : Lij = ∗ ∗ ∗ 0forsomejj,(i, j) ∈ E and suppose the hypothesis is true for all (u, v) ∈ E, 1 ≤ u

and by the induction hypothesis, the right hand side of the above equation is a function −1 −1 of {Luv }u>v,(u,v)∈E . Hence the matrix L is a function of {Luv }u>v,(u,v)∈E . It follows that the transformation −1 {Lij}(i,j)∈E,i>j →{Lij }(i,j)∈E,i>j is a bijection and the absolute value of the Jacobian of this transformation is 1, as it is the determinant of a lower triangular matrix with diagonal entries 1. 38 KHARE, RAJARATNAM     x1 U11 U12 2. If x = and U = is a positive definite matrix, then x2 U21 U22   T T T −1 T −1 (8.2) x Ux = z z + x2 U22 − U21U11 U12 x2 ≥ x2 U22 − U21U11 U12 x2,

1 − 1 2 2 where z = U11x1 + U11 U12x2. Hence, after transforming the non-zero entries of L to the corresponding entries of L−1,and using (8.2) to eliminate the dependent entries of L−1 from the integrand, we get,  T −1 m tr (LDL ) U + αi log Dii − ( ( ) i=1 ) e 2 dLdD

  αi m αi −1 Γ 2 − 1 2 2 = α dL −1 −1 T i −1 i=1 ((L )i·U((L )i·) ) 2 m ≤ 1 K    αi dai. |N ≺ i | −1 R ( )  2 i=2 ai aT 1 U ∗ i i 1

∗ Here K is a constant, Ui is an appropriate positive definite matrix, and ai represents the th −1 independent entries in the i row of L . By a suitable linear transformation bi of each of the ai,i=2, 3, ··· ,m,weget,  T −1 m tr (LDL ) U + αi log Dii − ( ( ) i=1 ) e 2 dLdD m ≤ ∗ 1 K αi dbi. R|N ≺(i)| T ∗∗ −1 i=2 (b b + ui ) 2 ∗ ∗∗ Here K and ui ,i=2, 3, ··· ,m are constants. Using the standard fact that

1 ∞ k T γ dx < if γ> , Rk (x x +1) 2 we conclude that,  T −1 m tr (LDL ) U + αi log Dii − ( ( ) i=1 ) e 2 dLdD < ∞

≺ if αi > |N (i)| +2 ∀i =1, 2, ··· ,m.

Proof of Theorem 2. We first note that by the formula provided on [9, page 16] that, ⎧ √ πΓ γ− 1 ⎨ ( 2 ) 1 1 Γ(γ) γ>2 , 2 γ dx = ⎩ R (1 + x ) ∞ otherwise. By repeated application, we can generalize the above formula to, ⎧ √ ( π)dΓ γ− d ⎨ ( 2 ) d 1 Γ(γ) γ>2 , T γ dx = ⎩ Rd (x x +1) ∞ otherwise. WISHART DISTRIBUTIONS 39

Let us now consider an integral of a similar form given by

1   γ dx, Rd x (xT 1)U 1   U0 u where U = is a positive definite matrix. Consider the linear transformation y = uT w 1 − 1 2 2 d U0 x + U0 u. It follows that for γ>2 ,

1 1 1   γ dx =  γ dy d 1 d T T −1 R T x |U0| 2 R y y + w − u U u (x 1)U 0 1  √ d d ( π) Γ γ − 2 =  γ− d 1 T −1 2 Γ(γ)|U0| 2 w − u U0 u  √ d d γ− d − 1 ( π) Γ γ − 2 |U0| 2 2 (8.3) = γ− d . Γ(γ)|U| 2 Using arguments as in the proof of Theorem 1 (see (**)),

  αi m αi −1 Γ 2 − 1 2 2 G α z (U, α)=  i −1 dL. −1 −1 T 2 i=1 (L )·iU ((L )·i)

Let us use the transformation −1 {Lij}(i,j)∈E,i>j →{Lij }(i,j)∈E,i>j. As demonstrated in the proof of Theorem 1, this transformation is a bijection and the absolute value Jacobian of this transformation is 1. Also, Proposition 3 implies that −1 L ∈LG ⇔ L ∈LG. −1 Hence, Lkl =0fork>l,(k, l) ∈/ E. It follows that,

α α   1   i α1 −1 m αi −1 Γ 2 − 1 2 2 Γ 2 − 1 2 2 α α zG(U, α)= 1    i dx. −1 |N ≺(i)| −1 (U11) 2 i=2 R x 2 (xT 1)U i 1

≺ αi |N (i)| This implies that zG(U, α) < ∞ if and only if 2 − 1 > 2 ∀ 1 ≤ i ≤ m. Assuming these are satisfied, we obtain using (8.3)that,

 ≺   α   αi √ ≺ ≺ αi |N (i)| 3 α 1 −1 m αi −1 |N (i)| αi |N (i)| ≺i − − 1 − 2 π − − |U | 2 2 2 Γ 2 − 1 2 2 Γ 2 1 2 ( ) Γ 2 1 2 zG(U, α)= α ≺ . 1 −1  α  αi |N (i)| 11 2 i i − −1 (U ) i=2 Γ 2 − 1 |U | 2 2 40 KHARE, RAJARATNAM

It follows after a little simplification that,

 ≺ ≺ αi √ ≺ αi |N (i)| m αi |N (i)| −1 |N (i)| ≺i − − 3 Γ 2 − 2 − 1 2 2 ( π) |U | 2 2 2 zG(U, α)= α |N ≺ i | , i i − ( ) −1 i=1 |U | 2 2 where it is implicitly assumed that |U ≺i| = 1 whenever N ≺(i)=φ. 

Proof of Proposition 6. It follows from the proof of Theorem 2 that for (k, i), (k, j) ∈ E, i,j < k,     |xixj|  −1 −1 −1 EU,α Lki Lkj Dkk = K    αk dx. R|N ≺(k)|  x 2 xT 1 U k 1 Here K is a constant. Hence,

  ≺  −1 −1 −1 αk |N (k)| EU,α L L D  < ∞ if > +1. ki kj kk 2 2 −1 We observe that Lkk =1 ∀1 ≤ k ≤ m. Therefore,    T −1 ≺ EU,α (LDL )ij  < ∞ if αk > |N (k)+2 ∀1 ≤ k ≤ m.

Hence, −1 ≺ EU,α Σ < ∞ if αk > |N (k)| +2 1≤ k ≤ m.

Note that given any positive definite matrix U0, ∃>0 such that if we perturb any entry of U0 by at most , the resulting matrix is still positive definite. Also, πU,α(L, D) is a continuous function of U in this bounded neighborhood of U0. Since by definition,  −1 m tr Σ U + αi log Dii − ( ( ) i=1 ) zG(U, α)= e 2 dLdD, it follows by combining the facts above to differentiate inside the integral w.r.t. Uij (with i>j), that,  −1 m tr Σ U + αi log Dii ∂zG(U, α) −1 − ( ( ) i=1 ) = − Σij e 2 dLdD. ∂Uij This gives, −1 EU,α Σij   m  ≺  m  ≺  ∂ αk |N (k)| 3 αk |N (k)| = − const + − − log |U ≺k| + − − 1 log |U k| ∂Uij k=1 2 2 2 k=1 2 2 m  ≺  m  ≺  αk |N (k)| ∂ αk |N (k)| 3 ∂ = − − 1 log |U k|− − − log |U ≺k|. k=1 2 2 ∂Uij i=1 2 2 2 ∂Uij WISHART DISTRIBUTIONS 41

Note that, for a matrix A and i>j,

∂ −1 log |A| =2Aij . ∂Aij k ≺ ≺k Note that, log |U | depends on Uij if and only if i, j ∈N (k) ∪{k} and log |U | depends ≺ on Uij if and only if i, j ∈N (k). It follows that for i>j, −1 EU,α Σij m  0 m  0  ≺  k −1  ≺  ≺k −1 = αk −|N (k)|−2 U − αk −|N (k)|−3 U . k=1 ij k=1 ij Recall that for a matrix A and a subset M of {1, 2, ··· ,m}, 0 Aij if i, j ∈ M, AM = 0otherwise.

Again by differentiating inside the integral w.r.t. Uii,wegetthat,   −1 Σii ∂ EU,α = − log zG(U, α). 2 ∂Uii We use the identity ∂ −1 log |A| = Aii , ∂Aii and similarly conclude that, −1 EU,α Σii m  0 m  0  ≺  k −1  ≺  ≺k −1 = αk −|N (k)|−2 U − αk −|N (k)|−3 U . k=1 ii k=1 ii Hence it follows that, −1 EU,α Σ m  0 m  0  ≺  k −1  ≺  ≺k −1 = αk −|N (k)|−2 U − αk −|N (k)|−3 U . k=1 k=1 

Proof of Proposition 7. Clearly, the expectation of Σii for i ∈ A1 can be computed explicitly ≺ k−1 asshowninSection6. Assume that the expectations of Σii and Σ·i for i ∈∪l=1 Al are given. Now note that for i ∈ Ak, ≺ ≺i ≺i −1 ≺ EU,α[Σ·i ]=EU,α Σ (Σ ) Σ·i ≺i ≺i −1 ≺ = EU,α Σ EU,α (Σ ) Σ·i ≺i ≺i −1 ≺ = EU,α Σ (U ) U·i . 42 KHARE, RAJARATNAM

∈N≺ ∈∪k−1 The last two equalities follow from Theorem 4.Notethatifj (i), then j l=1 Al,and ≺i hence EU,α Σ has already been computed by this stage. Similarily,

EU,α[Σii] ≺ T ≺i −1 ≺ = EU,α Dii +(Σ·i ) (Σ ) Σ·i ≺ T ≺i −1 ≺ Uii − (U·i ) (U ) U·i ≺ T ≺i −1 ≺ = ≺ + EU,α (Σ·i ) (Σ ) Σ·i , αi −|N (i)|−4 where E[Dii] is once more calculated by using the result from Theorem 4 that Dii has an inverse Gamma distribution. Now note that, ≺ T ≺i −1 ≺ EU,α (Σ·i ) (Σ ) Σ·i    T  ≺i −1 ≺ ≺i ≺i −1 ≺ = tr EU,α (Σ ) Σi· Σ (Σ ) Σ·i      ≺i ≺i −1 ≺ ≺i −1 ≺ T = tr EU,α Σ (Σ ) Σ·i (Σ ) Σ·i     T ≺i ≺i −1 ≺ ≺i −1 ≺ = tr EU,α Σ EU,α EU,α (Σ ) Σ·i (Σ ) Σ·i | Dii  ≺i ≺i −1 ≺i −1 ≺ ≺ T ≺i −1 = tr EU,α Σ EU,α Dii(U ) +(U ) U·i (U·i ) (U ) ⎛ ⎛  ⎞⎞ ≺i −1 ≺ T ≺i −1 ≺ (U ) Uii − (U ) (U ) U ⎝ ≺i ⎝ ·i ·i ≺i −1 ≺ ≺ T ≺i −1⎠⎠ = tr EU,α Σ ≺ +(U ) U·i (U·i ) (U ) . αi −|N (i)|−4   ≺i −1 ≺ Note that we have once more used the joint distribution of Dii, (Σ ) Σ·i as specified in Theorem 4. 

1 ≺ Proof of Proposition 8 We proceed by induction. Clearly if i =1,thenΣ and Σ1· are vacuous parameters and U11 1 E[Σ11]=E[D11]= = U11. α1 − 4 c k 1 k Assume now that E Σ = c U ∀1 ≤ k1, 1 ≤ i1,i2, ··· ,ik

EU,α[Σii]  ≺ T ≺i −1 ≺ Uii − (U·i ) (U ) U·i = |N ≺(i)| + c ⎛ ⎞ ≺ T ≺i −1 ≺ Uii − (U ) (U ) U 1 ⎝ ·i ·i ≺ ≺ T ≺i −1⎠ + tr ≺ I|N ≺(i)| + U·i (U·i ) (U ) c αi −|N (i)|−4  ≺   ≺  Uii |N (i)| 1 1 |N (i)| = 1+ +(U ≺)T (U ≺i)−1U ≺ − − |N ≺(i)| + c c ·i ·i c |N ≺(i)| + c c(|N ≺(i)| + c) 1 = Uii. c i 1 i Hence EU,α[Σ ]= c U . This implies the proposition.  Department of Statistics Department of Statistics Stanford, CA 94305-4065, U.S.A. Stanford, CA 94305-4065, U.S.A. E-mail: [email protected] E-mail: [email protected]