2. Wishart Distributions and Inverse-Wishart Sampling
Total Page:16
File Type:pdf, Size:1020Kb
Wishart Distributions and Inverse-Wishart Sampling Stanley Sawyer | Washington University | Vs. April 30, 2007 1. Introduction. The Wishart distribution W (§; d; n) is a probability distribution of random nonnegative-de¯nite d £ d matrices that is used to model random covariance matrices. The parameter n is the number of de- grees of freedom, and § is a nonnegative-de¯nite symmetric d £ d matrix that is called the scale matrix. By de¯nition Xn 0 W ¼ W (§; d; n) ¼ XiXi;Xi ¼ N(0; §) (1:1) i=1 so that W ¼ W (§; d; n) is the distribution of a sum of n rank-one matrices d de¯ned by independent normal Xi 2 R with E(X) = 0 and Cov(X) = §. In particular 0 E(W ) = nE(XiXi) = n Cov(Xi) = n§ (See multivar.tex for more details.) In general, any X ¼ N(¹; §) can be represented X = ¹ + AZ; Z ¼ N(0;Id); so that § = Cov(X) = A Cov(Z)A0 = AA0 (1:2) The easiest way to ¯nd A in terms of § is the LU-decomposition, which ¯nds 0 a unique lower diagonal matrix A with Aii ¸ 0 such that AA = §. Then by (1.1) and (1.2) with ¹ = 0 Xn µXn ¶ 0 0 0 W (§; d; n) ¼ (AZi)(AZi) ¼ A ZiZi A ;Zi ¼ N(0;Id) i=1 i=1 0 ¼ AW (d; n) A where W (d; n) = W (Id; d; n) (1:3) In particular, W (§; d; n) can be easily represented in terms of W (d; n) = W (Id; d; n). Assume in the following that n > d and § is invertible. Then the density of the random d £ d matrix W in (1.1) can be written ¡ ¢ jwj(n¡d¡1)=2 exp ¡(1=2) tr(w§¡1) f(w; n; §) = Q ¡ ¢ (1:4) dn=2 d(d¡1)=4 n=2 d 2 ¼ j§j i=1 ¡ (n + 1 ¡ i)=2 where jwj = det(w), j§j = det(§), and f(w; n; §) = 0 unless w is symmetric and positive de¯nite (Anderson 2003, Section 7.2, page 252). Wishart and Inverse-Wishart Distributions:::::::::::::::::::::::::::::::2 2. The Inverse-Wishart Conjugate Prior. An important use of the Wishart distribution is as a conjugate prior for multivariate normal sampling. This leads to a d-dimensional analog of the inverse-gamma-normal conjugate prior for normal sampling in one dimension. The likelihood function of n independent observations Xi ¼ N(¹; §) for a d £ d positive de¯nite matrix § is Yn 1 ¡ 0 ¡1 ¢ L(¹; §;X) = p exp ¡(1=2)(Xi ¡ ¹) § (Xi ¡ ¹) (2¼)dj§j i=1 à ! 1 Xn = exp ¡(1=2) (X ¡ ¹)0§¡1(X ¡ ¹) (2:1) (2¼)nd=2j§jn=2 i i i=1 The sum in (2.1) can be written Xn Xd Xd ¡1 (Xia ¡ ¹a)(§ )ab(Xib ¡ ¹b) i=1 a=1 b=1 Xd Xd Xn ¡1 = (§ )ab (Xia ¡ ¹a)(Xib ¡ ¹b) a=1 b=1 i=1 Xd Xd ¡1 ¡ ¡1 ¢ = (§ )abQ(¹)ab = tr § Q(¹) (2:2) a=1 b=1 where Xn 0 Q(¹) = (Xi ¡ ¹)(Xi ¡ ¹) i=1 Xn 0 0 = (Xi ¡ X)(Xi ¡ X) + n(X ¡ ¹)(X ¡ ¹) i=1 0 = Q0 + nºº ; º = X ¡ ¹ (2:3) Substituting (2.3) into (2.2) and (2.1) leads to the expression ¡ ¡ ¢¢ ¡ ¢ exp ¡(1=2) tr Q §¡1 exp ¡(1=2)nº0§¡1º L(¹; §;X) = 0 (2¼)nd=2j§jn=2 ¡1 (n¡1)=2 ¡ ¡ ¡1¢¢ = Cnd j§ j exp ¡(1=2) tr Q0§ 1 ¡ ¢ £ p exp ¡(1=2)n(¹ ¡ X)0§¡1(¹ ¡ X) (2:4) 2¼j§j Wishart and Inverse-Wishart Distributions:::::::::::::::::::::::::::::::3 where Xn 0 Q0 = (Xi ¡ X)(Xi ¡ X) (2:5) R i=1 Note that the integral L(¹; §;X)d¹ in (2.4) is the same as the Wishart ¡1 ¡1 density (1.4) with § replaced by w, Q0 in (2.4) replaced by § in (1.4) ¡1 ¡1 so that Q0§ is replaced by w§ , and n replaced by n¡d, within multi- plicative constants that depend only on n, d, and X. The similarity in forms between (1.4) and the ¯rst factor in (2.4) suggests that we might be able to sample from the density L(¹; §;X) in (2.4) by generating random variables by ¡1 (i) W ¼ W (d; n; Q0 ) (2:6) (ii) § = W ¡1 p 0 (iii) ¹ = X + (A= n)Z; § = AA ;Z ¼ N(0;Id) ¡1 ¡1 One subtlety is that the density of § in (2.6) will not be f(n; Q0 ; § ) for f(n; S; W ) in (1.4), or at least will not be this density with respect to 2 Lebesgue measure d§ in Rd . In general, for any function Á(y) ¸ 0, Z ¡ ¢ ¡ ¡1 ¢ ¡1 ¡1 E Á(§) = E Á(W ) = Á(y )f(n; Q0 ; y) dy Z ¡1 ¡1 ¡1 = Á(y)f(n; Q0 ; y ) d(y ) Z ¡1 ¡1 ¡1 = Á(y)f(n; Q0 ; y )Jy(y ) dy ¡1 ¡1 where Jy(y ) is the absolute value of the Jacobian matrix of y ! y . Among all invertible matrices, a space of dimension d2, the Jacobian ¡1 ¡2d Jy(y ) = jyj (see Theorem 4.1 below). However, f(w; n; §) in (1.4) is de- rived using the \Bartlett decomposition" (see Section 3 below) to parametrize positive de¯nite symmetric matrices by a flat space of dimension d(d + 1)=2. ¡1 ¡d¡1 ¡1 Anderson (2003) states Jy(y ) = jyj for the mapping y ! y re- stricted to symmetric matrices, but refers only to a theorem in his Appendix (Theorem A.4.6) that has only the d2-dimensional result. ¡1 ¡d¡1 In any event, substituting Jy(y ) = jyj above leads to Z ¡ ¢ ¡d¡1 ¡1 ¡1 E Á(§) = Á(y)jyj f(n; Q0 ; y ) dy Thus by (1.4) the joint density of (¹; §) generated by (2.6) is ¡(n+d+1)=2 ¡ ¡1 ¢ g(¹; §) = Cj§j exp ¡(1=2) tr(§ Q0) 1 ¡ ¢ £ p exp ¡(1=2)n(¹ ¡ X)0§¡1(¹ ¡ X) (2:7) 2¼j§j Wishart and Inverse-Wishart Distributions:::::::::::::::::::::::::::::::4 The ¯rst factor in (2.7) is called the inverse-Wishart distribution by Anderson (2003, Theorem 7.7.1). Gelman et al. (2003) de¯ne the four-parameter inverse-Wishart-normal density for (¹; §) as the density generated by ¡1 ¡1 § ¼ W (º0; d; ¤0 ) (2:8) ¹ j § ¼ N(¹0; §=·0) where ·0; º0 are positive numbers, ¹0 is a real number, and ¤0 is a d £ d positive de¯nite matrix. They also state the density ¡ ¢ ¡((º0+d+1)=2) ¡1 p(¹; §) = j§j exp ¡(1=2) tr(¤0§ ) (2:9) ¡1=2 ¡ 0 ¡1 ¢ £ j§j exp ¡(·0=2)(¹ ¡ ¹0) § (¹ ¡ ¹0) for (2.8), which is the same as (2.7) with º0 = n, ¤0 = Q0, ·0 = n, and ¹0 = X. Gelman et al. (2003) say that updating (2.8) or (2.9) with respect to an independent multivariate normal sample X1;X2;:::;Xn with distribution N(¹; §) preserves the distribution with ¹0 etc. replaced by n ·0¹0 ¹n = X + (2:10) ·0 + n ·0 + n ·n = ·0 + n ºn = º0 + n ·0n 0 ¤n = ¤0 + Q0 + (X ¡ ¹0)(X ¡ ¹0) ·0 + n for Q0 in (2.5). Since º0 appears in (2.10) only in the additive update ºn = º0 + n, the initial power of § in the density does not matter. Gelman et al. also discuss using a Je®rey's prior (d+1)=2 ¼0(¹; §) = C=j§j This would amount to increasing n by d + 1 in (2.6) or (2.7) or º0 by d + 1 in (2.9). This would at least guarantee that the random matrices W in (2.6) were always invertible. 3. A Fast Way to Generate Wishart-Distributed Random Vari- ables. Suppose that we want to estimate parameters in a model with in- dependent multivariate normal variables. Bayesian methods based on Gibbs sampling using (2.4){(2.4) and (2.6) or (2.8) depend on being able to simulate Wishart-distributed random matrices in an e±cient manner. Wishart and Inverse-Wishart Distributions:::::::::::::::::::::::::::::::5 If we simulate W ¼ W (§; d; n) using the basic de¯nition (1.1){(1.3), then we have to generate nd independent standard normal random variables and use of order nd2 operations for each simulated value of W . Odell and Feiveson (1966) (referenced in Liu, 2001) developed a way to simulate W in O(d2) operations, which is a considerable improvement in time if n À d. Theorem 3.1 (Odell and Feiveson, 1966). Suppose that Vi (1 · i · d) are independent random variables where Vi has a chi-square distribution with n ¡ i + 1 degrees of freedom (so that n ¡ d + 1 · n ¡ i + 1 · n). Suppose that Nij are independent normal random variables with mean zero and variance one for 1 · i < j · d, also independent of the Vi. De¯ne random variables bij for 1 · i; j · d by bji = bij for 1 · i < j · d and Xi¡1 2 bii = Vi + Nri; 1 · i · d (3:1) r=1 p Xi¡1 bij = Nij Vi + NriNrj ; i < j · d r=1 Then B = fbijg has a Wishart distribution W (d; n) = W (Id; d; n). Remarks. (1) We assume thatp empty sums in (3.1) are zero, so that (3.1) implies b11 = V1 and b1j = N1j V1. Note that each diagonal entry bii is individually chi-square with n degrees of freedom. Odell and Feiveson state the theorem with n ¡ 1 in place of n, so that Vi is chi-squared with n ¡ i degrees of freedom (from n ¡ d to n ¡ 1) and the conclusion is that B ¼ W (n ¡ 1; d).