<<

Conjugate Bayesian analysis of the Gaussian distribution

Kevin P. Murphy∗ [email protected]

Last updated October 3, 2007

1 Introduction The Gaussian or is one of the most widely used in . Estimating its parameters using and conjugate priors is also widely used. The use of conjugate priors allows all the results to be derived in closed form. Unfortunately, different books use different conventions on how to parameterize the various distributions (e.g., put the prior on the precision or the , use an inverse gamma or inverse chi-squared, etc), which can be very confusing for the student. In this report, we summarize all of the most commonly used forms. We provide detailed derivations for some of these results; the rest can be obtained by simple reparameterization. See the appendix for the definition the distributions that are used. 2 Normal prior Let us consider Bayesian estimation of the of a Gaussian, whose variance is assumed to be known. (We discuss the unknown variance case later.) 2.1 Likelihood

Let D = (x1,...,xn) be the data. The likelihood is n n 1 p(D µ, σ2) = p(x µ, σ2)=(2πσ2)−n/2 exp (x µ)2 (1) | i| −2σ2 i − i=1 ( i=1 ) Y X Let us define the empirical mean and variance 1 n x = x (2) n i i=1 Xn 1 s2 = (x x)2 (3) n i − i=1 X 2 1 n 2 (Note that other authors (e.g., [GCSR04]) define s = n−1 i=1(xi x) .) We can rewrite the term in the exponent as follows − P (x µ)2 = [(x x) (µ x)]2 (4) i − i − − − i i X X = (x x)2 + (x µ)2 2 (x x)(µ x) (5) i − − − i − − i i i X X X = ns2 + n(x µ)2 (6) − since

(x x)(µ x) = (µ x) ( x ) nx = (µ x)(nx nx)=0 (7) i − − − i − − − i i ! X X ∗Thanks to Hoyt Koepke for proof reading.

1 Hence 1 1 1 p(D µ, σ2) = exp ns2 + n(x µ)2 (8) | (2π)n/2 σn −2σ2 −   n/2   1 n ns2 exp (x µ)2 exp (9) ∝ σ2 −2σ2 − −2σ2       If σ2 is a constant, we can write this as n σ2 p(D µ) exp (x µ)2 (x µ, ) (10) | ∝ −2σ2 − ∝ N | n   since we are free to drop constant factors in the definition of the likelihood. Thus n observations with variance σ2 and 2 mean x is equivalent to 1 observation x1 = x with variance σ /n. 2.2 Prior Since the likelihood has the form n σ2 p(D µ) exp (x µ)2 (x µ, ) (11) | ∝ −2σ2 − ∝ N | n   the natural has the form

1 p(µ) exp (µ µ )2 (µ µ , σ2) (12) ∝ −2σ2 − 0 ∝ N | 0 0  0  2 2 (Do not confuse σ0 , which is the variance of the prior, with σ , which is the variance of the observation noise.) (A natural conjugate prior is one that has the same form as the likelihood.) 2.3 Posterior Hence the posterior is given by

p(µ D) p(D µ, σ)p(µ µ , σ2) (13) | ∝ | | 0 0 1 1 exp (x µ)2 exp (µ µ )2 (14) ∝ −2σ2 i − × −2σ2 − 0 " i # 0 X   1 1 = exp − (x2 + µ2 2x µ)+ − (µ2 + µ2 2µ µ) (15) 2σ2 i − i 2σ2 0 − 0 " i 0 # X Since the product of two Gaussians is a Gaussian, we will rewrite this in the form

µ2 1 n µ x µ2 x2 p(µ D) exp + + µ 0 + i i 0 + i i (16) | ∝ − 2 σ2 σ2 σ2 σ2 − 2σ2 2σ2   0   0 P   0 P  1 1 def= exp (µ2 2µµ + µ2 ) = exp (µ µ )2 (17) −2σ2 − n n −2σ2 − n  n   n  2 2 Matching coefficients of µ , we find σn is given by µ2 µ2 1 n − = − + (18) 2σ2 2 σ2 σ2 n  0  1 1 n (19) 2 = 2 + 2 σn σ0 σ σ2σ2 1 2 0 (20) σn = 2 2 = n 1 nσ0 + σ 2 + 2 σ σ0

2 5

N = 10

N = 2

N = 1 N = 0

0 −1 0 1

∗ Figure 1: Sequentially updating a Gaussian mean starting with a prior centered on µ0 = 0. The true parameters are µ = 0.8 ∗ (unknown), (σ2) = 0.1 (known). Notice how the data quickly overwhelms the prior, and how the posterior becomes narrower. Source: Figure 2.12 [Bis06].

Matching coefficients of µ we get

n 2µµ xi µ − n = µ i=1 + 0 (21) 2σ2 σ2 σ2 − n P 0  µ n x µ n = i=1 i + 0 (22) σ2 σ2 σ2 n P 0 σ2nx + σ2µ 0 0 (23) = 2 2 σ σ0 Hence σ2 nσ2 µ nx µ = µ + 0 x = σ2 0 + (24) n nσ2 + σ2 0 nσ2 + σ2 n σ2 σ2 0 0  0  This operation of matching first and second powers of µ is called completing the square. Another way to understand these results is if we work with the precision of a Gaussian, which is 1/variance (high precision low variance, low precision means high variance). Let

λ = 1/σ2 (25) 2 λ0 = 1/σ0 (26) 2 λn = 1/σn (27) Then we can rewrite the posterior as

p(µ D, λ) = (µ µ , λ ) (28) | N | n n λn = λ0 + nλ (29) xnλ + µ0λ0 µn = = wµML + (1 w)µ0 (30) λn −

3 prior sigma 10.000 prior sigma 1.000 0.45 0.7 prior prior 0.4 lik lik 0.6 post post 0.35 0.5 0.3

0.25 0.4

0.2 0.3

0.15 0.2 0.1

0.1 0.05

0 0 −5 0 5 −5 0 5

(a) (b)

Figure 2: Bayesian estimation of the mean of a Gaussian from one sample. (a) Weak prior N (0, 10). (b) Strong prior N (0, 1). In the latter case, we see the posterior mean is “shrunk” towards the prior mean, which is 0. Figure produced by gaussBayesDemo. where nx = n x and w = nλ . The precision of the posterior λ is the precision of the prior λ plus one i=1 i λn n 0 contribution of data precision λ for each observed data point. Also, we see the mean of the posterior is a convex combination ofP the prior and the MLE, with weights proportional to the relative precisions. To gain further insight into these equations, consider the effect of sequentially updating our estimate of µ (see Figure 1). After observing one data point x (so n =1), we have the following posterior mean σ2 σ2 0 (31) µ1 = 2 2 µ0 + 2 2 x σ + σ0 σ + σ0 σ2 0 (32) = µ0 + (x µ0) 2 2 − σ + σ0 σ2 (33) = x (x µ0) 2 2 − − σ + σ0 The first equation is a convex combination of the prior and MLE. The second equation is the prior mean ajusted towards the data x. The third equation is the data x adjusted towads the prior mean; this is called shrinkage. These are all equivalent ways of expressing the tradeoff between likelihood and prior. See Figure 2 for an example. 2.4 Posterior predictive The posterior predictive is given by

p(x D) = p(x µ)p(µ D)dµ (34) | | | Z = (x µ, σ2) (µ µ , σ2 )dµ (35) N | N | n n Z = (x µ , σ2 + σ2) (36) N | n n This follows from general properties of the Gaussian distribution (see Equation 2.115 of [Bis06]). An alternative proof is to note that x = (x µ)+ µ (37) − x µ (0, σ2) (38) − ∼ N µ (µ , σ2 ) (39) ∼ N n n Since E[X1 + X2]= E[X1]+ E[X2] and Var [X1 + X2]= Var [X1]+ Var [X2] if X1,X2 are independent, we have X (µ , σ2 + σ2) (40) ∼ N n n

4 since we assume that the residual error is conditionally independent of the parameter. Thus the predictive variance is 2 2 the uncertainty due to the observation noise σ plus the uncertainty due to the parameters, σn. 2.5 2 2 Writing m = µ0 and τ = σ0 for the hyper-parameters, we can derive the marginal likelihood as follows: n ` = p( m, σ2, τ 2) = [ (x µ, σ2)] (µ m, τ 2)dµ (41) D| N i| N | Z i=1 Y 2 2 2 τ 2n2x σ2m2 σ x m 2 + 2 +2nxm = exp i i exp σ τ (42) √ n√ 2 2 − 2σ2 − 2τ 2 2(nτ 2 + σ2) ( 2πσ) nτ + σ  P  ! The proof is below, based on the on the appendix of [DMP+06]. We have n ` = p( m, σ2, τ 2) = [ (x µ, σ2)] (µ m, τ 2)dµ (43) D| N i| N | i=1 Z Y 1 1 2 1 2 = exp (xi µ) (µ m) dµ (44) √ n √ −2σ2 − − 2τ 2 − (σ 2π) (τ 2π) i ! Z X Let us define S2 =1/σ2 and T 2 =1/τ 2. Then

2 2 1 S 2 2 T 2 2 ` = exp ( x + nµ 2µ xi) (µ + m 2µm) dµ (45) √ n √ − 2 i − − 2 − ( 2π/S) ( 2π/T ) i i ! Z X X = c exp 1 (S2nµ2 2S2 x + T 2µ2 2T 2µm) dµ (46) − 2 − i − i ! Z X where exp 1 (S2 x2 + T 2m2) c = − 2 i i (47) √ n √ ( 2π/SP) ( 2π/T )  So S2 x + T 2m ` = c exp 1 (S2n + T 2) µ2 2µ i i dµ (48) − 2 − S2n + T 2 Z   P  (S2nx + T 2m)2 S2nx + T 2m 2 = c exp exp 1 (S2n + T 2) µ dµ (49) 2(S2n + T 2) − 2 − S2n + T 2   Z "   # (S2nx + T 2m)2 √2π = c exp (50) 2(S2n + T 2) √ 2 2   S n + T exp 1 (S2 x2 + T 2m2) (S2nx + T 2m)2 √2π = − 2 i i exp (51) (√2π/S)n(√2π/T ) 2(S2n + T 2) √ 2 2 P    S n + T Now 1 √2π σ = (52) (2π)/T √S2n + T 2 √Nτ 2 + σ2 and p nx m 2 2 2 2 ( 2 + 2 ) (nxτ + mσ ) σ τ (53) n 1 = 2 2 2 2 2( σ2 + τ 2 ) 2σ τ (nτ + σ ) n2x2τ 2/σ2 + σ2m2/τ 2 +2nxm = (54) 2(nτ 2 + σ2)

5 So 2 2 2 τ 2n2x σ2m2 σ x m 2 + 2 +2nxm p( ) = exp i i exp σ τ (55) D √ n√ 2 2 − 2σ2 − 2τ 2 2(nτ 2 + σ2) ( 2πσ) nτ + σ  P  ! To check this, we should ensure that we get p(x, D) p(x D)= = (x µ , σ2 + σ2) (56) | p(D) N | n n (To be completed) 2.6 Conditional prior p(µ σ2) | Note that the previous prior is not, strictly speaking, conjugate, since it has the form p(µ) whereas the posterior has the form p(µ D, σ), i.e., σ occurs in the posterior but not the prior. We can rewrite the prior in conditional form as follows | p(µ σ)= (µ µ , σ2/κ ) (57) | N | 0 0 This means that if σ2 is large, the variance on the prior of µ is also large. This is reasonable since σ2 defines the measurement scale of x, so the prior belief about µ is equivalent to κ0 observations of µ0 on this scale. (Hence a noninformative prior is κ0 =0.) Then the posterior is p(µ D)= (µ µ , σ2/κ ) (58) | N | n n where κn = κ0 + n. In this form, it is clear that κ0 plays a role analogous to n. Hence κ0 is the equivalent sample size of the prior. 2.7 Reference analysis To get an uninformative prior, we just set the prior variance to infinity to simulate a uniform prior on µ. p(µ) 1= (µ , ) (59) ∝ N |· ∞ p(µ D) = (µ x, σ2/n) (60) | N | 3 Normal-Gamma prior We will now suppose that both the mean m and the precision λ = σ−2 are unknown. We will mostly follow the notation in [DeG70, p169]. 3.1 Likelihood The likelihood can be written in this form n 1 λ p(D µ, λ) = λn/2 exp (x µ)2 (61) | (2π)n/2 − 2 i − i=1 ! X n 1 λ = λn/2 exp n(µ x)2 + (x x)2 (62) (2π)n/2 − 2 − i − " i=1 #! X 3.2 Prior The conjugate prior is the normal-Gamma:

NG(µ, λ µ ,κ , α ,β ) def= (µ µ , (κ λ)−1)Ga(λ α , rate = β ) (63) | 0 0 0 0 N | 0 0 | 0 0 1 1 κ0λ 2 α0−1 −λβ0 = λ 2 exp( (µ µ0) )λ e (64) ZNG(µ0,κ0, α0,β0) − 2 − 1 1 λ = λα0− 2 exp κ (µ µ )2 +2β (65) Z − 2 0 − 0 0 NG   1   Γ(α ) 2π 2 Z (µ ,κ , α ,β ) = 0 (66) NG 0 0 0 0 βα0 κ 0  0 

6 NG(κ=2.0, a=1.0, b=1.0) NG(κ=2.0, a=3.0, b=1.0)

0.4 0.4

0.2 0.2

0 0 4 4 2 2 2 0 2 0 λ 0 −2 µ λ 0 −2 µ

NG(κ=2.0, a=5.0, b=1.0) NG(κ=2.0, a=5.0, b=3.0)

0.4 0.4

0.2 0.2

0 0 4 4 2 2 2 0 2 0 λ 0 −2 µ λ 0 −2 µ

Figure 3: Some Normal-Gamma distributions. Produced by NGplot2.

See Figure 3 for some plots. We can compute the prior marginal on µ as follows:

∞ p(µ) p(µ, λ)dλ (67) ∝ 0 Z ∞ 2 1 κ0(µ µ0) = λα0+ 2 −1 exp λ(β + − ) dλ (68) − 0 2 Z0   2 1 κ0(µ−µ0) We recognize this as an unnormalized Ga(a = α0 + 2 ,b = β0 + 2 ) distribution, so we can just write down Γ(a) p(µ) (69) ∝ ba b−a (70) ∝ κ0 1 = (β + (µ µ )2)−α0− 2 (71) 0 2 − 0 2 1 α0κ0(µ µ0) = (1+ − )−(2α0+1)/2 (72) 2α0 β0 which we recognize as as a T (µ µ ,β /(α κ )) distribution. 2α0 | 0 0 0 0

7 3.3 Posterior The posterior can be derived as follows.

p(µ, λ D) NG(µ, λ µ0,κ0, α0,β0)p(D µ, λ) (73) | ∝ 1 | | −(κ λ(µ−µ )2)/2 α −1 −β λ n/2 − λ Pn (x −µ)2 λ 2 e 0 0 λ 0 e 0 λ e 2 i=1 i (74) ∝ 1 × 2 P 2 λ 2 λα0+n/2−1e−β0λe−(λ/2)[κ0(µ−µ0) + i(xi−µ) ] (75) ∝ From Equation 6 we have n n (x µ)2 = n(µ x)2 + (x x)2 (76) i − − i − i=1 i=1 X X Also, it can be shown that

2 2 2 2 κ0n(x µ0) κ0(µ µ0) + n(µ x) = (κ0 + n)(µ µn) + − (77) − − − κ0 + n where κ0µ0 + nx µn = (78) κ0 + n Hence

κ (µ µ )2 + (x µ)2 = κ (µ µ )2 + n(µ x)2 + (x x)2 (79) 0 − 0 i − 0 − 0 − i − i i X X κ n(x µ )2 = (κ + n)(µ µ )2 + 0 − 0 + (x x)2 (80) 0 − n κ + n i − 0 i X So

1 2 p(µ, λ D) λ 2 e−(λ/2)(κ0+n)(µ−µn) (81) | ∝ − 2 2 κ0n(x µ0) α0+n/2−1 −β0λ −(λ/2) P (xi−x) −(λ/2) λ e e i e κ0+n (82) × (µ µ , ((κ + n)λ)−1) Ga(λ α + n/2,β ) (83) ∝ N | n × | 0 n where n κ n(x µ )2 β = β + 1 (x x)2 + 0 − 0 (84) n 0 2 i − 2(κ + n) i=1 0 X In summary,

p(µ, λ D) = NG(µ, λ µ ,κ , α ,β ) (85) | | n n n n κ0µ0 + nx µn = (86) κ0 + n κn = κ0 + n (87)

αn = α0 + n/2 (88) n κ n(x µ )2 β = β + 1 (x x)2 + 0 − 0 (89) n 0 2 i − 2(κ + n) i=1 0 X We see that the posterior sum of squares, βn, combines the prior sum of squares, β0, the sample sum of squares, 2 i(xi x) , and a term due to the discrepancy between the prior mean and sample mean. As can be seen from Figure 3,− the range of probable values for µ and σ2 can be quite large even after for moderate n. Keep this picture in mindP whenever someones claims to have “fit a Gaussian” to their data.

8 3.3.1 Posterior marginals The posterior marginals are (using Equation 72)

p(λ D) = Ga(λ α ,β ) (90) | | n n p(µ D) = T (µ µ ,β /(α κ )) (91) | 2αn | n n n n 3.4 Marginal likelihood To derive the marginal likelihood, we just dererive the posterior, but this time we keep track of all the constant factors. Let NG0(µ, λ µ ,κ , α ,β ) denote an unnormalized Normal-, and let Z = Z (µ ,κ , α ,β ) | 0 0 0 0 0 NG 0 0 0 0 be the normalization constant of the prior; similarly let Zn be the normalization constant of the posterior. Let N 0(x µ, λ) denote an unnormalized Gaussian with normalization constant 1/√2π. Then i| 1 1 1 n/2 p(µ, λ D) = NG0(µ, λ µ ,κ , α ,β ) N 0(x µ, λ) (92) | p(D) Z | 0 0 0 0 2π i| 0 i   Y The NG0 and N 0 terms combine to make the posterior NG0:

1 0 p(µ, λ D) = NG (µ, λ µn,κn, αn,βn) (93) | Zn | Hence Z p(D) = n (2π)−n/2 (94) Z0 α0 Γ(αn) β κ0 1 0 2 −n/2 = αn ( ) (2π) (95) Γ(α0) βn κn

3.5 Posterior predictive The posterior predictive for m new observations is given by

p(D ,D) p(D D) = new (96) new| p(D) Z Z = n+m (2π)−(n+m)/2 0 (2π)n/2 (97) Z0 Zn Z = n+m (2π)−m/2 (98) Zn 1 Γ(α ) βαn κ 2 = n+m n n (2π)−m/2 (99) Γ(α ) βαn+m κ n n+m  n+m  In the special case that m =1, it can be shown (see below) that this is a T-distribution

βn(κn + 1) p(x D) = t2αn (x µn, ) (100) | | αnκn To derive the m =1 result, we proceed as follows. (This proof is by Xiang Xuan, and is based on [GH94, p10].) When m =1, the posterior parameters are

αn+1 = αn +1/2 (101)

κn+1 = κn +1 (102) 1 1 κ (¯x µ )2 β = β + (x x¯)2 + n − n (103) n+1 n 2 i − 2(κ + 1) i=1 n X 9 Use the fact thatwhen m =1, we have x =x ¯ (since there is only one observation), hence we have 1 1 (x x¯)2 = 1 2 i=1 i− 0. Let’s use x denote Dnew, then βn+1 is P 2 κn(x µn) βn+1 = βn + − (104) 2(κn + 1) Substituting, we have the following,

1 αn 2 Γ(αn+1) βn κn −1/2 p(Dnew D) = (2π) (105) | Γ(α ) βαn+1 κ n n+1  n+1  1 αn 2 Γ(αn +1/2) βn κn −1/2 = 2 (2π) (106) κn(x−µn) α +1/2 Γ(αn) (βn + ) n κn +1 2(κn+1)   α +1/2 n 1 2 Γ((2αn + 1)/2) βn 1 κn −1/2 = 2 π) (107) κn(x−µn) 1 Γ((2αn)/2) βn +  2 2(κn + 1) 2(κn+1) βn  

αn+1/2   1 Γ((2α + 1)/2) 1 κ 2 n n −1/2 (108) = κ (x−µ )2 (π) Γ((2αn)/2) 1+ n n  2βn(κn + 1) 2βn(κn+1)   1   −(2αn+1)/2 Γ((2α + 1)/2) α κ 2 α κ (x µ )2 = (π)−1/2 n n n 1+ n n − n (109) Γ((2α )/2) 2α β (κ + 1) 2α β (κ + 1) n  n n n   n n n  Let Λ= αnκn , then we have, βn(κn+1)

1 −(2α +1)/2 Γ((2α + 1)/2) Λ 2 Λ(x µ )2 n p(D D) = (π)−1/2 n 1+ − n (110) new| Γ((2α )/2) 2α 2α n  n   n  We can see this is a T-distribution with center at µ , precision Λ= αnκn , and degree of freedom 2α . n βn(κn+1) n 3.6 Reference analysis The reference prior for NG is

p(m, λ) λ−1 = NG(m, λ µ = ,κ =0, α = 1 ,β = 0) (111) ∝ | · − 2 So the posterior is

n p(m, λ D) = NG(µ = x, κ = n, α = (n 1)/2,β = 1 (x x)2) (112) | n n n − n 2 i − i=1 X So the posterior marginal of the mean is

(x x)2 p(m D) = t (m x, i i − ) (113) | n−1 | n(n 1) P − which corresponds to the frequentist sampling distribution of the MLE µˆ. Thus in this case, the confidence interval and coincide. 4 Gammaprior If µ is known, and only λ is unknown (e.g., when implementing ), we can use the following results, which can be derived by simplifying the results for the Normal-NG model.

10 4.1 Likelihood

n λ p(D λ) λn/2 exp (x µ)2 (114) | ∝ − 2 i − i=1 ! X 4.2 Prior

p(λ) = Ga(λ α, β) λα−1e−λβ (115) | ∝ 4.3 Posterior

p(λ D) = Ga(λ α ,β ) (116) | | n n αn = α + n/2 (117) n β = β + 1 (x µ)2 (118) n 2 i − i=1 X 4.4 Marginal likelihood To be completed. 4.5 Posterior predictive

p(x D) = t (x µ, σ2 = β /α ) (119) | 2αn | n n 4.6 Reference analysis

p(λ) λ−1 = Ga(λ 0, 0) (120) ∝ | m p(λ D) = Ga(λ n/2, 1 (x µ)2) (121) | | 2 i − i=1 X 5 Normal-inverse-chi-squared (NIX) prior We will see that the natural conjugate prior for σ2 is the inverse-chi-squared distribution. 5.1 Likelihood The likelihood can be written in this form

n 1 1 p(D µ, σ2) = (σ2)−n/2 exp n (x x)2 + n(x µ)2 (122) | (2π)n/2 −2σ2 i − − " i=1 #! X 5.2 Prior The normal-inverse-chi-squared prior is

2 2 2 p(µ, σ ) = NIχ (µ0,κ0, ν0, σ0 ) (123) = (µ µ , σ2/κ ) χ−2(σ2 ν , σ2) (124) N | 0 0 × | 0 0 1 1 = σ−1(σ2)−(ν0/2+1) exp [ν σ2 + κ (µ µ)2] (125) Z (µ ,κ , ν , σ2) −2σ2 0 0 0 0 − p 0 0 0 0   (2π) 2 ν0/2 Z (µ ,κ , ν , σ2) = Γ(ν /2) (126) p 0 0 0 0 √κ 0 ν σ2 p 0  0 0 

11 NIX(µ =0.0, κ =1.0, ν =1.0, σ2=1.0) NIX(µ =0.0, κ =5.0, ν =1.0, σ2=1.0) 0 0 0 0 0 0 0 0

0.4 0.8

0.3 0.6

0.2 0.4

0.1 0.2

0 0 2 2 1.5 1 1.5 1 1 0.5 1 0.5 0 0 0.5 0.5 −0.5 −0.5 0 −1 0 −1 sigma2 µ sigma2 µ

(a) (b) NIX(µ =0.0, κ =1.0, ν =5.0, σ2=1.0) NIX(µ =0.5, κ =5.0, ν =5.0, σ2=0.5) 0 0 0 0 0 0 0 0

0.4 2.5

2 0.3

1.5 0.2 1

0.1 0.5

0 0 2 2 1.5 1 1.5 1 1 0.5 1 0.5 0 0 0.5 0.5 −0.5 −0.5 0 −1 0 −1 sigma2 µ sigma2 µ

(c) (d)

2 2 2 Figure 4: The NIχ (µ0, κ0,ν0,σ0 ) distribution. µ0 is the prior mean and κ0 is how strongly we believe this; σ0 is the prior 2 variance and ν0 is how strongly we believe this. (a) µ0 = 0, κ0 = 1,ν0 = 1,σ0 = 1. Notice that the contour plot (underneath the surface) is shaped like a “squashed egg”. (b) We increase the strenght of our belief in the mean, so it gets narrower: µ0 = 0, κ0 = 2 2 5,ν0 = 1,σ0 = 1. (c) We increase the strenght of our belief in the variance, so it gets narrower: µ0 = 0, κ0 = 1,ν0 = 5,σ0 = 1. 2 (d) We strongly believe the mean and variance are 0.5: µ0 = 0.5, κ0 = 5,ν0 = 5,σ0 = 0.5. These plots were produced with NIXdemo2.

2 See Figure 4 for some plots. The µ0 and σ /κ0 can be interpreted as the location and scale of µ, and 2 2 the hyperparameters u0 and σ0 as the degrees of freedom and scale of σ . For future reference, it is useful to note that the quadratic term in the prior can be written as

Q (µ) = S + κ (µ µ )2 (127) 0 0 0 − 0 = κ µ2 2(κ µ )µ + (κ µ2 + S ) (128) 0 − 0 0 0 0 0 2 where S0 = ν0σ0 is the prior sum of squares.

12 5.3 Posterior (The following derivation is based on [Lee04, p67].) The posterior is p(µ, σ2 D) (µ µ , σ2/κ )χ−2(σ2 ν , σ2)p(D µ, σ2) (129) | ∝ N | 0 0 | 0 0 | 1 σ−1(σ2)−(ν0/2+1) exp [ν σ2 + κ (µ µ)2] (130) ∝ −2σ2 0 0 0 0 −    1 (σ2)−n/2 exp ns2 + n(x µ)2 (131) × −2σ2 −    1  σ−3(σ2)−(νn/2) exp [ν σ2 + κ (µ µ)2] = NIχ2(µ ,κ , ν , σ2 ) (132) ∝ −2σ2 n n n n − n n n n   Matching powers of σ2, we find

νn = ν0 + n (133)

2 2 To derive the other terms, we will complete the square. Let S0 = ν0σ0 and Sn = νnσn for brevity. Grouping the terms inside the exponential, we have S + κ (µ µ)2 + ns2 + n(x µ)2 = (S + κ µ2 + ns2 + nx2)+ µ2(κ + n) 2(κ µ + nx)µ(134) 0 0 0 − − 0 0 0 0 − 0 0 Comparing to Equation 128, we have

κn = κ0 + n (135)

κnµn = κ0µ0 + nx (136) 2 2 2 2 Sn + κnµn = (S0 + κ0µ0 + ns + nx ) (137) S = S + ns2 + κ µ2 + nx2 κ µ2 (138) n 0 0 0 − n n One can rearrange this to get S = S + ns2 + (κ−1 + n−1)−1(µ x)2 (139) n 0 0 0 − 2 nκ0 2 = S0 + ns + (µ0 x) (140) κ0 + n − 2 2 We see that the posterior sum of squares, Sn = νnσn, combines the prior sum of squares, S0 = ν0σ0 , the sample sum of squares, ns2, and a term due to the uncertainty in the mean. In summary,

κ0µ0 + nx µn = (141) κn κn = κ0 + n (142)

νn = ν0 + n (143) 1 nκ σ2 = (ν σ2 + (x x)2 + 0 (µ x)2) (144) n ν 0 0 i − κ + n 0 − n i 0 X The posterior mean is given by

E[µ D] = µn (145) | ν E[σ2 D] = n σ2 (146) | ν 2 n n − The posterior is given by (Equation 14 of [BL01]): mode[µ D] = µ (147) | n ν σ2 mode[σ2 D] = n n (148) | ν 1 n − 13 The modes of the marginal posterior are

mode[µ D] = µ (149) | n ν σ2 mode[σ2 D] = n n (150) | νn +2 5.3.1 Marginal posterior of σ2 First we integrate out µ, which is just a Gaussian integral.

p(σ2 D) = p(σ2, µ D)dµ (151) | | Z 1 κ σ−1(σ2)−(νn/2+1) exp [ν σ2 ] exp n (µ µ)2] dµ (152) ∝ −2σ2 n n −2σ2 n −   Z   1 σ (2π) σ−1(σ2)−(νn/2+1) exp [ν σ2 ] (153) ∝ −2σ2 n n √κ   p n 1 (σ2)−(νn/2+1) exp [ν σ2 ] (154) ∝ −2σ2 n n   = χ−2(σ2 ν , σ2 ) (155) | n n 5.3.2 Marginal posterior of µ Let us rewrite the posterior as

1 p(µ, σ2 D) = Cφ−αφ−1 exp [ν σ2 + κ (µ µ)2] (156) | −2φ n n n n −   2 where φ = σ and α = (νn + 1)/2. This follows since

−1 2 −(ν /2+1) −1 −ν −2 − νn+1 −1 −α−1 σ (σ ) n = σ σ n σ = φ 2 φ = φ (157)

Now make the substitutions

A = ν σ2 + κ (µ µ)2 (158) n n n n − A x = (159) 2φ dφ A = x−2 (160) dx − 2 so

p(µ D) = Cφ−(α+1)e−A/2φdφ (161) | Z A = (A/2) C( )−(α+1)e−xx−2dx (162) − 2x Z A−α xα−1e−xdx (163) ∝ Z A−α (164) ∝ = (ν σ2 + κ (µ µ)2)−(νn+1)/2 (165) n n n n − κ −(νn+1)/2 1+ n (µ µ )2 (166) ∝ ν σ2 − n  n n  t (µ µ , σ2 /κ ) (167) ∝ νn | n n n

14 5.4 Marginal likelihood Repeating the derivation of the posterior, but keeping track of the normalization constants, gives the following.

p(D) = P (D µ, σ2)P (µ, σ2)dµdσ2 (168) | Z Z 2 Zp(µn,κn, νn, σn) 1 = 2 N (169) Zp(µ0,κ0, ν0, σ0) Zl ν /2 √κ Γ(ν /2) ν σ2 0 2 νn/2 1 0 n 0 0 (170) = 2 (n/2) √κn Γ(ν0/2) 2 νnσ (2π)    n  Γ(ν /2) κ (ν σ2)ν0/2 1 = n 0 0 0 (171) Γ(ν /2) κ (ν σ2 )νn/2 πn/2 0 r n n n 5.5 Posterior predictive

p(x D) = p(x µ, σ2)p(µ, σ2 D)dµdσ2 (172) | | | Z Z p(x, D) = (173) p(D) Γ((ν + 1)/2) κ (ν σ2 )νn/2 1 = n n n n (174) 2 κn 2 (ν +1)/2 1/2 Γ(νn/2) κn +1 (νnσ + (x µn) )) n π r n κn+1 − 1 −(ν +1)/2 Γ((ν + 1)/2) κ 2 κ (x µ )2 n = n n 1+ n − n (175) Γ(ν /2) (κ + 1)πν σ2 (κ + 1)ν σ2 n  n n n   n n n  2 (1 + κn)σn = tνn (µn, ) (176) κn 5.6 Reference analysis The reference prior is p(µ, σ2) (σ2)−1 which can be modeled by κ =0, ν = 1, σ =0, since then we get ∝ 0 0 − 0 2 −1 2 −(− 1 +1) 0 −1 2 −1/2 −2 p(µ, σ ) σ (σ ) 2 e = σ (σ ) = σ (177) ∝ (See also [DeG70, p197] and [GCSR04, p88].) With the reference prior, the posterior is

µn = x (178) ν = n 1 (179) n − κn = n (180) (x x)2 σ2 = i i − (181) n n 1 P − 1 p(µ, σ2 D) σ−n−2 exp [ (x x)2 + n(x µ)2] (182) | ∝ −2σ2 i − − i ! X The posterior marginals are

(x x)2 p(σ2 D) = χ−2(σ2 n 1, i i − ) (183) | | − n 1 P − (x x)2 p(µ D) = t (µ x, i i − ) (184) | n−1 | n(n 1) P −

15 which are very closely related to the sampling distribution of the MLE. The posterior predictive is

(1 + n) (x x)2 p(x D)= t x, i i − (185) | n−1 n(n 1)  P−  Note that [Min00] argues that Jeffrey’s principle says the uninformative prior should be of the form

1 2 −2 2 2 2 − 2 2 −1 −3 lim (µ µ0, σ /k)χ (σ k, σ0) (2πσ ) (σ ) σ (186) k 0 N | | ∝ ∝ → This can be achieved by setting ν =0 instead of ν = 1. 0 0 − 6 Normal-inverse-Gamma (NIG) prior Another popular parameterization is the following:

p(µ, σ2) = NIG(m,V,a,b) (187) = (µ m, σ2V )IG(σ2 a,b) (188) N | | 6.1 Likelihood The likelihood can be written in this form 1 1 p(D µ, σ2) = (σ2)−n/2 exp ns2 + n(x µ)2 (189) | (2π)n/2 −2σ2 −     6.2 Prior

2 p(µ, σ ) = NIG(m0, V0,a0,b0) (190) = (µ m , σ2V )IG(σ2 a ,b ) (191) N | 0 0 | 0 0 This is equivalent to the NIχ2 prior, where we make the following substitutions.

m0 = µ0 (192) 1 V0 = (193) κ0 ν a = 0 (194) 0 2 ν σ2 b = 0 0 (195) 0 2 6.3 Posterior We can show that the posterior is also NIG:

2 p(µ, σ D) = NIG(mn, Vn,an,bn) (196) | −1 −1 Vn = V0 + n (197) mn −1 = V0 m0 + nx (198) Vn an = a0 + n/2 (199) 1 b = b + [m2V −1 + x2 m2 V −1] (200) n 0 2 0 0 i − n n i X 2 The NIG posterior follows directly from the NIχ results using the specified substitutions. (The bn term requires some tedious algebra...)

16 6.3.1 Posterior marginals To be derived. 6.4 Marginal likelihood For the marginal likelihood, substituting into Equation 171 we have

Γ(a ) V (2b )a0 1 p(D) = n n 0 (201) Γ(a ) V (2b )an n/2 0 r 0 n π 1 a0 Vn 2 b Γ(an) 1 = 0 2a0−an (202) | |1 an n/2 bn Γ(a0) π V0 2 | | 1 V 2 ba0 Γ(a ) 1 = | n| 0 n (203) 1 ban Γ(a ) πn/22n V 2 n 0 | 0| 6.5 Posterior predictive For the predictive density, substituting into Equation 176 we have κ 1 n = (204) 2 1 2 (1 + κn)σ ( + 1)σ n κn n 2a = n (205) 2bn(1 + Vn) So

bn(1 + Vn) p(y D) = t2an (mn, ) (206) | an

T T T 2 These results follow from [DHMS02, p240] by setting x = 1, β = µ, B B = n, B X = nx, X X = i xi . Note that we use a difference parameterization of the student-t. Also, our equations for p(D) differ by a 2−n term. P 7 Multivariate Normal prior If we assume Σ is known, then a conjugate analysis of the mean is very simple, since the conjugate prior for the mean is Gaussian, the likelihood is Gaussian, and hence the posterior is Gaussian. The results are analogous to the scalar case. In particular, we use the general result from [Bis06, p92] with the following substitutions:

−1 −1 x = µ, y = x, Λ =Σ0, A = I,b =0,L =Σ/N (207)

7.1 Prior

p(µ) = (µ µ , Σ ) (208) N | 0 0 7.2 Likelihood

1 p(D µ, Σ) (x µ, Σ) (209) | ∝ N | N 7.3 Posterior

p(µ D, Σ) = (µ µ , Σ ) (210) | N | N N −1 −1 −1 ΣN = Σ0 + NΣ (211) −1 −1 µN = ΣN NΣ x +Σ 0 µ0 (212)  17 7.4 Posterior predictive

p(x D) = (x µ , Σ+Σ ) (213) | N | N N 7.5 Reference analysis

p(µ) 1= (µ , I) (214) ∝ N |· ∞ p(µ D) = (x, Σ/n) (215) | N 8 Normal-Wishart prior The multivariate analog of the normal-gamma prior is the normal-Wishart prior. Here we just state the results without proof; see [DeG70, p178] for details. We assume X is a d-dimensional. 8.1 Likelihood

n p(D µ, Λ) = (2π)−nd/2 Λ n/2 exp 1 (x µ)T Λ(x µ) (216) | | | − 2 i − i − i=1 ! X 8.2 Prior

p(µ, Λ) = NWi(µ, Λ µ ,κ,ν,T )= (µ µ , (κΛ)−1)Wi (Λ T ) (217) | 0 N | 0 ν | 1 1 κ = Λ 2 exp (µ µ )T Λ(µ µ ) Λ (κ−d−1)/2 exp 1 tr(T −1Λ) (218) Z | | − 2 − 0 − 0 | | − 2 κ d/2    Z = T κ/22dκ/2Γ (κ/2) (219) 2π | | d   Here T is the prior covariance. To see the connection to the scalar case, make the substitutions ν T α = 0 ,β = 0 (220) 0 2 0 2 8.3 Posterior

p(µ, Λ D) = (µ µ , (κ Λ)−1)Wi (Λ T ) (221) | N | n n νn | n κµ + nx µ = 0 (222) n κ + n κn T = T + S + (µ x)(µ x)T (223) n κ + n 0 − 0 − n S = (x x)(x x)T (224) i − i − i=1 X νn = ν + n (225)

κn = κ + n (226)

Posterior marginals

p(Λ D) = Wi (T ) (227) | νn n T p(µ D) = t (µ µ , n ) (228) | νn−d+1 | n κ (ν d + 1) n n −

18 The MAP estimates are given by

(ˆµ, Λ)ˆ = argmax p(D µ, Λ)NWi(µ, Λ) (229) µ,Λ | n µˆ = xi + κ0µ0N + κ0 (230) i=1 X n (x µˆ)(x µˆ)T + κ (ˆµ µ )(ˆµ µ )T + T −1 Σˆ = i=1 i − i − 0 − 0 − 0 0 (231) N + ν d P 0 − This reduces to the MLE if κ =0, ν = d and T =0. 0 0 | 0| 8.4 Posterior predictive T (κ + 1) p(x D)= t (µ , n n ) (232) | νn−d+1 n κ (ν d + 1) n n − If d =1, this reduces to Equation 100. 8.5 Marginal likelihood This can be computed as a ratio of normalization constants.

Z 1 n (233) p(D) = nd/2 Z0 (2π) d/2 1 Γ (ν /2) T ν0/2 κ d n 0 0 (234) = nd/2 | |ν /2 π Γd(ν0/2) T n κn | n|   This reduces to Equation 95 if d =1. 8.6 Reference analysis We set µ =0, κ =0, ν = 1, T =0 (235) 0 0 0 − | 0| to give

p(µ, Λ) Λ −(d+1)/2 (236) ∝ | | Then the posterior parameters become

µ = x, T = S, κ = n, ν = n 1 (237) n n n n − the posterior marginals become

S p(µ D) = t (µ x, ) (238) | n−d | n(n d) − p(Λ D) = Wi (Λ S) (239) | n−d | and the posterior predictive becomes

S(n + 1) p(x D) = t (x, (240) | n−d n(n d) − 9 Normal-Inverse-Wishart prior The multivariate analog of the normal inverse chi-squared (NIX) distribution is the normal inverse Wishart (NIW) (see also [GCSR04, p85]).

19 9.1 Likelihood The likelihood is

n − n 1 T −1 p(D µ, Σ) Σ 2 exp (x µ) Σ (x µ) (241) | ∝ | | −2 i − i − i=1 ! X − n 1 −1 = Σ 2 exp tr(Σ S) (242) | | −2   (243) where S is the matrix of sum of squares (scatter matrix)

N S = (x x)(x x)T (244) i − i − i=1 X 9.2 Prior The natural conjugate prior is normal-inverse-wishart

Σ IW (Λ−1) (245) ∼ ν0 0 µ Σ N(µ , Σ/κ ) (246) | ∼ 0 0 def p(µ, Σ) = NIW (µ0,κ0, Λ0, ν0) (247)

1 1 κ0 = Σ −((ν0+d)/2+1) exp tr(Λ Σ−1) (µ µ )T Σ−1(µ µ ) (248) Z | | −2 0 − 2 − 0 − 0   2v0d/2Γ (ν /2)(2π/κ )d/2 Z = d 0 0 (249) Λ ν0/2 | 0| 9.3 Posterior The posterior is

p(µ, Σ D, µ ,κ , Λ , ν ) = NIW (µ, Σ µ ,κ , Λ , ν ) (250) | 0 0 0 0 | n n n n κ0µ +0+ ny κ0 n µn = = µ0 + y (251) κn κ0 + n κ0 + n κn = κ0 + n (252)

νn = ν0 + n (253) κ0n T Λn = Λ0 + S + (x µ0)(x µ0) (254) κ0 + n − − The marginals are

Σ D IW (Λ−1, ν ) (255) | ∼ n n Λ µ D = t (µ , n ) (256) | νn−d+1 n κ (ν d + 1) n n − 2 To see the connection with the scalar case, note that Λn plays the role of νnσn (posterior sum of squares), so Λ Λ σ2 n = n = (257) κ (ν d + 1) κ ν κ n n − n n n

20 9.4 Posterior predictive Λ (κ + 1) p(x D)= t (µ , n n ) (258) | νn−d+1 n κ (ν d + 1) n n − To see the connection with the scalar case, note that Λ (κ + 1) Λ (κ + 1) σ2(κ + 1) n n = n n = n (259) κ (ν d + 1) κ ν κ n n − n n n 9.5 Marginal likelihood The posterior is given by 1 1 1 0 0 (260) p(µ, Σ D) = NIW (µ, Σ α0) nd/2 N (D µ, Σ) | p(D) Z0 | (2π) |

1 0 = NIW (µ, Σ αn) (261) Zn | where

1 κ0 NIW 0(µ, Σ α ) = Σ −((ν0+d)/2+1) exp tr(Λ Σ−1) (µ µ )T Σ−1(µ µ ) (262) | 0 | | −2 0 − 2 − 0 − 0   0 − n 1 −1 N (D µ, Σ) = Σ 2 exp tr(Σ S) (263) | | | −2   is the unnormalized prior and likelihood. Hence

Z 1 2νnd/2Γ (ν /2)(2π/κ )d/2 Λ ν0/2 1 n d n n 0 (264) p(D) = nd/2 = ν /2 ν d/2 | | d/2 nd/2 Z0 (2π) Λ n 2 0 Γ (ν /2)(2π/κ ) (2π) | n| d 0 0 1 2νnd/2 (2π/κ )d/2 Γ (ν /2) n d n (265) = nd/2 ν d/2 d/2 (2π) 2 0 (2π/κ0) Γd(ν0/2) d/2 1 Γ (ν /2) Λ ν0/2 κ d n 0 0 (266) = nd/2 | |ν /2 π Γd(ν0/2) Λ n κn | n|   This reduces to Equation 171 if d =1. 9.6 Reference analysis −(d+1)/2 A noninformative (Jeffrey’s) prior is p(µ, Σ) Σ which is the limit of κ0 0, ν0 1, Λ0 0 [GCSR04, p88]. Then the posterior becomes ∝ | | → →− | |→

µn = x (267)

κn = n (268) ν = n 1 (269) n − Λ = S = (x x)(x x)T (270) n i − i − i X p(Σ D) = IW (Σ S) (271) | n−1 | S p(µ D) = t (µ x, ) (272) | n−d | n(n d) − S(n + 1) p(x D) = t (x x, ) (273) | n−d | n(n d) − Note that [Min00] argues that Jeffrey’s principle says the uninformative prior should be of the form 1 − −(d+1)/2 −( d +1) lim (µ µ0, Σ/k)IWk(Σ kΣ) 2πΣ 2 Σ Σ 2 (274) k 0 N | | ∝ | | | | ∝ | | → This can be achieved by setting ν =0 instead of ν = 1. 0 0 −

21 Gamma(shape=a,rate=b) Gamma(shape=a,rate=b) 1.8 3 a=0.5, b=1.0 a=0.5, b=3.0 1.6 a=1.0, b=1.0 a=1.0, b=3.0 a=1.5, b=1.0 a=1.5, b=3.0 2.5 a=2.0, b=1.0 a=2.0, b=3.0 1.4 a=5.0, b=1.0 a=5.0, b=3.0

1.2 2

1 1.5 0.8

0.6 1

0.4 0.5 0.2

0 0 0 1 2 3 4 5 0 1 2 3 4 5

Figure 5: Some Ga(a,b) distributions. If a< 1, the peak is at 0. As we increase b, we squeeze everything leftwards and upwards. Figures generated by gammaDistPlot2.

10 Appendix: some standard distributions 10.1 Gamma distribution The gamma distribution is a flexible distribution for positive real valued rv’s, x > 0. It is defined in terms of two parameters. There are two common parameterizations. This is the one used by Bishop [Bis06] (and many other authors):

ba Ga(x shape = a, rate = b) = xa−1e−xb, x,a,b> 0 (275) | Γ(a) The second parameterization (and the one used by Matlab’s gampdf) is 1 Ga(x shape = α, scale = β) = xα−1e−x/β = Garate(x α, 1/β) (276) | βαΓ(α) | Note that the controls the shape; the merely defines the measurement scale (the horizontal axis). The rate parameter is just the inverse of the scale. See Figure 5 for some examples. This distribution has the following properties (using the rate parameterization): a mean = (277) b a 1 mode = − for a 1 (278) b ≥ a var = (279) b2 10.2 Inverse Gamma distribution Let X Ga(shape = a, rate = b) and Y =1/X. Then it is easy to show that Y IG(shape = a, scale = b), where the inverse∼ Gamma distribution is given by ∼

ba IG(x shape = a, scale = b) = x−(a+1)e−b/x, x,a,b> 0 (280) | Γ(a)

22 IG(a,b) 1.4 a=0.10, b=1.00 a=1.00, b=1.00 1.2 a=2.00, b=1.00 a=0.10, b=2.00 a=1.00, b=2.00 1 a=2.00, b=2.00

0.8

0.6

0.4

0.2

0 0 0.5 1 1.5 2

Figure 6: Some inverse gamma distributions (a=shape, b=rate). These plots were produced by invchi2plot.

The distribution has these properties

b mean = , a> 1 (281) a 1 −b mode = (282) a +1 b2 var = , a> 2 (283) (a 1)2(a 2) − − See Figure 6 for some plots. We see that increasing b just stretches the horizontal axis, but increasing a moves the peak up and closer to the left. There is also another parameterization, using the rate (inverse scale): 1 IG(x shape = α, rate = β) = Γ(a)x−(α+1)e−1/(βx), x,α,β> 0 (284) | βa

10.3 Scaled Inverse-Chi-squared The scaled inverse-chi-squared distribution is a reparameterization of the inverse Gamma [GCSR04, p575].

2 ν/2 2 −2 2 1 νσ − ν −1 νσ χ (x ν, σ ) = x 2 exp[ ], x> 0 (285) | Γ(ν/2) 2 − 2x   ν νσ2 = IG(x shape= , scale= ) (286) | 2 2 where the parameter ν > 0 is called the degrees of freedom, and σ2 > 0 is the scale. See Figure 7 for some plots. We see that increasing ν lifts the curve up and moves it slightly to the right. Later, when we consider Bayesian parameter estimation, we will use this distribution as a conjugate prior for a scale parameter (such as the variance of a Gaussian); increasing ν corresponds to increasing the effective strength of the prior.

23 χ−2(ν,σ2) 1.5 ν=1.00, σ2=0.50 ν=1.00, σ2=1.00 ν=1.00, σ2=2.00 ν=5.00, σ2=0.50 ν=5.00, σ2=1.00 1 ν=5.00, σ2=2.00

0.5

0 0 0.5 1 1.5 2

Figure 7: Some inverse scaled χ2 distributions. These plots were produced by invchi2plot.

The distribution has these properties νσ2 mean = for ν > 2 (287) ν 2 − νσ2 mode = (288) ν +2 2ν2σ4 var = for ν > 4 (289) (ν 2)2(ν 4) − − −2 2 2 The inverse chi-squared distribution, written χν (x), is the special case where νσ = 1 (i.e., σ = 1/ν). This corresponds to IG(a = ν/2,b = scale =1/2). 10.4 Let X be a p dimensional symmetric positive definite matrix. The Wishart is the multidimensional generalization of the Gamma. Since it is a distribution over matrices, it is hard to plot as a density function. However, we can easily sample from it, and then use the eigenvectors of the resulting matrix to define an ellipse. See Figure 8. There are several possible parameterizations. Some authors (e.g., [Bis06, p693], [DeG70, p.57],[GCSR04, p574], wikipedia) as well as WinBUGS and Matlab (wishrnd), define the Wishart in terms of degrees of freedom ν p and the scale matrix S as follows: ≥ 1 Wi (X S) = X (ν−p−1)/2 exp[ 1 tr(S−1X)] (290) ν | Z | | − 2 Z = 2νp/2Γ (ν/2) S ν/2 (291) p | | where Γp(a) is the generalized p 2α +1 i Γ (α)= πp(p−1)/4 Γ − (292) p 2 i=1 Y   (So Γ1(α)=Γ(α).) The mean and mode are given by (see also [Pre05]) mean = νS (293) mode = (ν p 1)S, ν>p +1 (294) − −

24 Wishart(dof=2.0,S=[4 3; 3 4]) Wishart(dof=10.0,S=[4 3; 3 4])

5 5 10 2 10 10 5 0 0 0 0 0 0 −5 −2 −10 −10 −5 −5 −10 −5 0 5 −5 0 5 −4 −2 0 2 4 −10 0 10 −20 0 20 −20 0 20

5 4 10 5 5 2 5 10 0 0 0 0 0 0 −5 −5 −2 −5 −10 −10 −4 −5 −10 0 10 −5 0 5 −5 0 5 −10 0 10 −10 0 10 −20 0 20

4 10 5 10 10 2 5 5 5 5 0 0 0 0 0 0 −5 −5 −2 −5 −5 −5 −10 −10 −4 −10 −5 0 5 −10 0 10 −10 0 10 −10 0 10 −10 0 10 −10 0 10

Figure 8: Some samples from the Wishart distribution. Left: ν = 2, right: ν = 10. We see that if if ν = 2 (the smallest valid value in 2 dimensions), we often sample nearly singular matrices. As ν increases, we put more mass on the S matrix. If S = I2, the samples would look (on average) like circles. Generated by wishplot.

In 1D, this becomes Ga(shape = ν/2, rate = S/2). −1 −1 S Note that if X Wiu(S), and Y = X , then Y IWν (S ) and E[Y ]= ν−d−1 . In [BS94, p.138],∼ and the wishpdf in Tom Minka’s∼ lightspeed toolbox, they use the following parameterization

B a Wi(X a, B)= | | X a−(p+1)/2 exp[ tr(BX)] (295) | Γp(a)| | −

We require that B is a p p symmetric positive definite matrix, and 2a>p 1. If p =1, so B is a scalar, this reduces to the Ga(shape = a, rate=× b) density. − To get some intuition for this distribution, recall that tr(AB) is a vector which contains the inner product of the rows of A and the columns of B. In Matlab notation we have

trace(A B) = [a(1,:)*b(:,1), ..., a(n,:)*b(:,n)]

−1 If X Wiν (S), then we are performing a kind of template matching between the columns of X and S (recall that both X∼ and S are symmetric). This is a natural way to define the distance between two matrices. 10.5 Inverse Wishart This is the multidimensional generalization of the inverse Gamma. Consider a d d positive definite (covariance) ma- trix X and a dof parameter ν > d 1 and psd matrix S. Some authors (eg [GCSR04,× p574]) use this parameterization: − 1 1 IW (X S−1) = X −(ν+d+1)/2 exp T r(SX−1) (296) ν | Z | | −2   S ν/2 (297) Z = νd/|2 | 2 Γd(ν/2) where d ν +1 i Γ (ν/2) = πd(d−1)/4 Γ( − ) (298) d 2 i=1 Y

25 The distribution has mean S E X = (299) ν d 1 − − In Matlab, use iwishrnd. In the 1d case, we have χ−2(Σ ν , σ2)= IW (Σ (ν σ2)−1) (300) | 0 0 ν0 | 0 0 Other authors (e.g., [Pre05, p117]) use a slightly different formulation (with 2d < ν) −1 d IW (X Q) = 2(ν−d−1)d/2πd(d−1)/4 Γ((ν d j)/2) (301) ν |  − −  j=1 Y  1  Q (ν−d−1)/2 X −ν/2 exp T r(X−1Q) (302) ×| | | | −2   which has mean Q E X = (303) ν 2d 2 − − 10.6 Student t distribution The generalized t-distribution is given as

−( ν+1 ) 1 x µ 2 t (x µ, σ2) = c 1+ ( − )2 (304) ν | ν σ   Γ(ν/2+1/2) 1 c = (305) Γ(ν/2) √νπσ where c is the normalization consant. µ is the mean, ν > 0 is the degrees of freedom, and σ2 > 0 is the scale. (Note that the ν parameter is often written as a subscript.) In Matlab, use tpdf. The distribution has the following properties: mean = µ, ν> 1 (306) mode = µ (307) νσ2 var = , ν> 2 (308) (ν 2) − 2 Note: if x tν (µ, σ ), then ∼ x µ − t (309) σ ∼ ν which corresponds to a standard t-distribution with µ =0, σ2 =1: Γ((ν + 1)/2) t (x) = (1 + x2/ν)−(ν+1)/2 (310) ν √νπΓ(ν/2) In Figure 9, we plot the density for different parameter values. As ν , the T approaches a Gaussian. T- distributions are like Gaussian distributions with heavy tails. Hence they are→ more ∞ robust to outliers (see Figure 10). If ν =1, this is called a . This is an interesting distribution since if X Cauchy, then E[X] does not exist, since the corresponding integral diverges. Essentially this is because the tails are∼ so heavy that samples from the distribution can get very far from the center µ. It can be shown that the t-distribution is like an infinite sum of Gaussians, where each Gaussian has a different precision:

p(x µ,a,b) = (x µ, τ −1)Ga(τ a, rate = b)dτ (311) | N | | Z = t (x µ,b/a) (312) 2a | (See exercise 2.46 of [Bis06].)

26 Student T distributions 0.4 t(ν=0.1) 0.35 t(ν=1.0) t(ν=5.0) 0.3 N(0,1)

0.25

0.2

0.15

0.1

0.05

0

−0.05 −6 −4 −2 0 2 4 6

Figure 9: Student t-distributions T (µ, σ2,ν) for µ = 0. The effect of σ is just to scale the horizontal axis. As ν→∞, the distribution approaches a Gaussian. See studentTplot.

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0 −5 0 5 10 −5 0 5 10 (a) (b)

Figure 10: Fitting a Gaussian and a Student distribution to some data (left) and to some data with outliers (right). The Student distribution (red) is much less affected by outliers than the Gaussian (green). Source: [Bis06] Figure 2.16.

27 T distribution, dof 2.0 Gaussian

0.2 2

0.15 1.5

0.1 1

0.05 0.5

0 0 2 2 1 2 1 2 0 1 0 1 0 0 −1 −1 −1 −1 −2 −2 −2 −2

Figure 11: Left: T distribution in 2d with dof=2 and Σ = 0.1I2. Right: Gaussian density with Σ = 0.1I2 and µ = (0, 0); we see it goes to zero faster. Produced by multivarTplot.

10.7 Multivariate t distributions The multivariate T distribution in d dimensions is given by

−( ν+d ) Γ(ν/2+ d/2) Σ −1/2 1 2 t (x µ, Σ) = | | 1+ (x µ)T Σ−1(x µ) (313) ν | Γ(ν/2) vd/2πd/2 × ν − −   (314)

where Σ is called the scale matrix (since it is not exactly the ). This has fatter tails than a Gaussian: see Figure 11. In Matlab, use mvtpdf. The distribution has the following properties

E x = µ if ν > 1 (315) mode x = µ (316) ν Cov x = Σ for ν > 2 (317) ν 2 − (The following results are from [Koo03, p328].) Suppose Y T (µ, Σ, ν) and we partition the variables into 2 blocks. Then the marginals are ∼ Y T (µ , Σ , ν) (318) i ∼ i ii and the conditionals are

Y y T (µ , Σ , ν + d ) (319) 1| 2 ∼ 1|2 1|2 1 µ = µ +Σ Σ−1(y µ ) (320) 1|2 1 12 22 2 − 2 Σ = h (Σ Σ Σ−1ΣT ) (321) 1|2 1|2 11 − 12 22 12 1 T −1 h1|2 = ν + (y2 µ2) Σ22 (y2 µ2) (322) ν + d2 − −   We can also show linear combinations of Ts are Ts:

Y T (µ, Σ, ν) AY T (Aµ, AΣA0, ν) (323) ∼ ⇒ ∼ We can sample from a y T (µ, Σ, ν) by sampling x T (0, 1, ν) and then transforming y = µ + RT x, where R = chol(Σ), so RT R =Σ. ∼ ∼

28 References [Bis06] C. Bishop. and machine learning. Springer, 2006. [BL01] P. Baldi and A. Long. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics, 17(6):509–519, 2001. [BS94] J. Bernardo and A. Smith. Bayesian Theory. John Wiley, 1994. [DeG70] M. DeGroot. Optimal Statistical Decisions. McGraw-Hill, 1970. [DHMS02] D. Denison, C. Holmes, B. Mallick, and A. Smith. Bayesian methods for nonlinear classification and regression. Wiley, 2002. [DMP+06] F. Demichelis, P. Magni, P. Piergiorgi, M. Rubin, and R. Bellazzi. A hierarchical Naive Bayes model for handling sample heterogeneity in classification problems: an application to tissue microarrays. BMC Bioinformatics, 7:514, 2006. [GCSR04] A. Gelman, J. Carlin, H. Stern, and D. Rubin. Bayesian data analysis. Chapman and Hall, 2004. 2nd edition. [GH94] D. Geiger and D. Heckerman. Learning Gaussian networks. Technical Report MSR-TR-94-10, Microsoft Research, 1994. [Koo03] Gary Koop. . Wiley, 2003. [Lee04] Peter Lee. : an introduction. Arnold Publishing, 2004. Third edition. [Min00] T. Minka. Inferring a Gaussian distribution. Technical report, MIT, 2000. [Pre05] S. J. Press. Applied multivariate analysis, using Bayesian and frequentist methods of inference. Dover, 2005. Second edition.

29