STATS 370: Bayesian Statistics Stanford University, Winter 2016

STATS 370: Bayesian Statistics Stanford University, Winter 2016 Problem Set 7 Solution provided by Ahmed Bou-Rabee Problem 1. Multivariate Gaussian Solution 1. We motivate our choice for the conjugate prior by first writing the likelihood function. n ! −1 X f(x ; : : : ; x ) / jΣj−n=2 exp (x − µ)T Σ−1(x − µ) 1 n 2 i i i=1 Pn T −1 T −1 Pn T −1 Then, using i=1(xi − µ) Σ (xi − µ) = n(µ − x¯) Σ (µ − x¯) + i=1(xi − x¯) Σ (xi − x¯) and the trace trick, we can write this as n ! −n=2 T −1 −1 X T f(x1; : : : ; xn) / jΣj exp −n=2(¯x − µ) Σ (¯x − µ) exp −1=2tr(Σ (xi − x¯)(xi − x¯) ) i=1 This looks like the product of a multivariate normal distribution of µ given Σ and a Wishart distribution of Σ. Looking on wikipedia, we see that this is the multivariate normal-Inverse-Wishart distribution. Specifically, (µ, Σ) ∼ NW (µ0; k0; λ0; v0) where µ0; k0; λ0; v0 are hyper-parameters of the normal-Inverse-Wishart distribution. We now derive the joint posterior distribution associated to this prior. Let (µ, Σ) ∼ NW (µ0; λ, W; v). And let x1; : : : ; xn be the observations from N(µ, Σ). Then, the posterior p(Σ; µjx1; : : : ; xn) / p(x1; : : : ; xnjΣ; µ)p(Σ; µ) n ! −n=2 T −1 −1 X T / jΣj exp −n=2(¯x − µ) Σ (¯x − µ) exp −1=2tr(Σ (xi − x¯)(xi − x¯) ) i=1 ((−(v0+k)=2+1) −1 T −1 jΣj exp −1=2tr(λ0Σ ) − k0=2(µ − µ0) Σ (µ − µ0) k0µ0+nx¯ Which we see is the normal-Wishart distribution with parameters µ0∗ = , k0∗ = k0 + n, λ0∗ = k0+n Pn T T λ0 + i=1(xi − µ)(xi − µ) + k0n=(k0 + n)(µ0 − x¯))µ0 − x¯) , and v0∗ = v0 + n. This is the same family of the prior, and is hence conjugate. Now we compute the marginal distribution of µ and we find that it is the multivariate student distribution. We do this by first computing the marginal of the joint prior and then substituting in the posterior parameters. By integrating out Σ, and using the fact that if A is a k by k non singular matrix and v is a k by k dimensional column vector, then jA + vvT j = jAj(1 + vT A−1v). Z π(µ) = π(µ, σ) Σ Z −(v0+k)=2+1 −1 T −1 / jΣj exp −1=2tr(λ0Σ ) − k0=2(µ − µ0) Σ (µ − µ0) Σ T −v0+n+1)=2 / [λ0 + (k0 + n)(m − µ)(m − µ) j T −1 −1=2(v0+n+1) / [1 + (k0 + n)(µ − µ0) λ0 (µ − µ0)] 1 We recognize this as the multivariate t distribution. Therefore, the posterior marginal distribution of ∗ ∗ µ is a multivariate t distribution with mean µ0, v0 + n − k + 1 degrees of freedom, and scale matrix ∗ ∗ ∗ λ0=(k0(v0 − k + 1)). Problem 2. Collinearity −1 Solution 2. Let's assume that β ∼ N(0; λI) for some λ > 0. Let Λ = λI. We know that (y − xiβ) = 2 i ∼ N(0; σ ). Thus, using Baye's rule, the posterior for β is p(βjy; X; σ2) / p(β)p(y; Xjβ; σ2) 1 1 / exp − (y − Xβ)T (y − Xβ) − βT Λβ 2σ2 2 Let β^ = (XT X + σ2Λ)−1XT y. Then, we can write −1 (y − Xβ) − βT Λβ σ2 as XT X + σ2Λ (β − β^)T (β − β^) + yT y − β^T (XT X + Λ)β^ + β^T Λβ^ σ2 Disregarding the components without β, and expanding we get, XT X + σ2Λ p(βjy; Xσ2) / exp (β − (XT X + σ2Λ)−1XT y)T (β − (XT X + σ2Λ)−1XT y) σ2 We recognize this as p(βjy; X; σ2) ∼ N((XT X + σ2Λ)−1XT y; σ2(XT X + σ2λ)−1). We complete the regression analysis by choosing the mean of the posterior as our estimate. That is, we have β^ = XT X + σ2Λ)−1XT y, which we recognize as the ridge regression solution. Note that while (XT X) does not have an inverse because of collinearity, XT X + λI always does when λ > 0. And, when n > k the solution still exists when λ > 0. Problem 3. How to use the posterior distribution for point estimates Solution 3. 1. This function is convex and differentiable so we differentiate with respect to θ^ and then take the point where the derivative is 0. Thus, we have 2θ = 2θ^. We then take expectations to get θ^ = Eθ, which is the posterior mean. 2. Let F (θ) denote the cumulative posterior distribution function of θ. Then again to calculate the minimum we have to differentiate with respect to θ^, but we can rewrite the minimizer first, (writing out the expected value in integral form) d Z jθ − θ^jp(θ)dθ dθ^ Ω d Z Z = ( (θ − θ^)p(θ)dθ + (θ^ − θ)p(θ)dθ) dθ^ Ω(θ>θ^) Ω(θ<θ^) d Z Z = ( θp(θ)dθ − (θ)p(θ)dθ + θF^ (θ^) − θ^(1 − F (θ^))) dθ^ Ω(θ>θ^) Ω(θ<θ^) = F (θ^) − (1 − F (θ^)) ^ 1 ^ Setting this equal to 0 and solving we get F (θ) = 2 , or θ is the posterior median. 2 ^ ^ ^ 3. By definition, EL0(θ; θ) = P (jθ − θj > ) = 1 − P (jθ − θj < ). And then, taking ! 0, we see this is 1 − P (θ = θ^) and this value is minimized when P (θ = θ^) is as large as possible and this occurs when θ^ is chosen to be the maximum of the posterior distribution, or the mode of p(θ). 3.

STATS 370: Bayesian Statistics Stanford University, Winter 2016

The Exponential Family 1 Definition

Machine Learning Conjugate Priors and Monte Carlo Methods

Polynomial Singular Value Decompositions of a Family of Source-Channel Models

A Compendium of Conjugate Priors

Bayesian Filtering: from Kalman Filters to Particle Filters, and Beyond ZHE CHEN

36-463/663: Hierarchical Linear Models

A Geometric View of Conjugate Priors

Gibbs Sampling, Exponential Families and Orthogonal Polynomials

Sparsity Enforcing Priors in Inverse Problems Via Normal Variance

Jeffreys Priors 1 Priors for the Multivariate Gaussian

Bayesian Inference for the Multivariate Normal

Prior Distributions for Variance Parameters in Hierarchical Models