STATS 370: Bayesian Statistics Stanford University, Winter 2016
Total Page:16
File Type:pdf, Size:1020Kb
STATS 370: Bayesian Statistics Stanford University, Winter 2016 Problem Set 7 Solution provided by Ahmed Bou-Rabee Problem 1. Multivariate Gaussian Solution 1. We motivate our choice for the conjugate prior by first writing the likelihood function. n ! −1 X f(x ; : : : ; x ) / jΣj−n=2 exp (x − µ)T Σ−1(x − µ) 1 n 2 i i i=1 Pn T −1 T −1 Pn T −1 Then, using i=1(xi − µ) Σ (xi − µ) = n(µ − x¯) Σ (µ − x¯) + i=1(xi − x¯) Σ (xi − x¯) and the trace trick, we can write this as n ! −n=2 T −1 −1 X T f(x1; : : : ; xn) / jΣj exp −n=2(¯x − µ) Σ (¯x − µ) exp −1=2tr(Σ (xi − x¯)(xi − x¯) ) i=1 This looks like the product of a multivariate normal distribution of µ given Σ and a Wishart distribution of Σ. Looking on wikipedia, we see that this is the multivariate normal-Inverse-Wishart distribution. Specifically, (µ, Σ) ∼ NW (µ0; k0; λ0; v0) where µ0; k0; λ0; v0 are hyper-parameters of the normal-Inverse-Wishart distribution. We now derive the joint posterior distribution associated to this prior. Let (µ, Σ) ∼ NW (µ0; λ, W; v). And let x1; : : : ; xn be the observations from N(µ, Σ). Then, the posterior p(Σ; µjx1; : : : ; xn) / p(x1; : : : ; xnjΣ; µ)p(Σ; µ) n ! −n=2 T −1 −1 X T / jΣj exp −n=2(¯x − µ) Σ (¯x − µ) exp −1=2tr(Σ (xi − x¯)(xi − x¯) ) i=1 ((−(v0+k)=2+1) −1 T −1 jΣj exp −1=2tr(λ0Σ ) − k0=2(µ − µ0) Σ (µ − µ0) k0µ0+nx¯ Which we see is the normal-Wishart distribution with parameters µ0∗ = , k0∗ = k0 + n, λ0∗ = k0+n Pn T T λ0 + i=1(xi − µ)(xi − µ) + k0n=(k0 + n)(µ0 − x¯))µ0 − x¯) , and v0∗ = v0 + n. This is the same family of the prior, and is hence conjugate. Now we compute the marginal distribution of µ and we find that it is the multivariate student distribu- tion. We do this by first computing the marginal of the joint prior and then substituting in the posterior parameters. By integrating out Σ, and using the fact that if A is a k by k non singular matrix and v is a k by k dimensional column vector, then jA + vvT j = jAj(1 + vT A−1v). Z π(µ) = π(µ, σ) Σ Z −(v0+k)=2+1 −1 T −1 / jΣj exp −1=2tr(λ0Σ ) − k0=2(µ − µ0) Σ (µ − µ0) Σ T −v0+n+1)=2 / [λ0 + (k0 + n)(m − µ)(m − µ) j T −1 −1=2(v0+n+1) / [1 + (k0 + n)(µ − µ0) λ0 (µ − µ0)] 1 We recognize this as the multivariate t distribution. Therefore, the posterior marginal distribution of ∗ ∗ µ is a multivariate t distribution with mean µ0, v0 + n − k + 1 degrees of freedom, and scale matrix ∗ ∗ ∗ λ0=(k0(v0 − k + 1)). Problem 2. Collinearity −1 Solution 2. Let's assume that β ∼ N(0; λI) for some λ > 0. Let Λ = λI. We know that (y − xiβ) = 2 i ∼ N(0; σ ). Thus, using Baye's rule, the posterior for β is p(βjy; X; σ2) / p(β)p(y; Xjβ; σ2) 1 1 / exp − (y − Xβ)T (y − Xβ) − βT Λβ 2σ2 2 Let β^ = (XT X + σ2Λ)−1XT y. Then, we can write −1 (y − Xβ) − βT Λβ σ2 as XT X + σ2Λ (β − β^)T (β − β^) + yT y − β^T (XT X + Λ)β^ + β^T Λβ^ σ2 Disregarding the components without β, and expanding we get, XT X + σ2Λ p(βjy; Xσ2) / exp (β − (XT X + σ2Λ)−1XT y)T (β − (XT X + σ2Λ)−1XT y) σ2 We recognize this as p(βjy; X; σ2) ∼ N((XT X + σ2Λ)−1XT y; σ2(XT X + σ2λ)−1). We complete the re- gression analysis by choosing the mean of the posterior as our estimate. That is, we have β^ = XT X + σ2Λ)−1XT y, which we recognize as the ridge regression solution. Note that while (XT X) does not have an inverse because of collinearity, XT X + λI always does when λ > 0. And, when n > k the solution still exists when λ > 0. Problem 3. How to use the posterior distribution for point estimates Solution 3. 1. This function is convex and differentiable so we differentiate with respect to θ^ and then take the point where the derivative is 0. Thus, we have 2θ = 2θ^. We then take expectations to get θ^ = Eθ, which is the posterior mean. 2. Let F (θ) denote the cumulative posterior distribution function of θ. Then again to calculate the minimum we have to differentiate with respect to θ^, but we can rewrite the minimizer first, (writing out the expected value in integral form) d Z jθ − θ^jp(θ)dθ dθ^ Ω d Z Z = ( (θ − θ^)p(θ)dθ + (θ^ − θ)p(θ)dθ) dθ^ Ω(θ>θ^) Ω(θ<θ^) d Z Z = ( θp(θ)dθ − (θ)p(θ)dθ + θF^ (θ^) − θ^(1 − F (θ^))) dθ^ Ω(θ>θ^) Ω(θ<θ^) = F (θ^) − (1 − F (θ^)) ^ 1 ^ Setting this equal to 0 and solving we get F (θ) = 2 , or θ is the posterior median. 2 ^ ^ ^ 3. By definition, EL0(θ; θ) = P (jθ − θj > ) = 1 − P (jθ − θj < ). And then, taking ! 0, we see this is 1 − P (θ = θ^) and this value is minimized when P (θ = θ^) is as large as possible and this occurs when θ^ is chosen to be the maximum of the posterior distribution, or the mode of p(θ). 3.