STATS 370: Bayesian Statistics Stanford University, Winter 2016

STATS 370: Bayesian Statistics Stanford University, Winter 2016

STATS 370: Bayesian Statistics Stanford University, Winter 2016 Problem Set 7 Solution provided by Ahmed Bou-Rabee Problem 1. Multivariate Gaussian Solution 1. We motivate our choice for the conjugate prior by first writing the likelihood function. n ! −1 X f(x ; : : : ; x ) / jΣj−n=2 exp (x − µ)T Σ−1(x − µ) 1 n 2 i i i=1 Pn T −1 T −1 Pn T −1 Then, using i=1(xi − µ) Σ (xi − µ) = n(µ − x¯) Σ (µ − x¯) + i=1(xi − x¯) Σ (xi − x¯) and the trace trick, we can write this as n ! −n=2 T −1 −1 X T f(x1; : : : ; xn) / jΣj exp −n=2(¯x − µ) Σ (¯x − µ) exp −1=2tr(Σ (xi − x¯)(xi − x¯) ) i=1 This looks like the product of a multivariate normal distribution of µ given Σ and a Wishart distribution of Σ. Looking on wikipedia, we see that this is the multivariate normal-Inverse-Wishart distribution. Specifically, (µ, Σ) ∼ NW (µ0; k0; λ0; v0) where µ0; k0; λ0; v0 are hyper-parameters of the normal-Inverse-Wishart distribution. We now derive the joint posterior distribution associated to this prior. Let (µ, Σ) ∼ NW (µ0; λ, W; v). And let x1; : : : ; xn be the observations from N(µ, Σ). Then, the posterior p(Σ; µjx1; : : : ; xn) / p(x1; : : : ; xnjΣ; µ)p(Σ; µ) n ! −n=2 T −1 −1 X T / jΣj exp −n=2(¯x − µ) Σ (¯x − µ) exp −1=2tr(Σ (xi − x¯)(xi − x¯) ) i=1 ((−(v0+k)=2+1) −1 T −1 jΣj exp −1=2tr(λ0Σ ) − k0=2(µ − µ0) Σ (µ − µ0) k0µ0+nx¯ Which we see is the normal-Wishart distribution with parameters µ0∗ = , k0∗ = k0 + n, λ0∗ = k0+n Pn T T λ0 + i=1(xi − µ)(xi − µ) + k0n=(k0 + n)(µ0 − x¯))µ0 − x¯) , and v0∗ = v0 + n. This is the same family of the prior, and is hence conjugate. Now we compute the marginal distribution of µ and we find that it is the multivariate student distribu- tion. We do this by first computing the marginal of the joint prior and then substituting in the posterior parameters. By integrating out Σ, and using the fact that if A is a k by k non singular matrix and v is a k by k dimensional column vector, then jA + vvT j = jAj(1 + vT A−1v). Z π(µ) = π(µ, σ) Σ Z −(v0+k)=2+1 −1 T −1 / jΣj exp −1=2tr(λ0Σ ) − k0=2(µ − µ0) Σ (µ − µ0) Σ T −v0+n+1)=2 / [λ0 + (k0 + n)(m − µ)(m − µ) j T −1 −1=2(v0+n+1) / [1 + (k0 + n)(µ − µ0) λ0 (µ − µ0)] 1 We recognize this as the multivariate t distribution. Therefore, the posterior marginal distribution of ∗ ∗ µ is a multivariate t distribution with mean µ0, v0 + n − k + 1 degrees of freedom, and scale matrix ∗ ∗ ∗ λ0=(k0(v0 − k + 1)). Problem 2. Collinearity −1 Solution 2. Let's assume that β ∼ N(0; λI) for some λ > 0. Let Λ = λI. We know that (y − xiβ) = 2 i ∼ N(0; σ ). Thus, using Baye's rule, the posterior for β is p(βjy; X; σ2) / p(β)p(y; Xjβ; σ2) 1 1 / exp − (y − Xβ)T (y − Xβ) − βT Λβ 2σ2 2 Let β^ = (XT X + σ2Λ)−1XT y. Then, we can write −1 (y − Xβ) − βT Λβ σ2 as XT X + σ2Λ (β − β^)T (β − β^) + yT y − β^T (XT X + Λ)β^ + β^T Λβ^ σ2 Disregarding the components without β, and expanding we get, XT X + σ2Λ p(βjy; Xσ2) / exp (β − (XT X + σ2Λ)−1XT y)T (β − (XT X + σ2Λ)−1XT y) σ2 We recognize this as p(βjy; X; σ2) ∼ N((XT X + σ2Λ)−1XT y; σ2(XT X + σ2λ)−1). We complete the re- gression analysis by choosing the mean of the posterior as our estimate. That is, we have β^ = XT X + σ2Λ)−1XT y, which we recognize as the ridge regression solution. Note that while (XT X) does not have an inverse because of collinearity, XT X + λI always does when λ > 0. And, when n > k the solution still exists when λ > 0. Problem 3. How to use the posterior distribution for point estimates Solution 3. 1. This function is convex and differentiable so we differentiate with respect to θ^ and then take the point where the derivative is 0. Thus, we have 2θ = 2θ^. We then take expectations to get θ^ = Eθ, which is the posterior mean. 2. Let F (θ) denote the cumulative posterior distribution function of θ. Then again to calculate the minimum we have to differentiate with respect to θ^, but we can rewrite the minimizer first, (writing out the expected value in integral form) d Z jθ − θ^jp(θ)dθ dθ^ Ω d Z Z = ( (θ − θ^)p(θ)dθ + (θ^ − θ)p(θ)dθ) dθ^ Ω(θ>θ^) Ω(θ<θ^) d Z Z = ( θp(θ)dθ − (θ)p(θ)dθ + θF^ (θ^) − θ^(1 − F (θ^))) dθ^ Ω(θ>θ^) Ω(θ<θ^) = F (θ^) − (1 − F (θ^)) ^ 1 ^ Setting this equal to 0 and solving we get F (θ) = 2 , or θ is the posterior median. 2 ^ ^ ^ 3. By definition, EL0(θ; θ) = P (jθ − θj > ) = 1 − P (jθ − θj < ). And then, taking ! 0, we see this is 1 − P (θ = θ^) and this value is minimized when P (θ = θ^) is as large as possible and this occurs when θ^ is chosen to be the maximum of the posterior distribution, or the mode of p(θ). 3.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    3 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us