<<

STATS 370: Stanford University, Winter 2016

Problem Set 7 Solution provided by Ahmed Bou-Rabee

Problem 1. Multivariate Gaussian Solution 1. We motivate our choice for the conjugate prior by first writing the . n ! −1 X f(x , . . . , x ) ∝ |Σ|−n/2 exp (x − µ)T Σ−1(x − µ) 1 n 2 i i i=1 Pn T −1 T −1 Pn T −1 Then, using i=1(xi − µ) Σ (xi − µ) = n(µ − x¯) Σ (µ − x¯) + i=1(xi − x¯) Σ (xi − x¯) and the trace trick, we can write this as n ! −n/2 T −1  −1 X T f(x1, . . . , xn) ∝ |Σ| exp −n/2(¯x − µ) Σ (¯x − µ) exp −1/2tr(Σ (xi − x¯)(xi − x¯) ) i=1 This looks like the product of a multivariate of µ given Σ and a of Σ. Looking on wikipedia, we see that this is the multivariate normal-Inverse-Wishart distribution. Specifically, (µ, Σ) ∼ NW (µ0, k0, λ0, v0) where µ0, k0, λ0, v0 are hyper-parameters of the normal-Inverse-Wishart distribution. We now derive the joint posterior distribution associated to this prior. Let (µ, Σ) ∼ NW (µ0, λ, W, v). And let x1, . . . , xn be the observations from N(µ, Σ). Then, the posterior

p(Σ, µ|x1, . . . , xn) ∝ p(x1, . . . , xn|Σ, µ)p(Σ, µ) n ! −n/2 T −1  −1 X T ∝ |Σ| exp −n/2(¯x − µ) Σ (¯x − µ) exp −1/2tr(Σ (xi − x¯)(xi − x¯) ) i=1 ((−(v0+k)/2+1) −1 T −1  |Σ| exp −1/2tr(λ0Σ ) − k0/2(µ − µ0) Σ (µ − µ0)

k0µ0+nx¯ Which we see is the normal-Wishart distribution with parameters µ0∗ = , k0∗ = k0 + n, λ0∗ = k0+n Pn T T λ0 + i=1(xi − µ)(xi − µ) + k0n/(k0 + n)(µ0 − x¯))µ0 − x¯) , and v0∗ = v0 + n. This is the same family of the prior, and is hence conjugate. Now we compute the of µ and we find that it is the multivariate student distribu- tion. We do this by first computing the marginal of the joint prior and then substituting in the posterior parameters. By integrating out Σ, and using the fact that if A is a k by k non singular matrix and v is a k by k dimensional column vector, then |A + vvT | = |A|(1 + vT A−1v). Z π(µ) = π(µ, σ) Σ Z −(v0+k)/2+1 −1 T −1  ∝ |Σ| exp −1/2tr(λ0Σ ) − k0/2(µ − µ0) Σ (µ − µ0) Σ T −v0+n+1)/2 ∝ [λ0 + (k0 + n)(m − µ)(m − µ) |

T −1 −1/2(v0+n+1) ∝ [1 + (k0 + n)(µ − µ0) λ0 (µ − µ0)]

1 We recognize this as the multivariate t distribution. Therefore, the posterior marginal distribution of ∗ ∗ µ is a multivariate t distribution with mean µ0, v0 + n − k + 1 degrees of freedom, and scale matrix ∗ ∗ ∗ λ0/(k0(v0 − k + 1)). Problem 2. Collinearity −1 Solution 2. Let’s assume that β ∼ N(0, λI) for some λ > 0. Let Λ = λI. We know that (y − xiβ) = 2 i ∼ N(0, σ ). Thus, using Baye’s rule, the posterior for β is p(β|y, X, σ2) ∝ p(β)p(y, X|β, σ2)  1 1  ∝ exp − (y − Xβ)T (y − Xβ) − βT Λβ 2σ2 2 Let βˆ = (XT X + σ2Λ)−1XT y. Then, we can write −1 (y − Xβ) − βT Λβ σ2 as XT X + σ2Λ (β − βˆ)T (β − βˆ) + yT y − βˆT (XT X + Λ)βˆ + βˆT Λβˆ σ2 Disregarding the components without β, and expanding we get,  XT X + σ2Λ  p(β|y, Xσ2) ∝ exp (β − (XT X + σ2Λ)−1XT y)T (β − (XT X + σ2Λ)−1XT y) σ2 We recognize this as p(β|y, X, σ2) ∼ N((XT X + σ2Λ)−1XT y, σ2(XT X + σ2λ)−1). We complete the re- gression analysis by choosing the mean of the posterior as our estimate. That is, we have βˆ = XT X + σ2Λ)−1XT y, which we recognize as the ridge regression solution. Note that while (XT X) does not have an inverse because of collinearity, XT X + λI always does when λ > 0. And, when n > k the solution still exists when λ > 0. Problem 3. How to use the posterior distribution for point estimates Solution 3. 1. This function is convex and differentiable so we differentiate with respect to θˆ and then take the point where the derivative is 0. Thus, we have 2θ = 2θˆ. We then take expectations to get θˆ = Eθ, which is the posterior mean. 2. Let F (θ) denote the cumulative posterior distribution function of θ. Then again to calculate the minimum we have to differentiate with respect to θˆ, but we can rewrite the minimizer first, (writing out the expected value in integral form) d Z |θ − θˆ|p(θ)dθ dθˆ Ω d Z Z = ( (θ − θˆ)p(θ)dθ + (θˆ − θ)p(θ)dθ) dθˆ Ω(θ>θˆ) Ω(θ<θˆ) d Z Z = ( θp(θ)dθ − (θ)p(θ)dθ + θFˆ (θˆ) − θˆ(1 − F (θˆ))) dθˆ Ω(θ>θˆ) Ω(θ<θˆ) = F (θˆ) − (1 − F (θˆ))

ˆ 1 ˆ Setting this equal to 0 and solving we get F (θ) = 2 , or θ is the posterior median.

2 ˆ ˆ ˆ 3. By definition, EL0(θ, θ) = P (|θ − θ| > ) = 1 − P (|θ − θ| < ). And then, taking  → 0, we see this is 1 − P (θ = θˆ) and this value is minimized when P (θ = θˆ) is as large as possible and this occurs when θˆ is chosen to be the maximum of the posterior distribution, or the mode of p(θ).

3