9. Linear Models and Regression Objective
Total Page:16
File Type:pdf, Size:1020Kb
9. Linear models and regression AFM Smith Objective To illustrate the Bayesian approach to fitting normal and generalized linear models. Bayesian Statistics Recommended reading • Lindley, D.V. and Smith, A.F.M. (1972). Bayes estimates for the linear model (with discussion). Journal of the Royal Statistical Society B, 34, 1–41. • Broemeling, L.D. (1985). Bayesian Analysis of Linear Models, Marcel- Dekker. • Gelman, A., Carlin, J.B., Stern, H.S. and Rubin, D.B. (2003). Bayesian Data Analysis, Chapter 8. • Wiper, M.P., Pettit, L.I. and Young, K.D.S. (2000). Bayesian inference for a Lanchester type combat model. Naval Research Logistics, 42, 609–633. Bayesian Statistics Introduction: the multivariate normal distribution Definition 22 T A random variable X = (X1,...,Xk) is said to have a multivariate normal distribution with mean µ and variance / covariance matrix Σ if 1 1 T −1 Rk f(x|µ, Σ) = 1 exp − (x − µ) Σ (x − µ) for x ∈ . (2π)k/2|Σ|2 2 In this case, we write X|µ, Σ ∼ N (µ, Σ). The following properties of the multivariate normal distribution are well known. Bayesian Statistics i. Any subset of X has a (multivariate) normal distribution. Pk ii. Any linear combination i=1 αiXi is normally distributed iii. If Y = a + BX is a linear transformation of X, then Y|µ, Σ ∼ N a + Bµ, BΣBT . X1 µ Σ11 Σ12 iv. If X = µ, Σ ∼ N 1 , then the conditional X2 µ2 Σ21 Σ22 density of X1 given X2 = x2 is −1 −1 X1|x2, µ, Σ ∼ N µ1 + Σ12Σ22 (x2 − µ2), Σ11 − Σ12Σ22 Σ21 Bayesian Statistics The multivariate normal likelihood function Suppose that we observe a sample x = (x1,..., xn) of data from N (µ, Σ). Then the likelihood function is given by n ! 1 1 X T −1 l(µ, Σ|x) = n exp − (xi − µ) Σ (xi − µ) (2π)nk/2|Σ| 2 2 i=1 " n #! 1 1 X T −1 T −1 ∝ n exp − (xi − x¯) Σ (xi − x¯) + n(µ − x¯) Σ (µ − x¯) |Σ| 2 2 i=1 1 1 −1 T −1 ∝ n exp − tr SΣ + n(µ − x¯) Σ (µ − x¯) |Σ| 2 2 1 Pn Pn T where x¯ = n i=1 xi and S = i=1(xi − x¯)(xi − x¯) and tr(M) represents the trace of the matrix M. It is possible to carry out Bayesian inference with conjugate priors for µ, Σ. We shall consider two cases which reflect different levels of knowledge about the variance-covariance matrix Σ. Bayesian Statistics 1 Conjugate Bayesian inference for the multivariate normal distribution I: Σ = φC Firstly, consider the case where the variance-covariance matrix is known up 1 to a constant, i.e. Σ = φC where C is a known matrix. Then, we have 1 X|µ, φ ∼ N µ, φC and the likelihood function is nk φ l(µ, φ|x) ∝ φ 2 exp − tr SC−1 + n(µ − x¯)T C−1(µ − x¯) . 2 Analogous to the univariate case, it can be seen that a multivariate normal- gamma prior distribution is conjugate. Bayesian Statistics The multivariate normal gamma distribution We say that µ, φ have a multivariate normal gamma prior with parameters −1 a b m, V , 2, 2 if 1 µ|φ ∼ N m, V φ a b φ ∼ G , . 2 2 −1 a b In this case, we write µ, φ ∼ N G m, V , 2, 2 . Bayesian Statistics The marginal distribution of µ In this case, the marginal distribution of µ is a multivariate, non-central t distribution. Definition 23 A(k-dimensional) random variable, T = (T1,...,Tk), has a multivariate t distribution with parameters d, µT , ΣT if d+k d+k − 2 Γ 2 1 T −1 f(t) = k 1 + (t − µT ) ΣT (t − µT ) . 2 1/2 d d (πd) |ΣT | Γ 2 In this case, we write T ∼ T (µT , ΣT , d). The following theorem gives the density of µ. Theorem 34 −1 a b Let µ, φ ∼ N G m, V , 2, 2 . Then the marginal density of µ is µ ∼ b T m, aV, a . Bayesian Statistics Proof Z ∞ p(µ) = p(µ, φ) dφ 0 Z ∞ = p(µ|φ)p(φ) dφ 0 a ∞ b Z 1 φ 2 a b = exp − (µ − m)T V−1(µ − m) 2 φ2−1 exp − φ dφ k 1 a 0 (2π) 2 |V|2 2 Γ 2 2 a b ∞ 1 2 Z a+k φ h i = 2 φ 2 −1 exp − b + (µ − m)T V−1(µ − m) dφ k 1 a (2π) 2 |V|2 Γ 2 0 2 a !−a+k 1 b 2 a + k b + (µ − m)T V−1(µ − m) 2 = 2 Γ k 1 a (2π) 2 |V|2 Γ 2 2 2 and, reordering terms proves the result. Bayesian Statistics The posterior distribution of µ, φ Theorem 35 1 Let X|µ, φ ∼ N µ, φC and assume the prior distributions µ|φ ∼ 1 a b N m, φV and φ ∼ G 2, 2 . Then, given sample data x, we have 1 µ|x, φ ∼ N m?, V? φ a? b? φ|x ∼ G , where 2 2 −1 V? = V−1 + nC−1 m? = V? V−1m + nC−1x¯ a? = a + nk b? = b + tr SC−1 + mT V−1m + nx¯T C−1x¯ − m?V?−1m?. Proof Exercise. The proof is analogous to the univariate case. Bayesian Statistics A simplification In the case where V ∝ C, we have a simplified result, similar to that for the univariate case. Theorem 36 −1 a b Let µ, φ ∼ N G m, αC , 2, 2 . Then, αm + nx¯ 1 µ|φ, x ∼ N , C α + n (α + n)φ αn T −1 a + n b + tr S + α+n(m − x¯)(m − x¯) C φ|x ∼ G , . 2 2 1 Proof This follows from the previous theorem substituting V = αC. Bayesian Statistics Results with the reference prior Theorem 37 1 Given the prior p(µ, φ) ∝ φ, then the posterior distribution is nk φ p(µ, φ|x) ∝ φ 2 −1 exp − tr SC−1 + n(µ − x¯)T C−1(µ − x¯) 2 1 µ|x, φ ∼ N x¯, C nφ ! (n − 1)k tr SC−1 φ|x ∼ G , 2 2 ! tr SC−1 µ|x ∼ T x¯, C, (n − 1)k . n(n − 1)k −1 1 (n−1)k tr(SC ) Proof µ, φ|x ∼ N G x¯, nC, 2 , 2 and the rest follows. Bayesian Statistics Conjugate inference for the multivariate normal distribution II: Σ unknown In this case, it is useful to reparameterize the normal distribution in terms of the precision matrix Φ = Σ−1 when the normal likelihood function becomes n 1 l(µ, Φ|x) ∝ |Φ| 2 exp − tr (SΦ) + n(µ − x¯)T Φ(µ − x¯) 2 It is clear that a conjugate prior for µ and Φ must take a similar form to the likelihood. This is a normal-Wishart distribution. Bayesian Statistics The normal Wishart distribution Definition 24 A k × k dimensional symmetric, positive definite random variable W is said to have a Wishart distribution with parameters d and V if d−k−1 |W| 2 1 f(W) = exp − tr V−1W dk d k(k−1) k 2 2 4 Q d+1−i 2 2 |V | π i=1 Γ 2 where d > k − 1. In this case, E[W] = dV and we write W ∼ W(d, V). If W ∼ W(d, V), then the distribution of W−1 is said to be an inverse Wishart distribution, W−1 ∼ IW d, V−1 with mean E W−1 = 1 −1 d−k−1V . Bayesian Statistics Theorem 38 −1 1 −1 Suppose that X|µ, Φ ∼ N µ, Φ and let µ|Φ ∼ N m, αΦ and Φ ∼ W(d, W). Then: αm + nx¯ 1 µ|Φ, x ∼ N , Φ−1 α + n α + n αn Φ|x ∼ W d + nk, W−1 + S + (m − x¯)(m − x¯)T α + n Proof Exercise. We can also derive a limiting prior distribution by letting d → 0 when k+1 p(Φ) ∝ |Φ| 2 when the posterior distribution is 1 µ|Φ, x ∼ N x¯, Φ−1 Φ|x ∼ W (n(k − 1), S) . n Bayesian Statistics Semi-conjugate inference via Gibbs sampling The conjugacy assumption that the prior precision of µ is proportional to the model precision Σ is very strong in many cases. Often, we may simply wish to use a prior distribution of form µ ∼ N (m, V) where m and V are known and a Wishart prior for Φ, say φ ∼ W(d, W) as earlier. In this case, the conditional posterior distributions are −1 −1 µ|Φ, x ∼ N V−1 + nΦ V−1m + nΦx¯ , V−1 + nΦ Φ|µ, x ∼ W d + n, W−1 + S + n(µ − x¯)(µ − x¯)T and therefore, it is straightforward to set up a Gibbs sampling algorithm to sample the joint posterior, as in the univariate case. Bayesian Statistics Aside: sampling the multivariate normal, multivariate t and Wishart distributions Samplers for the multivariate normal distribution (usually based on the Cholesky decomposition) are available in most statistical packages such as R or Matlab. Sampling the multivariate t distribution is only slightly more complicated. Assume that we wish to sample from T ∼ T (µ, Σ, d). Then from Theorem 34, the distribution of T is the same as the marginal distribution of T in the two stage model 1 d d T|φ ∼ N µ, Σ φ ∼ G , .