Bayesian Inference Chapter 4: Regression and Hierarchical Models
Total Page:16
File Type:pdf, Size:1020Kb
Bayesian Inference Chapter 4: Regression and Hierarchical Models Conchi Aus´ınand Mike Wiper Department of Statistics Universidad Carlos III de Madrid Master in Business Administration and Quantitative Methods Master in Mathematical Engineering Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 1 / 35 Objective AFM Smith Dennis Lindley We analyze the Bayesian approach to fitting normal and generalized linear models and introduce the Bayesian hierarchical modeling approach. Also, we study the modeling and forecasting of time series. Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 2 / 35 Contents 1 Normal linear models 1.1. ANOVA model 1.2. Simple linear regression model 2 Generalized linear models 3 Hierarchical models 4 Dynamic models Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 3 / 35 Normal linear models A normal linear model is of the following form: y = Xθ + ; 0 where y = (y1;:::; yn) is the observed data, X is a known n × k matrix, called 0 the design matrix, θ = (θ1; : : : ; θk ) is the parameter set and follows a multivariate normal distribution. Usually, it is assumed that: 1 ∼ N 0 ; I : k φ k A simple example of normal linear model is the simple linear regression model T 1 1 ::: 1 where X = and θ = (α; β)T . x1 x2 ::: xn Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 4 / 35 Normal linear models Consider a normal linear model, y = Xθ + . A conjugate prior distribution is a normal-gamma distribution: 1 θ j φ ∼ N m; V φ a b φ ∼ G ; : 2 2 Then, the posterior distribution given y is also a normal-gamma distribution with: −1 m∗ = XT X + V−1 XT y + V−1m −1 V∗ = XT X + V−1 a∗ = a + n b∗ = b + yT y + mT V−1m − m∗T V∗−1m∗ Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 5 / 35 Normal linear models The posterior mean is given by: −1 E [θ j y] = XT X + V−1 XT y + V−1m −1 −1 = XT X + V−1 XT X XT X XT y + V−1m −1 = XT X + V−1 XT Xθ^ + V−1m −1 where θ^ = XT X XT y is the maximum likelihood estimator. Thus, this expression may be interpreted as a weighted average of the prior estimator, m, and the MLE, θ^, with weights proportional to precisions since, 1 conditional on φ, the prior variance is φ V and that the distribution of the MLE ^ 1 T −1 from the classical viewpoint is θ j φ ∼ N θ; φ (X X) Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 6 / 35 Normal linear models Consider a normal linear model, y = Xθ + , and assume the limiting prior distribution, 1 p(θ; φ) / : φ Then, we have that, 1 −1 θ j y; φ ∼ N θ^; XT X ; φ 0 T 1 n − k yT y − θ^ XT X θ^ φ j y ∼ G ; : @ 2 2 A T ^T T ^ 2 y y−θ (X X)θ 2 1 Note thatσ ^ = n−k is the usual classical estimator of σ = φ . In this case, Bayesian credible intervals, estimators etc. will coincide with their classical counterparts. Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 7 / 35 ANOVA model The ANOVA model is an example of normal lineal model where: yij = θi + ij ; 1 where ij ∼ N (0; φ ); for i = 1;:::; k; and j = 1;:::; ni : Thus, the parameters are θ = (θ1; : : : ; θk ), the observed data are T y = (y11;:::; y1n1 ; y21;:::; y2n2 ;:::; yk1;:::; yknk ) , the design matrix is: 0 1 0 ··· 0 1 B . C B . C B C B 1n1 0 ··· 0 C B C B 0 1 0 C X = B C B . C B . C B C B 0 1n 0 C B 2 C B . C @ . A 0 0 1 Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 8 / 35 ANOVA model 1 Assume conditionally independent normal priors, θi ∼ N mi ; , for αi φ a b i = 1;:::; k; and a gamma prior φ ∼ G( 2 ; 2 ). This corresponds to a normal-gamma prior distribution for (θ; φ) where 1 1 m = (m1;:::; mk ) and V = diag ;:::; . α1 αk Then, it is obtained that, 00 n1y¯1·+α1m1 1 0 1 11 n +α α +n 1 1 1 1 1 θ j y; φ ∼ N BB . C ; B .. CC @@ . A φ @ . AA n1y¯1·+α1m1 1 n1+α1 αk +nk and Pk Pni 2 Pk ni 2 ! a + n b + (yij − y¯i·) + (¯yi· − mi ) φ j y ∼ G ; i=1 j=1 i=1 ni +αi 2 2 Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 9 / 35 ANOVA model 1 If we assume alternatively the reference prior, p(θ; φ) / φ , we have: 00 1 0 1 11 y¯1· n 1 1 θ j y; φ ∼ N BB . C ; B .. CC ; @@ . A φ @ . AA 1 y¯k· nk n − k (n − k)σ ^2 φ ∼ G ; ; 2 2 2 1 Pk 2 whereσ ^ = n−k i=1 (yij − y¯i·) is the classical variance estimate for this problem. A 95% posterior interval for θ1 − θ2 is given by: r 1 1 y¯1· − y¯2· ± σ^ + tn−k (0:975) ; n1 n2 which is equal to the usual, classical interval. Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 10 / 35 Example: ANOVA model Suppose that an ecologist is interested in analysing how the masses of starlings (a type of birds) vary between four locations. A sample data of the weights of 10 starlings from each of the four locations can be downloaded from: http://arcue.botany.unimelb.edu.au/bayescode.html. Assume a Bayesian one-way ANOVA model for these data where a different mean is considered for each location and the variation in mass between different birds is described by a normal distribution with a common variance. Compare the results with those obtained with classical methods. Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 11 / 35 Simple linear regression model Another example of normal linear model is the simple regression model: yi = α + βxi + i ; 1 for i = 1;:::; n, where i ∼ N 0; φ : Suppose that we use the limiting prior: 1 p(α; β; φ) / : φ Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 12 / 35 Simple linear regression model Then, we have that: 0 0 n 11 α α^ 1 Px 2 −nx¯ y; φ ∼ N ; i β @ β^ @ i=1 AA φnsx −nx¯ n ! n − 2 s 1 − r 2 φ j y ∼ G ; y 2 2 where: s α^ =y ¯ − β^x¯; β^ = xy ; sx Pn 2 Pn 2 sx = i=1 (xi − x¯) ; sy = i=1 (yi − y¯) ; 2 Pn sxy 2 sy 1 − r sxy = i=1 (xi − x¯)(yi − y¯) ; r = p ; σ^ = : sx sy n − 2 Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 13 / 35 Simple linear regression model Thus, the marginal distributions of α and β are Student-t distributions: α − α^ j y ∼ tn−2 r n 2 P 2 σ^ i=1xi n sx β − β^ j y ∼ tn−2 q σ^2 sx Therefore, for example, a 95% credible interval for β is given by: σ^ β^ ± p tn−2(0:975) sx equal to the usual classical interval. Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 14 / 35 Simple linear regression model Suppose now that we wish to predict a future observation: ynew = α + βxnew + new : Note that, E [ynew j φ, y] =α ^ + β^xnew Pn 2 2 1 i=1xi + nxnew − 2nxx¯ new V [ynew j φ, y] = + 1 φ nsx 1 s + nx¯2 + nx 2 − 2nxx¯ = x new new + 1 φ nsx Therefore, 2 !! 1 (¯x − xnew ) 1 ynew j φ, y ∼ N α^ + β^xnew ; + + 1 φ sx n Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 15 / 35 Simple linear regression model And then, ynew − α^ + β^xnew r j y ∼ tn−2 2 σ^ (¯x−xnew ) + 1 + 1 sx n leading to the following 95% credible interval for ynew : v u 2 ! u (¯x − xnew ) 1 α^ + β^xnew ± σ^t + + 1 tn−2 (0:975) ; sx n which coincides with the usual, classical interval. Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 16 / 35 Example: Simple linear regression model Consider the data file prostate.data that can be downloaded from: http://statweb.stanford.edu/~tibs/ElemStatLearn/. This includes, among other clinical measures, the level of prostate specific antigen in logs (lpsa) and the log cancer volume (lcavol) in 97 men who were about to receive a radical prostatectomy. Use a Bayesian linear regression model to predict the lpsa in terms of the lcavol. Compare the results with a classical linear regression fit. Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 17 / 35 Generalized linear models The generalized linear model generalizes the normal linear model by allowing the possibility of non-normal error distributions and by allowing for a non-linear relationship between y and x. A generalized linear model is specified by two functions: 1 A conditional, exponential family density function of y given x, parameterized by a mean parameter, µ = µ(x) = E[Y j x] and (possibly) a dispersion parameter, φ > 0, that is independent of x. 2 A (one-to-one) link function, g(·), which relates the mean, µ = µ(x) to the covariate vector, x, as g(µ) = xθ.