The Bayesian Linear Model

University of Minnesota

The Bayesian Linear Model – p. 1/9 Linear Model Basics

The linear model is the most fundamental of all serious statistical models, encompassing ANOVA, regression, ANCOVA, random and mixed effect modelling etc.

Ingredients of a linear model include an n 1 response vector y =(y ,...,yn) and × 1 an n p design matrix (e.g. including regressors) X =[x ,..., xp], assumed to have × 1 been observed without error. The linear model:

y = Xβ + ǫ; ǫ N(0,σ2I) ∼

Recall from standard statistical analysis, using calculus and linear algebra, the classical unbiased estimates of the regression parameter β and σ2 are

− 1 βˆ =(XT X) 1XT y;σ ˆ2 = (y Xβˆ)T (y Xβˆ). n p − − − The above estimate of β is also a least-squares estimate. The predicted value of y is given by T −1 T yˆ = Xβˆ = PX y where PX = X(X X) X .

PX is called the projector matrix of X. It is an operator that projects any vector to the space spanned by the columns of X. Note: (y Xβˆ)T (y Xβˆ)= yT (I P )y is − − − X the model residual. The Bayesian Linear Model – p. 2/9 Non-informative priors

For the Bayesian analysis, we will need to specify priors for the unknown regression parameters β and the variance σ2. What are the “non-informative” priors that would make this Bayesian analysis equivalent to the classical distribution theory? We need to consider absolutely ﬂat priors on β and log σ2. We have

1 1 P (β) 1; P (σ2) or equivalently P (β, σ2) . ∝ ∝ σ2 ∝ σ2

Note that none of the above two “distributions” are valid probabilities (they do not integrate to any ﬁnite number, let alone 1). So why is it that we are even discussing them? It turns out that even if the priors are improper (that’s what we call them), as long as the resulting posterior distributions are valid we can still conduct legitimate statistical inference on them. With a ﬂat prior on β we obtain, after some algebra, the conditional posterior distribution: − − P (β σ2, y)= N((XT X) 1XT y, (XT X) 1σ2). |

The Bayesian Linear Model – p. 3/9 contd.

The conditional posterior distribution of β would have been the desired posterior distribution had σ2 been known. Since that is not the case, we need to obtain the marginal posterior distribution by integrating out σ2 as:

P (β y)= P (β σ2, y)P (σ2 y)dσ2 | | | Z So, we need to ﬁnd the marginal posterior distribution of σ2: P (σ2 y). With the choice | of the ﬂat prior we obtain:

1 (n p)s2 P (σ2 y) exp( − ) | ∝ (σ2)(n−p)/2+1 − 2σ2 n p (n p)s2 IG( − , − ), ∼ 2 2 where s2 =σ ˆ2 = 1 yT (I P )y. This is a scaled inverse-chi-square distribution n−p − X which is the same as an inverted Gamma distribution IG((n p)/2, (n p)s2/2). − − A striking similarity with the classical result: The distribution of σˆ2 is also characterized as (n p)s2/σ2 following a chi-square distribution. −

The Bayesian Linear Model – p. 4/9 contd.

Returning to the marginal posterior distribution of P (β y), we can perform further | algebra (using the form of the Inverted Gamma density) to show that [β y] is a | non-central multivariate tn−p distribution with n p degrees of freedom and − non-centrality parameter βˆ. The above distribution is quite complicated but we rarely need to work with this. In order to carry out a non-informative Bayesian analysis, we use a simpler sampling based mechanism. For each i = 1,...,M ﬁrst draw σ2 [σ2 y] (which is (i) ∼ | Inverse-Gamma) followed by β N (XT X)−1XT y, (XT X)−1σ2 . (i) ∼ (i) M 2 The resulting samples βi,σ are precisely samples from the joint marginal (i) i=1 M 2 posterior distribution P (β, σ y). Automatically the samples β(i) are samples | i=1 M from marginal posterior distributions P (β y) while the samples σ2 are from | (i) i=1 the marginal posterior P (σ2 y). |

The Bayesian Linear Model – p. 5/9 contd.

A few important theoretical consequences are noted below: The multivariate t density is given by:

−n/2 Γ(n/2) (β βˆ)T (XT X)(β βˆ) P (β y)= 1+ − − . | (π(n p))p/2Γ((n p)/2) s2(XT X)−1 (n p)s2 − − | | " − #

The marginal distribution of each individual regression parameter βj is a non-central univariate tn−p distribution. In fact,

βj βˆj − tn−p. T −1 ∼ s (X X)jj q In particular, with a simple univariate N(µ, σ2) likelihood, we have

µ y¯ − tn−1. s/√n ∼

The promised reproduction of the classical case is now complete.

The Bayesian Linear Model – p. 6/9 Predicting from Linear Models

Now we want to apply our regression analysis to a new set of data, where we have observed the new covariate matrix X˜, and we wish to predict the corresponding outcome y˜ If β and σ2 were known, then y˜ would have a N(Xβ,σ˜ 2I) distribution. When parameters unknown: all predictions for the data must follow from the posterior predictive distribution:

p(y˜ y)= p(y˜ β,σ2)p(β, σ2 y)dβdσ2. | | | Z 2 M ˜ ˜ 2 For each posterior draw of (β(i),σ(i))i=1, draw y(i) from N(Xβ(i),σ(i)I). The M resulting sample (y˜(i))i=1 represents the predictive distribution. Theoretical Mean and Variance (conditional upon σ2):

E(˜y σ2, y)= X˜βˆ | − var(˜y σ2, y)=(I + X˜(XT X) 1X˜ T )σ2. |

Theoretical unconditional predictive distribution, p(˜y y),isa multivariate t distribution, ˜ ˆ 2 ˜ T −1 ˜ T | tn−p(Xβ,s (I + X(X X) X )). The Bayesian Linear Model – p. 7/9 Lindley and Smith (1972)

Hierarchical Linear Models in their full general form:

y β , Σ1 N(X1β , Σ1) | 1 ∼ 1 β β , Σ2 N(X2β , Σ2). 1 | 2 ∼ 2

It is assumed that X1, X2, Σ1 and Σ2 are known matrices, with the latter two being positive deﬁnite covariance matrices. The marginal distribution of y is given by:

T y N(X1X2β , Σ1 + X1Σ2X ). ∼ 2 1

The conditional distribution of β1 given y is:

∗ ∗ β y N(Σ µ, Σ ), where 1 | ∼ −1 ∗ T −1 −1 Σ = X1 Σ X1 +Σ2 , and T −1 −1 µ = X1 Σ1 y +Σ2 X2β2.

The Bayesian Linear Model – p. 8/9 The Proof: a sketch

To derive the marginal distribution of y, we ﬁrst rewrite the system

y = X1β + u, where u N(0, Σ1); 1 ∼ β = X2β + v, where v N(0, Σ2). 1 2 ∼

Also, u and v are independent of each other. It then follows that:

y = X1X2β2 + X1v + u.

This shows that the marginal distribution is normal. The ﬁnal result now follows by simple calculations of the mean and variance of y. For the conditional distribution β y, we see from Bayes Theorem that 1 | p(β y) p(y β )p(β ). By substituting expressions for the respective normal 1 | ∝ | 1 1 densities in the product on the right hand side and by completing the square in the quadratic form in the exponent, we ﬁnd p(β y) exp( Q/2), where 1 | ∝ −

∗ T ∗−1 ∗ T −1 T T −1 T ∗ Q =(β Σ µ) Σ (β Σ µ)+[y Σ y + β X Σ X2β µ Σ µ]. 1 − 1 − 1 2 2 2 2 −

The Bayesian Linear Model – p. 9/9