<<

A Bayesian Treatment of Linear Gaussian Regression

Frank Wood

December 3, 2009 Bayesian Approach to Classical In classical linear regression we have the following model

y|β, σ2, X ∼ N(Xβ, σ2I)

Unfortunately we often don’t know the observation error σ2 and, as well, we don’t know the vector of linear weights β that relates the input(s) to the output.

In Bayesian regression, we are interested in several inference objectives. One is the posterior distribution of the model , in particular the posterior distribution of the observation error given the inputs and the outputs.

P(σ2|X, y) Posterior Distribution of the Error Variance Of course in order to derive

P(σ2|X, y)

We have to treat β as a nuisance and integrate it out

Z P(σ2|X, y) = P(σ2, β|X, y)dβ Z = P(σ2|β, X, y)P(β|X, y)dβ Predicting a New Output for a (set of) new Input(s) Of particular interest is the ability to predict the distribution of output values for a new input

P(ynew |X, y, Xnew )

Here we have to treat both σ2 and β as a nuisance parameters and integrate them out

P(ynew |X, y, Xnew ) ZZ 2 2 2 = P(ynew |β, σ )P(σ |β, X, y)P(β|X, y)dβ, dσ Noninformative Prior for Classical Regression For both objectives, we need to place a prior on the model parameters σ2 and β. We will choose a noninformative prior to demonstrate the connection between the Bayesian approach to multiple regression and the classical approach.

P(σ2, β) ∝ σ−2 Is this a proper prior? What form will the posterior take in this case? Will it be proper?

Clearly other priors can be imposed, priors that are more informative. Posterior distribution of β given σ2 Sometimes it is the case that σ2 is known. In such cases the posterior distribution over the model parameters collapses to the posterior over β alone. Even when σ2 is also unknown, the factorization of the posterior distribution

P(σ2, β|X, y) = P(β|σ2, X, y)P(σ2|X, y) Suggests that determining the posterior distribution P(β|σ2, X, y) will be of use as a step in posterior analyses. Posterior distribution of β given σ2 Given our choice of (improper) prior we have

P(β|σ2, X, y)P(σ2|X, y) ∝ N(y|Xβ, σ2I)σ−2 Which, plugging in the normal likelihood and ignoring terms that are not a function of β we have

1 1 P(β|σ2, X, y) ∝ exp(− (y − Xβ)T I(y − Xβ))) 2 σ2 when we expand out the exponent we get an expression that looks like (again dropping terms that do not involve β) 1 1 1 exp(− (−2yT IXβ + βT XT IXβ)) 2 σ2 σ2 Multivariate Quadratic Square Completion We recognize the familiar form of the exponent of a multivariate Gaussian in this expression and can derive the and the variance of the distribution of β|σ2,... by noting that

T −1 T −1 T −1 (β − µβ) Σβ (β − µβ) = β Σ β − 2µβ Σβ β + const From this and the result from the previous slide 1 1 1 exp( − 2yT IXβ + βT XT IXβ) 2 σ2 σ2 −1 T 1 We can immediately identify Σβ = X σ2 IX and thus that 2 T −1 Σβ = σ (X X) . Similarly we can solve for µβ and we find

T −1 T µβ = (X X) X y Distribution of β given σ2 Mirroring the classical approach to regression we have that the distribution of the regression coefficients given the observation noise variance is 2 β|y, X, σ ∼ N(µβ, Σβ) 2 T −1 T −1 T where Σβ = σ (X X) and µβ = (X X) X y

Note that µβ is the same as the maximum likelihood or estimate βˆ = (XT X)−1XT y of the regression coefficients.

Of course we don’t usually know the observation noise variance σ2 and have to simultaneously estimate it from the . To determine the distribution of this quantity we need a few facts. Scaled inverse-chi-square distribution If θ ∼ Inv−χ2(ν, s2) then the pdf for θ is given by

ν/2 (ν/2) 2 P(θ) = θ−(ν/2+1)e(−νs /(2θ)) Γ(ν/2) 2 ∝ θ−(ν/2+1)e(−νs /(2θ))

You can think of the scaled inverse chi squared distribution as the chi squared distribution where the sum of squares is explicit in the parameterization. ν > 0 is the number of “degrees of freedom”, s > 0 is the . Distribution of σ2 given observations y and X The posterior distribution of the observation noise can be derived by noting that

P(β, σ2|y, X) P(σ2|y, X) = P(β|σ2, y, X) P(y|β, σ2, X)P(β, σ2|X) ∝ P(β|σ2, y, X)

But we have all of these terms. P(y|β, σ2, X) is the standard regression likelihood. We have just solved for the posterior distribution of β given σ2 and the rest, P(β|σ2, y, X) and we specified our prior P(σ2, β) ∝ σ−2 Distribution of σ2 given observations y and X When we plug all of these known distributions into the

P(y|β, σ2, X)P(β, σ2|X) P(σ2|y, X) ∝ P(β|σ2, y, X) σ−n exp(− 1 (y − Xβ)T 1 I(y − Xβ))σ−2 ∝ 2 σ2 −p 1 T −1 σ exp(− 2 (β − µβ) Σβ (β − µβ)) which simplifies to 1 1 ∝ σ−n+p−2 exp(− ((y − Xβ)T I(y − Xβ) 2 σ2 T −1 − (β − µβ) Σβ (β − µβ) )) Distribution of σ2 given observations y and X With significant algebraic effort one can arrive at 1 P(σ2|y, X) ∝ σ−n+p−2 exp(− (y − Xµ )T (y − Xµ )) 2σ2 β β ˆ Remembering that µβ = β we can rewrite this in a more familiar form, namely 1 P(σ2|y, X) ∝ σ−n+p−2 exp(− (y − Xβˆ)T (y − Xβˆ)) 2σ2 where the exponent is the sum of squared errors SSE. Distribution of σ2 given observations y and X By inspection 1 P(σ2|y, X) ∝ σ−n+p−2 exp(− (y − Xβˆ)T (y − Xβˆ)) 2σ2 follows an scaled inverse χ2 distribution

2 P(θ) ∝ θ−(ν/2+1)e(−νs /(2θ))

where θ = σ2 =⇒ ν = n − p (i.e. the number of degrees of freedom is the number of observations n minus the number of free 2 1 ˆ T ˆ parameters in the model p and s = n−p (y − Xβ) (y − Xβ) is the standard MSE estimate of the sample variance. Distribution of σ2 given observations y and X Note that this result 1 σ2 ∼ Inv−χ2(n − p, (y − Xβˆ)T (y − Xβˆ)) (1) n − p is exactly analogous to the following result from the classical estimation approach to linear regression.

From Cochran’s Theorem we have

SSE (y − Xβˆ)T (y − Xβˆ) = ∼ χ2(n − p) (2) σ2 σ2 To get from (1) to (2) one can use the change of distribution formula with the change of variable θ∗ = (y − Xβˆ)T (y − Xβˆ)/σ2. Distribution of output(s) given new input(s) Last but not least we will typically be interested in .

P(ynew |X, y, Xnew ) ZZ 2 2 2 = P(ynew |β, σ )P(σ |β, X, y)P(β|X, y)dβ, dσ

we will first assume, as usual that σ2 is known and proceed with evaluating

2 P(ynew |X, y, Xnew , σ ) Z 2 2 = P(ynew |β, σ )P(β|X, y, σ )dβ

instead. Distribution of output(s) given new input(s) We know the form of each of these expressions, the likelihood is normal as is the distribution of β given the rest

2 P(ynew |X, y, Xnew , σ ) Z 2 2 = P(ynew |β, σ )P(β|X, y, σ )dβ

In other words

2 P(ynew |X, y, Xnew , σ ) Z 2 = N(ynew |Xnew βˆ, σ ) N(β|βˆ, Σβ)dβ Bayes Rule for Gaussians To solve this integral we will use Bayes’ rule for Gaussians (taken from Bishop).

If

P(x) = N(x|µ, Λ−1) P(y|x) = N(y|Ax + b, L−1)

where x, y, and µ are all vectors and Λ and L are (invertable) matrices of the appropriate size then

P(y) = N(y|Aµ + b, L −1 + AΛ−1AT ) P(x|y) = N(x|Σ(AT L(y − b) + Λµ), Σ)

where Σ = (Λ + AT LA)−1 Distribution of output(s) given new input(s) Since this integral is just an application of Bayes rule for Gaussians we can directly write down the solution

2 P(ynew |X, y, Xnew , σ ) Z 2 = N(ynew |Xnew βˆ, σ ) N(β|βˆ, Σβ)dβ ˆ 2 T = N(ynew |Xnew β, σ (I + Xnew VβXnew )

2 T −1 where Vβ = Σβ/σ = (X X) Distribution of output(s) given new input(s) This solution

2 P(ynew |X, y, Xnew , σ ) ˆ 2 T = N(ynew |Xnew β, σ (I + Xnew VβXnew ) 2 T −1 where Vβ = Σβ/σ = (X X)

relies upon σ2 being known. Our final inference objective is to come up with

P(ynew |X, y, Xnew ) ZZ 2 2 2 = P(ynew |β, σ )P(σ |β, X, y)P(β|X, y)dβ, dσ Z 2 2 2 = P(ynew |X, y, Xnew , σ )P(σ |X, y, Xnew )dσ

where we have just derived the first term and the second we known is scaled inverse chi-squared. Distribution of output(s) given new input(s) The distributional form of

P(ynew |X, y, Xnew ) Z 2 2 2 = P(ynew |X, y, Xnew , σ )P(σ |X, y, Xnew )dσ

is a multivariate Student-t distribution with center Xnew βˆ, squared 2 T scale marix s (I + Xnew VβXnew ) and n − p degrees of freedom (left as homework).

Again this is the same result as in classical – the predictive distribution of a new (set of) points is Student-t when σ2 is unknown and marginalized out. Take home

I The Bayesian perspective brings a new analytic perspective to the classical regression setting.

I In classical regression we develop estimators and then determine their distribution under repeated or measurement of the underlying population.

I In Bayesian regression we stick with the single given dataset and calculate the uncertainty in our parameter estimates arising from the fact that we have a finite dataset.

I Given a single choice of prior, namely a particular improper prior we see that the posterior uncertainty regarding the model parameters corresponds exactly to the classical sampling distributions for regression estimators.

I Other priors can be utilized.