A Bayesian Treatment of Linear Gaussian Regression
Frank Wood
December 3, 2009 Bayesian Approach to Classical Linear Regression In classical linear regression we have the following model
y|β, σ2, X ∼ N(Xβ, σ2I)
Unfortunately we often don’t know the observation error σ2 and, as well, we don’t know the vector of linear weights β that relates the input(s) to the output.
In Bayesian regression, we are interested in several inference objectives. One is the posterior distribution of the model parameters, in particular the posterior distribution of the observation error variance given the inputs and the outputs.
P(σ2|X, y) Posterior Distribution of the Error Variance Of course in order to derive
P(σ2|X, y)
We have to treat β as a nuisance parameter and integrate it out
Z P(σ2|X, y) = P(σ2, β|X, y)dβ Z = P(σ2|β, X, y)P(β|X, y)dβ Predicting a New Output for a (set of) new Input(s) Of particular interest is the ability to predict the distribution of output values for a new input
P(ynew |X, y, Xnew )
Here we have to treat both σ2 and β as a nuisance parameters and integrate them out
P(ynew |X, y, Xnew ) ZZ 2 2 2 = P(ynew |β, σ )P(σ |β, X, y)P(β|X, y)dβ, dσ Noninformative Prior for Classical Regression For both objectives, we need to place a prior on the model parameters σ2 and β. We will choose a noninformative prior to demonstrate the connection between the Bayesian approach to multiple regression and the classical approach.
P(σ2, β) ∝ σ−2 Is this a proper prior? What form will the posterior take in this case? Will it be proper?
Clearly other priors can be imposed, priors that are more informative. Posterior distribution of β given σ2 Sometimes it is the case that σ2 is known. In such cases the posterior distribution over the model parameters collapses to the posterior over β alone. Even when σ2 is also unknown, the factorization of the posterior distribution
P(σ2, β|X, y) = P(β|σ2, X, y)P(σ2|X, y) Suggests that determining the posterior distribution P(β|σ2, X, y) will be of use as a step in posterior analyses. Posterior distribution of β given σ2 Given our choice of (improper) prior we have
P(β|σ2, X, y)P(σ2|X, y) ∝ N(y|Xβ, σ2I)σ−2 Which, plugging in the normal likelihood and ignoring terms that are not a function of β we have
1 1 P(β|σ2, X, y) ∝ exp(− (y − Xβ)T I(y − Xβ))) 2 σ2 when we expand out the exponent we get an expression that looks like (again dropping terms that do not involve β) 1 1 1 exp(− (−2yT IXβ + βT XT IXβ)) 2 σ2 σ2 Multivariate Quadratic Square Completion We recognize the familiar form of the exponent of a multivariate Gaussian in this expression and can derive the mean and the variance of the distribution of β|σ2,... by noting that
T −1 T −1 T −1 (β − µβ) Σβ (β − µβ) = β Σ β − 2µβ Σβ β + const From this and the result from the previous slide 1 1 1 exp( − 2yT IXβ + βT XT IXβ) 2 σ2 σ2 −1 T 1 We can immediately identify Σβ = X σ2 IX and thus that 2 T −1 Σβ = σ (X X) . Similarly we can solve for µβ and we find
T −1 T µβ = (X X) X y Distribution of β given σ2 Mirroring the classical approach to matrix regression we have that the distribution of the regression coefficients given the observation noise variance is 2 β|y, X, σ ∼ N(µβ, Σβ) 2 T −1 T −1 T where Σβ = σ (X X) and µβ = (X X) X y
Note that µβ is the same as the maximum likelihood or least squares estimate βˆ = (XT X)−1XT y of the regression coefficients.
Of course we don’t usually know the observation noise variance σ2 and have to simultaneously estimate it from the data. To determine the distribution of this quantity we need a few facts. Scaled inverse-chi-square distribution If θ ∼ Inv−χ2(ν, s2) then the pdf for θ is given by
ν/2 (ν/2) 2 P(θ) = θ−(ν/2+1)e(−νs /(2θ)) Γ(ν/2) 2 ∝ θ−(ν/2+1)e(−νs /(2θ))
You can think of the scaled inverse chi squared distribution as the chi squared distribution where the sum of squares is explicit in the parameterization. ν > 0 is the number of “degrees of freedom”, s > 0 is the scale parameter. Distribution of σ2 given observations y and X The posterior distribution of the observation noise can be derived by noting that
P(β, σ2|y, X) P(σ2|y, X) = P(β|σ2, y, X) P(y|β, σ2, X)P(β, σ2|X) ∝ P(β|σ2, y, X)
But we have all of these terms. P(y|β, σ2, X) is the standard regression likelihood. We have just solved for the posterior distribution of β given σ2 and the rest, P(β|σ2, y, X) and we specified our prior P(σ2, β) ∝ σ−2 Distribution of σ2 given observations y and X When we plug all of these known distributions into the
P(y|β, σ2, X)P(β, σ2|X) P(σ2|y, X) ∝ P(β|σ2, y, X) σ−n exp(− 1 (y − Xβ)T 1 I(y − Xβ))σ−2 ∝ 2 σ2 −p 1 T −1 σ exp(− 2 (β − µβ) Σβ (β − µβ)) which simplifies to 1 1 ∝ σ−n+p−2 exp(− ((y − Xβ)T I(y − Xβ) 2 σ2 T −1 − (β − µβ) Σβ (β − µβ) )) Distribution of σ2 given observations y and X With significant algebraic effort one can arrive at 1 P(σ2|y, X) ∝ σ−n+p−2 exp(− (y − Xµ )T (y − Xµ )) 2σ2 β β ˆ Remembering that µβ = β we can rewrite this in a more familiar form, namely 1 P(σ2|y, X) ∝ σ−n+p−2 exp(− (y − Xβˆ)T (y − Xβˆ)) 2σ2 where the exponent is the sum of squared errors SSE. Distribution of σ2 given observations y and X By inspection 1 P(σ2|y, X) ∝ σ−n+p−2 exp(− (y − Xβˆ)T (y − Xβˆ)) 2σ2 follows an scaled inverse χ2 distribution
2 P(θ) ∝ θ−(ν/2+1)e(−νs /(2θ))
where θ = σ2 =⇒ ν = n − p (i.e. the number of degrees of freedom is the number of observations n minus the number of free 2 1 ˆ T ˆ parameters in the model p and s = n−p (y − Xβ) (y − Xβ) is the standard MSE estimate of the sample variance. Distribution of σ2 given observations y and X Note that this result 1 σ2 ∼ Inv−χ2(n − p, (y − Xβˆ)T (y − Xβˆ)) (1) n − p is exactly analogous to the following result from the classical estimation approach to linear regression.
From Cochran’s Theorem we have
SSE (y − Xβˆ)T (y − Xβˆ) = ∼ χ2(n − p) (2) σ2 σ2 To get from (1) to (2) one can use the change of distribution formula with the change of variable θ∗ = (y − Xβˆ)T (y − Xβˆ)/σ2. Distribution of output(s) given new input(s) Last but not least we will typically be interested in prediction.
P(ynew |X, y, Xnew ) ZZ 2 2 2 = P(ynew |β, σ )P(σ |β, X, y)P(β|X, y)dβ, dσ
we will first assume, as usual that σ2 is known and proceed with evaluating
2 P(ynew |X, y, Xnew , σ ) Z 2 2 = P(ynew |β, σ )P(β|X, y, σ )dβ
instead. Distribution of output(s) given new input(s) We know the form of each of these expressions, the likelihood is normal as is the distribution of β given the rest
2 P(ynew |X, y, Xnew , σ ) Z 2 2 = P(ynew |β, σ )P(β|X, y, σ )dβ
In other words
2 P(ynew |X, y, Xnew , σ ) Z 2 = N(ynew |Xnew βˆ, σ ) N(β|βˆ, Σβ)dβ Bayes Rule for Gaussians To solve this integral we will use Bayes’ rule for Gaussians (taken from Bishop).
If
P(x) = N(x|µ, Λ−1) P(y|x) = N(y|Ax + b, L−1)
where x, y, and µ are all vectors and Λ and L are (invertable) matrices of the appropriate size then
P(y) = N(y|Aµ + b, L −1 + AΛ−1AT ) P(x|y) = N(x|Σ(AT L(y − b) + Λµ), Σ)
where Σ = (Λ + AT LA)−1 Distribution of output(s) given new input(s) Since this integral is just an application of Bayes rule for Gaussians we can directly write down the solution
2 P(ynew |X, y, Xnew , σ ) Z 2 = N(ynew |Xnew βˆ, σ ) N(β|βˆ, Σβ)dβ ˆ 2 T = N(ynew |Xnew β, σ (I + Xnew VβXnew )
2 T −1 where Vβ = Σβ/σ = (X X) Distribution of output(s) given new input(s) This solution
2 P(ynew |X, y, Xnew , σ ) ˆ 2 T = N(ynew |Xnew β, σ (I + Xnew VβXnew ) 2 T −1 where Vβ = Σβ/σ = (X X)
relies upon σ2 being known. Our final inference objective is to come up with
P(ynew |X, y, Xnew ) ZZ 2 2 2 = P(ynew |β, σ )P(σ |β, X, y)P(β|X, y)dβ, dσ Z 2 2 2 = P(ynew |X, y, Xnew , σ )P(σ |X, y, Xnew )dσ
where we have just derived the first term and the second we known is scaled inverse chi-squared. Distribution of output(s) given new input(s) The distributional form of
P(ynew |X, y, Xnew ) Z 2 2 2 = P(ynew |X, y, Xnew , σ )P(σ |X, y, Xnew )dσ
is a multivariate Student-t distribution with center Xnew βˆ, squared 2 T scale marix s (I + Xnew VβXnew ) and n − p degrees of freedom (left as homework).
Again this is the same result as in classical regression analysis – the predictive distribution of a new (set of) points is Student-t when σ2 is unknown and marginalized out. Take home
I The Bayesian perspective brings a new analytic perspective to the classical regression setting.
I In classical regression we develop estimators and then determine their distribution under repeated sampling or measurement of the underlying population.
I In Bayesian regression we stick with the single given dataset and calculate the uncertainty in our parameter estimates arising from the fact that we have a finite dataset.
I Given a single choice of prior, namely a particular improper prior we see that the posterior uncertainty regarding the model parameters corresponds exactly to the classical sampling distributions for regression estimators.
I Other priors can be utilized.