A Bayesian Treatment of Linear Gaussian Regression
Total Page:16
File Type:pdf, Size:1020Kb
A Bayesian Treatment of Linear Gaussian Regression Frank Wood December 3, 2009 Bayesian Approach to Classical Linear Regression In classical linear regression we have the following model yjβ; σ2; X ∼ N(Xβ; σ2I) Unfortunately we often don't know the observation error σ2 and, as well, we don't know the vector of linear weights β that relates the input(s) to the output. In Bayesian regression, we are interested in several inference objectives. One is the posterior distribution of the model parameters, in particular the posterior distribution of the observation error variance given the inputs and the outputs. P(σ2jX; y) Posterior Distribution of the Error Variance Of course in order to derive P(σ2jX; y) We have to treat β as a nuisance parameter and integrate it out Z P(σ2jX; y) = P(σ2; βjX; y)dβ Z = P(σ2jβ; X; y)P(βjX; y)dβ Predicting a New Output for a (set of) new Input(s) Of particular interest is the ability to predict the distribution of output values for a new input P(ynew jX; y; Xnew ) Here we have to treat both σ2 and β as a nuisance parameters and integrate them out P(ynew jX; y; Xnew ) ZZ 2 2 2 = P(ynew jβ; σ )P(σ jβ; X; y)P(βjX; y)dβ; dσ Noninformative Prior for Classical Regression For both objectives, we need to place a prior on the model parameters σ2 and β. We will choose a noninformative prior to demonstrate the connection between the Bayesian approach to multiple regression and the classical approach. P(σ2; β) / σ−2 Is this a proper prior? What form will the posterior take in this case? Will it be proper? Clearly other priors can be imposed, priors that are more informative. Posterior distribution of β given σ2 Sometimes it is the case that σ2 is known. In such cases the posterior distribution over the model parameters collapses to the posterior over β alone. Even when σ2 is also unknown, the factorization of the posterior distribution P(σ2; βjX; y) = P(βjσ2; X; y)P(σ2jX; y) Suggests that determining the posterior distribution P(βjσ2; X; y) will be of use as a step in posterior analyses. Posterior distribution of β given σ2 Given our choice of (improper) prior we have P(βjσ2; X; y)P(σ2jX; y) / N(yjXβ; σ2I)σ−2 Which, plugging in the normal likelihood and ignoring terms that are not a function of β we have 1 1 P(βjσ2; X; y) / exp(− (y − Xβ)T I(y − Xβ))) 2 σ2 when we expand out the exponent we get an expression that looks like (again dropping terms that do not involve β) 1 1 1 exp(− (−2yT IXβ + βT XT IXβ)) 2 σ2 σ2 Multivariate Quadratic Square Completion We recognize the familiar form of the exponent of a multivariate Gaussian in this expression and can derive the mean and the variance of the distribution of βjσ2;::: by noting that T −1 T −1 T −1 (β − µβ) Σβ (β − µβ) = β Σ β − 2µβ Σβ β + const From this and the result from the previous slide 1 1 1 exp( − 2yT IXβ + βT XT IXβ) 2 σ2 σ2 −1 T 1 We can immediately identify Σβ = X σ2 IX and thus that 2 T −1 Σβ = σ (X X) : Similarly we can solve for µβ and we find T −1 T µβ = (X X) X y Distribution of β given σ2 Mirroring the classical approach to matrix regression we have that the distribution of the regression coefficients given the observation noise variance is 2 βjy; X; σ ∼ N(µβ; Σβ) 2 T −1 T −1 T where Σβ = σ (X X) and µβ = (X X) X y Note that µβ is the same as the maximum likelihood or least squares estimate β^ = (XT X)−1XT y of the regression coefficients. Of course we don't usually know the observation noise variance σ2 and have to simultaneously estimate it from the data. To determine the distribution of this quantity we need a few facts. Scaled inverse-chi-square distribution If θ ∼ Inv−χ2(ν; s2) then the pdf for θ is given by ν=2 (ν=2) 2 P(θ) = θ−(ν=2+1)e(−νs =(2θ)) Γ(ν=2) 2 / θ−(ν=2+1)e(−νs =(2θ)) You can think of the scaled inverse chi squared distribution as the chi squared distribution where the sum of squares is explicit in the parameterization. ν > 0 is the number of \degrees of freedom", s > 0 is the scale parameter. Distribution of σ2 given observations y and X The posterior distribution of the observation noise can be derived by noting that P(β; σ2jy; X) P(σ2jy; X) = P(βjσ2; y; X) P(yjβ; σ2; X)P(β; σ2jX) / P(βjσ2; y; X) But we have all of these terms. P(yjβ; σ2; X) is the standard regression likelihood. We have just solved for the posterior distribution of β given σ2 and the rest, P(βjσ2; y; X) and we specified our prior P(σ2; β) / σ−2 Distribution of σ2 given observations y and X When we plug all of these known distributions into the P(yjβ; σ2; X)P(β; σ2jX) P(σ2jy; X) / P(βjσ2; y; X) σ−n exp(− 1 (y − Xβ)T 1 I(y − Xβ))σ−2 / 2 σ2 −p 1 T −1 σ exp(− 2 (β − µβ) Σβ (β − µβ)) which simplifies to 1 1 / σ−n+p−2 exp(− ((y − Xβ)T I(y − Xβ) 2 σ2 T −1 − (β − µβ) Σβ (β − µβ) )) Distribution of σ2 given observations y and X With significant algebraic effort one can arrive at 1 P(σ2jy; X) / σ−n+p−2 exp(− (y − Xµ )T (y − Xµ )) 2σ2 β β ^ Remembering that µβ = β we can rewrite this in a more familiar form, namely 1 P(σ2jy; X) / σ−n+p−2 exp(− (y − Xβ^)T (y − Xβ^)) 2σ2 where the exponent is the sum of squared errors SSE. Distribution of σ2 given observations y and X By inspection 1 P(σ2jy; X) / σ−n+p−2 exp(− (y − Xβ^)T (y − Xβ^)) 2σ2 follows an scaled inverse χ2 distribution 2 P(θ) / θ−(ν=2+1)e(−νs =(2θ)) where θ = σ2 =) ν = n − p (i.e. the number of degrees of freedom is the number of observations n minus the number of free 2 1 ^ T ^ parameters in the model p and s = n−p (y − Xβ) (y − Xβ) is the standard MSE estimate of the sample variance. Distribution of σ2 given observations y and X Note that this result 1 σ2 ∼ Inv−χ2(n − p; (y − Xβ^)T (y − Xβ^)) (1) n − p is exactly analogous to the following result from the classical estimation approach to linear regression. From Cochran's Theorem we have SSE (y − Xβ^)T (y − Xβ^) = ∼ χ2(n − p) (2) σ2 σ2 To get from (1) to (2) one can use the change of distribution formula with the change of variable θ∗ = (y − Xβ^)T (y − Xβ^)/σ2. Distribution of output(s) given new input(s) Last but not least we will typically be interested in prediction. P(ynew jX; y; Xnew ) ZZ 2 2 2 = P(ynew jβ; σ )P(σ jβ; X; y)P(βjX; y)dβ; dσ we will first assume, as usual that σ2 is known and proceed with evaluating 2 P(ynew jX; y; Xnew ; σ ) Z 2 2 = P(ynew jβ; σ )P(βjX; y; σ )dβ instead. Distribution of output(s) given new input(s) We know the form of each of these expressions, the likelihood is normal as is the distribution of β given the rest 2 P(ynew jX; y; Xnew ; σ ) Z 2 2 = P(ynew jβ; σ )P(βjX; y; σ )dβ In other words 2 P(ynew jX; y; Xnew ; σ ) Z 2 = N(ynew jXnew β^; σ ) N(βjβ^; Σβ)dβ Bayes Rule for Gaussians To solve this integral we will use Bayes' rule for Gaussians (taken from Bishop). If P(x) = N(xjµ, Λ−1) P(yjx) = N(yjAx + b; L−1) where x; y; and µ are all vectors and Λ and L are (invertable) matrices of the appropriate size then P(y) = N(yjAµ + b; L −1 + AΛ−1AT ) P(xjy) = N(xjΣ(AT L(y − b) + Λµ); Σ) where Σ = (Λ + AT LA)−1 Distribution of output(s) given new input(s) Since this integral is just an application of Bayes rule for Gaussians we can directly write down the solution 2 P(ynew jX; y; Xnew ; σ ) Z 2 = N(ynew jXnew β^; σ ) N(βjβ^; Σβ)dβ ^ 2 T = N(ynew jXnew β; σ (I + Xnew VβXnew ) 2 T −1 where Vβ = Σβ/σ = (X X) Distribution of output(s) given new input(s) This solution 2 P(ynew jX; y; Xnew ; σ ) ^ 2 T = N(ynew jXnew β; σ (I + Xnew VβXnew ) 2 T −1 where Vβ = Σβ/σ = (X X) relies upon σ2 being known. Our final inference objective is to come up with P(ynew jX; y; Xnew ) ZZ 2 2 2 = P(ynew jβ; σ )P(σ jβ; X; y)P(βjX; y)dβ; dσ Z 2 2 2 = P(ynew jX; y; Xnew ; σ )P(σ jX; y; Xnew )dσ where we have just derived the first term and the second we known is scaled inverse chi-squared. Distribution of output(s) given new input(s) The distributional form of P(ynew jX; y; Xnew ) Z 2 2 2 = P(ynew jX; y; Xnew ; σ )P(σ jX; y; Xnew )dσ is a multivariate Student-t distribution with center Xnew β^, squared 2 T scale marix s (I + Xnew VβXnew ) and n − p degrees of freedom (left as homework).