<<

Bayesian : Gory Details

Pubh7440 Notes By Sudipto Banerjee

n Let y = [yi]i=1 be an n × 1 vector of independent observations on a dependent variable (or response) from n experimental units. Associated with the yi, is a p × 1 vector of regressors, say xi, and lead to the model

y = Xβ + , (1)

T n T where X = [xi ]i=1 is the n × p of regressors with i-th row being xi and is assumed fixed,

n β is the vector of regression coefficients and  = [i]i=1 is the vector of random variables representing “pure error” or measurement error in the dependent variable. For independent obser-

2 iid 2 vations, we assume  ∼ MVN(0, σ In), viz. that each component i ∼ N(0, σ ). Furthermore, we will assume that the columns of the matrix X are linearly independent so that the of X is p.

1 The NIG conjugate prior family

A popular Bayesian model builds upon the linear regression of y using conjugate priors by specifying

2 2 2 2 p(β, σ ) = p(β | σ )p(σ ) = N(µβ, σ Vβ) × IG(a, b) = NIG(µβ,Vβ, a, b) ba  1 a+p/2+1  1  1  = × exp − b + (β − µ )T V −1(β − µ ) p/2 1/2 2 2 β β β (2π) |Vβ| Γ(a) σ σ 2  1 a+p/2+1  1  1  ∝ × exp − b + (β − µ )T V −1(β − µ ) , (2) σ2 σ2 2 β β β where Γ(·) represents the Gamma function and the IG(a, b) prior density for σ2 is given by

ba  1 a+1  b  p(σ2) = exp − , σ2 > 0, Γ(a) σ2 σ2

where a, b > 0. We call this the Normal-Inverse-Gamma (NIG) prior and denote it as NIG(µβ,Vβ, a, b). The NIG is a joint probability distribution of a vector β and a

σ2. If (β, σ2) ∼ NIG(µ, V, a, b), then an interesting analytic form results from integrating out σ2

1 from the joint density:

Z ba Z  1 a+1  1  1  NIG(µ, V, a, b)dσ2 = exp − b + (β − µ)T V −1(β − µ) dσ2 (2π)p/2|V |1/2Γ(a) σ2 σ2 2 ba Z  1  1  = exp − b + (β − µ)T V −1(β − µ) dσ2 (2π)p/2|V |1/2Γ(a) σ2 2 a p  −(a+ p ) b Γ a +  1  2 = 2 b + (β − µ)T V −1(β − µ) (2π)p/2|V |1/2Γ(a) 2 2a+p " −1 #−( 2 ) Γ a + p  (β − µ)T  b V  (β − µ) = 2 1 + a . p/2 b 1/2 2a π |(2a) a V | Γ(a)

This is a multivariate t density:

ν+p  − ν+p Γ  (β − µ)T Σ−1(β − µ) 2 MV St (µ, Σ) = 2 1 + , (3) ν ν  p/2 1/2 Γ 2 π |νΣ| ν

b  with ν = 2a and Σ = a V .

2 The likelihood

The likelihood for the model is defined, up to proportionality, as the joint probability of observing

the given the . Since X is fixed, the likelihood is given by

 1 n/2  1  p(y | β, σ2) = N(Xβ, σ2I) = exp − (y − Xβ)T (y − Xβ) . (4) 2πσ2 2σ2

3 The posterior distribution from the NIG prior

Inference will proceed from the posterior distribution

p(β, σ2)p(y | β, σ2) p(β, σ2 | y) = , p(y) where p(y) = R p(β, σ2)p(y | β, σ2)dβdσ2 is the of the data. The key to

deriving the joint posterior distribution is the following easily verified multivariate completion of

squares or ellipsoidal rectification identity:

uT Au − 2αT u = (u − A−1α)T A(u − A−1α) − αT A−1α, (5)

2 where A is a symmetric positive definite (hence invertible) matrix. An application of this identity immediately reveals,

1  1  1  1  b + (β − µ )T V (β − µ ) + (y − Xβ)T (y − Xβ) = b∗ + (β − µ∗)T V ∗−1(β − µ∗) , σ2 2 β β β σ2 2 using which we can write the posterior as

 1 a+(n+p)/2+1  1  1  p(β, σ2 | y) ∝ × exp − b∗ + (β − µ∗)T V ∗−1(β − µ∗) , (6) σ2 σ2 2 where

∗ −1 T −1 −1 T µ = (Vβ + X X) (Vβ µβ + X y),

V ∗ = (V −1 + XT X)−1,

a∗ = a + n/2, 1 b∗ = b + [µT V −1µ + yT y − µ∗T V ∗−1µ∗]. 2 β β β

This posterior distribution is easily identified as a NIG(µ∗,V ∗, a∗, b∗) proving it to be a conjugate family for the linear regression model.

Note that the marginal posterior distribution of σ2 is immediately seen to be an IG(a∗, b∗) whose density is given by:

∗ b∗a∗  1 a +1  b∗  p(σ2 | y) = exp − . (7) Γ(a∗) σ2 σ2

The marginal posterior distribution of β is obtained by integrating out σ2 from the NIG joint posterior as follows: Z Z p(β|y) = p(β, σ2 | y)dσ2 = NIG(µ∗,V ∗, a∗, b∗)dσ2

∗ Z  1 a +1  1  1  ∝ exp − b∗ + (β − µ∗)T V ∗−1(β − µ∗) dσ2 σ2 σ2 2 ∗  (β − µ∗)T V ∗−1(β − µ∗)−(a +p/2) ∝ 1 + . 2b∗

This is a multivariate t density:

 ∗  ∗ ν +p − ν +p Γ  ∗ T ∗−1 ∗  2 ∗ ∗ 2 (β − µ ) Σ (β − µ ) MV St ∗ (µ , Σ ) = 1 + , (8) ν ν∗  p/2 ∗ ∗ 1/2 ∗ Γ 2 π |ν Σ | ν ∗ ∗ ∗ b∗  ∗ with ν = 2a and Σ = a∗ V .

3 4 A useful expression for the NIG scale

Here we will prove:

1 −1 b∗ = b + y − Xµ T I + XV XT  (y − Xµ ) (9) 2 β β β

On account of the expression for b∗ derived in the preceding section, it suffices to prove that

T T −1 ∗ ∗−1 ∗ T T −1 y y + µβ V µβ − µ V µ = y − Xµβ I + XVβX (y − Xµβ)

∗ ∗ −1 T Substituting µ = V (V µβ + X y) in the left hand side above we obtain:

T T −1 ∗ ∗−1 ∗ T T −1 −1 T ∗ −1 T y y + µβ Vβ µβ − µ V µ = y y + µβ Vβ µβ − (Vβ µβ + X y)V (Vβ µβ + X y)

T ∗ T T ∗ −1 T −1 −1 ∗ −1 = y (I − XVβ X )y − 2y XV V µβ + µβ (Vβ − Vβ V Vβ )µ. (10)

Further development of the proof will employ two tricky identities. The first is the well-known

Sherman-Woodbury-Morrison identity in matrix algebra:

−1 (A + BDC)−1 = A−1 − A−1B D−1 + CA−1B CA−1, (11) where A and D are square matrices that are invertible and B and C are rectangular (square if A and D have the same dimensions) matrices such that the multiplications are well-defined. This identity is easily verified by multiplying the right hand side with A + BDC and simplifying to reduce it to the identity matrix.

T −1 Applying (11) twice, once with A = Vβ and D = (X X) to get the second equality and then

T −1 with A = (X X) and D = Vβ to get the third equality, we have

−1 −1 ∗ −1 −1 −1 −1 T −1 −1 Vβ − Vβ V Vβ = Vβ − Vβ (Vβ + XX ) Vβ

T −1 −1 = [Vβ + (X X) ]

= XT X − XT X(XT X + V −1)−1XT X

T ∗ T = X (In − XV X )X. (12)

∗ −1 T ∗ −1 ∗ T The next identity notes that since V (V + X X) = Ip, we have V V = Ip − V X X, so that

∗ −1 ∗ T ∗ T XV V = X − XV X X = (In − XV X )X. (13)

4 Substituting (12) and (13) in (10) we obtain

T ∗ T T ∗ T T ∗ T y (In − XV X )y − 2y (In − XV X )µβ + µβ (In − XV X )µβ

T ∗ T = (y − Xµβ) (In − XV X )(y − Xµβ)

T T −1 = (y − Xµβ) (In + XVX ) (y − Xµβ), (14) where the last step is again a consequence of (11):

T −1 −1 T −1 T ∗ T (In + XVX ) = In − X(V + X X) X = In − XV X .

5 Marginal distributions – the hard way

To obtain the marginal distribution of y, we first compute the distribution p(y | σ2) by integrating out β and subsequently integrate out σ2 to obtain p(y). To be precise, we use the expression for b∗ derived in the preceding section, proceeding as below: Z Z 2 2 2 2 2 p(y | σ ) = p(y | β, σ )p(β | σ )dβ = N(Xβ, σ In) × N(µβ, σ Vβ)dβ Z   1 1 n T T −1 o = n+p exp − (y − Xβ) (y − Xβ) + (β − µβ) Vβ (β − µβ) dβ 2 1/2 2σ2 (2πσ ) 2 |Vβ| 1 = n+p 2 1/2 (2πσ ) 2 |Vβ| Z  1  × exp − (y − Xµ )T (I + XVXT )−1(y − Xµ ) + (β − µ∗)T V ∗−1(β − µ∗) dβ 2σ2 β β   1 1 T T −1 = n+p exp − (y − Xµβ) (I + XVβX ) (y − Xµβ) 2 1/2 2σ2 (2πσ ) 2 |Vβ| Z  1  × exp − (β − µ∗)T V ∗−1(β − µ∗) dβ 2σ2  ∗ 1/2   1 |V | 1 T T −1 = n exp − 2 (y − Xµβ) (I + XVβX ) (y − Xµβ) (2πσ2) 2 |Vβ| 2σ   1 1 T T −1 = n exp − (y − Xµβ) (I + XVβX ) (y − Xµβ) 2 T 1/2 2 (2πσ ) 2 |I + XVβX | 2σ

2 T = N(Xµβ, σ (I + XVβX )). (15)

Here we have applied the matrix identity

|A + BDC| = |A||D||D−1 + CA−1B| (16)

5 to obtain  |V |  |I + XV XT | = |V ||V −1 + XT X| = β . n β β β |V ∗| Now, the marginal distribution of p(y) is obtained by integrating a NIG density as follows:

Z Z 2 2 2 2 T 2 p(y) = p(y | σ )p(σ )dσ = N(Xµβ, σ (I + XVX ))IG(a, b)dσ Z  b  = NIG(Xµ , (I + XVXT ), a, b)dσ2 = MV St Xµ, (I + XVXT ) . (17) β 2a a

Rewriting our result slightly differently reveals another useful property of the NIG density:

Z p(y) = p(y | β, σ2)p(β, σ2)dβdσ2

Z  b  = N(Xβ, σ2I ) × NIG(µ ,V , a, b)dβdσ2 = MV St Xµ, (I + XVXT ) . (18) n β β 2a a

Of course, the computation of p(y) could also be carried out in terms of the NIG distribution

parameters more directly as

Z Z 2 2 2 2 2 p(y) = p(y|β, σ )p(β, σ )dβdσ = N(Xβ, σ In) × NIG(µβ,Vβ, a, b)dβdσ

∗ ba Z  1 a +p/2+1  1  1  = × exp − b∗ + (β − µ∗)T V ∗−1(β − µ∗) p/2 1/2 2 2 (2π) |Vβ| Γ(a) σ σ 2 ba Γ(a∗)(2π)p/2p|V ∗| = × (n+p)/2p ∗ a∗ Γ(a)(2π) |Vβ| (b ) p −(a+n/2) baΓ a + n  |V ∗|  1 n o = 2 × b + µT V −1µ + yT y − µ∗V ∗−1µ∗ . (19) n/2 p β β β (2π) Γ(a) |Vβ| 2

6 Marginal distribution: the easy way

An alternative and much easier way to derive p(y | σ2), avoiding any integration at all, is to note that we can write the above model as:

2 y = Xβ + 1, where 1 ∼ N(0, σ I);

2 β = µβ + 2, where 2 ∼ N(0, σ Vβ),

where 1 and 2 are independent of each other. It then follows that

2 T y = Xµβ + X2 + 1 ∼ N(Xµβ, σ (I + XVβX )).

6 This gives p(y | σ2). Next we integrate out σ2 to obtain p(y) as in the preceding section to obtain

In fact, the entire distribution theory for the Bayesian regression with NIG priors could proceed

by completely avoiding any integration. To be precise, we obtain this marginal distribution first

and derive the posterior distribution:

2 2 2 p(β, σ ) × p(y|β, σ ) NIG(µ ,Vβ, a, b) × N(Xβ, σ I) p(β, σ2 | y) = = β , p(y) b T MV St2a(Xµ, a (I + XVβX )) which indeed reduces (after some algebraic manipulation) to the NIG(µ∗,V ∗, a∗, b∗) density.

7 Bayesian

Next consider Bayesian in the context of the linear regression model. Suppose we now want to apply our to a new set of data, where we have observed a new m × p

matrix of regressors X˜, and we wish to predict the corresponding outcome y˜. Observe that if β

and σ2 were known, then the probability law for the predicted outcomes would be described as

2 y˜ ∼ N(X˜β, σ Im) and would be independent of y. However, these parameters are not known; instead they are summarized through their posterior samples. Therefore, all predictions for the data must follow from the posterior predictive distribution:

Z p(y˜ | y) = p(y˜ | β, σ2)p(β, σ2 | y)dβdσ2 Z 2 ∗ ∗ ∗ ∗ 2 = N(X˜β, σ Im) × NIG(µ ,V , a , b )dβdσ  ∗  ∗ b ∗ T = MV St ∗ X˜µ , (I + XV˜ X˜ ) , (20) 2a a∗

where the last step follows from (18). There are two sources of uncertainty in the posterior predictive

distribution: (1) the fundamental source of variability in the model due to σ2, unaccounted for by

Xβ˜ , and (2) the posterior uncertainty in β and σ2 as a result of their estimation from a finite

sample y. As the sample size n → ∞ the due to posterior uncertainty disappears, but the

predictive uncertainty remains.

7 8 Posterior and posterior predictive

Sampling from the NIG posterior distribution is straightforward: for each l = 1,...,L, we sample n oL σ2(l) ∼ IG(a + n/2, b∗) and β(l) ∼ MVN(µ∗, σ2(l)V ∗). The resulting β(l), σ2(l) provide l=1 2 (l) L 2(l) L samples from the joint distribution p(β, σ | y) while {β }l=1 and {σ }l=1 provide samples from the marginal posterior distributions p(β | y) and p(σ2 | y) respectively.

Predictions are carried out by sampling from the posterior predictive density (20). Sampling

(l) 2(l) (l) (l) 2(l) from this is easy – for each posterior sample (β , σ ), we draw y˜ ∼ N(X˜β , σ Im). The

(l) L resulting {y˜ }l=1 are samples from the desired posterior predictive distribution in (20); the and variance of this sample provide estimates of the predictive mean and variance respectively.

9 The posterior distribution from improper priors

−1 Taking Vβ → 0 (i.e. the null matrix) and a → −p/2 and b → 0 leads to the improper prior p(β, σ2) ∝ 1/σ2. The posterior distribution is NIG (µ∗,V ∗, a∗, b∗) with

µ∗ = βˆ = (XT X)−1XT y,

V ∗ = (XT X)−1, n − p a∗ = , 2 (n − p)s2 1 1 b∗ = where s2 = (y − Xβˆ)T (y − Xβˆ) = yT (I − P )y, where P = X(XT X)−1XT . 2 n − p n − p X X

Here βˆ is the classical estimates (also the maximum likelihood estimate) of β, s2 is

2 the classical unbiased estimate of σ and PX is the projection matrix onto the column space of X.

Plugging in the above values implied by the improper priors into the more general NIG(µ∗,V ∗, a∗, b∗)

2  n−p (n−p)s2  density, we find the marginal posterior distribution of σ is an IG 2 , 2 (equivalently the 2 2 2 posterior distribution of (n−p)s /σ is a χn−p distribution) and the marginal posterior distribution

2 T of β is a MV Stn−p(βˆ, s X X) with density: " #− n Γ n  (β − βˆ)T XT X(β − βˆ) 2 MV St (µ∗, s2XT X) = 2 1 + . n−p n−p  p/2 2 T −1/2 (n − p)s2 Γ 2 π |(n − p)s (X X)| Predictions with non-informative priors again follow by sampling from the posterior predictive

distribution as earlier, but some additional insight is gained by considering analytical expressions

8 for the expectation and variance of the posterior predictive distribution. Again, plugging in the

parameter values implied by the improper priors into (20), we obtain the posterior predictive density

 2 T −1 T  as a MV Stn−p X˜βˆ, s (I + X˜(X X) X˜ ) .

Note that

E(y˜|σ2, y) = E[E(y˜ | β, σ2, y) | σ2, y]

= E[X˜β | σ2, y]

= X˜βˆ = X˜(XT X)−1XT y, where the inner expectation averages over p(y˜ | β, σ2) and the outer expectation averages with respect to p(β | σ2, y). Note that given σ2, the future observations have a mean which does not depend on σ2. In analogous fashion,

var(y˜ | σ2, y) = E[var(y˜ | β, σ2, y) | σ2, y] + var[E(y˜|β, σ2, y)|σ2, y]

2 2 = E[σ Im] + var[X˜β | σ , y]

T −1 T 2 = (Im + X˜(X X) X˜ )σ .

2 2 Thus, conditional on σ , the posterior predictive variance has two components: σ Im, representing

sampling variation, and X˜(XT X)−1X˜ T σ2, due to uncertainty about β.

9