Bayesian

Bob Stine

May 11, 1998

Methods – Review of Bayes ideas – Shrinkage methods (ridge regression) – Bayes factors: threshold z > √log n | | – Calibration of selection methods – Empirical Bayes (EBC) z > log p/q | | p Goals – Characteristics, strengths, weaknesses – Think about priors in preparation for next step

1 Bayesian Estimation Parameters as random variables

Parameter drawn randomly from prior p(). Gather Y conditional on some xed . Combine data with prior to form posterior, p(Y )p() p( Y )= | | p(Y)

Drop marginal p(Y ) since constant given Y p( Y ) p(Y ) p() | ∝ | posterior likelihood prior

Key Bayesian example| {zInference} | on{z normal} |{z}

2 2 2 2 Observe Y1,...,Yn N(, ), with N(M, = c ). | Posterior is normal with (think about regression) 2 E( Y )=M+ (Y M) | 2+2/n n/2 1/2 = Y + M n/2 +1/2 n/2 +1/2 In special case of 2 = 2,“prior is worth one observation”: n 1 E( Y )= Y + M | n+1 n+1

2 Why Bayes Shrinkage Works Steigler (1983) discussion of article by C. Morris in JASA.

θ E[Y|θ]=θ

Y

Regression slope is

Cov(Y,) c2 2 = Var Y (1 + c2)2 c2 = 1+c2

3 Ridge Regression Collinearity Originates in optimization problems where Hessian matrix in a Newton procedure becomes ill-conditioned. Marquardt’s idea is to perturb the diagonal, avoiding singular matrix.

Hoerl and Kennard (1970) apply to , with graphical aides like the ridge trace:

ˆ 1 plot =(X0X+Ip) X0Y on

Bayes hierarchical model Lindley and Smith 1972 Data follows standard , with a normal prior on slopes: 2 2 2 Y N(X, In) , N(0,c Ip). The posterior mean shrinks toward 0,

ˆ 1 1 E( )=(X0X+ Ip)X0Y | c2 1 =(I+ (XX)1)1ˆ p c2 0 Shrinkage larger as c2 0. → Picture Collinear and orthogonal cases

Where is the variable selection?

4 How to Obtain Selection from Bayes Main point To do more than just shrink linearly to prior mean, we have to get away from bivariate normal.

θ E[Y|θ]=θ

Y

Three alternatives

Spike-and-slab methods and BIC. Normal mixtures. Cauchy methods in information theory.

5 Bayesian Model Selection Conditioning on models

Jereys (1935), recently Raftery, Kass, and others Which model is most likely given the data? Think in terms of models M rather than parameters :

p(M Y )=p(Y M)p(M)/p(Y ) | | ? Express p(Y M) as average| of{z the} usual likelihood p(Y M,) | | over the parameter space,

p(M Y )= p(Y , M) p( M)d p(M)/p(Y ) | | | Z usual like Bayes factors | {z }

Compare two models, M0 and M1. Posterior odds in favor of model M0 over alternative M1 are p(M0 Y ) p(Y M0) p(M0) | = | p(M1 Y ) p(Y M1) p(M1) | | Posterior odds = Prior odds | {z } | {z } | {z } Since often p(M0)=p(M1)=1/2, prior odds = 1 so that Posterior odds = Bayes factor = K

6 Using Bayes Factors

Bayes factor for comparing null model M0 to alternative M1

p(M0 Y ) p(Y M0) | = | 1 p(M1 Y ) p(Y M1) | | Prior odds Posterior odds = Bayes factor K |{z} | {z } | {z } What’s a big Bayes factor? Jereys’ Appendix (1961):

K>1 “Null hypothesis supported.” ⇒ K<1 “Not worth more than bare mention.” ⇒ K<1/√10 “Evidence against H0 substantial.” ⇒ K<1/10 “strong” ⇒ K<1/103/2 “very strong” ⇒ K<1/100 “Evidence against H0 decisive.” ⇒

Computing

p(Y M)= p(Y , M) p( M)d | | | Z can be very hard to evaluate, especially in high dimensions in problems lacking a neat, closed-form solution. Details in Kass and Raftery (1995).

Approximations?

7 Approximating Bayes Factors: BIC Goal Schwarz (1976), Annals of Approximate p(Y M)= p(Y , M) p( M)d. | | | R Approach Make the integral look like a normal integral. Dene g()=logp(Y , M)p( M) so that | | p(Y M)= eg()d | R Quadratic expansion around max at ˜ (post. ):

g() g(˜)+( ˜)0H(˜)( ˜)/2 ˜ ˜ 1 ˜ g() ( )0(I) ( )/2

2 where H =[∂ g/∂i∂j]andI is information matrix.

Laplace’s method For Rp , posterior becomes ∈ ˜ ˜ 1 ˜ p(Y M) exp(g()) exp[( 1/2)( )0(I) ( )]d | ˜ Z p/2 1/2 = exp(g()) (2) I | |

Log posterior approximately penalized log likelihood at MLE ˆ,

˜ ˜ log P (Y M)=logp(Y, M)+logp(M)+(1/2) log I + O(1) | | | | | =logp(Y,ˆ M)+logp(ˆM) (p/2) log n + O(1) | | =logp(Y,ˆ M) (p/2) log n +O(1) | log-likelihood at MLE penalty | {z } | {z } 8 BIC Threshold in Orthogonal Regression Orthogonal setup

√n 2 X adds nˆ2 = 2 j + Z to Regr SS j j Coecient threshold

Add Xp+1 to a model with p coecients? BIC criterion implies

Add Xp+1 penalized like increases ⇐⇒ ˆ p +1 ˆ p log p(Y p+1) log n>log p(Y p) log n | 2 | 2 which implies that the change in the residual SS must satisfy

ˆ2 2 RSSp RSSp+1 np+1 zp+1 log n = = > 22 22 2 2

Add Xp+1 when zp+1 > log n, | | so that eective “ level” 0asp n . → →∞ Consistency of criterion If there is a “true model” of dimension p,asn BIC will →∞ identify this model w.p. 1. In contrast, AIC asymptotically overts since it tests with a xed level.

9 Bayes Hypothesis Test Problem 2 Test H0 : =0vsH1 := 0 given Y1,...,Yn N(, ). 6 Pick the hypothesis with higher posterior odds.

Test of point null (Berger, 1985)

Under H0, = 0 and integration reduces to

p(H0 Y )=p(Y 0 =0)p(H0)/p(Y ) | | N(0,2/n)

| {z 2 } Under alternative H1 : N(0, ),

p(H1 Y )= p(Y )p(H1)d p(H1)/p(Y ) | | | Z

1 2 Bayes factor Assume p(H0)=p(H1)= 2 and =1,

p(H0 y) Y N(0, 1/n) | = p(H1 y) Y N(0, 1+1/n) | 2 1+1/n e ny /2 = (n/(n+1))y2/2) s 1/n e ny2/2 e ny2/2 √n √ne y2/2 e which is equal to one when

ny2 =logn or in z scores z = log n | | p 10 Bayes Hypothesis Test: The Picture Scale Think on scale of z-score, so in eect the Bayes factor

2 p(H0 Y ) N(0, /n) | = 2 p(H1 Y ) N(0, (1 + 1/n)) | is ratio of N(0, 1) to N(0,n). Spike and slab With 2 = 1 and n = 100

Thus, density of Y under H1 is incredibly diuse relative to H0. This leads to the notion of a “spike and slab” prior when ideas are applied in estimation.

11 Discussion of BIC Bayes factor

Posterior odds to compare one hypothesis to another. Requires a likelihood and various approximations. Comparison to other criteria Penalty Test Level (n) z-

BIC 1 log n Decreasing in n z > √log n 2 | | AIC 1 Fixed z > √2 | | RIC log p Decreasing in p z > √2logp | | BIC tends to pick very parsimonious models. ⇒ Consistency Strong penalty leads to consistency for xed as n . →∞ Important? e.g., Is the AR(p) for any xed order p?

Spike and slab prior for testimator

For p models, M1,...,Mp, with equal priors p(Mj)=1/p

indexed by =(1,...,p), Bayes factors implies equal probability for 2p models with prior

1/2,j=0 p(j)=  c, j =0  6 12 Criteria in McDonald’s Example AIC

0 5 10 15 20 k

BIC

0 5 10 15 20 k 13 Detour to Random Eects Model Model Simplied version of variable selection problem.

2 Yj = j + j,j N(0, ),j=1,...,p or 2 2 2 2 Yj j N(j,) ,j N(0, =c ) | Decomposition of the marginal

Var Y = 2 + 2 = 2 (1 + c2)

so that c2 measures “signal strength”.

Examples

Repeated measures (Yij ,i=1,...,nj,Yj Yj) ⇒ Center eects in a Growth curves (longitudinal models) Relationship to regression

Parameters: j X j j ⇐⇒ Residual SS: Sum of squares not t, Y 2 j =0 j MLE Fits everything, so P ˆ ˆ 2 2 j = Yj E(j j) = p ⇒ j P

14 Bayes for Random Eects Model

2 Yj j N(j,) j=1,...,p | 2 2 2 j N(0, =c )

Posterior mean (given normality, c2)

1/2 c2 E(j Y )=Yj = | 1/2 +1/2 1+c2 Risk of

2 p E(j E[j Y ]) = | 1/2 +1/2 j X c2 = p2 1+c2

Relative risk E( ˆ)2 1 2 =1+ 2 E(j E[j Y ]) c | If c2 0, Bayes estimator does much better... but what’s c2? Approaches Could put a prior on c2, or try to estimate from data...

15 Shrink to Zero? Model and estimator

2 2 2 Yj j N(j, ) j N(0,c ) | 1 E(jY)= 1 Yj | 1+c2 Example 2 Yj c =1 E(j Y)= ⇒ | 2 How to shrink all the way to zero? Revise model and use a mixture prior,

2 Yj j N(j,) | 2 2 j N(0,c )+(1 )10 ,

Where denotes the probability of non-zero j.

Bayes estimator?

Posterior mean E(j Yj) =?, but surely not zero. | Prior estimates? Need and c2.

16 Alternative Strategy Approach

Avoid direct evaluation of posterior mean E[j Yj]. | Indicator variables Introduce indicators as used in regression

=(1,...,p),j 0,1 , ∈{ } and write the revised model as

2 Yj j N(j,) | j Bernoulli() where 2 2 N(0,c ) j=1 j   =0 j =0

Three-step process George & Foster, 1996

1. Estimate prior parameters c2 and . 2. Maximize posterior over rather than nd posterior mean, necessitating the next step. 3. Shrink estimates identied by .

17 Calibration of Selection Criteria Revised model with mixture prior

2 2 2 Yj j N(j,),j N(0,c )+(1 )10 | j =1 j =0 | {z } | {z } Calibration idea

2 How do and c aect choice of j =1? Result in a penalized likelihood, with previous penalties? Posterior for Given c2, 2,and,

p(Y) p()p(Y ) | ∝ | q/2 q p q 1 1 (1 ) ∝ p 1+c2 q Y 2 p Y 2 1 j=1 j j=q+1 j exp 2 2 2 + 2 (1P + c ) P ! q 1 q/2 RSS c2 exp ∝ 1 1+c2 22 1+c2 c2 RegrSS exp qR(, c2) ∝ 1+c2 22 where the penalty for each parameter to the log-likelihood is

1+c2 1 R(, c2)= log + 1 log(1 + c2) c2 2

18 Calibration of Selection Criteria Revised model with mixture prior 2 2 2 Yj j N(j,),j N(0,c )+(1 )10 | Penalty

1+c2 1 R(, c2)= log + 1 log(1 + c2) c2 2 terms Maximizing the posterior gives:

c2 Criterion Comments 1/2 3.92 AIC Lots of small coefs

1/2 n BIC

1/2 p2 RIC Few large coefs

1 But is the value = 2 reasonable?

Insight

prob on number j =0 Each criterion 6 ⇐⇒   prior for nonzero coecients



19 Empirical Bayes for Random Eects Model 2 2 2 Yj j N(j, ) j N(0,c ) | Bayes estimator

c2 1 E(j Y )= Yj = 1 Yj | 1+c2 1+c2 Idea Assume you know 2 or have a good estimate. Then estimate prior from marginal distribution

2 2 Yj N(0, (1 + c ) ) . So dene s2 Y 2 cˆ2 = 1 where s2 = j 2 p P Almost James-Stein Plug in estimator is

1 2 E(j Y )= 1 Yj = 1 Yj | 1+ˆc2 s2 James-Steinb adjusts for fact that E1/cˆ2 =1/Ecˆ2. 6

20 Empirical Bayes Criterion Model with mixture prior

2 2 2 Yj j N(j,),j N(0,c )+(1 )10 |

Step 1 Estimate for prior parameters via MLE for approximate likelihood (no avg)

2 q p q 2 q/2 2 2 2 L(c ,)= (1 ) (1 + c ) exp(c RegrSS/(2 (1 + c )) q Y 2 2 j=1 j c = 1 ˆ = q/p q2 P !+ Clearly need largeb number of parameters. (large p)

Step 2

Iteratively identify nonzero j as those maximizing posterior for most recent set of estimates c2 and ˆ

max p( Yb) p(Y ) p() | ∝ |

Step 3

Shrink nonzero j to compensate for maximization. Simulations show that this step is essential.

21 Properties of EBC Maximized likelihood George & Foster 1996

Maximimum value of the approximate likelihood L is RegrSS RegrSS q 1+log 2pH(q/p) 2 q2 where the Boolean entropy function is

H()= log (1 )log(1 ) 0.

Adaptive penalty Penalty depends on ˆ and estimated strength of signal in c2.

Features b

Large RegrSS relative to q BIC type behavior. ⇒ Small ˆ =1/p RIC behavior since ⇒ pH(1/p) log p

22 Simulation of Predictive Risk

Conditions Normal with random eects setup X = In, n = p = 1000. Full scores 1000 in each case.

Weak signal c2 =2 q 0 10 50 100 500 1000

Cp 572 577 599 627 845 1118 BIC 74.8 87.8 146 217 781 1488

RIC 2.9 21.0 93.3 183 899 1799

EBC 503 611 787 902 1000 999

EBC 20.3 61.7 474 872 1000 999 ˆ 1.1 19.9 91.8 168 501 667

Strong signal c2 = 100 q 0 10 50 100 500 1000

Cp 572 578 597 623 824 1072 BIC 75.4 88.7 143 213 771 1460

RIC 3.3 26.0 116 229 1142 2274

EBC 496 26.5 106 194 737 999

EBC 15.9 26.5 106 194 737 999 ˆ 1.1 26.4 106 193 731 990

23 Discussion Calibration

Model selection criteria correspond to priors on i, with typically many coecients (p large):

2 AIC Prior variance four times Lots of little i ⇒ 2 2 RIC Prior variance p times Fewer, very large i ⇒ Adaptive selection

Tune selection criterion to problem at hand. With shrinkage, does better than OLS with any xed criterion.

Weakness of normal prior

Cannot handle mixture of large and small nonzero i Some eects are very signicant (eg: age and osteoporosis) Others are more speculative (eg: hormonal factors) 2 Mixing these “confuses” normal prior i N(0, ) Cauchy prior

Tolerates mixture of large and small j (Jereys, 1961) Not conjugate Useful in information theoretic context

24 Empirical Bayes , Method 2 Full Bayes model

2 Yj j N(j,) | 2 2 2 N(0, =c ) j =1 j |   0 otherwise q p q (1 ) ,q=  || U[0, 1]

Conjugate priors lead to all the needed conditional distributions...

( , ,, 2,2,Y) ( ) Beta(q +1,p q+1) | | 2 others Inverted gamma | 1 2 others Inverted gamma | 2 (j,j) others (Bernoulli, Normal) |

Bayes probability Bernoulli probability via likelihood ratio

2 2 N(0, + )[Yj] P j =1 = 2 2 2 { } N(0, + )[Yj]+(1 )N(0, )[Yj] Gibbs Gibbs sampler avoids maximizing likelihood, but much slooooooowwwwweeeeerrrrrr.

25 Calibration in Regression Goal Think about various methods in a common setting, comparing more deeply than just via a threshold.

Comparison Method Threshold Prior for p Prior for

1 Bayes factors (BIC) √log n Bi( 2 ,p) Spike-and-slab Predictive risk (AIC) √2? ?

Minimax risk (RIC) √2logp ??

Bayesian variable selection George & Foster 1996

Put a prior on everything, particularly (j = 1 implies t j)

p(),p() p(Y) | ⇒ | In particular, consider beta prior for given some constant w

q p q p( w) w (1 w) ,q=, | ∝ || and normal for given , obs. variance 2, and a constant c2:

2 2 2 1 p( ,c, )=N(0,c(X0X) ). |

26 Bayesian Calibration Results Posterior for p( Y,,c,w) is increasing as a function of |

2 2 RegrSS/ F (c ,w) | | where

1+c2 1 w F (c2,w)= log(1 + c2)+2log c2 w Match constants to known penalties Since AIC, BIC,andRIC are selecting model based on 1 penalized likelihood (penalty factors 2, 2 log n, and log p) calibrate by solving

F (c2,w)=2 penalty factor

Which solution? 1 2 Pick w = 2 and solve for c ? If p is very large (or perhaps very small, do you expect half of the coecients to enter the model?

Next steps

Put another prior on c2 and w? Choose c2 and w to maximize the likelihood L(c2,w Y). |

27 Empirical Bayes Criterion Criterion Pick the subset of p possible predictors which maximizes the penalized likelihood (approximately)

SS SS q 1+log +2((p q) log(p q)+qlog q) ˆ2 qˆ2 RIC BIC | {z } | {z } where q = chosen predictors and SS denotes the regression | | sum-of-squares for this collection of predictors.

Empirical Bayes estimates

SS cˆ = 1 , wˆ = q/p ˆ2q + BIC component

Typically tr(X0X)=O(n) so that for xed , SS 1+log = O(log n) qˆ2

RIC component With q = 1, reduces to 2logp. Key property Threshold varies with the complexity of model: Once q gets large, decreasing threshold allows other variables into model.

28