Bayesian Model Selection
Bob Stine
May 11, 1998
Methods – Review of Bayes ideas – Shrinkage methods (ridge regression) – Bayes factors: threshold z > √log n | | – Calibration of selection methods – Empirical Bayes (EBC) z > log p/q | | p Goals – Characteristics, strengths, weaknesses – Think about priors in preparation for next step
1 Bayesian Estimation Parameters as random variables
Parameter drawn randomly from prior p( ). Gather data Y conditional on some xed . Combine data with prior to form posterior, p(Y )p( ) p( Y )= | | p(Y)
Drop marginal p(Y ) since constant given Y p( Y ) p(Y ) p( ) | ∝ | posterior likelihood prior
Key Bayesian example| {zInference} | on{z normal} |{z} mean
2 2 2 2 Observe Y1,...,Yn N( , ), with N(M, = c ). | Posterior is normal with (think about regression) 2 E( Y )=M+ (Y M) | 2+ 2/n n/ 2 1/ 2 = Y + M n/ 2 +1/ 2 n/ 2 +1/ 2 In special case of 2 = 2,“prior is worth one observation”: n 1 E( Y )= Y + M | n+1 n+1
2 Why Bayes Shrinkage Works Steigler (1983) discussion of article by C. Morris in JASA.
θ E[Y|θ]=θ
Y
Regression slope is
Cov(Y, ) c2 2 = Var Y (1 + c2) 2 c2 = 1+c2
3 Ridge Regression Collinearity Originates in optimization problems where Hessian matrix in a Newton procedure becomes ill-conditioned. Marquardt’s idea is to perturb the diagonal, avoiding singular matrix.
Hoerl and Kennard (1970) apply to regression analysis, with graphical aides like the ridge trace:
ˆ 1 plot =(X0X+ Ip) X0Y on
Bayes hierarchical model Lindley and Smith 1972 Data follows standard linear model, with a normal prior on slopes: 2 2 2 Y N(X , In) , N(0,c Ip). The posterior mean shrinks toward 0,
ˆ 1 1 E( )=(X0X+ Ip) X0Y | c2 1 =(I+ (XX)1)1 ˆ p c2 0 Shrinkage larger as c2 0. → Picture Collinear and orthogonal cases
Where is the variable selection?
4 How to Obtain Selection from Bayes Main point To do more than just shrink linearly to prior mean, we have to get away from bivariate normal.
θ E[Y|θ]=θ
Y
Three alternatives
Spike-and-slab methods and BIC. Normal mixtures. Cauchy methods in information theory.
5 Bayesian Model Selection Conditioning on models
Je reys (1935), recently Raftery, Kass, and others Which model is most likely given the data? Think in terms of models M rather than parameters :
p(M Y )=p(Y M)p(M)/p(Y ) | | ? Express p(Y M) as average| of{z the} usual likelihood p(Y M, ) | | over the parameter space,
p(M Y )= p(Y , M) p( M)d p(M)/p(Y ) | | | Z usual like Bayes factors | {z }
Compare two models, M0 and M1. Posterior odds in favor of model M0 over alternative M1 are p(M0 Y ) p(Y M0) p(M0) | = | p(M1 Y ) p(Y M1) p(M1) | | Posterior odds = Bayes factor Prior odds | {z } | {z } | {z } Since often p(M0)=p(M1)=1/2, prior odds = 1 so that Posterior odds = Bayes factor = K
6 Using Bayes Factors
Bayes factor for comparing null model M0 to alternative M1
p(M0 Y ) p(Y M0) | = | 1 p(M1 Y ) p(Y M1) | | Prior odds Posterior odds = Bayes factor K |{z} | {z } | {z } What’s a big Bayes factor? Je reys’ Appendix (1961):
K>1 “Null hypothesis supported.” ⇒ K<1 “Not worth more than bare mention.” ⇒ K<1/√10 “Evidence against H0 substantial.” ⇒ K<1/10 “strong” ⇒ K<1/103/2 “very strong” ⇒ K<1/100 “Evidence against H0 decisive.” ⇒
Computing
p(Y M)= p(Y , M) p( M)d | | | Z can be very hard to evaluate, especially in high dimensions in problems lacking a neat, closed-form solution. Details in Kass and Raftery (1995).
Approximations?
7 Approximating Bayes Factors: BIC Goal Schwarz (1976), Annals of Statistics Approximate p(Y M)= p(Y , M) p( M)d . | | | R Approach Make the integral look like a normal integral. De ne g( )=logp(Y , M)p( M) so that | | p(Y M)= eg( )d | R Quadratic expansion around max at ˜ (post. mode):
g( ) g( ˜)+( ˜)0H( ˜)( ˜)/2 ˜ ˜ 1 ˜ g( ) ( )0(I ) ( )/2
2 where H =[∂ g/∂ i∂ j]andI is information matrix.
Laplace’s method For Rp , posterior becomes ∈ ˜ ˜ 1 ˜ p(Y M) exp(g( )) exp[( 1/2)( )0(I ) ( )]d | ˜ Z p/2 1/2 = exp(g( )) (2 ) I | |
Log posterior approximately penalized log likelihood at MLE ˆ,
˜ ˜ log P (Y M)=logp(Y , M)+logp( M)+(1/2) log I + O(1) | | | | | =logp(Y ,ˆ M)+logp( ˆM) (p/2) log n + O(1) | | =logp(Y ,ˆ M) (p/2) log n +O(1) | log-likelihood at MLE penalty | {z } | {z } 8 BIC Threshold in Orthogonal Regression Orthogonal setup
√n 2 X adds n ˆ2 = 2 j + Z to Regr SS j j Coe cient threshold
Add Xp+1 to a model with p coe cients? BIC criterion implies
Add Xp+1 penalized like increases ⇐⇒ ˆ p +1 ˆ p log p(Y p+1) log n>log p(Y p) log n | 2 | 2 which implies that the change in the residual SS must satisfy
ˆ2 2 RSSp RSSp+1 n p+1 zp+1 log n = = > 2 2 2 2 2 2
Add Xp+1 when zp+1 > log n, | | so that e ective “ level” 0asp n . → →∞ Consistency of criterion If there is a “true model” of dimension p,asn BIC will →∞ identify this model w.p. 1. In contrast, AIC asymptotically over ts since it tests with a xed level.
9 Bayes Hypothesis Test Problem 2 Test H0 : =0vsH1 : = 0 given Y1,...,Yn N( , ). 6 Pick the hypothesis with higher posterior odds.
Test of point null (Berger, 1985)
Under H0, = 0 and integration reduces to
p(H0 Y )=p(Y 0 =0)p(H0)/p(Y ) | | N(0, 2/n)
| {z 2 } Under alternative H1 : N(0, ),
p(H1 Y )= p(Y )p( H1)d p(H1)/p(Y ) | | | Z
1 2 Bayes factor Assume p(H0)=p(H1)= 2 and =1,
p(H0 y) Y N(0, 1/n) | = p(H1 y) Y N(0, 1+1/n) | 2 1+1/n e ny /2 = (n/(n+1))y2/2) s 1/n e ny2/2 e ny2/2 √n √ne y2/2 e which is equal to one when
ny2 =logn or in z scores z = log n | | p 10 Bayes Hypothesis Test: The Picture Scale Think on scale of z-score, so in e ect the Bayes factor
2 p(H0 Y ) N(0, /n) | = 2 p(H1 Y ) N(0, (1 + 1/n)) | is ratio of N(0, 1) to N(0,n). Spike and slab With 2 = 1 and n = 100
Thus, density of Y under H1 is incredibly di use relative to H0. This leads to the notion of a “spike and slab” prior when ideas are applied in estimation.
11 Discussion of BIC Bayes factor
Posterior odds to compare one hypothesis to another. Requires a likelihood and various approximations. Comparison to other criteria Penalty Test Level (n) z-statistic
BIC 1 log n Decreasing in n z > √log n 2 | | AIC 1 Fixed z > √2 | | RIC log p Decreasing in p z > √2logp | | BIC tends to pick very parsimonious models. ⇒ Consistency Strong penalty leads to consistency for xed as n . →∞ Important? e.g., Is the time series AR(p) for any xed order p?
Spike and slab prior for testimator
For p models, M1,...,Mp, with equal priors p(Mj)=1/p
indexed by =( 1,..., p), Bayes factors implies equal probability for 2p models with prior
1/2, j=0 p( j)= c, j =0 6 12 Criteria in McDonald’s Example AIC
0 5 10 15 20 k
BIC
0 5 10 15 20 k 13 Detour to Random E ects Model Model Simpli ed version of variable selection problem.
2 Yj = j + j, j N(0, ),j=1,...,p or 2 2 2 2 Yj j N( j, ) , j N(0, =c ) | Decomposition of the marginal variance
Var Y = 2 + 2 = 2 (1 + c2)
so that c2 measures “signal strength”.
Examples
Repeated measures (Yij ,i=1,...,nj,Yj Yj) ⇒ Center e ects in a clinical trial Growth curves (longitudinal models) Relationship to regression
Parameters: j X j j ⇐⇒ Residual SS: Sum of squares not t, Y 2 j =0 j MLE Fits everything, so P ˆ ˆ 2 2 j = Yj E( j j) = p ⇒ j P
14 Bayes Estimator for Random E ects Model
2 Yj j N( j, ) j=1,...,p | 2 2 2 j N(0, =c )
Posterior mean (given normality, c2)
1/ 2 c2 E( j Y )=Yj = | 1/ 2 +1/ 2 1+c2 Risk of Bayes estimator
2 p E( j E[ j Y ]) = | 1/ 2 +1/ 2 j X c2 = p 2 1+c2