Bayesian Model Selection

Bayesian Model Selection Bob Stine May 11, 1998 Methods – Review of Bayes ideas – Shrinkage methods (ridge regression) – Bayes factors: threshold z > √log n | | – Calibration of selection methods – Empirical Bayes (EBC) z > log p/q | | p Goals – Characteristics, strengths, weaknesses – Think about priors in preparation for next step 1 Bayesian Estimation Parameters as random variables Parameter drawn randomly from prior p(). Gather data Y conditional on some xed . Combine data with prior to form posterior, p(Y )p() p( Y )= | | p(Y) Drop marginal p(Y ) since constant given Y p( Y ) p(Y ) p() | ∝ | posterior likelihood prior Key Bayesian example| {zInference} | on{z normal} |{z} mean 2 2 2 2 Observe Y1,...,Yn N(, ), with N(M, = c ). | Posterior is normal with (think about regression) 2 E( Y )=M+ (Y M) | 2+2/n n/2 1/2 = Y + M n/2 +1/2 n/2 +1/2 In special case of 2 = 2,“prior is worth one observation”: n 1 E( Y )= Y + M | n+1 n+1 2 Why Bayes Shrinkage Works Steigler (1983) discussion of article by C. Morris in JASA. θ E[Y|θ]=θ Y Regression slope is Cov(Y,) c2 2 = Var Y (1 + c2)2 c2 = 1+c2 3 Ridge Regression Collinearity Originates in optimization problems where Hessian matrix in a Newton procedure becomes ill-conditioned. Marquardt’s idea is to perturb the diagonal, avoiding singular matrix. Hoerl and Kennard (1970) apply to regression analysis, with graphical aides like the ridge trace: ˆ 1 plot =(X0X+Ip) X0Y on Bayes hierarchical model Lindley and Smith 1972 Data follows standard linear model, with a normal prior on slopes: 2 2 2 Y N(X, In) , N(0,c Ip). The posterior mean shrinks toward 0, ˆ 1 1 E( )=(X0X+ Ip)X0Y | c2 1 =(I+ (XX)1)1ˆ p c2 0 Shrinkage larger as c2 0. → Picture Collinear and orthogonal cases Where is the variable selection? 4 How to Obtain Selection from Bayes Main point To do more than just shrink linearly to prior mean, we have to get away from bivariate normal. θ E[Y|θ]=θ Y Three alternatives Spike-and-slab methods and BIC. Normal mixtures. Cauchy methods in information theory. 5 Bayesian Model Selection Conditioning on models Jereys (1935), recently Raftery, Kass, and others Which model is most likely given the data? Think in terms of models M rather than parameters : p(M Y )=p(Y M)p(M)/p(Y ) | | ? Express p(Y M) as average| of{z the} usual likelihood p(Y M,) | | over the parameter space, p(M Y )= p(Y , M) p( M)d p(M)/p(Y ) | | | Z usual like Bayes factors | {z } Compare two models, M0 and M1. Posterior odds in favor of model M0 over alternative M1 are p(M0 Y ) p(Y M0) p(M0) | = | p(M1 Y ) p(Y M1) p(M1) | | Posterior odds = Bayes factor Prior odds | {z } | {z } | {z } Since often p(M0)=p(M1)=1/2, prior odds = 1 so that Posterior odds = Bayes factor = K 6 Using Bayes Factors Bayes factor for comparing null model M0 to alternative M1 p(M0 Y ) p(Y M0) | = | 1 p(M1 Y ) p(Y M1) | | Prior odds Posterior odds = Bayes factor K |{z} | {z } | {z } What’s a big Bayes factor? Jereys’ Appendix (1961): K>1 “Null hypothesis supported.” ⇒ K<1 “Not worth more than bare mention.” ⇒ K<1/√10 “Evidence against H0 substantial.” ⇒ K<1/10 “strong” ⇒ K<1/103/2 “very strong” ⇒ K<1/100 “Evidence against H0 decisive.” ⇒ Computing p(Y M)= p(Y , M) p( M)d | | | Z can be very hard to evaluate, especially in high dimensions in problems lacking a neat, closed-form solution. Details in Kass and Raftery (1995). Approximations? 7 Approximating Bayes Factors: BIC Goal Schwarz (1976), Annals of Statistics Approximate p(Y M)= p(Y , M) p( M)d. | | | R Approach Make the integral look like a normal integral. Dene g()=logp(Y , M)p( M) so that | | p(Y M)= eg()d | R Quadratic expansion around max at ˜ (post. mode): g() g(˜)+( ˜)0H(˜)( ˜)/2 ˜ ˜ 1 ˜ g() ( )0(I) ( )/2 2 where H =[∂ g/∂i∂j]andI is information matrix. Laplace’s method For Rp , posterior becomes ∈ ˜ ˜ 1 ˜ p(Y M) exp(g()) exp[( 1/2)( )0(I) ( )]d | ˜ Z p/2 1/2 = exp(g()) (2) I | | Log posterior approximately penalized log likelihood at MLE ˆ, ˜ ˜ log P (Y M)=logp(Y, M)+logp(M)+(1/2) log I + O(1) | | | | | =logp(Y,ˆ M)+logp(ˆM) (p/2) log n + O(1) | | =logp(Y,ˆ M) (p/2) log n +O(1) | log-likelihood at MLE penalty | {z } | {z } 8 BIC Threshold in Orthogonal Regression Orthogonal setup √n 2 X adds nˆ2 = 2 j + Z to Regr SS j j Coecient threshold Add Xp+1 to a model with p coecients? BIC criterion implies Add Xp+1 penalized like increases ⇐⇒ ˆ p +1 ˆ p log p(Y p+1) log n>log p(Y p) log n | 2 | 2 which implies that the change in the residual SS must satisfy ˆ2 2 RSSp RSSp+1 np+1 zp+1 log n = = > 22 22 2 2 Add Xp+1 when zp+1 > log n, | | so that eective “ level” 0asp n . → →∞ Consistency of criterion If there is a “true model” of dimension p,asn BIC will →∞ identify this model w.p. 1. In contrast, AIC asymptotically overts since it tests with a xed level. 9 Bayes Hypothesis Test Problem 2 Test H0 : =0vsH1 := 0 given Y1,...,Yn N(, ). 6 Pick the hypothesis with higher posterior odds. Test of point null (Berger, 1985) Under H0, = 0 and integration reduces to p(H0 Y )=p(Y 0 =0)p(H0)/p(Y ) | | N(0,2/n) | {z 2 } Under alternative H1 : N(0, ), p(H1 Y )= p(Y )p(H1)d p(H1)/p(Y ) | | | Z 1 2 Bayes factor Assume p(H0)=p(H1)= 2 and =1, p(H0 y) Y N(0, 1/n) | = p(H1 y) Y N(0, 1+1/n) | 2 1+1/n e ny /2 = (n/(n+1))y2/2) s 1/n e ny2/2 e ny2/2 √n √ne y2/2 e which is equal to one when ny2 =logn or in z scores z = log n | | p 10 Bayes Hypothesis Test: The Picture Scale Think on scale of z-score, so in eect the Bayes factor 2 p(H0 Y ) N(0, /n) | = 2 p(H1 Y ) N(0, (1 + 1/n)) | is ratio of N(0, 1) to N(0,n). Spike and slab With 2 = 1 and n = 100 Thus, density of Y under H1 is incredibly diuse relative to H0. This leads to the notion of a “spike and slab” prior when ideas are applied in estimation. 11 Discussion of BIC Bayes factor Posterior odds to compare one hypothesis to another. Requires a likelihood and various approximations. Comparison to other criteria Penalty Test Level (n) z-statistic BIC 1 log n Decreasing in n z > √log n 2 | | AIC 1 Fixed z > √2 | | RIC log p Decreasing in p z > √2logp | | BIC tends to pick very parsimonious models. ⇒ Consistency Strong penalty leads to consistency for xed as n . →∞ Important? e.g., Is the time series AR(p) for any xed order p? Spike and slab prior for testimator For p models, M1,...,Mp, with equal priors p(Mj)=1/p indexed by =(1,...,p), Bayes factors implies equal probability for 2p models with prior 1/2,j=0 p(j)= c, j =0 6 12 Criteria in McDonald’s Example AIC 0 5 10 15 20 k BIC 0 5 10 15 20 k 13 Detour to Random Eects Model Model Simplied version of variable selection problem. 2 Yj = j + j,j N(0, ),j=1,...,p or 2 2 2 2 Yj j N(j,) ,j N(0, =c ) | Decomposition of the marginal variance Var Y = 2 + 2 = 2 (1 + c2) so that c2 measures “signal strength”. Examples Repeated measures (Yij ,i=1,...,nj,Yj Yj) ⇒ Center eects in a clinical trial Growth curves (longitudinal models) Relationship to regression Parameters: j X j j ⇐⇒ Residual SS: Sum of squares not t, Y 2 j =0 j MLE Fits everything, so P ˆ ˆ 2 2 j = Yj E(j j) = p ⇒ j P 14 Bayes Estimator for Random Eects Model 2 Yj j N(j,) j=1,...,p | 2 2 2 j N(0, =c ) Posterior mean (given normality, c2) 1/2 c2 E(j Y )=Yj = | 1/2 +1/2 1+c2 Risk of Bayes estimator 2 p E(j E[j Y ]) = | 1/2 +1/2 j X c2 = p2 1+c2 <p2=RiskofMLE Relative risk E( ˆ)2 1 2 =1+ 2 E(j E[j Y ]) c | If c2 0, Bayes estimator does much better... but what’s c2? Approaches Could put a prior on c2, or try to estimate from data... 15 Shrink to Zero? Model and estimator 2 2 2 Yj j N(j, ) j N(0,c ) | 1 E(jY)= 1 Yj | 1+c2 Example 2 Yj c =1 E(j Y)= ⇒ | 2 How to shrink all the way to zero? Revise model and use a mixture prior, 2 Yj j N(j,) | 2 2 j N(0,c )+(1 )10 , Where denotes the probability of non-zero j. Bayes estimator? Posterior mean E(j Yj) =?, but surely not zero. | Prior estimates? Need and c2. 16 Alternative Strategy Approach Avoid direct evaluation of posterior mean E[j Yj]. | Indicator variables Introduce indicators as used in regression =(1,...,p),j 0,1 , ∈{ } and write the revised model as 2 Yj j N(j,) | j Bernoulli() where 2 2 N(0,c ) j=1 j =0 j =0 Three-step process George & Foster, 1996 1.

Load more