Bayesian Model Selection

Bayesian Model Selection Bob Stine May 11, 1998 Methods – Review of Bayes ideas – Shrinkage methods (ridge regression) – Bayes factors: threshold z > √log n | | – Calibration of selection methods – Empirical Bayes (EBC) z > log p/q | | p Goals – Characteristics, strengths, weaknesses – Think about priors in preparation for next step 1 Bayesian Estimation Parameters as random variables Parameter drawn randomly from prior p(). Gather data Y conditional on some xed . Combine data with prior to form posterior, p(Y )p() p( Y )= | | p(Y) Drop marginal p(Y ) since constant given Y p( Y ) p(Y ) p() | ∝ | posterior likelihood prior Key Bayesian example| {zInference} | on{z normal} |{z} mean 2 2 2 2 Observe Y1,...,Yn N(, ), with N(M, = c ). | Posterior is normal with (think about regression) 2 E( Y )=M+ (Y M) | 2+2/n n/2 1/2 = Y + M n/2 +1/2 n/2 +1/2 In special case of 2 = 2,“prior is worth one observation”: n 1 E( Y )= Y + M | n+1 n+1 2 Why Bayes Shrinkage Works Steigler (1983) discussion of article by C. Morris in JASA. θ E[Y|θ]=θ Y Regression slope is Cov(Y,) c2 2 = Var Y (1 + c2)2 c2 = 1+c2 3 Ridge Regression Collinearity Originates in optimization problems where Hessian matrix in a Newton procedure becomes ill-conditioned. Marquardt’s idea is to perturb the diagonal, avoiding singular matrix. Hoerl and Kennard (1970) apply to regression analysis, with graphical aides like the ridge trace: ˆ 1 plot =(X0X+Ip) X0Y on Bayes hierarchical model Lindley and Smith 1972 Data follows standard linear model, with a normal prior on slopes: 2 2 2 Y N(X, In) , N(0,c Ip). The posterior mean shrinks toward 0, ˆ 1 1 E( )=(X0X+ Ip)X0Y | c2 1 =(I+ (XX)1)1ˆ p c2 0 Shrinkage larger as c2 0. → Picture Collinear and orthogonal cases Where is the variable selection? 4 How to Obtain Selection from Bayes Main point To do more than just shrink linearly to prior mean, we have to get away from bivariate normal. θ E[Y|θ]=θ Y Three alternatives Spike-and-slab methods and BIC. Normal mixtures. Cauchy methods in information theory. 5 Bayesian Model Selection Conditioning on models Jereys (1935), recently Raftery, Kass, and others Which model is most likely given the data? Think in terms of models M rather than parameters : p(M Y )=p(Y M)p(M)/p(Y ) | | ? Express p(Y M) as average| of{z the} usual likelihood p(Y M,) | | over the parameter space, p(M Y )= p(Y , M) p( M)d p(M)/p(Y ) | | | Z usual like Bayes factors | {z } Compare two models, M0 and M1. Posterior odds in favor of model M0 over alternative M1 are p(M0 Y ) p(Y M0) p(M0) | = | p(M1 Y ) p(Y M1) p(M1) | | Posterior odds = Bayes factor Prior odds | {z } | {z } | {z } Since often p(M0)=p(M1)=1/2, prior odds = 1 so that Posterior odds = Bayes factor = K 6 Using Bayes Factors Bayes factor for comparing null model M0 to alternative M1 p(M0 Y ) p(Y M0) | = | 1 p(M1 Y ) p(Y M1) | | Prior odds Posterior odds = Bayes factor K |{z} | {z } | {z } What’s a big Bayes factor? Jereys’ Appendix (1961): K>1 “Null hypothesis supported.” ⇒ K<1 “Not worth more than bare mention.” ⇒ K<1/√10 “Evidence against H0 substantial.” ⇒ K<1/10 “strong” ⇒ K<1/103/2 “very strong” ⇒ K<1/100 “Evidence against H0 decisive.” ⇒ Computing p(Y M)= p(Y , M) p( M)d | | | Z can be very hard to evaluate, especially in high dimensions in problems lacking a neat, closed-form solution. Details in Kass and Raftery (1995). Approximations? 7 Approximating Bayes Factors: BIC Goal Schwarz (1976), Annals of Statistics Approximate p(Y M)= p(Y , M) p( M)d. | | | R Approach Make the integral look like a normal integral. Dene g()=logp(Y , M)p( M) so that | | p(Y M)= eg()d | R Quadratic expansion around max at ˜ (post. mode): g() g(˜)+( ˜)0H(˜)( ˜)/2 ˜ ˜ 1 ˜ g() ( )0(I) ( )/2 2 where H =[∂ g/∂i∂j]andI is information matrix. Laplace’s method For Rp , posterior becomes ∈ ˜ ˜ 1 ˜ p(Y M) exp(g()) exp[( 1/2)( )0(I) ( )]d | ˜ Z p/2 1/2 = exp(g()) (2) I | | Log posterior approximately penalized log likelihood at MLE ˆ, ˜ ˜ log P (Y M)=logp(Y, M)+logp(M)+(1/2) log I + O(1) | | | | | =logp(Y,ˆ M)+logp(ˆM) (p/2) log n + O(1) | | =logp(Y,ˆ M) (p/2) log n +O(1) | log-likelihood at MLE penalty | {z } | {z } 8 BIC Threshold in Orthogonal Regression Orthogonal setup √n 2 X adds nˆ2 = 2 j + Z to Regr SS j j Coecient threshold Add Xp+1 to a model with p coecients? BIC criterion implies Add Xp+1 penalized like increases ⇐⇒ ˆ p +1 ˆ p log p(Y p+1) log n>log p(Y p) log n | 2 | 2 which implies that the change in the residual SS must satisfy ˆ2 2 RSSp RSSp+1 np+1 zp+1 log n = = > 22 22 2 2 Add Xp+1 when zp+1 > log n, | | so that eective “ level” 0asp n . → →∞ Consistency of criterion If there is a “true model” of dimension p,asn BIC will →∞ identify this model w.p. 1. In contrast, AIC asymptotically overts since it tests with a xed level. 9 Bayes Hypothesis Test Problem 2 Test H0 : =0vsH1 := 0 given Y1,...,Yn N(, ). 6 Pick the hypothesis with higher posterior odds. Test of point null (Berger, 1985) Under H0, = 0 and integration reduces to p(H0 Y )=p(Y 0 =0)p(H0)/p(Y ) | | N(0,2/n) | {z 2 } Under alternative H1 : N(0, ), p(H1 Y )= p(Y )p(H1)d p(H1)/p(Y ) | | | Z 1 2 Bayes factor Assume p(H0)=p(H1)= 2 and =1, p(H0 y) Y N(0, 1/n) | = p(H1 y) Y N(0, 1+1/n) | 2 1+1/n e ny /2 = (n/(n+1))y2/2) s 1/n e ny2/2 e ny2/2 √n √ne y2/2 e which is equal to one when ny2 =logn or in z scores z = log n | | p 10 Bayes Hypothesis Test: The Picture Scale Think on scale of z-score, so in eect the Bayes factor 2 p(H0 Y ) N(0, /n) | = 2 p(H1 Y ) N(0, (1 + 1/n)) | is ratio of N(0, 1) to N(0,n). Spike and slab With 2 = 1 and n = 100 Thus, density of Y under H1 is incredibly diuse relative to H0. This leads to the notion of a “spike and slab” prior when ideas are applied in estimation. 11 Discussion of BIC Bayes factor Posterior odds to compare one hypothesis to another. Requires a likelihood and various approximations. Comparison to other criteria Penalty Test Level (n) z-statistic BIC 1 log n Decreasing in n z > √log n 2 | | AIC 1 Fixed z > √2 | | RIC log p Decreasing in p z > √2logp | | BIC tends to pick very parsimonious models. ⇒ Consistency Strong penalty leads to consistency for xed as n . →∞ Important? e.g., Is the time series AR(p) for any xed order p? Spike and slab prior for testimator For p models, M1,...,Mp, with equal priors p(Mj)=1/p indexed by =(1,...,p), Bayes factors implies equal probability for 2p models with prior 1/2,j=0 p(j)= c, j =0 6 12 Criteria in McDonald’s Example AIC 0 5 10 15 20 k BIC 0 5 10 15 20 k 13 Detour to Random Eects Model Model Simplied version of variable selection problem. 2 Yj = j + j,j N(0, ),j=1,...,p or 2 2 2 2 Yj j N(j,) ,j N(0, =c ) | Decomposition of the marginal variance Var Y = 2 + 2 = 2 (1 + c2) so that c2 measures “signal strength”. Examples Repeated measures (Yij ,i=1,...,nj,Yj Yj) ⇒ Center eects in a clinical trial Growth curves (longitudinal models) Relationship to regression Parameters: j X j j ⇐⇒ Residual SS: Sum of squares not t, Y 2 j =0 j MLE Fits everything, so P ˆ ˆ 2 2 j = Yj E(j j) = p ⇒ j P 14 Bayes Estimator for Random Eects Model 2 Yj j N(j,) j=1,...,p | 2 2 2 j N(0, =c ) Posterior mean (given normality, c2) 1/2 c2 E(j Y )=Yj = | 1/2 +1/2 1+c2 Risk of Bayes estimator 2 p E(j E[j Y ]) = | 1/2 +1/2 j X c2 = p2 1+c2 <p2=RiskofMLE Relative risk E( ˆ)2 1 2 =1+ 2 E(j E[j Y ]) c | If c2 0, Bayes estimator does much better... but what’s c2? Approaches Could put a prior on c2, or try to estimate from data... 15 Shrink to Zero? Model and estimator 2 2 2 Yj j N(j, ) j N(0,c ) | 1 E(jY)= 1 Yj | 1+c2 Example 2 Yj c =1 E(j Y)= ⇒ | 2 How to shrink all the way to zero? Revise model and use a mixture prior, 2 Yj j N(j,) | 2 2 j N(0,c )+(1 )10 , Where denotes the probability of non-zero j. Bayes estimator? Posterior mean E(j Yj) =?, but surely not zero. | Prior estimates? Need and c2. 16 Alternative Strategy Approach Avoid direct evaluation of posterior mean E[j Yj]. | Indicator variables Introduce indicators as used in regression =(1,...,p),j 0,1 , ∈{ } and write the revised model as 2 Yj j N(j,) | j Bernoulli() where 2 2 N(0,c ) j=1 j =0 j =0 Three-step process George & Foster, 1996 1.

Bayesian Model Selection

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support