<<

arXiv:1308.6780v3 [stat.ME] 18 Aug 2015 eta o h eeomn fciia prediction clinical of ( is development [Steyerberg it models the example, for For practice. central statistical in pervasive omn-aRceLd 00Bsl Switzerland Basel, e-mail: 4070 Ltd, Roche Hoffmann-La F. at Biostatistician Saban´es Bov´e is Daniel Gravestock Isaac Saban´es Bov´e and Daniel Held, Deviance Leonhard the Selection with Model Bayesian Approximate 000auemoada nacinptet [Lee patients infarction over myocardial in treatments acute different study 40,000 four randomized of large comparison a for trial, GUSTO-I the sider 05 o.3,N.2 242–257 2, DOI: No. 30, Vol. 2015, Science Statistical fZrc,Hrcegae 4 01Zrc,Switzerland Zurich, 8001 e-mail: 84, Hirschengraben Zurich, University of Prevention, of and Institute Biostatistics, , of Department Student, Ph.D.

enadHl sPoesradIacGaetc is Gravestock Isaac and Professor is Held Leonhard c ttsia Science article Statistical original the the of by reprint published electronic an is This ern iesfo h rgnli aiainand pagination in detail. original typographic the from differs reprint h rbe fmdladvral eeto is selection variable and model of problem The nttt fMteaia Mathematical of Institute 10.1214/14-STS510 daniel.sabanes [email protected] .INTRODUCTION 1. nttt fMteaia Statistics Mathematical of Institute , eutn ae atr a eapoiae yts-ae B tha test-based show by We [ approximated (Johnson (GLMs). be tors can models factors linear Bayes generalized resulting to linear from ieaueo uoai n betv aaee rosi t in a priors now are is parameter class There important objective One models. and commodel. between automatic the factors on and Bayes literature models, resulting all the for priors of parameter of specification Abstract. e od n phrases: model, and logisti words using trial Key GUSTO-I the in survival sion. o clinica 30-day GLM a for standard of development the model the on with approach based the coefficients illustrate posterior We regression corresponding the the of approximate tion to m how to literature. describe the former we from the estimates link shrinkage and and factors approaches Bayes Bayes fully and empirical ttsiso h oes oetmt h hyperparameter the estimate To models. the of statistics 05 o.3,N.2 242–257 2, No. 30, Vol. 2015, 2009 [email protected] ] o lutain econ- we illustration, For )]. g pir oe eeto,shrinkage. selection, model -prior, 2015 , ; [email protected] aeinmdlslcinpsstomi hlegs the challenges: main two poses selection model Bayesian cn.J Stat. J. Scand. . ae atr eine eeaie linear generalized deviance, factor, Bayes This . 35 . g in pir,wihwr eetyextended recently were which -priors, 20)3438)uigtedeviance the using 354–368]) (2008) 1 rsincecet stesadr hie[Liang choice standard ( the al. et is coefficients gression lg scretylmtdt h iermdl[e.g., model ( linear al. the et to Bayarri limited currently is ology uint hsaets-ae ae atr [John- factors Bayes test-based ( are son this to so- lution one and problems, conceptual and computational ulmdlnse ihnamr eea alternative general more a within nested model null e rosfralmdl nteasneo substan- Pericchi of and Berger absence ( e.g., the [see, in information models prior tive all for parame- priors the manually ter eliciting from statistician unburden the which selection, model Bayesian objective Table model. in regression listed the covariates of 17 end- assessment the of this on effects for analysis our model focus prediction we point, clinical a develop to ( al. et ru rmteWsenrgo fteUAwith ( [Steyerberg USA survival the 30-day of point region Western n the from group 2001 18ptet n rgoi ftebnr end- binary the of prognosis and patients 2188 = hr snwalreltrtr natmtcand automatic on literature large a now is There 2005 method- Bayesian objective such However, )]. 2008 1995 ] osdracasclseai iha with scenario classical a Consider )]. ] o o-asinrgeso,teeare there regression, non-Gaussian For )]. ] esuyapbil vial sub- available publicly a study We )]. 2012 ] hr the where )], g Furthermore, epropose we , prediction l elinear he ysfac- ayes regres- c putation distribu- inimum utput. the t large g piro h re- the on -prior 2009 1 nalogistic a in ] norder In )]. 2 L. HELD, D. SABANES´ BOVE´ AND I. GRAVESTOCK

Table 1 Description of the variables in the GUSTO-I data set

Variable Description

y Death within 30 days after acute myocardial infarction (Yes =1, No=0) x1 Gender (Female = 1, Male = 0) x2 Age [years] x3 Killip class (4 categories) x4 Diabetes (Yes = 1, No = 0) x5 Hypotension (Yes = 1, No = 0) x6 Tachycardia (Yes = 1, No = 0) x7 Anterior infarct location (Yes = 1, No = 0) x8 Previous myocardial infarction (Yes = 1, No = 0) x9 Height [cm] x10 Weight [kg] x11 Hypertension history (Yes = 1, No = 0) x12 Smoking (3 categories: Never/Ex/Current) x13 Hypercholesterolaemia (Yes = 1, No = 0) x14 Previous angina pectoris (Yes = 1, No = 0) x15 Family history of myocardial infarctions (Yes = 1, No = 0) x16 ST elevation on ECG: Number of leads (0–11) x17 Time to relief of chest pain more than 1 hour (Yes = 1, No = 0)

model. Traditionally, the use of Bayes factors re- we finally obtain a unified framework for objective quires the specification of proper prior distributions Bayesian and parameter inference on all unknown model parameters of the alternative for GLMs. model, which are not shared by the null model. In The paper is structured as follows. In Section 2 contrast, Johnson (2005) defines Bayes factors us- we review the g-prior in the linear and generalized ing the distribution of a suitable test statistic under linear model, and show that this prior choice is im- the null and alternative models, effectively replacing plicit in the application of test-based Bayes factors the data with the test statistic. This approach elim- computed from the deviance statistic. In Section 3 inates the necessity to define prior distributions on we describe how the hyperparameter g influences model parameters and leads to simple closed-form model selection and parameter inference, and intro- expressions for χ2-, F -, t-, and z-statistics. duce empirical and fully for it. The Johnson (2005) approach is extended in John- Using empirical Bayes to estimate g, we are able son (2008) to the likelihood ratio test statistic and, to analytically quantify the accuracy of test-based thus, if applied to generalized mod- Bayes factors in the linear model. Connections to the els (GLMs), to the deviance statistic [Nelder and literature on minimum Bayes factors and shrinkage Wedderburn (1972)]. This is explored further in Hu of regression coefficients are outlined. In Section 4 and Johnson (2009), where Markov chain Monte we apply the methodology in order to build a logis- Carlo (MCMC) is used to develop a Bayesian vari- tic regression model for predicting 30-day survival able selection algorithm for . How- in the GUSTO-I trial, and compare our methodol- ever, the factor g in the implicit g-prior is treated as ogy with selected alternatives in a bootstrap study. fixed and estimation of the regression coefficients is In Section 5 we summarize our findings and sketch also not discussed. We fill this gap and extend the possible extensions. work by Hu and Johnson (2009), combining g-prior methodology for the linear model with Bayesian 2. OBJECTIVE BAYESIAN MODEL SELECTION IN REGRESSION model selection based on the deviance. This enables us to apply empirical [George and Foster (2000)] and Consider a generic regression model with linear fully Bayesian [Cui and George (2008)] approaches predictor η = α + x⊤β, from which weM assume that for estimating the hyperparameter g to GLMs. By the outcome y = (y1,...,yn) was generated. We col- linking g-priors to the theory on shrinkage esti- lect the intercept α, the regression coefficients vec- mates of regression coefficients [Copas (1983, 1997)], tor β, and possible additional parameters (e.g., the APPROXIMATE BAYESIAN MODEL SELECTION WITH THE DEVIANCE 3 residual in a linear model) in θ Θ. Spe- 2.1.1 Gaussian linear model Consider the Gaus- ∈ ⊤ 2 cific candidate models , j , differ with respect sian linear model j : yi N(α + x β ,σ ) with Mj ∈J M ∼ ij j to the content and the dimension of the covariate intercept α, regression coefficients vector βj, and 2 vector x, and hence β, so each model j defines variance σ , and collect all parameters in θj = θ M ⊤ 2 ⊤ 2 its own parameter vector j with likelihood func- (α, βj ,σ ) . Here N(µ,σ ) denotes the univariate tion p(y θ , ). 2 | j Mj Gaussian density with µ and variance σ , and Through optimizing this likelihood, we obtain the x = (x ,...,x )⊤ is the covariate vector for ob- ˆ ij i1 idj maximum likelihood estimate (MLE) θj of θj. For servation i = 1,...,n. Using the n dj full rank de- ⊤ × Bayesian inference a prior distribution with den- sign matrix Xj = (x1j,..., xnj) , the likelihood ob- sity p(θj j) is assigned to the parameter vector tained from n independent observations is |M θj to obtain the posterior density p(θj y, j) 2 | M ∝ (2) p(y θj, j) = Nn(y α1 + Xjβ ,σ I), p(y θj, j)p(θj j). This forms the basis to com- | M | j | M |M E pute the posterior mean (θj y, j) and other suit- with 1 and I denoting the all-ones vector and iden- | M able characteristics of the posterior distribution. tity matrix of dimension n, respectively. We assume The marginal likelihood that the covariates have been centered around 0, ⊤ that is, Xj 1 = 0. Here and in the following, 0 de- p(y j)= p(y θj, j)p(θj j) dθj notes the zero vector of length dj. |M Θ | M |M Z j Zellner’s g-prior [Zellner (1986)] fixes a constant is the key ingredient to transform prior model prob- g> 0 and specifies the Gaussian prior abilities Pr( j), j , to posterior model proba- 2 2 ⊤ −1 (3) β σ , j Nd (0,gσ (X Xj) ) bilities M ∈J j| M ∼ j j 2 for the regression coefficients βj , conditional on σ . p(y j) Pr( j) Pr( y)= |M M This prior can be interpreted as a posterior distribu- j p(y ) Pr( ) M | k∈J k k β (1) |M M tion, if α is fixed and a locally uniform prior for j is P DBFj,0 Pr( j) combined with an imaginary outcome y0 = α1 from = MPr . the Gaussian linear model (2) with the same de- k∈J DBFk,0 ( k) 2 M sign matrix Xj but scaled residual variance gσ . The In the second line, theP usual (data-based) Bayes fac- prior (3) on βj is usually combined with an improper tor DBFj,0 = p(y j )/p(y 0) of model j ver- reference prior on the intercept α and the residual sus a reference model|M replaces|M the marginalM like- variance σ2 [Liang et al. (2008)]: p(α, σ2) σ−2. 0 ∝ lihood p(y ) fromM the first line. Improper priors The posterior distribution of (α, β⊤)⊤ is then a mul- |Mj j can only be used for parameters that are common to tivariate t distribution, with posterior mean of βj all models (e.g., here the intercept α), because only given by then the indeterminate normalizing constant cancels n βˆ + n/g 0 E g ˆ j in the posterior model probabilities (1). (4) (βj y, j )= βj = · · . In Section 2.1 we discuss the g-prior, a specific | M g + 1 n + n/g class of prior distributions p(θj j), commonly ˆ |M This that the MLE βj, the ordinary least used in linear model selection problems. The g-prior squares (OLS) estimate, is shrunk toward the prior induces shrinkage of β, in the sense that the pos- mean zero. The shrinkage factor t = g/(g + 1) scales terior mean is a shrunken version of the MLE to- the MLE to obtain the posterior mean (4). In other ward the prior mean. Furthermore, it is an auto- words, the posterior mean is a weighted average of matic prior, since it does not require specification of the MLE and the prior mean with weights propor- subjective prior information. Section 2.2 discusses tional to the data sample size n and the term n/g, the resulting test-based Bayes factors under the g- respectively. Thus, n/g can be interpreted as the prior. prior sample size, or 1/g as the relative prior sam- 2.1 Zellner’s g-Prior and Generalizations ple size. The question of how to choose or estimate g will be addressed in Section 3. We start with the original formulation of Zellner’s One advantage of Zellner’s g-prior is that the g-prior for the Gaussian linear model in Section 2.1.1 marginal likelihood, or, equivalently, the (data- and extend this to GLMs in Section 2.1.2. based) versus the null model : β = M0 j 4 L. HELD, D. SABANES´ BOVE´ AND I. GRAVESTOCK

0, has a simple closed-form expression in terms of Hence, (3) can be written as 2 the usual coefficient of determination Rj of model −1 (8) βj j Ndj (0, g β ,β ). j [Liang et al. (2008)]: |M ∼ ·I j j M In the GLM, (α, β ) depends on the parameters DBFj,0 I j (5) and is not necessarily block-diagonal. However, if (n−dj −1)/2 2 −(n−1)/2 = (g + 1) 1+ g(1 Rj ) . we fix βj at its prior mean 0, (α, βj = 0) is block- { − } −1 ⊤I 2 diagonal with βj ,βj = c Xj WXj , so (7) and (8) Note that Rj can be written as a function of the I F -statistic are equivalent; see Copas [(1983), Section 8] for de- tails. Departures from the assumption βj = 0 are 2 2 (6) Fj = (n dj 1)R / dj(1 R ) also discussed in Copas (1983). { − − j } { − j } In contrast to Gaussian linear models, the marginal for testing βj = 0. This suggests that similar ex- pressions (in terms of test statistics) can be derived likelihood for GLMs no longer has a closed-form ex- for the corresponding Bayes factors in GLMs. This pression. For its computation, one has to resort to conjecture will be confirmed in Section 2.2. numerical approximations, for example, a Laplace approximation. This requires a Gaussian approxi- 2.1.2 Now consider a mation of the posterior p(α, β y, j), which can ⊤ j| M GLM j with linear predictor ηij = α + xijβj, be obtained with the Bayesian iteratively weighted M mean µij = h(ηij ) obtained with the response func- algorithm. See Saban´es Bov´eand Held tion h(η) and variance function v(µ) [Nelder and [(2011a), Section 3.1] for more details. Wedderburn (1972)]. The direct extension of the standard g-prior in the Gaussian linear model is 2.2 Test-Based Bayes Factors then the generalized g-prior [Saban´es Bov´eand Held Based on the asymptotic distribution of the de- (2011a)] viance statistic in Section 2.2.1, we connect the re- ⊤ −1 sulting test-based Bayes factors with the g-prior in (7) β j N (0, gc(X WXj) ), j|M ∼ dj j Section 2.2.2 and discuss the advantages over data- where W is a diagonal matrix with weights for the based Bayes factors in Section 2.2.3. observations (e.g., the binomial sample sizes for lo- gistic regression). Here the appropriate centering of 2.2.1 Asymptotic distributions of the deviance ⊤ statistic Consider the frequentist approach to model the covariates is Xj W1 = 0. As in Section 2.1.1, we specify an improper uniform prior p(α) 1 for selection, where test statistics are used to assess the the intercept α. The constant c = v h(α) h∝′(α)−2 evidence against the null model 0 : βj = 0 in a { } specific GLM . A popular choiceM is the deviance [Copas (1983); Saban´es Bov´eand Held (2011a)] in Mj (7) corresponds to the variance σ2 in the standard g- (or likelihood ratio test) statistic prior (3), which could also be formulated for general maxα,βj p(y α, βj, j) W zj(y) = 2 log | M . linear models with a nonunit weight matrix . It max p(y α, ) preserves the interpretation of n/g as the prior sam-  α | M0  ple size. Note that Saban´es Bov´eand Held (2011a) Then we have the well-known result that, condi- recommend to use α = 0 as default, but consider- tional on 0, the distribution of the deviance zj(Y) M able improvements in accuracy can be obtained by converges for n to a chi-squared distribution 2 →∞ using the MLEα ˆ of α under the null model; see χ (dj) with dj degrees of freedom. Section 4.1 for details. To derive the asymptotic distribution of the de- The connection between (3) and (7) is as follows. viance statistic under model j , Johnson (2008) M Denote the expected Fisher information (conditional considers a sequence of local alternative hypothe- 2 n on the variance σ in the Gaussian linear model) for ses H : βj = (1/√n), so the size of the true re- 1 O (α, β⊤)⊤ as (α, β ). In the Gaussian linear model, gression coefficients is scaled with 1/√n, and thus j I j this (dj + 1) (dj + 1) matrix is block-diagonal due gets smaller with increasing number of observa- to the centering× of the covariates, and does not de- tions n. This is the case of practical interest, be- pend on the intercept nor the regression coefficients: cause for larger βj it would be trivial to differen- n tiate between H0 : β = 0 and H , and for smaller ⊤ j 1 α,α α,βj n 0 I I −2 βj it would be too difficult [Johnson (2005), page (α, βj)= ⊤ = σ ⊤ . I β ,β 0 X Xj 691]. In this setup, the distribution of the deviance Iα,βj I j j !  j  APPROXIMATE BAYESIAN MODEL SELECTION WITH THE DEVIANCE 5 converges for n to a noncentral chi-squared Hence, in the linear model, TBFj,0 will be larger 2 →∞ distribution χ (dj, λj) with dj degrees of freedom, than DBFj,0 if both are calculated with the same ⊤ where λ = β β β β is the noncentrality param- g n 1; however, it is not clear which Bayes factor j j I j , j j ≥ − eter. Here β β denotes the expected Fisher infor- is larger for g

It is well known from standard GLM theory that the case g corresponds to an increasingly vague ˆ ˆ ⊤ ⊤ →∞ the MLE θj = (ˆα, βj ) follows asymptotically a prior on βj. As is well known, arbitrarily inflating normal distribution with mean θj and covariance the prior variance of parameters that are not com- matrix equal to the inverse expected Fisher infor- mon to all models is not a safe strategy. Here we see −1 mation (α, β ) , evaluated at the true values α immediately from (10) that TBFj,0 0 in this case. I j → and β . As in Copas (1983), we replace β with its This means that no matter how well the model j j j M prior mean 0, that is, we assume that the asymp- fits the data compared to the null model 0, the M totic inverse covariance matrix of θˆj is (α, 0)= latter is preferred if g is chosen large enough. This ˆ I is an example of Lindley’s paradox [Lindley (1957)]. diag α,α, βj ,βj . Note thatα ˆ and βj are now un- correlated{I I because} we have centered the covariate In between these two extremes, quite a few fixed values for g have been recommended. The choice vectors such that X⊤W1 = 0. j of g = n corresponds to the unit information prior Combining this Gaussian “likelihood” of θ with j [Kass and Wasserman (1995)], where the relative the generalized g-prior prior sample size is 1/n. For large n, the TBF is 0 0 asymptotically (n ) equivalent to the Bayesian θ ∞ →∞ j g, j Ndj +1 , −1 Information Criterion (BIC) [Johnson (2008), page | M ∼ 0 0 g β ,β    ·I j j  358]. However, Hu and Johnson [(2009), Section 3.1] gives the posterior distribution report that g [2n, 6n] has led to favorable predic- tive properties∈ and favorable operating characteris- θj y, g, j (11) | M tics in a particular linear model variable selection ex- −1 ample. Other proposals in the linear model include αˆ α,α 0 N , I . dj +1 ˆ −1 the Risk Inflation Criterion (RIC) by Foster and ∼ t βj 0 t β ,β !! 2  ·  ·I j j George (1994), which sets g = dj , and the Bench- mark prior by Fern´andez, Ley and Steel (2001), Here t = g/(g +1) is the same shrinkage factor for βˆ j where g = max n, d2 . as in the Gaussian linear model from Section 2.1.1. { j } A smaller g leads to a smaller t and thus to stronger 3.2 Estimating g via Empirical Bayes shrinkage of the βj posterior toward 0. The ap- The empirical Bayes (EB) approach [George and proximate posterior covariance matrix of βj is also Foster (2000)] avoids arbitrary choices of g which shrunk by the shrinkage factor t compared to the may be at odds with the data. The local EB ap- frequentist covariance matrix. In Section 4.2 we pro- proach, discussed in Section 3.2.1, retains compu- vide an empirical comparison of the true shrinkage tational simplicity in comparison to the global EB under the generalized g-prior and the theoretical approach, which we will describe in Section 3.2.3. shrinkage g/(g + 1). The local EB approach allows for an analytic com- The above assumption that the covariance ma- parison of TBFs and DBFs in the linear model, as trix of the MLE is the inverse expected Fisher in- derived in Section 3.2.2. formation (α, 0)−1 enables us to derive a simple 3.2.1 Local empirical Bayes Consider one specific form of theI posterior distribution. In practice, we model j. If we choose g such that (10) is maxi- use the corresponding sub-matrices of the observed mized,M we obtain the estimate Fisher information matrix evaluated at the MLE, (12) gˆ = max z /d 1, 0 . easily available from fitting a standard GLM, and LEB { j j − } (11) holds only approximately. Likewise, the inter- This is a local EB estimate because the prior pa- pretion of g as the ratio between the data sample rameter g is separately optimized in terms of the size and the prior sample size holds only approxi- marginal likelihood p (z ) of each model approx j|Mj mately. , j [George and Foster (2000)]. Using these Mj ∈J In order to understand the role of g for model values of g, the evidence in favor of the alternative selection, consider the TBF formula (10) and the hypothesis H1 is maximized. This has the disadvan- limiting case of g 0. Then the generalized g- tage that the resulting maximum TBFs → prior converges to a point mass at βj = 0, and thus mTBFj,0 j collapses to the null model 0. Consequently, (13) TBFM 1, because both modelsM are equal descrip- z −dj /2 z d j,0 → = max j exp j − j , 1 , tions of the data in the limit. On the other extreme, d 2  j     APPROXIMATE BAYESIAN MODEL SELECTION WITH THE DEVIANCE 7 obtained by plugging (12) into (10), are not consis- and Singhal (1978) who propose to replace the de- tent if the null model is true [Johnson (2008), page viance statistic with the squared Wald statistic for 355], that is, Pr( 0 y) 1 for n if 0 is true. model selection purposes. This comes at the price of This is clear fromM above| 6→ because→∞ (13) willM always be losing the coherence of the TBF for nested models larger than 1, instead of converging to 0, which is described in Section 2.2.3. necessary for consistent accumulation of evidence in 3.2.2 Comparison with data-based Bayes factors favor of the null model. We now continue the comparison of DBFs and TBFs However, the corresponding shrinkage factors in the linear model from Section 2.2.2, if the hyper- parameter g is estimated with local empirical Bayes. gˆLEB (14) tˆLEB = = max 1 dj/zj, 0 For the DBFs (5), the local EB estimate of g is gˆLEB + 1 { − } gˆ = max F 1, 0 , where F is the F -statistic (6); { j − } j are exactly the same as proposed by Copas [(1997), see, for example, Liang et al. (2008), equation (9). page 176] for out-of-sample prediction. He developed Pluggingg ˆ into (5) gives this formula specifically for logistic regression by mDBFj,0 generalizing the formula for linear models. See also = max F (n−dj −1)/2[F (1 R2)+ R2]−(n−1)/2, 1 van Houwelingen and Le Cessie [(1990), page 1322] { j j − j j } for another justification of this widely used shrink- (15) (n 1)R2 −dj /2 age factor. = max − j d There is a close connection between maximum  j  TBFs (13) and minimum Bayes factors, which are 2 −(n−dj −1)/2 1 Rj used to transform P -values into lower bounds on − , 1 . · 1 d /(n 1) the corresponding Bayes factor. Just as TBFs, these  − j −   methods usually consider the value of a test statis- A comparison of (15) with (13) allows us to quan- tify the accuracy of mTBFs in the Gaussian lin- tic (or the corresponding P -value) as the data [Ed- 2 wards, Lindman and Savage (1963); Berger and Sel- ear model. First note that 1 Rj = exp( zj/n), 2 2 − − lke (1987); Goodman (1999b); Sellke, Bayarri and so Rj /(1 Rj ) = exp(zj/n) 1. Hence, Fj 1 if z d , that− is, mDBF = 1− if mTBF =1,≤ and Berger (2001)]. As already noted by Held (2010), j ≤ j j,0 j,0 the ∆ = log mTBFj,0 log mDBFj,0 is nonneg- depending on the degrees of freedom dj, the maxi- − mum TBF (13) turns out to be equivalent to certain ative, if mDBFj,0 = 1. For mDBFj,0 > 1, the second- order Taylor approximation R2 1 exp( z /n) minimum Bayes factors (see Appendix B for explicit j ≈ − − j ≈ zj/n 1 zj/(2n) in the first term of (15) gives formulas and proofs): For dj =1, (13) is equal to { − } the Berger and Sellke (1987) bound for a normal log mDBFj,0 test statistic and a normal prior on its mean. For d z z j log(n 1) + log j + log 1 j dj =2, (13) is equivalent to the Sellke, Bayarri and ≈− 2 − dj − 2n Berger (2001) bound. For d , (13) is equal to      j →∞ the Edwards, Lindman and Savage (1963) universal (16) log(n) − bound for one-sided P -values obtained from normal  test statistics. n d 1 z d + − j − j j The maximum TBF also has close connections to 2 n − n 1 the Bayesian Local Information Criterion (BLIC)  −  d z d z n d 1 z d proposed by Hjort and Claeskens (2003), Sec- j log j + j j + − j − j − j , ≈− 2 d 4n n · 2 tion 9.2. The only difference is that in the BLIC  j  the deviance statistic is replaced by the squared where we have used the first-order approximation Wald statistic for testing βj = 0. However, the log(1 x) x both for x = dj/(n 1) and for squared Wald statistic shares the same noncentral x = z −/(2n)≈− and have replaced n 1 with− n, where j − chi-squared distribution as the deviance statistic in suitable. the local asymptotic framework under the alterna- Comparing equation (16) with (13) finally reveals tive model. Hence, the BLIC could be considered that the error ∆ is approximately as a possibly even more computationally convenient d + 1 d z (17) ∆=max j (z d ) j j , 0 . approximation of the TBF in the sense of Lawless 2n j − j − 4n   e 8 L. HELD, D. SABANES´ BOVE´ AND I. GRAVESTOCK

This is an interesting result. First, ∆ is positive so obtained from (9). From this we see that an inverse- the mTBFs will tend to be larger than the cor- gamma prior IG(a, b) on g + 1, truncated appro- responding mDBFs. Second, the errore is approxi- priately to the (1, ), is conjugate [Cui and ∞ mately linear in the deviance zj and inversely re- George (2008), page 891]. The corresponding prior 2 lated to the sample size n. However, for fixed Rj density function on g is the deviance zj grows linearly with n, which shows that the error ∆ is approximately independent of b (19) p(g) = M(a, b)(g + 1)−(a+1) exp , the sample size. Finally, this formula suggests a sim- −g + 1 ple bias-correction of mTBFs in GLMs by multi-   plying (13) with exp( ∆), which we will apply in where M(a, b) = ba b ua−1 exp( u) du −1 is the − { 0 − } Section 4.1. We note that the approximation (17) is normalizing constant. We denote this incomplete R fairly accurate as long ase zj/n is not too large, say, inverse-gamma distribution as g IncIG(a, b). The z /n < 1. ∼ j model-specific posterior density then is 3.2.3 Global empirical Bayes An alternative EB approach is to maximize the weighted sum of the (20) g zj , j IncIG(a + dj/2, b + zj/2). TBFs with weights equal to the prior model proba- | M ∼ bilities, that is, to maximize Hence, the marginal likelihood of model is Mj (18) TBF Pr( ) j,0 Mj papprox(zj g, j )p(g) j∈J p(zj j)= | M X |M p(g zj, j) with respect to g. The resulting estimateg ˆ par- | M GEB d /2−1 allels the global EB estimate [Liang et al. (2008), M(a, b)z j = j , Section 2.4] based on DBFs and needs to be com- dj /2 M(a + dj/2, b + zj/2)2 Γ(dj/2) puted by numerical optimization of (18). It was investigated by George and Foster (2000) for the and dividing this with p (z ) finally yields approx j|M0 Gaussian linear model. Calculatingg ˆGEB is more costly than calculating the model-specificg ˆLEB, and M(a, b) TBFj,0 = exp(zj /2). is even infeasible when is very large. In this case M(a + dj/2, b + zj/2) one could first perform|J a | stochastic model search and then restrict the sum in (18) to the set ˆ of A useful analytic consequence of (20) is that the J models visited. The stochastic model search could of the shrinkage factor t is be based on the local EB estimates, say, and the re- sulting posterior model probabilities are then “cor- a + d /2 1 (21) Mod(t z , )=max 1 j − , 0 . rected” using the global EB estimate. | j Mj − b + z /2  j  3.3 Full Bayes Estimation of g If the prior for g is not conjugate, the required EB approaches ignore the uncertainty of the es- integration of (9), p(z )= p (z g, ) j|Mj approx j| Mj · timatesg ˆLEB andg ˆGEB, respectively. As an alter- p(g) dg, can be performed by one-dimensional nu- R native, we will now discuss fully Bayesian estima- merical integration. Two examples of nonconjugate tion of g using a continuous hyperprior for g. Thus, hyperpriors on g which are used in the Gaussian we obtain continuous mixtures of generalized g- linear model are the Zellner and Siow (1980) prior, priors, which we call generalized hyper-g priors [Sa- where g IG(1/2, n/2), and the hyper-g/n prior ban´es Bov´eand Held (2011a)]. Mixtures of g-priors ∼ for model selection in the linear model were studied proposed by Liang et al. (2008): by Liang et al. (2008). g/n (22) U(0, 1). 3.3.1 Priors for g In order to retain a closed form g/n + 1 ∼ for the marginal likelihood of the model j , the prior for g must be conjugate to the (approximate)M Both priors give considerable probability mass to g “likelihood” values proportional to n: The mode for the Zellner– z /2 Siow prior is n/3, and the median for the hyper-g/n p (z g, ) (g + 1)−dj /2 exp j , approx j| Mj ∝ −g + 1 prior is n.   APPROXIMATE BAYESIAN MODEL SELECTION WITH THE DEVIANCE 9

3.3.2 Choice of hyperparameters The next ques- where we replace the data-based posterior p(g y, ) | Mj tion is then how to choose the hyperparameters a, b with the test-based posterior p(g zj , j) to retain of the conjugate prior (19). Cui and George (2008) computational simplicity. | M recommend a =1 and b = 0, which leads to If a conjugate incomplete inverse-gamma prior g distribution is specified for g, we first need to sample (23) t = U(0, 1), g + 1 ∼ from its model-specific (test-based) posterior (20). from an IncIG(a, b) distribution (19) is a uniform prior on the shrinkage factor t. This is easy using inverse sampling via its quantile function the hyper-g prior by Liang et al. (2008), a proper −1 prior with normalizing constant defined as the FIncIG(a,b)(x) limit limb→0 M(a, b)= a. The model-specific poste- rior mode (21) of t now equals the local EB es- b −1 1, b> 0, timate tˆ in (14), as it should, since we have = F (1 x)FIG(a,1)(b) − LEB  IG(a,1){ − } used the uniform prior (23) on t. Moreover, the  (1 x)−1/a 1, b = 0, marginal posterior mode of t, taking into account − − which is given in terms of the quantile and cumu- all models, will equal the global EB estimate tˆGEB =  gˆGEB/(ˆgGEB + 1). This indicates that using a hyper- lative distribution functions of the IG(a, 1) distri- g prior will lead to similar results as the EB meth- bution. If a nonconjugate prior is specified for g, ods. Alternatively, matching the mode n/3 of the then numerical methods can be used to sample from p(g z , ). Specifically, we approximate the log Zellner–Siow (ZS) prior g IG(1/2, n/2) suggests to | j Mj use g IncIG(a = 1/2, b =∼ (n + 3)/2). We call this posterior density using a linear interpolation, which the ZS∼ adapted prior. The posterior mode of t is now is a by-product of the numerical integration to ob- Mod(t z , ) = 1 (d 1)/(z + n + 3), which is tain the marginal likelihood of the model j. | j Mj − j − j M always larger than tˆLEB in (14) and thus leads to In the second step, we sample the actual model weaker shrinkage of the regression coefficients. parameters θj from their approximate posterior (11) The ZS prior and our adaptation depends on the given the sample for g. We use the observed Fisher sample size n, which leads to consistent model selec- information matrix, invert the corresponding sub- ˆ tion, even if the null model is true. Indeed, Johnson matrices forα ˆ and βj, and scale the latter one with (2008) shows that for g = (n) the TBF is consis- t = g/(g +1). The MLE βˆ is also multiplied with O j tent, because then the covariance matrix of the gen- t to obtain the appropriate mean of the conditional eralized g-prior (7) is (1) and prevents the alterna- Gaussian distribution (11). tive model from collapsingO with the null model. Here we have prior mode n/3, which fulfils this condition. 4. APPLICATION By contrast, the hyper-g prior (23) has its median at We consider data on 30-day survival from the 1, which clearly does not fulfil the condition. More- GUSTO-I trial data as introduced in Section 1 and over, the model-specific posterior mode under the use the TBF methodology as implemented in the hyper-g prior equals the local EB estimate, which is R glmBfp R 1 inconsistent if the null model is true; see Section 3.2. -package “ ” available from -Forge. The hyper-g/n prior (22) corrects this by scaling the 4.1 Variable Selection prior to have median n. However, these priors lead to weaker shrinkage than the local EB approach or As there are 17 explanatory variables in this data set, there are = 217 = 131,072 different models the hyper-g prior. Stronger shrinkage as in the em- |J | pirical Bayes approaches is in general advantageous to be considered for variable selection. This is still for prediction [Copas (1983, 1997)]. a manageable size and we can evaluate all models easily with TBFs (relative to the null model) within 3.3.3 Posterior parameter estimation For a given a few minutes. In the absence of subjective prior in- GLM j with deviance statistic zj, we would like formation on the importance of covariates, we use M to estimate the posterior distribution of its parame- prior inclusion probabilities of 1/2 for each covari- ⊤ ⊤ ters θj = (α, βj ) . We do this by sampling from an ate and a marginal uniform prior on dj . This is a approximation of the posterior distribution 1To install the R-package, just type install.packages p(θ y, )= p(θ g, y, )p(g y, ) dg, j| Mj j| Mj | Mj ("glmBfp", repos=” http://r-forge.r-project.org”) into R. Z 10 L. HELD, D. SABANES´ BOVE´ AND I. GRAVESTOCK

Fig. 1. Comparing test-based (TBF) and data-based (DBF) log Bayes factors. The Bayes factors are shown in four different colors, depending on whether or not the explanatory variables x2 (Age) and x3 (Killip class) are included in the corresponding models. (a) Local EB. (b) Hyper-g. (c) Hyper-g/n. (d) ZS adapted. commonly used objective prior assumption [Geisser In Figure 1, we plot the error log TBF log DBF (1984); Scott and Berger (2010)]. against log DBF using the 4 different methods− to We consider 4 approaches to estimate g: local estimate g. To reduce the size of the figures, we EB, the hyper-g prior, the hyper-g/n prior, and only show a random sample of 10,000 Bayes fac- the ZS adapted prior. Numerical computation of tors. We note that the log DBFs vary between 0 the corresponding DBFs [Saban´es Bov´eand Held and 106.7 (for local EB, where the log Bayes factors (2011a)] is—depending on the method to estimate cannot be negative), 0.7 and 103.5 (hyper-g), 6.8 g—between 11 (local EB) and 50 (ZS adapted and 102.9 (hyper-g/n−), and 14.1 and 97.3 (under− prior) times slower and requires explicit specifica- the ZS adapted prior). On average,− the log TBFs tion of the g-prior (7), including the constant c = tend to be slightly larger than the log DBFs with v h(α) h′(α)−2. As α is unknown, we fix it at the mean difference between 0.28 (hyper-g) and 0.37 (ZS { } MLEα ˆ obtained from the null model. We will use adapted). The standard deviations of the vary this example to quantify the accuracy of the approx- between 0.47 (hyper-g/n) and 0.70 (hyper-g). All imation of DBFs by TBFs. Bayes factors for all four methods had absolute er- APPROXIMATE BAYESIAN MODEL SELECTION WITH THE DEVIANCE 11 ror less than 2, apart from 12 TBFs calculated with of the two is included, then the size and direction the EB approach, where the log DBF was zero, but of the error depends on the approach used. Cluster- the log TBF was larger than zero. ing is particularly pronounced for the ZS adapted Closer inspection of Figure 1 reveals that the er- prior, where—somewhat surprisingly—the error of ror under the hyper-g prior has a pattern similar the log TBFs with x2 excluded and x3 included is to that under the local EB approach. For log DBFs around 1, whereas the error of the log TBFs with larger than 50, the error of the TBFs tends to in- x2 included and x3 excluded is negative, although crease with increasing DBFs, a feature that is visible the corresponding DBFs tend to be larger. Thus, in in all 4 figures and to be expected from the approxi- this case the error does not seem to increase in a mate error (17) in the linear model. Note that there monotone fashion with the DBFs. is strong clustering visible for all four approaches Following the good agreement of TBFs and DBFs, depending on whether or not the two most impor- the corresponding posterior variable inclusion prob- tant explanatory variables, x2 (Age) and x3 (Killip abilities are also very similar; see Figure 2. The two class), are included. The corresponding four groups neighboring bars have almost the same height for are given in different colors in Figure 1. If both are all covariates and in all settings. The only exception included, the log DBFs are large and the error of is the variable Weight (x10), where the difference is the TBFs is nearly always positive, a feature that between 5 and 6 percentage points. However, there is present in all four approaches. Likewise, if the are substantial differences in the inclusion probabili- two variables are not included, the Bayes factors are ties obtained with the different methods to estimate small and the absolute error is close to zero. If one g. As in the linear model [Liang et al. (2008)], the

Fig. 2. Inclusion probabilities for all approaches, comparing the data-based (left bars, ) and the test-based approach (right bars, ). The covariates are ordered with respect to the results from the data-based approach under the hyper-g/n prior. (a) Local EB. (b) Hyper-g. (c) Hyper-g/n. (d) ZS adapted. 12 L. HELD, D. SABANES´ BOVE´ AND I. GRAVESTOCK

Fig. 3. Comparison of priors (dashed lines) and posteriors (solid lines) of g under the conjugate incomplete inverse-gamma prior with hyper-g (left) and ZS adapted (right) hyperparameter choices. (a) Hyper-g prior and posterior, together with local EB (boxplot for the values at bottom of the plot) and global EB (vertical line) estimates of g. (b) ZS adapted prior and posterior, together with g = n (vertical line).

ZS adapted prior, favoring large values of g, leads variables x1, x2, x3, x5, x6, x8, x10, and x16. Inte- to more parsimonious models than the other three grated nested Laplace approximations [Rue, Mar- approaches. For example, the local EB median prob- tino and Chopin (2009)] have been used to fit ability model (MPM) under the TBF approach in- Bayesian logistic regression models under the gen- cludes the eight variables x1, x2, x3, x5, x6, x8, x10, eralized g-prior for various values of g with the R- x16. Exactly the same model is selected under the INLA package (www.r-inla.org). The constant c in hyper-g and the hyper-g/n prior, whereas the MPM (7) has been fixed based on the estimateα ˆ of α in model under the ZS adapted prior drops the vari- the null model. Empirical shrinkage is defined as the ratio of the resulting posterior mean estimates of the ables x1 and x10 and includes only the remaining six variables. regression coefficients over the corresponding MLEs. Empirical shrinkage can also be computed based on In Figure 3, the posterior distributions of g are the ratio of the resulting posterior over compared with the underlying conjugate prior dis- the corresponding variances of the MLEs; compare tributions (ZS adapted and hyper-g) and local as equation (11). well as global EB estimates of g. The posterior dis- Figure 4 shows that there is a good agreement be- tributions are based on all models and computed tween empirical and theoretical shrinkage g/(g + 1) using the identity for most regression coefficients, which supports the validity of the approximation (11). The agreement p(g z)= p(g zj, j) Pr( j zj). | | M M | is not so good for x2 (Age) and the factor vari- j∈J X able x3 (Killip class), perhaps because the strong We clearly see the difference between the two priors degree of discrimination of these important predic- resulting from the different hyperparameter choices. tors may affect the validity of the approximation (α, β ) (α, 0) from Section 3.1. The fixed choices g = n (BIC) and g = 2n are not I j ≈I supported by the data, as all estimates are far below 4.3 Bootstrap Cross-Validation these values. The local EB estimates of g tend to To quantify and compare the predictive perfor- be small, with the posterior mode of g under the mance of the TBF methods, we have performed a hyper-g prior and the global EB estimate having bootstrap cross-validation study. To reduce compu- similar values. The posterior mode of g under the tation time, we have used the best 8000 models ZS adapted prior is larger than the other estimates based on a stochastic model search, as described in but still much smaller than the fixed choices. Saban´es Bov´eand Held (2011b) with 30,000 itera- 4.2 Shrinkage of Coefficients tions, instead of exhaustive evaluation of all models. We have used the area under the ROC curve (AUC, We now consider the MPM model identified in measures discrimination), the calibration slope (CS) the previous section with either the local EB, hyper- [Cox (1958), measures calibration], and the logarith- g, or hyper-g/n approach, which includes the eight mic score (LS) (measures both discrimination and APPROXIMATE BAYESIAN MODEL SELECTION WITH THE DEVIANCE 13

Fig. 4. Shrinkage of posterior means and variances of regression coefficients under the generalized g-prior for various values of g. The posterior distribution has been calculated with the R-INLA software and the empirical shrinkage is plotted against the theoretical shrinkage g/(g + 1).

calibration) to quantify the predictive performance. portional to exp( AICj/2) and exp( BICj/2), re- See Gneiting and Raftery (2007) for a theoretical spectively [see Claeskens− and Hjort (−2008)], and to and Steyerberg (2009) for a more practical review the Hu and Johnson (2009) choice g = 2n. In ad- of methods to validate and compare probabilistic dition, we apply a recently developed method for predictions. Both AUC and CS are 1 for perfect dis- variable selection in generalized additive models to crimination and calibration, respectively. In practi- our setting [Marra and Wood (2011), Section 2.1]. cal applications they will be typically smaller than 1. The method gives component-wise shrinkage of co- The LS is defined as m log πˆyi (1 πˆ )1−yi /m, variate effects included, similar to a Bayesian model − i=1 { i − i } whereπ ˆi is the predicted probability of death (yi = average (BMA). Finally, simple backward selection 1) for the ith patientP in the validation sample, with AIC or BIC has been included as well as just i = 1,...,m. The LS is negatively oriented, that is, fitting the full model. the smaller, the better. The average criteria are shown in Table 2. Consid- The apparent performance of the methods using ering first the logarithmic score as our overall crite- the original sample both for fitting and predicting rion, we see that, for any of the four methods to es- is well-known to be of little value for estimating timate g based on TBFs, BMA is better than MPM, the predictive performance for new data. Therefore, and MPM is better than MAP, and this is also true we compute an estimate of the out-of-sample per- for AUC. This is not surprising, given the theoreti- formance using bootstrap cross-validation. For each cal advantage of BMA over single models concerning of 1000 bootstrap samples, we fit the methods and prediction. The empirical superiority of MPM over evaluate the above criteria based on the data not MAP indicates that the theoretical superiority of included in the bootstrap sample. We compare our the MPM approach in the linear model may extend methods with a more traditional AIC- or BIC-based to GLMs. We note that the BMA is also superior in approach for (Bayesian) model selection and aver- terms of calibration, whereas there is no clear prefer- aging based on posterior model probabilities pro- ence for either MAP or MPM in terms of CS. Over- 14 L. HELD, D. SABANES´ BOVE´ AND I. GRAVESTOCK

Table 2 However, it is not as good as the BMAs (which also GUSTO-I data: Comparison of the predictive performance of variable selection using bootstrap cross-validation of AUC, have implicit coefficient-wise shrinkage) with any of Calibration slope (CS), and Logarithmic score (LS) our four approaches.

AUC CS LS 5. DISCUSSION Local EB MAP 0.8313 0.8643 0.1874 In this paper we considered test-based Bayes fac- MPM 0.8322 0.8616 0.1870 tors derived from the deviance statistic for general- BMA 0.8344 0.8864 0.1860 ized linear models, emphasizing that the implicitly Hyper-g MAP 0.8314 0.8141 0.1880 MPM 0.8322 0.8196 0.1876 used prior on the regression coefficients is a general- BMA 0.8343 0.8406 0.1865 ized g-prior. As with the data-based Bayes factors, Hyper-g/n MAP 0.8310 0.8558 0.1877 estimation of g is possible and recommended. Lo- MPM 0.8320 0.8547 0.1872 cal EB estimation of g leads to posterior means of BMA 0.8345 0.8818 0.1860 the regression coefficients that correspond to shrink- ZS adapted MAP 0.8296 0.8396 0.1887 age estimates from the literature. Alternatively, full MPM 0.8300 0.8398 0.1885 Bayes estimation of g is possible and leads to gen- BMA 0.8343 0.8662 0.1866 eralized hyper-g priors. AIC MAP 0.8316 0.8208 0.1886 In an empirical comparison, the TBFs have been MPM 0.8318 0.8271 0.1884 shown to be in good agreement with the correspond- BMA 0.8339 0.8492 0.1873 ing DBFs. We developed a bias-correction in the BIC MAP 0.8259 0.8415 0.1908 MPM 0.8261 0.8424 0.1907 linear model under empirical Bayes which has fur- BMA 0.8313 0.8837 0.1884 ther reduced the error. It will be interesting to de- Fixed g = 2n MAP 0.8250 0.8418 0.1906 velop similar for the fully Bayesian ap- MPM 0.8251 0.8426 0.1905 proaches. Another important area of theoretical re- BMA 0.8308 0.8766 0.1881 search would be to investigate the conditions for op- GLM full 0.8314 0.8108 0.1888 timality of the MPM model in GLMs. GLM select 0.8330 0.8787 0.1871 TBFs are applicable in a wider context. In par- Step AIC 0.8314 0.8205 0.1887 ticular, the proposed methodology can be used for Step BIC 0.8285 0.8426 0.1898 function selection [Saban´es Bov´eand Held (2011b)] and can be extended to the Cox proportional haz- all, the local EB approach performs best, closely fol- ards model, which we will report elsewhere. Also, re- lowed by hyper-g/n. We would have expected more gression models for multicategorical data such as the similarities between local EB and hyper-g, which is proportional odds model or the multinomial logis- substantially worse, in particular, in terms of cal- tic regression model return a deviance, so the TBF ibration. The ZS adapted approach is better than approach will be applicable in these settings. The hyper-g in terms of calibration, but slightly worse same is true for CART models [Gravestock (2014)] in terms of discrimination and LS. and mixed models with fixed (known) random ef- Considering the alternatives to the TBF ap- fects variances, where a (marginal) deviance is also proach, AIC-weighted model selection has a simi- available. This is important in our context, as it lar performance to hyper-g and ZS adapted, but is would allow us to combine the spline-based Bayesian not as good as local EB or hyper-g/n. BIC-weighted model and function selection [Saban´es Bov´e, Held model selection and fixing g at 2n perform substan- and Kauermann (2014)] with TBFs. However, more tially worse, and so do the two stepwise procedures. research on the asymptotic distribution of the de- Simply using the full model gives reasonable dis- viance is needed for the application of TBFs to crimination, but very poor calibration, and so the mixed models with unknown variance components. LS is very poor. Among the alternative methods, the variable selection according to Marra and Wood APPENDIX A: PROOFS FOR SECTION 2.2.1 (2011) (“GLM Select”) performs best. Its additional flexibility from separate shrinkage of the coefficients In Section 2.2.1 we that the distribution of leads to a similar performance as our (global shrink- the deviance converges for n to a noncentral →∞ age) MPM model with either local EB or hyper-g/n. chi-squared distribution with dj degrees of freedom, APPROXIMATE BAYESIAN MODEL SELECTION WITH THE DEVIANCE 15

⊤ where λ = β β β β is the noncentrality param- For the bounds mentioned in Section 3.2 usually j j I j , j j eter. This is essentially proven by Davidson and the minimum Bayes factor in favor of the null hy- Lever (1970), and we briefly show how their The- pothesis is considered, which is mTBF−1 in our no- orem 1 applies here. In their notation the model tation. Let the P -value be p = 1 Fχ2(d)(z), where ⊤ ⊤ − is parametrized by θ = (θ1 ,θ2) with θ1 = βj be- Fχ2(d) is the cumulative distribution function of the ing the parameter of interest and θ2 = α being chi-squared distribution with d degrees of freedom. the nuisance parameter. We test the null hypoth- The proofs are adapted from Malaguerra (2012): ⊤ ⊤ esis H0: θ = θ0 = (0 ,θ2) . We consider a sequence −1 n n 1. Let d =1 and z > d = 1. Let q =Φ (1 p/2) of local alternatives θ = (θ1 ,θ2) with components − n n be the corresponding quantile of the standard nor- θ1k = δk/√n of θ1 , where δk = 0, k = 1,...,dj . It fol- n 6 mal distribution with cumulative distribution func- lows that θ θ for n . Then Theorem 1 of 2 → 0 →∞ tion Φ. We have q = z since a squared stan- Davidson and Lever (1970) states that for n 2 →∞ dard normal random variable is χ (1)-distributed the deviance converges in distribution to a non- and, hence, mTBF−1 = z1/2 exp( z/2) exp(1/2) = − central chi-squared distribution with dj degrees of q exp( q2/2)√e, which is the required result from ⊤ − freedom and noncentrality parameter δ C11(θ0)δ, Berger and Sellke (1987). ⊤ where δ = (δ1,...,δd ) . Here C11(θ0) is the inverse 2. Let d =2 and z > d = 2. Due to F 2 (z) = 1 j χ (2) − of the submatrix corresponding to θ1 of the inverse exp( z/2), we have p = exp( z/2) or z = 2 log(p), expected Fisher information from one observation, such− that z> 2 is equivalent− to p< 1/e. Moreover,− evaluated at θ = θ . But we know that the expected mTBF−1 = (2/z)−1 exp( (z 2)/2) = ep log(p), 0 − − − Fisher information is block-diagonal for θ = θ0, so which is the required result from Sellke, Bayarri and C11(θ0) is just the submatrix of the expected Fisher Berger (2001). information from one observation. Moreover, for n 3. The universal bound from Edwards, Lind- man and Savage (1963) that we want to reach observations we have βj ,βj = n C11(θ0), and com- I · is exp( q2/2), here q = Φ−1(1 p). We have to bined with δ = √nβj , we obtain the noncentrality − − ⊤ show that for d and fixed P -value, the ra- parameter λj = β β ,β β . → ∞ j I j j j tio of mTBF−1 and this universal bound is 1. With In order to derive the prior distribution for λ a j d we have (z d)/√2d ∼ N(0, 1) and, hence, based on the generalized g-prior (8) for β as stated j z →∞d + √2dq. Plugging− this in (13), we obtain in Section 2.2.2, first note that the generalized g- ≈ prior corresponds to mTBF−1 exp( q2/2) ˜ 1/2 − βj = ( /√g)βj Ndj (0, Idj ), Iβj ,βj ∼ −d/2 d d 2 1/2 exp q + q /2 where is the upper-triangular Cholesky root ≈ √2dq + d −r2 Iβj ,βj     ⊤ 2 2 2 of β β . Hence, β˜ β˜ χ (d ), which is a G(d /2, 1/2) = exp aq + a log(1 + q/a)+ q /2 I j , j j j ∼ j j {− } distribution. Expanding the quadratic form, we ob- with a = d/2. Now for large d the term q/a is tain small and, hence, we can apply a second-order Tay- p ˜ ⊤ ˜ ⊤ ⊤/2 1/2 lor expansion of log(1 + x) around x = 0, giving βj βj = 1/√gβj βj1/√g 2 Iβj ,βj Iβj ,βj log(1 + x) x x /2, and we obtain ≈ − ⊤ −1 2 2 = 1/gβj βj ,βj βj = λj/g mTBF q q q I exp aq + a2 + exp( q2/2) ≈ − a − 2a2 2 and, finally, λj = g λj/g G(dj /2, 1/(2g)). −     · ∼ = exp(0) = 1, APPENDIX B: PROOFS FOR SECTION 3.2 which proves the statement. For ease of notation we drop the index j of the ACKNOWLEDGMENTS alternative model and simply denote the deviance with z, the associated degrees of freedom with d, We thank Kerry L. Lee and Ewout W. Steyerberg while TBF denotes the corresponding TBF with re- for permission to use the GUSTO-I data set. We spect to the null model. are also grateful to Rafael Sauter for help with the 16 L. HELD, D. SABANES´ BOVE´ AND I. GRAVESTOCK

R-INLA software in Section 4.2 and to Manuela Ott Goodman, S. N. (1999b). Toward evidence-based medical for proofreading the final manuscript. We finally ac- statistics. 2: The Bayes factor. Annals of Internal Medicine knowledge helpful comments by two referees on an 130 1005–1013. Gravestock, I. earlier version of this article. (2014). Bayesian tree models priors and pos- terior approximations. Master’s thesis, Univ. Zurich. Held, L. (2010). A nomogram for P -values. BMC Medical Research Methodology 10 21. REFERENCES Hjort, N. L. and Claeskens, G. (2003). Frequentist model Barbieri, M. M. and Berger, J. O. (2004). Optimal predic- average estimators. J. Amer. Statist. Assoc. 98 879–899. tive model selection. Ann. Statist. 32 870–897. MR2065192 MR2041481 Bayarri, M. J., Berger, J. O., Forte, A. and Garc´ıa- Hu, J. and Johnson, V. E. (2009). Bayesian model selection Donato, G. (2012). Criteria for Bayesian model choice using test statistics. J. R. Stat. Soc. Ser. B. Stat. Methodol. with application to variable selection. Ann. Statist. 40 71 143–158. MR2655527 1550–1577. MR3015035 Johnson, V. E. (2005). Bayes factors based on test statis- Berger, J. O. and Pericchi, L. R. (2001). Objective tics. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 689–701. Bayesian methods for model selection: Introduction and MR2210687 comparison. In Model Selection (P. Lahiri, eds.). Institute Johnson, V. E. (2008). Properties of Bayes factors based on of Mathematical Statistics Lecture Notes—Monograph Se- test statistics. Scand. J. Stat. 35 354–368. MR2418746 ries 38 135–207. IMS, Beachwood, OH. MR2000753 Kass, R. E. and Wasserman, L. (1995). A reference Berger, J. O. and Sellke, T. (1987). Testing a point null Bayesian test for nested hypotheses and its relationship to hypothesis: Irreconcilability of p-values and evidence. J. the Schwarz criterion. J. Amer. Statist. Assoc. 90 928–934. Amer. Statist. Assoc. 82 112–139. MR0883340 MR1354008 Claeskens, G. and Hjort, N. L. (2008). Model Selection Lawless, J. F. and Singhal, K. (1978). Efficient screening and Model Averaging. Cambridge Univ. Press, Cambridge. of nonnormal regression models. Biometrics 34 318–327. MR2431297 Lee, K. L., Woodlief, L. H., Topol, E. J., Copas, J. B. (1983). Regression, prediction and shrink- Weaver, W. D., Betriu, A., Col, J., Simoons, M., age. J. R. Stat. Soc. Ser. B. Stat. Methodol. 45 311–354. Aylward, P., Van de Werf, F. and Califf, R. M. MR0737642 (1995). Predictors of 30-day mortality in the era of Copas, J. B. (1997). Using regression models for prediction: reperfusion for acute myocardial infarction: Results from Shrinkage and regression to the mean. Stat. Methods Med. an international trial of 41,021 patients. Circulation 91 6 Res. 167–183. 1659–1668. Cox, D. R. (1958). Two further applications of a model for Liang, F., Paulo, R., Molina, G., Clyde, M. A. and binary regression. Biometrika 45 562–565. Berger, J. O. (2008). Mixtures of g priors for Bayesian Cui, W. and George, E. I. (2008). Empirical Bayes vs. fully variable selection. J. Amer. Statist. Assoc. 103 410–423. Bayes variable selection. J. Statist. Plann. Inference 138 MR2420243 888–900. MR2416869 Lindley, D. V. (1957). A statistical paradox. Biometrika 44 Davidson, R. R. and Lever, W. E. (1970). The limiting 187–192. distribution of the likelihood ratio statistic under a class of Malaguerra, A. (2012). Bayesian variable selection based local alternatives. Sankhy¯aSer. A 32 209–224. MR0297050 on test statistics. Master’s thesis, Univ. Zurich. Edwards, W., Lindman, H. and Savage, L. J. (1963). Marra, G. and Wood, S. N. (2011). Practical variable se- Bayesian for psychological research. Psychological Review 70 193–242. lection for generalized additive models. Comput. Statist. 55 Fernandez,´ C., Ley, E. and Steel, M. F. J. (2001). Bench- Data Anal. 2372–2387. MR2786996 mark priors for Bayesian model averaging. J. Nelder, J. A. and Wedderburn, R. W. M. (1972). Gener- 100 381–427. MR1820410 alized linear models. J. Roy. Statist. Soc. Ser. A 135 370– Foster, D. P. and George, E. I. (1994). The risk infla- 384. tion criterion for multiple regression. Ann. Statist. 22 1947– Rue, H., Martino, S. and Chopin, N. (2009). Approximate 1975. MR1329177 Bayesian inference for latent Gaussian models using inte- Geisser, S. (1984). On prior distributions for binary trials. grated nested Laplace approximations (with discussion). J. Amer. Statist. 38 244–251. MR0770258 Roy. Statist. Soc. Ser. B 71 319–392. George, E. I. and Foster, D. P. (2000). Calibration and Sabanes´ Bove,´ D. and Held, L. (2011a). Hyper-g priors empirical Bayes variable selection. Biometrika 87 731–747. for generalized linear models. Bayesian Anal. 6 387–410. MR1813972 MR2843537 Gneiting, T. and Raftery, A. E. (2007). Strictly proper Sabanes´ Bove,´ D. and Held, L. (2011b). Bayesian frac- scoring rules, prediction, and estimation. J. Amer. Statist. tional polynomials. Stat. Comput. 21 309–324. MR2806611 Assoc. 102 359–378. MR2345548 Sabanes´ Bove,´ D., Held, L. and Kauermann, G. (2014). Goodman, S. N. (1999a). Toward evidence-based medi- Mixtures of g-priors for generalised additive model selec- cal statistics. 1: The P -value fallacy. Annals of Internal tion with penalised splines. J. Comput. Graph. Statist. Medicine 130 995–1004. DOI: 10.1080/10618600.2014.912136. APPROXIMATE BAYESIAN MODEL SELECTION WITH THE DEVIANCE 17

Scott, J. G. and Berger, J. O. (2010). Bayes and In Bayesian Inference and Decision Techniques (P. K. empirical-Bayes multiplicity adjustment in the variable- Goel and A. Zellner, eds.). Stud. Bayesian Econo- selection problem. Ann. Statist. 38 2587–2619. MR2722450 metrics Statist. 6 233–243. North-Holland, Amsterdam. Sellke, T., Bayarri, M. J. and Berger, J. O. (2001). MR0881437 Calibration of p-values for testing precise null hypotheses. Zellner, A. and Siow, A. (1980). Posterior odds ratios for Amer. Statist. 55 62–71. MR1818723 selected regression hypotheses. In Bayesian Statistics: Pro- Steyerberg, E. (2009). Clinical Prediction Models. Springer, ceedings of the First International Meeting Held in Valen- New York. cia (J. M. Bernardo, M. H. DeGroot, D. V. Lindley van Houwelingen, J. C. and Le Cessie, S. (1990). Predic- and A. F. M. Smith, eds.) 585–603. Univ. Valencia Press, tive value of statistical models. Stat. Med. 9 1303–1325. Valencia. Zellner, A. (1986). On assessing prior distributions and Bayesian with g-prior distributions.