Bayes factors P values Generalized additive References Bayes factors P values Generalized additive model selection References Preface

Introducing Bayes Factors There’s no theorem like Bayes’ theorem Like no theorem we know Leonhard Held Everything about it is appealing Division of Everything about it is a wow University of Zurich Let out all that a priori feeling You’ve been concealing right up to now!

G.E.P. Box 25 November 2011 Music: Irving Berlin (“There’s no Business like Show Business”)

Bayes factors P values Generalized additive model selection References Bayes factors P values Generalized additive model selection References Outline Bayes factors

ˆ Consider two hypotheses H0 and H1 and some x. ˆ Bayes’s theorem implies

P(H | x) p(x | H ) P(H ) Bayes factors 0 = 0 · 0 P(H1 | x) p(x | H1) P(H1) | {z } | {z } | {z } Posterior Bayes factor Prior odds

P values ˆ The Bayes factor (BF) quantifies the evidence of data x for H0 vs. H1. ˆ BF is the ratio of the marginal likelihoods Generalized additive model selection Z p(x | Hi ) = p(x | θ, Hi ) p(θ | Hi ) dθ | {z } | {z } Likelihood Prior of the two hypotheses H , i = 0, 1. i Bayes factors P values Generalized additive model selection References Bayes factors P values Generalized additive model selection References Properties of Bayes factors Scaling of Bayes factors

Bayes factors

1. need proper priors p(θ | Hi ). BF Strength of evidence < 1:1 Negative (supports H ) 2. reduce to likelihood ratios for simple hypotheses. 1 1:1 to 3:1 Barely worth mentioning 3. have an automatic penalty for model complexity. 3:1 to 20:1 Substantial 4. work also for non-nested models. 20:1 to 150:1 Strong 5. are symmetric measures of evidence. >150:1 Very strong 6. are related to the Bayesian Information Criterion (BIC).

Bayes factors P values Generalized additive model selection References Bayes factors P values Generalized additive model selection References Jeffreys–Lindley“paradox” About P values

Harold Jeffreys (1891-1989)

“What the use of P implies [...] When comparing models with different numbers of is that a hypothesis that may parameters and using diffuse priors, the simpler model be true may be rejected because is always favoured over the more complex one. it has not predicted observable results that have not occurred. → Priors matter. This seems a remarkable proce- → The evidence against the simpler model is bounded. dure.” Bayes factors P values Generalized additive model selection References Bayes factors P values Generalized additive model selection References About Jeffreys’ book“Theory of Probability”

Ronald Fisher (1890-1962)

“He makes a logical mistake on the first page which invali- dates all the 395 formulae in his book.”

Bayes factors P values Generalized additive model selection References Bayes factors P values Generalized additive model selection References Steve Goodman’s conclusion The Edwards et al. (1963) approach 2 ˆ Consider a Gauss test for H0 : µ = µ0 where x ∼ N(µ, σ ). ˆ This scenario reflects, at least approximately, many of the statistical procedures found in scientific journals.

ˆ With T value t = (x − µ0)/σ we obtain p(x | H0) = ϕ(t)/σ.

ˆ For the H1 we allow any prior “In fact, the P value is almost nothing sensible you distribution p(µ) for µ, it then follows that can think of. I tell students to give up trying.” Z 1 x − µ p(x | H ) = ϕ p(µ)dµ ≤ ϕ(0)/σ. 1 σ σ Q: What is the relationship between P values and Bayes factors? ˆ This corresponds to a prior density p(µ) concentrated at x.

ˆ The Bayes factor BF for H0 vs. H1 is therefore bounded: p(x | H ) BF = 0 ≥ exp(−0.5t2) =: BF p(x | H1) ˆ Universal lower bound on BF for any prior on µ! Bayes factors P values Generalized additive model selection References Bayes factors P values Generalized additive model selection References The Edwards et al. (1963) approach cont. The Edwards et al. (1963) approach cont.

Assuming equal for H0 and H1 we obtain: Assuming equal prior probability for H0 and H1 we obtain: P value P value 0.05 0.01 0.001 0.05 0.01 0.001 two-sided t 1.96 2.58 3.29 one-sided t 1.64 2.33 3.09 BF 0.15 0.04 0.004 BF 0.26 0.07 0.008 min P(H0 | x) 12.8% 3.5% 0.4% min P(H0 | x) 20.5% 6.3% 0.8%

Bayes factors P values Generalized additive model selection References Bayes factors P values Generalized additive model selection References Some refinements by Berger and Sellke (1987) The Sellke et al. (2001) approach ˆ Idea: Work directly with the P value p

ˆ Under H0: p ∼ U(0, 1) Consider two-sided tests with ˆ Under H1: p ∼ Be(ξ, 1) with 0 < ξ < 1 1. Symmetric prior distributions, centered at µ0 ˆ The Bayes factor of H0 vs. H1 is then 2. Unimodal, symmetric prior distributions, centered at µ0 Z . ξ−1 3. Normal prior distributions, centered at µ0 BF = 1 ξp p(ξ)dξ

P value for some prior p(ξ) under H1. Prior 5% 1% 0.1% ˆ Calculus shows that a lower limit on BF is Symmetric min P(H0 | x) 22.7% 6.8% 0.9% ( −e · p log(p) for p < e−1 + Unimodal min P(H0 | x) 29.0% 10.9% 1.8% BF = 1 else Normal min P(H0 | x) 32.1% 13.3% 2.4% P value 5% 1% 0.1% min P(H0 | x) 28.9% 11.1% 1.8% Bayes factors P values Generalized additive model selection References Bayes factors P values Generalized additive model selection References Summary A Nomogram for P Values

50 ˆ 0.5 40 Using minimum Bayes factors, P values can be transformed to 30 1 20 lower bounds on the of the null 15 2 hypothesis. 10 5 0.37 5 ˆ 0.05 It turns out that: 10 0.01 2 20 “Remarkably, this smallest possible bound is by no 1 30 0.001 always very small in those cases when the 40 0.5 % 50 1e−04 % datum would lead to a high classical significance 60 0.2 70 1e−05 0.1 level. 80 Even the utmost generosity to the alternative hy- 0.05 90 1e−06

0.02 pothesis cannot make the evidence in favor of it as 95 strong as classical significance levels might suggest.” 0.01 98 0.005 Edwards et al. (1963), Psychological Review 99 0.002 99.5 0.001 Prior P value Min. Posterior Probability Probability

Bayes factors P values Generalized additive model selection References Bayes factors P values Generalized additive model selection References

Example: p = 0.03 Q: What prior do I need to achieve p = P(H0 | x)?

50 50 0.5 40 0.5 40 30 30 1 1 20 20 15 15 2 2 10 10

5 0.37 5 5 0.37 5 0.05 0.05 10 10 0.01 2 0.01 2 20 20 1 1 30 0.001 30 0.001 40 0.5 40 0.5 % 50 1e−04 % % 50 1e−04 % 60 0.2 60 0.2 70 70 1e−05 0.1 1e−05 0.1 80 80 0.05 0.05 90 1e−06 90 1e−06

0.02 0.02 95 95 0.01 0.01 98 98 0.005 0.005 99 99 0.002 0.002 99.5 99.5 0.001 0.001 Prior P value Min. Posterior Prior P value Min. Posterior Probability Probability Probability Probability Bayes factors P values Generalized additive model selection References Bayes factors P values Generalized additive model selection References

Maximum difference between P(H0) and p = P(H0 | x) Bayesian regression

Edwards et al. (two−sided) ˆ Consider model Edwards et al. (one−sided) Berger & Sellke 1 25 2 Berger & Sellke 2 y ∼ N(1β0 + Xβ, σ I) Berger & Sellke 3 Sellke et al. 20 ˆ Zellner’s (1986) g-prior on the coefficients β

2  2 T −1

15 β | g, σ ∼ N 0, gσ (X X)

ˆ Jeffreys’ prior on intercept: p(β ) ∝ 1

10 0 ˆ Jeffreys’ prior on : p(σ2) ∝ (σ2)−1 5 max difference (in percentage points) max difference ˆ The factor g can be interpreted as inverse relative prior sample size. 0 → Fixed shrinkage factor g/(g + 1). 0.0001 0.0003 0.001 0.003 0.01 0.03 0.1

P value

Bayes factors P values Generalized additive model selection References Bayes factors P values Generalized additive model selection References Hyper-g prior in the Model selection in generalized additive regression

ˆ The problem of model selection in regression is pervasive in statistical practice. ˆ Prior with fixed g has unattractive asymptotic properties. ˆ Complexity increases dramatically if non-linear covariate → on g: g/(g + 1) ∼ U(0, 1) effects are allowed for. ˆ Parametric approaches: ⇒ Model selection consistency ˆ Fractional polynomials (FPs) (Sauerbrei and Royston, 1999) ˆ Bayesian FPs (Saban´es Bov´eand Held, 2011a) ⇒ p(y) has closed form. ˆ Semiparametric approaches: ˆ Generalized additive model selection ˆ Here we describe a Bayesian approach using penalized splines (joint work with Daniel Saban´es Bov´eand G¨oran Kauermann) Bayes factors P values Generalized additive model selection References Bayes factors P values Generalized additive model selection References Additive semiparametric models Penalised splines in mixed model layout

If dj > 1 then

Xp  T iid 2 f (x ),..., f (x ) = x β + Z u yi = β0 + fj (xij )+ εi , εi ∼ N(0, σ ), i = 1,..., n j 1j j nj j j j j j=1 T xj n × 1 covariate vector, zero-centred (1 xj = 0) Effect of xj Functional form Degrees of freedom T T Zj n × K spline basis matrix (1 Zj = xj Zj = 0) 2 not included fj (xij ) ≡ 0 dj = 0 uj K × 1 spline coefficients vector, uj ∼ N(0, σ ρj I) linear fj (xij )= xij βj dj = 1 T Variance factor ρj corresponds to degrees of freedom smooth fj (xij )= xij βj + zj (xij ) uj dj > 1 T −1 −1 T dj = tr{(Z Zj + ρj I) Z Zj } + 1 Degrees of freedom vector d = (d1,..., dp) determines the model. j j < K + 1 (K: number of knots)

Bayes factors P values Generalized additive model selection References Bayes factors P values Generalized additive model selection References Transformation to standard linear model Model prior

1. Conditional model: Xβ  z }| {  0 X X 2 1. The number of covariates included (id) is uniform on {0, 1,..., p}. y | uj s ∼ N 1β0 + xj βj + Zj uj , σ I

j:dj ≥1 j:dj >1 2. For fixed id, all covariate choices are equally likely. 2. Marginal model: 3. Degrees of freedom are uniform on {1, 3,..., K}.

2 X T y ∼ N(1β0 + Xβ, σ V) where V = I + ρj Zj Zj ⇒ P(dj = 0) = P(dj ≥ 1) = 1/2 j:dj >1 ⇒ Note: dj ’s are dependent, prior is multiplicity-corrected. 3. Decorrelated marginal model: (Scott and Berger, 2010)

2 −T /2 −T /2 −T /2 ˆ Prior can be modified to have fixed prior for linear effect, y˜ ∼ N(1˜β0+X˜β, σ I) with y˜ = V y, 1˜ = V 1, X˜ = V X e.g. P(dj = 1) = 1/4. → g-prior:   β | g, σ2 ∼ N 0, gσ2(XT V−1X)−1 Bayes factors P values Generalized additive model selection References Bayes factors P values Generalized additive model selection References Hyper-g prior for generalized additive models Marginal likelihood computation

We use the working normal model

a  −1 z0 | β0, βd, ud ∼ N 1nβ0 + Xdβd + Zdud, W0 Approximate the marginal likelihood of model d via numerical integration with respect to g (Saban´es Bov´eand Held, 2011b): with z = η + diag{dh(η )/dη}−1(y − h(η )), W = W(η ) and 0 0 0 0 0 0 Z ∞ η0 = 0n. p(y | d) = p(y | g, d) p(g) dg. The generalised g-prior can now be derived as 0 Here p(y | g, d) is computed using a Laplace approximation based  −1! ˜ T ˜ ˜T −1 ˜ on a Gaussian approximation of p(β , β , u | y, g, d) using the βd | g ∼ N 0, g Xd (In + ZdDdZd ) Xd , 0 d d Bayesian IWLS algorithm (West, 1985). ˜ 1/2 ˜ 1/2 where Xd = W0 Xd , Zd = W0 Zd and Dd is block-diagonal with entries ρj IK (dj > 1).

Bayes factors P values Generalized additive model selection References Bayes factors P values Generalized additive model selection References Model search Application: Pima Indians Data

Variable Description y Signs of diabetes according to WHO criteria (Yes = 1, No = 0) ˆ Model space grows exponentially in number of covariates p. x1 Number of pregnancies x2 Plasma glucose concentration in an oral glucose tolerance test [mg/dl] ˆ Exhaustive model search may still be possible. x3 Diastolic blood pressure [mm Hg] x4 Triceps skin fold thickness [mm] ˆ 2 Otherwise efficient stochastic search algorithms are necessary. x5 Body mass index (BMI) [kg/m ] ˆ Easy setup is Metropolis-Hastings with proposals: x6 Diabetes pedigree function x7 Age [years] Move Choose a covariate j and de-/increase dj Switch Choose a pair (i, j) and switch d and d i j ˆ Cubic O’Sullivan splines with K = 6 quintile-based knots → “MCMC model composition” (Madigan and York, 1995) ˆ Exhaustive model search: 823 543 models in 94.3 hours ˆ Stochastic model search: 43 766 models in 3.6 hours ⇒ 489 top models (66% probability mass) identical, in total 98% probability mass has been found. Bayes factors P values Generalized additive model selection References Bayes factors P values Generalized additive model selection References Posterior inclusion probabilities MAP model

d2 = 1 d5 = 3 4 10

2 5 0

-2 0

-4 -5

x1 x2 x3 x4 x5 x6 x7 60 100 140 180 20 30 40 50 60

not included (dj = 0) 0.63 0.00 0.81 0.84 0.00 0.02 0.01 x2 (Glucose concentration [mg/dl]) x5 (BMI [kg/m2]) linear (dj = 1) 0.09 0.48 0.09 0.06 0.11 0.26 0.00 d = 2 d = 4 smooth (dj > 1) 0.28 0.52 0.09 0.10 0.88 0.72 0.99 6 7 4 3 2 0 1 0 -5 -1 -2 -10 -3

0.0 0.5 1.0 1.5 2.0 2.5 20 30 40 50 60 70 80

x6 (Diabetes pedigree function) x7 (Age)

(means with pointwise and simultaneous 95% credible intervals)

Bayes factors P values Generalized additive model selection References Bayes factors P values Generalized additive model selection References The 10 best models Postprocessing

ˆ Meta-model: posterior probabilities of sub-models are added and used as weights for model averaging

npreg glu bp skin bmi ped age logMargLik prior post ˆ Best variable-selection meta-model includes x2, x5, x6 and x7 and 1 0 1 0 0 3 2 4 -243.3850 2.755732e-06 0.017715737 has posterior probability 46.6% 2 0 1 0 0 4 2 4 -243.5438 2.755732e-06 0.015115243 3 0 1 0 0 3 2 3 -243.5813 2.755732e-06 0.014558931 ˆ We may define the probability model comprising all models 4 0 1 0 0 4 2 3 -243.7555 2.755732e-06 0.012231371 including those covariates that have more than 50% posterior 5 0 2 0 0 3 2 4 -243.7892 2.755732e-06 0.011825606 6 0 1 0 0 2 2 4 -243.9368 2.755732e-06 0.010202615 inclusion probability. 7 0 2 0 0 4 2 4 -243.9417 2.755732e-06 0.010152803 Here: identical to the best variable-selection meta-model. 8 0 1 0 0 3 1 4 -243.9580 2.755732e-06 0.009988492 9 0 2 0 0 3 2 3 -243.9862 2.755732e-06 0.009710784 ˆ We can also optimize the degrees of freedom of included covariates 10 0 1 0 0 3 3 4 -244.0169 2.755732e-06 0.009417629 with respect to the marginal likelihood:

d = (0, 1, 0, 0, 3, 2, 4) −→ d∗ = (0, 1, 0, 0, 3.42, 2.13, 3.69) log p(y | d) = −243.39 −→ log p(y | d∗) = −243.23 Bayes factors P values Generalized additive model selection References Bayes factors P values Generalized additive model selection References MAP model Best variable-selection meta-model

d2 = 1 d5 = 3 4 10 4 10

2 2 5 5 0 0

-2 0 -2 0

-4 -5 -4 -5

60 100 140 180 20 30 40 50 60 60 100 140 180 20 30 40 50 60

x2 (Glucose concentration [mg/dl]) x5 (BMI [kg/m2]) x2 (Glucose concentration [mg/dl]) x5 (BMI [kg/m2])

d6 = 2 d7 = 4 4 4 3 3 2 0 2 0 1 1 0 -5 0 -5 -1 -1 -2 -10 -2 -10 -3 -3

0.0 0.5 1.0 1.5 2.0 2.5 20 30 40 50 60 70 80 0.0 0.5 1.0 1.5 2.0 2.5 20 30 40 50 60 70 80

x6 (Diabetes pedigree function) x7 (Age) x6 (Diabetes pedigree function) x7 (Age)

(means with pointwise and simultaneous 95% credible intervals) (means with pointwise and simultaneous 95% credible intervals)

Bayes factors P values Generalized additive model selection References References

Berger, J. O. and Sellke, T. (1987). Testing a point null hypothesis: Irreconcilability of P values and evidence (with discussion), Journal of the American Statistical Association 82: 112–139. Edwards, W., Lindman, H. and Savage, L. J. (1963). Bayesian in psychological research, Psychological Review 70: 193–242. Held, L. (2010). A nomogram for P values, BMC Med Res Methodol 10: 21. Madigan, D. and York, J. (1995). Bayesian graphical models for discrete data, International Statistical Review 63(2): 215–232. Saban´esBov´e,D. and Held, L. (2011a). Bayesian fractional polynomials, and Computing 21: 309–324. Available from: http://dx.doi.org/10.1007/s11222-010-9170-7. Saban´esBov´e,D. and Held, L. (2011b). Hyper-g priors for generalized linear models, Bayesian Analysis. Forthcoming article as of 18/2/2011. Available from: http://ba.stat.cmu.edu/abstracts/Sabanes.php. Sauerbrei, W. and Royston, P. (1999). Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials, Journal of the Royal Statistical Society. Series A (Statistics in Society) 162(1): 71–94. Scott, J. G. and Berger, J. O. (2010). Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem, Annals of Statistics 38(5): 2587–2619. Sellke, T., Bayarri, M. J. and Berger, J. O. (2001). Calibration of p values for testing precise null hypotheses, The American 55: 62–71. West, M. (1985). Generalized linear models: scale parameters, outlier accommodation and prior distributions, in J. M. Bernardo, M. H. DeGroot, D. V. Lindley and A. F. M. Smith (eds), 2: Proceedings of the Second Valencia International Meeting, North-Holland, Amsterdam, pp. 531–558. Zellner, A. (1986). On assessing prior distributions and Bayesian with g-prior distributions, in P. K. Goel and A. Zellner (eds), and Decision Techniques: Essays in Honor of Bruno de Finetti, Vol. 6 of Studies in Bayesian Econometrics and Statistics, North-Holland, Amsterdam, chapter 5, pp. 233–243.