Bayesian Inference Chapter 5: Model Selection

Bayesian Inference Chapter 5: Model selection Conchi Aus´ınand Mike Wiper Department of Statistics Universidad Carlos III de Madrid Master in Business Administration and Quantitative Methodsl Master in Mathematical Engineering Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 1 / 33 Objective In this class we consider problems where we have various contending models and show how these can be compared and also look at the possibilities of model averaging. Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 2 / 33 Basics In principle, model selection is easy. Suppose that we wish to compare two models, M1, M2. Then, define prior probabilities, P(M1) and P(M2) = 1 − P(M1). Given data, we can calculate the posterior probabilities via Bayes theorem: f (xjM )P(M ) P(M jx) = 1 1 1 f (x) where f (x) = f (xjM1)P(M1) + f (xjM2)P(M2). Note also that if model Mi is parametric with parameters θi , then Z f (xjMi ) = f (xjMi ; θi )f (θi jMi ) dθi Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 3 / 33 Now consider the possible losses (negative utilities) associated with taking a wrong decision. L(select M2jM1 true) and L(select M1jM2 true): (something like type I and type II errors) Take the decision which minimizes the expected loss (Bayes decision). Expected loss for choosing M1 is P(M2jx)L(M1jM2) and similarly for M2. If the loss functions are equal, then we just select the model with higher probability. Setting L(M1jM2) = 0:05 and L(M2jM1) = 0:95 means we select M1 if P(M1jx) > 0:05. Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 4 / 33 The coin tossing example again Return to the coin tossing problem of Class I. Suppose now that we wished to test the hypothesis H0 : θ = 0:5 versus the alternative H1 : θ 6= 0:5. Assume that P(H0) = P(H1) = 0:5 and that θjH1 ∼ Beta(5; 5). 12 f (xjH ) = 0:512 0 9 Z 1 12 9 3 1 5−1 5−1 f (xjH1) = θ (1 − θ) θ (1 − θ) dθ 0 9 B(5; 5) 12 B(14; 8) = 9 B(5; 5) which implies that P(H0jx) ≈ 0:387. Therefore, we would reject H0 under equal loss functions, but would not if we used 0:05 and 0:95 losses for type I and type II errors as previously. Note that the p-value (under binomial sampling) is 0:0386 so in this case we reject H0 at a 5% level. Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 5 / 33 Two sided hypothesis tests: a paradox In two sided hypothesis tests, Bayesian and classical results can often differ greatly. Consider a coin tossing problem where we observe 49,581 heads and 48,870 tails. Then a classical 2 tailed test of H0 : θ = 0:5 vs H1 : θ 6= 0:5, gives a p-value of 0:0232 and H0 is clearly rejected. However, given the set up of the previous example, the posterior probability that H0 is true is P(H0jx) ≈ 0:89. Given a uniform prior for θ under H1, this probability increases to 0:95. On the contrary, for the one-sided test with H1 : θ > 0:5, the p-value is 0:0116 and the Bayesian posterior probability of H0 is P(H0jx) = 0:0117. Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 6 / 33 What if we have a lot of possible models? In principle we can proceed as earlier but ... inference will be sensitive to the selection of the prior probabilities P(Mi ) and ... it is often difficult to define these in many contexts (e.g. variable selection in regression models). We need a criterion which is less dependent on prior information. Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 7 / 33 The Bayes factor The Bayes factor in favour of Mi and against model Mj is i P(Mi jx) P(Mj ) Bj = : P(Mj jx) P(Mi ) This is the posterior odds divided by the prior odds. How does this get rid of the dependency on the priors? Bi = f (xjMi ) . j f (xjMj ) If the models do not have any parameters, this is the log likelihood ratio. Otherwise, recall that the marginal likelihood, f (xjMi ) depends on the prior f (θi jMi ). Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 8 / 33 Example In our coin tossing example, 0 f (xjH0) B1 = f (xjH1) 12 0:512 9 = 12 B(14;8) 9 B(5;5) = 0:6309 How do we interpret this number? Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 9 / 33 Consistency and scales of evidence i It is clear that 0 ≤ Bj < 1. i When Bj = 1, the marginal likelihoods are the same, so the data provide equal evidence for both models. i i If model i is true, then, Bj ! 1 and if model j is true, then Bj ! 0 as n ! 1. Kass and Raftery (1995) provide the following table for interpreting the Bayes factor. Bayes factor Interpretation 1 to 3 Not worth more than a bare mention 3 to 20 Positive 20 to 150 Strong > 150 Very strong Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 10 / 33 Relationship to classical model selection criteria The Bayesian information criterion (Schwarz 1978) for evaluating a model, M is BIC = −2 log f (xjθ;^ M) + k log n where θ^ is the MLE and, the parameters defined under this model, θ, have dimension k. Then, under certain regularity conditions, as n ! 1, for two models Mi and Mj , then i BICi − BICj ! −2 log Bj : Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 11 / 33 Problems with the use of Bayes factors I: philosophy They require that one of the models is \true". There is a true value of the parameter θ under this model. There is positive prior mass on this true value under this model. When the true model is not included, what happens as n ! 1? Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 12 / 33 Problems with the use of Bayes factors II: calculation Calculation of the Bayes factor is often tough In order to calculate the Bayes factor, we need the marginal likelihoods. Outside of conjugate models, these are impossible to evaluate analytically. Various alternatives are available in the context of Gibbs sampling and MCMC. Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 13 / 33 Harmonic mean estimator Consider an MCMC sample, θ(1); :::; θ(N). Then, we can estimate f (xjM) or f (x), (dropping dependence on M for notational convenience) as n !−1 1 X f (x) ≈ 1=f (xjθ ) : N i i=1 This is consistent as the expectation of what is being averages is Z 1 Z 1 f (xjθ)f (θ) 1 Z 1 f (θjx) dθ = dθ = f (θ) dθ = : f (xjθ) f (xjθ) f (x) f (x) f (x) However, the estimator is highly unstable and can often have infinite variance. Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 14 / 33 Chib's (1991) approach Recall Bayes theorem: f (xjθ)f (θ) f (θjx) = f (x) so that, taking logarithms and reordering, log f (x) = log f (xjθ) + log f (θ) − log f (θjx): Suppose that we have run the Gibbs sampler and have a sensible posterior point estimate, say θ~. Then, we can typically calculate log f (θ~) and log f (xjθ~) analytically. How do we calculate log f (θ~jx)? Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 15 / 33 Assume that θ = (θ1; :::; θk ). Then, from the law of multiplication, log f (θ~jx) = log f (θ~1jx) + log f (θ~2jx; θ~1) + ::: + log f (θ~k jx; θ~1; :::; θ~k−1): Firstly, if we run the Gibbs sampler again, and generate sample values θ(1); :::; θ(N), we can calculate N 1 X log f (θ~ jx) ≈ log f (θ~ jx; θ(i)) 1 N 1 i=1 ~ (i) Now fix θ1 = θ1 and run the Gibbs sampler again, generating a sample θ−1, for i = 1; :::; N to calculate N 1 X (i) log f (θ~ jx; θ~ ) ≈ log f (θ~ jx; θ ; θ~ ) 2 1 N 2 −1 1 i=1 Then run the Gibbs sampler again with θ1 = θ~1 and θ2 = θ~2 and so on. Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 16 / 33 Fairly accurate. Relies on all conditional distributions being available analytically. Extensions to more general MCMC samplers are available. Means that we have to run the Gibbs sampler various times which means that it can be very slow. Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 17 / 33 Software reliability example Remember the software reliability example. Before we assumed i.i.d. exponential failure times. 80000 60000 40000 total execution time total execution 20000 0 0 20 40 60 80 100 120 140 failure count Inter-failure times are longer after more faults have been observed (and corrected?) Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 18 / 33 The Jelinski Moranda model This model assumes N initial faults, each with rate θ and that after each failure, the fault causing it is perfectly corrected. Xi jθ; N ∼ exponential(θ(N − i + 1)): The likelihood is m ! N! X f (xjθ; N) / θm exp −θ (N − i + 1)x (N − m)! i i=1 where m is the number of observed failures. Given an exponential(1) prior for θ, the conditional posterior is gamma.

Bayesian Inference Chapter 5: Model Selection

Arxiv:1803.00360V2 [Stat.CO] 2 Mar 2018 Prone to Misinterpretation [2, 3]

1 Estimation and Beyond in the Bayes Universe

Hierarchical Models & Bayesian Model Selection

Bayes Factors in Practice Author(S): Robert E

Bayes Factor Consistency

Calibrated Bayes Factors for Model Selection and Model Averaging

The Bayes Factor

On the Derivation of the Bayesian Information Criterion

Bayesian Alternatives for Common Null-Hypothesis Significance Tests

Bayesian Model Selection

A Tutorial on Conducting and Interpreting a Bayesian ANOVA in JASP

Assessing Bayes Factor Surfaces Using Interactive Visualization and Computer Surrogate Modeling