Chapter 5: Model selection
Conchi Aus´ınand Mike Wiper Department of Statistics Universidad Carlos III de Madrid
Master in Business Administration and Quantitative Methodsl Master in Mathematical Engineering
Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 1 / 33 Objective
In this class we consider problems where we have various contending models and show how these can be compared and also look at the possibilities of model averaging.
Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 2 / 33 Basics
In principle, model selection is easy.
Suppose that we wish to compare two models, M1, M2. Then, define prior probabilities, P(M1) and P(M2) = 1 − P(M1). Given data, we can calculate the posterior probabilities via Bayes theorem:
f (x|M )P(M ) P(M |x) = 1 1 1 f (x)
where f (x) = f (x|M1)P(M1) + f (x|M2)P(M2).
Note also that if model Mi is parametric with parameters θi , then Z f (x|Mi ) = f (x|Mi , θi )f (θi |Mi ) dθi
Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 3 / 33 Now consider the possible losses (negative utilities) associated with taking a wrong decision.
L(select M2|M1 true) and L(select M1|M2 true).
(something like type I and type II errors) Take the decision which minimizes the expected loss (Bayes decision).
Expected loss for choosing M1 is
P(M2|x)L(M1|M2)
and similarly for M2. If the loss functions are equal, then we just select the model with higher probability.
Setting L(M1|M2) = 0.05 and L(M2|M1) = 0.95 means we select M1 if P(M1|x) > 0.05.
Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 4 / 33 The coin tossing example again Return to the coin tossing problem of Class I.
Suppose now that we wished to test the hypothesis H0 : θ = 0.5 versus the alternative H1 : θ 6= 0.5.
Assume that P(H0) = P(H1) = 0.5 and that θ|H1 ∼ Beta(5, 5).
12 f (x|H ) = 0.512 0 9 Z 1 12 9 3 1 5−1 5−1 f (x|H1) = θ (1 − θ) θ (1 − θ) dθ 0 9 B(5, 5) 12 B(14, 8) = 9 B(5, 5)
which implies that P(H0|x) ≈ 0.387. Therefore, we would reject H0 under equal loss functions, but would not if we used 0.05 and 0.95 losses for type I and type II errors as previously. Note that the p-value (under binomial sampling) is 0.0386 so in this case we reject H0 at a 5% level.
Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 5 / 33 Two sided hypothesis tests: a paradox
In two sided hypothesis tests, Bayesian and classical results can often differ greatly.
Consider a coin tossing problem where we observe 49,581 heads and 48,870 tails.
Then a classical 2 tailed test of H0 : θ = 0.5 vs H1 : θ 6= 0.5, gives a p-value of 0.0232 and H0 is clearly rejected. However, given the set up of the previous example, the posterior probability that H0 is true is P(H0|x) ≈ 0.89.
Given a uniform prior for θ under H1, this probability increases to 0.95.
On the contrary, for the one-sided test with H1 : θ > 0.5, the p-value is 0.0116 and the Bayesian posterior probability of H0 is P(H0|x) = 0.0117.
Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 6 / 33 What if we have a lot of possible models?
In principle we can proceed as earlier but ...
inference will be sensitive to the selection of the prior probabilities P(Mi ) and ... it is often difficult to define these in many contexts (e.g. variable selection in regression models). We need a criterion which is less dependent on prior information.
Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 7 / 33 The Bayes factor
The Bayes factor in favour of Mi and against model Mj is
i P(Mi |x) P(Mj ) Bj = . P(Mj |x) P(Mi )
This is the posterior odds divided by the prior odds. How does this get rid of the dependency on the priors? Bi = f (x|Mi ) . j f (x|Mj ) If the models do not have any parameters, this is the log likelihood ratio.
Otherwise, recall that the marginal likelihood, f (x|Mi ) depends on the prior f (θi |Mi ).
Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 8 / 33 Example
In our coin tossing example,
0 f (x|H0) B1 = f (x|H1) 12 0.512 9 = 12 B(14,8) 9 B(5,5) = 0.6309
How do we interpret this number?
Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 9 / 33 Consistency and scales of evidence
i It is clear that 0 ≤ Bj < ∞. i When Bj = 1, the marginal likelihoods are the same, so the data provide equal evidence for both models. i i If model i is true, then, Bj → ∞ and if model j is true, then Bj → 0 as n → ∞.
Kass and Raftery (1995) provide the following table for interpreting the Bayes factor. Bayes factor Interpretation 1 to 3 Not worth more than a bare mention 3 to 20 Positive 20 to 150 Strong > 150 Very strong
Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 10 / 33 Relationship to classical model selection criteria
The Bayesian information criterion (Schwarz 1978) for evaluating a model, M is
BIC = −2 log f (x|θ,ˆ M) + k log n
where θˆ is the MLE and, the parameters defined under this model, θ, have dimension k.
Then, under certain regularity conditions, as n → ∞, for two models Mi and Mj , then i BICi − BICj → −2 log Bj .
Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 11 / 33 Problems with the use of Bayes factors I: philosophy
They require that one of the models is “true”. There is a true value of the parameter θ under this model. There is positive prior mass on this true value under this model.
When the true model is not included, what happens as n → ∞?
Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 12 / 33 Problems with the use of Bayes factors II: calculation
Calculation of the Bayes factor is often tough In order to calculate the Bayes factor, we need the marginal likelihoods. Outside of conjugate models, these are impossible to evaluate analytically. Various alternatives are available in the context of Gibbs sampling and MCMC.
Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 13 / 33 Harmonic mean estimator
Consider an MCMC sample, θ(1), ..., θ(N). Then, we can estimate f (x|M) or f (x), (dropping dependence on M for notational convenience) as
n !−1 1 X f (x) ≈ 1/f (x|θ ) . N i i=1 This is consistent as the expectation of what is being averages is
Z 1 Z 1 f (x|θ)f (θ) 1 Z 1 f (θ|x) dθ = dθ = f (θ) dθ = . f (x|θ) f (x|θ) f (x) f (x) f (x)
However, the estimator is highly unstable and can often have infinite variance.
Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 14 / 33 Chib’s (1991) approach
Recall Bayes theorem: f (x|θ)f (θ) f (θ|x) = f (x) so that, taking logarithms and reordering,
log f (x) = log f (x|θ) + log f (θ) − log f (θ|x).
Suppose that we have run the Gibbs sampler and have a sensible posterior point estimate, say θ˜. Then, we can typically calculate log f (θ˜) and log f (x|θ˜) analytically. How do we calculate log f (θ˜|x)?
Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 15 / 33 Assume that θ = (θ1, ..., θk ). Then, from the law of multiplication,
log f (θ˜|x) = log f (θ˜1|x) + log f (θ˜2|x, θ˜1) + ... + log f (θ˜k |x, θ˜1, ..., θ˜k−1).
Firstly, if we run the Gibbs sampler again, and generate sample values θ(1), ..., θ(N), we can calculate
N 1 X log f (θ˜ |x) ≈ log f (θ˜ |x, θ(i)) 1 N 1 i=1
˜ (i) Now fix θ1 = θ1 and run the Gibbs sampler again, generating a sample θ−1, for i = 1, ..., N to calculate
N 1 X (i) log f (θ˜ |x, θ˜ ) ≈ log f (θ˜ |x, θ , θ˜ ) 2 1 N 2 −1 1 i=1
Then run the Gibbs sampler again with θ1 = θ˜1 and θ2 = θ˜2 and so on.
Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 16 / 33 Fairly accurate. Relies on all conditional distributions being available analytically. Extensions to more general MCMC samplers are available. Means that we have to run the Gibbs sampler various times which means that it can be very slow.
Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 17 / 33 Software reliability example
Remember the software reliability example. Before we assumed i.i.d. exponential failure times. 80000 60000 40000 total execution time total execution 20000 0
0 20 40 60 80 100 120 140 failure count
Inter-failure times are longer after more faults have been observed (and corrected?)
Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 18 / 33 The Jelinski Moranda model This model assumes N initial faults, each with rate θ and that after each failure, the fault causing it is perfectly corrected.
Xi |θ, N ∼ exponential(θ(N − i + 1)).
The likelihood is
m ! N! X f (x|θ, N) ∝ θm exp −θ (N − i + 1)x (N − m)! i i=1 where m is the number of observed failures. Given an exponential(1) prior for θ, the conditional posterior is gamma. Given a Poisson(200) prior for N, the conditional posterior for N − m is Poisson. It is easy to run a Gibbs sampler
The estimated marginal likelihood for this model is approx. −983. The marginal likelihood for the i.i.d. model is −1012.
Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 19 / 33 Problems with the use of Bayes factors III: existence
Return to the coin tossing problem and suppose that under H1, we use Haldane’s 1 prior, f (θ) ∝ θ(1−θ) . Then
Z 1 12 9 3 −1 −1 f (x|H1) ∝ θ (1 − θ) θ (1 − θ) dθ 0 9 but how can we get rid of the proportionality? When we use improper priors for the parameters within any model, the Bayes factor is not defined!
Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 20 / 33 Possible solution: quasi Bayes factors
Various quasi Bayes factors have been introduced to get round the problem of improper priors. All use some idea of dividing the data into a minimal training set and an evaluation set. Idea is to use the (smallest set of) training data to make the improper prior proper ... and then use the evaluation data as the sample to calculate a Bayes factor.
Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 21 / 33 Intrinsic Bayes factors
Consider the example of normal data X |θ ∼ Normal(θ, 1) and suppose we wish to test H0 : θ ≤ 0 versus H1 : θ > 0. Assume we define uniform priors for θ conditional on each of these hypotheses. Suppose that we observe a sample of size n.
Note that given a single datum, say x1, then
θ|x1, Hi ∼ truncated normal(x1, 1)
− + where the truncation is onto Z in the case of H0 and Z in the case of H1. Then, conditional on x1, we can define a (partial) Bayes factor
R 0 f (x|θ, x )f (θ|x , H ) dθ 0 −∞ 1 1 0 B1 (x−1|x1) = R ∞ f (x|θ, x1)f (θ|x1, H1) dθ 0 √ 1 − Φ(−x ) Φ(− nx¯) = 1 √ Φ(x1) 1 − Φ(− nx¯)
Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 22 / 33 A problem is that this is clearly sensitive to the training sample chosen.
If x1 is an outlier, we could have problems. One possibility is to average over all possible training sets of size 1.
I Geometric IBF: √ 1/n q n ! 0 n 0 Φ(− nx¯) Y 1 − Φ(−xi ) GIBF1 = B1 (x−i |xi ) = √ 1 − Φ(− nx¯) Φ(xi ) i=1
I Arithmetic IBF
n √ n 0 1 X 0 Φ(− nx¯) 1 X 1 − Φ(−xi ) AIBF1 = B1 (x−i |xi ) = √ n 1 − Φ(− nx¯) n Φ(xi ) i=1 i=1 Both methods lose some of the nice properties of the original Bayes factor.
Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 23 / 33 Information criteria
We have seen that the Bayes factor is often very difficult to calculate. In such cases, information criteria may be preferred. The most well known classical information criteria are The Bayesian information criterion
BIC = −2 log f (x|θ,ˆ M) + k log n
The Akaike information criterion
AIC = −2 log f (x|θ,ˆ M) + 2k.
Both criteria use the deviance, D(θ|x) = −2 log f (x|θ, M) plus a correction term to account for model complexity.
Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 24 / 33 Linear models example