<<

Chapter 5:

Conchi Aus´ınand Mike Wiper Department of Universidad Carlos III de Madrid

Master in Business Administration and Quantitative Methodsl Master in Mathematical Engineering

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 1 / 33 Objective

In this class we consider problems where we have various contending models and show how these can be compared and also look at the possibilities of model averaging.

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 2 / 33 Basics

In principle, model selection is easy.

Suppose that we wish to compare two models, M1, M2. Then, define prior probabilities, P(M1) and P(M2) = 1 − P(M1). Given , we can calculate the posterior probabilities via Bayes theorem:

f (x|M )P(M ) P(M |x) = 1 1 1 f (x)

where f (x) = f (x|M1)P(M1) + f (x|M2)P(M2).

Note also that if model Mi is parametric with parameters θi , then Z f (x|Mi ) = f (x|Mi , θi )f (θi |Mi ) dθi

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 3 / 33 Now consider the possible losses (negative utilities) associated with taking a wrong decision.

L(select M2|M1 true) and L(select M1|M2 true).

(something like type I and type II errors) Take the decision which minimizes the expected loss (Bayes decision).

Expected loss for choosing M1 is

P(M2|x)L(M1|M2)

and similarly for M2. If the loss functions are equal, then we just select the model with higher probability.

Setting L(M1|M2) = 0.05 and L(M2|M1) = 0.95 we select M1 if P(M1|x) > 0.05.

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 4 / 33 The coin tossing example again Return to the coin tossing problem of Class I.

Suppose now that we wished to test the hypothesis H0 : θ = 0.5 versus the alternative H1 : θ 6= 0.5.

Assume that P(H0) = P(H1) = 0.5 and that θ|H1 ∼ Beta(5, 5).

 12  f (x|H ) = 0.512 0 9 Z 1   12 9 3 1 5−1 5−1 f (x|H1) = θ (1 − θ) θ (1 − θ) dθ 0 9 B(5, 5)  12  B(14, 8) = 9 B(5, 5)

which implies that P(H0|x) ≈ 0.387. Therefore, we would reject H0 under equal loss functions, but would not if we used 0.05 and 0.95 losses for type I and type II errors as previously. Note that the p-value (under binomial ) is 0.0386 so in this case we reject H0 at a 5% level.

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 5 / 33 Two sided hypothesis tests: a paradox

In two sided hypothesis tests, Bayesian and classical results can often differ greatly.

Consider a coin tossing problem where we observe 49,581 heads and 48,870 tails.

Then a classical 2 tailed test of H0 : θ = 0.5 vs H1 : θ 6= 0.5, gives a p-value of 0.0232 and H0 is clearly rejected. However, given the set up of the previous example, the that H0 is true is P(H0|x) ≈ 0.89.

Given a uniform prior for θ under H1, this probability increases to 0.95.

On the contrary, for the one-sided test with H1 : θ > 0.5, the p-value is 0.0116 and the Bayesian posterior probability of H0 is P(H0|x) = 0.0117.

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 6 / 33 What if we have a lot of possible models?

In principle we can proceed as earlier but ...

inference will be sensitive to the selection of the prior probabilities P(Mi ) and ... it is often difficult to define these in many contexts (e.g. variable selection in regression models). We need a criterion which is less dependent on prior information.

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 7 / 33 The Bayes factor

The Bayes factor in favour of Mi and against model Mj is

i P(Mi |x) P(Mj ) Bj = . P(Mj |x) P(Mi )

This is the posterior divided by the prior odds. How does this get rid of the dependency on the priors? Bi = f (x|Mi ) . j f (x|Mj ) If the models do not have any parameters, this is the log likelihood ratio.

Otherwise, recall that the , f (x|Mi ) depends on the prior f (θi |Mi ).

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 8 / 33 Example

In our coin tossing example,

0 f (x|H0) B1 = f (x|H1)  12  0.512 9 =   12 B(14,8) 9 B(5,5) = 0.6309

How do we interpret this number?

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 9 / 33 Consistency and scales of evidence

i It is clear that 0 ≤ Bj < ∞. i When Bj = 1, the marginal likelihoods are the same, so the data provide equal evidence for both models. i i If model i is true, then, Bj → ∞ and if model j is true, then Bj → 0 as n → ∞.

Kass and Raftery (1995) provide the following table for interpreting the Bayes factor. Bayes factor Interpretation 1 to 3 Not worth more than a bare mention 3 to 20 Positive 20 to 150 Strong > 150 Very strong

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 10 / 33 Relationship to classical model selection criteria

The Bayesian information criterion (Schwarz 1978) for evaluating a model, M is

BIC = −2 log f (x|θ,ˆ M) + k log n

where θˆ is the MLE and, the parameters defined under this model, θ, have dimension k.

Then, under certain regularity conditions, as n → ∞, for two models Mi and Mj , then i BICi − BICj → −2 log Bj .

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 11 / 33 Problems with the use of Bayes factors I: philosophy

They require that one of the models is “true”. There is a true value of the parameter θ under this model. There is positive prior mass on this true value under this model.

When the true model is not included, what happens as n → ∞?

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 12 / 33 Problems with the use of Bayes factors II: calculation

Calculation of the Bayes factor is often tough In order to calculate the Bayes factor, we need the marginal likelihoods. Outside of conjugate models, these are impossible to evaluate analytically. Various alternatives are available in the context of and MCMC.

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 13 / 33 Harmonic estimator

Consider an MCMC sample, θ(1), ..., θ(N). Then, we can estimate f (x|M) or f (x), (dropping dependence on M for notational convenience) as

n !−1 1 X f (x) ≈ 1/f (x|θ ) . N i i=1 This is consistent as the expectation of what is being averages is

Z 1 Z 1 f (x|θ)f (θ) 1 Z 1 f (θ|x) dθ = dθ = f (θ) dθ = . f (x|θ) f (x|θ) f (x) f (x) f (x)

However, the estimator is highly unstable and can often have infinite .

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 14 / 33 Chib’s (1991) approach

Recall Bayes theorem: f (x|θ)f (θ) f (θ|x) = f (x) so that, taking logarithms and reordering,

log f (x) = log f (x|θ) + log f (θ) − log f (θ|x).

Suppose that we have run the Gibbs sampler and have a sensible posterior point estimate, say θ˜. Then, we can typically calculate log f (θ˜) and log f (x|θ˜) analytically. How do we calculate log f (θ˜|x)?

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 15 / 33 Assume that θ = (θ1, ..., θk ). Then, from the law of multiplication,

log f (θ˜|x) = log f (θ˜1|x) + log f (θ˜2|x, θ˜1) + ... + log f (θ˜k |x, θ˜1, ..., θ˜k−1).

Firstly, if we run the Gibbs sampler again, and generate sample values θ(1), ..., θ(N), we can calculate

N 1 X log f (θ˜ |x) ≈ log f (θ˜ |x, θ(i)) 1 N 1 i=1

˜ (i) Now fix θ1 = θ1 and run the Gibbs sampler again, generating a sample θ−1, for i = 1, ..., N to calculate

N 1 X (i) log f (θ˜ |x, θ˜ ) ≈ log f (θ˜ |x, θ , θ˜ ) 2 1 N 2 −1 1 i=1

Then run the Gibbs sampler again with θ1 = θ˜1 and θ2 = θ˜2 and so on.

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 16 / 33 Fairly accurate. Relies on all conditional distributions being available analytically. Extensions to more general MCMC samplers are available. Means that we have to run the Gibbs sampler various times which means that it can be very slow.

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 17 / 33 Software reliability example

Remember the software reliability example. Before we assumed i.i.d. exponential failure times. 80000 60000 40000 total execution time total execution 20000 0

0 20 40 60 80 100 120 140 failure count

Inter-failure times are longer after more faults have been observed (and corrected?)

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 18 / 33 The Jelinski Moranda model This model assumes N initial faults, each with rate θ and that after each failure, the fault causing it is perfectly corrected.

Xi |θ, N ∼ exponential(θ(N − i + 1)).

The likelihood is

m ! N! X f (x|θ, N) ∝ θm exp −θ (N − i + 1)x (N − m)! i i=1 where m is the number of observed failures. Given an exponential(1) prior for θ, the conditional posterior is gamma. Given a Poisson(200) prior for N, the conditional posterior for N − m is Poisson. It is easy to run a Gibbs sampler

The estimated marginal likelihood for this model is approx. −983. The marginal likelihood for the i.i.d. model is −1012.

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 19 / 33 Problems with the use of Bayes factors III: existence

Return to the coin tossing problem and suppose that under H1, we use Haldane’s 1 prior, f (θ) ∝ θ(1−θ) . Then

Z 1   12 9 3 −1 −1 f (x|H1) ∝ θ (1 − θ) θ (1 − θ) dθ 0 9 but how can we get rid of the proportionality? When we use improper priors for the parameters within any model, the Bayes factor is not defined!

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 20 / 33 Possible solution: quasi Bayes factors

Various quasi Bayes factors have been introduced to get round the problem of improper priors. All use some idea of dividing the data into a minimal training set and an evaluation set. Idea is to use the (smallest set of) training data to make the improper prior proper ... and then use the evaluation data as the sample to calculate a Bayes factor.

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 21 / 33 Intrinsic Bayes factors

Consider the example of normal data X |θ ∼ Normal(θ, 1) and suppose we wish to test H0 : θ ≤ 0 versus H1 : θ > 0. Assume we define uniform priors for θ conditional on each of these hypotheses. Suppose that we observe a sample of size n.

Note that given a single datum, say x1, then

θ|x1, Hi ∼ truncated normal(x1, 1)

− + where the truncation is onto Z in the case of H0 and Z in the case of H1. Then, conditional on x1, we can define a (partial) Bayes factor

R 0 f (x|θ, x )f (θ|x , H ) dθ 0 −∞ 1 1 0 B1 (x−1|x1) = R ∞ f (x|θ, x1)f (θ|x1, H1) dθ 0 √ 1 − Φ(−x ) Φ(− nx¯) = 1 √ Φ(x1) 1 − Φ(− nx¯)

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 22 / 33 A problem is that this is clearly sensitive to the training sample chosen.

If x1 is an outlier, we could have problems. One possibility is to average over all possible training sets of size 1.

I Geometric IBF: √ 1/n q n ! 0 n 0 Φ(− nx¯) Y 1 − Φ(−xi ) GIBF1 = B1 (x−i |xi ) = √ 1 − Φ(− nx¯) Φ(xi ) i=1

I Arithmetic IBF

n √ n 0 1 X 0 Φ(− nx¯) 1 X 1 − Φ(−xi ) AIBF1 = B1 (x−i |xi ) = √ n 1 − Φ(− nx¯) n Φ(xi ) i=1 i=1 Both methods lose some of the nice properties of the original Bayes factor.

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 23 / 33 Information criteria

We have seen that the Bayes factor is often very difficult to calculate. In such cases, information criteria may be preferred. The most well known classical information criteria are The Bayesian information criterion

BIC = −2 log f (x|θ,ˆ M) + k log n

The Akaike information criterion

AIC = −2 log f (x|θ,ˆ M) + 2k.

Both criteria use the , D(θ|x) = −2 log f (x|θ, M) plus a correction term to account for model complexity.

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 24 / 33 Linear models example

1  Suppose we have a , Y ∼ normal Xθ, τ I . Then recall that the fitted values are E[Y|θˆ] = HY where H = X(XT X)−1XT is the hat matrix.

Then the diagonal elements of H satisfy the restriction 0 ≤ hii ≤ 1 and Pn i=1 hii = k, the number of linear parameters in the model.

The elements hii are influence measures. Higher values imply that θˆ will change more if the i’th datum is removed.

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 25 / 33 A Bayesian hat matrix

1  Suppose we introduce a normal prior distribution for θ, θ ∼ normal m, cτ V . Then the Bayesian hat matrix is now

 1 −1 H = X V + XT X XT c

and satisfies ¯ EY|θ[Y|θ] = EY[Y] + H (Y − EY[Y]) . Pn The effective number of parameters for the linear model is then pD = i=1 hii .

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 26 / 33 The deviance information criterion

The effective number of parameters in a model with data x is defined in Spiegelhalter et al (2002) as pD = D¯ − D(θ¯) where θ¯ = E[θ|x] and D¯ = E[D(θ|x)|x]. Then the deviance information criterion is

DIC = D¯ + pD .

DIC = D(θ¯) + 2pD This is like a Bayesian AIC. For non-hierarchical, linear models with a non-informative prior on θ, DIC = AIC.

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 27 / 33 Advantages and disadvantages

Very easy to calculate from MCMC output. Doesn’t matter if we don’t have a true model (unlike Bayes factor or BIC) Inconsistent (like AIC) Only really useful in nested models.

pD can be negative. Doesn’t work in latent variable models. For alternatives, see Celeux et al (2003).

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 28 / 33 Predictive performance

As well as comparing models in sample, it is useful to assess their ability to predict data. The standard way is to use the log-predictive score:

Divide the data into a training set xT and a prediction set, xP . Then calculate the mean log predictive score

1 X − log f (xi |xT ). nP i:xi ∈xP

For time varying models, this can show which models predict better at different time periods. This can be very computationally intensive.

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 29 / 33 Model averaging

Often, various models are available and will give similar fits but very different predictions. In such cases, we may wish to use model averaging to make predictions. In principle this is straightforward:

Given a set of models with well defined prior and posterior distributions, we can make predictions as

k X f (y|x) = P(Mi |x)f (y|x, Mi ). i=1

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 30 / 33 Practical issues I: computing the sum

If we have lots of possible models, the full computation of the summation may be infeasible. Two possibilities: Occam’s window: exclude complex models if data are in favour of simpler models. For example, we can consider thresholding type ideas in regression problems. MC 3. Use MCMC to directly approximate the summation.

I We need to construct Metropolis Hastings passes that propose moves through the model space.

I For models of variable dimension this can be done using e.g. reversible jump moves.

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 31 / 33 Practical issues II: prior model probabilities

In regression type models with multiple possible regressors, we can consider priors for a model Mi of form:

k Y δji 1−δji P(Mi ) = πj (1 − πj ) j=1

where there are k possible covariates and δij = 1 if covariate j is included in model Mi and 0 otherwise.

Setting πj = 0.5 gives a uniform prior over model space.

Setting πj < 0.5 penalizes large models.

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 32 / 33 Summary

In the last chapter we have examined model comparison ideas.

Bayesian hypothesis testing is similar to classical in one sided tests but often very different in 2 sided tests. Theoretical model comparison using Bayes factors is simple but there are many practical complications. We can use Bayesian model selection criteria. We haven’t said anything about goodness of fit.

Conchi Aus´ınand Mike Wiper Model selection Masters Programmes 33 / 33