<<

Introduction to Bayesian thinking seminar

Rodrigo Díaz

[email protected]

Geneva Observatory, April 11th, 2016 Agenda (I)

• Part I. Basics.

• General concepts & history of .

.

• Part II. Likelihoods.

• Constructing likelihoods.

• Part III. Priors.

• General concepts and problems. Subjective vs objective priors.

• Assigning priors. Agenda (II)

• Part IV. Posterior.

• Conjugate priors. The right prior for a given likelihood. • MCMC.

• Part V. Model comparison.

• Model evidence and . The philosophy of integrating over parameter space.

• Estimating the model evidence. Dos and Don’ts

• Point estimate: BIC. • Importance and a family of methods. • Nested sampling. Part I basics

Fig. adapted from Gregory (2005)

Deductive inference (predictions)

Testable Observations hypothesis () (theory)

Statistical inference (hyp. testing, parameter estimation) Statistical inference requires a theory Frequentist

Bayesian (1701 – 1761) First appearance of the product rule (the base for the Bayes’ theorem; An Essay towards solving a Problem in the Doctrine of Chances). p(D Hi,I) p(Hi I, D)= p(D| I) p(Hi I) | | · |

Pierre-Simon Laplace (1749 – 1827) Wide application of the Bayes' rule. Principle of insufficient reason (non-informative priors). Primitive version of the Bernstein–von Mises theorem.

Laplace’s “” is largely rejected for ~100 years. The reign of . Fischer, Pearson, etc. (1891 – 1989)

Objective Bayesian probability revived. Jeffreys rule for priors.

(1940s - 1960s) R. T. Cox George Pólya E. T. Jaynes Plausible reasoning. Reasoning with uncertainty. Probability theory as an extension of Aristotelian . The product and sum rules deduced for basic principles. MAXENT priors. See E.T Jaynes. Probability Theory: The Logic of . http://www-biba.inrialpes.fr/Jaynes/prob.html Statistical inference

Fig. adapted from Gregory (2005)

Deductive inference (predictions)

Testable Observations hypothesis (data) (theory)

Statistical inference (hyp. testing, parameter estimation) The three desiderata

• Rules of deductive reasoning: strong syllogisms from Aristotelian logic. • Brain works using plausible inference. Woman in wood cabin story. • The rules for plausible reasoning are deduced from three simple desiderata on how the mind of a thinking robot should work (Cox- Póyla-Jaynes).

I. Degrees of Plausibility are represented by real numbers. II. Qualitative Correspondence with common sense. III. If a conclusion can be reasoned out in more than one way; then every possible way must lead to the same result. “… if degrees of plausibility are represented by real numbers︎, then there is a uniquely determined set of quantitative rules for conducting inference︎ That is︎ any, other rules whose results conflict with them will necessarily violate an elementary -and nearly inescapable- desideratum of rationality or consistency. But the final result was just the standard rules of probability theory, given already by Bernoulli and Laplace, so why all the fuss? The important new feature was that these rules were now seen as uniquely valid principles of logic in general, making no reference to “chance” or “random variables”; so their of application is vastly greater than had been supposed in the conventional probability theory that was developed in the early 20th century. As a result, the imaginary distinction between ︎probability theory︎ and ︎statistical inference︎ disappears︎ and the field achieves not only logical unity and simplicity︎ but far greater technical power and flexibility in applications.” The basic rules of probability theory.

A B Sum p(A + B)=p(A)+p(B) p(AB) Product “Aristotelian p(AB)=p(A B)p(B)=p(B A)p(A) deductive | | logic is the limiting form Once these rules are recognised as general of our rules for plausible ways of manipulating any preposition, a reasoning” world of applications opens up. E.T. Jaynes “… if degrees of plausibility are represented by real numbers︎, then there is a uniquely determined set of quantitative rules for conducting inference︎ That is︎ any, other rules whose results conflict with them will necessarily violate an elementary -and nearly inescapable- desideratum of rationality or consistency. But the final result was just the standard rules of probability theory, given already by Bernoulli and Laplace, so why all the fuss? The important new feature was that these rules were now seen as uniquely valid principles of logic in general, making no reference to “chance” or “random variables”; so their range of application is vastly greater than had been supposed in the conventional probability theory that was developed in the early 20th century. As a result, the imaginary distinction between ︎probability theory︎ and ︎statistical inference︎ disappears︎ and the field achieves not only logical unity and simplicity︎ but far greater technical power and flexibility in applications.” The Bayesian recipe

Sum p(A + B)=p(A)+p(B) p(AB) Product p(AB)=p(A B)p(B)=p(B A)p(A) | | Bayes’ p(B A) p(A) p(A B)= | theorem | p(B) The Bayesian recipe

Law of Total for {Bi} p(A I)= p(A Bi, I)p(Bi I) exclusive and Probability | | | exhaustive i X

B2 B3

B1 B4 A

B5

B6 The Bayesian recipe

Sum p(A + B)=p(A)+p(B) p(AB) Product p(AB)=p(A B)p(B)=p(B A)p(A) | | Bayes’ p(B A) p(A) p(A B)= | theorem | p(B)

for {Bi} Law of Total p(A I)= p(A B , I)p(B I) exclusive | | i i| and Probability i exhaustive X The Bayesian recipe

Sum p(A + B)=p(A)+p(B) p(AB) Product p(AB)=p(A B)p(B)=p(B A)p(A) | | Bayes’ p(B A) p(A) p(A B)= | theorem | p(B)

for {Bi} Law of Total p(A I)= p(A B , I)p(B I) exclusive | | i i| and Probability i exhaustive X for {Bi} p(B I)=1 exclusive Normalisation i| and i exhaustive X "Bayes", "Bayesian", MCMC

Two basic tasks of statistical inference

Learning process (parameter estimation)

Decision making (model comparison) Learning process

Bayesian probability represents a state of p(✓¯ H , I) p(✓¯ D, H , I) | i | i ✓¯ : parameter vector Prior Posterior Hi: hypothesis I: information D: data Discrete space (hypothesis space) p(H I) p(H I, D) i| i| ✓¯ : parameter vector Hi: hypothesis I: information Learning process D: data

Enter the

p(D ✓¯, Hi, I) p(✓¯ Hi, I, D)= | p(✓¯ Hi, I) | p(D Hi, I) · | Posterior | Prior

p(✓¯ D, H , I) (H ) p(✓¯ H , I) | i / L✓ i · | i

The proportionality constant has many names: marginal likelihood, global likelihood, model evidence, prior predictive. Hard to compute. Important. Optimising the learning process

8 Role• ofThe probability likelihood theory in science needs8 to be selectiveRole for of probability the learning theory in science

(a) process to be effective.(b) (a) (b)

Likelihood Prior Likelihood Prior p(D |H0,M1,I ) p(H0|M1,I ) p(D |H0,M1,I ) p(H0|M1,I )

Prior Likelihood Prior Likelihood

p(H0|M1,I ) p(D |H0,M1,I ) p(H0|M1,I ) p(D |H0,M1,I )

Dominated Posterior Posterior Posterior Posterior p(H0|D,M1,I ) p(H0|D,M1,I ) p(H0|D,M1,I ) byp data(H0|D,M1,I )

Dominatedby prior information P. Gregory (2005) Parameter H0 Parameter H0 Parameter H0 Parameter H0

Figure 1.2 Bayes’ theorem provides a model of the inductive learningFigure process. 1.2 Bayes’ The posterior theorem provides a model of the inductive learning process. The posterior PDF (lower graphs) is proportional to the product of the prior PDF andPDF the (lower likelihood graphs) function is proportional to the product of the prior PDF and the likelihood function (upper graphs). This figure illustrates two extreme cases: (a) the prior(upper much graphs). broader This than figure illustrates two extreme cases: (a) the prior much broader than likelihood, and (b) likelihood much broader than prior. likelihood, and (b) likelihood much broader than prior.

Two extreme cases are shown in Figure 1.2. In the first, pTwaneol extrem(a), thee casepriosr arise shown in Figure 1.2. In the first, panel (a), the prior is much broader than the likelihood. In this case, the posteriormuch PDF broader is determined than the likelihood. In this case, the posterior PDF is determined entirely by the new data. In the second extreme, panel (b), theentirely new data by the are new much data. In the second extreme, panel (b), the new data are much less selective than our prior information and hence the posteriorless selective is essentially than our the prior information and hence the posterior is essentially the prior. prior. Now suppose we acquire more data represented by propositionNowD2. suppose We can we again acquire more data represented by proposition D2. We can again apply Bayes’ theorem to compute a posterior that reflects our newapply state Bayes’ of knowledge theorem to compute a posterior that reflects our new state of knowledge about the parameter. This time our new prior, I 0, is the posteriorabout derived the parameter. from D1; I This, time our new prior, I 0, is the posterior derived from D1; I, i.e., I 0 D1; I. The new posterior is given by i.e., I 0 D1; I. The new posterior is given by ¼ ¼ p H0 D2; I 0 p H0 I 0 p D2 H0; I 0 : (1:13) p H0 D2; I 0 p H0 I 0 p D2 H0; I 0 : (1:13) ð j Þ / ð j Þ ð j Þ ð j Þ / ð j Þ ð j Þ

1.3.4 Example of the use of Bayes’ theorem 1.3.4 Example of the use of Bayes’ theorem Here we analyze a simple model comparison problem using Bayes’Here theorem. we analyze We a start simple model comparison problem using Bayes’ theorem. We start by stating our prior information, I, and the new data, D. by stating our prior information, I, and the new data, D. I stands for: I stands for: a) Model M1 predicts a star’s distance, d1 100 light years (ly). a) Model M1 predicts a star’s distance, d1 100 light years (ly). ¼ ¼ b) Model M2 predicts a star’s distance, d2 200 ly. b) Model M2 predicts a star’s distance, d2 200 ly. ¼ ¼ c) The uncertainty, e, in distance measurements is described by ac) Gaussian The uncertainty, distributione, ofin distance measurements is described by a Gaussian distribution of the form the form Decision making

Bayes’ theorem is also the base for model comparison

p(D H , I) p(H I, D)= | i p(H I) i| p(D I) · i| | n but now p(D H , I)= p(D ✓¯, H , I)p(✓¯ H , I)d ✓ | i | i | i Z⇡ Computation of the evidence cannot be escaped. p(Hi|I)

p(Hi|I, D)

p(D|Hi, I)

p(D|H , I) p(H |I, D)= i · p(H |I) i p(D|I) i

D = DRV DtransitDastrometry...DN

N indep. p(D|H, I)=p(E1,E2, ..., EN |H, I) = ! p(Ei|H, I) i=1 Decision making χ2 ¯ indep.,gauss. θ¯ L(Hi)=Model comparisonp(D|θ, consistsHi, I in) computing∝ the ratio of expthe − posteriors () of two competing hypotheses. 2

p(Hi I,D) p(Hp(iD|DH, Ii,)I) p(Hi I) p(H |I,D) = p(D|H ,I) p(H |I) j | p(Hj |D| , jI) · j | Bayes factor Prior odds p(D|θ¯, Hi, I) p(θ¯|Hi, I, D)= · p(θ¯|Hi, I) p(D|Hi, I)

p(θ¯|Hi, I)

p(θ¯|D, Hi, I) ∝ p(D|θ¯, Hi, I) · p(θ¯|Hi, I)

p(θ¯|D, Hi, I) ∝ L(Hi) · p(θ¯|Hi, I)

p(Hi|D, I) ∝ p(D|Hi, I) · p(Hi|I)

p(H |I, D) p(D|H , I) p(H |I) i ∝ i · i p(Hj |I, D) p(D|Hj , I) p(Hj |I)

1 Part II likelihoods ✓¯ : parameter vector Hi: hypothesis I: information The likelihood function D: data

p(D ✓¯, Hi, I) p(✓¯ Hi, I, D)= | p(✓¯ Hi, I) | p(D Hi, I) · | Posterior | Prior

p(✓¯ D, H , I) (H ) p(✓¯ H , I) | i / L✓ i · | i

The likelihood is the probability of obtaining data D, for a given prior information I and a set of parameters !.

Remember, likelihood is not a probability for parameter vector ! (for that you need the prior) Optimising the learning process

8 Role• ofThe probability likelihood theory in science needs8 to be selectiveRole for of probability the learning theory in science

(a) process to be effective.(b) (a) (b)

Likelihood Prior Likelihood Prior p(D |H0,M1,I ) p(H0|M1,I ) p(D |H0,M1,I ) p(H0|M1,I )

Prior Likelihood Prior Likelihood

p(H0|M1,I ) p(D |H0,M1,I ) p(H0|M1,I ) p(D |H0,M1,I )

Dominated Posterior Posterior Posterior Posterior p(H0|D,M1,I ) p(H0|D,M1,I ) p(H0|D,M1,I ) byp data(H0|D,M1,I )

Dominatedby prior information P. Gregory (2005) Parameter H0 Parameter H0 Parameter H0 Parameter H0

Figure 1.2 Bayes’ theorem provides a model of the inductive learningFigure process. 1.2 Bayes’ The posterior theorem provides a model of the inductive learning process. The posterior PDF (lower graphs) is proportional to the product of the prior PDF andPDF the (lower likelihood graphs) function is proportional to the product of the prior PDF and the likelihood function (upper graphs). This figure illustrates two extreme cases: (a) the prior(upper much graphs). broader This than figure illustrates two extreme cases: (a) the prior much broader than likelihood, and (b) likelihood much broader than prior. likelihood, and (b) likelihood much broader than prior.

Two extreme cases are shown in Figure 1.2. In the first, pTwaneol extrem(a), thee casepriosr arise shown in Figure 1.2. In the first, panel (a), the prior is much broader than the likelihood. In this case, the posteriormuch PDF broader is determined than the likelihood. In this case, the posterior PDF is determined entirely by the new data. In the second extreme, panel (b), theentirely new data by the are new much data. In the second extreme, panel (b), the new data are much less selective than our prior information and hence the posteriorless selective is essentially than our the prior information and hence the posterior is essentially the prior. prior. Now suppose we acquire more data represented by propositionNowD2. suppose We can we again acquire more data represented by proposition D2. We can again apply Bayes’ theorem to compute a posterior that reflects our newapply state Bayes’ of knowledge theorem to compute a posterior that reflects our new state of knowledge about the parameter. This time our new prior, I 0, is the posteriorabout derived the parameter. from D1; I This, time our new prior, I 0, is the posterior derived from D1; I, i.e., I 0 D1; I. The new posterior is given by i.e., I 0 D1; I. The new posterior is given by ¼ ¼ p H0 D2; I 0 p H0 I 0 p D2 H0; I 0 : (1:13) p H0 D2; I 0 p H0 I 0 p D2 H0; I 0 : (1:13) ð j Þ / ð j Þ ð j Þ ð j Þ / ð j Þ ð j Þ

1.3.4 Example of the use of Bayes’ theorem 1.3.4 Example of the use of Bayes’ theorem Here we analyze a simple model comparison problem using Bayes’Here theorem. we analyze We a start simple model comparison problem using Bayes’ theorem. We start by stating our prior information, I, and the new data, D. by stating our prior information, I, and the new data, D. I stands for: I stands for: a) Model M1 predicts a star’s distance, d1 100 light years (ly). a) Model M1 predicts a star’s distance, d1 100 light years (ly). ¼ ¼ b) Model M2 predicts a star’s distance, d2 200 ly. b) Model M2 predicts a star’s distance, d2 200 ly. ¼ ¼ c) The uncertainty, e, in distance measurements is described by ac) Gaussian The uncertainty, distributione, ofin distance measurements is described by a Gaussian distribution of the form the form ✓¯ : parameter vector Hi: hypothesis I: information Likelihood function D: data

Ingredients

Physical Statistical Error (non-deterministic) model model statistics

• Analytic model • Unknown errors (jitter) • • Simulations • Instrument systematics • Non-Gaussianity • … • Complex physics (activity, …) • … • … 2 indep.,gauss. p(D ✓¯, H , I)= (H ) exp ✓ | i L✓ i / 2 Constructing the likelihood

The data: D = D D ... D = D 1 2 n { i} Di: the i-th measurement is in the infinitesimal range yi to yi + dyi

The errors:

Ei: the i-th error is in the infinitesimal range ei to ei + dei

p(E ✓, H, I)=f (e ) The of statement Ei i| E i

Most used fE 2 fE(ei)=N(0, i ) The model:

Mi: the i-th error is in the infinitesimal range mi to mi + dmi

p(M ✓, H, I)=f (m ) The probability distribution of statement Mi i| M i Constructing the likelihood

The data: D = D D ... D = D 1 2 n { i} We want to build the probability distribution: p(D ✓, H, I)=p(D ,D ,...,D ✓, H, I) | 1 2 n|

Write: yi = mi + ei

Convolution It can be shown that: equation

p(D ✓, H, I)= dm f (m ) f (y m ) i| i M i E i i Z Constructing the likelihood

p(D ✓, H, I)= dm f (m ) f (y m ) i| i M i E i i Z

But for a deterministic model, mi is obtained from a (usually analytically) function f without any uncertainty (say, a Keplerian curve for RV measurements) m = f(x ✓) i i| f (m )=(m f(x ✓)) M i i i|

Then: p(Di ✓, H, I)= dmi (mi f(xi ✓)) fE(yi mi) | | Z = f (y f(x ✓)=p(E ✓, H, I) E i i| i| p(D ✓, H, I)=p(D ,D ,...,D ✓, H, I)=p(E ,E ,...,E ✓, H, I) | 1 2 n| 1 2 n| Constructing the likelihood

p(D ✓, H, I)=p(D ,D ,...,D ✓, H, I)=p(E ,E ,...,E ✓, H, I) | 1 2 n| 1 2 n|

Now, for independent errors

p(D ✓, H, I)=p(E ,E ,...,E ✓, H, I) | 1 2 n| = p(E ✓, H, I)...p(E ✓, H, I) 1| n| n = p(E ✓, H, I) i| i=1 Y And, if in addition, the errors are distributed normally:

2 indep.,gauss. p(D ✓, H, I) exp ✓ | / 2 Constructing the likelihood

Back to the convolution equation

p(D ✓, H, I)= dm f (m ) f (y m ) i| i M i E i i Z

For a non-deterministic model, Mi is distributed:

Mi: the i-th error is in the infinitesimal range mi to mi + dmi

p(M ✓, H, I)=f (m ) The probability distribution of statement Mi i| M i

E.g. adding instrumental error / resolution: f (m )=N(f(x ✓), 2 ) M i i| inst Part III priors xkcd.com Prior

p(H I) Hi: hypothesis (can be continuous). i| I: information

• Prior information I is always present:

• The term “prior” does not necessarily “earlier in time”.

• Philosophical controversy on how to assign priors.

• Subjective vs. objective views.

• No single universal rule, but a few accepted methods.

• Informative priors. Usually based on the output from previous observations. (What was the prior of the first analysis?).

• Ignorance priors. Required as a starting point for the theory. Assigning ignorance priors

1. Principle of indifference.

Given n mutually exclusive, exhaustive hypothesis, {Hi}, with i = 1, …, n, the PoI states: p(H I)=1/n i| Assigning ignorance priors

2. Transformation groups. Location and scale parameters.

For a certain type of parameters (location and scale), “total ignorance” can we represented as invariance under certain (group of) transformation.

Location: “position of highest tree along a river.” Problem must be invariant under a translation.

X0 = X + c

p(X I)dX = p(X0 I)dX0 = | | p(X0 I)d(X + c)=p(X0 I)dX | |

p(X I) = constant Uniform prior. | Assigning ignorance priors

2. Transformation groups. Location and scale parameters.

For a certain type of parameters (location and scale), “total ignorance” can we represented as invariance under certain (group of) transformation.

Scale: “life time of a new bacteria” or “Poisson rate” Problem must be invariant under scaling.

X0 = aX

p(X I)dX = p(X0 I)dX0 = | | p(X0 I)d(aX)=ap(X0 I)dX | | constant p(X I)= “Jeffreys” prior. | x Assigning ignorance priors

3. Jeffreys rule.

Besides location and scale parameters, little more can be said using transformation invariance.

Jeffreys priors use the Fisher information; parameterisation invariant, but strange behaviour in many dimensions.

d2 log Observed Fischer information: I = LD D d✓2

But D is not known when we have to define a prior. Use expectation value over D. d2 log I(✓)= E LD D d✓2  Assigning ignorance priors

3. Jeffreys rule says:

p(✓ I) I(✓) | / Examples: p • Mean of Normal distribution (") with known #^2. p(µ 2, I) constant | / • Rate $ of Poisson distribution. p( I) 1/p | / • Exercise: Scale of Normal with known mean value? Assigning ignorance priors

3. Jeffreys rule:

p(✓ I) I(✓) | / p d2 log I(✓)= E LD D d✓2  • Favours parts of parameter space where data provides more information.

• Is invariant under reparametrisation.

• Works fine only in one dimension…

See more examples here: en.wikipedia.org/wiki/Jeffreys_prior Assigning ignorance priors

4. Maximum entropy.

Uses information theory to define a measurement of the distribution information.

n = p log(p ) S i i i=0 X

• Maximise entropy subject to constraints (use Lagrange multipliers).

• Defined strictly for discrete distributions. Assigning ignorance priors

5. Reference priors.

Recent development in the field (Bernardo; Bernardo, Berger, & Sun).

Similar to MAXENT, but maximise the Kullback-Leibler between prior and posterior.

• Priors which maximise the expected difference with the posterior.

• Rigorous definition is complicated and subtle.

• For 1-D, reference priors are the Jeffreys priors.

• In multi-D, reference prior behaves better than Jeffreys priors. Agenda (II)

• Part IV. Posterior.

• Conjugate priors. The right prior for a given likelihood.

• MCMC.

• Part V. Model comparison.

• Model evidence and Bayes factor. The philosophy of integrating over parameter space.

• Estimating the model evidence. Dos and Don’ts

• BIC, Chib & Jeliazkov, etc.

• Importance sampling and a family of methods. Part IV sampling the posterior / MCMC Sampling from the posterior

p(✓¯ D, H , I) (H ) p(✓¯ H , I) | i / L✓ i · | i • The posterior distribution is proportional to the likelihood times the prior.

• The normalising constant (called model evidence, marginal likelihood, etc.) is of importance when comparing different models.

• The posterior contains all the information on a given model a Bayesian can get for a given set of priors and data.

• Posterior is only analytical in few cases:

• Conjugate priors.

• Other methods needed to sample from posterior.

Remember, most Bayesian computations can be reduced to expectation values with respect to the posterior. ✓¯ : parameter vector Hi: hypothesis I: information D: data

p(✓¯ D, H , I) (H ) p(✓¯ H , I) | i / L✓ i · | i Metropolis-Hastings ¯ 1. ✓0 p(✓0 I) ✓¯0 L · | * 2. Create proposal point. ✓¯ ¯ 0 3. ✓0 p(✓0 I) q(✓¯ * L · | 0, ✓¯) p(✓¯ I) 4. L✓0 · 0| r = p(✓¯ I) L✓0 · 0| ✓¯ : parameter vector Hi: hypothesis I: information Markov Chain Monte Carlo D: data

p(✓¯ D, H , I) (H ) p(✓¯ H , I) | i / L✓ i · | i Metropolis-Hastings ¯ 1. ✓0 p(✓0 I) ✓¯0 L · | * 2. Create proposal point. ✓¯1 ¯ 3. ✓0 p(✓0 I) q(✓¯ L · | 0, ✓¯ * ¯ ) ✓ p(✓0 I) 4. r = L 0 · ¯| ¯) ✓0 p(✓0 I) , ✓ L · | ¯ ¯✓0 ✓0 q( * 5. Accept proposal with probability min(1, r) ✓¯ : parameter vector Hi: hypothesis I: information Markov Chain Monte Carlo D: data

p(✓¯ D, H , I) (H ) p(✓¯ H , I) | i / L✓ i · | i Metropolis-Hastings 1. p(✓¯ I) L✓0 · 0| ✓¯1 * 2. Create proposal point. ¯ 3. ✓0 p(✓0 I) q L · | ( ¯ ✓ ¯ ✓ p(✓0 I) 0 0 , 4. r = L · | ✓ ¯ ¯ p(✓¯ I) ) ✓0 L✓0 · 0| * 5. Accept proposal with probability min(1, r) ✓¯ : parameter vector Hi: hypothesis I: information Markov Chain Monte Carlo D: data

p(✓¯ D, H , I) (H ) p(✓¯ H , I) | i / L✓ i · | i Metropolis-Hastings

✓¯0 Algorithms * Metropolis-Hastings ¯ ✓1 Slice sampling q(✓¯ 0, ✓¯) * … Codes Hybrid Monte Carlo pymc ¯) , ✓ emcee ¯ ¯✓0 ✓0 q( kombine * cobmcmc … 5 minutes to LEARN

a lifetime to MASTER But beware. Non- convergence, bad mixing. The dark side of MCMC are they.

Problems with correlations and multi-modal distributions. The problem with Correlations

If parameters exhibit correlations, then step size must be small to reach the demanded fraction of accepted jumps.

Rejected proposal Accepted proposal

Need a very long chain to explore the entire posterior. Or, more relevant, the entire posterior will not be explored thoroughly (i.e. reduced error bars!) MCMC: the Good, the Bad, and the Ugly

Visual inspection of traces.

Good

Marginal mixing

No convergence MCMC:Checking the Good, the Bad, and the Ugly Function

Evaluation of the chain auto-correlations

Image credit: E. Ford Multi-modal posteriors.

Difficult for samplerHas to thismove acrossChain modes Converged? if separated by a low-probability barrier.

Image credit: E. Ford Multi-modal posteriors.

Run as many chains as possible starting from significantly different places in the prior space.

Be paranoid! You can always be missing modes.

Part V model comparison Decision making

Bayes’ theorem is also the base for model comparison

p(D H , I) p(H I, D)= | i p(H I) i| p(D I) · i| | n but now p(D H , I)= p(D ✓¯, H , I)p(✓¯ H , I)d ✓ | i | i | i Z⇡ Computation of the model evidence (a.k.a. marginal likelihood, global likelihood, prior predictive, …) cannot be escaped. p(Hi|I)

p(Hi|I, D)

p(D|Hi, I)

p(D|H , I) p(H |I, D)= i · p(H |I) i p(D|I) i

D = DRV DtransitDastrometry...DN

N indep. p(D|H, I)=p(E1,E2, ..., EN |H, I) = ! p(Ei|H, I) i=1 Decision making χ2 ¯ indep.,gauss. θ¯ L(Hi)=Model comparisonp(D|θ, consistsHi, I in) computing∝ the ratio of expthe − posteriors (odds ratio) of two competing hypotheses. 2

p(Hi I,D) p(Hp(iD|DH, Ii,)I) p(Hi I) p(H |I,D) = p(D|H ,I) p(H |I) j | p(Hj |D| , jI) · j | Bayes factor Prior odds p(D|θ¯, Hi, I) p(θ¯|Hi, I, D)= · p(θ¯|Hi, I) p(D|Hi, I)

p(θ¯|Hi, I)

p(θ¯|D, Hi, I) ∝ p(D|θ¯, Hi, I) · p(θ¯|Hi, I)

p(θ¯|D, Hi, I) ∝ L(Hi) · p(θ¯|Hi, I)

p(Hi|D, I) ∝ p(D|Hi, I) · p(Hi|I)

p(H |I, D) p(D|H , I) p(H |I) i ∝ i · i p(Hj |I, D) p(D|Hj , I) p(Hj |I)

1 A built-in Occam’s razor

The Bayes factor naturally punishes models with more parameters

p(D H , I)= p(D ✓¯, H , I)p(✓¯ H , I)d✓¯ | i | i | i Z E.g.: model M0, without free parameters. Model M1, with one parameter.

M = M (✓ = ✓ ) p(D M0, I)=p(D ✓0, M1, I) 48 0 The how-to1 of Bayesian0 inference | |

∧ θ ∧∧ p(D M1, I)= p(D ✓, M1, I) p(✓ M1, I)d✓ L( ) = p(D , M , I ) | | · | θ |θ 1 Z 1

= p(D ✓, M1, I)d✓ p(D|θ, M1, I ) = L(θ) ✓ | Z ✓ p(D ✓ˆ, M , I) δθ ⇡ | 1 ✓ 1 p(θ|M1,I ) = ∆θ p(D ✓ˆ, M1, I) ✓ Occam’s factor B | ∆θ 10 ⇡ p(D ✓ , M , I) ✓ One per parameter | 0 1 Gregory (2005) Parameter θ

Figure 3.1 The characteristic width  of the likelihood peak and Á of the prior.

The likelihood has a characteristic width2 which we represent by . The character- istic width is defined by

d p D ; M1; I p D ^; M1; I : (3:21) ð j Þ ¼ ð j Þ Â ZÁ Then we can approximate the global likelihood (Equation (3.8)) for M1 in the following way:

p D M1; I d p  M1; I p D ; M1; I M1 ð j Þ¼ ð j Þ ð j Þ¼Lð Þ Z 1 d p D ; M1; I ¼ Á ð j Þ Z (3:22)  p D ^; M1; I % ð j Þ Á  or alternatively; M1 ^ : Lð Þ%Lð Þ Á

Since model M0 has no free parameters, no integral need be calculated to find its global likelihood, which is simply equal to the likelihood for model M1 for  0, ¼ p D M0; I p D 0; M1; I 0 : (3:23) ð j Þ¼ ð j Þ¼Lð Þ Thus the Bayes factor in favor of the more complicated model is

p D ^; M1; I  ^  B10 ð j Þ Lð Þ : (3:24) % p D 0; M1; I Á ¼ 0 Á ð j Þ Lð Þ

2 If the likelihood function is really a Gaussian and the prior is flat, it is simple to show that  p2p, where  is the of the posterior PDF for . ¼ ffiffiffiffiffiffi Estimating the marginal likelihood

p(D H , I)= p(D ✓¯, H , I)p(✓¯ H , I)dn✓ | i | i | i Z⇡ Marginal likelihood is a k-dimensional integral over parameter space. Accounting for the “size” of parameter space bring many good things to Bayesian statistics, but the integral is intractable… Large number of techniques to estimate the value of the integral: • Asymptotic estimates (Laplace approximation, BIC). • Importance sampling. • Chib & Jeliazkov. • Nested sampling. Information criteria for astrophysics L75 where the evidence is not readily calculable, and a simpler model 2.2 AIC and BIC selection technique is required. In this article I describe and apply an additional information cri- Much of the literature, both in astrophysics and elsewhere, seeks a terion, the Information Criterion (DIC) of Spiegelhalter simpler surrogate for the evidence which still encodes the tension et al. (2002, henceforth SBCL02), which combines heritage from between fit and model complexity. In Liddle (2004), I described two both Bayesian methods and information theory. It has interesting such statistics, the AIC and BIC, which have subsequently been quite properties. First, unlike the AIC and BIC it accounts for the sit- widely applied to astrophysics problems. They are relatively simple uation, common in astrophysics, where one or more parameters to apply because they require only the maximum likelihood achiev- or combination of parameters is poorly constrained by the data. able within a given model, rather than the likelihood throughout the Secondly, it is readily calculable from posterior samples, such as parameter space. Of course, such simplification comes at a cost, the those generated by MCMC methods. It has already been used in cost being that they are derived using various assumptions, partic- astrophysics to study quasar clustering (Porciani & Norberg 2006). ularly Gaussianity or near-Gaussianity of the posterior distribution, that may be poorly respected in real-world situations. 2 STATISTICS The AIC is defined as ≡− L + k, 2.1 Bayesian evidence AIC 2ln max 2 (2) L The Bayesian evidence, also known as the model likelihood and where max is the maximum likelihood achievable by the model sometimes, less accurately, as the marginal likelihood, comes from and k the number of parameters of the model (Akaike 1974). The a full implementation of Bayesian inference at the model level, best model is the one which minimizes the AIC, and there is no and is the probability of the data given the model. Using Bayes requirement for the models to be nested. The AIC is derived by theorem, it updates the prior model probability to the posterior modelEstimatingan approximate minimizationthe marginalof the Kullback–Leibler likelihoodinforma- probability. Usually the prior model probabilities are taken as equal, tion entropy, which measures the difference between the true data but quoted results can readily be rescaled to allow for unequal ones if distribution and the model distribution. An explanation geared to required (e.g. Lasenby & Hobson 2006). In many circumstances the astronomers can be found in Takeuchi (2000), while the full statis- evidence can be calculated without simplifying assumptions (though Pointtical justification Estimationsis given by Burnham & Anderson (2002). perhaps with numerical errors). It has now been quite widely applied The BIC was introduced by Schwarz (1978), and is defined as in cosmology; see for example Jaffe (1996), Hobson, Bridle & Lahav L Bayesian Information BIC ≡−2ln max + k ln N, Criterion (3) (2002), Saini, Weller & Bridle (2004), Trotta (2005), Parkinson et al. (2006), and Lasenby & Hobson (2006). where N is the number of data points used in the fit. It comes from approximating the evidence ratios of models, known as the Bayes The evidence is given by • Some nice properties are lost, such as reasonable parameter factor (Jeffreys 1961; Kass & Raftery 1995). The BIC assumes that penalisation (see Liddle 2007). E ≡ L(θ) P(θ)dθ, (1) the data points are independent and identically distributed, which ! • Themay erroror may termnot ofbe thevalid BICdepending is generallyon Othe(1)data. “Evenset under for largeconsidera- samples, where θ is the vector of parameters being varied in the model and ittion does(e.g. notit produceis unlikely theto rightbe good value.”for cosmic microwave anisotropy P(θ) is the properly normalized prior distribution of those parame- data, but may well be for supernova luminosity–distance data). ters (often chosen to be flat). It is the average value of the likelihood Applications of these two criteria have usually shown broad L over the entire model parameter space that was allowed before agreement in the conclusions reached, but occasional differences the data came in. It rewards a combination of data fit and model in the detailed of models. One should consider the extent to predictiveness. Models which fit the data well and make narrow which the conditions used in the derivation of the criteria are vio- predictions are likely to fit well over much of their available pa- lated in real situations. A particular case in point is the existence of rameter space, giving a high average. Models which fit well for parameter degeneracies; inclusion (inadvertent or otherwise) of un- particular parameter values, but were not very predictive, will fit constrained parameters is penalized by the AIC and BIC, but not by poorly in most of their parameter space, driving the average down. the evidence. Interpretation of the BIC as an estimator of evidence Models which cannot fit the data well will do poorly in any event. differences is therefore suspect in such cases. The integral in equation (1) may however be difficult to calculate, Burnham & Anderson (2002, 2004) have stressed the importance as it may have too many dimensions to be amenable to evaluation of using a version of the AIC corrected for small sample sizes, AICc. by gridding, and the simplest MCMC methods such as Metropolis– This is given by (Sugiura 1978) Hastings produce samples only in the part of parameter space where 2k(k + 1) the is high rather than throughout the prior. AICc = AIC + . (4) N − k − 1 Nevertheless, many methods exist (e.g. Gregory 2005; Trotta 2005), and the nested sampling algorithm (Skilling 2006) has proven fea- Because the correction term anyway disappears for large sample sible for many cosmology applications (Mukherjee et al. 2006; sizes, N ≫ k, there is no reason not to use it even in that case, i.e. Parkinson et al. 2006; Liddle et al. 2006b). it is always preferable to use AICc rather than the original AIC. In A particular property of the evidence worth noting is that it does typical small-sample cases, e.g. N/k being only a few, the correction not penalize parameters (or, more generally, degenerate parameter term strengthens the penalty, bringing the AICc towards the BIC and combinations) which are unconstrained by the data. If the likelihood potentially mitigating the difference between them. is flat or nearly flat in a particular direction, it simply factorizes out of the evidence integral leaving it unchanged. This is an appealing 2.3 DIC property, as it indicates that the model fitting the data is doing so really by varying fewer parameters than at first seemed to be the The DIC was introduced by SBCL02. It has already been widely case, and it is the unnecessary parameters that should be discarded, applied outside of astrophysics. Its starting point is a definition of an not the entire model. effective number of parameters pD of a model. This quantity, known

⃝C 2007 The Author. Journal compilation ⃝C 2007 RAS, MNRAS 377, L74–L78 Importance Sampling (I)

• Used to obtain moments of distributions using samples from another distribution.

• The evidence is the expectation value of the likelihood over the prior space. p(D H , I)= p(D ✓¯, H , I)p(✓¯ H , I)dn✓ | i | i | i Z⇡ • The simplest estimation of the evidence is to take samples from the prior and compute the average of the p(D|θ, Hi, I): 1 S pˆ(D H , I)= p(D ✓(i), H , I) | i S | i i X • But this estimator is extremely inefficient if the likelihood is concentrated relative to the prior. A large number of samples are needed, which is computationally expensive. Importance Sampling (II)

More generally, we can choose a distribution g(θ) (from which we can sample) and express the integral as an expectation value over that distribution (dropping conditional I for notation simplicity)

wg (✓) p(✓ H ) p(✓ H ) p(D ✓, H )d✓ = | i p(D ✓, H ) g(✓)d✓ = | i | i g(✓) | i Z Z z }| { =E [w (✓) p(D ✓, H )] g(✓) g | i 1 S w (✓(i)) p(D ✓(i), H ) ⇠ S g | i i X where θ(i) is sampled from g(θ), the importance sampling function. Importance Sampling (III)

• Different choices of g(θ) give different estimates: 1 S pˆ(D H , I)= p(D ✓(i), H , I) g(✓)=p(✓ Hi, I) | i S | i | i X 1 S 1 g(✓)=p(✓ D, Hi, I) p(D ˆH , I)=S | i p(D ✓(i), H , I) | " i i # X | Harmonic The : mean • does not satisfy a Gaussian (Kass & Raftery 1995). The sum is dominated by the occasional small terms in the likelihood. • Its variance can be infinite. • Insensitive to diffuse priors (i.e. priors much larger than likelihood). Importance Sampling (IV)

Perrakis et al. (2014) use the product of the posterior marginal distributions. k g(✓)= p(✓ D, H , I) i| i i=1 Y Which leads to the estimator 1 S (✓(i))p(✓(i) H, I) p(D ˆH, I)= L | | S k (i) i j=1 p(✓ D, H, I) X | • Behaves much better than the harmonic mean.Q

• Requires estimating the marginal densities in the denominator. Different techniques available (-, kernel estimation, etc.)

• Samples from the marginal distributions are readily obtained from MCMC samples Estimating the marginal likelihood

Estimations based on importance sampling (IS)

ISF Leads to... But...

Prior Mean Estimate Very inefficient

Harmonic Mean Dominated by points Posterior Estimate with low likelihood.

Requires drawing Posterior & Mixture Estimate from both posterior Kaas & Raferty Prior (1995) and prior

Truncated- Inconsistent or as Posterior x 2 Posterior Mixture good as HM. Tuomi & Jones (2012) Estimate Nested sampling (Skilling 2007)

p(D H , I)= p(D ✓¯, H , I)p(✓¯ H , I)dn✓ | i | i | i Z⇡ A non-trivial change of variables. Define “prior volume” X dX = p(✓ I, H)dn✓ | n X()= (✓)> p(✓ I, H)d ✓ L |

Then, the integral is a “simple”R 1-D integral over prior volume

1 p(D H, I)= (X)dX | L Z0 Nested sampling (Skilling 2007)

1 p(D H, I)= (X)dX | L MULTINEST: efficient and robust Bayesian inference 1603 Z0

where wM j XM /N for all j.Thefinaluncertaintyonthecalculated evidence may+ = be straightforwardly estimated from a single run of the nested sampling algorithm by calculating the relative entropy of the full sequence of samples (see FH08). Once the evidence is found, posterior inferences can be easily generated using the fullZ sequence of (inactive and active) points generated in the nested sampling process. Each such point is simply assigned the weight w p Lj j , (9) j = Z Figure 1. Cartoon illustrating (a) the posterior of a two-dimensional prob- where the sample index j runs from 1 to M N, the to- N = + lem and (b) the transformed (X) function where the prior volumes Xi are tal number of sampled points. These samples can then be used to L associated with each likelihood i . L calculate inferences of posterior parameters, such as , stan- dard deviations, covariances and so on, or to construct marginalized where (X), the inverse of equation (4), is a monotonically de- posterior distributions. creasingL function of X. Thus, if one can evaluate the likelihoods i (Xi ), where Xi is a sequence of decreasing values, L = L 4ELLIPSOIDALNESTEDSAMPLING 0 at each iteration i.Employinganaiveapproachthat L Li M draws blindly from the prior would result in a steady decrease in i wi . (7) the acceptance rate of new samples with decreasing prior volume Z = L i 1 (and increasing likelihood). != In the following we will use the simple trapezium rule, for which Ellipsoidal nested sampling (Mukherjee et al. 2006) tries to over- 1 the weights are given by wi 2 (Xi 1 Xi 1). An example of come the above problem by approximating the iso-likelihood con- aposteriorintwodimensionsanditsassociatedfunction= − − + (X) is tour by a D-dimensional ellipsoid determined from the L = Li shown in Fig. 1. L matrix of the current set of active points. New points are The summation (equation 7) is performed as follows. The itera- then selected from the prior within this ellipsoidal bound (usu- tion counter is first set to i 0andN ‘active’ (or ‘live’) samples ally enlarged slightly by some user-defined factor) until one is = are drawn from the full prior π(Θ) (which is often simply the uni- obtained that has a likelihood exceeding that of the removed lowest- form distribution over the prior range), so the initial prior volume is likelihood point. In the limit that the ellipsoid coincides with the X 1. The samples are then sorted in order of their likelihood, true iso-likelihood contour, the acceptance rate tends to unity. 0 = and the smallest (with likelihood 0) is removed from the active set Ellipsoidal nested sampling as described above is efficient for (hence becoming ‘inactive’) andL replaced by a point drawn from simple unimodal posterior distributions without pronounced de- the prior subject to the constraint that the point has a likelihood generacies, but is not well suited to multimodal distributions. As > .Thecorrespondingpriorvolumecontainedwithinthisiso- advocated by Shaw et al. (2007) and shown in Fig. 2, the sampling L L0 likelihood contour will be a given by X1 t1X0, efficiency can be substantially improved by identifying distinct clus- N 1 = where t1 follows the distribution Pr(t) Nt − (i.e. the probabil- ters of active points that are well separated and by constructing ity distribution for the largest of N samples= drawn uniformly from an individual (enlarged) ellipsoid bound for each cluster. In some the interval [0, 1]). At each subsequent iteration i, the removal of problems, however, some modes of the posterior may exhibit a pro- the lowest-likelihood point i in the active set, the drawing of a nounced curving degeneracy so that it more closely resembles a replacement with > andL the reduction of the corresponding (multidimensional) ‘banana’. Such features are problematic for all L Li prior volume Xi ti Xi 1 are repeated, until the entire prior vol- sampling methods, including that of Shaw et al. (2007). ume has been traversed.= − The algorithm thus travels through nested In FH08, we made several improvements to the sampling method shells of likelihood as the prior volume is reduced. The mean and of Shaw et al. (2007), which significantly improved its efficiency standard deviations of log t, which dominates the geometrical ex- and robustness. Among these, we proposed a solution to the above ploration, are E[ log t] 1/N and σ [ log t] 1/N.Sinceeach problem by partitioning the set of active points into as many sub- value of log t is independent,= − after i iterations the= prior volume will clusters as possible to allow maximum flexibility in following the shrink down such that log Xi (i √i)/N.Thus,onetakesXi degeneracy. These clusters are then enclosed in ellipsoids and a new exp( i/N). ≈− ± = point is then drawn from the set of these ‘overlapping’ ellipsoids, The− algorithm is terminated on determining the evidence to some correctly taking into account the overlaps. Although this subcluster- specified precision (we use 0.5 in log-evidence): at iteration i, the ing approach provides maximum efficiency for highly degenerate largest evidence contribution that can be made by the remaining por- distributions, it can result in lower efficiencies for relatively simpler tion of the posterior is # i maxXi ,where max is the maximum problems owing to the overlap between the ellipsoids. Also, the likelihood in the currentZ set= ofL active points. TheL evidence estimate factor by which each ellipsoid was enlarged was chosen arbitrar- (equation 7) may then be refined by adding a final increment from ily. Another problem with our previous approach was in separating the set of N active points, which is given by modes with elongated curving degeneracies. We now propose solu-

N tions to all these problems, along with some additional modifications ULTI EST #Z j wM j , (8) to improve efficiency and robustness still further, in the M N = L + j 1 algorithm presented in the following section. !=

C C ⃝ 2009 The Authors. Journal compilation ⃝ 2009 RAS, MNRAS 398, 1601–1614 John Skilling 839

X = 0. In terms of coordinates θ, the intervals represent nested shells around contours of constant likelihood value, with points exactly on the same contour being ranked by their labels ℓ. More generally, instead of taking one point within the likelihood-constrained box, take N of them where N is any convenient number, and select the worst (lowest L, highest X), as the i’th point. This recurrence is

N−1 X0 = 1, Xi = tiXi−1, Pr(ti) = Nti in (0, 1) , (14) ti being the largest of N random numbers from Uniform(0,1). The mean and standard deviation of log t (which dominates the geometrical exploration) are

E(log t) = −1/N, dev(log t) = 1/N . (15)

The individual log√ t are independent, so after i steps, the prior mass is expected√ to shrink to ≈ − ± ± log Xi (i i)/N. Thus we expect the procedure to √take about NH NH steps to shrink down to the bulk of the posterior, and a further N C or so steps to cross it. For a crude implementation, we can simply proclaim log Xi = −i/N as if we knew it, though it’s more professional to acknowledge the uncertainties. Actually, it is not necessary to find N points anew at each step, because N −1 of them are already available, being the survivors after deleting the worst. Only one new point is required per step, and this θ may be found by any method that draws from the prior subject to L(θ) being above its constraint Li−1. One method is to replace the deleted point by a copy of a random survivor, evolved within the box by MCMC for some adequate number of trials. Surviving points could be used as stationary guides in such exploration. Another method might be generation of new points by genetic mixing of the survivors’ coordinates. All that matters is that the step ends with N usably independent samples within the box. Nested sampling (Skilling 2007) 6 Nested sampling procedure

At each step, the procedure has N points θ1, . . . , θN , with corresponding likelihoods L(θ ), . . . , L(θ ), augmented to L+ as in (6) if ties of likelihood are anticipated. The 1 N 1 lowest (minimum) such value is the likelihood Li associated with step i. There are to be j MULTINEST: efficient and robust Bayesian inference 1603 iterative steps. p(D H, I)= (X)dX Algorithm: | L Z0 Start with N points θ1, . . . , θN from prior; where w X /N for all j.Thefinaluncertaintyonthecalculated initialise Z = 0, X0 = 1. M j M Repeat for i = 1, 2, . . . , j; + = record the lowest of the current likelihood values as Li, evidence may be straightforwardly estimated from a single run of set Xi = exp(−i/N) (crude) or sample it to get uncertainty, set wi = Xi−1 − Xi (simple) or (Xi−1 − Xi+1)/2 (trapezoidal), the nested sampling algorithm by calculating the relative entropy of increment Z by Liwi, then replace point of lowest likelihood by new one drawn the full sequence of samples (see FH08). from within L(θ) > Li, in proportion to the prior π(θ). −1 Increment Z by N (L(θ1) + . . . + L(θN )) Xj . Once the evidence is found, posterior inferences can be easily generated using the fullZ sequence of (inactive and active) points 1 The last step fills in the missing band 0 < X < Xj of the desired integral 0 L dX with weight w =TwoN −1X forkeys:each surviving point, after the iterative steps have compressed the domain generated in the nested sampling process. Each such point is simply j ! - Assign Xi. assigned the weight - Draw with condition L > Li. w p Lj j , (9) j = Z Figure 1. Cartoon illustrating (a) the posterior of a two-dimensional prob- where the sample index j runs from 1 to M N, the to- N = + lem and (b) the transformed (X) function where the prior volumes Xi are tal number of sampled points. These samples can then be used to L associated with each likelihood i . L calculate inferences of posterior parameters, such as means, stan- dard deviations, covariances and so on, or to construct marginalized where (X), the inverse of equation (4), is a monotonically de- posterior distributions. creasingL function of X. Thus, if one can evaluate the likelihoods i (Xi ), where Xi is a sequence of decreasing values, L = L 4ELLIPSOIDALNESTEDSAMPLING 0 at each iteration i.Employinganaiveapproachthat L Li M draws blindly from the prior would result in a steady decrease in i wi . (7) the acceptance rate of new samples with decreasing prior volume Z = L i 1 (and increasing likelihood). != In the following we will use the simple trapezium rule, for which Ellipsoidal nested sampling (Mukherjee et al. 2006) tries to over- 1 the weights are given by wi 2 (Xi 1 Xi 1). An example of come the above problem by approximating the iso-likelihood con- aposteriorintwodimensionsanditsassociatedfunction= − − + (X) is tour by a D-dimensional ellipsoid determined from the L = Li shown in Fig. 1. L of the current set of active points. New points are The summation (equation 7) is performed as follows. The itera- then selected from the prior within this ellipsoidal bound (usu- tion counter is first set to i 0andN ‘active’ (or ‘live’) samples ally enlarged slightly by some user-defined factor) until one is = are drawn from the full prior π(Θ) (which is often simply the uni- obtained that has a likelihood exceeding that of the removed lowest- form distribution over the prior range), so the initial prior volume is likelihood point. In the limit that the ellipsoid coincides with the X 1. The samples are then sorted in order of their likelihood, true iso-likelihood contour, the acceptance rate tends to unity. 0 = and the smallest (with likelihood 0) is removed from the active set Ellipsoidal nested sampling as described above is efficient for (hence becoming ‘inactive’) andL replaced by a point drawn from simple unimodal posterior distributions without pronounced de- the prior subject to the constraint that the point has a likelihood generacies, but is not well suited to multimodal distributions. As > .Thecorrespondingpriorvolumecontainedwithinthisiso- advocated by Shaw et al. (2007) and shown in Fig. 2, the sampling L L0 likelihood contour will be a random variable given by X1 t1X0, efficiency can be substantially improved by identifying distinct clus- N 1 = where t1 follows the distribution Pr(t) Nt − (i.e. the probabil- ters of active points that are well separated and by constructing ity distribution for the largest of N samples= drawn uniformly from an individual (enlarged) ellipsoid bound for each cluster. In some the interval [0, 1]). At each subsequent iteration i, the removal of problems, however, some modes of the posterior may exhibit a pro- the lowest-likelihood point i in the active set, the drawing of a nounced curving degeneracy so that it more closely resembles a replacement with > andL the reduction of the corresponding (multidimensional) ‘banana’. Such features are problematic for all L Li prior volume Xi ti Xi 1 are repeated, until the entire prior vol- sampling methods, including that of Shaw et al. (2007). ume has been traversed.= − The algorithm thus travels through nested In FH08, we made several improvements to the sampling method shells of likelihood as the prior volume is reduced. The mean and of Shaw et al. (2007), which significantly improved its efficiency standard deviations of log t, which dominates the geometrical ex- and robustness. Among these, we proposed a solution to the above ploration, are E[ log t] 1/N and σ [ log t] 1/N.Sinceeach problem by partitioning the set of active points into as many sub- value of log t is independent,= − after i iterations the= prior volume will clusters as possible to allow maximum flexibility in following the shrink down such that log Xi (i √i)/N.Thus,onetakesXi degeneracy. These clusters are then enclosed in ellipsoids and a new exp( i/N). ≈− ± = point is then drawn from the set of these ‘overlapping’ ellipsoids, The− algorithm is terminated on determining the evidence to some correctly taking into account the overlaps. Although this subcluster- specified precision (we use 0.5 in log-evidence): at iteration i, the ing approach provides maximum efficiency for highly degenerate largest evidence contribution that can be made by the remaining por- distributions, it can result in lower efficiencies for relatively simpler tion of the posterior is # i maxXi ,where max is the maximum problems owing to the overlap between the ellipsoids. Also, the likelihood in the currentZ set= ofL active points. TheL evidence estimate factor by which each ellipsoid was enlarged was chosen arbitrar- (equation 7) may then be refined by adding a final increment from ily. Another problem with our previous approach was in separating the set of N active points, which is given by modes with elongated curving degeneracies. We now propose solu-

N tions to all these problems, along with some additional modifications ULTI EST #Z j wM j , (8) to improve efficiency and robustness still further, in the M N = L + j 1 algorithm presented in the following section. !=

C C ⃝ 2009 The Authors. Journal compilation ⃝ 2009 RAS, MNRAS 398, 1601–1614 Nested sampling (Skilling 2007)

Draw new point with condition.

Elipsoidal nested sampling (Mukherjee et al. 2006). • Uses ellipsoidal contours around active points. • Becomes inefficient at high dimensions or multi-modal.

MultiNest algorithm (Feroz, Hobson, Bridges, 2009). • Ellipsoids break up as needed to improve efficiency. • Still problems at high dimensions (the curse goes on). 1604 F. Fe ro z , M . P. H o b s o n a n d M . B r i d ge s

(a) (b) (c) (d) (e)

Figure 2. Cartoon of ellipsoidal nested sampling from a simple bimodal distribution. In (a) we see that the ellipsoid represents a good bound to the active region. In (b)–(d), as we nest inwards we can see that the acceptance rate will rapidly decrease as the bound steadily worsens. (e) illustrates the increase in efficiency obtained by sampling from each clustered region separately.

5THEMULTIN EST ALGORITHM where we define

The MULTINEST algorithm builds upon the ‘simultaneous ellipsoidal πj (θj θ1,...,θj 1) nested sampling method’ presented in FH08, but incorporates a | − number of improvements. In short, at each iteration i of the nested π(θ1,...,θj 1, θj , θj 1,...,θD)dθj 1 ...dθD. (15) = − + + sampling process, the full set of N active points is partitioned and ! ellipsoidal bounds are constructed using a new algorithm presented The physical parameters Θ corresponding to the parameters u in in Section 5.2. This new algorithm is far more efficient and robust the unit hypercube can then be found by replacing the distributions than the method used in FH08 and automatically accommodates πj in equation (13) with those defined in equation (15) and solving elongated curving degeneracies, while maintaining high efficiency for θj .ThecorrespondingphysicalparametersΘ are then used to for simpler problems. This results in a set of (possibly overlapping) calculate the likelihood value of the point u in the unit hypercube. ellipsoids. The lowest-likelihood point from the full set of N active It is worth mentioning that in many problems the prior π(Θ) is points is then removed (hence becoming ‘inactive’) and replaced by uniform, in which case the unit hypercube and the physical param- a new point drawn from the set of ellipsoids, correctly taking into eter space coincide. Even when this is not so, one is often able to account any overlaps. Once a point becomes inactive, it plays no solve equation (13) analytically, resulting in virtually no compu- further part in the nested sampling process, but its details remain tational overhead. For more complicated problems, two alternative stored. We now discuss the MULTINEST algorithm in detail. approaches are possible. First, one may solve equation (13) numer- ically, most often using lookup tables to reduce the computational cost. Alternatively, one can recast the inference problem, so that 5.1 Unit hypercube sampling space the conversion between the unit hypercube and the physical param- The new algorithm for partitioning the active points into clusters eter space becomes trivial. This is straightforwardly achieved by, e.g., defining the new ‘likelihood’ ′(Θ) (Θ)π(Θ) and ‘prior’ and for constructing ellipsoidal bounds requires the points to be L ≡ L π ′(Θ) constant. The latter approach does, however, have the po- uniformly distributed in the parameter space. To satisfy this require- ≡ ment, the MULTINEST ‘native’ space is taken as a D-dimensional unit tential to be inefficient since it does not make use of the true prior hypercube (each parameter value varies from 0 to 1) in which sam- π(Θ) to guide the sampling of the active points. ples are drawn uniformly. All partitioning of points into clusters, construction of ellipsoidal bounds and sampling are performed in the unit hypercube. 5.2 Partitioning and construction of ellipsoidal bounds In order to conserve probability mass, the point u (u , u , ..., = 1 2 In FH08, the partitioning of the set of N active points at each itera- u ) in the unit hypercube should be transformed to the point Θ D tion was performed in two stages. First, X-means (Pelleg & Moore (θ , θ , ..., θ ) in the ‘physical’ parameter space, such that = 1 2 D 2000) was used to partition the set into the number of clusters that optimized the Bayesian Information Criterion. Secondly, to accom- π(θ , θ ,...,θ )dθ dθ ...dθ du du ...du . (10) 1 2 D 1 2 D = 1 2 D modate modes with elongated, curving degeneracies, each cluster ! ! identified by X-means was divided into subclusters to follow the de- In the simple case that the prior π(Θ) is separable generacy. To allow maximum flexibility, this was performed using a modified, iterative k-means algorithm with k 2 to produce as many π(θ1, θ2,...,θD) π1(θ1)π2(θ2) ...πD(θD), (11) subclusters as possible consistent with there= being at least D 1 = + one can satisfy equation (10) by setting points in any subcluster, where D is the dimensionality of the param- eter space. As mentioned above, however, this approach can lead to πj (θj )dθj duj . (12) inefficiencies for simpler problems in which the iso-likelihood con- = tour is well described by a few (well-separated) ellipsoidal bounds, Therefore, for a given value of uj , the corresponding value of θj can be found by solving owing to large overlaps between the ellipsoids enclosing each sub- cluster. Moreover, the factor f by which each ellipsoid was enlarged θj was chosen arbitrarily. u π (θ ′ )dθ ′ . (13) j = j j j We now address these problems by using a new method to parti- !−∞ tion the active points into clusters and simultaneously construct In the more general case in which the prior π(Θ) is not separable, the ellipsoidal bound for each cluster (this also makes the no- one instead writes tion of subclustering redundant). At the ith iteration of the nested

π(θ1, θ2,...,θD) π1(θ1)π2(θ2 θ1) ...πD(θD θ1, θ2 ...θD 1), sampling process, an ‘expectation-maximization’ (EM) approach is = | | − (14) used to find the optimal ellipsoidal decomposition of N active points

C C ⃝ 2009 The Authors. Journal compilation ⃝ 2009 RAS, MNRAS 398, 1601–1614 Nested sampling (Skilling 2007)

1 Assigning Xi. p(D H, I)= (X)dX MULTINEST: efficient and robust Bayesian inference 1603 | L Z0 Use statistical properties of randomly where wM j XM /N for all j.Thefinaluncertaintyonthecalculated + distributed points. evidence may= be straightforwardly estimated from a single run of

N 1 the nested sampling algorithm by calculating the relative entropy of X0 =1,Xi = tiXi 1,p(ti)=Nti the full sequence of samples (see FH08). E [log t]= 1/N Once the evidence is found, posterior inferences can be easily generated using the fullZ sequence of (inactive and active) points Draw from generated in the nested sampling process. Each such point is simply Take mean value OR distribution to assigned the weight ti =exp( i/N) account for uncertainty. w p Lj j , (9) j = Z Figure 1. Cartoon illustrating (a) the posterior of a two-dimensional prob- where the sample index j runs from 1 to M N, the to- N = + lem and (b) the transformed (X) function where the prior volumes Xi are tal number of sampled points. These samples can then be used to L associated with each likelihood i . L calculate inferences of posterior parameters, such as means, stan- dard deviations, covariances and so on, or to construct marginalized where (X), the inverse of equation (4), is a monotonically de- posterior distributions. creasingL function of X. Thus, if one can evaluate the likelihoods i (Xi ), where Xi is a sequence of decreasing values, L = L 4ELLIPSOIDALNESTEDSAMPLING 0 at each iteration i.Employinganaiveapproachthat L Li M draws blindly from the prior would result in a steady decrease in i wi . (7) the acceptance rate of new samples with decreasing prior volume Z = L i 1 (and increasing likelihood). != In the following we will use the simple trapezium rule, for which Ellipsoidal nested sampling (Mukherjee et al. 2006) tries to over- 1 the weights are given by wi 2 (Xi 1 Xi 1). An example of come the above problem by approximating the iso-likelihood con- aposteriorintwodimensionsanditsassociatedfunction= − − + (X) is tour by a D-dimensional ellipsoid determined from the L = Li shown in Fig. 1. L covariance matrix of the current set of active points. New points are The summation (equation 7) is performed as follows. The itera- then selected from the prior within this ellipsoidal bound (usu- tion counter is first set to i 0andN ‘active’ (or ‘live’) samples ally enlarged slightly by some user-defined factor) until one is = are drawn from the full prior π(Θ) (which is often simply the uni- obtained that has a likelihood exceeding that of the removed lowest- form distribution over the prior range), so the initial prior volume is likelihood point. In the limit that the ellipsoid coincides with the X 1. The samples are then sorted in order of their likelihood, true iso-likelihood contour, the acceptance rate tends to unity. 0 = and the smallest (with likelihood 0) is removed from the active set Ellipsoidal nested sampling as described above is efficient for (hence becoming ‘inactive’) andL replaced by a point drawn from simple unimodal posterior distributions without pronounced de- the prior subject to the constraint that the point has a likelihood generacies, but is not well suited to multimodal distributions. As > .Thecorrespondingpriorvolumecontainedwithinthisiso- advocated by Shaw et al. (2007) and shown in Fig. 2, the sampling L L0 likelihood contour will be a random variable given by X1 t1X0, efficiency can be substantially improved by identifying distinct clus- N 1 = where t1 follows the distribution Pr(t) Nt − (i.e. the probabil- ters of active points that are well separated and by constructing ity distribution for the largest of N samples= drawn uniformly from an individual (enlarged) ellipsoid bound for each cluster. In some the interval [0, 1]). At each subsequent iteration i, the removal of problems, however, some modes of the posterior may exhibit a pro- the lowest-likelihood point i in the active set, the drawing of a nounced curving degeneracy so that it more closely resembles a replacement with > andL the reduction of the corresponding (multidimensional) ‘banana’. Such features are problematic for all L Li prior volume Xi ti Xi 1 are repeated, until the entire prior vol- sampling methods, including that of Shaw et al. (2007). ume has been traversed.= − The algorithm thus travels through nested In FH08, we made several improvements to the sampling method shells of likelihood as the prior volume is reduced. The mean and of Shaw et al. (2007), which significantly improved its efficiency standard deviations of log t, which dominates the geometrical ex- and robustness. Among these, we proposed a solution to the above ploration, are E[ log t] 1/N and σ [ log t] 1/N.Sinceeach problem by partitioning the set of active points into as many sub- value of log t is independent,= − after i iterations the= prior volume will clusters as possible to allow maximum flexibility in following the shrink down such that log Xi (i √i)/N.Thus,onetakesXi degeneracy. These clusters are then enclosed in ellipsoids and a new exp( i/N). ≈− ± = point is then drawn from the set of these ‘overlapping’ ellipsoids, The− algorithm is terminated on determining the evidence to some correctly taking into account the overlaps. Although this subcluster- specified precision (we use 0.5 in log-evidence): at iteration i, the ing approach provides maximum efficiency for highly degenerate largest evidence contribution that can be made by the remaining por- distributions, it can result in lower efficiencies for relatively simpler tion of the posterior is # i maxXi ,where max is the maximum problems owing to the overlap between the ellipsoids. Also, the likelihood in the currentZ set= ofL active points. TheL evidence estimate factor by which each ellipsoid was enlarged was chosen arbitrar- (equation 7) may then be refined by adding a final increment from ily. Another problem with our previous approach was in separating the set of N active points, which is given by modes with elongated curving degeneracies. We now propose solu-

N tions to all these problems, along with some additional modifications ULTI EST #Z j wM j , (8) to improve efficiency and robustness still further, in the M N = L + j 1 algorithm presented in the following section. !=

C C ⃝ 2009 The Authors. Journal compilation ⃝ 2009 RAS, MNRAS 398, 1601–1614