ΛCDM and Beyond: Cosmology Tools in Theory and in Practice - CANTATA Cost-Action Summer School 2017 and in cosmology Signe Riemer-Sørensen, University of Oslo Adapted from notes by Tamara Davis, University of Queensland

Statement of expectations

I don’t lecture, I teach. I’m here for you to learn. I expect active participation during the classes. This format requires much MORE EFFORT from the students than standard lectures, but the learning gain is also higher. We will cover some of the background information on how CosmoMC and MontePython work. During the classes we’ll be doing quizzes using the Kahoot! platform. I recommend you download the Kahoot! app for your smartphone1, but it’s absolutely not necessary. The quizzes work in any browser (go to https://kahoot.it and follow the instructions).

1 Introduction

Cosmology is about understanding the universe. We do some observations and then we compare them to models that fit in a bigger theory. But how do we actually determine the parameters of the models? And which model is the best?

“If your needs a , you need a better experiment.” – Ernest Rutherford, winner of the Nobel Prize in Chemistry

“Torture the , and it will confess to anything.” – Ronald Coase, winner of the Nobel Prize in Economics

Handling statistics forms a crucial part of any data analysis, but it is often neglected in our formal education, or taught more formally than practically. Here we will try to bridge that gap, and outline the basic data analysis techniques that are most commonly used in cosmology, including some common pitfalls and tips-of-the-trade. To focus the discussion we will look at the constraints on our cosmological model parameters that can be derived from observations like supernovae (SNe), baryon acoustic oscillations (BAO), and the cosmic microwave background (CMB). In other words, how one gets from the observed data illustrated in the left part of Fig. 1 to the likelihood contours in the right part of the figure. Occasionally, we’ll comment on how CosmoMC and MontePython works, but most of the content is general. This is not meant to be a rigorous mathematical derivation of the statistics. Just a practical guide on how to implement them.

Figure 1: The aim of this lecture is to reveal some of the subtleties involved in creating these types of contour plots for cosmology. How do you go from the supernova distance modulus vs redshift data on the left, to the red 1σ confidence contour on the right? Once you’ve got there, how do you combine that data with the other data sets including possible correlations, nuisance parameters, and multi-parameter models. Although we’re using cosmology as a case study, the techniques are applicable to data analysis in general. Plots adapted from Davis et al. (2007).

1App Store http://turl.no/1atg, Google Play http://turl.no/1atf

1 1.1 Models and parameters

It is important to distinguish up-front the difference between testing models and finding the best fit parameters within those models. Different models could be based on different gravitational theories – for example General Relativity or f (R) gravity – or they could simply be different parameter combinations within a single theory of gravity – for example General Relativity with a cosmological constant and cold dark matter (ΛCDM) with the parameters θΛCDM = (H0, ΩM, ΩΛ) or General Relativity with a more general dark energy and cold dark matter (wCDM) with the parameters θΛCDM = (H0, ΩM, Ωx, w). When you assume the universe is flat, that reduces the number of parameters you need to fit for and therefore constitutes a different model as well. So testing flat-ΛCDM is different to testing ΛCDM. We consider three levels of inference as shown in Tab. 1. In Sec. 2 we will discuss what we by the best model/parameters and the framework of Bayesian statistics. In Sec. 3 we will address the first level of inference namely how to determine the best-fit parameters within a particular model. The second level of inference is the comparison between models, which will be the focus of Sec. 5, while the third level of “what to do when you’re unable to chose between models” is briefly discussed in Sec. 5.1.

Level 1 - Parameter inference Level 2 - Model comparison Level 3 - Model averaging Actually, there are several possible None of the models is clearly the best. I have selected a model M and prior models: M , M , ... What is the relative What is the inference on the P(θ|M). What are the parameters? Use 0 1 plausibility of M , M , ... given the parameters accounting for model Bayes Theorem 0 1 observed data? uncertainty? Model averaging

X P(θ|d) = P(Mi|d)P(θ|d, Mi) P(d|θ, M)P(θ|M) P(M |d) P(θ|d, M) = odds = 0 i P(d|M) P(M1|d) e.g. finding curvature of the universe e.g. parameter determination by e.g. comparison of ΛCDM and as in Vardanyan et al. (2011) or running CosmoMC or MontePython wΛCDM as in Heavens et al. (2017) evolving dark eneryg as in Liddle et al. (2006) ⇓ ⇓ ⇓ Sec. 2 Sec. 5 Sec. 5.1

Table 1: Three different levels of inference.

2 Bayesian statistics

Bayesian statistical analysis is a general method where the probability of a theory/parameter value being the correct value is computed based on observations, and where the probability estimate can be updated when new observations become available. This is contrary to frequentist interpretation where the probability is a direct measure of the proportion of outcomes in your experiment (e.g. often used in laboratory ). Bayes Theorem gives the probability that the inferred parameters θ are true given we observe the data d. This quantity is also called the posterior probability function: P(d|θ, M)P(θ|M) P(θ|d, M) = , (1) P(d|M) where θ represent a given set of parameters, d is the observed data, and M is any other external information e.g. the model2. P(d|θ, M) is the probability of observing the data d if the parameters θ are true. This is also called the likelihood. P(θ|M) is called the prior and is the probability that the parameters are true in the absence of data but given some external information (prior knowledge)3. P(d|M) is called the evidence and is the probability of observing the data given the external information (the choice of model) but before the data are known. For parameter estimation within the model we usually don’t have to consider the evidence (normalisation) and the problem reduces to "posterior ∝ likelihood × prior". The output of a Bayesian analysis is a probability density function. This is contrary to the frequentist statistics where the output is an estimator for the parameters e.i. one value that represents a parameter. In Bayesian statistics the probability density function gives a quantitative measure of how well we can believe the estimated parameters given the combination of data and prior knowledge.

2The model can contain physical knowledge e.g. General Relativity or flatness of the universe. 3e.g. something measured by another experiment, or an assumption that a parameter falls within a given interval e.g. masses have to be positive.

2 Figure 2: Left: This schematic illustrates Bayes Theorem. The prior is the knowledge you have before you measure something. Then you obtain “sensory evidence” (measurements/observations) which leads to an updated posterior (dashed line). If you increase the precision of your measurements or you decrease the precision of your prior, your measurements will have stronger influence on the posterior. From ?. Right: Bayesian versus frequentist interpretation of a solar neutrino experiment from https://xkcd.com/1132/.

2.1 How to explore the parameter space

In any judgement of best-fit parameters you need to start by exploring the parameter space and testing how well the model fits the data for a wide of parameter values. In cosmology, we often deal with a very large number of parameters (e.g. approximately 107 pixel temperatures in the CMB map), and a direct (analytical) evaluation of the posterior distribution is impossible. Instead we compute the posterior on a sample drawn from the real posterior distribution. There are several ways to do this, but the most common are either setting up a grid of parameter values and testing each one, or using a Monte-Carlo Markov Chain (MCMC) method to selectively test the most likely region of parameter space. Testing over a grid of parameters can be very time consuming, because you spend a lot of time calculating your model in regions of parameter space that are very unlikely (the white space in Fig. 1). When you have many parameters this can become prohibitive. That’s where MCMC comes in. Monte Carlo basically representing a distribution by randomly it, and Markov Chain is a smart way of sampling. A Markov Chain is defined by a series of random elements where the conditional distribution of the nth element only depends on the (n-1)th element. The chain is stationary if the distribution does not depend on n (the sample number). For our purpose, the crucial property is that after some steps (the burn-in phase), the chain reach a state where successive elements of the chain are drawn from high-density regions of the underlying distribution (the posterior). You start running an MCMC chain somewhere in your parameter space, then it jumps to another set of parameters to test (you can tune how big the jumps should be). If the second point is at a higher likelihood it starts from there and makes another jump, but if it is a worse fit there’s an algorithm by which it decides whether to keep the new step or drop back to its previous step and try again. The MCMC leaves a track of points in its wake (called a chain), and the density of those points eventually gives you the likelihood of your parameters. There’s a nice visualiser for MCMC sampling here https://chi-feng.github.io/mcmc-demo/app.html.

2.2 Metropolis-Hastings algorithm

The Metropolis-Hastings algorithm is a fancy name for how to sample the parameter space (even if you cannot sample the directly). p(θ) is the probability distribution that we want to sample and θ0 is the arbitrary starting sample. Then we specify a distribution q(θ0, θ) which suggest a candidate for the next sample value, θ0 given the previous value θ (called the stepping/jumping function). For each candidate we can calculate the acceptance ratio as p(θ0) q(θ|θ0) a = . (2) p(θ) q(θ0|θ) 3 Figure 3: Illustration of the Metropolis-Hastings algorithm in one dimension. Left panel: Starting at θ1, θ2 is proposed and accepted (step A), θ3 is proposed and rejected (step B), θ4 is proposed and accepted (step C). The resulting chain is then {θ1, θ2, θ2, θ4, ...}. Middle panel: If the step size is too large, the chain can never get into the minimum (− ln(p(θ)) = L ∝ χ2). Right panel: If the step size is too small, the sampling will take very long time to reach the high-probability region. Figure from Leclercq et al. (2014).

If a > 1, then θ0 is accepted and otherwise it is accepted with probability a. If accepted, θ0 becomes the new state of the chain, and otherwise the chain stays at θ. A smart way of writing this is p(jump from x to x0) = min(1, a). Most often higher probability points will be accepted, and lower probability points rejected, but occasionally a lower probability point will be accepted so we also explore low probability regions. This is illustrated in Fig. 3. Since each step only depends on the previous step and is independent of the number of steps, the chain becomes a Markov Chain, and the sample distribution converges to the underlying probability distribution the more samples you draw. The major disadvantage of Metropolis-Hasting is that the samples are correlated. Even though they follow the underlying distribution, neighbouring samples are correlated and you can only use every n-th sample. Instead you could increase the step- size, but that would mean more rejected steps and take a long time to get into the high-probability region (see Fig. 3). Also the beginning of the sample may follow a distribution that does not reflect the underlying distribution very well, in particular if the starting point is in a low probability region. Consequently, we often remove the first part of the sample, called the burn-in period. The advantage is that we do not need to know the probability distribution, but only a proportional distribution, and it works very well for a large number of dimensions (parameters). The challenge in the Metropolis-Hastings algorithm is to chose an appropriate step size in all dimensions such that not all points are being rejected, but you still explore the full range in all parameters. CosmoMC and MontePython use per default a Gaussian proposal density function. If you provide a matrix (e.g. from a previous similar but not necessarily identical run), the information can be used to decompose the parameters into un-correlated orthogonal base parameters and split in groups computationally fast (e.g. experimental nuisance parameters) and slow (e.g. cosmological) parameters (Lewis & Bridle, 2002).

2.3 Convergence

The more samples you draw the better your chain will reflect the underlying probability distribution. How do you know when to stop sampling? When is it ok to make inferences based on a chain? This is called convergence. There exists no test that can tell us if a chain has converged, only if it hasn’t converged. So convergence diagnostics are a necessary but not sufficient condition. Some good signs of convergence are:

• Individual segments of the chain give similar results • The chain is much longer than any obvious correlations • “High” acceptance rate of proposed steps • Multiple chains with different starting points give similar results • MCMC on simulated data give accurate results even with significantly fewer iterations

Within a single chain, one way to check for convergence is to look at the between points with a fixed distance (as a function of distance/lag): PN−lag ¯ ¯ i (θi − θ)(θi+lag − θ) ρlag = . (3) PN ¯ 2 i (θi − θ) What is the smallest lag that gives ρ ≈ 0? Is it much smaller than the length of the chain? lag √ Generally the of multiple measurements, N, of a given quantity goes as σ/ N. But since the MCMC samples are P correlated we need to replace the sample N by the effective sample size given by ESS=N/ lag ρlag. So if we want to reduce the 4 uncertainty from 100% per measurement to a mean with uncertainty σx = 0.03σ we need N = 1000 in the uncorrelated case, but ESS= 4000 in the correlated case. Another option is to run several chains and check for similarity between the chains. Running several chains in parallel might also be a good idea for practical and computational reasons. The Gelman-Rubin test compares the variance within individual chains to the variance between chains, and can be used to check that the chains are providing similar results (An et al., 1998). CosmoMC and MontePython can both provide the correlation and Gelman-Rubin diagnostics to help you determine whether your chain is long enough45, but you should also always check your chains. Look at trace/scatter plots, run several chains, compare different parts of the chains. As an example, the convergence criteria used for Planck are:

• Chains must be individually converged • The tails of the distribution must be well enough explored so that confidence intervals for each of the parameters can be determined from the last half of each of the chains

• The Gelman and Rubin criterion must be less than R − 1 < 0.01 in the least converged orthogonalized parameter • The first 30% of each chain is discarded as burn in

3 Comparing data to model

To determine how a model with a given set of parameters matches the data, we need to compute the likelihood. As we will see in Sec. 3.6 the likelihood is related to χ2 so we start by examining some properties of the χ2. If this is more than a repetition for you, you should probably consult a book or website on introductory statistics.

3.1 χ2, ∆χ2, reduced χ2, and combining data sets

First we start with a χ2 test. The lower the χ2 the better the fit. The value of χ2 for a particular data/model combination is given by, !2 X µ − µ χ2 = model i , (4) 0 σ i i where µmodel is the value predicted by your cosmological model and µi ± σi is the ith data point and its uncertainty. Given a value of χ2 you can calculate the likelihood of that model. We show how to calculate likelihoods in Sec. 3.6, but you can also calculate your uncertainties directly from the χ2 values. These are appropriate in the case of Gaussian distributed likelihoods. (The key is not whether the data has Gaussian errors, but rather whether the resulting distribution of likelihoods is Gaussian.)

3.2 What is the best fit, and the uncertainty?

The parameter values that give you the lowest χ2 are the best fit parameters of your model. To calculate the uncertainties on those best fits you compare the other “models” (same models with other parameter values) to the χ2 for the best fit,

∆χ2 = χ2 − min(χ2). (5)

The larger the ∆χ2 the worse the fit. There are standard thresholds of ∆χ2 that correspond to uncertainties of one, two, and three standard deviations (σ1, σ2, σ3). The meaning of the standard deviations is that simply, 68.27%, 95.45%, and 99.73% of the data should lie within the one, two, and three σ limits, respectively. Note that these are not exactly round numbers, and the 95% confidence interval is slightly different to the 2σ limit.6 2 The values of ∆χ corresponding to [σ1, σ2, σ3] differ depending on the number of parameters you’re fitting. When you’re fitting one parameter ∆χ2 = [1, 4, 9] correspond to the first, second, and third standard deviations, respectively. When you’re fitting two parameters the first three standard deviations are given by ∆χ2 = [2.30, 6.18, 11.83]. When you’re fitting more parameters I recommend you use likelihoods rather than the raw χ2 value anyway.

4http://cosmologist.info/cosmomc/readme.html 5 http://monte-python.readthedocs.io/en/latest/getting_started.html√ 6 The values are actually defined as σz = erf(z/ 2), where erf (the error function) is the cumulative sum of a gaussian distribution, Z x 2 2 erf(x) = √ e−t dt. (6) π 0

5 Figure 5: No further explanation... https://xkcd.com/1725/.

2 ∆χ for σ1 σ2 σ3 one parameter 1 4 9 two parameters 2.30 6.18 11.83 more parameters Use likelihoods

Table 2: ∆χ2 values corresponding to different standard devi- ations (σ1, σ2, σ3) for different numbers of parameters (one and two).

Figure 4: Demonstration of ∆χ2 for one and two degrees of freedom (solid and dashed respectively). In the case of gaussian errors the best fit and uncertainties after marginalisation corre- spond to the 1 degree of freedom limits.

3.3 How good is the fit?

When fitting models to data it is not enough to find out which parameters give the best fit, you also need to establish whether that best fit is actually a good fit. To get an approximate measure of goodness of fit use the reduced χ2. To get the reduced χ2 2 divide χ by the number of degrees of freedom: ndof = ndata − 1 − nparams, where ndata is the number of data points and nparams is the number of parameters you are fitting for. For a good fit the result should be close to 1. This is not rigorously defined rule, but rather a rule of thumb. A reduced χ2 much less than 1 indicates that the error bars are probably overestimated (too large) or you have too many parameters in your model. Meanwhile, a reduced χ2 of much larger than 1 indicates that the data aren’t a good fit to your model. However, care should be taken because a high reduced χ2 could actually indicate a number of other things as well. A high reduced χ2 could mean,

• the model is inadequate and a better model is needed; • the uncertainties on the data points have been underestimated; • there is a systematic error in the data.

3.4 Including priors

If we have prior information about a particular parameter that goes into this model, then we want to weight the χ2 value so it prefers parameter values close to our prior. We do this by adding an extra term in the χ2 equation. Say we have a prior on a parameter, x, such that we know x = xprior ± σprior. For example xprior could be ΩM = 0.27 ± 0.03 or a combination such as ΩM + ΩΛ = 1.00 ± 0.02, which you know from previous experiments. The prior contributes, !2 2 xmodel − xprior χprior = . (7) σprior 6 2 2 2 2 Then the total is simply the sum of those χ values, χ = χ0 + χprior. Your choice of priors will affect the posterior, so chose wisely. In some cases the prior has a physical meaning e.g. masses must be positive, but in other cases it’s not obvious. If in doubt be as inclusive as possible. For MCMC chains, the prior can also affect which part of parameter space will be explored, and “flat”, “logarithmic”, or “Gaussian” priors on a given parameter might lead to different results, in particular if your chain hasn’t converged well enough.

3.5 Including other independent data sets

Including other data sets is simple, when the data sets are independent. You simply repeat the above for whichever data set you are interested in, calculate the χ2 for that new data set, and add it to the χ2 total as though it were another prior so that

2 X 2 2 χ = χdata,i + χprior,j . (8) i, j

2 The ∆χ thresholds for σ1,2,3 don’t change as you add extra data sets, because you still have the same number of parameters in your model. We deal with correlated data in Sec. A. WARNING!: Once you’ve combined data sets, make sure that your new best fit is still a good fit by checking the reduced χ2 is about one. When a model inadequately describes the data, you can still get sensible looking likelihoods, and often with small error bars, but the model should be ruled out because the best fit is a poor one. See Fig. 6 for an example.

Figure 6: Example of two badly aligned data sets. The real model from which the data is generated is a parabola y = z2. The data have been split into two overlapping data sets; the low-z data in blue and the high-z data in red. The model we’re testing is a linear fit, y = mz + b with the slope, m, and the y-intercept, b as the parameters we’re fitting for. The best linear fits are shown in dashed lines on the plot on the left, with the dotted line showing the best fit to the two data sets combined. The is a good fit to each data set, given the uncertainties. The plot on the right shows the likelihood contours in the m vs b plane, and the best fit values are given in the upper right for each of the data sets (red and blue), and the data sets combined (green). The χ2 per degree of freedom is given in the bottom left. The high-z data have more points and slightly tighter error bars, which results in tighter contours. The point of this plot is to show that even though both data sets are good fits to the linear model (χ2/dof∼ 1), once you combine them the total indicates a bad fit (χ2/dof∼ 4). Nevertheless, you often see people quoting an amazingly precise result, like the green contours here, when all that is actually showing is that the model is a bad one. Beware the wrath of the referee if you make that error.

3.6 Likelihoods

Converting a χ2 value into a likelihood L is simple, 2 L = e−χ /2, (9) from which you can see why χ2 is often referred to as the ‘log-likelihood’,

−2 ln L = χ2. (10)

Thanks to the logarithm you can see that whereas you add χ2 values from independent experiments to get the total χ2 you multiply likelihoods to get the total likelihood. Once you have a likelihood distribution for a parameter, it is straightforward to integrate under the curve to find the value of the likelihood, within which 68.27% of the likelihood area is enclosed. The values of the parameter that match that likelihood are your ±1σ uncertainties. The same procedure works whether or not the likelihood surface is Gaussian.

7 Figure 7: Example of a Gaussian likelihood distribution on the left, and a skewed likelihood distribution (middle). The relative likelihood appears in the top panel of each with the cumulative likelihood below. In the case of a Gaussian likelihood distribution there is no difference between the mean and max-likelihood +0.80 methods for determining the best fit. However, in the skewed case the values are very different. The maximum likelihood method would give 0.57−0.33, whereas +0.86 the mean of the posterior would give 0.93−0.48. The maximum likelihood makes more sense when looking at the relative likelihood plot, as the point chosen has the maximum likelihood (thus the name), and the ±1σ limits have the same relative likelihood (horizontal blue line). On the other hand, the mean makes more sense when looking at the cumulative likelihood plot, as the point chosen is in the middle of the probability distribution and the ±1σ limits enclose equal amounts of the cumulative probability. Which you use is a matter of taste. A truncated likelihood (right) also results in the mean and max-likelihood methods differing. This occurs where the parameter range being tested either (1) doesn’t include the full range that the data says is possible, or (2) butts up against a hard limit, such as the limit of ΩM ≥ 0 needed because we can’t have a negative matter density

3.7 Importance sampling

If you have an MCMC chain from a previous sampling but “forgot” to include a prior or an independent data set, you might be able to avoid the by using a method called importance sampling (Lewis & Bridle, 2002). Basically this means re-assigning a new likelihood to all points in the sample following Eqn. 8. However, this is not always a reasonable approach. The MCMC chain and the new information have to cover the same region in parameter space and not provide wildly different best fit values for overlapping parameters. Phrased in terms of likelihood, an MCMC chain is importance sampled by re-computing the likelihood and updating the multi- plicity, w:

LIS = LorigLprior (11)

LIS wIS = worig (12) Lorig

3.8 Non-Gaussian likelihoods

In the case of a non-Gaussian distribution, picking your best fit value and uncertainties becomes tricky (see Fig. 7). The maximum likelihood value is no longer in the centre of the distribution, and if there is a long tail in one direction or the other, then the majority of the likelihood can end up sitting far to one side or the other of the actual best fit. Therefore, another way to choose the best fit value is to choose the value for which 50% of the likelihood is above and 50% below. In other words, take the mean of the posterior distribution (if you like Bayesian language) or find where the cumulative likelihood hits 0.5. When the likelihood distribution is Gaussian, these two are equivalent, but in the case where the likelihood distribution is skewed, the two can differ substantially (see Fig. 7). The other instance in which the mean and the maximum likelihood will not coincide, is when a parameter has a hard cut-off on one side, and that cutoff is within the realistic limits of your measurement. For example, matter density can’t be less than zero, so if the data allow for zero matter density then the likelihood value will be truncated (see Fig. 7).

8 4 Marginalisation

Marginalisation is basically about getting rid of unwanted parameters when quoting your results. You primarily need it in two circumstances. Firstly, when you have a nuisance parameter (e.g. instrumental effects that you don’t care about), and secondly when you have a multi-parameter fit for which you would like to quote results for a single parameter (or plot a 2D contour when you’ve tested more than two parameters). Marginalisation just reduces the dimensions of your array of likelihoods.

4.1 Simple grid example

At its heart marginalisation is a very simple procedure. We’ll start with an example. Say you’re testing ΩM with SN data, within the flat-ΛCDM model. The parameter ΩM is the only free parameter of the model, but due to uncertainties in the absolute magnitude of the SNe and H0, which both shift the magnitude redshift curve up and down, you also have to vary the nuisance parameter M. So for each value of ΩM you test, you have an array of χ2 values, corresponding to all the different M values you tested. To marginalise over M you just convert those χ2 values to likelihoods and 7 add up the total likelihood for each ΩM . Figure 8: Marginalising over a nuisance parameter M to ex- tract constraints on ΩM. For each ΩM being tested you simply 4.2 Gaussian likelihood distribution add the likelihoods for all possible values of M, to get the total relative likelihood for that ΩM. On the left this is shown as re- When you have a Gaussian likelihood distribution, you can use a short- ducing a matrix of likelihoods to a 1D array. On the right is the χ2 equivalent pictorial representation, reducing a contour to a 1D cut and rather than summing over the likelihoods, you can just use the likelihood curve. values for the best-fit M for each ΩM. The MCMC technique generates points for each tested model, and it is the density of points in each ΩM bin that gives you the likelihood of that ΩM value. So finding the maximum likelihood in an MCMC chain involves finding where the density of points is highest. Marginalising over extra parameters is just as simple in the MCMC case – you just add up the number of points, N, in the unwanted parameters and contribute them to the corresponding ΩM you’re testing (and normalise by the total number of points in the chain). ! R Z L(ΩM) = L(ΩM, M)dM, in theory L(θ, θnuisance)dθnuisance , (13)   X  = P L(Ω , M ), over a grid  L(θ , θ,i ) , (14) i M i  i nuisance  i   L M L ,i = (ΩM, max), when Gaussian (θi, θnuisance) , (15)   X  = P N(Ω , M )/N , in MCMC.  N(θ , θ,i )/N  (16) i M i tot  i nuisance tot i

4.3 Marginalisation for contour plots

Even when there are no ‘nuisance’ parameters, you often need to marginalise to present the results of a multi-parameter fit. For example, when fitting for BAO in ΛCDM you need to fit for the set P CDM = (H , Ω , Ω ). How do you generate the Λ P 0 M Λ contour plots for (ΩM, ΩΛ)? You marginalise over H0. In other words, L(ΩM, ΩΛ) = i L(H0,i, ΩM, ΩΛ). In general, for higher numbers of parameters, you just keep repeating the sum over the unwanted parameters until you get down to the parameter you’re interested in. Since you generally are just looking for relative likelihoods, normalising the likelihood surface is often not necessary, but it is good practice to make sure that the total likelihood adds up to one. Is your best fit still a good fit? Because you sum likelihoods, that’s equivalent to multiplying χ2 values. Don’t attempt to look at the χ2 values after marginalisation. Just look at the lowest χ2 in your unmarginalised grid to see whether that is a good fit.

4.4 Analytic marginalisation

In some cases it is possible to get rid of nuisance parameters by analytic marginalisation. This is possible e.g. when the nuisance appear in the mean of a Gaussian likelihood, and the parameters are Gaussian distributed (Bridle et al., 2002). In practice it

7In the case of the distance modulus data from SNe there is an analytical way to marginalise, but more about that in Sec. ?? 9 Figure 9: Likelihood contours for the flat-wCDM model, using the SN data from the Union 2 compilation. Black con- tours show the 1, 2, and 3, sigma χ2 limits for 2 degrees of freedom. Gray contours show the likelihoods for the same. They differ slightly because this test has been done on a grid and the grid truncates the edge of the distribution a little. The upper and right panels show the marginalised likelihood dis- tributions for ΩM and w respectively. The maximum likeli- hood point before marginalisation is shown as the grey/black bullseye in the contour plot. The maximum likelihood of the 1D distributions after marginalisation are shown as the blue crosses and error bars. The mean of the posterior is shown as the red diamond and error bars. Note that the maximum likelihood of the marginalised distribution does not find the maximum of the total distribution, but the mean of the poste- rior comes closer to picking the true best fit.

means that for a given set of parameters, you chose a value for the nuisance parameter to maximise the likelihood. So the normal R procedure for marginalisation is P(θ) = P(θ, n)dn where n is the nuisance parameter you want to get rid of. For a Gaussian distribution you can also maximise the likelihood as a function of the nuisance parameter so that P(θ) ∝ maxnP(θ, n).

4.5 Mean of the posterior vs maximum likelihood

In Figure 9 we demonstrate the marginalisation procedure for the flat-wCDM model with supernova data. The two parameters are w and ΩM, and the 2D likelihood contours are shown in the bottom left, with the 1D marginalised likelihood distributions for ΩM and w in the upper and right-hand panels respectively. The peak of the entire likelihood distribution is shown as the black and grey ‘bullseye’ at the centre of the contours. Notably, this maximum likelihood is not the maximum likelihood of either parameter after marginalisation. If you marginalise, and then choose the maximum likelihood for each parameter, you get the blue cross. On the other hand, if you take the mean of the marginalised likelihood distribution you get the red diamond. When you look at the marginalised distributions you would think “Why would one ever use the mean of the posterior, since it obviously doesn’t pick the best fit value of this parameter?" However, the reason becomes clear when you go back to the full likelihood distribution and see that the mean after marginalisation is closer to the maximum likelihood point of the whole distribution than the maximum likelihood after marginalisation is. (The red diamond is closer to the black bulls-eye than the blue cross is.) In general (for non-Gaussian distributions), the mean of the posterior is a better representation of the overall likelihood distribution than the maximum likelihood point after marginalisation. In (Planck Collaboration et al., 2016) they quote the mean of the likelihood and the two-sided 68% confidence limits (95% confidence limits for “capped” parameters). If we know the best fitting point overall (the bullseye), then why go to all the trouble of marginalising anyway? Well, that bullseye point does not have any uncertainty estimates. The real reason we go to the trouble of marginalising is to figure out the size of our error bars.

10 5 Model comparison

So far we’ve been trying to figure out what the best fit parameters are within a model. We’ve seen that when a model is a poor fit, you get a χ2 per degree of freedom significantly greater than one. That often indicates that a more complex model is needed – for example, one with an extra parameter, or a new model entirely. Some people have tried to quantify how bad the χ2 must be before a new model, or extra parameter, is justified. This is by no means an exact science, and is to a great extent just quantifying your common sense. In general, when one model is nested within another model, as in the case of flat-ΛCDM being a special case of ΛCDM, the model with extra param- eters is guaranteed to be a better fit. So when comparing between models, you can’t just prefer the model with the lowest χ2 because you can always add another parameter, and you’ll get a lower χ2 again. So there has to be some threshold of improvement that you get in χ2 to justify the addition of the extra parameter. There are various methods used to penalise the model with the extra param- eters. Bayesian model selection is a popular way to select which model is the preferred model. You will often see simplified concepts such as the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) used. Basically these give a threshold by which the more complex model must improve on the simpler one before the extra parameter is con- sidered justified by the data. Thus, the aim is to quantify Occam’s razor and Figure 10: No comments https://xkcd.com/892/ prefer the simpler model unless the data show a more complex model is necessary to get a good fit. The BIC is also known as the Schwarz information criterion (Schwarz, 1978) and is given by,

BIC = −2 ln L + k ln N, (17) where L is the maximum likelihood, k is the number of parameters, and N is the number of data points used in the fit. You can see that in the case of Gaussian errors the difference in the BIC values between two models just become ∆BIC = ∆χ2 + ∆k ln N. A difference in BIC of 2 is considered positive evidence against the model with the higher BIC, while a ∆BIC of 6 is considered strong evidence (Liddle, 2004). Meanwhile the AIC (Akaike, 1974) is given by,

AIC = −2 ln L + 2k. (18)

This gives results similar to BIC, although the AIC is somewhat more lenient on models with extra parameters and has been shown to have practical advantages over BIC (better recovery of more similar models). Look at the data in Fig. 6. They were created from a parabola but we fit them with a linear model. If we instead fitted them with e.g. a cubic model, the χ2 would be lower, but it doesn’t mean it’s the correct model. Instead AIC or BIC would tell you that the extra parameters aren’t justified.

5.1 Model averaging

Quite often no model will have decisive evidence over the competing models in terms of AIC or BIC. Instead of choosing a specific model, we can then include the uncertainty from the choice of model in the parameter estimates by computing a model- averaged posterior, where the individual posteriors from each model are summed together, weighted by the individual posterior values (Parkinson & Liddle, 2013)

P i P(Mi|d)P(θ¯|d, Mi) P(θ¯|d) = P . i P(Mi|d)

The idea is to produce a model averaged version of the posterior for a given parameter shared between the models. However, it is not always practically possible. Even when parameters are shared between several models, they might have slightly different meanings in each model, so the averaged posterior only makes sense in special cases. Other open questions are how many models do you need to ensure that you probe the realistic model space, and how many common parameters should they have. Examples of practical applications can be found in Liddle et al. (2006) where they investigated whether dark energy evolves with time, and in Vardanyan et al. (2011) where the model averaging is used to quantify the uncertainty on the flatness of the universe, but as stated by T. Loredo (Cornell University): "As a final remark, astronomers new to Bayesian methods really should consider Bayesian Model Averaging to be a somewhat advanced topic. Like nonparametrics, it’s an area where it’s a good idea to run your calculation by someone with experience and knowledge of the literature" 11 6 Literature

If you want to know more, these are good places to start.

Cosmology: from theory to data, from data to theory (Leclercq et al., 2014). Probably the most useful reference if all you need is a practical guide to model fitting in cosmology.

Bayesian Methods in Cosmology (Hobson et al., 2014). A whole book dedicated to Bayesian statistics in the context of cosmology.

Markov Chain Monte Carlo Methods for Bayesian Data Analysis in Astronomy (Sharma, 2017). New and very detailed review article about all the details of MCMC (and probably a few more) that you will ever need. No cosmological examples

12 A Correlated data sets

I don’t think I’ll go through this, but I’ll leave it in the notes When two data points are correlated it means that they will both tend to lie either above or below the correct model. For example, if you measured the temperature at noon and 1pm each day you would find that those two data points were consistently above the mean daily temperature. If you averaged those two data points you would not get the correct mean daily temperature, and no matter how many√ noon-time data points you added to your average you would still not get it right. That means the uncertainty is not dropping as n as you would expect when your n data points were independent. That’s because temperature measurements are correlated with the time they are taken. Temperature readings at midnight would be anti-correlated with temperature readings at noon. So when correlations exist in your data it is crucial that you account for them in your statistical analysis. Whether combining with the CMB, or taking the ratios of two BAO measurements, there are many potential sources of correlation between data sets. In this section we attempt to present in a straightforward manner how to account for them in your statistical analysis.

Figure 11: An example of correlated data from Blake et al. (2011). This is the correlation function measured for the galaxies observed by the WiggleZ survey on the Anglo-Australian Telescope. The correlation function tells you the overdensity of galaxy pairs as a function of separation. In this case we’re measuring the correlation between galaxy position, but that’s not the correlation I want to point out. Rather, I want to point out that the data points on this plot are correlated. That’s because the volumes in which we’re measuring our galaxy pairs are overlapping, so we’re sampling the same density field. The is shown on the right. The axes are both separation (corresponding to s on the left panel). Red is correlated, blue is uncorrelated or anti-correlated. (To be precise, p we plot the amplitude of the cross-correlation, Ci j/ CiiC j j.) The correlation explains why the data points aren’t as scattered as you would expect for the size of the error bars. The error bars are just the diagonal part of the covariance matrix.

When the two data sets are not independent you can’t simply add the χ2 values together, because that would double-count the correlated part of the data and give falsely tight, and potentially misleading constraints. The χ2 technique described above, is actually just a simplification of the full analysis taking correlations into account. To do it properly you should use the covariance matrix, which is simply the matrix of uncertainties. In the simple case of uncorrelated data, the equation for χ2 can be rewritten as a matrix product. Say you have two data points with theoretical expectation xthry and ythry and measured values xobs ± σx and yobs ± σy. The covariance matrix is just the matrix of the , which in the uncorrelated case is simply,

" 2 # σx 0 V = 2 . (19) 0 σy

T 2 Writing the difference between observation and theory as a vector, X = (xthry − xobs, ythry − yobs) = (∆x, ∆y), the equation for χ is then χ2 = XT V−1X, (20) or writing it out longhand,    1  " # ∆x, ∆y  2 0  ∆x 2  σx  χ =  1  (21)  0 2  ∆y σy Correlated data means that if one data point lies above the expected curve, the other is likely to do so too. So if you don’t take that into account it can (a) lead to skewed results and (b) give you falsely tight constraints. The way to take into account correlations is to add uncertainty to the measurement.√ Another way to put it is that when data points are correlated, adding extra data reduces the uncertainty by less than the usual n amount (where n is the number of data points).

In general, when two parameters, with theoretical expectation xthry and ythry and measured values xobs ± σx and yobs ± σy are correlated by correlation coefficient ρxy (not to be confused with density) the covariance and inverse covariance matrices are 13 given by (Wall & Jenkins, 2003, Eq. 4.4),

−ρ " 2 2 #  1 xy  2 σx ρxyσxσy −1 1  σ σxσy  V V  −ρx  . = 2 2 ; = 2  xy 1  (22) ρxyσxσy σy 1 − ρ  2  xy σxσy σy

For three parameters, the covariance matrix (before inversion) is

 2 2 2   σx ρxyσxσy ρxzσxσz   2 2 2  V =  ρxyσxσy σy ρyzσyσz  . (23)  2 2 2  ρxzσxσz ρyzσyσz σz

At this point we stop writing out the inverse covariance matrices and recommend you get your computer to calculate those for you.There are various ways to calculate what the correlation is between two measurements, but we leave that for another day.

A.1 Slightly more complex correlations

Sometimes you need to combine data from different sources before calculating χ2. If the data you’re combining is correlated, there are some subtleties in how to address that. For example, the sz parameter used in BAO studies is actually a ratio between the sound horizon scale at last scattering, as measured by the CMB `A to that seen in the galaxies dz. (They’re actually defined in opposite senses, so the ratio becomes the product, ... details details ...).

When working with WiggleZ, we had three measurements of dz and one measurement of `A that you combine to create three measurements of sz = `Adz/π. Two of the dz values were correlated to each other, while all of the dz measurements were correlated with the `A. How do you calculate the covariance matrix for that combination?

Start with the vector s = [s1, s2, s3] = [d1`A, d2`A, d3`A]. The Jacobian is a handy matrix made up of the partial derivatives of the components of s with respect to each of d1, d2 ,d3, and `A:

 ∂s ∂s ∂s ∂s   1 1 1 1   ` 0 0 d   ∂d1 ∂d2 ∂d3 ∂`A   A 1   ∂s2 ∂s2 ∂s2 ∂s2    J =   =  0 `A 0 d2  . (24)  ∂d1 ∂d2 ∂d3 ∂`A     ∂s ∂s ∂s ∂s     3 3 3 3  0 0 `A d3 ∂d1 ∂d2 ∂d3 ∂`A

Imagine that d1 and d2 are correlated (as in the SDSS points at z = 0.2 and z = 0.35) but d3 is uncorrelated (for example, a WiggleZ point at z = 0.6), then the covariance matrix for the original four data points looks like

 σ2 σ2 0 0   11 12   σ2 σ2 0 0  C =  12 22  . (25)  0 0 σ2 0   33   0 0 0 σ2  `A`A We need to convert that to a new covariance matrix for the three components of s. That is what the Jacobian is used for. The covariance matrix for s = [s1, s2, s3] is,  σ2 `2 + σ d2 σ2 `2 + σ d d σ d d   11 A `A`A 1 12 A `A`A 1 2 `A`A 1 3  Cnew JCJT  σ2 `2 + σ d d σ2 `2 + σ d2 σ d d  . = =  12 A `A`A 1 2 22 A `A`A 2 `A`A 2 3  (26)  2 2 2  σ`A`A d1d3 σ`A`A d2d3 σ33`A + σ`A`A d3

A similar procedure then needs to be followed to calculate the correlation between the results of this measurement and any other correlated data, such as the CMB-R parameter.

A Exercise: Sampling and marginalising

The aim of this exercise is to write your own Metropolis-Hastings sampler. You’ll also do some basic analysis of the chains, including convergence check and likelihood functions. In most cases we apply MCMC sampling because we don’t know the target distribution or because it is difficult to calculate analytically (e.g. we only have measurements, not the underlying distribution). In terms of cosmology, the target function is often a likelihood. Here we will assume a simple form of the target distribution to make things easier. We use an exponential distribution with mean 1: t(x) = e−x for x > 0 otherwise t(x) = 0. For the proposal/jumping function we use a Gaussian with zero mean and unity standard deviation (Gaussian sampling is very standard).

1. Define the target distribution function (the likelihood) 2. Build a Metropolis-Hastings scheme to sample from the target distribution A = t(proposed x)/t(current x). Jump from current x to proposed x with probability min(acceptance ratio,A). The framework is 14 define target function x_current = x_start = some value N = number of steps/proposals loop from zero to N: x_proposed = x_current + random number from Gaussian distribution (mean=0, sigma=1) (this is usually a build in function) A = t(x_proposed)/t(x_current) B = random number from uniform distribution from 0 to 1 (this is usually a build in function) update x_current to x_proposed if A>B, otherwise x_current stays the same record x_current (and t(x_current))

3. Plot the 1D sample distribution (x as a function of sample number). Is there any burn-in that you should remove? 4. Plot a of sampled x values. Bin the likelihood values in x and overplot the distribution of normalised likelihood values. 5. Determine and plot the covariance as a function of lag

PN−lag i (xi − x¯)(xi+lag − x¯) ρlag = . (27) PN 2 i (xi − x¯)

What is the smallest lag that gives ρlag ≈ 0? Is it much smaller than the length of the chain?

2 −x1 −(2−x2) /4 6. Extend to multiple parameters. You can use a target distribution of t(x1, x2) = e + e . The framework now becomes define target function of x_1 and x_2 x1_current = x1_start = some value x2_current = x2_start = some value N = number of steps/proposals

loop from zero to N: x1_proposed = x1_current + random number from Gaussian distribution (mean=0, sigma=1) x2_proposed = x2_current + random number from Gaussian distribution (mean=0, sigma=1)

A = t(x1_proposed, x2_proposed)/t(x1_current, x2_current) B = random number from uniform distribution from 0 to 1

update x1_current to x1_proposed and x2_current to x2_proposed if A>B, otherwise x_current stays the same record x1_current, x2_current (and t(x1_current, x2_current))

7. Check the convergence of your chain

8. Plot x1 versus x2

9. Bin x1 and x2 and the target distribution and make a contour plot of one, two and three standard deviations. One way to find the contour values of the standard deviations is to sort your likelihood chain, normalise it to one, compute the cumulative sum, and chose the values of the sorted chain that gives a cummulative sum of ≈ 0.68, 0.95, 0.99.

10. Marginalise over x2 and plot the 1D distribution of x1, same for x2. What are the 1σ uncertainties?

B Exercise: Importance sampling and marginalisation

The aim of this exercise is to understand how the plots from CosmoMC and MontePython are generated, and to use importance sampling to add an extra prior on the chain. Rather than using GetDist you will write your own code to examine an MCMC chain with the format from CosmoMC or MontePython.

1. Download a chain from https://drive.google.com/open?id=0B3pDjkLTg3eCbzB3UEpvQ0dlNzA (same chains as in Mat- teo’s workshop). 2. Read the chain in as a table or data frame. The first column is the multiplicity (weight) and the second is the total − ln(L) = χ2/2.

15 3. The chain was run with Planc and BAO data only. Now use the value of the Hubble parameter from Riess et al. 2016 (http://adsabs.harvard.edu/abs/2016ApJ...826...56R) to importance sample the likelihood and multiplicity of the chain. H0 = 73.24 ± 1.74 km/s/Mpc. Remember:  2 2 xmodel−xprior 2 2 2 χ = and χ = χ + χ , and wIS = LpriorwIS prior σprior 0 prior 4. Load the new and old chain into GetDist and see how the prior affects the contours. 5. Marginalise over all other parameters and make a contour plot of omegal* versus omegam* before and after importance sampling. In order to make the contour plot you can bin the chain in omegal* and omegam* and sum the multiplici- ties/likelihoods. Here the easiest is probably to use the ∆χ2 values for two independent parameters for the contour levels ∆χ2 = [2.23, 6.18, 11.83].

References

Akaike, H. 1974, EEE Transactions on Automatic Control, 19, 716

An, L., Brooks, S., & Gelman, A. 1998, Journal of Computational and Graphical Statistics, 7, 434 Blake, C. et al. 2011, Monthly Notices of the Royal Astronomy Society, 415, 2892, 1105.2862 Bridle, S. L., Crittenden, R., Melchiorri, A., Hobson, M. P., Kneissl, R., & Lasenby, A. N. 2002, Monthly Notices of the Royal Astronomy Society, 335, 1193, astro-ph/0112114

Davis, T. M. et al. 2007, The Astrophysical Journal, 666, 716 Heavens, A., Fantaye, Y., Sellentin, E., Eggers, H., Hosenie, Z., Kroon, S., & Mootoovaloo, A. 2017, ArXiv e-prints, 1704.03467 Hobson, M. P., Jaffe, A. H., Liddle, A. R., Mukherjee, P., & Parkinson, D. 2014, Bayesian Methods in Cosmology Leclercq, F., Pisani, A., & Wandelt, B. D. 2014, ArXiv e-prints, 1403.1260

Lewis, A., & Bridle, S. 2002, Physical Review D, 66, 103511, astro-ph/0205436 Liddle, A. R. 2004, Monthly Notices of the Royal Astronomy Society, 351, L49, astro-ph/0401198 Liddle, A. R., Mukherjee, P., Parkinson, D., & Wang, Y. 2006, Physical Review D, 74, 123506, astro-ph/0610126

Parkinson, D., & Liddle, A. R. 2013, Statistical Analysis and Data Mining: The ASA Data Science Journal, Vol. 9, Issue 1, p. 3-14, 6, 3, 1302.1721 Planck Collaboration et al. 2016, /aap, 594, A13, 1502.01589 Schwarz, G. 1978, Ann. Statist., 6, 461

Sharma, S. 2017, Annual Review of Astronomy and Astrophysics, 55, 213, 1706.01629 Vardanyan, M., Trotta, R., & Silk, J. 2011, Monthly Notices of the Royal Astronomy Society, 413, L91, 1101.5476 Wall, J., & Jenkins, C. 2003, Practical Statistics for Astronomers, Cambridge Observing Handbooks for Research Astronomers (Cambridge University Press)

16