Statistics and Model Selection in Cosmology 1 Introduction

ΛCDM and Beyond: Cosmology Tools in Theory and in Practice - CANTATA Cost-Action Summer School 2017 Statistics and model selection in cosmology Signe Riemer-Sørensen, University of Oslo Adapted from notes by Tamara Davis, University of Queensland Statement of expectations I don’t lecture, I teach. I’m here for you to learn. I expect active participation during the classes. This format requires much MORE EFFORT from the students than standard lectures, but the learning gain is also higher. We will cover some of the background information on how CosmoMC and MontePython work. During the classes we’ll be doing quizzes using the Kahoot! platform. I recommend you download the Kahoot! app for your smartphone1, but it’s absolutely not necessary. The quizzes work in any browser (go to https://kahoot.it and follow the instructions). 1 Introduction Cosmology is about understanding the universe. We do some observations and then we compare them to models that fit in a bigger theory. But how do we actually determine the parameters of the models? And which model is the best? “If your experiment needs a statistician, you need a better experiment.” – Ernest Rutherford, winner of the Nobel Prize in Chemistry “Torture the data, and it will confess to anything.” – Ronald Coase, winner of the Nobel Prize in Economics Handling statistics forms a crucial part of any data analysis, but it is often neglected in our formal education, or taught more formally than practically. Here we will try to bridge that gap, and outline the basic data analysis techniques that are most commonly used in cosmology, including some common pitfalls and tips-of-the-trade. To focus the discussion we will look at the constraints on our cosmological model parameters that can be derived from observations like supernovae (SNe), baryon acoustic oscillations (BAO), and the cosmic microwave background (CMB). In other words, how one gets from the observed data illustrated in the left part of Fig. 1 to the likelihood contours in the right part of the figure. Occasionally, we’ll comment on how CosmoMC and MontePython works, but most of the content is general. This is not meant to be a rigorous mathematical derivation of the statistics. Just a practical guide on how to implement them. Figure 1: The aim of this lecture is to reveal some of the subtleties involved in creating these types of contour plots for cosmology. How do you go from the supernova distance modulus vs redshift data on the left, to the red 1σ confidence contour on the right? Once you’ve got there, how do you combine that data with the other data sets including possible correlations, nuisance parameters, and multi-parameter models. Although we’re using cosmology as a case study, the techniques are applicable to data analysis in general. Plots adapted from Davis et al. (2007). 1App Store http://turl.no/1atg, Google Play http://turl.no/1atf 1 1.1 Models and parameters It is important to distinguish up-front the difference between testing models and finding the best fit parameters within those models. Different models could be based on different gravitational theories – for example General Relativity or f (R) gravity – or they could simply be different parameter combinations within a single theory of gravity – for example General Relativity with a cosmological constant and cold dark matter (ΛCDM) with the parameters θΛCDM = (H0; ΩM; ΩΛ) or General Relativity with a more general dark energy and cold dark matter (wCDM) with the parameters θΛCDM = (H0; ΩM; Ωx; w). When you assume the universe is flat, that reduces the number of parameters you need to fit for and therefore constitutes a different model as well. So testing flat-ΛCDM is different to testing ΛCDM. We consider three levels of inference as shown in Tab. 1. In Sec. 2 we will discuss what we mean by the best model/parameters and the framework of Bayesian statistics. In Sec. 3 we will address the first level of inference namely how to determine the best-fit parameters within a particular model. The second level of inference is the comparison between models, which will be the focus of Sec. 5, while the third level of “what to do when you’re unable to chose between models” is briefly discussed in Sec. 5.1. Level 1 - Parameter inference Level 2 - Model comparison Level 3 - Model averaging Actually, there are several possible None of the models is clearly the best. I have selected a model M and prior models: M ; M ; ::: What is the relative What is the inference on the P(θjM). What are the parameters? Use 0 1 plausibility of M ; M ; ::: given the parameters accounting for model Bayes Theorem 0 1 observed data? uncertainty? Model averaging X P(θjd) = P(Mijd)P(θjd; Mi) P(djθ; M)P(θjM) P(M jd) P(θjd; M) = odds = 0 i P(djM) P(M1jd) e.g. finding curvature of the universe e.g. parameter determination by e.g. comparison of ΛCDM and as in Vardanyan et al. (2011) or running CosmoMC or MontePython wΛCDM as in Heavens et al. (2017) evolving dark eneryg as in Liddle et al. (2006) + + + Sec. 2 Sec. 5 Sec. 5.1 Table 1: Three different levels of inference. 2 Bayesian statistics Bayesian statistical analysis is a general method where the probability of a theory/parameter value being the correct value is computed based on observations, and where the probability estimate can be updated when new observations become available. This is contrary to frequentist interpretation where the probability is a direct measure of the proportion of outcomes in your experiment (e.g. often used in laboratory experiments). Bayes Theorem gives the probability that the inferred parameters θ are true given we observe the data d. This quantity is also called the posterior probability function: P(djθ; M)P(θjM) P(θjd; M) = ; (1) P(djM) where θ represent a given set of parameters, d is the observed data, and M is any other external information e.g. the model2. P(djθ; M) is the probability of observing the data d if the parameters θ are true. This is also called the likelihood. P(θjM) is called the prior and is the probability that the parameters are true in the absence of data but given some external information (prior knowledge)3. P(djM) is called the evidence and is the probability of observing the data given the external information (the choice of model) but before the data are known. For parameter estimation within the model we usually don’t have to consider the evidence (normalisation) and the problem reduces to "posterior / likelihood × prior". The output of a Bayesian analysis is a probability density function. This is contrary to the frequentist statistics where the output is an estimator for the parameters e.i. one value that represents a parameter. In Bayesian statistics the probability density function gives a quantitative measure of how well we can believe the estimated parameters given the combination of data and prior knowledge. 2The model can contain physical knowledge e.g. General Relativity or flatness of the universe. 3e.g. something measured by another experiment, or an assumption that a parameter falls within a given interval e.g. masses have to be positive. 2 Figure 2: Left: This schematic illustrates Bayes Theorem. The prior is the knowledge you have before you measure something. Then you obtain “sensory evidence” (measurements/observations) which leads to an updated posterior (dashed line). If you increase the precision of your measurements or you decrease the precision of your prior, your measurements will have stronger influence on the posterior. From ?. Right: Bayesian versus frequentist interpretation of a solar neutrino experiment from https://xkcd.com/1132/. 2.1 How to explore the parameter space In any judgement of best-fit parameters you need to start by exploring the parameter space and testing how well the model fits the data for a wide range of parameter values. In cosmology, we often deal with a very large number of parameters (e.g. approximately 107 pixel temperatures in the CMB map), and a direct (analytical) evaluation of the posterior distribution is impossible. Instead we compute the posterior on a sample drawn from the real posterior distribution. There are several ways to do this, but the most common are either setting up a grid of parameter values and testing each one, or using a Monte-Carlo Markov Chain (MCMC) method to selectively test the most likely region of parameter space. Testing over a grid of parameters can be very time consuming, because you spend a lot of time calculating your model in regions of parameter space that are very unlikely (the white space in Fig. 1). When you have many parameters this can become prohibitive. That’s where MCMC comes in. Monte Carlo basically means representing a distribution by randomly sampling it, and Markov Chain is a smart way of sampling. A Markov Chain is defined by a series of random elements where the conditional distribution of the nth element only depends on the (n-1)th element. The chain is stationary if the distribution does not depend on n (the sample number). For our purpose, the crucial property is that after some steps (the burn-in phase), the chain reach a state where successive elements of the chain are drawn from high-density regions of the underlying distribution (the posterior). You start running an MCMC chain somewhere in your parameter space, then it jumps to another set of parameters to test (you can tune how big the jumps should be). If the second point is at a higher likelihood it starts from there and makes another jump, but if it is a worse fit there’s an algorithm by which it decides whether to keep the new step or drop back to its previous step and try again.

Statistics and Model Selection in Cosmology 1 Introduction

WHAT DID FISHER MEAN by an ESTIMATE? 3 Ideas but Is in Conﬂict with His Ideology of Statistical Inference

Statistical Theory

Estimation Methods in Multilevel Regression

Akaike's Information Criterion

Plausibility Functions and Exact Frequentist Inference

An Introduction to Maximum Likelihood in R

Likelihood Ratios: a Simple and Flexible Statistic for Empirical Psychologists

SPATIAL and TEMPORAL MODELLING of WATER ACIDITY in TURKEY LAKES WATERSHED Spatial and Temporal Modelling of Water Acidity in Turkey Lakes Watershed

P Values, Hypothesis Testing, and Model Selection: It’S De´Ja` Vu All Over Again1

Maximum Likelihood Estimation ∗ Contents

The Ways of Our Errors

Profile Likelihood Estimation of the Vulnerability P(X>V) and the Mixing