Statistical Issues in Reproducibility

Statistical Issues in Reproducibility Werner A. Stahel Abstract Reproducibility of quantitative results is a theme that calls for probabilistic models and corresponding statistical methods. The simplest model is a random sample of normally distributed observations for both the original and the replication data. It will often lead to the conclusion of a failed reproduction because there is stochastic variation between observation campaigns, or unmodeled correlations lead to optimistic measures of precision. More realistic models include variance components and/or describe correlations in time or space. Since getting the \same circumstances" again, as required for reproducibility, is often not possible or not desirable, we discuss how regression models can help to achieve credibility by what we call a \data challenge" of a model rather than a replication of data. When model development is part of a study, reproducibility is a more delicate problem. More generally, widespread use of exploratory data analysis in some fields leads to unreliable conclusions and to a crisis of trust in published results. The role of models even entails philosophical issues. 1 Introduction Empirical science is about extracting, from a phenomenon or process, features that are relevant for similar situations elsewhere or in the future. On the basis of data obtained from an experiment or of a set of observations, a model is determined which corresponds to the ideas we have about the generation of the data and allows for drawing conclusions concerning the posed questions. Frequently, the model comprises a structural part, expressed by a formula, and a random part. Both parts may include constants, called parameters of the model, that represent the features of interest, on which inference is drawn from the data. The simplest instance is the model of a \random sample" from a distributions whos expected value is the focus of interest. The basic task of data analysis is then to determine the parameter(s) which make the model the best description of the data in some clearly defined sense. The core business of statistics is to supplement such an \estimate" with a measure of precision, usually in the form of an interval in which the \true value" should be contained with high probability. This leads to the basic theme of the present volume: If new data becomes 2 Stahel available by reproducing an experiment or observation under circumstances as similar as possible, the new results should be compatible with the earlier ones within the precision that goes along with them. Clearly, then, reliable precision measures are essential for judging the success of such \reproductions". The measures of precision are based on the assessment of variability in the data. Probability theory provides the link that leads from a measure of data variability to the precision of the estimated value of the parameter. For example, it is well known that the variance of the mean of n observations is the variance of the observations' distribution, divided by n if the observations are independent. We will emphasize in this chapter that the assumption of independence is quite often inadequate in practice and therefore, statistical results appear more precise than they are, and precision indicators fail to describe adequately what is to be expected in replication studies. In such cases, models that incorporate structures of variation or correlation lead to appropriate adjustments of precision measures. Many research questions concern the relationship between some given input or \explanatory" variables and one or more output or \response" variables. They are formalized through regression models. These models even allow for relating studies that do not meet the requirement of \equal circumstances" which are necessary for testing reproducibility. We will show how such comparisons may still yield a kind of confirmation of the results of an original study, and thereby generalize the idea of reproducibility to what we call data challenges. These considerations presume that the result of a study is expressed as a model for the data, and that quantitative information is the focus of the reproduction project. However, a more basic and more fascinating step in science is to develop or extend such a model for a new situation. Since the process of devel- oping models is not sufficiently formalized, considerations on the reproducibility of such developments also remain somewhat vague. We will nevertheless address this topic, which leads to the question of what is to be reproduced. In this paper, we assume that the reader knows the basic notions of statistics. For the sake of establishing the notation and recalling some basic concepts, we give a gentle introduction in Section 2. Since variation is the critical issue for the assessment of precision, we discuss structures of variation in Section 3. Regression models and their use for reproducibility assessment is explained in Section 4. Section 5 covers the issue of model development and consequences for reproducibility. Section 6 addresses issues of the analysis of large datasets. Comments on Bayesian statistics and reproducibility follow in Section 7. Some general conclusions are drawn in the final Section 8.1 1While this contribution treats reproducibility in the sense of conducting a new measurement campaign and comparing its results to the original one, the aspect of reproducing the analysis Statistical Issues in Reproducibility 3 984 ● 500 ● day 1 day 2 day 3 400 300 ● ● ● ● ● ● t test t without outliers Wilcoxon ● ● 200 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● − ● ● ● ● ● ● ● − ● ● ● ● ● − ● ● ● 100 ● ● ● ● ●● ● "true" velocity [km/s] − 299.700 velocity ● ● ● ● ● ● ● ● 0 ● ● −100 0 10 20 30 40 50 60 70 80 Figure 1: Measurements of the velocity of light by Newcomb in 1882. Three confidence intervals are shown on the right. 2 A Random Sample 2.1 Simple Inference for a Random Sample The generic statistical problem consists of determining a constant from a set of observations that are all obtained under the \same circumstances" and are supposed to \measure" the constant in some sense. We begin with a classic scientific example: the determination of the velocity of light by Newcomb in 1882. The data set comprises 66 measurements taken on three days, see Fig. 1. The probability model that is used in such a context is that of a random variable X with a supposed distribution, most commonly the normal distribution. It is given by a density f or, more generally, by the (\theoretical") cumulative distribution function F (x) = P (X ≤ x). The distribution is characterized by parameters: the expected value µ and the standard deviation σ (in case of the normal distribution). For the posed problem, µ is the parameter of interest, and σ is often called a nuisance parameter, since the problem would be easier if it were not needed for a reasonable description of the data. We will use θ to denote a general parameter and θ to denote the vector of all the parameters of the model. The distribution function will be denoted by Fθ(x). of the data is left to Bailey et al. (this volume). The tools advocated by many statisticians to support this aim are the open source statistical system R and the documentation tools Sweave and knitr (Leisch 2002, Xie 2013, Stodden et al. 2014). 4 Stahel 1.0 24 20 0.8 16 0.6 F 12 frequency 0.4 8 6 0.2 4 2 0 0.0 −100 0 50 150 250 350 450 −100 0 50 150 250 350 Figure 2: Histogram (left) and empirical cumulative distribution function (right) for the Newcomb data. The dashed and dotted lines represent normal distributions obtained from fits to the data without and with two outliers, respectively. If the model is suitable, the histogram describing the sample values x1; x2; :::; xn will be similar to the density function, and the empirical cumulative distribution function Fb(x) = #(xi ≤ x)=n will be like the theoretical one (Fig. 2). For small samples, the similarity is not very close, but by the law of large numbers, it will increase with n and approach identity for n ! 1. Probability theory determines in mathematical terms how close the similarity will be for a given n. The question asked above concerns the mean of the random quantity. It is not the mean of the sample that is of interest, but the expected value of the random variable. It should be close to the sample mean, and again, probability theory shows how close the mean must be to the expected value. It thereby gives us the key to drawing statistical inferences about the model from the sample of data at hand { here, for inferences about the parameter θ. The basic questions Q and answers A in statistical inference about a parameter of a model are the following. Q1 Which value of the parameter θ (or parameters θ) is most plausible in the light of the data x1; x2; :::; xn? A1 The most plausible value is given by a function θb(x1; x2; :::; xn) (or bθ(:::)) of the data, called an estimator. Q2 Is a given value θ0 of a parameter plausible in the light of the data? A2 The answer is given by a statistical test. A test statistic T (x1; x2; :::; xn; θ0) is calculated, and the answer is \yes" if its value is small enough. The threshold for what is considered \small" is determined by a preselected probability α (called the level of the test, usually 5%) of falsely answering \no" (called the error of the first kind). Usually, the test statistic is a Statistical Issues in Reproducibility 5 function of the estimator bθ of all model parameters and of the parameter θ0. If the answer is \no", one says that the test is (statistically) signifi- cant or, more explicitly, that the null hypothesis H0 that the parameter θ equals the supposed value θ0 is rejected.

Statistical Issues in Reproducibility

Experimental Evidence During a Global Pandemic

Sensitivity and Specificity. Wikipedia. Last Modified on 16 November 2013

A Randomized Control Trial Evaluating the Effects of Police Body-Worn

Why Replications Do Not Fix the Reproducibility Crisis: a Model And

Which Findings Should Be Published?∗

1 Maximizing the Reproducibility of Your Research Open

Paradoxical Conclusions in Randomised Controlled Trials with 'Doubly Null

(Salmo Salar) with Amoebic Gill Disease (AGD) Chlo

The Significance of Null Results in Physics Education Research

Using Bias Analysis to Address Misclassification in Epidemiology

Null Hypothesis Significance Testing: a Review of an Old and Continuing Controversy

ORBITA and Coronary Stents: a Case Study in the Analysis and Reporting