What Is the Probability of Replicating a Statistically Significant Effect?

What Is the Probability of Replicating a Statistically Significant Effect?

Psychonomic Bulletin & Review 2009, 16 (4), 617-640 doi:10.3758/PBR.16.4.617 THEORETICAL AND REVIEW A RTICLES What is the probability of replicating a statistically significant effect? JEFF MILLER University of Otago, Dunedin, New Zealand If an initial experiment produces a statistically significant effect, what is the probability that this effect will be replicated in a follow-up experiment? I argue that this seemingly fundamental question can be interpreted in two very different ways and that its answer is, in practice, virtually unknowable under either interpretation. Although the data from an initial experiment can be used to estimate one type of replication probability, this estimate will rarely be precise enough to be of any use. The other type of replication probability is also unknowable, because it depends on unknown aspects of the research context. Thus, although it would be nice to know the probability of replicating a significant effect, researchers must accept the fact that they generally cannot determine this information, whichever type of replication probability they seek. Scientific theories are built on replicable phenomena an oft-cited (e.g., Cohen, 1994) study, Oakes (1986) pre- (see, e.g., Falk, 1998; Guttman, 1977; Tukey, 1969; Wainer sented a group of 70 researchers with a scenario in which & Robinson, 2003). In sciences with deterministic measure- a two-group comparison resulted in a t test that was sig- ments, the idea of replication is simple: If two researchers nificant at the level of p .01. A majority (60%) thought measure the same phenomenon using the same instruments that this indicated a 99% chance of a significant result in and procedures, they should obtain essentially the same re- a replication study, although this is patently not the case sults. Things are not so simple when the measurements are (Oakes, 1986; cf. Haller & Kraus, 2002). More recently, subject to random variability due to measurement error, others have documented additional confusions regarding individual differences, or both. In this case, real effects are what is to be expected from replications (e.g., Cumming, only replicated with a certain probability—often called the Williams, & Fidler, 2004). “replication probability.” Even when a real effect is pres- Because of the importance of replication probability and ent, some replication failures must be expected as one of the confusion surrounding it, recent articles in numerous the unfortunate consequences of variability. disciplines have urged researchers to consider replication For researchers faced with random variability, it is probability more carefully (e.g., Cumming, 2008; Cum- useful to understand the nature and determinants of rep- ming & Maillardet, 2006; Gorroochurn, Hodge, Heiman, lication probability for at least three reasons. First, this Durner, & Greenberg, 2007; Greenwald, Gonzalez, Har- probability is relevant in assessing the implications of dis- ris, & Guthrie, 1996; Killeen, 2005; Robinson & Levin, crepant results (“Is this a real effect that by chance was not 1997; Sohn, 1998). Researchers have been offered formu- replicated, or was the initial finding spurious?”). Second, las with which to compute the probability of replicating it is also relevant when researchers want to show that an their current results, and they have been advised to report effect obtained in one circumstance disappears in some the resulting replication probabilities as well as—or even other situation (e.g., a control experiment); the absence in preference to—more traditional statistical measures of the effect in the new situation is only diagnostic if the (e.g., Greenwald et al., 1996; Killeen, 2005; Psychologi- experiment had a high probability of replicating a true ef- cal Science editorial board, 2005). fect. Third, replication probability is relevant when plan- In this article, I consider further the questions of what ning a series of experiments (“What are the chances that replication probability is and what factors determine it, I will obtain this effect again in future experiments like and I argue for two main theses. One thesis is that there this one?”). are two quite different meanings of the term “replication Unfortunately, there is evidence that many psychologi- probability,” each of which might be of interest to re- cal researchers do not understand replication probability searchers under some circumstances. It is important to be (see, e.g., Tversky & Kahneman, 1971). For example, in clear about which meaning is under consideration, how- J. Miller, [email protected] 617 © 2009 The Psychonomic Society, Inc. 618 MILLER ever, when discussing replication probability or trying to cedure is chosen so that the Type I error probability has a estimate it, because confusion between the two types of certain predetermined value—typically set at ( .05, as replication probability can lead to inappropriate conclu- already mentioned—when the null hypothesis is true. sions. The other thesis is that in practice, neither of these When the null hypothesis is really false and some al- replication probabilities can be estimated at all accurately ternative hypothesis is true, the probability of rejecting from the data of an initial experiment, so they are both es- the null hypothesis is called the “power” of the experi- sentially unknowable. Moreover, the latter thesis implies ment, and the symbol for this probability is 1;. Cor- that researchers are generally ill-advised to summarize respondingly, under a particular alternative hypothesis, their data in terms of estimated replication probabilities, the probability that a false null hypothesis is incorrectly despite the importance of these quantities, because the retained is ;. As is well known (for a review, see, e.g., estimates that they obtain are nearly meaningless. Cohen, 1992), power increases with the true size of the This article begins with a short review of the standard effect under study.1 It also increases with the sample size hypothesis-testing framework in which the question of of the experiment and with the ( level associated with the replication probability often arises. The following sections hypothesis-testing procedure. Although the sample size examine in detail the two different meanings of “replica- and ( level of a given experiment can be specified exactly, tion probability,” how each of these probabilities might be the true effect size is never known exactly in practice, pre- estimated, and why the estimates are not very accurate. cluding direct computation of power. The General Discussion then considers how the same con- Researchers generally regard an effect as having been ceptual distinctions and estimation uncertainty extend to replicated successfully if the effect is statistically signifi- the concept of replication probability within other infer- cant in both an initial study and a follow-up study, with the ential approaches (e.g., Bayesian). results of both studies in the same direction (e.g., larger mean for group A than for group B; Rosenthal, 1993).2 HYPOTHESIS-TESTING BACKGROUND TWO MEANINGS OF Although the framework of null-hypothesis signifi- “REPLICATION PROBABILITY” cance testing (NHST) remains controversial (see, e.g., Abelson, 1997; Cohen, 1994; Kline, 2004; Loftus, 1996; It is useful to distinguish between two legitimate but Lykken, 1991; Oakes, 1986; Wagenmakers, 2007), even quite different meanings of “replication probability” that its critics acknowledge that it is still in common use and might be of interest to researchers under different cir- that many of its problems stem more from misunderstand- cumstances. Both may be defined within a frequentist ing and misuse than from inherent flaws. Therefore, rep- framework. One, which I call the “aggregate” replica- lication probability is discussed here mainly within this tion probability, is the probability that researchers who hypothesis-testing framework. Importantly, this article obtain significant results in their initial experiments will should not be seen as arguing that NHST is superior to al- also obtain significant effects in identical follow-up ex- ternative statistical techniques (e.g., confidence intervals; periments.3 As will be discussed in detail, this meaning cf. Cumming & Finch, 2005), although I do believe that of replication probability applies across a large pool of NHST is one of a wide range of techniques that can use- researchers working within a common experimental or fully be employed, as long as its strengths and limitations theoretical context but testing different null hypotheses. It are clearly understood. refers to the proportion of successful replications across Within the hypothesis-testing framework, researchers all of the different null hypotheses tested. The other mean- test for a significant effect by computing the probabil- ing, which I call the “individual” replication probability, is ity, under the null hypothesis, of observing data at least the long-run proportion of significant results that would as discrepant from the predictions of the null hypothesis be obtained by a particular researcher in exact replica- as the data they have actually observed. They reject the tions of that researcher’s own initial study. This meaning null hypothesis if this computed probability—sometimes refers to the proportion of significant results within exact called the “attained significance level” or “p value”—is replications of a particular initial study (i.e., testing a sin- less than a predetermined cutoff alpha (() level

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    24 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us