Is Psychology Suffering from a Replication Crisis? What Does “Failure to Replicate” Really Mean?
Total Page:16
File Type:pdf, Size:1020Kb
Is Psychology Suffering From a Replication Crisis? What Does “Failure to Replicate” Really Mean? Scott E. Maxwell University of Notre Dame Michael Y. Lau Teachers College, Columbia University George S. Howard University of Notre Dame Psychology has recently been viewed as facing a replica- mistakenly reporting an effect when in reality no effect tion crisis because efforts to replicate past study findings exists) at 5%. However, as a number of authors (e.g., frequently do not show the same result. Often, the first Gelman & Loken, 2014; John, Loewenstein, & Prelec, study showed a statistically significant result but the rep- 2012; Simmons, Nelson, & Simonsohn, 2011) have dis- lication does not. Questions then arise about whether the cussed, data analyses in psychology and other fields are first study results were false positives, and whether the often driven by the observed data. Data-driven analyses replication study correctly indicates that there is truly no include but are not limited to noticing apparent patterns in effect after all. This article suggests these so-called failures the data and then testing them for significance, testing to replicate may not be failures at all, but rather are the effects on multiple measures, testing effects on subgroups result of low statistical power in single replication studies, of participants, fitting multiple latent variable models, in- and the result of failure to appreciate the need for multiple cluding or excluding various covariates, and stopping data replications in order to have enough power to identify true collection once significant results have been obtained. effects. We provide examples of these power problems and Some of these practices may be entirely appropriate de- suggest some solutions using Bayesian statistics and meta- pending on the specific circumstances, but even at best the analysis. Although the need for multiple replication studies existence of such practices makes it difficult to evaluate the may frustrate those who would prefer quick answers to accuracy of a single published study because these prac- psychology’s alleged crisis, the large sample sizes typically tices typically increase the probability of obtaining a sig- needed to provide firm evidence will almost always require nificant result. Gelman and Loken (2014) state that “Fisher concerted efforts from multiple investigators. As a result, it offered the idea of p values as a means of protecting remains to be seen how many of the recently claimed researchers from declaring truth based on patterns in noise. failures to replicate will be supported or instead may turn In an ironic twist, p values are now often manipulated to out to be artifacts of inadequate sample sizes and single lend credence to noisy claims based on small samples” (p. study replications. 460). The question of whether a pattern seemingly identified Keywords: false positive results, statistical power, meta- in an original study is in fact more than just noise can often analysis, equivalence tests, Bayesian methods best be addressed by testing whether the pattern can be replicated in a new study, which has led to increased sychologists have recently become increasingly attention to the role of replication in psychological re- concerned about the likely overabundance of false search. Moonesinghe, Khoury, and Janssens (2007) have Ppositive results in the scientific literature. For ex- shown that successful replications can greatly lower the ample, Simmons, Nelson, and Simonsohn (2011) state that risk of inflated false positive results. Both Moonesinghe et “In many cases, a researcher is more likely to falsely find al. (2007, p. 218) and Simons (2014, p. 76) maintain that evidence that an effect exists than to correctly find evidence This document is copyrighted by the American Psychological Association or one of its allied publishers. replication is “the cornerstone of science” because only that it does not” (p. 1359). In a similar vein, Ioannidis This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. replication can adjudicate whether a single study reporting (2005) concluded that for disciplines where statistical sig- an original result represents a true finding or a false positive nificance is a virtual prerequisite for publication, “most result. Perspectives on Psychological Science devoted a current published research findings are false” (p. 696). special section to replicability in 2012 (Pashler & Wagen- Such concerns led Pashler and Wagenmakers (2012) to conclude that there appears to be a “crisis of confidence in psychological science reflecting an unprecedented doubt Scott E. Maxwell, Department of Psychology, University of Notre Dame; Michael Y. Lau, Department of Counseling and Clinical Psychology, among practitioners about the reliability of research find- Teachers College, Columbia University; George S. Howard, Department ings in the field” (p. 528). Simmons et al. (2011) state that of Psychology, University of Notre Dame. “a field known for publishing false positives loses its cred- We thank Melissa Maxwell Davis, David Funder, John Kruschke, ibility” (p. 1359). Will Shadish, and Steve West for their valuable comments on an earlier version of this paper. An initial reaction might be that psychology is im- Correspondence concerning this article should be addressed to Scott mune to such concerns because published studies typically E. Maxwell, Department of Psychology, University of Notre Dame, Notre appear to control the probability of a Type I error (i.e., Dame, IN 46556. E-mail: [email protected] September 2015 ● American Psychologist 487 © 2015 American Psychological Association 0003-066X/15/$12.00 Vol. 70, No. 6, 487–498 http://dx.doi.org/10.1037/a0039400 arisen in several disciplines, much of the concern has focused on psychology. A 2012 article in The Chronicle of Higher Education, for example, raised the question, “Is Psychology About to Come Undone?” (Bartlett, 2012). More recently, a 2014 Chronicle of Higher Education article described the apparent crisis as “repligate” (Bartlett, 2014). A particular replication may fail to confirm the results of an original study for a variety of reasons, some of which may include intentional differences in procedures, mea- sures, or samples as in a conceptual replication (Cesario, 2014; Simons, 2014; Stroebe & Strack, 2014). Although conceptual replication studies can be very informative, they may not be able to identify false positive results in the published literature, because if the replication study fails to find an effect previously reported in a published study, any discrepancy in results may simply be due to procedural differences in the two studies. For this reason, there has been an increased emphasis recently on exact (or direct) replications. If exactly replicating the procedures of the Scott E. original study fails to replicate the results, then it might Maxwell seem reasonable to conclude that the results of the original study are in reality nothing more than a Type I error (i.e., mistakenly reporting an effect when, in reality, no effect exists). The primary purpose of our article is to explain makers, 2012). More recently, this journal has begun a new why even an exact replication may fail to obtain findings type of article, a Registered Replication Report (RRR). In consistent with the original study and yet the effect iden- an APS Observer column, Roediger (2012) stated that “By tified in the original study may very well be true despite following the practice of both direct and systematic repli- these discrepant findings. cation, of our own research and of others’ work, we would It might seem straightforward to decide whether a avoid the greatest problems we are now witnessing” (para. replication study is a success or a failure, at least from a 19). Along these lines, collaborative efforts such as the narrow statistical perspective. Generally speaking, a pub- Reproducibility Project (Open Science Collaboration, lished original study has in all likelihood demonstrated a 2012) and the psychfiledrawer.org website, which provides statistically significant effect. In the current zeitgeist, a an archive of replication studies, reflect systematic efforts replication study is usually interpreted as successful if it to assess the extent to which original findings published in also demonstrates a statistically significant effect. On the the literature are replicable and can be trusted. other hand, a replication study that fails to show statistical Several recent apparent replication failures have been significance would typically be interpreted as a failure.1 widely publicized and have begun to cast doubt in some An immediate limitation of this perspective is that the minds on the extent to which the field more broadly is beset replication study may have failed to produce a statistically with a preponderance of results that cannot be replicated. significant result because it may have been underpowered. Most notably, various replication studies (e.g., Galak, Le- There is always some probability that a nonsignificant Boeuf, Nelson, & Simmons, 2012; Ritchie, Wiseman, & result may be a Type II error (i.e., failing to reject the null French, 2012) apparently fail to confirm Bem’s (2011) hypothesis even though it is false). However, this limitation This document is copyrighted by the American Psychological Association or one of its allied publishers. highly publicized findings regarding the existence of psi. seems to have an immediate solution, namely to design the This article is intended solely for the personal use of theAnother individual user and is not to be disseminated broadly. highly publicized example is the apparent failure replication study so as to have adequate statistical power of Doyen, Klein, Pichon, and Cleeremans (2012) and Pa- and thus minimal risk of a Type II error. As Simons (2014) shler, Coburn, and Harris (2012) to replicate Bargh’s work states, “If an effect is real and robust, any competent on the influence of subtle priming on behavior. More researcher should be able to obtain it when using the same generally, out of 14 replication attempts organized by procedures with adequate statistical power” (p.