Why Replications Do Not Fix the Reproducibility Crisis: a Model And
Total Page:16
File Type:pdf, Size:1020Kb
1 63 2 Why Replications Do Not Fix the Reproducibility 64 3 65 4 Crisis: A Model and Evidence from a Large-Scale 66 5 67 6 Vignette Experiment 68 7 69 8 Adam J. Berinskya,1, James N. Druckmanb,1, and Teppei Yamamotoa,1,2 70 9 71 a b 10 Department of Political Science, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139; Department of Political Science, Northwestern 72 University, 601 University Place, Evanston, IL 60208 11 73 12 This manuscript was compiled on January 9, 2018 74 13 75 14 There has recently been a dramatic increase in concern about collective “file drawer” (4). When publication decisions depend 76 15 whether “most published research findings are false” (Ioannidis on factors beyond research quality, the emergent scientific 77 16 2005). While the extent to which this is true in different disciplines re- consensus may be skewed. Encouraging replication seems to 78 17 mains debated, less contested is the presence of “publication bias,” be one way to correct a biased record of published research 79 18 which occurs when publication decisions depend on factors beyond resulting from this file drawer problem (5–7). Yet, in the 80 19 research quality, most notably the statistical significance of an ef- current landscape, one must also consider potential publication 81 20 fect. Evidence of this “file drawer problem” and related phenom- biases at the replication stage. While this was not an issue 82 21 ena across fields abounds, suggesting that an emergent scientific for OSC since they relied on over 250 scholars to replicate the 83 22 consensus may represent false positives. One of the most noted 100 studies, the reality is that individual replication studies 84 23 responses to publication bias has involved increased emphasis on also face a publication hurdle. 85 24 replication where a researcher repeats a prior research study with In what follows, we present a model and a survey exper- 86 25 different subjects and/or situations. But do replication studies suf- iment that captures the publication process for initial and 87 26 fer themselves from publication bias, and if so, what are the implica- replication studies. In so doing, we introduce a distinct type of 88 27 tions for knowledge accumulation? In this study, we contribute to the publication bias, what we call “gotcha bias.” This bias occurs 89 28 emerging literature on publication and replication practices in sev- only for replication studies such that the likelihood of publi- 90 29 eral important ways. First, we offer a micro-level model of the publi- cation increases if the replication contradicts the findings of 91 30 cation process involving an initial study and a replication. The model original study. We also show that the common metric used to 92 31 incorporates possible publication bias both at the initial and replica- assess the replication success – the “reproducibility” rate (i.e., 93 32 tion stages, enabling us to investigate the implications of such bias proportion of published replication results that successfully 94 33 on various statistical metrics of quality of evidence. Second, we es- reproduce the original positive finding) – is not affected by 95 34 timate the key parameters of the model with a large-scale vignette initial study publication bias (i.e., the file drawer) according to 96 35 experiment conducted with the universe of political science profes- our model. In other words, a low reproducibility rate provides 97 36 sors teaching at Ph.D.-granting institutions in the United States. Our no insight into whether there is a file drawer problem. Finally, 98 37 results show substantial evidence of both types of publication bias our empirical results from a survey of political scientists sug- 99 38 in the discipline: On average, respondents judged statistically signif- gest that publication biases occur with greater frequency at 100 39 icant results about 20 percentage points more likely to be published the replication phase and the gotcha bias may, in practice, 101 40 than statistically insignificant results, and contradictory replication exacerbate the false positive rate. 102 41 results about six percentage points more publishable than replica- 103 42 tions that are consistent with original results. Based on these find- Model of Publication Decisions 104 43 ings, we offer practical recommendations for the improvement of We consider two distinct types of publication bias. First, we 105 44 publication practices in political science. examine the well-known file drawer problem (4). File drawer 106 45 DRAFT 107 Keyword 1 | Keyword 2 | Keyword 3 | Keyword 4 | Keyword 5 bias occurs if a positive test result (i.e., a statistical hypothesis 46 test that rejects the null hypothesis of no effect) is more likely 108 47 to be published than a negative test result (i.e., a hypothesis 109 48 eplication is a hallmark of science. In the ideal, all empir- test that does not reject the null hypothesis), ceteris paribus. 110 49 Rical research findings would be subject to replication with In other words, the published record of research is skewed away 111 50 knowledge accumulating as replications proceed. There has from the true distribution of the research record, overstating 112 51 been increasing concern that such an “ideal” would paint an the collective strength of the findings. For example, if 1 out of 113 52 unflattering portrait of science – a recent survey of scientists 10 studies showed that sending varying types of health-related 114 53 found that 90% of respondents agreed there is a reproducibility text messages leads people to eat less fatty food, and only that 115 54 crisis (1). Evidence of such a crisis comes, in part, from the 1 is published, the result is a mis-portrayal of the effect of text 116 55 Open Science Collaboration (OSC) project that replicated messages. There is a large theoretical and empirical literature 117 56 just 36% of initially statistically significant results from 100 that documents the evidence and possible consequences of this 118 57 previously published psychology experiments (2). 119 58 One possible driver of the “replication crisis” is publication 120 Please provide details of author contributions here. 59 bias at the initial stage: that is, the published literature 121 Please declare any conflict of interest here. 60 overstates statistically significant results because those are 122 1 61 the only kind that survive the publication process (3). Non- A.J.B., J.N.D. and T.Y. contributed equally to this work. 123 62 significant results are instead relegated to the discipline’s 2To whom correspondence should be addressed. E-mail: [email protected] 124 www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX PNAS | January 9, 2018 | vol. XXX | no. XX | 1–7 125 type of publication bias (8–10). The file drawer bias reflects We note that our model is not intended to be an accurate, 187 126 an entrenched culture that prioritizes statistical significance descriptive portrayal of the actual scientific practice. For ex- 188 127 and novelty, as well as a funding system that rewards positive ample, not all published results will ever be replicated with 189 128 findings (3). fresh samples in reality, even with the current push towards 190 129 Second, we consider the possibility that the process of credible research. It is indeed possible that researchers might 191 130 replication itself could induce bias. We define gotcha bias to strategically choose what existing studies to replicate and 192 131 be the phenomenon that, ceteris paribus, a positive test result which replication results to submit for review, given their per- 193 132 is more likely to be published when there exists a prior study ception about publication biases. Instead, our goal here is to 194 133 that tested the same hypothesis and had a negative result than examine how the idealized model of replication science, as ex- 195 134 when a prior study showed a positive test result (and vice emplified by the OSC study (2), would differ if we “perturbed” 196 135 versa). That is, replications are more likely to be published if it by adding possible publication bias for replication studies 197 136 they overturn extant findings. The published record of research themselves. 198 137 therefore overly emphasizes replication that run counter to To study the consequences of the two types of publication 199 138 existing findings, as compared to the true distribution of biases, we specifically consider the following metrics of evidence 200 139 the research record. For example, there might be 10 similar quality in published studies. 201 140 202 studies of text messaging effects that each purport to replicate Definition 1 [Actual False Positive Rate (AFPR) in pub- 141 203 corresponding previous studies, and they all find significant lished replication studies] 142 effects but 9 of them are successful replications of the original 204 143 205 studies that found equally significant results. If only the one α˜2 = Pr(replication test significant | replication published, 144 206 study that finds a significant effect contrary to a previous non- the null is true) 145 significant study is published, the larger research community 207 146 will see a distorted portrayal of the research record. The Definition 2 [Reproducibility] 208 147 gotcha bias is a concept applicable to replication studies that 209 148 attempt to reproduce results of previous, original studies using R = Pr(replication test significant | original test significant 210 149 similar study designs. The hypothesized mechanism behind and published, replication published)