<<

1 63 2 Why Replications Do Not Fix the 64 3 65 4 Crisis: A Model and Evidence from a Large-Scale 66 5 67 6 Vignette 68 7 69 8 Adam J. Berinskya,1, James N. Druckmanb,1, and Teppei Yamamotoa,1,2 70 9 71 a b 10 Department of Political , Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139; Department of Political Science, Northwestern 72 University, 601 University Place, Evanston, IL 60208 11 73 12 This manuscript was compiled on January 9, 2018 74 13 75 14 There has recently been a dramatic increase in concern about collective “file drawer” (4). When publication decisions depend 76 15 whether “most published findings are false” (Ioannidis on factors beyond research quality, the emergent scientific 77 16 2005). While the extent to which this is true in different disciplines re- consensus may be skewed. Encouraging replication seems to 78 17 mains debated, less contested is the presence of “publication ,” be one way to correct a biased record of published research 79 18 which occurs when publication decisions depend on factors beyond resulting from this file drawer problem (5–7). Yet, in the 80 19 research quality, most notably the statistical significance of an ef- current landscape, one must also consider potential publication 81 20 fect. Evidence of this “file drawer problem” and related phenom- at the replication stage. While this was not an issue 82 21 ena across fields abounds, suggesting that an emergent scientific for OSC since they relied on over 250 scholars to replicate the 83 22 consensus may represent false positives. One of the most noted 100 studies, the reality is that individual replication studies 84 23 responses to has involved increased emphasis on also face a publication hurdle. 85 24 replication where a researcher repeats a prior research study with In what follows, we present a model and a survey exper- 86 25 different subjects and/or situations. But do replication studies suf- iment that captures the publication process for initial and 87 26 fer themselves from publication bias, and if so, what are the implica- replication studies. In so doing, we introduce a distinct type of 88 27 tions for knowledge accumulation? In this study, we contribute to the publication bias, what we call “gotcha bias.” This bias occurs 89 28 emerging literature on publication and replication practices in sev- only for replication studies such that the likelihood of publi- 90 29 eral important ways. First, we offer a micro-level model of the publi- cation increases if the replication contradicts the findings of 91 30 cation process involving an initial study and a replication. The model original study. We also show that the common metric used to 92 31 incorporates possible publication bias both at the initial and replica- assess the replication success – the “reproducibility” rate (i.e., 93 32 tion stages, enabling us to investigate the implications of such bias proportion of published replication results that successfully 94 33 on various statistical metrics of quality of evidence. Second, we es- reproduce the original positive finding) – is not affected by 95 34 timate the key parameters of the model with a large-scale vignette initial study publication bias (i.e., the file drawer) according to 96 35 experiment conducted with the universe of political science profes- our model. In other words, a low reproducibility rate provides 97 36 sors teaching at Ph.D.-granting institutions in the United States. Our no insight into whether there is a file drawer problem. Finally, 98 37 results show substantial evidence of both types of publication bias our empirical results from a survey of political scientists sug- 99 38 in the discipline: On average, respondents judged statistically signif- gest that publication biases occur with greater frequency at 100 39 icant results about 20 percentage points more likely to be published the replication phase and the gotcha bias may, in practice, 101 40 than statistically insignificant results, and contradictory replication exacerbate the false positive rate. 102 41 results about six percentage points more publishable than replica- 103 42 tions that are consistent with original results. Based on these find- Model of Publication Decisions 104 43 ings, we offer practical recommendations for the improvement of We consider two distinct types of publication bias. First, we 105 44 publication practices in political science. examine the well-known file drawer problem (4). File drawer 106 45 DRAFT 107 Keyword 1 | Keyword 2 | Keyword 3 | Keyword 4 | Keyword 5 bias occurs if a positive test result (i.e., a statistical hypothesis 46 test that rejects the null hypothesis of no effect) is more likely 108 47 to be published than a negative test result (i.e., a hypothesis 109 48 eplication is a hallmark of science. In the ideal, all empir- test that does not reject the null hypothesis), ceteris paribus. 110 49 Rical research findings would be subject to replication with In other words, the published record of research is skewed away 111 50 knowledge accumulating as replications proceed. There has from the true distribution of the research record, overstating 112 51 been increasing concern that such an “ideal” would paint an the collective strength of the findings. For example, if 1 out of 113 52 unflattering portrait of science – a recent survey of scientists 10 studies showed that sending varying types of health-related 114 53 found that 90% of respondents agreed there is a reproducibility text messages leads people to eat less fatty food, and only that 115 54 crisis (1). Evidence of such a crisis comes, in part, from the 1 is published, the result is a mis-portrayal of the effect of text 116 55 Collaboration (OSC) project that replicated messages. There is a large theoretical and empirical literature 117 56 just 36% of initially statistically significant results from 100 that documents the evidence and possible consequences of this 118 57 previously published (2). 119 58 One possible driver of the “” is publication 120 Please provide details of author contributions here. 59 bias at the initial stage: that is, the published literature 121 Please declare any conflict of interest here. 60 overstates statistically significant results because those are 122 1 61 the only kind that survive the publication process (3). Non- A.J.B., J.N.D. and T.Y. contributed equally to this work. 123 62 significant results are instead relegated to the discipline’s 2To whom correspondence should be addressed. E-mail: [email protected] 124

www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX PNAS | January 9, 2018 | vol. XXX | no. XX | 1–7 125 type of publication bias (8–10). The file drawer bias reflects We note that our model is not intended to be an accurate, 187 126 an entrenched culture that prioritizes statistical significance descriptive portrayal of the actual scientific practice. For ex- 188 127 and novelty, as well as a funding system that rewards positive ample, not all published results will ever be replicated with 189 128 findings (3). fresh samples in reality, even with the current push towards 190 129 Second, we consider the possibility that the process of credible research. It is indeed possible that researchers might 191 130 replication itself could induce bias. We define gotcha bias to strategically choose what existing studies to replicate and 192 131 be the phenomenon that, ceteris paribus, a positive test result which replication results to submit for review, given their per- 193 132 is more likely to be published when there exists a prior study ception about publication biases. Instead, our goal here is to 194 133 that tested the same hypothesis and had a negative result than examine how the idealized model of replication science, as ex- 195 134 when a prior study showed a positive test result (and vice emplified by the OSC study (2), would differ if we “perturbed” 196 135 versa). That is, replications are more likely to be published if it by adding possible publication bias for replication studies 197 136 they overturn extant findings. The published record of research themselves. 198 137 therefore overly emphasizes replication that run counter to To study the consequences of the two types of publication 199 138 existing findings, as compared to the true distribution of biases, we specifically consider the following metrics of evidence 200 139 the research record. For example, there might be 10 similar quality in published studies. 201 140 202 studies of text messaging effects that each purport to replicate Definition 1 [Actual False Positive Rate (AFPR) in pub- 141 203 corresponding previous studies, and they all find significant lished replication studies] 142 effects but 9 of them are successful replications of the original 204 143 205 studies that found equally significant results. If only the one α˜2 = Pr(replication test significant | replication published, 144 206 study that finds a significant effect contrary to a previous non- the null is true) 145 significant study is published, the larger research community 207 146 will see a distorted portrayal of the research record. The Definition 2 [Reproducibility] 208 147 gotcha bias is a concept applicable to replication studies that 209 148 attempt to reproduce results of previous, original studies using R = Pr(replication test significant | original test significant 210 149 similar study designs. The hypothesized mechanism behind and published, replication published) 211 150 this bias is again a proclivity for novelty and sensationalism. 212 151 Although some authors have alluded to a similar phenomenon The AFPRs represent the proportions of the positive results in 213 152 – most notably the Proteus phenomenon that occurs when published replication studies that are actually false, i.e., where 214 153 extreme opposite results are more likely to be published (11) the null hypotheses are in fact true.In the ideal world, these 215 154 – we are the first (as far as we are aware) to formulate our rates would be equal to the nominal type-I error rate that the 216 155 discussion of this form of bias in the same positive/negative tests in replication studies are theoretically designed to achieve. 217 156 test result terms used to describe file drawer bias. However, α˜2 will diverge from their designed type-I error rate 218 157 due to the two kinds of publication biases we consider. It 219 158 We employ a rather simplified model of publication process is well known that file drawer bias tends to inflate the false 220 159 in order to study the consequences of these two types of positive rate by disproportionately “shelving” negative results 221 160 publication biases (Figure1). The model starts with a null that correctly identifies true null hypotheses (4). The effect of 222 161 hypothesis (e.g., no treatment effect) as the true state of the gotcha bias, however, has not been documented. 223 162 world and proceeds as follows. (We also consider a parallel Our analysis reveals that the gotcha bias affects the AFPR 224 163 model for the case where the null is false, as discussed in in replication results in several interconnected ways. Specif- 225 164 Supplementary Materials.) First, the “original study” tests ically, ceteris paribus, gotcha bias for significant replication 226 165 the null hypothesis with the nominal type-I error probability results exacerbates the inflation of the AFPR in replication re- 227 166 of α1, based on a simple random sample of size N drawn from sults, while gotcha bias for insignificant replication results has 228 167 the target population. The result, whether (false) positive or the opposite effect of deflating the AFPR closer to the nominal 229 168 (true) negative, goes through a peer-review process and gets FPR in replication results. Intuitively, this tends to occur 230 169 published, with probability p1 for a positive resultDRAFT and p0 for because gotcha bias makes publication of false positive results 231 170 a negative result. The anticipated discrepancy between p1 and in replication studies more likely when the original studies 232 171 p0, such that p1 > p0, represents what we call the file drawer correctly accepted the same null hypotheses, but less likely 233 172 bias. when the original test also rejected the hypothesis. Moreover, 234 173 Second, only the published results from the first stage in the presence of gotcha bias, we find that the file drawer 235 174 are subjected to replication studies, which we assume to be bias in original studies has the effect of decreasing AFPR 236 175 designed identically to the original study but conducted on a in replication results. The net effect of gotcha bias on the 237 176 newly collected sample from the same population. With the replication-study AFPR is thus ambiguous and depends on 238 177 type-I error rate of α2, the result is a (false) positive. The which of these mutually countervailing effects is dominant. 239 178 results then go through a peer-review process similar to the first Note, however, that our model also implies that the AFPR 240 179 stage, except that their publication probability now depends can never be less than the nominal FPR of replication tests 241 180 on both test results from the current and previous stages (q11, as long as the replication results themselves are also subject 242 181 q10, q01, q00). If q11 < q01 (such that a positive replication to non-negative file drawer bias (i.e., q01 > q00 and q11 > q10). 243 182 result is more likely to be published when it contradicts a Supplementary Materials contain a more precise discussion, 244 183 previous negative result than when it confirms an existing with reference to the exact mathematical expression for α˜2 in 245 184 positive result), then we call it the gotcha bias for significant terms of our model parameters. 246 185 replication results. Similarly, q10 > q00 would represent a Reproducibility refers to the proportion of the published 247 186 gotcha bias for insignificant replication results. replication test results that successfully reproduce the positive 248

2 | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX Berinsky et al. 249 Truth = Treatment Has No Effect 311 250 (Null Hypothesis is True) 312 α 1 α 251 1 − 1 313 252 314 Positive Result Negative Result 253 (Null Rejected) (Null Not Rejected) 315 254 Original 316 p1 1 p1 p0 1 p0 255 Study − − 317

256 Published Not Published Published Not Published 318

257 α2 1 α2 α2 1 α2 319 258 − − 320 259 Replication Success Replication Fail Replication Fail Replication Success 321 260 (Null Rejected) (Null Not Rejected) (Null Rejected) (Null Not Rejected) 322 Replication q 1 q q 1 q q 1 q q 1 q 261 Study 11 − 11 10 − 10 01 − 01 00 − 00 323 262 324 263 Published Not Published Published Not Published Published Not Published Published Not Published 325 264 326 265 Fig. 1. A Model of Publication Process with Two Stages 327 266 328 267 329 original results. In other words, R asks “How often do replica- ing to other disciplines, it is noteworthy that, as with other 268 330 tion studies that are published confirm the positive results of scientific disciplines, questions of publication bias have be- 269 331 the original published test results?” This is one of the metrics come central to ongoing discussions and initiatives in political 270 332 employed by the Open Science Collaboration (2); that project science (13, 14). 271 333 involved more than 250 scholars who attempted to replicate Participants were sent to a link where they were provided 272 334 100 psychology experiments from three highly ranked journals. a set of vignettes that described a paper (on the validity of 273 335 They reproduced statistically significant results for 36% of using vignettes, see (15)). Each respondent was provided 274 336 initially statistically significant effects and concluded “there is with 5 different vignettes, each concerning a single paper. We 275 337 room to improve reproducibility in psychology,” attributing requested them to act as if they were an author and asked 276 338 the low rate to publication bias among other factors. whether they would submit the paper to a journal. Each 277 339 In Supplementary Materials, however, we provide an exact respondent then received another 5 vignettes where they were 278 340 formula for R in terms of the model parameters that casts asked to play the role of a reviewer. Here, we asked whether 279 341 serious doubt on their interpretation of the low reproducibility. they would recommend the paper be published. Finally, we 280 342 Indeed, our model implies that reproducibility should have asked whether the responded had ever edited a journal. If they 281 343 no direct relationship with the file drawer bias in the original had, we gave them 5 additional vignettes. These vignettes 282 344 studies, because ceteris paribus, increase in FPR due to file asked the respondent whether he or she would publish the 283 345 drawer bias should be offset by a corresponding decrease in paper. 284 346 the type-II error rate. Our analysis instead points at more 285 Each vignette randomly varied a host of features, many of 347 important determinants of reproducibility, including the power 286 which are not relevant to our focus here (see Supplementary 348 of the original and replication studies, publication bias in the 287 Materials for the full text of the vignette with all possible vari- 349 replication studies themselves, and what Ioannidis calls the 288 ations). For our purposes, a sample vignette for the “author” 350 “pre-study odds” of a true relationship (i.e., proportion of 289 condition is presented in Figure2. After each vignette, we 351 false nulls in the field) (12). In particular, publication bias in 290 asked the following question as our main our variable: “If you 352 replication studies can either increase or decrease R, depending 291 were the author of this paper, what is the percent chance you 353 on the relative importance of file drawer bias and gotcha 292 would send this paper to a peer-reviewed journal?” The entire 354 bias. Moreover, regardless of the presence of publication bias, 293 DRAFTwording included some normalization and is also provided in 355 reproducibility can be easily close to 20% or even lower in 294 Supplementary Materials along with versions for the “reviewer” 356 low-power studies or when researchers are testing mostly true 295 and “editor” conditions. 357 nulls. 296 358 297 359 Survey Experiment Results 298 360 299 To illustrate how our simple model can shed light on the Estimating Two Types of Publication Bias. We begin by ask- 361 300 “reproducibility crisis” in a scientific discipline, we conduct ing how much evidence our data show of the two types of 362 301 a large-scale vignette survey experiment. Our primary goal publication bias – file drawer bias and gotcha bias. Figure3 363 302 in the calibration analysis here is not to provide an exact presents the average percent chance of taking an action toward 364 303 estimate of evidence quality in a scientific discipline, but to publication (e.g., sending out a paper as an author, recom- 365 304 illustrate the utility of our framework with data that have mending publication as a reviewer, and supporting publication 366 305 reasonable empirical relevance. as an editor) that the respondents gave to different types of 367 306 Our population constituted all political science department hypothetical papers described in our randomly generated vi- 368 307 faculty at Ph.D. granting institutions based in the United gnettes, along with 95% confidence intervals. Here, we pool 369 308 States. The Supplementary Materials contain a description of the author, reviewer and editor conditions in our analysis; the 370 309 our procedure and a demographic portrait of results broken down for these roles are provided in Supple- 371 310 our respondents. While caution should be taken in generaliz- mentary Materials. We also combine the two conditions in 372

Berinsky et al. PNAS | January 9, 2018 | vol. XXX | no. XX | 3 373 publishing a replication test result when that replication finds 435 We are interested in how you, as an author, decide to submit your research to a 374 journal. To do this, we will present you with five descriptions of papers. After each a significant effect which runs contrary to a previous insignifi- 436 375 description, we will ask you some questions about it. cant test result of the same hypothesis ([62.5%, 65.4%]). This 437 376 percentage drops to 61.2% ([59.6%, 62.7%]) for a significant 438 Suppose that you are evaluating a paper reporting the results of an empirical study in 377 your own subfield. The study aims at testing a hypothesis with quantitative data and replication result that successfully reproduces an earlier signif- 439 378 has the following characteristics. icant finding (t=-2.92, p<0.01). Thus, for replication studies, 440 379 [There is no existing empirical study that tests the same hypothesis./ It is a repli- there is an increased probability of supporting publication of 441 • 380 cation of an earlier study that had reported a result that is highly significant by surprising results in either direction – findings that overturn 442 conventional standards (e.g., p-value of less than .01) on the test of the same hy- 381 pothesis./ It is a replication of an earlier study that had reported a result that is previously published studies are privileged in the publication 443 382 significant by conventional standards (e.g., p-value of less than .05) on the test process. Importantly, however, the gotcha effect only goes 444 383 of the same hypothesis./ It is a replication of an earlier study that had reported a so far; even at the replication stage, the standard file drawer 445 result that is not significant (e.g., p-value of greater than .75) on the test of the 384 same hypothesis.] problem exerts influence. Our overall evidence indicates that 446 385 the standard file drawer bias is of larger magnitude than the 447 A sample size of [50/150/1000/5000]. 386 • gotcha effect for replication studies. Thus, replication results 448 The test result is [highly significant by conventional standards (e.g., p-value of • 387 less than .01)/ significant by conventional standards (e.g., p-value of less than that find statistically significant effects are more likely to move 449 388 .05) / not significant by conventional standards (e.g., p-value of greater than .75)]. towards publication than insignificant results, no matter what 450 389 Seemingly sound in terms of methods and analysis. the original results may be. 451 • 390 In sum, in a world in which people are unlikely to seek 452 391 to publish null results, there is a danger of biased collective 453 392 454 Fig. 2. Sample Vignette from the Experiment. The phrases in square brackets findings because only significant results find an audience – not 393 separated by slashes represent alternative texts that are randomly and independently just in the initial stage, but in replications as well. This fact is 455 394 assigned for each vignette. further compounded by gotcha bias, of which we find smaller 456 395 but still substantial evidence. These results suggest that, for 457 396 example, had the Open Science Collaboration’s replications 458 which test results are described as statistically significant into 397 been independently submitted, more than half would not have 459 a single category in our analysis. 398 been published. 460 399 As expected, given extant work (9), our estimates for orig- 461 400 inal studies show clear evidence of file drawer bias. These 462 401 results are presented in the left panel of Figure3. While 463 In addition to estimat- 402 respondents, on average, indicated a 67.1% chance of submit- Estimating Actual False Positive Rates. 464 ing the two types of publication bias, our survey experimental 403 ting, recommending, or supporting a paper with a significant 465 data allow us to make inferences about the aforementioned 404 test result (95% CI = [65.6%, 68.7%]), they gave only 45.2% 466 key metrics of evidence quality: the AFPR and reproducibil- 405 chance of doing the same for a paper with a non-significant 467 ity. Here, we provide our estimates of the AFPR based on 406 finding ([43.6%, 46.9%]). 468 our formula and the vignette data (Figure4). Consider a 407 More interestingly, our result clearly suggests that replica- 469 published study that tries to replicate an earlier published 408 tion studies are subject to the same kind of file drawer bias as 470 study by testing the same hypothesis with a new sample at 409 the original research studies. These results are presented in 471 the 0.05 significance level. If the original study reported a 410 the right panel of Figure3. On average, respondents reported 472 significant test result for the hypothesis at the 0.01 level, the 411 a 62.5% chance of moving a significant test result in a repli- 473 AFPR in its replication test is estimated to be 0.063 (95% CI 412 cation study toward publication ([61.3%, 63.8%]), whereas 474 = [0.060, 0.067]). The estimate drops slightly to 0.060 ([0.057, 413 they only gave 44.1% chance for a non-significant replication 475 0.064]) if the original study rejected the null at the 0.05 level. 414 test result ([42.6%, 45.6%]). Thus, regardless of whether a 476 In contrast, a positive replication result that is significant at 415 replication study “succeeds” or “fails” to reproduce the orig- 477 the 0.05 level is estimated to have an AFPR of 0.079 ([0.075, 416 inal finding, that replication is more likely to be published 478 0.083]) if the result is contrary to an original . 417 when its result is statistically significant than whenDRAFT it is a 479 418 null finding. It is also noteworthy that replication studies are This nearly 0.02 point increase is a direct consequence of 480 419 less likely to be published than original studies: on average, publication bias. A 0.05-level positive replication test result 481 420 respondents indicated 2.8% less chance of taking an action is a surprise and therefore a more publishable finding when 482 421 toward publishing a replication result than an original test the original result was non-significant. Moreover, we find 483 422 result (t=-4.65, p<0.00). Our findings imply that extra efforts that a large majority of true nulls would in fact be classified 484 423 may need be made in encouraging publication of replication as non-significant by the original study (more precisely, 92.6 485 424 studies in general. percent of the time if the nominal FPR of the test was 0.05, 486 425 Turning to gotcha bias, our results again show clear evi- with 95% CI = [92.2, 92.9]). Thus, according to our model 487 426 dence that this more subtle form of publication bias is fostered of publication process, those published significant test results 488 427 among political scientists, at least when asked about hypo- that contradict existing non-significant results will contain 489 428 thetical publication decisions. Respondents assigned 49.1% more false positives than those studies that confirm earlier 490 429 chance of submitting/recommending/supporting publication significant findings. In Supplementary Materials, we show that 491 430 of an insignificant test result ([47.3%, 50.8%]) when the study the same pattern emerges for the positive replication tests 492 431 fails to replicate an earlier significant test result, compared to that use the nominal FPR of 0.01. The bottom line is that 493 432 only 39.1% when it successfully replicates a previously non- our data suggest that replications are not an elixir to correct 494 433 significant finding ([37.3%, 40.9%]). Likewise, respondents the scientific record – publication bias in replication studies 495 434 indicated a 63.9% chance of making a decision in favor of lead to an AFPR that exceeds what would occur by chance. 496

4 | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX Berinsky et al. 497 Original Study Replication Study 559 498 Insignificant Significant Insignificant Significant 560 499 561 500 80 80 562 501 563 502 564 503 565 504 60 60 566 505 567 506 568 507 569 508 570 509 40 40 571 510 572 511 573 512 574 % Would Take Action in Favor of Publication Action in Favor Take % Would 513 20 20 575 514 Originally Insignificant Originally Significant Originally Insignificant Originally Significant 576 515 577 516 Fig. 3. Evidence of Two Types of Publication Bias. 578 517 579 518 Nominal vs. Actual False Positive Rate (AFPR) in Replication Studies result occurs because a large majority of replication studies 580 519 Originally Significant Originally Significant 581 Originally Insignificant with such low power will fail to detect those effects. In con- at 0.01 at 0.05 520 trast, high-powered replication studies can reproduce original 582 521 positive results with high probability even when the pre-study 583 522 0.08 odds of true effects are rather low, because such studies are 584 523 unlikely to mis-classify those few true effects as insignificant. 585 524 Of course, reproducibility eventually converges to the nom- 586 525 0.07 inal type-I error of the replication test as the proportion of 587 526 true nulls approaches one, at which point the tests are merely 588 527 “replicating” the wrong results at their designed false positive 589 528 0.06 rate. 590 529 591

Actual FPR (Estimated) What happens to reproducibility when we incorporate our 530 592 estimated levels of publication bias in its calculation? Here, Nominal FPR 531 0.05 593 = 0.05 we again use our vignette survey data to produce such esti- 532 594 mates corresponding to each of the simulated scenarios (solid 533 595 lines in Figure5) along with their 95% confidence bands 534 596 (shaded regions). Somewhat counterintuitively, we find that 535 597 the publication bias exhibited in our experiment would im- 536 Fig. 4. Estimates of AFPR for Published Replication Study Results. 598 prove reproducibility by statistically significant margins across 537 599 all possible values of statistical power and the pre-study odds 538 600 of true effects. This result stems from the predominance of 539 Estimating Reproducibility. Finally, we look at another impor- 601 file drawer bias that we find even in replication studies. That 540 tant metric of evidence quality: the reproducibility of original 602 is, because positive results are published more often than neg- 541 positive results in published replication studies. InDRAFT addition 603 ative results, the “successful” reproduction of original positive 542 to publication bias, the key parameters that determine repro- 604 results are overrepresented in published replication studies 543 ducibility are power and the proportion of true null hypotheses 605 compared to negative reproduction results. The gotcha bias 544 that are tested in a given scientific field. We therefore first sim- 606 does counterbalance this tendency to some extent, but because 545 ulate reproducibility under different scenarios with respect to 607 this bias is smaller than the file drawer bias, the net effect is 546 those two key parameters, in the assumed absence of publica- 608 to increase reproducibility. 547 tion bias. These simulated theoretical values of reproducibility 609 548 are plotted by dashed lines in Figure5. Thus, our finding suggests that publication bias in repli- 610 549 In Figure 4, we present a given combination of nominal cation studies might actually have a positive impact on the 611 550 type-I error rates in the original and replication studies. We standard metric of reproducibility assuming that the real world 612 551 provide four sets of reproducibility simulations, each corre- mirrors our vignette experiment result and the file drawer bias 613 552 sponding to a specific sample size (50, 150, 1,000 and 5,000), dominates gotcha bias in replication studies as we find through 614 553 and an implied power value, that was also used in our vignette our vignette experiment. This result, however, should not be 615 554 experiment. As mentioned above, reproducibility varies widely taken as a recommendation to encourage publication bias in 616 555 depending on these parameters. When hypotheses are tested replication studies. For one thing, we also find that repro- 617 556 with a small sample size (such as N =50), these tests have low ducibility is more strongly determined by other factors such 618 557 statistical power. Reproducibility therefore remains low even as power of the studies and the pre-study odds of true effects 619 558 when researchers are all testing for effects that are true. This in a given discipline. More fundamentally, reproducibility is 620

Berinsky et al. PNAS | January 9, 2018 | vol. XXX | no. XX | 5 621 Conclusion 683 622 684 The “reproducibility crisis” in science has generated substan- 623 685 tial discussion and a number of efforts to encourage wide-scale 624 686 replications. Indeed, one can only assess and ultimately ad- 625 687 dress such a crisis if replications occur and become part of the 626 688 scientific literature. This process of learning is not as straight- 627 689 forward as often thought. Our model isolates how an idyllic 628 690 replication process works, showing two distinct types of biases 629 691 Reproducibility of Positive Results that can skew the published literature. Moreover, the model 630 692 Replication Significant at 0.05 Replication Significant at 0.01 shows that reproducibility is not contingent on publication 631 693 1.00 bias in initial studies but rather is affected by power and bias 632 694 in replication publication, inter alia. 633 Study Significant at 0.01 Original 695 Our survey experiment results show that even in the midst 634 696 0.75 of widespread discussions about the importance of replica- 635 697 tion and publication bias, scholars still exhibit these biases. 636 698 Moreover, changing incentives and behaviors is not easy: re- 637 0.50 699 searchers, like everyone else, exhibit confirmation biases that 638 700 lead them to privilege the preconception that statistical sig- 639 701 nificance is critical (16). We find these preconceptions carry 640 0.25 702 over to their evaluation of replication studies. Moreover, in- 641 703 centivizing journals to be “more encouraging” of replications 642 704 0.00 (7) could perhaps backfire since the gotcha bias might lead to 643 705 a mis-portrayal of accumulated knowledge and possibly mis- 644 1.00 706 incentivize researchers conducting replications. Put simply, 645 Study Significant at 0.05 Original 707 there are no easy solutions. 646 0.75 708 Reproducibility (Estimated) Addressing publication bias, with respect to replication 647 709 studies, will likely require broader institutional change such 648 710 as a collective commitment to pre-registration, a shift to open 649 0.50 711 access journals and/or required publication, or blind review 650 712 (3, 17). Each of these reforms involve considerable resource 651 713 investments and have downsides such as potentially prioritizing 652 0.25 714 certain types of research (18, 19), and/or resulting in an 653 715 overwhelming amount of information to assess (although see 654 716 0.00 (20)). There are more modest approaches including having 655 717 0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00 journals publish brief “replication” sections and incentivizing 656 Proportion of True Nulls (Assumed) 718 citations to replications (21). These latter ideas could help 657 719 attenuate the replication publication bias and we believe they 658 720 W/ Publication Bias (Estimated) N = 50 are worth exploring on a larger-scale. 659 W/o Publication Bias (Theoretical) N = 150 721 N = 1000 We also urge scholars to take a step back in assessing 660 722 N = 5000 the “reproducibility crisis.” While we do not question extant 661 723 evidence from well-known studies and meta-analytic literature 662 724 Fig. 5. Estimates of Reproducibility as Function of Power and “Pre-Study Odds.” which shows replication inconsistencies (10, 11), we also note 663 725 that other large-scale replication attempts in political science 664 726 (22) and economics (23) were relatively more successful than 665 DRAFT 727 the Open Science Collaboration results. The question then 666 728 is what research areas, theories, methods, and context affect 667 729 the likelihood of replication, and this understanding, in turn, 668 730 would facilitate the assessment of replications. 669 731 670 732 671 733 not a direct indicator of whether study results represent true ACKNOWLEDGMENTS. We thank James Dunham, Jacob Roth- 672 schild, Robert Pressel, Chris Peng, and Shiyao Liu for research 734 673 effects or not, but a metric of regularity in finding positive assistance. We are grateful to Melissa Sands and the participants 735 674 test results whether or not they are actually indicative of the at the 2017 Conference of the Society for Political for 736 675 true state of the world. useful comments and suggestions. 737 676 738 677 739 678 740 679 741 680 742 681 743 682 744

6 | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX Berinsky et al. 745 1. Baker M (2016) Is there a reproducibility crisis? A survey lifts the lid on how re- 2(8):696–701. 807 746 searchers view the crisis rocking science and what they think will help. Nature 533(7604):452– 13. Lupia A, Elman C (2014) Openness in political science: Data access and research trans- 808 455. parency. PS: Political Science & Politics 47(1):19–42. 747 2. Open Science Collaboration (2015) Estimating the reproducibility of psychological science. 14. Monogan JE (2015) Research preregistration in political science: The case, counterargu- 809 748 Science 349(6251):aac4716. ments, and a response to critiques. PS: Political Science & Politics 48(3):425–429. 810 749 3. Brown AW, Mehta TS, Allison DB (2017) Publication bias in science: What is it, why is it 15. Hainmueller J, Hangartner D, Yamamoto T (2015) Validating vignette and conjoint survey 811 problematic, and how can it be addressed? The Oxford Handbook of the Science of Science experiments against real-world behavior. Proceedings of the National Academy of 750 Communication p. 93. 112(8):2395–2400. 812 751 4. Rosenthal R (1979) The file drawer problem and tolerance for null results. Psychological 16. Bollen K, Cacioppo J, Kaplan R, Krosnick J, Olds J (2015) Social, behavioral, and eco- 813 Bulletin 86(3):638. nomic sciences perspectives on robust and reliable science: Report of the subcommittee 752 5. Klein SB (2014) What can recent replication failures tell us about the theoretical commitments on replicability in science, advisory committee to the national science foundation directorate 814 753 of psychology? Theory & Psychology 24(3):326–338. for social, behavioral, and economic sciences. Technical Report. Retrieved at: www. nsf. 815 754 6. Bohannon J (2015) Many psychology papers fail replication test. Science 349(6251):910– gov/sbe/AC_Materials/SBE_Robust_and_Reliable_Research_Report. pdf. 816 911. 17. Nosek BA, Ebersole CR, DeHaven A, Mellor D (2017) The preregistration revolution. 755 7. Nosek BA, et al. (2015) Promoting an open research culture. Science 348(6242):1422–1425. 18. Coffman LC, Niederle M (2015) Pre-analysis plans have limited upside, especially where 817 756 8. Gerber AS, Malhotra N (2008) Publication bias in empirical sociological research: Do arbitrary replications are feasible. The Journal of Economic Perspectives 29(3):81–97. 818 757 significance levels distort published results? Sociological Methods & Research 37(1):3–30. 19. Freese J, Peterson D (2017) Replication in . Annual Review of Sociology (0). 819 9. Franco A, Malhotra N, Simonovits G (2014) Publication bias in the social sciences: Unlocking 20. de Winter J, Happee R (2013) Why selective publication of statistically significant results can 758 the file drawer. Science 345(6203):1502–1505. be effective. PloS one 8(6):e66463. 820 759 10. Fanelli D, Costas R, Ioannidis JP (2017) Meta-assessment of bias in science. Proceedings 21. Coffman LC, Niederle M, Wilson AJ, , et al. (2017) A proposal to organize and promote 821 760 of the National Academy of Sciences p. 201618569. replications. American Economic Review 107(5):41–45. 822 11. Ioannidis JP, Trikalinos TA (2005) Early extreme contradictory estimates may appear in pub- 22. Mullinix KJ, Leeper TJ, Druckman JN, Freese J (2015) The generalizability of survey experi- 761 lished research: the proteus phenomenon in molecular genetics research and randomized ments. Journal of Experimental Political Science 2(2):109–138. 823 762 trials. Journal of clinical 58(6):543–549. 23. Camerer CF, et al. (2016) Evaluating replicability of laboratory experiments in economics. 824 763 12. Ioannidis JPA (2005) Why most published research findings are false. PLoS Science 351(6280):1433–1436. 825 764 826 765 827 766 828 767 829 768 830 769 831 770 832 771 833 772 834 773 835 774 836 775 837 776 838 777 839 778 840 779 841 780 842 781 843 782 844 783 845 784 846 785 847 786 848 787 849 788 850 789 DRAFT 851 790 852 791 853 792 854 793 855 794 856 795 857 796 858 797 859 798 860 799 861 800 862 801 863 802 864 803 865 804 866 805 867 806 868

Berinsky et al. PNAS | January 9, 2018 | vol. XXX | no. XX | 7