Threats of a Replication Crisis

review articles

DOI:10.1145/3360311 articles from 41 ACM Transactions Research replication only works if journals showed that statistical significance was used as an evidentiary there is confidence built into the results. criterion in 61 articles (15%) across 21 different journals (51%), and in BY ANDY COCKBURN, PIERRE DRAGICEVIC, varied domains: from the evalua- LONNI BESANÇON, AND CARL GUTWIN tion of classification algorithms, to comparing the performance of cloud computing platforms, to assessing a new video-delivery technique in terms of quality of experience. Threats of While computer science research has increased its use of experimental methods, the scientific community’s faith in these methods has been erod- a Replication ed in several areas, leading to a ‘replication crisis’27,32 in which experimental results cannot be reproduced and published findings are mistrusted. Crisis Consequently, many disciplines have taken steps to understand and try to address these problems. In particular, misuse of statistical significance in Empirical as the standard of evidence for experimental success has been identified as a key contributor in the replication Computer Science crisis. But there has been relatively little debate within computer science about this problem or how to address it. If computer science fails to adapt while others move on to new standards then Denning’s concern will return—other disciplines will stop taking us seriously.

“If we do not live up to the traditional standards key insights of science, there will come a time when ˽ Many areas of computer science research (performance analysis, software no one takes us seriously.” engineering, AI, and human-computer 13 interaction) validate research claims —Peter J. Denning, 1980. by using statistical significance as the FORTY YEARS AGO, Denning argued that computer standard of evidence. ˽ A loss of confidence in statistically science research could be strengthened by increased significant findings is plaguing other empirical disciplines, yet there has been adoption of the scientific experimental method. relatively little debate of this issue and its Through the intervening decades, Denning’s call associated ‘replication crisis’ in CS. ˽ We review factors that have contributed has been answered. Few computer science graduate to the crisis in other disciplines, with a students would now complete their studies without focus on problems stemming from an over-reliance on—and misuse of—null some introduction to experimental hypothesis testing, hypothesis significance testing. ˽ Our analysis of papers published in a and computer science research papers routinely use cross section of CS journals suggests p-values to formally assess the evidential strength of a large proportion of CS research faces the same threats to replication as those experiments. Our analysis of the 10 most-downloaded encountered in other areas.

70 COMMUNICATIONS OF THE ACM | AUGUST 2020 | VOL. 63 | NO. 8 Beyond issues of statistical signifi- same result using the original setup), areas of science, with a focus on issues cance, computer science research raises and reproducible (a different team can relating to the use of null hypothesis some distinct challenges and opportu- produce the same result using a dif- significance (NHST) as an evidentiary nities for experimental replication. ferent experimental setup).5 However, criterion. We then report on our analy- Computer science research often re- these definitions are primarily direct- sis of a cross section of computer sci- lies on complex artifacts such as ed at experiments that analyze the re- ence publications to identify how com- source code and datasets, and with ap- sults of computations (such as new mon NHST is in our discipline. Later, propriate packaging, replication of computer algorithms, systems, or we review potential solutions, dealing some computer experiments can be methods), and uptake of the badges first with alternative ways to analyze substantially automated. The replica- has been slow in fields involving exper- data and present evidence for hypothe- bility problems associated with access iments with human participants. Fur- sized effects, and second arguing for to research artifacts have been broadly thermore, the main issues contribut- improved openness and transparency discussed in computer systems re- ing to the replication crisis in other in experimental research. search (for example, Krishnamurthi25 experimental disciplines do not stem and Collberg9), and the ACM now awards from access to artifacts; rather, they The Replication Crisis badges to recognize work that is repeat- largely stem from a misuse of eviden- in Other Areas of Science able (the original team of researchers tiary criteria used to determine wheth- In assessing the scale of the crisis in can reliably produce the same result us- er an experiment was successful or not. their discipline, cancer researchers at- ing the same experimental setup), repli- Here, we review the extent and tempted to reproduce the findings of

IMAGE BY ANDRIJ BORYS ASSOCIATES, USING SHUTTERSTOCK ASSOCIATES, ANDRIJ BORYS BY IMAGE cable (a different team can produce the causes of the replication crisis in other landmark research papers, finding they

AUGUST 2020 | VOL. 63 | NO. 8 | COMMUNICATIONS OF THE ACM 71 review articles

tenable and the resultant finding is la- belled ‘statistically significant.’ When the p-value exceeds the α level, results Some Terminology interpretation is not straightforward— ˲ Publication bias: Papers supporting their hypotheses are accepted for publication perhaps there is no effect, or perhaps at a much higher rate than those that do not. the experiment lacked sufficient power ˲ File drawer effect: Null findings tend to be unpublished and therefore hidden to expose a real effect (a Type II error or from the scientific community. false negative, where β represents the ˲ p-hacking: Manipulation of experimental and analysis methods to produce probability of this type of error). statistically significant results. Used as a collective term in this paper for a variety of undesirable research practices. Publication bias. In theory, rejec- tion of the null hypothesis should ele- ˲ p-fishing: seeking statistically significant effects beyond the original hypothesis. vate confidence that observed effects ˲ HARKing: Hypothesising After the Results are Known: Post-hoc reframing are real and repeatable. But concerns of experimental intentions to present a p-fished outcome as having been predicted from the start. about the dichotomous interpretation of NHST as ‘significant’ or not have been raised for almost 60 years. Many of these could not do so in 47 of 53 cases,3 and (4, 5) to test the hypotheses, and the re- concerns stem from a troublesome psychology researchers similarly failed sultant data is analyzed and compared publication bias in which papers that re- to replicate 39 out of 100 studies.31 with the predictions (6). Finally, results ject the null hypothesis are accepted for Results of a recent Nature survey of are interpreted (7), possibly leading to publication at a much higher rate than more than 1,500 researchers found adjustment of ideas and beliefs. those that do not. Demonstrating this that 90% agree there is a crisis, that A critical part of this process con- effect, Sterling41 analyzed 362 papers more than 70% had tried and failed to cerns the evidentiary criteria used for published in major psychology jour- reproduce another scientist’s experi- determining whether experimental re- nals between 1955 and 1956, noting ments, and that more than half had sults (at 6) conform with hypotheses that 97.3% of papers that used NHST failed to replicate their own findings.2 (at 3). Null hypothesis significance rejected the null hypothesis. Experimental process. A scientist’s testing (NHST) is one of the main The high publication rates for pa- typical process for experimental work methods for providing this evidence. pers that reject the null hypothesis is summarized along the top row of When using NHST, a p-value is calcu- contributes to a file drawer effect35 in Figure 1, with areas of concern and po- lated that represents the probability of which papers that fail to reject the tential solutions shown in the lower encountering data at least as extreme null go unpublished because they are rows. In this process, initial ideas and as the observed data if a null hypothe- not written up, written up but not beliefs (item 1) are refined through for- sis of no effect were true. If that proba- submitted, or submitted and rejected.16 mative explorations (2), leading to the bility is lower than a threshold value Publication bias and the file drawer ef- development of specific hypotheses (the α level, normally .05, representing fect combine to propagate the dissemi- and associated predictions (3). An ex- the Type I error rate of false positives) nation and maintenance of false periment is designed and conducted then the null hypothesis is deemed un- knowledge: through the file drawer ef-

Figure 1. Stages of a typical experimental process (top, adapted from Gundersen18), prevalent concerns at each stage (middle), and potential solutions (bottom).

Adjust

Compare

Scientiﬁc Process

1. Ideas 2. Exploratory studies 3. Hypotheses 4. Experimental 5. Conduct 6. Data analysis and 7. Interpretation and beliefs and iteration and predictions design experiment hypothesis testing and publication

Areas For Concern

a. Publication bias b. Publication bias d. Statistical power e. Quality control, f. HARKing, influences influences c. HARKing and confidence mid-experiment p-hacking, and g. File drawer project choice exploration thresholds adjustments p-fishing effect

Potential Solutions

Registered Reports

Preregistration

Improved Improved Data repositories evidentiary criteria evidentiary criteria

72 COMMUNICATIONS OF THE ACM | AUGUST 2020 | VOL. 63 | NO. 8 review articles

fect, correct findings of no effect are Figure 2. HARKing (Hypothesizing After the unpublished and hidden from view; Results are Known) is an instance of the and through publication bias, a single Texas sharpshooter fallacy. Illustration by incorrect chance finding (a 1:20 chance Dirk-Jan Hoek, CC-BY. at a = .05, if the null hypothesis is true) can be published and become part of a Publication bias discipline’s wrong knowledge. disincentivizes Ideally, scientists are objective and dispassionate throughout their inves- replication, tigations, but knowledge of the publi- which is a critical cation bias strongly opposes these ide- als. Publication success shapes element of careers, so researchers need their experiments to succeed (rejecting the scientific validation. null in order to get published), creating many areas of concern (middle row of Figure 1), as follows. Publication bias negatively influences exploratory studies and exploratory project selection. There are risks that the data analyses may be difficult. In addi- direction of entire disciplines can be tion, scientists’ foreknowledge that ex- negatively affected by publication bias ploratory studies may suffer from these (Figure 1a and g). Consider a young fac- problems can deter them from carry- ulty member or graduate student who ing out the exploratory step. has a choice between two research proj- Publication bias encourages HARK- ects: one that is mundane, but likely to ing. Publication bias encourages re- satisfy a perceived publication criterion searchers to explore hypotheses that of p < .05; the other is exciting but risky are different to those that they original- in that results cannot be anticipated ly set out to test (Figure 1c and f). This and may end up in a file drawer. Publica- practice is called ‘HARKing,’23 which tion bias is likely to draw researchers stands for Hypothesizing After the Re- towards safer topics in which outcomes sults are Known, also known as ‘out- are more certain, potentially stifling re- come switching’. searchers’ interest in risky questions. Diligent researchers will typically re- Publication bias also disincentivizes cord a wide set of experimental data replication, which is a critical element beyond that required to test their in- of scientific validation. Researchers’ tended hypotheses—this is good prac- low motivation to conduct replications tice, as doing so may help interpret and is easy to understand—a successful rep- explain experimental observations. lication is likely to be rejected because it However, publication bias creates merely confirms what is already strong incentives for scientists to en- ‘known,’ while a failure to replicate is sure that their experiments produce likely to be rejected for failing to satisfy statistically significant results. Con- the p < .05 publication criterion. sciously or subconsciously, they may Publication bias disincentivizes steer their studies to ensure that ex- exploratory research. Exploratory studies perimental data satisfies p < .05. If the and iteration play an important role in researcher’s initial hypothesis fails the scientific process (Figure 1b). This (concerning task time, say) but some is particularly true in areas of comput- other data satisfies p < .05 (error rate, er science, such as human-computer for example), then authors may be interaction, where there may be a range tempted to reframe the study around of alternative solutions to a problem. the data that will increase the paper’s Initial testing can quickly establish vi- chance of acceptance, presenting the ability and provide directions for itera- paper as having predicted that out- tive refinement. Insights from explora- come from the start. This reporting tions can be valuable for the research practice, which is an instance of the community, but if reviewers have been so-called “Texas sharpshooter falla- trained to expect standards of statisti- cy” (see Figure 2), essentially invali- cal evidence that only apply to confir- dates the NHST procedure due to in- matory studies (such as the ubiquitous flated Type I error rates. For example, p-value) then publishing insights from if a researcher collects 15 dependent

AUGUST 2020 | VOL. 63 | NO. 8 | COMMUNICATIONS OF THE ACM 73 review articles

variables and only reports statisti- from the start,36 which may falsely as- techniques such as excluding certain cally significant ones, and if we as- suage researchers’ concerns when re- data points (for example, removing sume that in reality the experimental framing their study around a hypothe- outliers, excluding participants, or nar- manipulation has no effect on any of sis that differs from the original. rowing the set of conditions under the variables, then the probability of Publication bias encourages mid- test), applying various transformations a Type I error is 54% instead of the experiment adjustments. In addition to the data, or applying statistical tests advertised 5%.19 to the modification of hypotheses, only to particular data subsets. While While many scientists might agree other aspects of an experiment may such analyses can be entirely appropri- that other scientists are susceptible to be modified during its execution ate if planned and reported in full, en- questionable reporting practices such (Figure 1e), and the modifications gaging in a data ‘fishing’ exercise to as HARKing, evidence suggests they may go unreported in the final paper. satisfy p < .05 is not, especially if the are troublesomely widespread.20,21 For For example, the number of samples results are then selectively reported. example, over 63% of respondents to a in the study may be increased mid- Flexible data analysis and selective re- survey of 2,000 psychology researchers experiment in response to a failure to porting can dramatically increase Type admitted failing to report all depen- obtain statistical significance (56% of I error rates, and these are major cul- dent measures, which is often associ- psychologists self-admitted to this prits in the replication crisis.38 ated with the selective reporting of fa- questionable practice20). This, again, vorable findings.20 inflates Type I error rates, which im- Is Computer Science Research Even without any intention to mis- pairs the validity of NHST. at Risk? (Spoiler: Yes) represent data, scientists are suscepti- Publication bias encourages ques- Given that much of computer science ble to cognitive biases that may pro- tionable data analysis practices. Di- research either does not involve ex- mote misrepresentations: for example, chotomous interpretation of NHST can periments, or involves deterministic apophenia is the tendency to see pat- also lead to problems in analysis: once or large-sample computational experi- terns in data where none exists, and it experimental data has been collected, ments that are reproducible as long has been raised as a particular concern researchers may be tempted to explore a as data and code are made accessible, for big-data analyses;6 confirmation bias variety of post-hoc data analyses to one could argue that the field is largely is the tendency to favor evidence that make their findings look stronger or to immune to replication issues that have aligns with prior beliefs or hypotheses;30 reach statistical significance (Figure 1f). plagued other empirical disciplines. To and hindsight bias is the tendency to see For example, they might consciously or find out whether this argument is ten- an outcome as having been predictable unconsciously manipulate various able, we analyzed the ten most down-

Figure 3. Count of articles from among the ‘10 most downloaded’ (5/24/19) that use dichotomous interpretations of p from among ACM journals titled ‘Transactions on...’

10 9 8 7 6 5 4 Count of papers 3 using dichotomous p 2 1 0 TOS: Storage TOS: TOG: Graphics TOG: TWEB: The Web TON: NetworkingTON: TALG: Algorithms TALG: TSC: Social Computing TSC: TAP: Applied Perception TAP: TOSN: Sensor NetworksTOSN: TODS: Database Systems TODS: TOCS: Computer Systems TOCS: TOPC: Parallel Computing Parallel TOPC: TOIT: Internet Technology TOIT: TOPS: Privacy and SecurityTOPS: TOCT: Computation Theory TOCT: TOCL: Computational Logic TOCL: TOIS: Information Systems TOIS: TOCE: Computing Education TOCE: TOMS: Mathematical SoftwareTOMS: TCPS: Cyber-Physical Systems TCPS: TACCESS: Accessible Computing TACCESS: THRIA: Human-Robot Interaction TEAC: Economics and Computation Economics TEAC: TIIS: Interactive Intelligent Systems TIIS: Interactive TECS: Embedded Computing Systems TECS: TOCHI: Computer-Human Interaction TOCHI: TSAS: Spatial Algorithms and Systems TSAS: TKDD: Knowledge Discovery from Data TMIS: Management Information Systems TACO: Architecture and Code Optimization TACO: TAAS: Autonomous and Adaptive Systems Autonomous and Adaptive TAAS: TIST: Intelligent Systems and Technology Intelligent Systems TIST: TOMACS: Modeling and Computer Simulation TOMACS: TOSEM: SoftwareTOSEM: Engineering and Methodology TASLP: Audio, Speech and Language Processing Audio, Speech and Language TASLP: TOPLAS: Programming Languages and Systems Languages TOPLAS: Programming TRETS: Reconﬁgurable Technology and Systems Technology Reconﬁgurable TRETS: TCBB: Computational Biology and Bioinformatics TCBB: TODAES: Design Automation of Electronic Systems TODAES: TOMPECS: Modeling and Performance Evaluation of... Evaluation Modeling and Performance TOMPECS: TOMM: Multimedia Computing, Communications, and... TOMM: TALLIP: Asian and Low-Resource Language Information... Language Asian and Low-Resource TALLIP:

74 COMMUNICATIONS OF THE ACM | AUGUST 2020 | VOL. 63 | NO. 8 review articles loaded articles for 41 ACM journals scheduler for electric vehicles, using without reducing incentives for p-hack- beginning with the name ‘Transactions real data from smart meters). Third, ing, and by diverting resources away on.’ We inspected all 410 articles to de- they are used to carry out performance from replications). Another argument termine whether or not they used p < α analysis of hardware or software, us- is the threshold value remains arbi- (with a normally 0.05) as a criterion for ing actual data from running systems trary, and that focusing instead on ef- establishing evidence of a difference (for example, a comparison of real fect sizes and their interval estimates between conditions. The presence of cloud computing platforms). Fourth, (confidence intervals or credible in- p-values is an indication of statistical they are used to assess human perfor- tervals) can better characterize results. uncertainty, and therefore of the use of mance with interfaces or interaction There is also a pragmatic problem that nondeterministic small-sample exper- techniques (for example, which of two until publication venues firmly an- iments (for example involving human menu designs is faster). nounce their standards, authors will be subjects). Furthermore, as we have pre- Given the high proportion of com- free to choose terminology (‘statistical- viously discussed, the use of a dichoto- puter science journals that accept pa- ly significant’ at p < .05 or ‘statistically mous interpretation of p-values as ‘sig- pers using dichotomous interpreta- significant’ at p < .005) and reviewers/ nificant’ or ‘not significant’ is thought tions of p, it seems unreasonable to readers may differ in their expecta- to promote publication bias and ques- believe that computer science research tions. Furthermore, the proposal does tionable data analysis practices, both is immune to the problems that have nothing to discourage or prevent prob- of which heavily contributed to the rep- contributed to a replication crisis in lems associated with inappropriate lication crisis in other disciplines. other disciplines. Next, we review pro- modification of experimental methods A total of 61 of the 410 computer posals from other disciplines on how and objectives after they begin. science articles (15%) included at least to ease the replication crisis, focusing Abandon statistical significance. one dichotomous interpretation of a p- first on changes to the way in which ex- Many researchers argue the replication value.a All but two of the papers that perimental data is analyzed, and sec- crisis does not stem from the choice of used dichotomous interpretations ond on proposals for improving open- the .05 cutoff, but from the general (97%) identified at least one finding as ness and transparency. idea of using an arbitrary cutoff to clas- satisfying the p < .05 criterion, suggest- sify results, in a dichotomous manner, ing that publication bias (long ob- Proposals for Easing as statistically significant or not. Some served in other disciplines41) is likely to the Crisis: Better Data Analysis of these researchers have called for re- also exist in empirical computer science. Redefine statistical significance. Many porting exact p-values and abandoning Furthermore, 21 different journals researchers attribute some of the rep- the use of statistical significance (51%) included at least one article using lication crisis to the dominant use of thresholds.1 Recently, a comment pub- a dichotomous interpretation of p with- NHST. Among the noted problems lished in Nature with more than 800 in the set of 10 papers inspected. The with NHST is the ease with which ex- signatories called for abandoning bi- count of articles across journals is sum- periments can produce false-positive nary statistical significance.28 Cum- marized in Figure 3, with fields such as findings, even without scientists con- ming12 argued for the banning of p-val- applied perception, education, software tributing to the problem through ques- ues altogether and recommended the engineering, information systems, bio- tionable research practices. To address use of estimation statistics where informatics, performance modeling, this problem, a group of 75 senior sci- strength of evidence is assessed in a and security all showing positive counts. entists from diverse fields (including non-dichotomous manner, by examin- Our survey showed four main ways computer science) proposed that the ing confidence intervals. Similar rec- in which experimental techniques are accepted norm for determining ‘sig- ommendations have been made in used in computer science research, nificance’ in NHST tests be reduced computer science.14 The editorial spanning work in graphics, software from a = .05 to a = .005.4 Their proposal board of the Basic and Applied Social engineering, artificial intelligence, and was based on two analyses—the rela- Psychology journal went further by an- performance analysis, as well as the ex- tionship between Bayes factors and p- nouncing it would not publish papers pected use in human-computer inter- values, and the influence of statistical containing any statistics that could be action. First, empirical methods are power on false positive rates—both of used to derive dichotomous interpreta- used to assess the quality of an artifact which indicated disturbingly high false tions, including p-values and confi- produced by a technique, using hu- positive rates at a = .05. The authors dence intervals.42 Overall there is no mans as judges (for example, the pho- also recommended the word ‘sugges- consensus on what should replace torealism of an image or the quality of tive’ be used to describe results in the NHST, but many methodologists are in streaming video). Second, empirical range .005 <= p < .05. favor of banning dichotomous statisti- methods are used to evaluate classifi- Despite the impressive list of au- cal significance language. cation or prediction algorithms on re- thors, this proposal attracted heavy Despite the forceful language al-world data (for example, a power criticism (see Perezgonzalez33 for a re- opposing NHST (for example, “very view). Some have argued the reasoning few defenses of NHST have been at- behind the .005 threshold is flawed, and tempted,”12), some researchers believe a Data for this analysis is available at osf.io/hkqyt/, including a quote extracted that adopting it could actually make the NHST and the notion of dichotomous from each counted paper showing its use of replication crisis worse (by causing a hypothesis testing still have their a dichotomous interpretation. drop in the statistical power of studies place.4 Others have suggested the calls

AUGUST 2020 | VOL. 63 | NO. 8 | COMMUNICATIONS OF THE ACM 75 review articles

to abandon NHST are a red herring in ond, many reviewers of empirical pa- Openness, Preregistration, the replicability crisis,37 not least due pers are familiar and comfortable and Registered Reports to the lack of evidence that doing so with NHST procedures and its associ- While the debate continues over the will aid replicability. ated styles of results reporting, and merits of different methods for data Adopt Bayesian statistics. Several re- they may criticize its absence; in par- analysis, there is a wide agreement searchers propose replacing NHST ticular, reviewers may suspect that on the need for improved openness with Bayesian statistical methods. One the absence of reported dichotomous and transparency in empirical sci- of the key motivators for doing so con- outcomes is a consequence of their ence. This includes making materials, cerns a common misunderstanding of failure to attain p < .05. Both of these resources, and datasets available for the p-value in NHST. Researchers concerns suggest that a paper’s ac- future researchers who might wish to wish to understand the probability ceptance prospects could be harmed replicate the work. the null hypothesis is true, given the if lacking simple and clear statements Making materials and data available

data observed (P(H0|D)), and p is often of results outcome, such as those pro- after a study’s completion is a substan- misunderstood to represent this val- vided by NHST, despite the simplistic tial improvement, because it greatly fa- ue. However, the p-value actually rep- and often misleading nature of such cilitates peer scrutiny and replication. resents the probability of observing dichotomous statements. However, it does not prevent ques- data at least as extreme as the sample Quantify p-hacking in published tionable research practices, since the if the null hypothesis were true: work. None of the proposals discussed history of a data analysis (including

(P(D|H0). In contrast to NHST, Bayes- here address problems connected possible p-hacking) is not visible in ian statistics can enable the desired with researchers consciously or sub- the final analysis scripts. And if others

computation of (P(H0|D). consciously revising experimental fail to replicate a study’s findings, the Bayesian statistics are perfectly suit- methods, objectives, and analyses af- original authors can easily explain ed for doing estimation statistics, and ter their study has begun. Statistical away the inconsistencies by question- have several advantages over confi- analysis methods exist that allow re- ing the methodology of the new study dence intervals.22,26 Nevertheless, they searchers to assess whether a set of al- or by claiming that an honest Type I can also be used to carry out dichoto- ready published studies are likely to error occurred. mous tests, possibly leading to the have involved such practices. A com- Overcoming these limitations re- same issues as NHST. Furthermore, mon method is based on the p-curve, quires a clear statement of materials, Bayesian analysis is not immune to the which is the distribution of statistical- methods, and hypotheses before the ex- problems of p-hacking—researchers ly significant p-values in a set of stud- periment is conducted, as provided by can still ‘b-hack’ to manipulate experi- ies.40 Studies of true effects should experimental preregistration and reg- mental evidence.37,39 In particular, the produce a right-skewed p-curve, with istered reports, discussed next. choice of priors adds an important ad- many lower statistically significant p- Experimental preregistration. In re- ditional experimenter degree of free- values (for example, .01s) than high sponse to concerns about questionable dom in Bayesian analysis.39 values (for example, .04s); but a set of research practices, various authorities Help the reader form their own con- p-hacked studies are likely to show a instituted registries in which research- clusion. Given the contention over the left-skewed p-curve, indicative of se- ers preregister their intentions, hypoth- relative merits of different statistical lecting variables that tipped analyses eses, and methods (including sample methods and thresholds, researchers into statistical significance. sizes and precise plans for the data have proposed that when reporting re- While use of p-curves appears analyses) for upcoming experiments. sults, authors should focus on assist- promising, it has several limitations. Risks of p-hacking or outcome switching the reader in reaching their own First, it requires a set of study results ing are dramatically reduced when a conclusions by describing the data and to establish a meaningful curve, and precise statement of method predates the evidence as clearly as possible. This its use as a diagnostic tool for evi- the experimental conduct. Further- can be achieved through the use of dence of p-hacking in any single arti- more, if the registry subsequently carefully crafted charts that focus on cle is discouraged. Second, its useful- stores experimental data, then the file effect sizes and their interval esti- ness for testing the veracity of any drawer is effectively opened on experi- mates, and the use of cautionary lan- particular finding in a field depends mental outcomes that might otherwise guage in the author’s interpretations on the availability of a series of related have been hidden due to failure to at- and conclusions.11,14 or replicated studies; but replications tain statistical significance. While improved explanation and in computer science are rare. Third, Although many think preregistra- characterization of underlying experi- statisticians have questioned the ef- tions is only a recent idea, and there- mental data is naturally desirable, au- fectiveness of p-curves for detecting fore one that needs to be refined and thors are likely to encounter prob- questionable research practices, dem- tested before it can be fully adopted, it lems if relying only on the onstrating through simulations that has in fact been in place for a long time persuasiveness of their data. First, the p-curve methods cannot reliably dis- in medical research. In 1997, the U.S. impact of using more cautious lan- tinguish between p-hacking of null ef- Food and Drug Administration Mod- guage on the persuasiveness of argu- fects and studies of true effects that ernization Act (FDAMA) established ments when compared to categorical suffer experimental omissions such the registry ClinicalTrials.gov, and over arguments is still uncertain.15 Sec- as unknown confounds.7 96,000 experiments were registered in

76 COMMUNICATIONS OF THE ACM | AUGUST 2020 | VOL. 63 | NO. 8 review articles its first 10 years, assisted by the deci- method; everything that might be ex- sion of the International Committee of pected in a traditional paper except for Medical Journal Editors to make pre- the results and their interpretation. registration a requirement for publica- Submissions are therefore considered tion in their journals.34 Results suggest based on the study’s motivations (is that preregistration has had a substan- With registered this an interesting research question?) tial effect on scientific outcomes—for reports, papers and method (is the way of answering example, an analysis of studies funded the question sound and valid?). If ac- by the National Heart, Lung, and Blood are submitted cepted, a registered report is published Institute between 1970 and 2012 for review prior regardless of the final results. showed the rate at which studies A recent analysis of 127 registered showed statistically significant find- to conducting reports in the bio-medical and psycho- ings plummeted from 57% before the logical sciences showed that 61% of introduction of mandatory preregis- the experiment. studies did not support their hypothe- tration (in 2000) to only 8% after.21 sis, compared to the estimated 5%–20% The success of ClinicalTrials.gov and of null findings in the traditional litera- the spread of the replication crisis to ture.10 As of February 2019, the Center other disciplines has prompted many for Open Science (https://cos.io/rr/) disciplines to introduce their own lists 136 journals that accept registered registries, including the American reports and 27 journals that have ac- Economic Association (https://www. cepted them as part of a special issue. socialscienceregistry.org/) and the po- No computer science journal is cur- litical science ‘dataverse.’29 The Open rently listed. Science Framework (OSF) also supports preregistration, ranging from Recommendations for simple and brief descriptions through Computer Science to complete experimental specifica- The use of NHST in relatively small- tion (http://osf.io). Although original- sample empirical studies is an impor- ly focused on replications of psycho- tant part of many areas of computer logical studies, it is now used in a science, creating risks for our own re- range of disciplines, including by producibility crisis.8,14,24 The following computer scientists. recommendations suggest activities Registered reports. While experi- and developments that computer scien- mental preregistration should entists can work on to protect the credibil- hance confidence in published find- ity of the discipline’s empirical research. ings, it does not prevent reviewers from Promote preregistration. The ACM using statistical significance as a crite- has the opportunity and perhaps the rion for paper acceptance. Therefore, it obligation to lead and support changes does not solve the problem of publica- that improve empirical computer sci- tion bias and does not help prevent the ence—its stated purpose includes ‘pro- file drawer effect. As a result, the scien- motion of the highest standards’ and tific record can remain biased toward the ACM Publications Board has the positive findings, and since achieving goal of ‘aggressively developing the statistical significance is harder if p- highest-quality content.’ These goals hacking is not an option, researchers would be supported by propagating to may be even more motivated to focus journal editors and conference chairs on unsurprising but safe hypotheses an expectation that empirical studies where the null is likely to be rejected. should be preregistered, preferably us- However, we do not want to simply take ing transdisciplinary registries such as null results as equivalent to statistical the Open Science Framework (http:// significance, because null results are osf.io). Authors of papers describing trivially easy to obtain; instead, the fo- empirical studies could be asked or re- cus should be on the quality of the quired to include a standardized state- question being asked in the research. ment at the end of their papers’ ab- Registered reports are a way to pro- stract providing a link to the vide this focus. With registered reports, preregistration, or explicitly stating papers are submitted for review prior to that the study was not preregistered (in conducting the experiment. Registered other disciplines, preregistration is reports include the study motivation, mandatory). Reviewers would also related work, hypotheses, and detailed need to be educated on the value of

AUGUST 2020 | VOL. 63 | NO. 8 | COMMUNICATIONS OF THE ACM 77 review articles

preregistration and the potential image conference chairs to experiment plications of its absence. with registered report submissions. It is worth noting that experimen- Encourage data and materials open- tal preregistration has potential bene- ness. The ACM Digital Library supports fits to authors even if they do not in- access to resources that could aid repli- tend to test formal hypotheses. If the Experimental cation through links to auxiliary materi- registry entry is accessible at the time preregistration has als. However, more could be done to of paper submission (perhaps through encourage or require authors to make a key that is disclosed to reviewers), potential benefits to data and resources available. Currently, then an author who preregisters an ex- authors even if they authors decide whether or not to upload ploratory experiment is protected resources. Instead, uploading data against reviewer criticism that the do not intend to test could be compulsory for publication, stated exploratory intent is due to with exceptions made only following HARKing following a failure to reject formal hypotheses. special permission from an editor or the null hypothesis.8 program chair. While such require- Another important point regarding ments may seem draconian given the preregistration is that it does not con- permissive nature of current practice in strain authors from reporting unex- computer science, the requirement is pected findings. Any analysis that common in other disciplines and out- might be used in an unregistered ex- lets, such as Nature’s ‘Scientific Data’ periment could also be used in a pre- (www.nature.com/sdata/). registered one, but the language used A first step in this direction would to describe the analysis in the pub- be to follow transparency and openness lished paper must make the post-hoc guidelines (https://cos.io/our-services/ discovery clear, such as ‘Contrary to ex- top-guidelines/), which encourage pectations ...’ or ‘In addition to the pre- authors to state in their submission registered analysis, we also ran ...’ whether or not they made their data, Publish registered reports. The edi- scripts, and preregistered analysis torial boards of ACM journals that fea- available online, and to provide links to ture empirical studies could adapt them where available. their reviewing process to support the Promote clear reporting of results. submission of registered reports and While the debate over standards for push for this publication format. This data analysis and reporting continues, is perhaps the most promising of all certain best-practice guidelines are interventions aimed at easing the rep- emerging. First, authors should focus lication crisis—it encourages re- on two issues: conveying effect sizes searchers to address interesting ques- (this includes simple effect sizes such as tions, it eliminates the need to differences between means11), and help- produce statistically significant re- ing readers to understand the uncertain- sults (and, thus, addresses the file ty around those effect sizes by reporting drawer problem), and it encourages interval estimates14,26 or posterior dis- reviewers to focus on the work’s im- tributions.22 A range of recommenda- portance and potential validity.10 In ad- tions already exist for improving report- dition, it eliminates hindsight bias ing clarity and transparency and must be among reviewers, that is, the sentiment followed more widely. For example, most that they could have predicted the out- effect sizes only capture central tenden- comes of a study, and that the findings cies and thus provide an incomplete are therefore unsurprising. picture. Therefore, it can help to also The prospect of permitting the sub- convey population variability through mission of registered reports to large- well-known practices such as reporting scale venues is daunting (for example, standard deviations (and their interval ACM 2019 Conference on Human- estimates) and/or plotting data distri- Computer Interaction received approx- butions. When reporting the out- imately 3,000 submissions to its papers comes of statistical tests, the name of track). However, the two-round sub- the test and its associated key data mission and review process adopted by (such as degrees of freedom) should be conferences within the Proceedings of reported. And, if describing the out- the ACM (PACM) series could be adapted comes of a NHST test, the exact p-value to embrace the submission of regis- should be reported. Since the proba- tered reports at round 1. We encour- bility of a successful replication de-

78 COMMUNICATIONS OF THE ACM | AUGUST 2020 | VOL. 63 | NO. 8 review articles pends on the order of magnitude of er than traditional evidentiary criteria of large NHLBI clinical trials has increased over time. PLOS One 10, 8 (Aug. 2015), 1–12. 17 p, we suggest avoiding excessive pre- such as p < .05, then researchers would 22. Kay, M., Nelson, G.L., and Hekler, E.B. Researcher- cision (one or two significant digits are be better motivated to identify inter- centered design of statistics: Why Bayesian statistics better fit the culture and incentives of HCI. In enough), and using scientific notation esting research questions, including Proceedings of the 2016 ACM CHI Conference on (for example, p = 2 × 10–5) instead of in- potentially risky ones. One potential Human Factors in Computing Systems, 4521–4532. 23. Kerr, N.L. Harking: Hypothesizing after the results equalities (for example, p < .001) when objection to risky studies is their typi- are known. Personality & Social Psychology Rev. 2, 3 (1998), 196. Lawrence Erlbaum Assoc. reporting very small p-values. cally low statistical power: testing null 24. Kosara, R., and Haroz, S. Skipping the replication Encourage replications. The intro- effects or very small effects with small crisis in visualization: Threats to study validity and how to address them. Evaluation and Beyond— duction of preregistration and regis- samples can lead to vast overestima- Methodological Approaches for Visualization (Berlin, tered reports in other disciplines tions of effect sizes.27 However, this is Germany, Oct. 2018). 25. Krishnamurthi, S., and Vitek, J. The real software caused a rapid decrease in the propor- mostly true in the presence of p-hack- crisis: Repeatability as a core value. Commun. ACM tion of studies finding statistically sig- ing or publication bias, two issues 58, 3 (Mar. 2015), 34–36. 26. Kruschke, J.K., and Liddell, T.M. The Bayesian new nificant effects. Assuming the same that are eliminated by moving beyond statistics: Hypothesis testing, estimation, meta- was to occur in computer science, how the statistical significance filter and analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Rev. 25, 1 (2018), would this influence accepted publica- adopting registered reports. 178–206. tions? It is likely that many more em- 27. Loken, E., and Gelman, A. Measurement error and the replication crisis. Science 355, 6325 (2017), 584–585. References pirical studies would be published 28. McShane, B. ., Gal, D., Gelman, A., Robert, C. and 1. Amrhein, V., Korner-Nievergelt, F., and Roth, T. Tackett, J.L. Abandon statistical significance. with statistically non-significant find- The earth is flat p( > 0.05): significance thresholds The American Statistician 73, sup1 (2019), 235–245. and the crisis of unreplicable research. PeerJ 5, 7 29. Monogan, III, J.E. A case for registering studies of ings or with no statistical analysis (2017), e3544. political outcomes: An application in the 2010 house (such as exploratory studies that rely 2. Baker, M. Is there a reproducibility crisis? Nature 533, elections. Political Analysis 21, 1 (2013), 21. 7604 (2016), 452–454. 30. Nickerson, R.S. Confirmation bias: A ubiquitous on qualitative methods). It is also likely 3. Begley, C. G., and Ellis, L. M. Raise standards for phenomenon in many guises. Rev. General Psychology preclinical cancer research. Nature 483, 7391 that this would encourage researchers 2, 2 (1998), 175–220. (2012), 531. 31. Open Science Collaboration and others. Estimating the to consider conducting experimental 4. Benjamin, D. et al. Redefine statistical significance. reproducibility of psychological science. Science 349, PsyArXiv (July 22, 2017). replications, regardless of previous 6251 (2015), aac4716. 5. Boisvert, R.F. Incentivizing reproducibility. Commun. 32. Pashler, H., and Wagenmakers, E.-J. Editors’ introduction outcomes. Replications of studies with ACM 59, 10 (Oct. 2016), 5–5. to the special section on replicability in psychological 6. Boyd, D., and Crawford, K. Critical questions for big statistically significant results help re- science: A crisis of confidence?Perspectives on data. Information, Communication & Society 15, 5 Psychological Science 7, 6 (2012), 528–530. duce Type I error rates, and replica- (2012), 662–679. 33. Perezgonzalez, J.D., and Frias-Navarro, D. Retract 7. Bruns, S.B., and Ioannidis, J.P.A. P-curve and tions of studies with null outcomes re- 0.005 and propose using JASP, instead. Preprint, 2017; p-hacking in observational research. PLOS One 11, 2 https://psyarxiv.com/t2fn8. duce Type II error rates and can test the (Feb. 2016), 1–13. 34. Rennie, D. Trial registration: A great idea switches 8. Cockburn, A., Gutwin, C., and Dix, A. HARK no more: boundaries of hypotheses. If better from ignored to irresistible. JAMA 292, 11 (2004), On the preregistration of CHI experiments. In 1359–1362. data repositories were available, com- Proceedings of the 2018 ACM CHI Conference on 35. Rosenthal, R. The file drawer problem and tolerance Human Factors in Computing Systems (Montreal, puter science students around the for null results. Psychological Bulletin 86, 3 (1979), Canada, Apr. 2018), 141:1–141:12. 638–641. world could contribute to the robust- 9. Collberg, C., and Proebsting, T. A. Repeatability in 36. Sanbonmatsu, D.M., Posavac, S.S., Kardes, F.R. and computer systems research. Commun. ACM 59, 3 Mantel, S.P. Selective hypothesis testing. Psychonomic ness of findings by uploading to regis- (Mar. 2016), 62–69. Bulletin & Rev. 5, 2 (June 1998), 197–220. tries the outcomes of replications con- 10. Cristea, I.A., and Ioannidis, J.P.A. P-values in display 37. Savalei, V., and Dunn, E. Is the call to abandon items are ubiquitous and almost invariably significant: p-values the red herring of the replicability crisis? ducted as part of their courses on A survey of top science journals. PLOS One 13, 5 Frontiers in Psychology 6 (2015), 245. (2018), e0197440. experimental methods. Better data re- 38. Simmons, J.P., Nelson, L.D., and Simonsohn, U. False- 11. Cumming, G. Understanding the New Statistics: positive psychology: Undisclosed flexibility in data positories with richer datasets would Effect Sizes, Confidence Intervals, and Meta-analysis. collection and analysis allows presenting anything Multivariate applications series. Routledge, 2012. also facilitate meta-analyses, which el- as significant.Psychological Science 22, 11 (2011), 12. Cumming, G. The new statistics: Why and how. 1359–1366. evate confidence in findings beyond Psychological Science 25, 1 (2014), 7–29. 39. Simonsohn, U. Posterior-hacking: Selective reporting 13. Denning, P. J. ACM President’s letter: What is that possible from a single study. invalidates Bayesian results also. SSRN (2014). experimental computer science? Commun. ACM 23, 40. Simonsohn, U., Nelson, L.D., and Simmons, J.P. Educate reviewers (and authors). 10 (Oct. 1980), 543–544. P-curve: A key to the file-drawer.J. Experimental 14. Dragicevic, P. Fair statistical communication in HCI. Many major publication venues in Psychology: General 143, 2 (2014), 534–547. Modern Statistical Methods for HCI, J. Robertson and 41. Sterling, T.D. Publication decisions and their computer science are under stress due M. Kaptein, eds. Springer International Publishing, possible effects on inferences drawn from tests of 2016, 291–330. to a deluge of submissions that cre- significance—or vice versa.J. American Statistical 15. Durik, A.M., Britt, M.A., Reynolds, R., and Storey, J. Assoc. 54, 285 (1959), 30–34. ates challenges in obtaining expert re- The effects of hedges in persuasive arguments: A 42. Trafimow, D., and Marks, M. Editorial.Basic and nuanced analysis of language. J. Language and Social views. Authors can become frustrated Applied Social Psychology 37, 1 (2015), 1–2. Psychology 27, 3 (2008), 217–234. when reviewers focus on equivocal re- 16. Franco, A., Malhotra, N., and Simonovits, G. Publication bias in the social sciences: Unlocking the file drawer. Andy Cockburn ([email protected]) is a sults of a well-founded and potentially Science 345, 6203 (2014), 1502–1505. professor at the University of Cantebury, Christchurch, important study—but reviewers can 17. Goodman, S.N. A comment on replication, p-values and New Zealand, where he is head of the HCI and evidence. Statistics in Medicine 11, 7 (1992), 875–879. Multimedia Lab. also become frustrated when authors 18. Gundersen, O.E., and Kjensmo, S. State of the fail to provide definitive findings on art: Reproducibility in artificial intelligence. In Pierre Dragicevic is a research scientist at Inria, Orsay, Proceedings of the 32nd AAAI Conference on Artificial France. which to establish a clear contribu- Intelligence, the 30th Innovative Applications of tion. In the spirit of registered reports, Artificial Intelligence, and the 8th AAAI Symposium on Lonni Besançon is a postdoc student at Linköping Educational Advances in Artificial Intelligence. New University, Linköping, Sweden our recommendation is to educate re- Orleans, LA, USA, Feb. 2–7, 2018 ), 1644–1651. viewers (and authors) on the research 19. Ioannidis, J.P.A. Why most published research findings Carl Gutwin is a professor in the Department of Computer are false. PLOS Medicine 2, 8 (Aug. 2005). Science and director of the HCI Lab at the University of value of studying interesting and im- 20. John, L.K., Loewenstein, G., and Prelec, D. Measuring Saskatchewan, Canada. portant effects, largely irrespective of the prevalence of questionable research practices with incentives for truth telling. Psychological Science the results generated. If reviewers fo- 23, 5 (2012), 524–532. PMID: 22508865. cused on questions and method rath- 21. Kaplan, R.M., and Irvin, V.L. Likelihood of null effects © 2020 ACM 0001-0782/20/8 $15.00

AUGUST 2020 | VOL. 63 | NO. 8 | COMMUNICATIONS OF THE ACM 79