hp://cos.io/

Brian Nosek University of Virginia hp://briannosek.com/

General Article

Psychological Science XX(X) 1 –8 False-Positive : Undisclosed © The Author(s) 2011 Reprints and permission: sagepub.com/journalsPermissions.nav Flexibility in Data Collection and Analysis DOI: 10.1177/0956797611417632 Allows Presenting Anything as Significant http://pss.sagepub.com

Joseph P. Simmons1, Leif D. Nelson2, and Uri Simonsohn1 1The Wharton School, University of Pennsylvania, and 2Haas School of Business, University of California, Berkeley

Abstract In this article, we accomplish two things. First, we show that despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (≤ .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis. Second, we suggest a simple, low-cost, and straightforwardly effective disclosure-based solution to this problem. The solution involves six concrete requirements for authors and four guidelines for reviewers, all of which impose a minimal burden on the publication process.

Keywords methodology, motivated reasoning, publication, disclosure

Received 3/17/11; Revision accepted 5/23/11

Our job as scientists is to discover truths about the world. We Which control variables should be considered? Should spe- generate hypotheses, collect data, and examine whether or not cific measures be combined or transformed or both? the data are consistent with those hypotheses. Although we It is rare, and sometimes impractical, for researchers to aspire to always be accurate, errors are inevitable. make all these decisions beforehand. Rather, it is common Perhaps the most costly error is a false positive, the incor- (and accepted practice) for researchers to explore various ana- rect rejection of a null hypothesis. First, once they appear in lytic alternatives, to search for a combination that yields “sta- the literature, false positives are particularly persistent. tistical significance,” and to then report only what “worked.” Because null results have many possible causes, failures to The problem, of course, is that the likelihood of at least one (of replicate previous findings are never conclusive. Furthermore, many) analyses producing a falsely positive finding at the 5% because it is uncommon for prestigious journals to publish null level is necessarily greater than 5%. findings or exact replications, researchers have little incentive This exploratory behavior is not the by-product of mali- to even attempt them. Second, false positives waste resources: cious intent, but rather the result of two factors: (a) ambiguity They inspire investment in fruitless research programs and can in how best to make these decisions and (b) the researcher’s lead to ineffective policy changes. Finally, a field known for desire to find a statistically significant result. A large literature publishing false positives risks losing its credibility. documents that people are self-serving in their interpretation In this article, we show that despite the nominal endorse- ment of a maximum false-positive rate of 5% (i.e., p ≤ .05), current standards for disclosing details of data collection and Corresponding Authors: Joseph P. Simmons, The Wharton School, University of Pennsylvania, 551 analyses make false positives vastly more likely. In fact, it is Jon M. Huntsman Hall, 3730 Walnut St., Philadelphia, PA 19104 unacceptably easy to publish “statistically significant” evi- E-mail: [email protected] dence consistent with any hypothesis. Leif D. Nelson, Haas School of Business, University of California, Berkeley, The culprit is a construct we refer to as researcher degrees Berkeley, CA 94720-1900 of freedom. In the course of collecting and analyzing data, E-mail: [email protected] researchers have many decisions to make: Should more data Uri Simonsohn, The Wharton School, University of Pennsylvania, 548 be collected? Should some observations be excluded? Which Jon M. Huntsman Hall, 3730 Walnut St., Philadelphia, PA 19104 conditions should be combined and which ones compared? E-mail: [email protected]

Electronic copy available at: http://ssrn.com/abstract=1850704 Open access, freely available online Essay Why Most Published Research Findings Are False John P. A. Ioannidis

factors that infl uence this problem and is characteristic of the fi eld and can Summary some corollaries thereof. vary a lot depending on whether the There is increasing concern that most fi eld targets highly likely relationships Modeling the Framework for False or searches for only one or a few current published research fi ndings are Positive Findings false. The probability that a research claim true relationships among thousands is true may depend on study power and Several methodologists have and millions of hypotheses that may bias, the number of other studies on the pointed out [9–11] that the high be postulated. Let us also consider, same question, and, importantly, the ratio rate of nonreplication (lack of for computational simplicity, of true to no relationships among the confi rmation) of research discoveries circumscribed fi elds where either there relationships probed in each scientifi c is a consequence of the convenient, is only one true relationship (among fi eld. In this framework, a research fi nding yet ill-founded strategy of claiming many that can be hypothesized) or is less likely to be true when the studies conclusive research fi ndings solely on the power is similar to fi nd any of the conducted in a fi eld are smaller; when the basis of a single study assessed by several existing true relationships. The effect sizes are smaller; when there is a formal statistical signifi cance, typically pre-study probability of a relationship greater number and lesser preselection for a p-value less than 0.05. Research being true is R⁄(R + 1). The probability of tested relationships; where there is is not most appropriately represented of a study fi nding a true relationship greater fl exibility in designs, defi nitions, and summarized by p-values, but, refl ects the power 1 − β (one minus outcomes, and analytical modes; when unfortunately, there is a widespread the Type II error rate). The probability there is greater fi nancial and other notion that medical research articles of claiming a relationship when none interest and prejudice; and when more truly exists refl ects the Type I error teams are involved in a scientifi c fi eld It can be proven that rate, α. Assuming that c relationships in chase of statistical signifi cance. most claimed research are being probed in the fi eld, the Simulations show that for most study expected values of the 2 × 2 table are designs and settings, it is more likely for fi ndings are false. given in Table 1. After a research a research claim to be false than true. fi nding has been claimed based on Moreover, for many current scientifi c should be interpreted based only on achieving formal statistical signifi cance, fi elds, claimed research fi ndings may p-values. Research fi ndings are defi ned the post-study probability that it is true often be simply accurate measures of the here as any relationship reaching is the positive predictive value, PPV. prevailing bias. In this essay, I discuss the formal statistical signifi cance, e.g., The PPV is also the complementary implications of these problems for the effective interventions, informative probability of what Wacholder et al. conduct and interpretation of research. predictors, risk factors, or associations. have called the false positive report “Negative” research is also very useful. probability [10]. According to the 2 “Negative” is actually a misnomer, and × 2 table, one gets PPV = (1 − β)R⁄(R ublished research fi ndings are the misinterpretation is widespread. − βR + α). A research fi nding is thus sometimes refuted by subsequent However, here we will target evidence, with ensuing confusion relationships that investigators claim Citation: Ioannidis JPA (2005) Why most published P exist, rather than null fi ndings. research fi ndings are false. PLoS Med 2(8): e124. and disappointment. Refutation and As has been shown previously, the controversy is seen across the range of Copyright: © 2005 John P. A. Ioannidis. This is an research designs, from clinical trials probability that a research fi nding open-access article distributed under the terms and traditional epidemiological studies is indeed true depends on the prior of the Creative Commons Attribution License, probability of it being true (before which permits unrestricted use, distribution, and [1–3] to the most modern molecular reproduction in any medium, provided the original research [4,5]. There is increasing doing the study), the statistical power work is properly cited. concern that in modern research, false of the study, and the level of statistical Abbreviation: PPV, positive predictive value fi ndings may be the majority or even signifi cance [10,11]. Consider a 2 × 2 the vast majority of published research table in which research fi ndings are John P. A. Ioannidis is in the Department of Hygiene compared against the gold standard and Epidemiology, University of Ioannina School of claims [6–8]. However, this should Medicine, Ioannina, Greece, and Institute for Clinical not be surprising. It can be proven of true relationships in a scientifi c Research and Health Policy Studies, Department of that most claimed research fi ndings fi eld. In a research fi eld both true and Medicine, Tufts-New England Medical Center, Tufts false hypotheses can be made about University School of Medicine, Boston, Massachusetts, are false. Here I will examine the key United States of America. E-mail: [email protected] the presence of relationships. Let R be the ratio of the number of “true Competing Interests: The author has declared that The Essay section contains opinion pieces on topics relationships” to “no relationships” no competing interests exist. of broad interest to a general medical audience. among those tested in the fi eld. R DOI: 10.1371/journal.pmed.0020124

PLoS Medicine | www.plosmedicine.org 0696 August 2005 | Volume 2 | Issue 8 | e124

New Problems?

• Low power • Quesonable research pracces • Overabundance of posive results • Ignoring null results • Lack of replicaon • Limitaons of NHST

Sterling, 1959; Cohen, 1962; Lykken, 1968; Tukey, 1969; Greenwald, 1975; Meehl, 1978; Rosenthal, 1979 Soluons

• Transparency • Disnguish confirmatory v. exploratory • Replicaon • Aggregate evidence • Narrow use of NHST

Sterling, 1959; Cohen, 1962; Lykken, 1968; Tukey, 1969; Greenwald, 1975; Meehl, 1978; Rosenthal, 1979 Problems ✔

Soluons ✔

Implementaon X Central issue

Incenves for individual success are focused on geng it published, not geng it right Challenges

• Temporal construal (Trope & Liberman, 2003)

• Perceived norms (Anderson, Marnson, & DeVries, 2007)

• Movated reasoning (Kunda, 1990)

• Minimal accountability (Lerner & Tetlock, 1999)

• I am busy (Me & You, 2014) Non-profit, incorporated 2013 4 foundaon funders, $10 million Mission: Improve openness, integrity, Fully FOSS technology and reproducibility No competors of scienfic research Acvies: Infrastructure, community, Strategy

1. Services to support exisng workflow

2. Enable good pracces

3. Nudge incenves top-down and boom-up

1. Services to support exisng workflow

Search Publish and report discover Write Develop report idea

Interpret Design findings study

Analyze Acquire data materials

Collect Store data data Framework hp://osf.io/

Jeff Spies Open Science Framework (OSF)

• For collaboraon, documentaon, archiving, sharing, registraon • Respects and integrates workflow • Replaces ad hoc archiving with shared soluon • Merges private and public workflows • Connects infrastructure • Integrates top-down and boom-up soluons • Incenvizes openness

Search Publish and report discover Write Develop report idea

Interpret Design findings study

Analyze Acquire data materials

Collect Store data data Search Publish and report discover Write Funcons for workflow Develop report Collaboraon idea Authoring Documentaon Archiving Interpret Design Commenng findings study Peer Review Version control Registraon Analyze Sharing Acquire data (Alt)metrics materials

Collect Store data data Store data Search Publish and report discover Write Develop report idea

Interpret Design findings study

Analyze Acquire data materials

Collect Store data data 2. Enable Good Pracces

Openness Registraon Reproducibility

3. Nudge incenves top-down and boom-up

Altmetrics Disclosure standards Registered Reports Crowdsourcing Badges

4 tacks, 1 objecve: Disclosure

1. Self-disclosure We report all data exclusions, manipulaons, and measures, and how we determined our sample sizes

Leif Nelson Joe Simmons 4 tacks, 1 objecve: Disclosure

1. Self-disclosure 2. Voluntary response 4 tacks, 1 objecve: Disclosure

1. Self-disclosure 2. Voluntary response 3. Journal Requirement 4 tacks, 1 objecve: Disclosure

I request that the authors add a 1. Self-disclosure statement to the paper confirming whether, for all 2. Voluntary response experiments, they have reported 3. Journal Requirement all measures, condions, data exclusions, and how they 4. Reviewer Request determined their sample sizes. …. This is the standard reviewer disclosure request endorsed by the [see hp://osf.io/hadz3]. I include it in every review. Review of intro and Registered Reports methods prior to data collecon; published regardless of outcome

In use: – Perspecves on Psychological Science – Social Psychology – Cortex – Froners in Cognion – Aenon, Percepon, & Psychophysics – Experimental Psychology – AIMS Neuroscience Many Labs Project

Michelangelo Rick Klein Kate Ratliff Vianello

+ Stepan Bahnik, Michael J. Bernstein, Konrad Bocian, Mark Brandt, Claudia Chloe Brumbaugh, Zeynep Cemalcilar, Jesse Chandler, Winnee Cheong, William E. Davis, Thierry Devos, Mahew Eisner, Natalia Frankowska, David Furrow, Elisa Maria Galliani, Fred Hasselman, Joshua A. Hicks, James F. Hovermale, S. Jane Hunt, Jeffrey R. Huntsinger, Hans IJzerman, Melissa-Sue John, Jennifer A. Joy-Gaba, Heather Kappes, Lacy E. Krueger, Jaime Kurtz, Carmel A. Levitan, Robyn Malle, Wendy L. Morris, Anthony J. Nelson, Jason A. Nier, Grant Packard, Ronaldo Pila, Abraham M. Rutchick, Kathleen Schmidt, Jeanine L. Skorinko, Robert Smith, Jusn Storbeck, Lyn M. Van Swol, Donna Thompson, Anna van 't Veer, Leigh Ann Vaughn, Marek Vranka, Aaron Wichman, Julie Woodzicka

Charter Endorsing Organizaons

Berkeley Init. for Transparency in Social Sciences Mozilla Science Lab Bio, Tech and Beyond Network for Open Scienfic Innovaon Center for Open Science Open Science Federaon Databrary OpenfMRI DataONE Prometheus Research DuraSpace PsychoPy European Associaon for Personality Psych Reproducibility Iniave figshare Science Exchange Laura and John Arnold Foundaon Charter Adopng Journals

Cortex Journal of Social Psychology European Journal of Personality Journal of Vision Human Computaon Psi Chi Journal Journal of Research in Personality Psychological Science

Josh Carp Chun Wang Sam Portnow Denise Holman Alex Schiller Nan Chen

Jeff Spies Melissa Lewis Johanna Cohoon Tim Errington Wendy Zhu Jake Rosenberg

hp://cos.io/ Lyndsy Simon Andrew Sallans Michael Lapuz Saul Brodsky