Redefine Statistical Significance We Propose to Change the Default P-Value Threshold for Statistical Significance from 0.05 to 0.005 for Claims of New Discoveries
Total Page:16
File Type:pdf, Size:1020Kb
comment Redefine statistical significance We propose to change the default P-value threshold for statistical significance from 0.05 to 0.005 for claims of new discoveries. Daniel J. Benjamin, James O. Berger, Magnus Johannesson, Brian A. Nosek, E.-J. Wagenmakers, Richard Berk, Kenneth A. Bollen, Björn Brembs, Lawrence Brown, Colin Camerer, David Cesarini, Christopher D. Chambers, Merlise Clyde, Thomas D. Cook, Paul De Boeck, Zoltan Dienes, Anna Dreber, Kenny Easwaran, Charles Efferson, Ernst Fehr, Fiona Fidler, Andy P. Field, Malcolm Forster, Edward I. George, Richard Gonzalez, Steven Goodman, Edwin Green, Donald P. Green, Anthony Greenwald, Jarrod D. Hadfield, Larry V. Hedges, Leonhard Held, Teck Hua Ho, Herbert Hoijtink, Daniel J. Hruschka, Kosuke Imai, Guido Imbens, John P. A. Ioannidis, Minjeong Jeon, James Holland Jones, Michael Kirchler, David Laibson, John List, Roderick Little, Arthur Lupia, Edouard Machery, Scott E. Maxwell, Michael McCarthy, Don Moore, Stephen L. Morgan, Marcus Munafó, Shinichi Nakagawa, Brendan Nyhan, Timothy H. Parker, Luis Pericchi, Marco Perugini, Jeff Rouder, Judith Rousseau, Victoria Savalei, Felix D. Schönbrodt, Thomas Sellke, Betsy Sinclair, Dustin Tingley, Trisha Van Zandt, Simine Vazire, Duncan J. Watts, Christopher Winship, Robert L. Wolpert, Yu Xie, Cristobal Young, Jonathan Zinman and Valen E. Johnson he lack of reproducibility of scientific not address the appropriate threshold for probabilities. By Bayes’ rule, this ratio may studies has caused growing concern confirmatory or contradictory replications be written as: over the credibility of claims of new of existing claims. We also do not advocate T Pr Hx discoveries based on ‘statistically significant’ changes to discovery thresholds in fields ()1obs findings. There has been much progress that have already adopted more stringent Pr()Hx0obs toward documenting and addressing standards (for example, genomics fx()obs1H Pr()H1 (1) several causes of this lack of reproducibility and high-energy physics research; see the =× (for example, multiple testing, P-hacking, ‘Potential objections’ section below). fx()obs0H Pr()H0 publication bias and under-powered We also restrict our recommendation ≡×BF (priorodds) studies). However, we believe that a leading to studies that conduct null hypothesis cause of non-reproducibility has not yet significance tests. We have diverse views where BF is the Bayes factor that represents been adequately addressed: statistical about how best to improve reproducibility, the evidence from the data, and the prior standards of evidence for claiming new and many of us believe that other ways of odds can be informed by researchers’ beliefs, discoveries in many fields of science are summarizing the data, such as Bayes factors scientific consensus, and validated evidence simply too low. Associating statistically or other posterior summaries based on from similar research questions in the same significant findings with P < 0.05 results clearly articulated model assumptions, are field. Multiple-hypothesis testing, P-hacking in a high rate of false positives even in the preferable to P values. However, changing the and publication bias all reduce the credibility absence of other experimental, procedural P value threshold is simple, aligns with the of evidence. Some of these practices reduce and reporting problems. training undertaken by many researchers, the prior odds of H1 relative to H0 by For fields where the threshold for and might quickly achieve broad acceptance. changing the population of hypothesis tests defining statistical significance for new that are reported. Prediction markets3 and discoveries is P < 0.05, we propose a change Strength of evidence from P values analyses of replication results4 both suggest to P < 0.005. This simple step would In testing a point null hypothesis H0 against that for psychology experiments, the prior immediately improve the reproducibility of an alternative hypothesis H1 based on data odds of H1 relative to H0 may be only about scientific research in many fields. Results xobs, the P value is defined as the probability, 1:10. A similar number has been suggested that would currently be called significant calculated under the null hypothesis, that a in cancer clinical trials, and the number but do not meet the new threshold should test statistic is as extreme or more extreme is likely to be much lower in preclinical instead be called suggestive. While than its observed value. The null hypothesis biomedical research5. statisticians have known the relative is typically rejected — and the finding is There is no unique mapping between weakness of using P ≈ 0.05 as a threshold declared statistically significant — if the the P value and the Bayes factor, since the for discovery and the proposal to lower P value falls below the (current) type I error Bayes factor depends on H1. However, the it to 0.005 is not new1,2, a critical mass of threshold α = 0.05. connection between the two quantities researchers now endorse this change. From a Bayesian perspective, a more can be evaluated for particular test We restrict our recommendation to direct measure of the strength of evidence statistics under certain classes of plausible claims of discovery of new effects. We do for H1 relative to H0 is the ratio of their alternatives (Fig. 1). NATURE HUMAN BEHAVIOUR | www.nature.com/nathumbehav © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. comment 100.0 100.0 7 Power In many studies, statistical power is low . Likelihood ratio bound Figure 2 demonstrates that low statistical 50.0 UMPBT 50.0 power and α = 0.05 combine to produce Local−H1 bound high false positive rates. 25.7 20.0 20.0 For many, the calculations illustrated by 13.9 Fig. 2 may be unsettling. For example, the 10.0 10.0 false positive rate is greater than 33% with prior odds of 1:10 and a P value threshold actor 5.0 5.0 of 0.05, regardless of the level of statistical es f 3.4 power. Reducing the threshold to 0.005 Bay 2.4 would reduce this minimum false positive 2.0 2.0 rate to 5%. Similar reductions in false 1.0 1.0 positive rates would occur over a wide range of statistical powers. 0.5 0.5 Empirical evidence from recent replication projects in psychology and 0.3 0.3 experimental economics provide insights 0.0010 0.0025 0.0050 0.0100 0.02500.05000.1000 into the prior odds in favour of H1. In both P value projects, the rate of replication (that is, significance at P < 0.05 in the replication in Fig. 1 | Relationship between the P value and the Bayes factor. The Bayes factor (BF) is defined a consistent direction) was roughly double as fx()obs1H . The figure assumes that observations are independent and identically distributed for initial studies with P < 0.005 relative to fx()obs0H 2 2 initial studies with 0.005 < P < 0.05: 50% (i.i.d.) according to x ~ N(μ,σ ), where the mean μ is unknown and the variance σ is known. The P value 8 2 versus 24% for psychology , and 85% versus is from a two-sided z-test (or equivalently a one-sided χ -test) of the null hypothesis H0: μ = 0. Power 1 44% for experimental economics9. Although (red curve): BF obtained by defining H1 as putting ½ probability on μ = ± m for the value of m that gives based on relatively small samples of studies 75% power for the test of size α = 0.05. This H1 represents an effect size typical of that which is implicitly assumed by researchers during experimental design. Likelihood ratio bound (black curve): BF obtained (93 in psychology, and 16 in experimental economics, after excluding initial studies by defining H1 as putting ½ probability on μ = ± x,̂ where x ̂is approximately equal to the mean of the with P > 0.05), these numbers are suggestive observations. These BFs are upper bounds among the class of all H1 terms that are symmetric around the of the potential gains in reproducibility null, but they are improper because the data are used to define H1. UMPBT (blue curve): BF obtained by 2 that would accrue from the new threshold defining H1 according to the uniformly most powerful Bayesian test that places ½ probability on μ = ± w, where w is the alternative hypothesis that corresponds to a one-sided test of size 0.0025. This curve of P < 0.005 in these fields. In biomedical is indistinguishable from the ‘Power’ curve that would be obtained if the power used in its definition was research, 96% of a sample of recent papers 1 80% rather than 75%. Local-H bound (green curve): BF = , where p is the P value, is a large-sample claim statistically significant results with 1 −ep lnp the P < 0.05 threshold10. However, upper bound on the BF from among all unimodal alternative hypotheses that have a mode at the null and 5 satisfy certain regularity conditions15. The red numbers on the y axis indicate the range of Bayes factors replication rates were very low for these that are obtained for P values of 0.005 or 0.05. For more details, see the Supplementary Information. studies, suggesting a potential for gains by adopting this new standard in these fields as well. A two-sided P value of 0.05 corresponds evidence according to conventional Bayes Potential objections 6 to Bayes factors in favour of H1 that range factor classifications . We now address the most compelling from about 2.5 to 3.4 under reasonable Second, in many fields the P < 0.005 arguments against adopting this higher assumptions about H1 (Fig. 1). This standard would reduce the false positive rate standard of evidence. is weak evidence from at least three to levels we judge to be reasonable.