COMMENT EVOLUTION Cooperation HISTORY To fight denial, CHEMISTRY Three more unsung PUBLISHING As well as ORCID and conflict from ants study Galileo and women — of astatine ID and English, list authors and chimps to us p.308 Arendt p.309 discovery p.311 in their own script p.311 ILLUSTRATION BY DAVID PARKINS DAVID BY ILLUSTRATION

Retire Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories call for an end to hyped claims and the dismissal of possibly crucial effects.

hen was the last time you heard How do so often lead scientists to literature with overstated claims and, less a seminar speaker claim there deny differences that those not educated in famously, led to claims of conflicts between was ‘no difference’ between statistics can plainly see? For several genera- studies where none exists. Wtwo groups because the difference was tions, researchers have been warned that a We have some proposals to keep scientists ‘statistically non-significant’? statistically non-significant result does not from falling prey to these misconceptions. If your experience matches ours, there’s ‘prove’ the null hypothesis (the hypothesis a good chance that this happened at the that there is no difference between groups or PERVASIVE PROBLEM last talk you attended. We hope that at least no effect of a treatment on some measured Let’s be clear about what must stop: we someone in the audience was perplexed if, as outcome)1. Nor do statistically significant should never conclude there is ‘no differ- frequently happens, a plot or table showed results ‘prove’ some other hypothesis. Such ence’ or ‘no association’ just because a P value that there actually was a difference. misconceptions have famously warped the is larger than a threshold such as 0.05

©2019 Spri nger Nature Li mited. All ri ghts reserved. 21 MARCH 2019 | VOL 567 | NATURE | 305


or, equivalently, because a confidence Association released a statement in The some quality-control standard). And we interval includes zero. Neither should we American Statistician warning against the are also not advocating for an anything- conclude that two studies conflict because misuse of statistical significance and P val- goes situation, in which weak evidence one had a statistically significant result and ues. The issue also included many commen- suddenly becomes credible. Rather, and in the other did not. These errors waste research taries on the subject. This month, a special line with many others over the decades, we efforts and misinform policy decisions. issue in the same journal attempts to push are calling for a stop to the use of P values For example, consider a series of analyses these reforms further. It presents more than in the conventional, dichotomous way — to of unintended effects of anti-inflammatory 40 papers on ‘Statistical inference in the 21st decide whether a result refutes or supports a drugs2. Because their results were statistically century: a world beyond P < 0.05’. The edi- scientific hypothesis5. non-significant, one set of researchers con- tors introduce the collection with the cau- cluded that exposure to the drugs was “not tion “don’t say ‘statistically significant’”3. QUIT CATEGORIZING associated” with new-onset atrial fibrillation Another article4 with dozens of signatories The trouble is human and cognitive more (the most common disturbance to heart also calls on authors and journal editors to than it is statistical: bucketing results into rhythm) and that the results stood in con- disavow those terms. ‘statistically significant’ and ‘statistically trast to those from an earlier study with a We agree, and call for the entire concept non-significant’ makes people think that the statistically significant outcome. of statistical significance to be abandoned. items assigned in that way are categorically Now, let’s look at the actual data. The We are far from different6–8. The same problems are likely to researchers describing their statistically “Eradicating alone. When we arise under any proposed statistical alterna- non-significant results found a risk ratio categorization invited others to tive that involves dichotomization, whether of 1.2 (that is, a 20% greater risk in exposed will help to halt read a draft of this frequentist, Bayesian or otherwise. patients relative to unexposed ones). They overconfident comment and sign Unfortunately, the false belief that also found a 95% confidence interval claims, their names if they crossing the threshold of statistical sig- that spanned everything from a trifling unwarranted concurred with our nificance is enough to show that a result is risk decrease of 3% to a considerable risk message, 250 did so ‘real’ has led scientists and journal editors to increase of 48% (P = 0.091; our calcula- declarations of within the first 24 privilege such results, thereby distorting the tion). The researchers from the earlier, sta- ‘no difference’ hours. A week later, literature. Statistically significant estimates tistically significant, study found the exact and absurd we had more than are biased upwards in magnitude and poten- same risk ratio of 1.2. That study was sim- statements 800 signatories — all tially to a large degree, whereas statistically ply more precise, with an interval spanning about checked for an aca- non-significant estimates are biased down- from 9% to 33% greater risk (P = 0.0003; our ‘replication demic affiliation or wards in magnitude. Consequently, any dis- calculation). failure’.” other indication of cussion that focuses on estimates chosen for It is ludicrous to conclude that the present or past work their significance will be biased. On top of statistically non-significant results showed in a field that depends on statistical model- this, the rigid focus on statistical significance “no association”, when the interval estimate ling (see the list and final count of signatories encourages researchers to choose data and included serious risk increases; it is equally in the Supplementary Information). These methods that yield statistical significance for absurd to claim these results were in contrast include statisticians, clinical and medical some desired (or simply publishable) result, with the earlier results showing an identical researchers, biologists and psychologists or that yield statistical non-significance for observed effect. Yet these common practices from more than 50 countries and across all an undesired result, such as potential side show how reliance on thresholds of statisti- continents except Antarctica. One advocate effects of drugs — thereby invalidating cal significance can mislead us (see ‘Beware called it a “surgical strike against thought- conclusions. false conclusions’). less testing of statistical significance” and “an The pre-registration of studies and a These and similar errors are widespread. opportunity to register your voice in favour commitment to publish all results of all Surveys of hundreds of articles have found of better scientific practices”. analyses can do much to mitigate these that statistically non-significant results are We are not calling for a ban on P values. issues. However, even results from pre-reg- interpreted as indicating ‘no difference’ or Nor are we saying they cannot be used as istered studies can be biased by decisions ‘no effect’ in around half (see ‘Wrong inter- a decision criterion in certain special- invariably left open in the analysis plan9. pretations’ and Supplementary Information). ized applications (such as determining This occurs even with the best of intentions. In 2016, the American Statistical whether a manufacturing process meets Again, we are not advocating a ban on P values, confidence intervals or other sta- tistical measures — only that we should BEWARE FALSE CONCLUSIONS not treat them categorically. This includes Studies currently dubbed ‘statistically signi cant’ and ‘statistically non-signi cant’ need not be dichotomization as statistically significant or contradictory, and such designations might cause genuine eects to be dismissed. not, as well as categorization based on other statistical measures such as Bayes factors. One reason to avoid such ‘dichotomania’ ET AL. AMRHEIN ET SOURCE: V. ‘Signi cant’ study is that all statistics, including P values and (low P value) confidence intervals, naturally vary from study to study, and often do so to a sur- ‘Non-signi cant’ study prising degree. In fact, random variation (high P value) The observed eect (or alone can easily lead to large disparities in point estimate) is the P values, far beyond falling just to either side same in both studies, so of the 0.05 threshold. For example, even if they are not in conict, even if one is ‘signi cant’ researchers could conduct two perfect and the other is not. replication studies of some genuine effect, each with 80% power (chance) of achieving

Decreased eect No eect Increased eect P < 0.05, it would not be very surprising for one to obtain P < 0.01 and the other P > 0.30.

306 | NATURE | VOL 567 | 21 MARCH 2019 ©2019 Spri nger Nature Li mited. All ri ghts reserved. ©2019 Spri nger Nature Li mited. All ri ghts reserved.


Whether a P value is small or large, caution and data tabulation will be more detailed is warranted. WRONG INTERPRETATIONS and nuanced. Authors will emphasize their We must learn to embrace uncertainty. An analysis of 791 articles across 5 journals* estimates and the uncertainty in them — for found that around half mistakenly assume One practical way to do so is to rename con- non-signi cance means no eect. example, by explicitly discussing the lower fidence intervals as ‘compatibility intervals’ and upper limits of their intervals. They will Appropriately Wrongly and interpret them in a way that avoids over- interpreted interpreted not rely on significance tests. When P values ET AL. AMRHEIN ET SOURCE: V. confidence. Specifically, we recommend that 51 are reported, they will be given with sensible authors describe the practical implications precision (for example, P = 0.021 or P = 0.13) of all values inside the interval, especially — without adornments such as stars or let- the observed effect (or point estimate) and ters to denote statistical significance and not

the limits. In doing so, they should remem- as binary inequalities (P < 0.05 or P > 0.05). ber that all the values between the interval’s Decisions to interpret or to publish results

limits are reasonably compatible with the S will not be based on statistical thresholds. data, given the statistical assumptions used 79 People will spend less time with statistical to compute the interval7,10. Therefore, sin- software, and more time thinking. gling out one particular value (such as the Our call to retire statistical significance null value) in the interval as ‘shown’ makes and to use confidence intervals as compat- no sense. ibility intervals is not a panacea. Although it We’re frankly sick of seeing such non- will eliminate many bad practices, it could sensical ‘proofs of the null’ and claims of well introduce new ones. Thus, monitoring *Data taken from: P. Schatz et al. Arch. Clin. Neuropsychol. 20, non-association in presentations, research 1053–1059 (2005); F. Fidler et al. Conserv. Biol. 20, 1539–1544 the literature for statistical abuses should be (2006); R. Hoekstra et al. Psychon. Bull. Rev. 13, 1033–1037 (2006); articles, reviews and instructional materials. F. Bernardi et al. Eur. Sociol. Rev. 33, 1–15 (2017). an ongoing priority for the scientific com- An interval that contains the null value will munity. But eradicating categorization will often also contain non-null values of high help to halt overconfident claims, unwar- practical importance. That said, if you deem feeling that this is a basis for a confident ranted declarations of ‘no difference’ and all of the values inside the interval to be prac- decision. A different level can be justified, absurd statements about ‘replication failure’ tically unimportant, you might then be able depending on the application. And, as in the when the results from the original and rep- to say something like ‘our results are most anti-inflammatory-drugs example, interval lication studies are highly compatible. The compatible with no important effect’. estimates can perpetuate the problems of misuse of statistical significance has done When talking about compatibility inter- statistical significance when the dichotomi- much harm to the scientific community and vals, bear in mind four things. First, just zation they impose is treated as a scientific those who rely on scientific advice. P values, because the interval gives the values most standard. intervals and other statistical measures all compatible with the data, given the assump- Last, and most important of all, be have their place, but it’s time for statistical tions, it doesn’t mean values outside it are humble: compatibility assessments hinge significance to go. ■ incompatible; they are just less compatible. on the correctness of the statistical assump- In fact, values just outside the interval do not tions used to compute the interval. In prac- Valentin Amrhein is a professor of zoology differ substantively from those just inside tice, these assumptions are at best subject to at the University of Basel, Switzerland. the interval. It is thus wrong to claim that an considerable uncertainty7,8,10. Make these Sander Greenland is a professor of interval shows all possible values. assumptions as clear as possible and test the and statistics at the Second, not all values inside are equally ones you can, for example by plotting your University of California, Los Angeles. Blake compatible with the data, given the assump- data and by fitting alternative models, and McShane is a statistical methodologist and tions. The point estimate is the most compat- then reporting all results. professor of marketing at Northwestern ible, and values near it are more compatible Whatever the statistics show, it is fine to University in Evanston, Illinois. For a full than those near the limits. This is why we suggest reasons for your results, but discuss list of co-signatories, see Supplementary urge authors to discuss the point estimate, a range of potential explanations, not just Information. even when they have a large P value or a wide favoured ones. Inferences should be scien- e-mail: [email protected] interval, as well as discussing the limits of tific, and that goes far beyond the merely that interval. For example, the authors above statistical. Factors such as background 1. Fisher, R. A. Nature 136, 474 (1935). 2. Schmidt, M. & Rothman, K. J. Int. J. Cardiol. 177, could have written: ‘Like a previous study, evidence, study design, data quality and 1089–1090 (2014). our results suggest a 20% increase in risk understanding of underlying mechanisms 3. Wasserstein, R. L., Schirm, A. & Lazar, N. A. of new-onset atrial fibrillation in patients are often more important than statistical Am. Stat. 19.1583913 (2019). given the anti-inflammatory drugs. None- measures such as P values or intervals. 4. Hurlbert, S. H., Levine, R. A. & Utts, J. Am. Stat. theless, a risk difference ranging from a 3% The objection we hear most against decrease, a small negative association, to a retiring statistical significance is that it is 616 (2019). 48% increase, a substantial positive associa- needed to make yes-or-no decisions. But 5. Lehmann, E. L. Testing Statistical Hypotheses 2nd edn 70–71 (Springer, 1986). tion, is also reasonably compatible with our for the choices often required in regula- 6. Gigerenzer, G. Adv. Meth. Pract. Psychol. Sci. 1, data, given our assumptions.’ Interpreting tory, policy and business environments, 198–218 (2018). the point estimate, while acknowledging decisions based on the costs, benefits and 7. Greenland, S. Am. J. Epidemiol. 186, 639–645 (2017). its uncertainty, will keep you from making likelihoods of all potential consequences 8. McShane, B. B., Gal, D., Gelman, A., Robert, C. & false declarations of ‘no difference’, and from always beat those made based solely on Tackett, J. L. Am. Stat. making overconfident claims. statistical significance. Moreover, for deci- 0031305.2018.1527253 (2019). 9. Gelman, A. & Loken, E. Am. Sci. 102, 460–465 Third, like the 0.05 threshold from which sions about whether to pursue a research (2014). it came, the default 95% used to compute idea further, there is no simple connection 10. Amrhein, V., Trafimow, D. & Greenland, S. Am. intervals is itself an arbitrary convention. It between a P value and the probable results Stat. is based on the false idea that there is a 95% of subsequent studies. 543137 (2019). chance that the computed interval itself con- What will retiring statistical significance Supplementary information accompanies this tains the true value, coupled with the vague look like? We hope that methods sections article; see

©2019 Spri nger Nature Li mited. All ri ghts reserved. ©2019 Spri nger Nature Li mited. All ri ghts reserved. 21 MARCH 2019 | VOL 567 | NATURE | 307