
COMMENT EVOLUTION Cooperation HISTORY To fight denial, CHEMISTRY Three more unsung PUBLISHING As well as ORCID and conflict from ants study Galileo and women — of astatine ID and English, list authors and chimps to us p.308 Arendt p.309 discovery p.311 in their own script p.311 ILLUSTRATION BY DAVID PARKINS DAVID BY ILLUSTRATION Retire statistical significance Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories call for an end to hyped claims and the dismissal of possibly crucial effects. hen was the last time you heard How do statistics so often lead scientists to literature with overstated claims and, less a seminar speaker claim there deny differences that those not educated in famously, led to claims of conflicts between was ‘no difference’ between statistics can plainly see? For several genera- studies where none exists. Wtwo groups because the difference was tions, researchers have been warned that a We have some proposals to keep scientists ‘statistically non-significant’? statistically non-significant result does not from falling prey to these misconceptions. If your experience matches ours, there’s ‘prove’ the null hypothesis (the hypothesis a good chance that this happened at the that there is no difference between groups or PERVASIVE PROBLEM last talk you attended. We hope that at least no effect of a treatment on some measured Let’s be clear about what must stop: we someone in the audience was perplexed if, as outcome)1. Nor do statistically significant should never conclude there is ‘no differ- frequently happens, a plot or table showed results ‘prove’ some other hypothesis. Such ence’ or ‘no association’ just because a P value that there actually was a difference. misconceptions have famously warped the is larger than a threshold such as 0.05 ©2019 Spri nger Nature Li mited. All ri ghts reserved. 21 MARCH 2019 | VOL 567 | NATURE | 305 COMMENT or, equivalently, because a confidence Association released a statement in The some quality-control standard). And we interval includes zero. Neither should we American Statistician warning against the are also not advocating for an anything- conclude that two studies conflict because misuse of statistical significance and P val- goes situation, in which weak evidence one had a statistically significant result and ues. The issue also included many commen- suddenly becomes credible. Rather, and in the other did not. These errors waste research taries on the subject. This month, a special line with many others over the decades, we efforts and misinform policy decisions. issue in the same journal attempts to push are calling for a stop to the use of P values For example, consider a series of analyses these reforms further. It presents more than in the conventional, dichotomous way — to of unintended effects of anti-inflammatory 40 papers on ‘Statistical inference in the 21st decide whether a result refutes or supports a drugs2. Because their results were statistically century: a world beyond P < 0.05’. The edi- scientific hypothesis5. non-significant, one set of researchers con- tors introduce the collection with the cau- cluded that exposure to the drugs was “not tion “don’t say ‘statistically significant’”3. QUIT CATEGORIZING associated” with new-onset atrial fibrillation Another article4 with dozens of signatories The trouble is human and cognitive more (the most common disturbance to heart also calls on authors and journal editors to than it is statistical: bucketing results into rhythm) and that the results stood in con- disavow those terms. ‘statistically significant’ and ‘statistically trast to those from an earlier study with a We agree, and call for the entire concept non-significant’ makes people think that the statistically significant outcome. of statistical significance to be abandoned. items assigned in that way are categorically Now, let’s look at the actual data. The We are far from different6–8. The same problems are likely to researchers describing their statistically “Eradicating alone. When we arise under any proposed statistical alterna- non-significant results found a risk ratio categorization invited others to tive that involves dichotomization, whether of 1.2 (that is, a 20% greater risk in exposed will help to halt read a draft of this frequentist, Bayesian or otherwise. patients relative to unexposed ones). They overconfident comment and sign Unfortunately, the false belief that also found a 95% confidence interval claims, their names if they crossing the threshold of statistical sig- that spanned everything from a trifling unwarranted concurred with our nificance is enough to show that a result is risk decrease of 3% to a considerable risk message, 250 did so ‘real’ has led scientists and journal editors to increase of 48% (P = 0.091; our calcula- declarations of within the first 24 privilege such results, thereby distorting the tion). The researchers from the earlier, sta- ‘no difference’ hours. A week later, literature. Statistically significant estimates tistically significant, study found the exact and absurd we had more than are biased upwards in magnitude and poten- same risk ratio of 1.2. That study was sim- statements 800 signatories — all tially to a large degree, whereas statistically ply more precise, with an interval spanning about checked for an aca- non-significant estimates are biased down- from 9% to 33% greater risk (P = 0.0003; our ‘replication demic affiliation or wards in magnitude. Consequently, any dis- calculation). failure’.” other indication of cussion that focuses on estimates chosen for It is ludicrous to conclude that the present or past work their significance will be biased. On top of statistically non-significant results showed in a field that depends on statistical model- this, the rigid focus on statistical significance “no association”, when the interval estimate ling (see the list and final count of signatories encourages researchers to choose data and included serious risk increases; it is equally in the Supplementary Information). These methods that yield statistical significance for absurd to claim these results were in contrast include statisticians, clinical and medical some desired (or simply publishable) result, with the earlier results showing an identical researchers, biologists and psychologists or that yield statistical non-significance for observed effect. Yet these common practices from more than 50 countries and across all an undesired result, such as potential side show how reliance on thresholds of statisti- continents except Antarctica. One advocate effects of drugs — thereby invalidating cal significance can mislead us (see ‘Beware called it a “surgical strike against thought- conclusions. false conclusions’). less testing of statistical significance” and “an The pre-registration of studies and a These and similar errors are widespread. opportunity to register your voice in favour commitment to publish all results of all Surveys of hundreds of articles have found of better scientific practices”. analyses can do much to mitigate these that statistically non-significant results are We are not calling for a ban on P values. issues. However, even results from pre-reg- interpreted as indicating ‘no difference’ or Nor are we saying they cannot be used as istered studies can be biased by decisions ‘no effect’ in around half (see ‘Wrong inter- a decision criterion in certain special- invariably left open in the analysis plan9. pretations’ and Supplementary Information). ized applications (such as determining This occurs even with the best of intentions. In 2016, the American Statistical whether a manufacturing process meets Again, we are not advocating a ban on P values, confidence intervals or other sta- tistical measures — only that we should BEWARE FALSE CONCLUSIONS not treat them categorically. This includes AL. ET Studies currently dubbed ‘statistically signicant’ and ‘statistically non-signicant’ need not be dichotomization as statistically significant or contradictory, and such designations might cause genuine eects to be dismissed. not, as well as categorization based on other statistical measures such as Bayes factors. One reason to avoid such ‘dichotomania’ SOURCE: V. AMRHEIN SOURCE: V. ‘Signicant’ study is that all statistics, including P values and (low P value) confidence intervals, naturally vary from study to study, and often do so to a sur- ‘Non-signicant’ study prising degree. In fact, random variation (high P value) The observed eect (or alone can easily lead to large disparities in point estimate) is the P values, far beyond falling just to either side same in both studies, so of the 0.05 threshold. For example, even if they are not in conict, even if one is ‘signicant’ researchers could conduct two perfect and the other is not. replication studies of some genuine effect, each with 80% power (chance) of achieving Decreased eect No eect Increased eect P < 0.05, it would not be very surprising for one to obtain P < 0.01 and the other P > 0.30. 306 | NATURE | VOL 567 | 21 MARCH 2019 ©2019 Spri nger Nature Li mited. All ri ghts reserved. ©2019 Spri nger Nature Li mited. All ri ghts reserved. COMMENT Whether a P value is small or large, caution and data tabulation will be more detailed ET AL. ET is warranted. WRONG INTERPRETATIONS and nuanced. Authors will emphasize their We must learn to embrace uncertainty. An analysis of 791 articles across 5 journals* estimates and the uncertainty in them — for found that around half mistakenly assume One practical way
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages3 Page
-
File Size-