Why P Values Do Not Tell You If Your Treatment Is Likely to Work

Editorial Postgrad Med J: first published as 10.1136/postgradmedj-2019-137079 on 29 October 2019. Downloaded from

hypothesis. What we actually need is the Problem with p values: why p values do false discovery rate (FDR), which is the proportion of reported discoveries that not tell you if your treatment is likely are false positives. Some authors prefer the term false positive risk to emphasise to work that this is the risk that, having observed a significant p value from a single exper- 7 Robert Price ‍ ,1 Rob Bethune ‍ ,2,3 Lisa Massey2 iment, it is a false positive. The FDR can be calculated from the power, type I error rate and an estimate of the preva- lence of real effects among the many ideas 8 Introduction Despite the inherent conflict in these two we may test. It is crucial to note that the Medicine has made remarkable prog- approaches, they have been fused into type I error rate is the long-term proba- ress within the lifetime of the oldest NHST for single trials, where a p value bility of a false positive for a single exper- members of our society. Evidence from threshold is used to accept or reject the iment repeated with exact replication. It trials has come to replace expert opinion null hypothesis.2 is not the same as the FDR that applies as the arbiter of treatment effectiveness. to a single run of each experiment. Even Following the work of Fisher (1925) and when experiments are perfectly designed Neyman and Pearson (1933), null hypoth- The problem with p values and their and executed without bias, missing data esis significance testing (NHST) and the misinterpretation or multiple statistical testing, the FDR in p value have become the cornerstones of For several decades, there has been a NHST using a p<0.05 threshold has been clinical research. The attractions of a rule- failure by many authors to realise that all estimated to be in the range 30%–60%, based ‘algorithm’ approach are that it is these probabilities (p, α, β, type I error, depending on the field of research and the 4 9–12 easy to implement, permits binary deci- type II error) are conditional. The order power of the study. sions to be made and makes it simple for of terms matters in conditional proba- investigators, editors, readers and funding bilities, and they must be used according Measles and spots bodies to count or discount the work. to a simple set of mathematical rules or Another example may help to clarify the But does that make it reliable? Even in their meanings are changed. However, the problem of the transposed conditional. the early days, this approach was contro- shorthand notation adopted in biomedical Consider the difference between the prob- versial. Worse still, as these ideas became statistics assumes that everyone knows ability of having spots given you have broadly adopted, fundamental misinter- this. This makes it easy for authors to offer measles, which is about 1, and the proba- pretations were embedded in the literature their own mistaken interpretations in the 1 bility that the spots you have are caused by and practice of biomedical research. belief that they are providing clarity. measles. In the latter, spots have become Fisher originally proposed using the For example, the standard definition of certain, the given, while measles has become exact p value for a single trial as an indica- the p value is ‘the probability of having tion of the credibility of the null hypoth- observed our data (or something more an uncertain cause of the spots. So, by esis when considered together with all extreme) given the null hypothesis is changing the word given to caused by we 5 the available evidence.2 It is worth noting true’. In NHST, the p value is a condi- have inadvertently inverted the conditional probability and now have a probability that the null hypothesis does not neces- tional probability, conditioned on the null http://pmj.bmj.com/ sarily mean no difference between groups, hypothesis always being true. Unfortu- which is much less than 1. There are many this is the nil null hypothesis.3 Rather nately, authors often replace due to with other causes of spots than measles, but if it is the hypothesis we aim to nullify by given, without realising that this makes it you have measles you are very likely to have experiment, which can provide more the probability of a completely different spots. This example shows the danger of powerful evidence if it includes a quan- event. We calculate p values because it is changing given to caused by which is what titative prediction of the expected differ- easy, not because they are the probability many researchers do. ence. In the following decade, Neyman we really want. We usually want the more In the last decade, attempts have on October 1, 2021 by guest. Protected copyright. and Pearson developed the concepts of intuitive probability; our hypothesis is true been made in social science, psychology, medicine and pharmacology to replicate the alternative hypothesis, α, power, β, given the data. It is tempting to claim that type I error and type II error, as a formal this is the same as the p value and that if important experiments with very disap- decision-making procedure to automate p=0.05 there is a 1 in 20 probability that pointing results. This has led to increasing industrial production quality control. the data arose by chance alone (the null concern about the validity of much scien- This method requires multiple samples hypothesis). This is incorrect, because the tific work. Attention has inevitably turned p value is conditioned on the null hypoth- to the statistical techniques and our inter- and repeated analyses to control the long- 2 11 term error rates, only using the p value esis being true, it therefore cannot be used pretation of them. as a binary decision-making threshold. to give a probability that the null hypoth- In biomedical research, we often do esis may be true or indeed false.4 6 not know the effect size as it is frequently This error is an example of the fallacy small, sampling is difficult and variance is 1Anaesthetics, Royal Devon and Exeter NHS Foundation of the transposed conditional or the pros- often large and poorly known. Crucially, Trust, Exeter, UK ecutor’s fallacy, so called because it may we only do the experiment once and have 2Surgery, Royal Devon and Exeter NHS Foundation Trust, Exeter, UK be used by the prosecution to exaggerate only a one-point estimate of the p value. 3South West Academic Health Science Network, Exeter, the weight of evidence against a defendant Additionally, our theories are only weakly UK in a criminal trial. Similarly, this common predictive and do not generate precise Correspondence to Dr Robert Price, Anaesthetics, misinterpretation of the p value exagger- numerical quantities that can be checked Exeter, Devon, UK; robert. price1@nhs. net ates the weight of evidence against the null in quantitative experiments as is possible

Price R, et al. Postgrad Med J January 2020 Vol 96 No 1131 1 Editorial Postgrad Med J: first published as 10.1136/postgradmedj-2019-137079 on 29 October 2019. Downloaded from

different assumptions, as can be done in T able 1 Fraction of correct definitions of the p value found in medical textbooks by library Bayesian data analysis. subject classification We urge authors and editors to demote Subject area Fraction of correct and unambiguous definitions for the p value the prominence of p values in journal arti- Evidence-based medicine 3/15 cles, have the actual null hypothesis formally Exam revision 3/17 stated at the beginning of the article and Research 8/13 promote the use of the more intuitive (but Statistics 4/4 harder to calculate) Bayesian statistics. Total overall subjects 18/49

To the point in the physical sciences. NHST does not must resist redefining statistical quantities ►► The p value in null hypothesis signif- perform well under these circumstances. to suit their own arguments, because it is icance testing is conditioned on the Added to this are the frequent misun- mathematically wrong to do so. Possible null hypothesis being true. derstandings about the meaning of the p statistical approaches include the use ►► This means that a p value of 0.05 does value. To reiterate, the p value is strictly of effect size estimation accompanied not mean that the probability our data the probability of obtaining data as by 95% CIs, a reduction in the p value arose by chance alone is 1 in 20. 14 extreme or more so, given the null hypoth- threshold for significance, rigorous trial ►► In fact, the chance of us mistak- esis is true. pre-registration and the mandatory publi- enly rejecting the null hypothesis To investigate how common this misun- cation of negative studies. However, CIs and concluding we have a successful derstanding is in medical textbooks, we are also misinterpreted and often used treatment is more in the region of audited the definition of the term p value to produce identical results to a p<0.05 30%–60%. in books held in the medical library of the significance test. Reducing the p value ►► Scientific journals and textbooks need University of Exeter Medical School at threshold will reduce the false positive rate to be explicit on how p values are used the Royal Devon and Exeter Foundation at the cost of an increase in the false nega- and defined. Trust (table 1). We included texts under tive rate, particularly in under-powered ►► Use of the more intuitive Bayesian the subject groups; statistics, research, studies. A more intuitive methodology to statistics should become more evidence-based medicine and examination use is Bayesian statistics; this calculates the widespread. revision books (see online supplementary probability that a hypothesis is true given data file 1). the data; this is mostly what researches Correction notice This article has been corrected since it was published Online First. This paper is now The most common error was to claim and readers actually assume the p value Open Access. that the p value is the probability that the to be. The mathematical and logical basis data was generated by chance alone. This of Bayes theory has been established by Contributors RP and RB conceived the idea of writing 15 16 17 definition has been frequently and vigor- Cox, Jaynes and others. However, it the article. RP was the primary author and conducted ously refuted in the statistics literature but is not without its problems and if misap- the audit; RB and LM contributed to drafting and is very persistent in medical textbooks and plied can be just as misleading as NHST. revising the article, in particular helping to clarify the journals.4–6 It is possible to directly compare the prob- text and make it suitable for a wider medical audience. Beyond misinterpretations of p values, abilities of different hypotheses being The authors have not declared a specific grant for there are also widespread problems with true and updating knowledge as new data this research from any funding agency in the public, commercial or not-for-profit sectors. http://pmj.bmj.com/ multiple testing, sometimes inadvertant, arrives. Because Bayesians use probability which grossly inflates the proportion of distributions, they can easily calculate Competing interests None declared. false positive results. This is known as a sensible measure of uncertainty on a Patient consent for publication Not required. ‘p-hacking’ or ‘data dredging’ and allows parameter, the Bayesian credible interval. Provenance and peer review Not commissioned; researchers to selectively report spurious This is much more meaningful than the externally peer reviewed. results as significant.13 frequentist CI, which is again based on

performance over many repetitions but is on October 1, 2021 by guest. Protected copyright. Discussion measured only once. In the past, the two Given the scale of this problem, what major objections to Bayesian methods have should be done? There are two main been the difficulty of calculating intrac- areas to address. First of all, we need to table integrals and the use of prior proba- Open access This is an open access article distributed teach the correct statistical interpretation bilities. The first is a practical point, while in accordance with the Creative Commons Attribution of NHST because of the huge volume of the second is philosophical. On the first 4.0 Unported (CC BY 4.0) license, which permits others point computing power, Markov Chain to copy, redistribute, remix, transform and build upon trials already published. This has already this work for any purpose, provided the original work been attempted without success for at least Monte Carlo algorithms, Gibbs samplers is properly cited, a link to the licence is given, and the last 40 years. Second, we need to move and open-source Bayesian statistics soft- indication of whether changes were made. See: https:// to statistical models that are better suited ware have made numerical solutions to the creativecommons.org/ licenses/by/ 4.0/. to current research problems and address integrals required for Bayesian data anal- © Author(s) (or their employer(s)) 2020. Re-use some of the shortcomings of NHST. Both ysis relatively easy to do, which was not permitted under CC BY. Published by BMJ. of these issues will need the entire research the case in the past. On the second point, ►► Additional material is published online only. To community to change. Research funding we already know that prior probabilities view please visit the journal online (http://dx.doi.org/ agencies, universities and journals must influence NHST FDRs, even though this is 10.1136/ postgradmedj-2019- 137079). recognise that they have played a key role ignored in the data analysis.8 9 12 It would in promoting a culture where the p value be much better to state the priors explic- has had primacy over reason. Researchers itly and test the analysis for sensitivity to

2 Price R, et al. Postgrad Med J January 2020 Vol 96 No 1131 Editorial Postgrad Med J: first published as 10.1136/postgradmedj-2019-137079 on 29 October 2019. Downloaded from

To cite Price R, Bethune R, Massey L. Postgrad Med J 2 Szucs D, Ioannidis JPA. When null hypothesis 10 Ioannidis JPA. Why most published research findings 2020;96:1–3. significance testing is unsuitable for research: a are false. PLoS Med 2005;2:e124–701. reassessment. Front Hum Neurosci 2017;11:390. 11 Begley CG, Ioannidis JPA. Reproducibility in science. Received 28 August 2019 3 Cohen J. The earth is round (P<0.05). American improving the standard for basic and preclinical Revised 14 October 2019 Psychologist 1994;49:997–1003. research. Circulation Research 2015;116:116–26. Accepted 17 October 2019 4 Goodman S. A dirty dozen: twelve P-Value 12 Nuzzo R. Statistical errors. Nature 2014;506:150–2. Published Online First 29 October 2019 misconceptions. Semin Hematol 2008;45:135–40. 13 Davey Smith G, Ebrahim S. Data dredging, bias, or 5 Altman DG. Practical statistics for medical research 1st confounding. they can all get you into the BMJ and the Postgrad Med J 2020;96:1–3. edition. London: Chapman and Hall, 1991: 167–71. Friday papers. BMJ 2002;325:1437–8. doi:10.1136/postgradmedj-2019-137079 6 Wasserstein RL, Lazar NA. The ASA Statement on 14 Benjamin DJ, Berger JO, Johannesson M, et al. p -Values: Context, Process, and Purpose. Am Stat Redefine statistical significance. Nat Hum Behav ORCID iDs 2016;70:129–33. 2018;2:6–10. Robert Price http://orcid. org/0000- 0002-1933- 8728 7 Colquhoun D. The reproducibility of research and the 15 Cox RT. Probability, frequency and reasonable Rob Bethune http://orcid. org/0000- 0002-1855- 0639 misinterpretation of p -values. R. Soc. open sci. 2017;4. expectation. Am J Phys 1946;14:1–13. 8 Vidgen B, Yasseri T. P-Values: misunderstood and 16 Jaynes ET, Theory P. Probability theory. The logic of References misused. Frontiers in Physics 2016;4:1–5. science. Cambridge University Press, 2013. 1 Westover MB, Westover KD, Bianchi MT. Significance 9 Kirkwood BR, Sterne JAC. Essential medical statistics. 17 Terinin A, Draper D. Cox’s Theorem and testing as perverse probabilistic Reasoning. BMC Med 2nd edition. Massachusetts: Blackwell Science Ltd, Jaynesian Interpretation of Probability 2017. 2011;9:20. 2003: 426–8. arXiv:1507.06597v2 [math.ST]. http://pmj.bmj.com/ on October 1, 2021 by guest. Protected copyright.

Price R, et al. Postgrad Med J January 2020 Vol 96 No 1131 3