Statistical Tests, P Values, Confidence Intervals, and Power
Total Page:16
File Type:pdf, Size:1020Kb
Eur J Epidemiol (2016) 31:337–350 DOI 10.1007/s10654-016-0149-3 ESSAY Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations 1 2 3 4 Sander Greenland • Stephen J. Senn • Kenneth J. Rothman • John B. Carlin • 5 6 7 Charles Poole • Steven N. Goodman • Douglas G. Altman Received: 9 April 2016 / Accepted: 9 April 2016 / Published online: 21 May 2016 Ó The Author(s) 2016. This article is published with open access at Springerlink.com Abstract Misinterpretation and abuse of statistical tests, literature. In light of this problem, we provide definitions confidence intervals, and statistical power have been and a discussion of basic statistics that are more general decried for decades, yet remain rampant. A key problem is and critical than typically found in traditional introductory that there are no interpretations of these concepts that are at expositions. Our goal is to provide a resource for instruc- once simple, intuitive, correct, and foolproof. Instead, tors, researchers, and consumers of statistics whose correct use and interpretation of these statistics requires an knowledge of statistical theory and technique may be attention to detail which seems to tax the patience of limited but who wish to avoid and spot misinterpretations. working scientists. This high cognitive demand has led to We emphasize how violation of often unstated analysis an epidemic of shortcut definitions and interpretations that protocols (such as selecting analyses for presentation based are simply wrong, sometimes disastrously so—and yet on the P values they produce) can lead to small P values these misinterpretations dominate much of the scientific even if the declared test hypothesis is correct, and can lead to large P values even if that hypothesis is incorrect. We Editor’s note This article has been published online as then provide an explanatory list of 25 misinterpretations of supplementary material with an article of Wasserstein RL, Lazar NA. P values, confidence intervals, and power. We conclude The ASA’s statement on p-values: context, process and purpose. The with guidelines for improving statistical interpretation and American Statistician 2016. reporting. Albert Hofman, Editor-in-Chief EJE. & Sander Greenland 3 RTI Health Solutions, Research Triangle Institute, [email protected] Research Triangle Park, NC, USA Stephen J. Senn 4 Clinical Epidemiology and Biostatistics Unit, Murdoch [email protected] Children’s Research Institute, School of Population Health, University of Melbourne, Melbourne, VIC, Australia John B. Carlin [email protected] 5 Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina, Chapel Hill, NC, Charles Poole USA [email protected] 6 Meta-Research Innovation Center, Departments of Medicine Steven N. Goodman and of Health Research and Policy, Stanford University [email protected] School of Medicine, Stanford, CA, USA Douglas G. Altman 7 Centre for Statistics in Medicine, Nuffield Department of [email protected] Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford, UK 1 Department of Epidemiology and Department of Statistics, University of California, Los Angeles, CA, USA 2 Competence Center for Methodology and Statistics, Luxembourg Institute of Health, Strassen, Luxembourg 123 338 S. Greenland et al. Keywords Confidence intervals Á Hypothesis testing Á Null analyzed, and how the analysis results were selected for testing Á P value Á Power Á Significance tests Á Statistical presentation. The full set of assumptions is embodied in a testing statistical model that underpins the method. This model is a mathematical representation of data variability, and thus ideally would capture accurately all sources of such vari- Introduction ability. Many problems arise however because this statis- tical model often incorporates unrealistic or at best Misinterpretation and abuse of statistical tests has been unjustified assumptions. This is true even for so-called decried for decades, yet remains so rampant that some ‘‘non-parametric’’ methods, which (like other methods) scientific journals discourage use of ‘‘statistical signifi- depend on assumptions of random sampling or random- cance’’ (classifying results as ‘‘significant’’ or not based on ization. These assumptions are often deceptively simple to a P value) [1]. One journal now bans all statistical tests and write down mathematically, yet in practice are difficult to mathematically related procedures such as confidence satisfy and verify, as they may depend on successful intervals [2], which has led to considerable discussion and completion of a long sequence of actions (such as identi- debate about the merits of such bans [3, 4]. fying, contacting, obtaining consent from, obtaining Despite such bans, we expect that the statistical methods cooperation of, and following up subjects, as well as at issue will be with us for many years to come. We thus adherence to study protocols for treatment allocation, think it imperative that basic teaching as well as general masking, and data analysis). understanding of these methods be improved. Toward that There is also a serious problem of defining the scope of a end, we attempt to explain the meaning of significance model, in that it should allow not only for a good repre- tests, confidence intervals, and statistical power in a more sentation of the observed data but also of hypothetical general and critical way than is traditionally done, and then alternative data that might have been observed. The ref- review 25 common misconceptions in light of our expla- erence frame for data that ‘‘might have been observed’’ is nations. We also discuss a few more subtle but nonetheless often unclear, for example if multiple outcome measures or pervasive problems, explaining why it is important to multiple predictive factors have been measured, and many examine and synthesize all results relating to a scientific decisions surrounding analysis choices have been made question, rather than focus on individual findings. We after the data were collected—as is invariably the case further explain why statistical tests should never constitute [33]. the sole input to inferences or decisions about associations The difficulty of understanding and assessing underlying or effects. Among the many reasons are that, in most sci- assumptions is exacerbated by the fact that the statistical entific settings, the arbitrary classification of results into model is usually presented in a highly compressed and ‘‘significant’’ and ‘‘non-significant’’ is unnecessary for and abstract form—if presented at all. As a result, many often damaging to valid interpretation of data; and that assumptions go unremarked and are often unrecognized by estimation of the size of effects and the uncertainty sur- users as well as consumers of statistics. Nonetheless, all rounding our estimates will be far more important for statistical methods and interpretations are premised on the scientific inference and sound judgment than any such model assumptions; that is, on an assumption that the classification. model provides a valid representation of the variation we More detailed discussion of the general issues can be found would expect to see across data sets, faithfully reflecting in many articles, chapters, and books on statistical methods and the circumstances surrounding the study and phenomena their interpretation [5–20]. Specific issues are covered at length occurring within it. in these sources and in the many peer-reviewed articles that In most applications of statistical testing, one assump- critique common misinterpretations of null-hypothesis testing tion in the model is a hypothesis that a particular effect has and ‘‘statistical significance’’ [1, 12, 21–74]. a specific size, and has been targeted for statistical analysis. (For simplicity, we use the word ‘‘effect’’ when ‘‘associa- tion or effect’’ would arguably be better in allowing for noncausal studies such as most surveys.) This targeted Statistical tests, P values, and confidence intervals: assumption is called the study hypothesis or test hypothe- a caustic primer sis, and the statistical methods used to evaluate it are called statistical hypothesis tests. Most often, the targeted effect Statistical models, hypotheses, and tests size is a ‘‘null’’ value representing zero effect (e.g., that the study treatment makes no difference in average outcome), Every method of statistical inference depends on a complex in which case the test hypothesis is called the null web of assumptions about how data were collected and hypothesis. Nonetheless, it is also possible to test other 123 Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations 339 effect sizes. We may also test hypotheses that the effect Specifically, the distance between the data and the does or does not fall within a specific range; for example, model prediction is measured using a test statistic (such as we may test the hypothesis that the effect is no greater than a t-statistic or a Chi squared statistic). The P value is then a particular amount, in which case the hypothesis is said to the probability that the chosen test statistic would have be a one-sided or dividing hypothesis [7, 8]. been at least as large as its observed value if every model Much statistical teaching and practice has developed a assumption were correct, including the test hypothesis. strong (and unhealthy) focus on the idea that the main aim This definition embodies a crucial