Importance of Effect Sizes for the Accumulation of Knowledge

᭜ EDITORIAL VIEWS Anesthesiology 2007; 106:415–7 Copyright © 2007, the American Society of Anesthesiologists, Inc. Lippincott Williams & Wilkins, Inc. Importance of Effect Sizes for the Accumulation of Knowledge Editor’s Note: Starting in January 2007, all manuscripts accepted for publication in ANESTHESIOLOGY are undergoing review by Timothy Houle, Ph.D., for statistical testing and reporting. There are two reasons for this additional review. As indicated in the following editorial, authors and readers often confuse P values with magnitude of effects, when it is oftentimes the latter that matters most. One goal for a universal statistical review is to remind authors to report and emphasize the magnitude of the effects they observe, in both the Results and Discussion sections, rather than restricting their comments to P values Downloaded from http://pubs.asahq.org/anesthesiology/article-pdf/106/3/415/363269/0000542-200703000-00002.pdf by guest on 24 September 2021 only. In addition, medical literature, including numerous reviews and original articles in ANESTHESIOLOGY, rely on combining results from separately published reports to reach consensus conclusions. A second goal for statistical review is to assure that statistical reporting is provided in a manner that facilitates this subsequent research. For the past several decades, there has been an escalating debate regarding the appropriate techniques to evaluate scientific hypotheses. In fact, the traditional method of significance testing is now being called into question. The discussion that follows is the first of several editorials over the coming months that will examine several possible methods used to report scientific findings. These examinations will do so by looking at the strengths and weaknesses of these methods and common reporting errors, and will begin with a comparison of interpreting P values versus effect size measures. My hope for these discussions is that they will lead to clear guidelines to statistical reporting that will eventually be incorporated into the Instructions for Authors in this journal and others in our specialty. James C. Eisenach, M.D., Editor-in-Chief SINCE Sir Ronald Fisher, who is widely considered the tical reporting but also adds an element of the magnitude father of inferential statistics,1 there has been ongoing of the effect that simply cannot be ascertained from P debate about the proper way to evaluate scientific values alone. Despite the advantages of effect size mea- hypotheses. In particular, the value of significance sures over P values, many scientists remain largely un- testing, or the use of a P value–based criterion, has aware of effect size reporting and frequently neglect to come under scrutiny, with some authors asserting that report a measure of treatment (or experimental) effect. the use of P values should be discontinued altogeth- With the goal of increasing awareness of effect size er.2 Indeed, there are several convincing arguments reporting, what follows is an introduction to effect sizes, that the reliance on P values to evaluate hypotheses is a description of several commonly used effect size mea- problematic, specifically that it is a practice plagued sures, and recommendations for the inclusion of effect 3 by misunderstanding, a perversion of the entire phi- size measures as a standard in research reports. 3 losophy of hypothesis testing, and even a barrier to An effect size simply is an index of the magnitude of 2 the accumulation of knowledge. Despite these acrid the observed relation under study.6 Whereas a P value criticisms, the use of P values as the sole arbiter of provides a dichotomous indication of the presence or evaluating hypotheses remains in widespread use. absence of an effect, and at most a direction of the This debate over the continued use of P values has effect (in the case of one-tailed tests), an effect size persisted for decades and is not likely to be settled in characterizes the degree of the observed effect. The the near future, because the need for which the use of simplicity of this definition conceals the complexity of P values arose remains the same: to provide a standard the concept, especially when one considers the wide for evaluating the reliability of the effects under study. array of possible metrics, inferences, and hypotheses Considering the litany of problems accompanying re- that could be tested. Although there are many possible liance on P values, what is the alternative? choices to measure effect size, P values as a measure A substantial improvement over traditional hypotheses of effect are always a poor choice and should not be testing is to report an effect size when publishing re- used. search. This conclusion certainly is not new in the gen- P values are not measures of effect size. Quite often, eral scientific sense3,4 and already has been suggested in statistical significance (as indicated by the P value) is anesthesiology research.5 Reporting effect sizes provides mistaken as indicating large effects and/or meaningful all of the information available from “traditional” statis- effects. But a statistically significant P value, regardless of its size, indicates neither magnitude nor clinical mean- Accepted for publication November 28, 2006. The author is not supported by, ingfulness. This is so because statistical significance con- nor maintains any financial interest in, any commercial activity that may be associated with the topic of this subject. founds the size of the effect with the size of the sample Anesthesiology, V 106, No 3, Mar 2007 415 416 EDITORIAL VIEWS to the extent that even the most miniscule effect can be difference depends on the variability within the groups. made statistically significant with a large enough sample Therefore, when using raw mean differences as a mea- size; or, as Thompson7 remarks, sure of effect size, it is necessary to provide estimates of the mean difference as well as estimates of confidence Statistical significance testing can involve a tautologi- around those estimates (which are created using the cal logic in which tired researchers, having collected variability of the effect as well as sample size). The use of data on hundreds of subjects, then conduct a signifi- 95% confidence intervals is recommended for this very cance test to evaluate whether there were a lot of purpose. subjects, which the researchers already know, be- The practice of reporting 95% confidence intervals cause they collected the data and know they are tired. surrounding a point estimate not only reflects the degree Another misconception about P values and one that is of uncertainty that is encountered anytime a sample is Downloaded from http://pubs.asahq.org/anesthesiology/article-pdf/106/3/415/363269/0000542-200703000-00002.pdf by guest on 24 September 2021 often translated into inappropriate use as an effect size is used as a population estimate, but also provides infor- that they are mistakenly interpreted to reflect the prob- mation akin to significance testing.2 Evaluation of the ability that the null hypothesis is true.3 This reasoning, if magnitude of the raw score difference and 95% confi- it were valid, would logically lead to the use of a P value dence band can be used to evaluate both the size and as a measure of effect, but a P value reflects the proba- direction of the effect (as in the example of the 1.5-point bility of observing the data given the null hypothesis and difference above) as well as traditional statistical signifi- not the probability of the null hypothesis given the data.3 cance (if the bounds of the band do not include the value For example, an investigator conducts a comparison specified by the null hypothesis [usually zero], the ob- between a drug and placebo condition testing a null served effect is statistically significant by conventional hypothesis that drug ϭ placebo, and finds that for the hypothesis testing standards). The practice of providing comparison P ϭ 0.032. A valid interpretation would be confidence intervals around these observed differences that there is less than a 5% probability that differences of is made practical by widely available software applica- this magnitude (or greater) would be observed if in fact tions that routinely provide the information needed for there were no “true” differences between the two con- such estimates when conducting regressions, analysis of ditions (the observed differences caused by sampling variance, and more. error). An appealing, but invalid, interpretation would There is one additional problem, however, with using be that there is a less than 5% chance that the drug a raw score metrics as a measure of effect size. When an condition is the same as placebo. Most researchers are effect is calculated using a specific metric, it is difficult to far more interested in discovering this second probabil- compare the findings of a given study to other studies ity, but this information cannot be gleaned from a single that use a different metric. Often, the same construct is P value. Understanding that P values are not valid mea- measured using substantially different strategies, thus sures of effect size, there are several pragmatic methods making the comparison of effects much more complex. by which to communicate the magnitude of an effect. Returning to the example about postoperative pain, one One widely used method of analyzing and reporting study could have measured pain using the 0–10 numer- research findings is the comparison of mean differences. ical rating scale, whereas another could have defined the In anesthesiology research, there are several commonly effect as the percentage of patients who requested addi- used metrics (e.g., procedure time, dose information, tional pain medication. A direct comparison between number of treatment responders, costs) that naturally studies that present effect sizes using such different lead to meaningful effect size indices when considered metrics or that use conceptually different outcome mea- in the form of mean group differences.

Importance of Effect Sizes for the Accumulation of Knowledge

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support