Statistical Power, Sample Size, and Their Reporting in Randomized Controlled Trials
Total Page:16
File Type:pdf, Size:1020Kb
Statistical Power, Sample Size, and Their Reporting in Randomized Controlled Trials David Moher, MSc; Corinne S. Dulberg, PhD, MPH; George A. Wells, PhD Objective.\p=m-\Todescribe the pattern over time in the level of statistical power and least 80% power to detect a 25% relative the reporting of sample size calculations in published randomized controlled trials change between treatment groups and (RCTs) with negative results. that 31% (22/71) had a 50% relative Design.\p=m-\Ourstudy was a descriptive survey. Power to detect 25% and 50% change, as statistically significant (a=.05, relative differences was calculated for the subset of trials with results in one tailed). negative Since its the of which a was used. Criteria were both publication, report simple two-group parallel design developed Freiman and has been cited to trial results as or and to the outcomes. colleagues2 classify positive negative identify primary more than 700 times, possibly indicating Power calculations were based on results from the primary outcomes reported in the seriousness with which investi¬ the trials. gators have taken the findings. Given Population.\p=m-\Wereviewed all 383 RCTs published in JAMA, Lancet, and the this citation record, one might expect an New England Journal of Medicine in 1975, 1980, 1985, and 1990. increase over time in the awareness of Results.\p=m-\Twenty-sevenpercent of the 383 RCTs (n=102) were classified as the consequences of low power in pub¬ having negative results. The number of published RCTs more than doubled from lished RCTs and, hence, an increase in 1975 to 1990, with the proportion of trials with negative results remaining fairly the reporting of sample size calcula¬ stable. Of the trials results with tions made before clinical trials are con¬ simple two-group parallel design having negative ducted. dichotomous or continuous outcomes 16% and 36% had suf- primary (n=70), only Our was to describe the ficient statistical to detect a 25% or 50% relative objective pat¬ power (80%) difference, respec- tern, over time, in the level ofpower and tively. These percentages did not consistently increase overtime. Overall, only 32% in the reporting of sample size calcula¬ of the trials with negative results reported sample size calculations, but the tions in published RCTs with negative percentage doing so has improved over time from 0% in 1975 to 43% in 1990. Only results since the publication of Freiman 20 of the 102 reports made any statement related to the clinical significance of the and colleagues'2 report. We did this by observed differences. extending Frieman and colleagues' ob¬ Conclusions.\p=m-\Mosttrials with negative results did not have large jectives during the period from 1975 to enough 1990. sample sizes to detect a 25% or a 50% relative difference. This result has not changed over time. Few trials discussed whether the observed differences were METHODS There are reasons to clinically important. important change this practice. The Study Sample reporting of statistical power and sample size also needs to be improved. (JAMA. 1994;272:122-124) We reviewed RCTs published in JAMA, Lancet, and the New England Journal of Medicine. These were the three ofthe 20journals from which more THE EFFICACY of new interventions probability of making a type II error, or than half (36/71) of the trials with nega¬ are most readily accepted if the results concluding that the result is not statis¬ tive results reported by Frieman and are from randomized controlled trials tically significant, when the observed colleagues were drawn.2 To capture the (RCTs).1 Essential to the planning of an effect is clinically meaningful. denominator, each volume published in RCT is estimation ofthe required sample The relationship between negative 1975, 1980, 1985, and 1990 was hand- size. The investigator should ensure that findings (ie, when statistical significance searched to extract the RCTs. To be there is sufficient power to detect, as was not reached) and statistical power considered an RCT, the study being as¬ statistically significant, a treatment ef¬ has been well illustrated in Freiman sessed had to contain an explicit state¬ fect of an a priori specified size. The and colleagues'2 review of 71 RCTs with ment about randomization. The identi¬ opposite perspective of conceiving the negative results published during 1960 fied RCTs were consecutively coded and problem is that the investigator should to 1977. These RCTs were drawn from divided into three equal groups for re¬ ensure that there is a low ß value, or a collection of300 simple two-group par¬ view by members of the study team, allel design trials with dichotomous pri¬ with use of a structured data collection From the Clinical Epidemiology Unit, Loeb Medical mary outcomes in 20 form. The data collected included infor¬ Research Institute (Mr Moher and Dr Wells), and the published journals. Faculties of Medicine (Mr Moher and Dr Wells) and Freiman and colleagues were interested mation on whether the trial results were Health Sciences (Dr Dulberg), University of Ottawa, in assessing whether RCTs with nega¬ positive or negative, the study design, Ottawa, Ontario. tive results had sufficient statistical whether a or hoc size Presented in part at the Second International Con- priori post sample detect a 25% a were gress on Peer Review in Biomedical Publication, Chi- power to and 50% rela¬ calculations performed, and the cago, Ill, September 10, 1993. tive difference between treatment in¬ elements necessary for us to calculate to the Reprint requests Clinical Epidemiology Unit, terventions. Their review indicated that observed or Loeb Medical Research Institute, Ottawa Civic Hospi- power (eg, proportions tal, 1053 Carling Ave, Ottawa, Ontario, Canada K1Y most of the trials had low power to de¬ means and SDs of the primary outcome 4E9 (Mr Moher). tect these effects: only 7% (5/71) had at obtained in each group). Downloaded From: by a The Bodleian Libraries of the University of Oxford User on 08/16/2018 Published Reports of Simple Two-Group Parallel ample, ifoutcomes included both disease- calculation was based. Only 45% re¬ Design Randomized Controlled Trials With Negative free survival and overall mortality, mor¬ ported the control group event rate, but With Dichotomous or Continuous Results (n=52) tality was considered to be the primary 79% specified the power level. Slightly (n=18) Primary Outcomes That Had at Least 80% outcome. more than half (58%) reported the Power to Detect Two Effect Sizes Between Control but few indicated whether and Experimental Treatment Groups* level, (18%) RESULTS the value was one tailed or two tailed. Effect Size: Relative Difference In fact, in only 30% of these trials was I I Description of Study Sample there sufficient detail to enable us to Year_25%_50% total 393 were in the calculated 1975 12(16) 25(16) A of RCTs published replicate reported sample 1980 13(15) 47(15) JAMA, Lancet, and the New England size. 1985 7(15) 27(15) Journal Medicine the 4 1990 of during years 25(24) 42(24) of our review. Ten trials were COMMENT Overall 16(70) 36(70) excluded from the analysis for the following rea¬ If a trial with negative results has a "Sample size calculations were based on a two-tailed sons: results based on invalid statistical sufficient sample size to detect a clini¬ a value of .05 with use of either a test orar test, as appropriate for the scale of measurement of the primary analyses precluded classification of the cally important effect, then the nega¬ outcome measure. Values are percentages (number). trial (n=5), randomization was not ex¬ tive results are interpretable—the treat¬ plicit or not all patients were random¬ ment did not have an effect at least as After closely following Frieman and ized (n=2), and no statistical analysis large as the effect considered to be clini¬ colleagues'2 selection criteria, our power was performed (n=3). cally relevant. If a trial with negative calculations were performed on the sub¬ Twenty-seven percent (n=102) of the results has insufficient power, a clini¬ set of the trials with negative results in remaining 383 trials had negative re¬ cally important but statistically nonsig¬ which a simple two-group parallel de¬ sults. Although the number of RCTs nificant effect is usually ignored or, sign was used. However, rather than published has more than doubled be¬ worse, is taken to mean that the treat¬ calculating power only for trials with tween 1975 and 1990 (67 vs 148), the ment under study made no difference. dichotomous outcomes, we also calcu¬ percentage that had negative results has Thus, there are important scientific rea¬ lated power for trials with continuous remained fairly stable over time: 33%, sons to report sample size and/or power primary outcomes. We calculated each 27%, 25%, and 25% in 1975,1980,1985, calculations. trial's power to detect a 25% and a 50% and 1990, respectively. There are also ethical reasons to es¬ relative with an value of timate size when a trial. change, .05, Statistical Power sample planning using a test or a t test, as appropriate Altman5 noted that ethics committees for the scale of measurement of the pri¬ We calculated power for 70 of the 102 may not want to approve the rare over¬ mary outcome. Our calculations differed RCTs with negative results that em¬ sized trial because of the unnecessary from those of Frieman and colleagues in ployed a simple two-group parallel de¬ costs and involvement of additional pa¬ that we employed a two-tailed rather sign with dichotomous (n=52) and con¬ tients. More commonly, ethics commit¬ than a one-tailed a. A standard program3 tinuous (n=18) primary outcomes. The tees may not want to approve trials that was used for our power calculations.