Statistical Power, Sample Size, and Their Reporting in Randomized Controlled Trials

David Moher, MSc; Corinne S. Dulberg, PhD, MPH; George A. Wells, PhD

Objective.\p=m-\Todescribe the pattern over time in the level of statistical power and least 80% power to detect a 25% relative the reporting of sample size calculations in published randomized controlled trials change between treatment groups and (RCTs) with negative results. that 31% (22/71) had a 50% relative Design.\p=m-\Ourstudy was a descriptive survey. Power to detect 25% and 50% change, as statistically significant (a=.05, relative differences was calculated for the subset of trials with results in one tailed). negative Since its the of which a was used. Criteria were both publication, report simple two-group parallel design developed Freiman and has been cited to trial results as or and to the outcomes. colleagues2 classify positive negative identify primary more than 700 times, possibly indicating Power calculations were based on results from the primary outcomes reported in the seriousness with which investi¬ the trials. gators have taken the findings. Given Population.\p=m-\Wereviewed all 383 RCTs published in JAMA, Lancet, and the this citation record, one might expect an New England Journal of Medicine in 1975, 1980, 1985, and 1990. increase over time in the awareness of Results.\p=m-\Twenty-sevenpercent of the 383 RCTs (n=102) were classified as the consequences of low power in pub¬ having negative results. The number of published RCTs more than doubled from lished RCTs and, hence, an increase in 1975 to 1990, with the proportion of trials with negative results remaining fairly the reporting of sample size calcula¬ stable. Of the trials results with tions made before clinical trials are con¬ simple two-group parallel design having negative ducted. dichotomous or continuous outcomes 16% and 36% had suf- primary (n=70), only Our was to describe the ficient statistical to detect a 25% or 50% relative objective pat¬ power (80%) difference, respec- tern, over time, in the level ofpower and tively. These percentages did not consistently increase overtime. Overall, only 32% in the reporting of sample size calcula¬ of the trials with negative results reported sample size calculations, but the tions in published RCTs with negative percentage doing so has improved over time from 0% in 1975 to 43% in 1990. Only results since the publication of Freiman 20 of the 102 reports made any statement related to the clinical significance of the and colleagues'2 report. We did this by observed differences. extending Frieman and colleagues' ob¬ Conclusions.\p=m-\Mosttrials with negative results did not have large jectives during the period from 1975 to enough 1990. sample sizes to detect a 25% or a 50% relative difference. This result has not changed over time. Few trials discussed whether the observed differences were METHODS There are reasons to clinically important. important change this practice. The Study Sample reporting of statistical power and sample size also needs to be improved. (JAMA. 1994;272:122-124) We reviewed RCTs published in JAMA, Lancet, and the New England Journal of Medicine. These were the three ofthe 20journals from which more THE EFFICACY of new interventions probability of making a type II error, or than half (36/71) of the trials with nega¬ are most readily accepted if the results concluding that the result is not statis¬ tive results reported by Frieman and are from randomized controlled trials tically significant, when the observed colleagues were drawn.2 To capture the (RCTs).1 Essential to the planning of an effect is clinically meaningful. denominator, each volume published in RCT is estimation ofthe required sample The relationship between negative 1975, 1980, 1985, and 1990 was hand- size. The investigator should ensure that findings (ie, when searched to extract the RCTs. To be there is sufficient power to detect, as was not reached) and statistical power considered an RCT, the study being as¬ statistically significant, a treatment ef¬ has been well illustrated in Freiman sessed had to contain an explicit state¬ fect of an a priori specified size. The and colleagues'2 review of 71 RCTs with ment about . The identi¬ opposite perspective of conceiving the negative results published during 1960 fied RCTs were consecutively coded and problem is that the investigator should to 1977. These RCTs were drawn from divided into three equal groups for re¬ ensure that there is a low ß value, or a collection of300 simple two-group par¬ view by members of the study team, allel design trials with dichotomous pri¬ with use of a structured collection From the Clinical Unit, Loeb Medical mary outcomes in 20 form. The data collected included infor¬ Research Institute (Mr Moher and Dr Wells), and the published journals. Faculties of Medicine (Mr Moher and Dr Wells) and Freiman and colleagues were interested mation on whether the trial results were Health Sciences (Dr Dulberg), University of Ottawa, in assessing whether RCTs with nega¬ positive or negative, the study design, Ottawa, Ontario. tive results had sufficient statistical whether a or hoc size Presented in part at the Second International Con- priori post sample detect a 25% a were gress on Peer Review in Biomedical Publication, Chi- power to and 50% rela¬ calculations performed, and the cago, Ill, September 10, 1993. tive difference between treatment in¬ elements necessary for us to calculate to the Reprint requests Clinical Epidemiology Unit, terventions. Their review indicated that observed or Loeb Institute, Ottawa Civic Hospi- power (eg, proportions tal, 1053 Carling Ave, Ottawa, Ontario, Canada K1Y most of the trials had low power to de¬ and SDs of the primary outcome 4E9 (Mr Moher). tect these effects: only 7% (5/71) had at obtained in each group).

Downloaded From: by a The Bodleian Libraries of the University of Oxford User on 08/16/2018 Published Reports of Simple Two-Group Parallel ample, ifoutcomes included both disease- calculation was based. Only 45% re¬ Design Randomized Controlled Trials With Negative free survival and overall mortality, mor¬ ported the control group event rate, but With Dichotomous or Continuous Results (n=52) tality was considered to be the primary 79% specified the power level. Slightly (n=18) Primary Outcomes That Had at Least 80% outcome. more than half (58%) reported the Power to Detect Two Effect Sizes Between Control but few indicated whether and Experimental Treatment Groups* level, (18%) RESULTS the value was one tailed or two tailed. Effect Size: Relative Difference In fact, in only 30% of these trials was I I Description of Study Sample there sufficient detail to enable us to Year_25%_50% total 393 were in the calculated 1975 12(16) 25(16) A of RCTs published replicate reported sample 1980 13(15) 47(15) JAMA, Lancet, and the New England size. 1985 7(15) 27(15) Journal Medicine the 4 1990 of during years 25(24) 42(24) of our review. Ten trials were COMMENT Overall 16(70) 36(70) excluded from the analysis for the following rea¬ If a trial with negative results has a "Sample size calculations were based on a two-tailed sons: results based on invalid statistical sufficient sample size to detect a clini¬ a value of .05 with use of either a test orar test, as appropriate for the scale of measurement of the primary analyses precluded classification of the cally important effect, then the nega¬ . Values are percentages (number). trial (n=5), randomization was not ex¬ tive results are interpretable—the treat¬ plicit or not all patients were random¬ ment did not have an effect at least as After closely following Frieman and ized (n=2), and no statistical analysis large as the effect considered to be clini¬ colleagues'2 selection criteria, our power was performed (n=3). cally relevant. If a trial with negative calculations were performed on the sub¬ Twenty-seven percent (n=102) of the results has insufficient power, a clini¬ set of the trials with negative results in remaining 383 trials had negative re¬ cally important but statistically nonsig¬ which a simple two-group parallel de¬ sults. Although the number of RCTs nificant effect is usually ignored or, sign was used. However, rather than published has more than doubled be¬ worse, is taken to that the treat¬ calculating power only for trials with tween 1975 and 1990 (67 vs 148), the ment under study made no difference. dichotomous outcomes, we also calcu¬ percentage that had negative results has Thus, there are important scientific rea¬ lated power for trials with continuous remained fairly stable over time: 33%, sons to report sample size and/or power primary outcomes. We calculated each 27%, 25%, and 25% in 1975,1980,1985, calculations. trial's power to detect a 25% and a 50% and 1990, respectively. There are also ethical reasons to es¬ relative with an value of timate size when a trial. change, .05, Statistical Power sample planning using a test or a t test, as appropriate Altman5 noted that ethics committees for the scale of measurement of the pri¬ We calculated power for 70 of the 102 may not want to approve the rare over¬ mary outcome. Our calculations differed RCTs with negative results that em¬ sized trial because of the unnecessary from those of Frieman and colleagues in ployed a simple two-group parallel de¬ costs and involvement of additional pa¬ that we employed a two-tailed rather sign with dichotomous (n=52) and con¬ tients. More commonly, ethics commit¬ than a one-tailed a. A standard program3 tinuous (n=18) primary outcomes. The tees may not want to approve trials that was used for our power calculations. We Table presents the distribution, over are too small to observe clinically im¬ also calculated the percentages over time time, in the percentage oftrials that had portant differences, because, as Altman of trials reporting sample size calcula¬ at least 80% power (a=.05, two tailed) to put it, such a trial may "be scientifically tions. detect two effect sizes: a relative dif¬ useless, and hence unethical in its use of ference between treatment of and other resources." of Trials and Identification groups subjects Selection 25% and of 50%. Overall, only 16% and Our results indicate that most trials of Outcome Primary 36% of the trials had at least 80% power with negative results had too few pa¬ For an RCT to be classified as having to detect a 25% or a 50% relative dif¬ tients to detect a relative difference of negative results, there had to be an ex¬ ference, respectively. These figures have 25% or 50% with sufficient statistical plicit statement in the text that nega¬ not consistently improved over time. power and that this has not changed tive results had been obtained. When an over time. our set of Size Calculations Despite unique explicit statement was missing, classi¬ Sample decision rules to identify trials with nega¬ fication of a trial as having negative re¬ Among the 102 trials with negative tive results and primary outcomes, the sults required identification of the pri¬ findings, only 32% (n=33) reported a results are similar to those originally mary outcome measure. As Pocock et sample size calculation. While this num¬ reported by Freiman and colleagues216 al4 observed in 1987, primary outcome ber is small, the situation has improved years ago. measures are not usually clearly speci¬ over time since 1980. None of the 22 Our observation that most trials do fied. Encountering the same problem, trials with negative results published in not report sample size calculations is we specified a series of decision rules to 1975 was found to have included a sample consistent with other descriptive sur¬ select the primary outcome. If an article size calculation, 32% (7/22) did so in 1980, veys of RCTs that did riot focus solely reported a sample size calculation, the 48% (10/21) did so in 1985, and 43% (16/ on trials with negative results. DerSi- outcome used in the calculation was 37) did so in 1990. Only 20 (20%) of the monian and colleagues6 and Pocock and taken as the primary outcome. Published 102 trials with negative results made colleagues4 evaluated general méthod¬ descriptive on this variable any kind of statement about the clinical ologie and statistical problems of clinical were then used in our power calcula¬ significance of the results with respect trials published in 1979 and 1985, re¬ tions. If sample size calculations were to the observed statistical differences spectively. Both reports found that sta¬ not reported and multiple outcomes were between the treatment groups. tistical power was discussed in only evaluated, at least 50% of the results of An examination of the 33 trials with about 12% of the published RCTs se¬ statistical tests had to be nonsignificant negative results with sample size cal¬ lected for review. More recently, Alt- for the RCT to be classified as having culations revealed serious deficiencies man and Doré7 reported that 39% of a negative results. Among the multiple in the reporting of the variables essen¬ convenience sample of 80 RCTs pub¬ outcomes, the most serious was identi¬ tial for these calculations. No trial in¬ lished in 1987 reported calculating fied as the primary outcome. For ex- dicated the statistical test on which the sample size.

Downloaded From: by a The Bodleian Libraries of the University of Oxford User on 08/16/2018 It is possible that investigators do plan analyses).10 Even if one holds this view, porting of trials.1314 Reports of RCTs required sample sizes but that this in¬ it does not preclude the value of pub¬ should provide readers with detailed in¬ formation is not included in the pub¬ lishing post hoc calculation of the power formation about the design, execution, lished reports, but this does not appear of a study with negative results to de¬ analysis, and interpretation of the trial to be the case. After personally contact¬ tect a clinically important difference. and its findings. A minimum set of re¬ ing principal authors, Liberati and col¬ This calculation would enable the reader quired information, ie, "structured re¬ leagues8 discovered that only a very to make a more informedjudgment as to porting," would help readers to evalu¬ small percentage had conducted sample the clinical relevance of the observed ate the of a trial. A similar ap¬ size calculations but not included this absence of statistical significance. Fur¬ proach, more informative abstracts, has information in the published report. thermore, evidence indicates that be¬ had a positive impact on how the results Another explanation for the absence cause of ,11 studies with of abstracts are communicated.15 of sample size calculations is a lack of positive results are more likely to be We propose that authors should re¬ understanding of the concept of effect published than are those with negative port sample size calculations and that size. Indeed, one ofthe most challenging results. As a consequence, unpublished the following information should be con¬ aspects of sample size planning is de¬ trials with negative results might not tained in all published reports of RCTs: termining a clinically important effect. enter into a systematic review. (1) The primary dependent measure(s) The 25% and 50% relative differences In a recent commentary, Cohen12 should be clearly identified. (2) A clini¬ on which our power calculations and found the absence ofpower calculations cally important treatment effect should those of Frieman and colleagues2 were in published psychological research to be specified. (3) The treatment effect based may, in fact, represent very large be inexplicable. He suggested that this should be clearly indicated as being an differences. More modest but clinically problem could exemplify the "slow move¬ absolute or a relative difference. (4) The important treatment effects would ne¬ ment of methodological advance" or statistical test, directionality, level, cessitate trials with substantially larger could reflect difficulties researchers face and statistical power used to estimate sample sizes. Reviewing the cardiovas¬ in performing the appropriate calcula¬ sample size should be reported. If sample cular literature to evaluate the magni¬ tions. He also commented that the "pas¬ size calculations were not conducted a tude oftreatment effects, Yusuf and col¬ sive acceptance ofthis state of affairs by priori, a published report of an RCT leagues9 found that relatively small editors and reviewers is even more of a with negative results should provide post treatment effects should be expected. mystery." hoc statistical power calculations to de¬ A third possibility for most trials fail¬ Our results indicate that even when tect a clinically important difference. The ing to report sample size calculations is sample size calculations were published, benefits ofincluding this information are that this may reflect the view that plan¬ the details necessary to replicate the clearly worth the extra space required ning sample size is unnecessary because calculations were missing in most cases. in the publications. RCTs, whatever their outcome, are in¬ These deficiencies are consistent with a valuable to systematic reviews (meta- general concern about the quality of re- References 1. Cook DJ, Guyatt GH, Laupacis A, Sackett DL. search, III: how large a sample? BMJ. 1980;281: Clin Trials. September 11,1993; Doc No. 89. Serial Rules of evidence and clinical recommendations on 1336-1338. online. and the use of antithrombotic agents. Chest. 1992;102 6. DerSimonian R, Charette LJ, McPeek B, Mos- 11. Dickersin K, Min YI. NIH clinical trials (suppl):305S-311S. teller F. Reporting on methods in clinical trials. publication bias. Online J Curr Clin Trials. April 2. Freiman JA, Chalmers TC, Smith H, Kuebler N Engl J Med. 1982;306:1332-1337. 28, 1993; Doc No. 50. Serial online. Bull. RR. The importance of beta, the type II error, and 7. Altman DG, Dor\l=e'\CJ. Randomization baseline 12. Cohen J. A power primer. Psychol 1992; size in the and interpretation ofthe comparisons in clinical trials. Lancet. 1990;335:149\x=req-\ 112:155-159. sample design trials. Br J Ob- randomized controlled trial: survey of 71 'negative' 153. 13. Grant A. Reporting controlled trials. N Engl J Med. 1978;299:690-694. 8. Liberati A, Himel HN, Chalmers TC. A quality stet Gynecol. 1989;96:397-400. 3. Dupont WD, Plummer WD. Power and sample assessment of randomized control trials ofprimary 14. Mosteller F, Gilbert JP, McPeek B. Reporting for controlled size calculations: a review and computer program. treatment of breast cancer. J Clin Oncol. 1986;4: standards and research strategies Controlled Clin Trials. 1990;11:116-128. 942-951. trials. Controlled Clin Trials. 1980;1:37-58. 4. Pocock SJ, Hughes MD, Lee RJ. Statistical prob- 9. Yusuf S, Collins R, Peto R. Why do we need 15. Haynes RB, Mulrow CD, Huth EJ, Mtman DG, abstracts revisited. lems in the reporting of clinical trials: a survey of some large, simple randomized trials? Stat Med. Gardner MJ. More informative three medical journals. N Engl J Med. 1987;317: 1984;3:409-420. Ann Intern Med. 1990;113:69-76. 426-432. 10. Chalmers TC. quality needs to be 5. Altman DG. Statistics and ethics in medical re- improved to facilitate meta-analyses. OnlineJCurr

Downloaded From: by a The Bodleian Libraries of the University of Oxford User on 08/16/2018