Sample Size Debate
Total Page:16
File Type:pdf, Size:1020Kb
Sample Size Debate Does increasing the sample size provide greater precision for detecting treatment effects Evolution of this debate Original Debate Plan Is bigger better? Traditional calculations targeted a sample size based on the requirement for statistical power of .8, but a strategy of overpowering studies with larger samples increases the likelihood of study success. • Today’s Debate format • Introduction- G Sachs 5 min • Pro- G Sachs 10- min • Con- S Brannan 15 min • Regulatory Perspective: A Potter 15 min • Presenter rebuttal- 10 min • Your Views • Emerging Consensus Potential Conflicts: Gary Sachs, MD Bracket: Full time Employee Collaborative Care Initiative: Owner Is bigger better ? Pro: Overpowering studies with larger samples increases the likelihood of study success • Bigger is better • The Central Limit Theorem • Dakota Effect • The Boston Red Sox • Other Examples • Why what Steve thinks is wrong Central Limit Theorem • “Irrespective of the shape of the underlying distribution, sample mean & proportions will approximate normal distributions if the sample size is sufficiently large” • Large sample – Narrow CI Fundamental features of RCT design • 3 reduce the risk of bias: • Randomized group assignment, double-blinded assessments, and control or comparison groups. • These sine quo non elements are standard for RCTs in psychopharmacology • 2 other aspects of RCT design, may become impediments in clinical trial implementation. • Multiplicity, comparing the investigational and comparator groups on multiple outcomes, inflates type I error. • Unreliability, introduces bias, reduces statistical power and, thus, diminishes the feasibility of a clinical trial. • Many task paradigms in neuroscience have either unknown reliability or levels of reliability that may be less than optimal. • Many such paradigms are complex and may provide several different outcome measures that could be considered equally valid indicators of cognitive improvement. • Thus, the problems of unreliability and multiplicity may be particularly acute for RCTs using such measures. Empirical Estimates of Cohen d With 95% Confidence Interval (Population ∆ = 0.50) Leon AC. Schizophrenia Bulletin vol. 34 no. 4 pp. 664–669, 2008 Key Formulas A quick approximation 95% CI 4 = d ± 푁 Effect Size, Cohen’s d 푋 −푋 d= 1 2 푠.푑. Statistical Power Based on Cohen’s d = ∆ Active - ∆ Placebo / s.d. (Pooled) The sample size (N) required for each of 2 treatment groups (assuming equal cell sizes) to detect various effects with statistical power of 0.80 using a t test with a 2-tailed a level of .05 small effect: medium effect: large effect: d = 0.20 d = 0.50 d = 0.80 The sample size (N) required 393 64 26 Leon A: Schizophrenia Bulletin vol. 34 no. 4 pp. 664–669, 2008 The 2017 Boston Red Sox First place in American League East Data for all Non-pitchers on 2017 roster • Batting Average Month by Month vs Full Season • Who is the best player on the team? • How many Red Sox are .400 hitters? – Hypothesis: • More variance in smaller samples • Players added after April add variance Samples size impacts s.d. 0.8 0.7 Batting Average by Month 0.6 0.5 0.4 0.3 0.2 0.1 0 April May June July August September BA total Mookie Betts Andrew Benintendi Xander Bogaerts Mitch Moreland Hanley Ramirez Jackie Bradley Jr. Dustin Pedroia Christian Vazquez Sandy Leon Chris Young Rafael Devers Deven Marrero Eduardo Nunez† Brock Holt Josh Rutledge Pablo Sandoval† Sam Travis Marco Hernandez Tzu-Wei Lin Rajai Davis† Steve Selsky • Can we assume these 21 players are all the same? Larger N associated with Smaller s.d. Yields More Power s.d. Batting Average by Month 0.137 0.106 0.095 0.083 0.057 0.049 0.031 April May June July August September Full Season Traditional 2 Cell Study Sample requirement N=350 Sample size justification: We are planning a study of a continuous response variable from independent control and experimental subjects with 1 control(s) per experimental subject. In a previous study the response within each subject group was normally distributed with standard deviation 10. If the true difference in the experimental and control means is 3, we will need to study 175 experimental subjects and 175 control subjects to be able to reject the null hypothesis that the population means of the experimental and control groups are equal with probability (power) 0.8. The Type I error probability associated with this test of this null hypothesis is 0.05. Support of Concept Study 2 Cell Sample requirement N=150 Note impact of smaller sample: We are planning a study of a continuous response variable from independent control and experimental subjects with 1 control(s) per experimental subject. In a previous study the response within each subject group was normally distributed with standard deviation 10. If the true difference in the experimental and control means is 3, we will need to study 75 experimental subjects and 75 control subjects to be able to reject the null hypothesis that the population means of the experimental and control groups are equal with probability (power) 0.75. The Type I error probability associated with this test of this null hypothesis is 0.25. • Achieving a low placebo response is only critical for the success of underpowered treatment arms A Khan et al 2017 Table 1. The effect of median split of treatment arm sample size on the relationship between magnitude of placebo response and treatment arm success. Depression <170 N >170 N Mean PCBO Resp 29.9% 36.7% Relationship of PCBO R2= 0.346 R2= 0.043 resp to success P<0.001 p= 0.164 Success Rate 40% 68% Even though placebo response is significantly (p<0.001) higher in the >170 N group (36.7%) than it is in the <170 N group (29.9%), the success rate is 28% higher. Schizophrenia <202 N ≥202 N Mean PCBO resp 6.3% 10.1% Relationship of PCBO R2= 0.377 R2= 0.067 resp to success p= 0.001 p= 0.089 Success Rate 63% 77% Even though placebo response is significantly (p= 0.018) higher in the >202 N group (10.1%) than it is in the <170 N group (6.3%), the success rate is 14% higher. Bigger is Better! Debate Conclusion: Bigger is better • Ceteris parabus larger samples deliver more power to detect true differences • Power reflects Sample size, s.d. and effect size • Chances of deviation from true population lessens as sample size increase • Studies designed to take advantage of this mathematical reality must also pay attention to quality control and feasibility. • Multiplicity: comparing the investigational and comparator groups on multiple outcomes, inflates type I error. • Unreliability: introduces bias, reduces statistical power • Feasibility requirement for qualified sites and raters Rebuttal Is bigger better? Protocol xxx Site Quality Heat Map: Problem of Outlier Sites violate Ceteris parabus Worst Quality B A F D C Quality E Rank Best Quality Number of Subjects Randomized Protocol xxx Cumulative Quality Indicator Matrix: Site A PI yyyyy Extras for back-up Is bigger better? Interpretation of “∆”: Critical evaluation required Treatment development relies on Weigh development decisions based on detection of small deltas outcomes and study quality? (differences between change scores) Confidence for Study Quality d < 0.3 decision making Good Poor d > 0.4 High ? Positive (Avoid repetition) Study Result Low or High exclude. (Avoid Negative repetition) Most RCTs are designed to detect Effects Sizes between 0.3-0.4 insert client Study Managementlogo hereUsing Blinded Data Analytics December 8, 2016 Case Study: Probing the fog of study operations • Large Complex multisite trial • Slow recruitment • Quality Concerns • High Respondent Burden • Screening burdensome (requires up to 6 hrs) • Site feedback • Sponsor collected anecdotal • Bracket collected anecdotal • BDA Site A PI yyyy: Cases of YMRS (Rater) Identical Item Scoring During DB Phase Subject #### Solid Line = Rater score Dashed Line = Computer score BPI duration by site by BPI duration 25 N of assessments Duration 40 20 30 15 20 10 Nof assessments 10 Averageduration BPI site at in minutes 5 0 0 422 403 402 417 326 305 325 310 414 309 324 423 424 420 311 303 410 413 404 429 319 419 316 302 405 318 427 314 401 330 409 323 312 315 308 317 304 301 407 320 313 329 307 327 408 418 426 430 328 428 YMRS durations by site by durations YMRS 25 N of assessments Duration 40 20 30 15 20 10 Nof assessments 10 Average durationYMRS atin minutessite 5 0 0 325 422 407 410 402 417 305 409 418 405 309 308 404 310 403 420 423 301 320 430 303 318 311 424 413 408 429 329 323 314 317 324 414 328 316 313 427 426 330 307 419 304 302 312 428 319 326 401 315 327 MADRS durations by site by durations MADRS 25 N of assessments Duration 50 20 40 15 30 10 20 Nof assessments 5 AverageMADRS duration atin minutessite 10 0 0 423 410 422 402 304 427 403 405 419 326 424 310 413 325 430 414 305 301 311 303 309 302 318 316 426 330 307 407 418 404 324 317 417 319 429 314 323 401 320 329 420 315 409 327 312 308 328 408 428 313 Across all post randomization visits Site 84 MADRSComp Median duration varies with severity 30 27 25 Minutes 21 Duration 20 MADRSComp 15 12 10 5 2.5 0 MADRSComp MADRSComp MADRSComp MADRSComp Total= 0-8 Total= 9-16 Total= 17-24 Total= 25-32 n=30 n=19 n=39 n=36 TOT 0-8 TOT 9-16 TOT 17-24 TOT 25-32 SD 2.9 2.4 2.2 1.7 MADRS at Screen Average Rater MADRS intreview duration at individual sites at screening 60 35 Mean + SD MADRS score Duration 55 30 50 +2 s.d.