Sample Size Debate

Does increasing the sample size provide greater precision for detecting treatment effects Evolution of this debate

Original Debate Plan Is bigger better? Traditional calculations targeted a sample size based on the requirement for statistical power of .8, but a strategy of overpowering studies with larger samples increases the likelihood of study success.

• Today’s Debate format • Introduction- G Sachs 5 min • Pro- G Sachs 10- min • Con- S Brannan 15 min • Regulatory Perspective: A Potter 15 min • Presenter rebuttal- 10 min • Your Views • Emerging Consensus Potential Conflicts: Gary Sachs, MD

Bracket: Full time Employee Collaborative Care Initiative: Owner Is bigger better ? Pro: Overpowering studies with larger samples increases the likelihood of study success

• Bigger is better • The Central Limit Theorem • Dakota Effect • The • Other Examples • Why what Steve thinks is wrong Central Limit Theorem

• “Irrespective of the shape of the underlying distribution, sample mean & proportions will approximate normal distributions if the sample size is sufficiently large”

• Large sample – Narrow CI

Fundamental features of RCT design

• 3 reduce the risk of bias: • Randomized group assignment, -blinded assessments, and control or comparison groups. • These sine quo non elements are standard for RCTs in psychopharmacology • 2 other aspects of RCT design, may become impediments in clinical trial implementation. • Multiplicity, comparing the investigational and comparator groups on multiple outcomes, inflates type I . • Unreliability, introduces bias, reduces statistical power and, thus, diminishes the feasibility of a clinical trial. • Many task paradigms in neuroscience have either unknown reliability or levels of reliability that may be less than optimal. • Many such paradigms are complex and may provide several different outcome measures that could be considered equally valid indicators of cognitive improvement. • Thus, the problems of unreliability and multiplicity may be particularly acute for RCTs using such measures. Empirical Estimates of Cohen d With 95% Confidence Interval (Population ∆ = 0.50)

Leon AC. Schizophrenia Bulletin vol. 34 no. 4 pp. 664–669, 2008 Key Formulas

A quick approximation 95% CI 4 = d ± 푁

Effect Size, Cohen’s d 푋 −푋 d= 1 2 푠.푑. Statistical Power Based on Cohen’s d = ∆ Active - ∆ Placebo / s.d. (Pooled)

The sample size (N) required for each of 2 treatment groups (assuming equal cell sizes) to detect various effects with statistical power of 0.80 using a t test with a 2-tailed a level of .05

small effect: medium effect: large effect: d = 0.20 d = 0.50 d = 0.80

The sample size (N) required 393 64 26

Leon A: Schizophrenia Bulletin vol. 34 no. 4 pp. 664–669, 2008 The 2017 Boston Red Sox First place in American League East

Data for all Non-pitchers on 2017 roster • Batting Average Month by Month vs Full Season • Who is the best player on the team? • How many Red Sox are .400 hitters?

– Hypothesis: • More variance in smaller samples • Players added after April add variance Samples size impacts s.d.

0.8

0.7 Batting Average by Month

0.6

0.5

0.4

0.3

0.2

0.1

0 April May June July August September BA total

Mookie Betts Andrew Benintendi Xander Bogaerts Mitch Moreland Hanley Ramirez Jackie Bradley Jr. Dustin Pedroia Christian Vazquez Sandy Leon Chris Young Rafael Devers Deven Marrero Eduardo Nunez† Brock Holt Josh Rutledge Pablo Sandoval† Sam Travis Marco Hernandez Tzu-Wei Lin Rajai Davis† Steve Selsky

• Can we assume these 21 players are all the same? Larger N associated with Smaller s.d. Yields More Power

s.d. Batting Average by Month

0.137

0.106 0.095 0.083

0.057 0.049

0.031

April May June July August September Full Season Traditional 2 Cell Study Sample requirement N=350

Sample size justification: We are planning a study of a continuous response variable from independent control and experimental subjects with 1 control(s) per experimental subject. In a previous study the response within each subject group was normally distributed with standard deviation 10. If the true difference in the experimental and control means is 3, we will need to study 175 experimental subjects and 175 control subjects to be able to reject the null hypothesis that the population means of the experimental and control groups are equal with probability (power) 0.8. The Type I error probability associated with this test of this null hypothesis is 0.05. Support of Concept Study 2 Cell Sample requirement N=150

Note impact of smaller sample: We are planning a study of a continuous response variable from independent control and experimental subjects with 1 control(s) per experimental subject. In a previous study the response within each subject group was normally distributed with standard deviation 10. If the true difference in the experimental and control means is 3, we will need to study 75 experimental subjects and 75 control subjects to be able to reject the null hypothesis that the population means of the experimental and control groups are equal with probability (power) 0.75. The Type I error probability associated with this test of this null hypothesis is 0.25. • Achieving a low placebo response is only critical for the success of underpowered treatment arms A Khan et al 2017 Table 1. The effect of median split of treatment arm sample size on the relationship between magnitude of placebo response and treatment arm success. Depression <170 N >170 N Mean PCBO Resp 29.9% 36.7% Relationship of PCBO R2= 0.346 R2= 0.043 resp to success P<0.001 p= 0.164 Success Rate 40% 68% Even though placebo response is significantly (p<0.001) higher in the >170 N group (36.7%) than it is in the <170 N group (29.9%), the success rate is 28% higher.

Schizophrenia <202 N ≥202 N Mean PCBO resp 6.3% 10.1% Relationship of PCBO R2= 0.377 R2= 0.067 resp to success p= 0.001 p= 0.089 Success Rate 63% 77% Even though placebo response is significantly (p= 0.018) higher in the >202 N group (10.1%) than it is in the <170 N group (6.3%), the success rate is 14% higher. Bigger is Better! Debate Conclusion:

Bigger is better • Ceteris parabus larger samples deliver more power to detect true differences • Power reflects Sample size, s.d. and effect size

• Chances of deviation from true population lessens as sample size increase

• Studies designed to take advantage of this mathematical reality must also pay attention to quality control and feasibility.

• Multiplicity: comparing the investigational and comparator groups on multiple outcomes, inflates type I error. • Unreliability: introduces bias, reduces statistical power • Feasibility requirement for qualified sites and raters Rebuttal

Is bigger better? Protocol xxx Site Quality Heat Map: Problem of Outlier Sites violate Ceteris parabus

Worst Quality

B A F D C Quality E Rank

Best Quality

Number of Subjects Randomized Protocol xxx Cumulative Quality Indicator Matrix: Site A PI yyyyy

Extras for back-up

Is bigger better? Interpretation of “∆”: Critical evaluation required

Treatment development relies on Weigh development decisions based on detection of small deltas outcomes and study quality? (differences between change scores)

Confidence for Study Quality d < 0.3 decision making Good Poor d > 0.4 High ? Positive (Avoid repetition) Study Result Low or High exclude. (Avoid Negative repetition)

Most RCTs are designed to detect Effects Sizes between 0.3-0.4 insert client Study Managementlogo hereUsing Blinded Data Analytics

December 8, 2016 Case Study: Probing the fog of study operations • Large Complex multisite trial • Slow recruitment • Quality Concerns • High Respondent Burden • Screening burdensome (requires up to 6 hrs) • Site feedback • Sponsor collected anecdotal • Bracket collected anecdotal • BDA Site A PI yyyy: Cases of YMRS (Rater) Identical Item Scoring During DB Phase

Subject ####

Solid Line = Rater score Dashed Line = Computer score

BPI duration by site 25

N of assessments Duration 40

20

30

15

20

10

Nof assessments

10

Averageduration BPI site at in minutes

5

0 0

422 403 402 417 326 305 325 310 414 309 324 423 424 420 311 303 410 413 404 429 319 419 316 302 405 318 427 314 401 330 409 323 312 315 308 317 304 301 407 320 313 329 307 327 408 418 426 430 328 428

YMRS durations by site 25

N of assessments Duration 40

20

30

15

20

10

Nof assessments

10

Averageduration YMRS atin minutes site

5

0 0

325 422 407 410 402 417 305 409 418 405 309 308 404 310 403 420 423 301 320 430 303 318 311 424 413 408 429 329 323 314 317 324 414 328 316 313 427 426 330 307 419 304 302 312 428 319 326 401 315 327

MADRS durations by site 25

N of assessments Duration 50

20 40

15 30

10 20

Nof assessments

5

AverageMADRSduration atin minutes site

10

0 0

423 410 422 402 304 427 403 405 419 326 424 310 413 325 430 414 305 301 311 303 309 302 318 316 426 330 307 407 418 404 324 317 417 319 429 314 323 401 320 329 420 315 409 327 312 308 328 408 428 313 Across all post randomization visits Site 84 MADRSComp Median duration varies with severity

30 27

25 Minutes 21 Duration 20

MADRSComp 15 12

10

5 2.5

0

MADRSComp MADRSComp MADRSComp MADRSComp Total= 0-8 Total= 9-16 Total= 17-24 Total= 25-32 n=30 n=19 n=39 n=36

TOT 0-8 TOT 9-16 TOT 17-24 TOT 25-32 SD 2.9 2.4 2.2 1.7 MADRS at Screen

Average Rater MADRS intreview duration at individual sites at screening 60 35 Mean + SD MADRS score Duration

55

30 50 +2 s.d.

45 25

40

35 20 Mean 30

15 25

20

10

15 -2 s.d.

Mean Mean MADRS atScore + Site SD Average at duration minutes Average site in

10 5

5

0 0

316(N=8) 311(N=6) 310(N=8) 325(N=2) 328(N=8) 326(N=2) 307(N=3) 302(N=4) 303(N=3) 308(N=5) 324(N=5) 320(N=7) 315(N=6) 312(N=8) 304(N=3) 301(N=3) 317(N=2) 327(N=5)

318(N=13) 329(N=12) 319(N=12) 314(N=18) 305(N=25) 309(N=13)

Run time: 27 Nov 2016 10:57:43 MADRS at Baseline

Average Rater MADRS intreview duration at individual sites at baseline 60 40 Mean + SD MADRS score Duration

55 35 50 +2 s.d.

45 30

40 25 35

30 20 Mean

25 15 20

15 10

Mean Mean MADRS atScore + Site SD Average at duration minutes Average site in

10 -2 s.d. 5 5

0 0

316(N=2) 318(N=8) 328(N=7) 307(N=1) 329(N=9) 311(N=3) 314(N=6) 319(N=5) 302(N=3) 325(N=1) 320(N=6) 326(N=2) 308(N=2) 309(N=6) 310(N=4) 327(N=3) 304(N=2) 303(N=3) 324(N=4) 315(N=3) 317(N=2) 301(N=2) 312(N=3) 305(N=18)

Run time: 27 Nov 2016 10:56:23 MADRS during DB Phase

Average Rater MADRS intreview duration at individual sites 60 Mean + SD MADRS score Duration 30

55

50 25 +2 s.d.

45

40 20

35 Mean

30 15

25

20 10

15 -2 s.d.

Mean Mean MADRS atScore + Site SD Average at duration minutes Average site in

10 5

5

0 0

325(N=7) 307(N=4)

318(N=59) 316(N=10) 311(N=25) 329(N=58) 310(N=29) 314(N=37) 315(N=21) 328(N=48) 320(N=47) 319(N=36) 309(N=42) 303(N=14) 304(N=19) 324(N=28) 327(N=20) 302(N=23) 301(N=11) 317(N=17) 326(N=14) 308(N=12) 312(N=26) 305(N=134)

Run time: 27 Nov 2016 10:39:30 YMRS at Screening Visit

Average Rater YMRS intreview duration at individual sites at screening 60 25 Mean + SD YMRS score Duration

55

50 20

45 +2 s.d.

40

15 35

30

Mean 25 10

20

15 Mean Mean MADRS atScore + Site SD Average at duration minutes Average site in 5 10 -2 s.d.

5

0 0

307(N=3) 328(N=8) 325(N=2) 316(N=7) 324(N=5) 301(N=3) 312(N=8) 315(N=5) 310(N=8) 303(N=3) 320(N=7) 326(N=2) 302(N=4) 311(N=6) 304(N=3) 317(N=2) 308(N=5) 327(N=5)

318(N=13) 305(N=25) 314(N=18) 329(N=12) 309(N=13) 319(N=12)

Run time: 27 Nov 2016 11:02:31 YMRS at Basline

Average Rater YMRS intreview duration at individual sites at baseline 60 20 Mean + SD YMRS score Duration

55

50 +2 s.d.

45 15

40

35

30 10 Mean

25

20

15 5 Mean Mean MADRS atScore + Site SD Average at duration minutes Average site in -2 s.d. 10

5

0 0

316(N=2) 304(N=2) 318(N=8) 319(N=5) 314(N=4) 328(N=7) 326(N=2) 320(N=6) 312(N=3) 329(N=9) 310(N=4) 315(N=3) 307(N=1) 309(N=6) 311(N=3) 324(N=4) 302(N=3) 301(N=2) 317(N=2) 308(N=2) 327(N=3) 303(N=3) 305(N=18)

Run time: 27 Nov 2016 11:04:09 YMRS during DB phase

Average Rater YMRS intreview duration at individual sites 60 Mean + SD YMRS score Duration 15 +2 s.d. 55

50

45

40 10 Mean 35

30

25

20 5 -2 s.d.

15

Mean Mean MADRS atScore + Site SD Average at duration minutes Average site in

10

5

0 0

307(N=4) 316(N=9) 325(N=7)

318(N=59) 320(N=47) 310(N=29) 329(N=58) 301(N=11) 328(N=48) 312(N=26) 315(N=20) 309(N=42) 314(N=35) 304(N=19) 324(N=28) 311(N=25) 302(N=23) 326(N=14) 319(N=37) 327(N=20) 317(N=17) 308(N=12) 303(N=14) 305(N=134)

Run time: 27 Nov 2016 11:04:59 XXX screening: MADRSSBR vs MADRSComp MADRS rater vs. computer scores at screening 60 Subjects: 464 Rater Mean (SD): 36.23 (6.08) Computer Mean (SD): 34.99 (7.75)

50

40

rater -

30 MADRS

20

10

Randomized - concordant Randomized - borderline concordant

Not-randomized - concordant Not-randomized - discordant 0

0 10 20 30 40 50 60 MADRS-computer Run time: 17 Dec 2017 07:30:45 XXX Visit 1: ∆MADRSSBR vs ∆ MADRSComp

Rater vs. computer MADRS change at Visit 1 10 Rater Mean (SD): -9.71 (9.88) Computer Mean (SD): -10.25 (10.11) 5

0

-5

-10

-15

-20 MADRS change change (points) MADRS -25

-30

-35

-40 Concordant Discordant

-40 -35 -30 -25 -20 -15 -10 -5 0 5 10 Computer MADRS change (points) Run time: 17 Dec 2017 07:30:49 Statistical Power Based on Cohen’s d = ∆ Active - ∆ Placebo / s.d. (Pooled)

The sample size (N) required for each of 2 treatment groups (assuming equal cell sizes) to detect various effects with statistical power of 0.80 using a t test with a 2-tailed a level of .05

small effect: medium effect: large effect: d = 0.20 d = 0.50 d = 0.80

The sample size (N) required 393 64 26

Leon A: Schizophrenia Bulletin vol. 34 no. 4 pp. 664–669, 2008 Good inter-rater reliability is important because it has an enormous impact on statistical power

Sample size required per arm 250

200 200 167

150 143 125 111 100 100

50

0 1 0.9 0.8 0.7 0.6 0.5 Intraclass Correlation Coefficient (Rater reliability)

Adapted from Kobak et al. (2009) Journal of Clinical Psychopharmacology 29, 82–85. Difference in sample requirement with signal ( ∆ Active - ∆ Placebo) improved by 1 point

A formula for estimating the number of subjects needed per group for statistical power of 0.80 to detect population effect sizes of other magnitudes with a t test is: N = 16/d2 For example, if ∆ Active - ∆ Placebo= 4 to detect an effect size of d = 0.40, the number of subjects required is: = 16/(0.4)2 = 100 subjects per group. N= 200 For example, if ∆ Active - ∆ Placebo= 3 to detect an effect size of d = 0.30, the number of subjects required is: = 16/(0.3)2 = 177 subjects per group. N= 344 For example, cost of an extra 144 randomized subjects > $2 Million

Lehr R. 16 s-squared over d-squared: a relation for crude sample size estimates. Stat Med. 1992;11:1099–1102.