<<

Advanced Power Analysis Workshop

www.nicebread.de PD Dr. Felix Schönbrodt & Dr. Stella Bollmann www.researchtransparency.org Ludwig-Maximilians-Universität München Twitter: @nicebread303 •Part I: General concepts of power analysis

•Part II: Hands-on: Repeated measures ANOVA and multiple regression

•Part III: Power analysis in multilevel models

•Part IV: Tailored design analyses by simulations in

2 Part I: General concepts of power analysis

•What is “statistical power”? •Why power is important •From power analysis to design analysis: Planning for precision (and other stuff) •How to determine the expected/minimally interesting

3 What is statistical power? A 2x2 classification

Reality: Reality: Effect present No effect present

Test indicates: True Positive False Positive Effect present

Test indicates: False Negative True Negative No effect present

5 https://effectsizefaq.files.wordpress.com/2010/05/type-i-and-type-ii-errors.jpg

6 https://effectsizefaq.files.wordpress.com/2010/05/type-i-and-type-ii-errors.jpg

7 A priori power analysis: We assume that the effect exists in reality

Reality: Reality: Effect present No effect present

Power = α=5% Test indicates: 1- β p < .05 True Positive False Positive

Test indicates: β=20% p > .05 False Negative True Negative

8 Calibrate your power feeling

total n Two-sample t test (between design), d = 0.5 128 (64 each group) One-sample t test (within design), d = 0.5 34 Correlation: = .21 173 Difference between two correlations, 428 r₁ = .15, r₂ = .40 ➙ q = 0.273

ANOVA, 2x2 Design: effect, f = 0.21 180 (45 each group)

All a priori power analyses with α = 5%, β = 20% and two-tailed

9 total n Two-sample t test (between design), d = 0.5 128 (64 each group) One-sample t test (within design), d = 0.5 34 Correlation: r = .21 173 Difference between two correlations, 428 r₁ = .15, r₂ = .40 ➙ q = 0.273

ANOVA, 2x2 Design: Interaction effect, f = 0.21 180 (45 each group)

All a priori power analyses with α = 5%, β = 20% and two-tailed

10 The power of within-SS designs ect of correlation on power in within-subject ect of correlation ff May, K., & Hittner, J. B. (2012). E K., & Hittner, May, 1, 2. versus between-subjects designs. Innovative Teaching,

11 Typical reported effect sizes I

• e.g., Richard, Bond, & Stokes-Zoota (2003) • Meta-meta-analysis; > 25.000 studies, > 8.000.000 participants • effect r = .21 (across literature SD = .15); = .18

Richard, F. D., Bond, C. F. J., & Stokes-Zoota, J. J. (2003). One Hundred Years of Social Psychology Quantitatively Described. Review of General Psychology, 7(4), 331–363. doi:10.1037/1089-2680.7.4.331 12 Typical reported effect sizes II • e.g., Bosco et al. (2015) • 147,328 correlations from Journal of Applied Psychology and Personnel Psychology • median effect: r = .16, mean effect r = .22 (SD = .20)

Bosco, F. A., Aguinis, H., Singh, K., Field, J. G., & Pierce, C. A. (2015). Correlational effect size benchmarks. Journal of Applied 13 Psychology, 100(2), 431–449. http://doi.org/10.1037/a0038047 Average sample size: n = 40

Average published effect size: d = .5 / r = .21 (certainly overstated due to publication bias)

Average power: <34%

Bakker, M., van Dijk, A., & Wicherts, J. M. (2012). The Rules of the Game Called Psychological Science. Perspectives on Psychological Science, 7(6), 543–554. 14 Overall PP = Psychology/Psychiatry

92%!

Fanelli, D. (2011). Negative results are disappearing from most disciplines and countries. Scientometrics, 90(3), 891–904. doi:10.1007/s11192-011-0494-7 15 Why power is important Exercise

Given that p< .05:

What is the probability that a real effect exists in the population? ➙ prob(H₁|D) 30% of investigated 1000 tests effects are real performed

30% 70%

power = real effect in no effect in α = 5% 35% 300 tests 700 tests

35% 65% 95% 5%

105 effects 195 effects not 35 significant 665 n.s. results detected detected results (true negatives) (true positives) (false negatives) (false positives) p < α n.s. n.s. p < α

35 of (35+105) = 140 significant p-values actually come from a population with null effect, 105 of 140 from a real effect. (FDR) = 35/140 = 25% Positive predictive value (PPV) = 105/140 = 75%

Nuzzo, R. (2014). Statistical errors. Nature. Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science, 1(3), 140216– 18 140216. http://doi.org/10.1073/pnas.1313476110 30% of investigated 1000 tests effects are real performed

30% 70%

power = real effect in no effect in α = 5% 35% 300 tests 700 tests

35% 65% 95% 5%

105 effects 195 effects not 35 significant 665 n.s. results detected detected results (true negatives) (true positives) (false negatives) (false positives) p < α n.s. n.s. p < α

35 of (35+105) = 140 significant p-values actually come from a population with null effect, 105 of 140 from a real effect. False discovery rate (FDR) = 35/140 = 25% Positive predictive value (PPV) = 105/140 = 75%

Nuzzo, R. (2014). Statistical errors. Nature. Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science, 1(3), 140216– 19 140216. http://doi.org/10.1073/pnas.1313476110 30% of investigated 1000 tests effects are real performed

30% 70%

power = real effect in no effect in α = 5% 35% 300 tests 700 tests

35% 65% 95% 5%

105 effects 195 effects not 35 significant 665 n.s. results detected detected results (true negatives) (true positives) (false negatives) (false positives) p < α n.s. n.s. p < α

35 of (35+105) = 140 significant p-values actually come from a population with null effect, 105 of 140 from a real effect. False discovery rate (FDR) = 35/140 = 25% Positive predictive value (PPV) = 105/140 = 75%

Nuzzo, R. (2014). Statistical errors. Nature. Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science, 1(3), 140216– 20 140216. http://doi.org/10.1073/pnas.1313476110 Practice with the PPV app! http://shinyapps.org/apps/PPV/ 21 Smaldino, P. E., & McElreath, R. (2016). The natural selection of bad science. arXiv:1605.09511 [physics, stat]. Retrieved from http://arxiv.org/abs/1605.09511 22 Our results indicate that the median statistical power in neuroscience is 21%.

Assumed that our tested hypothesis are true in 30% of all cases (which is a not too risky research scenario):

• A typical neuroscience study must fail in 94% of all cases • In the most likely outcome of p > .05, we have no idea whether a) the effect does not exist, or b) we simply missed the effect. Virtually no knowledge has been gained.

Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: why small sample size undermines23 the reliability of neuroscience. Nat Rev Neurosci, 14(5), 365–376. doi:10.1038/nrn3475 When a study is underpowered it most likely provides only weak inference. Even before a single participant is assessed, it is highly unlikely that an underpowered study provides an informative result.

Consequently, research unlikely to produce diagnostic outcomes is inefcient and can even be considered unethical. Why sacrifce people's time, animals' lives, and societies' resources on an that is highly unlikely to be informative?

Schönbrodt, F. D. & Wagenmakers, E.-J. (2016). Design Analysis: Planning for Compelling Evidence. http://dx.doi.org/10.2139/ssrn.2722435 24 Power is a frequentist property - beware of fallacies!

• Power is a pre- measure (i.e., before data are collected) that averages over infinite hypothetical

• Only one of these hypothetical experiments will actually be observed • Power is a property of the test procedure – not of a single study’s outcome! • Power is conditional on a hypothetical effect size – not conditional on the actual data obtained • “Once the actual data are available, a power calculation is no longer conditioned on what is known, no longer corresponds to a valid inference, and may now be misleading.” ➙ for inference better use likelihood ratios or Bayes factors. Then pre-data power considerations are irrelevant.

Wagenmakers, E.-J., Verhagen, J., Ly, A., Bakker, M., Lee, M. D., Matzke, D., Rouder, J. N., et al. (2014). A power fallacy. 25 Behavior Research Methods, 47, 913–917. doi:10.3758/s13428-014-0517-4 From power analysis to design analysis Classical a priori power Planning for precision analysis

Assume that the real effect Assume that the real effect has has a certain size a certain size (say, Cohen’s d = 0.5) (say, Cohen’s d = 0.5)

When I run many identical When I run many identical studies with a fixed sample studies with a fixed sample size size n and α level of 5% … n and α level of 5% …

… what proportion of them … what proportion of them will will result in a p-value result with a confidence interval smaller than α? that has a width <0.10?

➙ Task: Find the sample size n ➙ Task: Find the sample that ensures that at least, say, size n that ensures at least, 80% of all studies have a CI say, 80% significant results. width < 0.10

Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample Size Planning for Statistical Power and Accuracy in Estimation. Annual Review of Psychology, 59(1), 537–563. http://27 doi.org/10.1146/annurev.psych.59.103006.093735 General a priori design analysis

Assume properties of reality (effect sizes, distributions, etc.)

When I run many identical studies with certain design properties (e.g., sample size, plan, thresholds, priors) …

… what proportion of them will result with certain test results (e.g., p < .01, CI width < .05, Bayes factor > 10)

➙ Task: Tune the design properties in a way that at least X% of all hypothetical studies give the desired outcome

Kruschke, J. (2014). Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan (2nd edition.). Boston: Academic Press. 28 Schönbrodt, F. D. & Wagenmakers, E.-J. (2016). Bayes Factor Design Analysis: Planning for Compelling Evidence. http://dx.doi.org/10.2139/ssrn.2722435 How to determine the expected effect? Can we base our power analyses on published effect sizes?

• No. • See RP:P: 83% of all effect sizes are smaller than the original: Mean original: r = .40 ➙ Mean : r = .20 • See also Franco et al. (2015): Reported ES 2x larger than unreported ES • RP:P’s median power conditional on reported ES: 92% • RP:P’s median power conditional on replication ES: <30% • Suggestion: Divide reported effect by 2, compute power analysis.

30 Safeguard power (Perugini et al., 2014)

• Incorporate uncertainty in original study’s ES estimate • Aim for lower end of 60%-CI • Example: • Original study finds d = 0.5 (n = 30 in each group) • 60% CI = [0.28; 0.72] • Naive 80% power analysis: n = 64 • Safeguard 80% power analysis: n = 202 • Rewards precise estimates in original study

library(MBESS) ci.smd(smd=0.5, n.1=30, n.2=30, conf.level=0.60)

Perugini, M., Gallucci, M., & Costantini, G. (2014). Safeguard Power as a Protection Against Imprecise Power Estimates. Perspectives on Psychological Science, 9(3), 319–332. http://doi.org/10.1177/1745691614528519 31 2.5x tule of thumb (Simonsohn,2015)

•Two possible questions in replication: • Does the replication ES differ significantly from zero? • Does the replication ES differ significantly from a detectable effect? •Not very useful in already well- powered large-n situations

32 Smallest Effect Size of Interest (SESOI)

•“I don’t care about smaller effects” •If p > .05, one of three possible conditions: • Effect exists in a relevant size, but I was unlucky (with chance β) • Effect exists, but is so small that I don’t care about • Effect does not exist •Problem 1: Hard to determine •Problem 2: Highly inefficient if the true effect is larger than the minimally interesting effect

33 Sensitivity analysis Sensitivity analysis

• Sometimes we cannot commit to a point estimate of the assumed/minimally interesting effect size • Compute power analysis for a of plausible values (e.g. d = 0.3, 0.5, and 0.7)

δ n for 80% power in 2-group t-test (each group) 0.3 175 0.5 64 0.7 33

35 Bayesian (hybrid) power analysis

Quantify uncertainty about true effect under H₁ with prior: δ ∼ N (0.5, σ2 = 0.12)

•Run a Monte Carlo simulation:

• Draw effect size δi from prior distribution • Draw sample of size n from population δi • Repeat 10,000 times, record how many studies have significant p value •For this prior, the necessary sample size for obtaining a power of 80% would be n=70.

O'Hagan, A., Stevens, J. W., & Campbell, M. J. (2005). Assurance in design. Pharmaceutical Statistics, 4(3), 187–201. http://doi.org/ 10.1002/pst.175 36 Schönbrodt, F. D. & Wagenmakers, E.-J. (submitted). Bayes factor design analysis: Planning for compelling evidence Part II: Hands-on

Repeated measures ANOVA and multiple regression Statistical tests

• For any power analysis one needs a distribution given the (noncentral distribution)

• That , we have to specify the alternative hypothesis

38 Effect sizes

• The of the distribution is determined by the pre-specified effect size

• Possible effect sizes we can test for significance:

ANOVA Regression One way ANOVA Simple 2 2 2 σ µ 2 σ zw i ρ = 2 η = 2 β σ ges z σ ges Two way ANOVA Multiple linear Regression 2 2 σ A 2 2 η = ρ ρ βz e.g. par _ A 2 2 j j σ A +σ ε Repeated measures 2 2 σ zw ηpar = 2 2 σ zw +σ ε

39 Effect sizes

ANOVA Regression One way ANOVA

2 2 η 2 ρ f = 2 f = βz 1−η 1− ρ 2 Two way ANOVA & Multiple linear Regression Repeated measurement 2 2 2 2 ρ ρ ηpar f = f 2 j f 2 = 2 (β ) = 2 1− ρ 1− ρ zj e.g. 1−ηpar

More easy to estimate using partial r2

40 Sample size determination: 1. Step

• Choose minimum expected/interesting effect size • Cohens conventions:

41 Sample size determination in GPower

• What has to be pre-specified?

1. Expected minimum effect size 2. Minimum required alpha-level 3. Minimum required power

42 Power analysis for MLR

• Problem: collinear predictors β • SD‘s of estimations of zj is dependent on height of collinearity β • Power of z j is dependent on height of collinearity • The higher the collinearity the lower the power • Solution: β 1. zj of uncorrelated predictors (very unlikely) 2. Partial r2: independent of height of collinearity

43 Sample size determination in R

• Package: „pwr“ • One way ANOVA

• Simple linear regression

44 Part III: Power Analyses for Multilevel Models Power in multilevel models

• Assume that testing one student costs 10€, and you have a budget of 10,000€ ...

• 40 classes with 25 students each?

• 100 classes with 10 students each? • Or, in a 3 level structure:

• 10 classes from 10 schools with 10 students?

• 20 classes from 5 schools with 10 students?

• 5 classes from 10 schools with 20 students? • Trade-off: Each new school generates 500€ extra costs (e.g., for driving there). Shall we sacrifice 50 students for an extra school on level 3?

46 Power in MLMs: Difficulties

•Sample size on multiple levels •You need a priori estimates for the random (e.g., How much in intercepts do you expect?) ➙ hard to estimate, typically unknown. •Introducing covariates can change the optimal sample size on each level, depending on how much variance covariates explain within or between groups •Dependent on intraclass coefficient (ICC)

47 Some exemplary ICCs • 20 groups with 20 persons each

48 Some exemplary ICCs

49 Some exemplary ICCs

50 Power in MLMs: Dependency on ICC

• Dependency on ICC

• When ICC➙1 there are no differences of units within each group

• each L1 unit carries virtually no additional information beyond the existing L1 units

• measuring more L1 units has no informational gain; the only way is to add L2 units • „ICC values typically range between .05 and .20“ (Snijders & Bosker, 1999); Bliese, 2000) • „values greater than .30 will be rare“ (Bliese, 2000) • „median value of .12; this is likely an overestimate“ (James, 1982) • „a value between .10 and .15 will provide a conservative estimate of the ICC when it cannot be precisely computed“ • But: there are some research fields where ICCs > 30% are not uncommon! Always try to tune to your field, and do not rely on defaults.

Scherbaum, C. A., & Ferreter, J. M. (2008). Estimating Statistical Power and Required Sample Sizes for Organizational Research Using Multilevel Modeling. Organizational Research 51 Methods, 12(2), 347–367. http://doi.org/10.1177/1094428107308906 Power in MLMs: More L1 or L2?

• For the power of L1 γ₀₁ = slope of the L2-predictor predictors the # of L1 units is more relevant (but also take ICC into account) • For the power of L2 predictors the # of L2 units is more relevant, as well as the size of the L2 units (i.e., L1 units within L2 units) • Typically, the n on L2 is more important than the n within each L1 unit

Scherbaum, C. A., & Ferreter, J. M. (2008). Estimating Statistical Power and Required Sample Sizes for Organizational Research Using Multilevel Modeling. Organizational Research 52 Methods, 12(2), 347–367. http://doi.org/10.1177/1094428107308906 Rules of thumb (which are often criticized!)

•Kreft (1996): 30/30 rule

• >= 30 units on L2

• >= 30 L1 units for each L2 unit

• ! 30*30 = 900 units on L1 •Hox (1998): 50/20 rule

• >= 50 units on L2

• >= 20 L1-units for each L2 unit

• ! 50*20 = 1000 units on L1

Scherbaum, C. A., & Ferreter, J. M. (2008). Estimating Statistical Power and Required Sample Sizes for Organizational Research Using Multilevel Modeling. Organizational Research 53 Methods, 12(2), 347–367. http://doi.org/10.1177/1094428107308906 Example: 3-level therapy study

• L1: measurements (5, 11, or 21 per patient), nested in ... • L2: patients (2, 4, or 8 per therapist), nested in... • L3: therapists (focus of power analysis: How many therapists do you acquire?) • Outcome variable on L1: therapy progress (self reported) • Focal experimental factor: Therapists get feedback about self-reported therapy progress (or not) (i.e., between-group experimental factor on L3) • ➙ moderating effect on L1-slope: • Is therapy progress over time (rate of change) stronger when therapists get feedback about the current status? de Jong, K., Moerbeek, M., & van der Leeden, R. (2010). A priori power analysis in longitudinal three-level multilevel models: An example with therapist effects. 54 Psychotherapy Research, 20(3), 273–284. http://doi.org/10.1080/10503300903376320 Example: 3-level therapy study

•Saturation at ~60 therapists: more are not necessary (if each has 4 patients) •# of measurements per patient irrelevant de Jong, K., Moerbeek, M., & van der Leeden, R. (2010). A priori power analysis in longitudinal three-level multilevel models: An example with 55 therapist effects. Psychotherapy Research, 20(3), 273–284. http://doi.org/10.1080/10503300903376320 Example: 3-level therapy study

80% Power

21*8=168 42*4=168 84*2=168 patients patients patients

de Jong, K., Moerbeek, M., & van der Leeden, R. (2010). A priori power analysis in longitudinal three-level multilevel models: An example with therapist effects. 56 Psychotherapy Research, 20(3), 273–284. http://doi.org/10.1080/10503300903376320 MLPowSim • MLPowSim (http://www.bristol.ac.uk/cmm/software/mlpowsim/) • Generates simulation code for R • 2- and 3-level models

57 Web app from Jake Westfall: Stimuli nested in persons

Westfall, J., Kenny, D. A., & Judd, C. M. (2014). Statistical power and optimal design in experiments in which samples of participants respond to samples of stimuli. Journal of Experimental Psychology: General, 143 (5), 2020-2045. 58 Literature

• Scherbaum, C. A., & Ferreter, J. M. (2008). Estimating Statistical Power and Required Sample Sizes for Organizational Research Using Multilevel Modeling. Organizational Research Methods, 12(2), 347–367. http://doi.org/10.1177/1094428107308906

• de Jong, K., Moerbeek, M., & van der Leeden, R. (2010). A priori power analysis in longitudinal three-level multilevel models: An example with therapist effects. Psychotherapy Research, 20(3), 273–284. http://doi.org/10.1080/10503300903376320

• Maas, C., & Hox, J. J. (2005). Sufficient sample sizes for multilevel modeling. Methodology, 1(3), 85– 91. http://doi.org/10.1027/1614-2241.1.3.85

• Mathieu, J. E., Aguinis, H., Culpepper, S. A., & Chen, G. (2012). Understanding and estimating the power to detect cross-level interaction effects in multilevel modeling. Journal of Applied Psychology, 97(5), 951–966. http://doi.org/10.1037/a0028380

• Snijders, Tom AB. (2005). Power and sample size in multilevel linear models. Encyclopedia of Statistics in Behavioral Science.

59 Part IV: Tailored design analyses by simulation in R # Example: two-group t-test General a priori design analysis

n1 <- 100 # sample size first group Assume properties of reality n2 <- n1 # sample size second group (effect sizes, distributions, etc.) delta <- 0.5 # assumed true effect size alpha <- .05 # Type I error level B <- 10000 # number of Monte Carlo simulations # (= hypothetical studies)

When I run many identical studies ps <- c() # stores p-values from each simulation with certain design properties (e.g., for (i in 1:B) { sample size, sampling plan, # draw a sample from a population that has thresholds, priors) … # the assumed effect size x <- rnorm(n1, mean=0, sd=1) y <- rnorm(n2, mean=delta, sd=1)

t1 <- t.test(x, y) ps <- c(ps, t1$p.value) }

… what proportion of them will # compute power result with certain results (e.g., p < . # (= number of studies that have a p-value < alpha) prop.table(table(ps < alpha)) 01, CI width < .05, Bayes factor > 10) print(paste0("Power = ", sum(ps < alpha)/B*100, "%"))

➙ Task: Tune the design features in a way that at least X% of all # tune n1 and n2 from step 1 hypothetical studies give the # until desired power is achieved desired outcome 61 # Example: two-group t-test with prior General a priori design analysis on effect size

n1 <- 100 # sample size first group Assume properties of reality n2 <- n1 # sample size second group (effect sizes, distributions, etc.) delta <- 0.5 # assumed true effect size alpha <- .05 # Type I error level

ps <- c() # stores p-values from each simulation for (i in 1:B) { x <- rnorm(n1, mean=0, sd=1) When I run many identical studies with certain design properties (e.g., ## inin eacheach Monte CarloCarlo run:run: choosechoose another another true true ES ES sample size, sampling plan, ## fromfrom thethe prior distributiondistribution thresholds, priors) … d_id_i <-<- rnorm(rnorm(11,, meanmean==0.50.5,, sdsd=0.2=0.2) ) yy <-<- rnorm(rnorm(n2,, meanmean==d_id_i,, sdsd==1)1)

t1 <- t.test(x, y) ps <- c(ps, t1$p.value) }

… what proportion of them will # compute power result with certain results (e.g., p < . # (= number of studies that have a p-value < alpha) prop.table(table(ps < alpha)) 01, CI width < .05, Bayes factor > 10) print(paste0("Power = ", sum(ps < alpha)/B*100, "%"))

➙ Task: Tune the design features in a way that at least X% of all # tune n1 and n2 from step 1 hypothetical studies give the # until desired power is achieved desired outcome 62 Hand-carved simulations in R

• Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/ hierarchical models. Cambridge University Press. Chapter 20.5

• Kruschke, J. (2014). Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan (2nd edition.). Boston: Academic Press. Chapter 13

63 Bayesian (hybrid) power analysis

# ------# more advanced: Bayesian hybrid power analysis: # Use a prior distribution to quantify uncertainty about true effect; # then compute percentage of significant studies

# Our standard scenario: a two-group t-test

# find a reasonable prior plausibility distribution of effect sizes # e.g.: Normal(mean=0.5, sd=0.2) effectSizeRange <- seq(-0.5, 2, length=100) plot(x=effectSizeRange, y=dnorm(effectSizeRange, mean=0.5, sd=0.2), type="l", ylab="Plausibility", xlab="Effect size")

n1 <- 100 # sample size first group B <- 1000 # number of Monte Carlo simulations (e.g., 5000) n2 <- n1

ps <- c() # this vector stores the B p-values for (i in 1:B) { x <- rnorm(n1, mean=0, sd=1)

# in each Monte Carlo run: choose another true ES from the prior distribution y <- rnorm(n2, mean=rnorm(1, mean=0.5, sd=0.2), sd=1) t1 <- t.test(x, y) ps <- c(ps, t1$p.value) }

# compute power (= number of studies that have a p-value < .05) prop.table(table(ps < .05))

# plot p-curve in significant range hist(ps[ps<.05])

# --> now increase sample size until desired power is achieved 64 Schönbrodt, F. D. & Wagenmakers, E.-J. (submitted). Bayes factor design analysis: Planning for compelling evidence