Advanced Power Analysis Workshop
www.nicebread.de PD Dr. Felix Schönbrodt & Dr. Stella Bollmann www.researchtransparency.org Ludwig-Maximilians-Universität München Twitter: @nicebread303 •Part I: General concepts of power analysis
•Part II: Hands-on: Repeated measures ANOVA and multiple regression
•Part III: Power analysis in multilevel models
•Part IV: Tailored design analyses by simulations in
2 Part I: General concepts of power analysis
•What is “statistical power”? •Why power is important •From power analysis to design analysis: Planning for precision (and other stuff) •How to determine the expected/minimally interesting effect size
3 What is statistical power? A 2x2 classification matrix
Reality: Reality: Effect present No effect present
Test indicates: True Positive False Positive Effect present
Test indicates: False Negative True Negative No effect present
5 https://effectsizefaq.files.wordpress.com/2010/05/type-i-and-type-ii-errors.jpg
6 https://effectsizefaq.files.wordpress.com/2010/05/type-i-and-type-ii-errors.jpg
7 A priori power analysis: We assume that the effect exists in reality
Reality: Reality: Effect present No effect present
Power = α=5% Test indicates: 1- β p < .05 True Positive False Positive
Test indicates: β=20% p > .05 False Negative True Negative
8 Calibrate your power feeling
total n Two-sample t test (between design), d = 0.5 128 (64 each group) One-sample t test (within design), d = 0.5 34 Correlation: r = .21 173 Difference between two correlations, 428 r₁ = .15, r₂ = .40 ➙ q = 0.273
ANOVA, 2x2 Design: Interaction effect, f = 0.21 180 (45 each group)
All a priori power analyses with α = 5%, β = 20% and two-tailed
9 total n Two-sample t test (between design), d = 0.5 128 (64 each group) One-sample t test (within design), d = 0.5 34 Correlation: r = .21 173 Difference between two correlations, 428 r₁ = .15, r₂ = .40 ➙ q = 0.273
ANOVA, 2x2 Design: Interaction effect, f = 0.21 180 (45 each group)
All a priori power analyses with α = 5%, β = 20% and two-tailed
10 The power of within-SS designs ect of correlation on power in within-subject ect of correlation ff May, K., & Hittner, J. B. (2012). E K., & Hittner, May, 1, 2. versus between-subjects designs. Innovative Teaching,
11 Typical reported effect sizes I
• e.g., Richard, Bond, & Stokes-Zoota (2003) • Meta-meta-analysis; > 25.000 studies, > 8.000.000 participants • mean effect r = .21 (across literature SD = .15); median = .18
Richard, F. D., Bond, C. F. J., & Stokes-Zoota, J. J. (2003). One Hundred Years of Social Psychology Quantitatively Described. Review of General Psychology, 7(4), 331–363. doi:10.1037/1089-2680.7.4.331 12 Typical reported effect sizes II • e.g., Bosco et al. (2015) • 147,328 correlations from Journal of Applied Psychology and Personnel Psychology • median effect: r = .16, mean effect r = .22 (SD = .20)
Bosco, F. A., Aguinis, H., Singh, K., Field, J. G., & Pierce, C. A. (2015). Correlational effect size benchmarks. Journal of Applied 13 Psychology, 100(2), 431–449. http://doi.org/10.1037/a0038047 Average sample size: n = 40
Average published effect size: d = .5 / r = .21 (certainly overstated due to publication bias)
Average power: <34%
Bakker, M., van Dijk, A., & Wicherts, J. M. (2012). The Rules of the Game Called Psychological Science. Perspectives on Psychological Science, 7(6), 543–554. 14 Overall PP = Psychology/Psychiatry
92%!
Fanelli, D. (2011). Negative results are disappearing from most disciplines and countries. Scientometrics, 90(3), 891–904. doi:10.1007/s11192-011-0494-7 15 Why power is important Exercise
Given that p< .05:
What is the probability that a real effect exists in the population? ➙ prob(H₁|D) 30% of investigated 1000 tests effects are real performed
30% 70%
power = real effect in no effect in α = 5% 35% 300 tests 700 tests
35% 65% 95% 5%
105 effects 195 effects not 35 significant 665 n.s. results detected detected results (true negatives) (true positives) (false negatives) (false positives) p < α n.s. n.s. p < α
35 of (35+105) = 140 significant p-values actually come from a population with null effect, 105 of 140 from a real effect. False discovery rate (FDR) = 35/140 = 25% Positive predictive value (PPV) = 105/140 = 75%
Nuzzo, R. (2014). Statistical errors. Nature. Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science, 1(3), 140216– 18 140216. http://doi.org/10.1073/pnas.1313476110 30% of investigated 1000 tests effects are real performed
30% 70%
power = real effect in no effect in α = 5% 35% 300 tests 700 tests
35% 65% 95% 5%
105 effects 195 effects not 35 significant 665 n.s. results detected detected results (true negatives) (true positives) (false negatives) (false positives) p < α n.s. n.s. p < α
35 of (35+105) = 140 significant p-values actually come from a population with null effect, 105 of 140 from a real effect. False discovery rate (FDR) = 35/140 = 25% Positive predictive value (PPV) = 105/140 = 75%
Nuzzo, R. (2014). Statistical errors. Nature. Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science, 1(3), 140216– 19 140216. http://doi.org/10.1073/pnas.1313476110 30% of investigated 1000 tests effects are real performed
30% 70%
power = real effect in no effect in α = 5% 35% 300 tests 700 tests
35% 65% 95% 5%
105 effects 195 effects not 35 significant 665 n.s. results detected detected results (true negatives) (true positives) (false negatives) (false positives) p < α n.s. n.s. p < α
35 of (35+105) = 140 significant p-values actually come from a population with null effect, 105 of 140 from a real effect. False discovery rate (FDR) = 35/140 = 25% Positive predictive value (PPV) = 105/140 = 75%
Nuzzo, R. (2014). Statistical errors. Nature. Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science, 1(3), 140216– 20 140216. http://doi.org/10.1073/pnas.1313476110 Practice with the PPV app! http://shinyapps.org/apps/PPV/ 21 Smaldino, P. E., & McElreath, R. (2016). The natural selection of bad science. arXiv:1605.09511 [physics, stat]. Retrieved from http://arxiv.org/abs/1605.09511 22 Our results indicate that the median statistical power in neuroscience is 21%.
Assumed that our tested hypothesis are true in 30% of all cases (which is a not too risky research scenario):
• A typical neuroscience study must fail in 94% of all cases • In the most likely outcome of p > .05, we have no idea whether a) the effect does not exist, or b) we simply missed the effect. Virtually no knowledge has been gained.
Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: why small sample size undermines23 the reliability of neuroscience. Nat Rev Neurosci, 14(5), 365–376. doi:10.1038/nrn3475 When a study is underpowered it most likely provides only weak inference. Even before a single participant is assessed, it is highly unlikely that an underpowered study provides an informative result.
Consequently, research unlikely to produce diagnostic outcomes is inefcient and can even be considered unethical. Why sacrifce people's time, animals' lives, and societies' resources on an experiment that is highly unlikely to be informative?
Schönbrodt, F. D. & Wagenmakers, E.-J. (2016). Bayes Factor Design Analysis: Planning for Compelling Evidence. http://dx.doi.org/10.2139/ssrn.2722435 24 Power is a frequentist property - beware of fallacies!
• Power is a pre-data measure (i.e., before data are collected) that averages over infinite hypothetical experiments
• Only one of these hypothetical experiments will actually be observed • Power is a property of the test procedure – not of a single study’s outcome! • Power is conditional on a hypothetical effect size – not conditional on the actual data obtained • “Once the actual data are available, a power calculation is no longer conditioned on what is known, no longer corresponds to a valid inference, and may now be misleading.” ➙ for inference better use likelihood ratios or Bayes factors. Then pre-data power considerations are irrelevant.
Wagenmakers, E.-J., Verhagen, J., Ly, A., Bakker, M., Lee, M. D., Matzke, D., Rouder, J. N., et al. (2014). A power fallacy. 25 Behavior Research Methods, 47, 913–917. doi:10.3758/s13428-014-0517-4 From power analysis to design analysis Classical a priori power Planning for precision analysis
Assume that the real effect Assume that the real effect has has a certain size a certain size (say, Cohen’s d = 0.5) (say, Cohen’s d = 0.5)
When I run many identical When I run many identical studies with a fixed sample studies with a fixed sample size size n and α level of 5% … n and α level of 5% …
… what proportion of them … what proportion of them will will result in a p-value result with a confidence interval smaller than α? that has a width <0.10?
➙ Task: Find the sample size n ➙ Task: Find the sample that ensures that at least, say, size n that ensures at least, 80% of all studies have a CI say, 80% significant results. width < 0.10
Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample Size Planning for Statistical Power and Accuracy in Parameter Estimation. Annual Review of Psychology, 59(1), 537–563. http://27 doi.org/10.1146/annurev.psych.59.103006.093735 General a priori design analysis
Assume properties of reality (effect sizes, distributions, etc.)
When I run many identical studies with certain design properties (e.g., sample size, sampling plan, thresholds, priors) …
… what proportion of them will result with certain test results (e.g., p < .01, CI width < .05, Bayes factor > 10)
➙ Task: Tune the design properties in a way that at least X% of all hypothetical studies give the desired outcome
Kruschke, J. (2014). Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan (2nd edition.). Boston: Academic Press. 28 Schönbrodt, F. D. & Wagenmakers, E.-J. (2016). Bayes Factor Design Analysis: Planning for Compelling Evidence. http://dx.doi.org/10.2139/ssrn.2722435 How to determine the expected effect? Can we base our power analyses on published effect sizes?
• No. • See RP:P: 83% of all effect sizes are smaller than the original: Mean original: r = .40 ➙ Mean replication: r = .20 • See also Franco et al. (2015): Reported ES 2x larger than unreported ES • RP:P’s median power conditional on reported ES: 92% • RP:P’s median power conditional on replication ES: <30% • Suggestion: Divide reported effect by 2, compute power analysis.
30 Safeguard power (Perugini et al., 2014)
• Incorporate uncertainty in original study’s ES estimate • Aim for lower end of 60%-CI • Example: • Original study finds d = 0.5 (n = 30 in each group) • 60% CI = [0.28; 0.72] • Naive 80% power analysis: n = 64 • Safeguard 80% power analysis: n = 202 • Rewards precise estimates in original study
library(MBESS) ci.smd(smd=0.5, n.1=30, n.2=30, conf.level=0.60)
Perugini, M., Gallucci, M., & Costantini, G. (2014). Safeguard Power as a Protection Against Imprecise Power Estimates. Perspectives on Psychological Science, 9(3), 319–332. http://doi.org/10.1177/1745691614528519 31 2.5x tule of thumb (Simonsohn,2015)
•Two possible questions in replication: • Does the replication ES differ significantly from zero? • Does the replication ES differ significantly from a detectable effect? •Not very useful in already well- powered large-n situations
32 Smallest Effect Size of Interest (SESOI)
•“I don’t care about smaller effects” •If p > .05, one of three possible conditions: • Effect exists in a relevant size, but I was unlucky (with chance β) • Effect exists, but is so small that I don’t care about • Effect does not exist •Problem 1: Hard to determine •Problem 2: Highly inefficient if the true effect is larger than the minimally interesting effect
33 Sensitivity analysis Sensitivity analysis
• Sometimes we cannot commit to a point estimate of the assumed/minimally interesting effect size • Compute power analysis for a range of plausible values (e.g. d = 0.3, 0.5, and 0.7)
δ n for 80% power in 2-group t-test (each group) 0.3 175 0.5 64 0.7 33
35 Bayesian (hybrid) power analysis
Quantify uncertainty about true effect under H₁ with prior: δ ∼ N (0.5, σ2 = 0.12)
•Run a Monte Carlo simulation:
• Draw effect size δi from prior distribution • Draw sample of size n from population δi • Repeat 10,000 times, record how many studies have significant p value •For this prior, the necessary sample size for obtaining a power of 80% would be n=70.
O'Hagan, A., Stevens, J. W., & Campbell, M. J. (2005). Assurance in clinical trial design. Pharmaceutical Statistics, 4(3), 187–201. http://doi.org/ 10.1002/pst.175 36 Schönbrodt, F. D. & Wagenmakers, E.-J. (submitted). Bayes factor design analysis: Planning for compelling evidence Part II: Hands-on
Repeated measures ANOVA and multiple regression Statistical tests
• For any power analysis one needs a distribution given the alternative hypothesis (noncentral distribution)
• That means, we have to specify the alternative hypothesis
38 Effect sizes
• The expected value of the distribution is determined by the pre-specified effect size
• Possible effect sizes we can test for significance:
ANOVA Regression One way ANOVA Simple linear Regression 2 2 2 σ µ 2 σ zw i ρ = 2 η = 2 β σ ges z σ ges Two way ANOVA Multiple linear Regression 2 2 σ A 2 2 η = ρ ρ βz e.g. par _ A 2 2 j j σ A +σ ε Repeated measures 2 2 σ zw ηpar = 2 2 σ zw +σ ε
39 Effect sizes
ANOVA Regression One way ANOVA Simple linear Regression
2 2 η 2 ρ f = 2 f = βz 1−η 1− ρ 2 Two way ANOVA & Multiple linear Regression Repeated measurement 2 2 2 2 ρ ρ ηpar f = f 2 j f 2 = 2 (β ) = 2 1− ρ 1− ρ zj e.g. 1−ηpar
More easy to estimate using partial r2
40 Sample size determination: 1. Step
• Choose minimum expected/interesting effect size • Cohens conventions:
41 Sample size determination in GPower
• What has to be pre-specified?
1. Expected minimum effect size 2. Minimum required alpha-level 3. Minimum required power
42 Power analysis for MLR
• Problem: collinear predictors β • SD‘s of estimations of zj is dependent on height of collinearity β • Power of z j is dependent on height of collinearity • The higher the collinearity the lower the power • Solution: β 1. zj of uncorrelated predictors (very unlikely) 2. Partial r2: independent of height of collinearity
43 Sample size determination in R
• Package: „pwr“ • One way ANOVA
• Simple linear regression
44 Part III: Power Analyses for Multilevel Models Power in multilevel models
• Assume that testing one student costs 10€, and you have a budget of 10,000€ ...
• 40 classes with 25 students each?
• 100 classes with 10 students each? • Or, in a 3 level structure:
• 10 classes from 10 schools with 10 students?
• 20 classes from 5 schools with 10 students?
• 5 classes from 10 schools with 20 students? • Trade-off: Each new school generates 500€ extra costs (e.g., for driving there). Shall we sacrifice 50 students for an extra school on level 3?
46 Power in MLMs: Difficulties
•Sample size on multiple levels •You need a priori estimates for the random variances (e.g., How much variance in intercepts do you expect?) ➙ hard to estimate, typically unknown. •Introducing covariates can change the optimal sample size on each level, depending on how much variance covariates explain within or between groups •Dependent on intraclass coefficient (ICC)
47 Some exemplary ICCs • 20 groups with 20 persons each
48 Some exemplary ICCs
49 Some exemplary ICCs
50 Power in MLMs: Dependency on ICC
• Dependency on ICC
• When ICC➙1 there are no differences of units within each group
• each L1 unit carries virtually no additional information beyond the existing L1 units
• measuring more L1 units has no informational gain; the only way is to add L2 units • „ICC values typically range between .05 and .20“ (Snijders & Bosker, 1999); Bliese, 2000) • „values greater than .30 will be rare“ (Bliese, 2000) • „median value of .12; this is likely an overestimate“ (James, 1982) • „a value between .10 and .15 will provide a conservative estimate of the ICC when it cannot be precisely computed“ • But: there are some research fields where ICCs > 30% are not uncommon! Always try to tune to your field, and do not rely on defaults.
Scherbaum, C. A., & Ferreter, J. M. (2008). Estimating Statistical Power and Required Sample Sizes for Organizational Research Using Multilevel Modeling. Organizational Research 51 Methods, 12(2), 347–367. http://doi.org/10.1177/1094428107308906 Power in MLMs: More L1 or L2?
• For the power of L1 γ₀₁ = slope of the L2-predictor predictors the # of L1 units is more relevant (but also take ICC into account) • For the power of L2 predictors the # of L2 units is more relevant, as well as the size of the L2 units (i.e., L1 units within L2 units) • Typically, the n on L2 is more important than the n within each L1 unit
Scherbaum, C. A., & Ferreter, J. M. (2008). Estimating Statistical Power and Required Sample Sizes for Organizational Research Using Multilevel Modeling. Organizational Research 52 Methods, 12(2), 347–367. http://doi.org/10.1177/1094428107308906 Rules of thumb (which are often criticized!)
•Kreft (1996): 30/30 rule
• >= 30 units on L2
• >= 30 L1 units for each L2 unit
• ! 30*30 = 900 units on L1 •Hox (1998): 50/20 rule
• >= 50 units on L2
• >= 20 L1-units for each L2 unit
• ! 50*20 = 1000 units on L1
Scherbaum, C. A., & Ferreter, J. M. (2008). Estimating Statistical Power and Required Sample Sizes for Organizational Research Using Multilevel Modeling. Organizational Research 53 Methods, 12(2), 347–367. http://doi.org/10.1177/1094428107308906 Example: 3-level therapy study
• L1: measurements (5, 11, or 21 per patient), nested in ... • L2: patients (2, 4, or 8 per therapist), nested in... • L3: therapists (focus of power analysis: How many therapists do you acquire?) • Outcome variable on L1: therapy progress (self reported) • Focal experimental factor: Therapists get feedback about self-reported therapy progress (or not) (i.e., between-group experimental factor on L3) • ➙ moderating effect on L1-slope: • Is therapy progress over time (rate of change) stronger when therapists get feedback about the current status? de Jong, K., Moerbeek, M., & van der Leeden, R. (2010). A priori power analysis in longitudinal three-level multilevel models: An example with therapist effects. 54 Psychotherapy Research, 20(3), 273–284. http://doi.org/10.1080/10503300903376320 Example: 3-level therapy study
•Saturation at ~60 therapists: more are not necessary (if each has 4 patients) •# of measurements per patient irrelevant de Jong, K., Moerbeek, M., & van der Leeden, R. (2010). A priori power analysis in longitudinal three-level multilevel models: An example with 55 therapist effects. Psychotherapy Research, 20(3), 273–284. http://doi.org/10.1080/10503300903376320 Example: 3-level therapy study
80% Power
21*8=168 42*4=168 84*2=168 patients patients patients
de Jong, K., Moerbeek, M., & van der Leeden, R. (2010). A priori power analysis in longitudinal three-level multilevel models: An example with therapist effects. 56 Psychotherapy Research, 20(3), 273–284. http://doi.org/10.1080/10503300903376320 MLPowSim • MLPowSim (http://www.bristol.ac.uk/cmm/software/mlpowsim/) • Generates simulation code for R • 2- and 3-level models
57 Web app from Jake Westfall: Stimuli nested in persons
Westfall, J., Kenny, D. A., & Judd, C. M. (2014). Statistical power and optimal design in experiments in which samples of participants respond to samples of stimuli. Journal of Experimental Psychology: General, 143 (5), 2020-2045. 58 Literature
• Scherbaum, C. A., & Ferreter, J. M. (2008). Estimating Statistical Power and Required Sample Sizes for Organizational Research Using Multilevel Modeling. Organizational Research Methods, 12(2), 347–367. http://doi.org/10.1177/1094428107308906
• de Jong, K., Moerbeek, M., & van der Leeden, R. (2010). A priori power analysis in longitudinal three-level multilevel models: An example with therapist effects. Psychotherapy Research, 20(3), 273–284. http://doi.org/10.1080/10503300903376320
• Maas, C., & Hox, J. J. (2005). Sufficient sample sizes for multilevel modeling. Methodology, 1(3), 85– 91. http://doi.org/10.1027/1614-2241.1.3.85
• Mathieu, J. E., Aguinis, H., Culpepper, S. A., & Chen, G. (2012). Understanding and estimating the power to detect cross-level interaction effects in multilevel modeling. Journal of Applied Psychology, 97(5), 951–966. http://doi.org/10.1037/a0028380
• Snijders, Tom AB. (2005). Power and sample size in multilevel linear models. Encyclopedia of Statistics in Behavioral Science.
59 Part IV: Tailored design analyses by simulation in R # Example: two-group t-test General a priori design analysis
n1 <- 100 # sample size first group Assume properties of reality n2 <- n1 # sample size second group (effect sizes, distributions, etc.) delta <- 0.5 # assumed true effect size alpha <- .05 # Type I error level B <- 10000 # number of Monte Carlo simulations # (= hypothetical studies)
When I run many identical studies ps <- c() # stores p-values from each simulation with certain design properties (e.g., for (i in 1:B) { sample size, sampling plan, # draw a sample from a population that has thresholds, priors) … # the assumed effect size x <- rnorm(n1, mean=0, sd=1) y <- rnorm(n2, mean=delta, sd=1)
t1 <- t.test(x, y) ps <- c(ps, t1$p.value) }
… what proportion of them will # compute power result with certain results (e.g., p < . # (= number of studies that have a p-value < alpha) prop.table(table(ps < alpha)) 01, CI width < .05, Bayes factor > 10) print(paste0("Power = ", sum(ps < alpha)/B*100, "%"))
➙ Task: Tune the design features in a way that at least X% of all # tune n1 and n2 from step 1 hypothetical studies give the # until desired power is achieved desired outcome 61 # Example: two-group t-test with prior General a priori design analysis on effect size
n1 <- 100 # sample size first group Assume properties of reality n2 <- n1 # sample size second group (effect sizes, distributions, etc.) delta <- 0.5 # assumed true effect size alpha <- .05 # Type I error level
ps <- c() # stores p-values from each simulation for (i in 1:B) { x <- rnorm(n1, mean=0, sd=1) When I run many identical studies with certain design properties (e.g., ## inin eacheach Monte CarloCarlo run:run: choosechoose another another true true ES ES sample size, sampling plan, ## fromfrom thethe prior distributiondistribution thresholds, priors) … d_id_i <-<- rnorm(rnorm(11,, meanmean==0.50.5,, sdsd=0.2=0.2) ) yy <-<- rnorm(rnorm(n2,, meanmean==d_id_i,, sdsd==1)1)
t1 <- t.test(x, y) ps <- c(ps, t1$p.value) }
… what proportion of them will # compute power result with certain results (e.g., p < . # (= number of studies that have a p-value < alpha) prop.table(table(ps < alpha)) 01, CI width < .05, Bayes factor > 10) print(paste0("Power = ", sum(ps < alpha)/B*100, "%"))
➙ Task: Tune the design features in a way that at least X% of all # tune n1 and n2 from step 1 hypothetical studies give the # until desired power is achieved desired outcome 62 Hand-carved simulations in R
• Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/ hierarchical models. Cambridge University Press. Chapter 20.5
• Kruschke, J. (2014). Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan (2nd edition.). Boston: Academic Press. Chapter 13
63 Bayesian (hybrid) power analysis
# ------# more advanced: Bayesian hybrid power analysis: # Use a prior distribution to quantify uncertainty about true effect; # then compute percentage of significant studies
# Our standard scenario: a two-group t-test
# find a reasonable prior plausibility distribution of effect sizes # e.g.: Normal(mean=0.5, sd=0.2) effectSizeRange <- seq(-0.5, 2, length=100) plot(x=effectSizeRange, y=dnorm(effectSizeRange, mean=0.5, sd=0.2), type="l", ylab="Plausibility", xlab="Effect size")
n1 <- 100 # sample size first group B <- 1000 # number of Monte Carlo simulations (e.g., 5000) n2 <- n1
ps <- c() # this vector stores the B p-values for (i in 1:B) { x <- rnorm(n1, mean=0, sd=1)
# in each Monte Carlo run: choose another true ES from the prior distribution y <- rnorm(n2, mean=rnorm(1, mean=0.5, sd=0.2), sd=1) t1 <- t.test(x, y) ps <- c(ps, t1$p.value) }
# compute power (= number of studies that have a p-value < .05) prop.table(table(ps < .05))
# plot p-curve in significant range hist(ps[ps<.05])
# --> now increase sample size until desired power is achieved 64 Schönbrodt, F. D. & Wagenmakers, E.-J. (submitted). Bayes factor design analysis: Planning for compelling evidence