Advanced Power Analysis Workshop
Total Page:16
File Type:pdf, Size:1020Kb
Advanced Power Analysis Workshop www.nicebread.de PD Dr. Felix Schönbrodt & Dr. Stella Bollmann www.researchtransparency.org Ludwig-Maximilians-Universität München Twitter: @nicebread303 •Part I: General concepts of power analysis •Part II: Hands-on: Repeated measures ANOVA and multiple regression •Part III: Power analysis in multilevel models •Part IV: Tailored design analyses by simulations in 2 Part I: General concepts of power analysis •What is “statistical power”? •Why power is important •From power analysis to design analysis: Planning for precision (and other stuff) •How to determine the expected/minimally interesting effect size 3 What is statistical power? A 2x2 classification matrix Reality: Reality: Effect present No effect present Test indicates: True Positive False Positive Effect present Test indicates: False Negative True Negative No effect present 5 https://effectsizefaq.files.wordpress.com/2010/05/type-i-and-type-ii-errors.jpg 6 https://effectsizefaq.files.wordpress.com/2010/05/type-i-and-type-ii-errors.jpg 7 A priori power analysis: We assume that the effect exists in reality Reality: Reality: Effect present No effect present Power = α=5% Test indicates: 1- β p < .05 True Positive False Positive Test indicates: β=20% p > .05 False Negative True Negative 8 Calibrate your power feeling total n Two-sample t test (between design), d = 0.5 128 (64 each group) One-sample t test (within design), d = 0.5 34 Correlation: r = .21 173 Difference between two correlations, 428 r₁ = .15, r₂ = .40 ➙ q = 0.273 ANOVA, 2x2 Design: Interaction effect, f = 0.21 180 (45 each group) All a priori power analyses with α = 5%, β = 20% and two-tailed 9 total n Two-sample t test (between design), d = 0.5 128 (64 each group) One-sample t test (within design), d = 0.5 34 Correlation: r = .21 173 Difference between two correlations, 428 r₁ = .15, r₂ = .40 ➙ q = 0.273 ANOVA, 2x2 Design: Interaction effect, f = 0.21 180 (45 each group) All a priori power analyses with α = 5%, β = 20% and two-tailed 10 The power ofwithin-SS designs Thepower 11 May, K., & Hittner, J. B. (2012). Effect of correlation on power in within-subject versus between-subjects designs. Innovative Teaching, 1, 2. Typical reported effect sizes I • e.g., Richard, Bond, & Stokes-Zoota (2003) • Meta-meta-analysis; > 25.000 studies, > 8.000.000 participants • mean effect r = .21 (across literature SD = .15); median = .18 Richard, F. D., Bond, C. F. J., & Stokes-Zoota, J. J. (2003). One Hundred Years of Social Psychology Quantitatively Described. Review of General Psychology, 7(4), 331–363. doi:10.1037/1089-2680.7.4.331 12 Typical reported effect sizes II • e.g., Bosco et al. (2015) • 147,328 correlations from Journal of Applied Psychology and Personnel Psychology • median effect: r = .16, mean effect r = .22 (SD = .20) Bosco, F. A., Aguinis, H., Singh, K., Field, J. G., & Pierce, C. A. (2015). Correlational effect size benchmarks. Journal of Applied 13 Psychology, 100(2), 431–449. http://doi.org/10.1037/a0038047 Average sample size: n = 40 Average published effect size: d = .5 / r = .21 (certainly overstated due to publication bias) Average power: <34% Bakker, M., van Dijk, A., & Wicherts, J. M. (2012). The Rules of the Game Called Psychological Science. Perspectives on Psychological Science, 7(6), 543–554. 14 Overall PP = Psychology/Psychiatry 92%! Fanelli, D. (2011). Negative results are disappearing from most disciplines and countries. Scientometrics, 90(3), 891–904. doi:10.1007/s11192-011-0494-7 15 Why power is important Exercise Given that p< .05: What is the probability that a real effect exists in the population? ➙ prob(H₁|D) 30% of investigated 1000 tests effects are real performed 30% 70% power = real effect in no effect in α = 5% 35% 300 tests 700 tests 35% 65% 95% 5% 105 effects 195 effects not 35 significant 665 n.s. results detected detected results (true negatives) (true positives) (false negatives) (false positives) p < α n.s. n.s. p < α 35 of (35+105) = 140 significant p-values actually come from a population with null effect, 105 of 140 from a real effect. False discovery rate (FDR) = 35/140 = 25% Positive predictive value (PPV) = 105/140 = 75% Nuzzo, R. (2014). Statistical errors. Nature. Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science, 1(3), 140216– 18 140216. http://doi.org/10.1073/pnas.1313476110 30% of investigated 1000 tests effects are real performed 30% 70% power = real effect in no effect in α = 5% 35% 300 tests 700 tests 35% 65% 95% 5% 105 effects 195 effects not 35 significant 665 n.s. results detected detected results (true negatives) (true positives) (false negatives) (false positives) p < α n.s. n.s. p < α 35 of (35+105) = 140 significant p-values actually come from a population with null effect, 105 of 140 from a real effect. False discovery rate (FDR) = 35/140 = 25% Positive predictive value (PPV) = 105/140 = 75% Nuzzo, R. (2014). Statistical errors. Nature. Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science, 1(3), 140216– 19 140216. http://doi.org/10.1073/pnas.1313476110 30% of investigated 1000 tests effects are real performed 30% 70% power = real effect in no effect in α = 5% 35% 300 tests 700 tests 35% 65% 95% 5% 105 effects 195 effects not 35 significant 665 n.s. results detected detected results (true negatives) (true positives) (false negatives) (false positives) p < α n.s. n.s. p < α 35 of (35+105) = 140 significant p-values actually come from a population with null effect, 105 of 140 from a real effect. False discovery rate (FDR) = 35/140 = 25% Positive predictive value (PPV) = 105/140 = 75% Nuzzo, R. (2014). Statistical errors. Nature. Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science, 1(3), 140216– 20 140216. http://doi.org/10.1073/pnas.1313476110 Practice with the PPV app! http://shinyapps.org/apps/PPV/ 21 Smaldino, P. E., & McElreath, R. (2016). The natural selection of bad science. arXiv:1605.09511 [physics, stat]. Retrieved from http://arxiv.org/abs/1605.09511 22 Our results indicate that the median statistical power in neuroscience is 21%. Assumed that our tested hypothesis are true in 30% of all cases (which is a not too risky research scenario): • A typical neuroscience study must fail in 94% of all cases • In the most likely outcome of p > .05, we have no idea whether a) the effect does not exist, or b) we simply missed the effect. Virtually no knowledge has been gained. Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: why small sample size undermines23 the reliability of neuroscience. Nat Rev Neurosci, 14(5), 365–376. doi:10.1038/nrn3475 When a study is underpowered it most likely provides only weak inference. Even before a single participant is assessed, it is highly unlikely that an underpowered study provides an informative result. Consequently, research unlikely to produce diagnostic outcomes is inefcient and can even be considered unethical. Why sacrifce people's time, animals' lives, and societies' resources on an experiment that is highly unlikely to be informative? Schönbrodt, F. D. & Wagenmakers, E.-J. (2016). Bayes Factor Design Analysis: Planning for Compelling Evidence. http://dx.doi.org/10.2139/ssrn.2722435 24 Power is a frequentist property - beware of fallacies! • Power is a pre-data measure (i.e., before data are collected) that averages over infinite hypothetical experiments • Only one of these hypothetical experiments will actually be observed • Power is a property of the test procedure – not of a single study’s outcome! • Power is conditional on a hypothetical effect size – not conditional on the actual data obtained • “Once the actual data are available, a power calculation is no longer conditioned on what is known, no longer corresponds to a valid inference, and may now be misleading.” ➙ for inference better use likelihood ratios or Bayes factors. Then pre-data power considerations are irrelevant. Wagenmakers, E.-J., Verhagen, J., Ly, A., Bakker, M., Lee, M. D., Matzke, D., Rouder, J. N., et al. (2014). A power fallacy. 25 Behavior Research Methods, 47, 913–917. doi:10.3758/s13428-014-0517-4 From power analysis to design analysis Classical a priori power Planning for precision analysis Assume that the real effect Assume that the real effect has has a certain size a certain size (say, Cohen’s d = 0.5) (say, Cohen’s d = 0.5) When I run many identical When I run many identical studies with a fixed sample studies with a fixed sample size size n and α level of 5% … n and α level of 5% … … what proportion of them … what proportion of them will will result in a p-value result with a confidence interval smaller than α? that has a width <0.10? ➙ Task: Find the sample size n ➙ Task: Find the sample that ensures that at least, say, size n that ensures at least, 80% of all studies have a CI say, 80% significant results. width < 0.10 Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample Size Planning for Statistical Power and Accuracy in Parameter Estimation. Annual Review of Psychology, 59(1), 537–563. http://27 doi.org/10.1146/annurev.psych.59.103006.093735 General a priori design analysis Assume properties of reality (effect sizes, distributions, etc.) When I run many identical studies with certain design properties (e.g., sample size, sampling plan, thresholds, priors) … … what proportion of them will result with certain test results (e.g., p < .01, CI width < .05, Bayes factor > 10) ➙ Task: Tune the design properties in a way that at least X% of all hypothetical studies give the desired outcome Kruschke, J.