<<

Bayesian and Reproducibility

NCS 2018, Paris

5th October 2018

Leonhard Held & Manuela Ott , and Prevention Institute and

Center for Reproducible Science

University of Zurich, Department of Biostatistics Page 1 Reproducibility in Drug Development Nature Reviews Drug Discovery 10(9), 712-713 (2011)

“With reasonable efforts (sometimes the equivalent of 3–4 full-time employees over 6–12 months), we have frequently been unable to reconfirm published .”

University of Zurich, Department of Biostatistics Page 2 Reproducibility in Biomedical Research Nature (2014)

“The recent evidence showing the irreproducibility of significant numbers of biomedical research publications demands immediate and substantive action.”

University of Zurich, Department of Biostatistics Page 3 The Four Horsemen of the Reproducibility Crisis Dorothy Bishop, Oxford Reproducibility Lecture (2017)

University of Zurich, Department of Biostatistics Page 4 Diagnostic Tests Mr. Smith and Dr. Jones

Mr. Smith is worried that he has disorder D and consults Dr. Jones. Dr. Jones orders a test T that has only 2 outcomes, positive and negative, and Mr. Smith’s test result comes back positive. Dr. Jones tells his patient that he has a 50% chance of having the disorder given this test result.

Mossman and Berger (2001)

University of Zurich, Department of Biostatistics Page 5 Diagnostic Tests Mr. Smith and Dr. Jones cont.

Mr. Smith asks his doctor how he came to this con- clusion. Dr. Jones – an unusual physician who uses Bayes’ theorem and discusses it with his patients – answers:

“Ten percent of men who are your age have D. Test T has 90% sensitivity and 90% specificity. Applying Bayes’ theorem, your positive test result that your chance of having D is 50%.”

University of Zurich, Department of Biostatistics Page 6 The Fagan Nomogram

0.1 99.9 0.2 99.8

0.5 99.5 1 99 2 Likelihood ratio 98 1000 5 500 95 200 10 100 90 50 15 85 20 20 80 10 30 5 70 40 2 60 50 1 50 60 0.5 40 70 0.2 30 0.1 80 0.05 20 85 15 0.02 90 0.01 10 0.005 95 0.002 5 Pre−test probability (%) 0.001 probability (%) Post−test 98 2 99 1 99.5 0.5

99.8 0.2 99.9 0.1

University of Zurich, Department of Biostatistics Page 7 Likelihood Ratio

https://www.youtube.com/watch?v=OssWmN9pm4Q

University of Zurich, Department of Biostatistics Page 8 Diagnostic Tests Bayes’ theorem

PPV Sensitivity Prevalence = × 1 − PPV 1 − Specificity 1 − Prevalence | {z } | {z } | {z } Posterior Likelihood ratio Prior odds

PPV : positive predictive value

University of Zurich, Department of Biostatistics Page 9 Statistical Tests Two interpretations

– Test a point null hypothesis H0 against an alternative H1

– Use a p-value to quantify the evidence against the null hypothesis H0

Two different interpretations of the evidence: 1. “p-less-than”, e. g. p < 0.05 – Dichotomization into “significant” and “non-significant” – Allows to study frequentist properties of “significant” tests 2. “p-equals”, e. g. p = 0.04 – Adequate interpretation for a single test, no loss of information

Colquhoun (2014, 2017)

University of Zurich, Department of Biostatistics Page 10 Statistical Tests Bayes’ theorem for “p-less-than” interpretation

TPR Power Pr(H ) = × 1 1 − TPR Type-I error rate Pr(H0) | {z } Pre-experimental odds

Bayarri et al. (2016)

TPR = Pr(H1 | significance): true positive rate FPR = 1 − TPR : false positive rate

University of Zurich, Department of Biostatistics Page 11 Claims of New Discoveries

“This simple step would immediately improve the reproducibility of scientific research in many fields.”

University of Zurich, Department of Biostatistics Page 12 Lower the p-Value Threshold for Significance! Suggestive evidence emphasizes the need for

1

Not significant

0.1

0.05

Suggestive 0.01 P−value 0.005

Significant 0.001

0.0001

University of Zurich, Department of Biostatistics Page 13 Why Would You Do This?

University of Zurich, Department of Biostatistics Page 14 Why Most Published Resarch is Probably False

https://www.economist.com/blogs/graphicdetail/2013/10/daily-chart-2

University of Zurich, Department of Biostatistics Page 15 The False Positive Rate 70% sample size increase for p < 0.005 threshold

Benjamin et al. (2018)

University of Zurich, Department of Biostatistics Page 16 This Argument is Not New Staquet et al (1979, Cancer Treat Rep, 1917–21)

University of Zurich, Department of Biostatistics Page 17 Statistical Tests Bayes’ theorem for “p-equals” interpretation

FPR f (p | H ) Pr(H ) = 0 × 0 1 − FPR f (p | H1) Pr(H1) | {z } | {z } Bayes factor BF01 Prior odds

FPR = Pr(H0 | p): false positive risk

Colquhoun (2017, 2018)

University of Zurich, Department of Biostatistics Page 18 Bayes Factors Data-based, test-based or p-based

Data y Test p-value

Bayes factor BF01

Held & Ott (2018)

University of Zurich, Department of Biostatistics Page 19 Interpretation of Bayes Factors

University of Zurich, Department of Biostatistics Page 20 Bayes Factors Choice of the alternative

– Point null hypothesis H0: θ = θ0

– Two types of alternatives H1:

2 – Local, e. g. H1: θ ∼ N(θ0, σ ) ⇒ for exploratory studies

– Simple, i. e. H1: θ = θ1 6= θ0 ⇒ for confirmatory studies

– The simple alternative is often chosen such that a given power (80%) is reached under a pre-specified type I error rate (5%). Colquhoun (2017, 2018)

University of Zurich, Department of Biostatistics Page 21 Minimum Bayes Factors

– Minimum Bayes factor (minBF): the smallest possible Bayes factor within a specific class of alternatives

→ The smaller the minBF, the stronger the maximal evidence against H0 within that class

– The minBF also depends on the – type of test e. g. z-test vs. t-test – sample size n – dimension of the parameter of interest.

University of Zurich, Department of Biostatistics Page 22 Bayes and Minimum Bayes Factors Simple alternative, 80% power for µ = 2.8

t*= 0.67 (p=0.5) ● 10 t*= 1.64 (p=0.1) t*= 2.58 (p=0.01) 3 t*= 3.29 (p=0.001)

1 ●

1/3 Bayes factor Bayes 1/10 ●

1/30

1/100 ●

0 1 2 2.8 4 µ

University of Zurich, Department of Biostatistics Page 23 Minimum Bayes Factors

University of Zurich, Department of Biostatistics Page 24 The R package pCalibrate

– Implements various calibrations of p-values as minimum Bayes factors – Available on CRAN (https://cran.r-project.org/package=pCalibrate) – Covers different classes of alternatives and one- and two-sided tests Examples of some functions: – p-based minBFs: pCalibrate(p, ... ) – test-based minBFs – z-test: zCalibrate(p, ... ) – t-test: tCalibrate(p, ... ) – likelihood ratio test: LRCalibrate(p, ... )

University of Zurich, Department of Biostatistics Page 25 p-Based Minimum Bayes Factors For a two-sided p-value p

The “−e p log(p)” calibration

 −e p log p for p < 1/e ≈ 0.37 minBF(p) = 1 otherwise.

Sellke, Bayarri & Berger (2001) pCalibrate(p)

The “−e q log(q)” calibration, where q = 1 − p.

Held & Ott (2018) pCalibrate(p, alternative="informative")

University of Zurich, Department of Biostatistics Page 26 p-Based Minimum Bayes Factors

1

1/2.5

1/7.5 1/14 1/30 minBF 1/74 1/100

1/300 − e p log(p) − e q log(q)

0.0001 0.001 0.01 0.05 0.5

Two sided p−value University of Zurich, Department of Biostatistics Page 27 Minimum Bayes Factors Two-sample t-test setting: tCalibrate(... )

Local alternatives Simple alternative

1 1

1/3 1/3

1/10 1/10

minBF 1/30 minBF 1/30 n large n large 1/100 n=50 1/100 n=50 n=20 n=20 1/300 n=10 1/300 n=10

0.001 0.005 0.05 0.3 0.001 0.005 0.05 0.3

two−sided t−test p−value two−sided t−test p−value

University of Zurich, Department of Biostatistics Page 28 Bayes Factors vs. Minimum Bayes Factors Simple alternative (at 80% power) in two-sample t-test

Simple BF tCalibrate minBF

3 1

1 1/3 1/3 1/10 1/10 BF 1/30 1/30 minBF 1/100 1/100 n=50 n=50 n=20 n=20 1/300 n=10 1/300 n=10

0.0001 0.001 0.01 0.1 0.0001 0.001 0.01 0.1

two−sided t−test p−value two−sided t−test p−value

University of Zurich, Department of Biostatistics Page 29 Back to Probabilities Forward – Reverse – Inverse Bayes

FPR Pr(H0) = BF01(p) × 1 − FPR Pr(H1)

p Forward Bayes: Pr(H0) = 50% −→ FPR p Reverse Bayes: FPR = 5% −→ Pr(H0) FPR Inverse Bayes: Pr(H0) = 50% −→ p

University of Zurich, Department of Biostatistics Page 30 Forward-Bayes

Under the assumption of equipoise: Pr(H0) = Pr(H1) = 50%

n=10 n=50

100 100 50 50 30 16.8 21.6 10 10

3 3.3 1.8 1 1 FPR (in %) FPR (in %) 0.3 0.3 0.1 Simple BF 0.1 Simple BF tCalibrate tCalibrate 0.03 0.03

0.0001 0.001 0.01 0.1 0.5 0.0001 0.001 0.01 0.1 0.5

two−sided t−test p−value two−sided t−test p−value

University of Zurich, Department of Biostatistics Page 31 Reverse-Bayes

For a given p-value, determine (max) Pr(H0) such that FPR = 5%.

n=10 n=50

0 0 20.7 16 50 50 60.5 74.1 70 (in %) (in %)

) 90 ) 90 0 0 H H ( ( Pr Pr 97 97 Simple BF Simple BF tCalibrate tCalibrate 99 99

0.0001 0.001 0.01 0.1 0.5 0.0001 0.001 0.01 0.1 0.5

two−sided t−test p−value two−sided t−test p−value

University of Zurich, Department of Biostatistics Page 32 Inverse-Bayes

Under the assumption of equipoise: Pr(H0) = Pr(H1) = 50%

n=10 n=50

0.3 0.3 0.1 0.1

0.03 0.03 0.014 0.0079 0.0029 0.003 0.0014 0.001

two−sided t−test p−value two−sided 0.0003 tCalibrate t−test p−value two−sided 0.0003 tCalibrate Simple BF Simple BF 0.0001 0.0001

0.5 1 3 5 10 25 0.5 1 3 5 10 25

FPR (in %) FPR (in %)

University of Zurich, Department of Biostatistics Page 33 This Argument is Not New Either Goodman (1999, Ann Intern Med, 995-1013)

University of Zurich, Department of Biostatistics Page 34 Misunderstanding of Significance Tests Steve Goodman in Science (2016)

“Let us hope the next century will see as much progress in the inferential methods of science as in its substance.”

University of Zurich, Department of Biostatistics Page 35 Conclusion

A Bayesian perspectives on statistical significance tests offers important insights: – A significance threshold at α = 0.05 combined with low power produces high false positive rates. – A substantial reduction in false positive rate can be obtained with α = 0.005 and a sample size increase of 70%. – P-values equal or slightly below 0.05 carry only weak evidence against the null hypothesis.

University of Zurich, Department of Biostatistics Page 36 References

I Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E.-J. et al. (2018). Redefine statistical significance. Nature Human Behaviour, 2:6–10.

I Colquhoun, D. (2014). An investigation of the and the misinterpretation of p-values. Royal Society Open Science, 1: 140216.

I Colquhoun, D. (2017). The reproducibility of research and the misinterpretation of p-values. Royal Society Open Science, 4:171085.

I Colquhoun, D. (2018). The false positive risk: a proposal concerning what to do about p-values. arXiv:1802.04888v4 [stat.AP].

I Held, L. & Ott, M. (2018). On p-values and Bayes factors. Annual Review of Statistics and Its Application, 5:393–419.

I Sellke, T., Bayarri, M. J. & Berger, J. O. (2001). Calibration of p values for testing precise null hypotheses. The American , 55(1):62–71.

University of Zurich, Department of Biostatistics Page 37