Reproducibility from Discovery to Clinical Research What Could Be the Way Out?
Total Page:16
File Type:pdf, Size:1020Kb
Reproducibility from Discovery to Clinical Research What could be the way out? Bruno Boulanger November 2019 Agenda The reproducibility crisis, i.e. the replicability crisis The p-value crisis How to make a decision? What is the question ? The Bayesian learning process in Pharmaceutical R&D The posterior predictive distribution The power, the Bayesian power and the Assurance The missing component: the elephant in the room ? Take away message © PharmaLex 2 The Replicability crisis: the beginning Nature, 2014 As the American Statistical Association officially reminded in March 2016…. The American Statistical Association reminded in a press release some key points: 1. P-values can indicate how incompatible the data are with a specified statistical model. 2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. 3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. 4. Proper inference requires full reporting and transparency. 5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. 6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis. Reproducibility now a public concern Emergency meeting: ASA Symposium in October 2017. The American Statistician - 2019 Maybe the key issue is the training in statistics S. Goodman: The American Statistician - 2019 10 Nature March 2019 11 National Academies May 2019 12 To « p » or not to « p »: what is the question? The objective: is my product effective ? How to make a decision ? What is the probability of obtaining the observed data, if the A product is not effective? What is the probability that the product is effective, given the B observed data? 2 / 18 Two different ways to make a decision based on Pr 퐨퐛퐬퐞퐫퐯퐞퐝 퐝퐚퐭퐚 퐩퐫퐨퐝퐮퐜퐭 퐢퐬 퐧퐨퐭 퐞퐟퐟퐞퐜퐭퐢퐯퐞 ) A ◼ Better known as the p-value concept ◼ Used in the null hypothesis test (or decision) ◼ This is the likelihood of the data assuming an hypothetical explanation (eg the “null hypothesis”) ◼ Classical statistics perspective (Frequentist) Pr 퐩퐫퐨퐝퐮퐜퐭 퐞퐟퐟퐞퐜퐭퐢퐯퐞 퐨퐛퐬퐞퐫퐯퐞퐝 퐝퐚퐭퐚 ) B ◼ Bayesian perspective ◼ It is the probability of efficacy given the data 3 / 18 Nature 2017 Nothing has changed in 20 years 1999 What is the question ? New patient : Cancer? Diagnostic test Test Result Interpretation ? What is the probability that the patient has Cancer given the observed positive results ? Context A disease D with a low prevalence 1 % of the population is diseased = D+ Major consequences if the disease is not detected A problem of decision making The accuracy of a diagnostic test is assessed as follows: Sensitivity: Pr 퐩퐨퐬퐢퐭퐢퐯퐞 퐫퐞퐬퐮퐥퐭 퐜퐚퐧퐜퐞퐫) Specificity: Pr 퐧퐞퐠퐚퐭퐢퐯퐞 퐫퐞퐬퐮퐥퐭 퐧퐨 퐜퐚퐧퐜퐞퐫) In practice: Given that the diagnostic test result is positive, what is the probability you truly have cancer? Pr 퐜퐚퐧퐜퐞퐫 퐩퐨퐬퐢퐭퐢퐯퐞 퐫퐞퐬퐮퐥퐭 ) = ? 6 / 18 1 Example Pr 퐜퐚퐧퐜퐞퐫 퐩퐨퐬퐢퐭퐢퐯퐞 퐫퐞퐬퐮퐥퐭) = = 0.077 12 + 1 sensitivity = 86% How can that be so low? The small proportion of errors for prevalence = 1% the large majority of women who do not have breast cancer swamps the large proportion of correct diagnoses for the few women who have it. specificity = 88% Agresti, A. (2007). An Introduction to Categorical Data Analysis. Wiley, 2nd ed. Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-values. R. Soc. Open sci. 1(3): 140216 7 / 18 The clinical trial analogy effective? clinical trial data Pr 퐝퐫퐮퐠 퐞퐟퐟퐞퐜퐭퐢퐯퐞 퐝퐚퐭퐚) = ? depends largely on prior probability that there is a real effect 8 / 18 21 “If you use p = 0.05 to suggest that you have made a discovery, you will be wrong at least 30% of the time.” prior probability 80 Pr 퐫퐞퐚퐥 퐞퐟퐟퐞퐜퐭 퐩 < 0.05) = = 0.64 80 + 45 Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-values. R. Soc. Open sci. 1(3): 140216. 9 / 18 22 “If you use p = 0.05 “….when you are in early discovery 8 true + tests prior probability Effect = 10 tests 2 false P(real)=0.01 - tests 940 true - tests No Effect 990 tests 50 false + tests 8 Pr 퐫퐞퐚퐥 퐞퐟퐟퐞퐜퐭 퐩 < 0.05) = = ퟎ. ퟏퟒ ‼‼ 8 + 50 9 / 18 …. if the prior is good 560 true + tests prior probability Effect = 700 tests 140 false P(real)=0.7 - tests 285 true - tests No Effect 300 tests 15 false + tests 560 Pr 퐫퐞퐚퐥 퐞퐟퐟퐞퐜퐭 퐩 < 0.05) = = ퟎ. ퟗퟕ 560 + 15 9 / 18 24 False “Discovery" Rate for p<0.05, power=0.8 as function of Prior Probability 1.0 0.8 0.6 FDR 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Prior Probability False “Discovery” Rate for p<0.05, power=0.8 as function of Prior Probability Tufts report: 11.8% products entering 1.0 Clinical Development reach approval PoS 0.8 0.6 Clinical prior PoS FDR 0.4 Research Prior Research 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Prior Probability THE VALUE OF BAYESIAN APPROACH IN DRUG DEVELOPMENT 10 / 18 Bayesian inference is the mechanism used to update the state of knowledge prior information data information posterior information 푝(휃) 푝 data 휃) 푝 휃 data) x The process to arrive at a posterior distribution makes use of Bayes’ formula. The coin flipping experiment prior information data information posterior information 퐏퐫 < ퟎ. ퟓ ≈ ퟎ. ퟕퟓ 퐏퐫 < ퟎ. ퟓ = ퟎ. ퟗퟒ x 43 heads in 100 flips e n i l e s a b m o r f e g n a h c E a t l e D r o f l o n i c u R t s n i a g a a l u m r o F y b g n i k n a R s e c n e r e f f i D n a e M f o y t i s n e d r o i r e t s o P 5 1 7 4 = y a D t c e f r e P e t i h W e l c i h e V . s v t c e f r e P e t i h W % 1 4 1 4 7 D C point - end Clinical 0.25 0.50 0.75 1 1 0.75 0.50 0.25 1.25 5 2 . 0 - 0 5 . 0 - 5 7 . 0 - 0 0 . 1 - 5 2 . 1 - ) r e t t e b s i r e l l a m s ( e c n e r e f f i d n a e M t c e f r e P e t i h W e l c i h e V . s v t c e f r e P e t i h W % 1 4 1 4 7 D C . s r o i r p t a l F . a l u m r o f * y d u t s y d u t s t c e j b u s m o d n a r ; t n e m t a e r t + e s a b = e g n a h c : l e d o M n a i s e y a B Decision rules based on Posterior Probability Posterior on based rules Decision Drug development is a learning process Discovery Phase I Phase II Phase III pre-clinical We now have computing power to apply Bayesian statistics 17 / 18 Regulatory point of view 2010 & 2016 Guidance for medical device clinical trials Guidance for Industry and FDA Staff Guidance for the Use of Bayesian Statistics in Medical Device Clinical Trials Document issued on: February 5, 2010 33 FDA CID initiative © PharmaLex 34 Historical control 35 Part II: Power, Bayesian power and Assurance © PharmaLex 36 Power vs assurance assumptions: independent samples t-test (H0: 휇1 = 휇2 vs H1: 휇1 ≠ 휇2) frequentist approach (power) A power calculation takes a particular value of the effect within the range of possible values given by H1 and poses the question: if this particular value happens to obtain, what is the probability of coming to the correct conclusion that there is a difference? assumptions: 휇1 = 100; 휇2 = 120; very strong priors! 2 2 휎1 = 휎2 = 39 © PharmaLex 37 Power vs assurance assumptions: independent samples t-test (H0: 휇1 = 휇2 vs H1: 휇1 ≠ 휇2) bayesian approach (assurance) In order to reflect the uncertainty, a large number of effect sizes, i.e. (휇1−휇2)/휎pooled, are generated using the prior distributions. A power curve is obtained for each effect size the expected (weighted by prior beliefs) power curve is calculated Note1: Given those priors, using the Frequentist power approach, the Probability of Success of the trial is 50% Note2: about 50% of Phase III trials are failing because of lack of efficacy (S. Wang, FDA, 2008) © PharmaLex 38 (Frequentist) Power Assurance “Assurance is the unconditional probability Let R denote the rejection of the null that a trial will lead to a specific outcome” hypothesis, the power is, assuming parameter values of 휽 = 휽∗ 훾 푛 ≔ ∫ 휋 휃, 푛 푓 휃 푑휃 휋 휽∗, 푛 ≔ Pr(푅|휽∗, 풏) 훾(푛) ≔ Pr 푅 = 퐸휃[휋 휽 ] It is a conditional probability. It is It is thus also a function of n (and eventually conditional on the parameters of the other nuisance parameters) model, e.g. the “true effect size” in a The assurance is the expected power over all frequentist test and the sample size.