Benjamini-Hochberg Method

P-values and statistical tests 7. Multiple test corrections Marek Gierliński Division of Computational Biology Hand-outs available at http://is.gd/statlec 2 Lets perform a test � times False positives True positives H0 true H0 false Total Number of Significant �� discoveries Not �� − � significant Total �( �) � Number of tests True negatives False negatives 3 Family-wise error rate �� = Pr (�� ≥ 1) 4 Probabilities of independent events multiply 1 1 � � and � = �(�)×�(�) 2 4 1 2 1 2 Toss two coins 1 2 1 2 1 2 5 Probabilities of either event is 1 − 1 − � = 1 1 � � or � = ? 2 4 1 � � = 1 − �(�) 2 1 1 2 4 � � and � = � � ×� � = Toss two = 1 − � � coins 1 1 � � or � = 1 − � � and � 2 4 1 = 2 � � or � = 1 − 1 − � � 1 1 2 4 1 = 3 �(� or �) = 1 − 1 − = 2 4 6 False positive probability H0: no effect Set � = 0.05 One test Probability of having a false positive �) = � Two independent tests Probability of having at least one false positive in either test = �= = 1 − 1 − � � independent tests Probability of having at least one false positive in any test F �F = 1 − 1 − � 7 Family-wise error rate (FWER) Probability of having at least one false positive among � tests; � = 0.05 F �F = 1 − 1 − � Jelly beans test, � = 20, � = 0.05 FWER =( �=( = 1 − 1 − 0.05 = 0.64 � 8 Bonferroni limit – to control FWER Probability of having at least one false positive Controlling FWER among � tests; � = 0.05 We want to make sure that F �F = 1 − 1 − � �� ≤ �J. Then, the FWER is controlled at level �′. Bonferroni limit FWER � �J = � � F � = 1 − 1 − ≈ � F � 9 Test data (1000 independent experiments) Random samples, size � = 5, from two normal distributions Negatives �( = 20 g �( = 970 Positives �) = 40 g �) = 30 10 One sample t-test, H0: � = 20 g No correction H0 true H0 false Total Significant 56 30 86 Not 914 0 914 significant Total 970 30 1000 30 positives 970 negatives 11 One sample t-test, H0: � = 20 g No correction H0 true H0 false Total Significant �� = 56 �� = 30 86 Not significant �� = 914 �� = 0 914 Total 970 30 1000 Family-wise error rate �� = Pr (�� ≥ 1) 56 �� = = 0.058 �� 56 + 914 False positive rate �� = = � �� + �� ( 0 �� = = 0 �� 0 + 30 False negative rate �� = = �) �� + �� 12 Bonferroni limit No correction Bonferroni � 0.05 5×10ST �� 0.058 0 �� 0 0.87 No correction Bonferroni H0 true H0 false Total H0 true H0 false Total Significant 56 30 86 Significant 0 4 4 Not Not 914 0 914 970 26 996 significant significant Total 970 30 1000 Total 970 30 1000 13 Holm-Bonferroni method Sort p-values �()), �(=), … , �(F) V Reject (1) if � ≤ ()) F V Reject (2) if � ≤ (=) FS) V Reject (3) if � ≤ (W) FS= ... � � V � � � Stop when � > � � − � + 1 (X) FSXZ) 1 0.003 0.05 0.01 0.01 2 0.005 0.05 0.01 0.0125 Holm-Bonferroni method 3 0.012 0.05 0.01 0.017 controls FWER 4 0.04 0.05 0.01 0.025 5 0.058 0.05 0.01 0.05 14 Holm-Bonferroni method No correction Bonferroni HB � 0.05 5×10ST 5×10ST �� 0.058 0 0 �� 0 0.87 0.87 30 smallest p-values (out of 1000) Holm-Bonferroni Bonferroni H true H false Total Holm-Bonferroni 0 0 Significant 0 4 4 Not 970 26 996 significant Total 970 30 1000 15 False discovery rate �� = � False discovery rate False positive rate False discovery rate �� = = �� = = �( �� + �� + �� The fraction of truly non-significant events The fraction of discoveries that are false we falsely marked as significant 56 56 �� = = 0.058 �� = = 0.65 970 86 No correction H0 true H0 false Total Significant �� = 56 �� = 30 86 Not significant �� = 914 �� = 0 914 Total 970 30 1000 17 BenJamini-Hochberg method Sort p-values �()), �(=), … , �(F) Find the largest �, such that � � ≤ � (X) � Reject all null hypotheses for � = 1, … , � � � � � � � � � � − � + 1 � 1 0.003 0.05 0.01 0.01 0.01 2 0.005 0.05 0.01 0.0125 0.02 Benjamini-Hochberg 3 0.012 0.05 0.01 0.017 0.03 method controls FDR 4 0.038 0.05 0.01 0.025 0.04 5 0.058 0.05 0.01 0.05 0.05 18 BenJamini-Hochberg method No correction Bonferroni HB BH � 0.05 5×10ST 3.7×10ST 0.0011 �� 0.058 0 0 0.0021 �� 0 0.87 0.87 0.30 �� 0.65 0 0 0.087 Benjamini-Hochberg Benjamini-Hochberg Bonferroni H true H false Total Holm-Bonferroni 0 0 Significant 2 21 23 Not 968 9 977 significant Total 970 30 1000 19 Controlling FWER and FDR Holm-Bonferroni Benjamini-HochBerg controls FWER controls FDR �� = Pr (�� ≥ 1) �� = �� + �� is a random variable Controlling FWER - guaranteed Controlling FDR - guaranteed �� ≤ �′ �[��] ≤ � 20 BenJamini-Hochberg procedure controls FDR Controlling FDR �[��] ≤ � � �� can be approXimated by the mean over many eXperiments Bootstrap: generate test data 10,000 times, perform 1000 t-tests for each set and find FDR for BH procedure 21 AdJusted p-values p-values can be “adjusted”, � so they compare directly X with �, and not � F adjusted p-values Problem: adjusted p-value � does not eXpress any � probability � Despite their popularity, I raw p-values recommend against using adjusted p-values 22 How to do this in R # Read generated data > d = read.table("http://tiny.cc/two_hypotheses", header=TRUE) > p = d$p # Holm-Bonferroni procedure > p.adj = p.adjust(p, "holm") > p[which(p.adj < 0.05)] [1] 1.476263e-05 2.662440e-05 3.029839e-05 # Benjamini-Hochberg procedure > p.adj = p.adjust(p, "BH") > p[which(p.adj < 0.05)] [1] 1.038835e-03 6.670798e-04 1.050547e-03 1.476263e-05 5.271367e-04 [6] 3.503370e-04 9.664789e-04 1.068863e-03 7.995860e-04 5.404476e-04 [11] 9.681321e-04 1.580069e-04 1.732747e-04 3.159954e-04 2.662440e-05 [16] 4.709732e-04 1.517964e-04 2.873971e-04 3.258726e-04 4.087615e-04 [21] 3.029839e-05 9.320438e-04 1.713309e-04 2.863402e-04 4.082322e-04 23 Estimating false discovery rate Control and estimate Controlling FDR Estimating FDR 1. FiX acceptable FDR limit, �, For each p-value, �_, form a point beforehand estimate of FDR, 2. Find a thresholding rule, so that ��`(�_) �[��] ≤ � 25 P-value distribution Data set 2 100% null 80% null, 20% alternative 26 P-value distribution Good Bad! 27 Definition of �( Proportion of null tests 80% null, 20% alternative # non−significant tests � = ( # all tests H1 H0 �( 28 Storey method Estimate �� Use the histogram for � > � to estimate �n((�) Do it for all 0 < � < 1 and then find the best �n( � Storey, J.D., 2002, JR Statist Soc B, 64, 479 29 Point estimate of FDR Point estimate, ��(�) 80% null, 20% alternative Arbitrary limit �, every �_ < � is significant. No. of significant tests is �(�) = # �_ < � No. of false positives is True positives ��(�) = ��(� False positives Hence, �� = ( ( �(�) � 30 Storey method Point estimate of FDR This is the so-called q-value: � �_ = min �� wxyz If monotonic � �_ = �� _ 31 How to do this in R > library(qvalue) # Read data set 1 > pvalues = read.table('http://tiny.cc/multi_FDR', header=TRUE) > p = pvalues$p # Benjamini-Hochberg limit > p.adj = p.adjust(p, method='BH') > sum(p.adj <= 0.05) [1] 216 # q-values > qobj = qvalue(p) > q = qobj$qv > summary(qobj) pi0: 0.8189884 Cumulative number of significant calls: <1e-04 <0.001 <0.01 <0.025 <0.05 <0.1 <1 p-value 40 202 611 955 1373 2138 10000 q-value 0 1 1 96 276 483 10000 local FDR 0 1 3 50 141 278 5915 > plot(qobj) > hist(qobj) 32 33 Interpretation of q-value � ≤ 0.0273 No. ID p-value q-value There are 106 tests with ... ... ... ... Expect 2.73% of false positives among 100 9249 0.000328 0.0266 these tests 101 8157 0.000328 0.0266 102 8228 0.000335 0.0269 Expect ~3 false positives if you set a limit 103 8291 0.000338 0.0269 of � ≤ 0.0273 or � ≤ 0.00353 104 8254 0.000347 0.0272 105 8875 0.000348 0.0272 106 8055 0.000353 0.0273 q-value tells you how many false 107 8235 0.000375 0.0284 positives you should expect after 108 8148 0.000376 0.0284 choosing a significance limit 109 8236 0.000381 0.0284 110 8040 0.000382 0.0284 ... ... ... ... 34 Q-values vs BenJamini-Hochberg When �n( = 1, both methods give the same result. BH adjusted �n( p-value For the same FDR, Storey’s method provides more q-value significant p-values. Hence, it is more powerful, especially for small �n(. But this depends on how good the estimate of �n( is. �n( - estimate of the proportion of null (non-significant) tests 35 Which multiple-test correction should I use? Benjamini-Hochberg No correction Bonferroni Storey False positives False negatives False positive False negative “Discover” effect where there is no effect Missed discovery Can be tested in follow-up eXperiments Once you’ve missed it, it’s gone Not hugely important in small samples Impossible to manage in large samples 36 Multiple test procedures: summary Method Controls Advantages Disadvantages Recommendation No FPR False negatives not Can result in Small samples, when correction inflated �� ≫ �� the cost of FN is high Bonferroni FWER None Lots of false Do not use negatives Holm- FWER Slightly better than Lots of false Appropriate only Bonferroni Bonferroni negatives when you want to guard against any false positives Benjamini- FDR Good trade-off On average, � of Better in large Hochberg between false your positives will samples positives and be false negatives Storey -- More powerful than Depends on a The best method, BH, in particular for good estimate of gives more insight small �n( �n( into FDR 37 Hand-outs available at http://tiny.cc/statlec BenJamini-Hochberg method Theoretical null distribution: uniform p-values � � = (X) � Generated data null distribution 39 BenJamini-Hochberg method Test data Null data 970 H0 1000 H0 20 H1 p-values � 100% FDR � � � 5% FDR � 40 BenJamini-Hochberg method Test data Null data 970 H0 1000 H0 20 H1 � � value value - - p p � � � 41.

Benjamini-Hochberg Method

Underestimation of Type I Errors in Scientific Analysis

The Reproducibility of Research and the Misinterpretation of P Values

From P-Value to FDR

Statistical Significance for Genome-Wide Experiments

Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy

Redefine Statistical Significance We Propose to Change the Default P-Value Threshold for Statistical Significance from 0.05 to 0.005 for Claims of New Discoveries

Hypothesis Testing

8. Multiple Test Corrections

Hypothesis Testing (1/22/13)

False Discovery Rate Control Is a Recommended Alternative to Bonferroni-Type Adjustments in Health Studies Mark E

How Should We Use Hypothesis Tests to Guide the Next Experiment?

Multiple Comparisons: Problem and Solutions