False Discovery Rate Control Is a Recommended Alternative to Bonferroni-Type Adjustments in Health Studies Mark E

Author's personal copy Journal of Clinical Epidemiology 67 (2014) 850e857 REVIEW ARTICLES False discovery rate control is a recommended alternative to Bonferroni-type adjustments in health studies Mark E. Glickmana,b,*, Sowmya R. Raoa,c, Mark R. Schultza aCenter for Health care Organization and Implementation Research, Bedford VA Medical Center, 200 Springs Road (152), Bedford, MA 01730, USA bDepartment of Health Policy and Management, Boston University School of Public Health, 715 Albany Street, Talbot Building, Boston, MA 02118, USA cDepartment of Quantitative Health Sciences, University of Massachusetts Medical School, 55 N. Lake Avenue, Worcester, MA 01655, USA Accepted 12 March 2014; Published online 13 May 2014 Abstract Objectives: Procedures for controlling the false positive rate when performing many hypothesis tests are commonplace in health and medical studies. Such procedures, most notably the Bonferroni adjustment, suffer from the problem that error rate control cannot be local- ized to individual tests, and that these procedures do not distinguish between exploratory and/or data-driven testing vs. hypothesis-driven testing. Instead, procedures derived from limiting false discovery rates may be a more appealing method to control error rates in multiple tests. Study Design and Setting: Controlling the false positive rate can lead to philosophical inconsistencies that can negatively impact the practice of reporting statistically significant findings. We demonstrate that the false discovery rate approach can overcome these inconsistencies and illustrate its benefit through an application to two recent health studies. Results: The false discovery rate approach is more powerful than methods like the Bonferroni procedure that control false positive rates. Controlling the false discovery rate in a study that arguably consisted of scientifically driven hypotheses found nearly as many significant results as without any adjustment, whereas the Bonferroni procedure found no significant results. Conclusion: Although still unfamiliar to many health researchers, the use of false discovery rate control in the context of multiple testing can provide a solid basis for drawing conclusions about statistical significance. Ó 2014 Elsevier Inc. All rights reserved. Keywords: False discovery rate; False positive rate; FWER; Multiple tests; P-value; Study-wide error rate 1. Introduction made, with plenty of advocates on each side of the argument. It appears doubtful that researchers will coalesce We are now in an age of scientific inquiry where health behind a unified point of view any time soon. and medical studies are routinely collecting large amounts Significance level adjustments that control study-wide of data. These studies typically involve the researcher at- error rates are still common in peer-reviewed health studies. tempting to draw many inferential conclusions through An examination of recent issues of several highly cited numerous hypothesis tests. Researchers are typically medical and health journals (Journal of the American Med- advised to perform some type of significance-level adjust- ical Association, New England Journal of Medicine, Annals ment to account for the increased probability of reporting of Internal Medicine, and Medical Care) reveals an abun- false positive results through multiple tests. Such adjust- dant use of multiple-test adjustments that control study- ments are designed to control study-wide error rates and wide error rates: We found 191 articles published in 2012 lower the probability of falsely rejecting true null hypothe- to 2013 making some adjustment for multiple testing, with ses. The most commonly understood downside of these 102 (53.4%) performing the Bonferroni or another study- procedures is the loss of power to detect real effects. Argu- wide error adjustment. Some other studies reported explic- ments have been put forth over the years whether adjust- itly, and almost apologetically, that they had not performed ments for controlling study-wide error rates should be an adjustment, and some even reported consequences of not having adjusted for multiple tests. Despite the continued popularity of multiple test adjust- Conflict of interest: None. Funding: None. ments in health studies that control false positive error * Corresponding author. Tel.: 781-687-2875; fax: 781-687-3106. rates, we argue that controlling the false discovery rate E-mail address: [email protected] (M.E. Glickman). [1] is an attractive alternative. The false discovery rate 0895-4356/$ - see front matter Ó 2014 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jclinepi.2014.03.012 Author's personal copy M.E. Glickman et al. / Journal of Clinical Epidemiology 67 (2014) 850e857 851 in-roads into more general health studies. Of the 191 arti- What is new? cles we found in highly cited journals that mention adjustments for multiple tests, only 14 (7.3%) include false Controlling the false positive rate to address multi- discovery rate adjustments. plicity of tests in health studies can result in logical This article is intended to remind readers of the funda- inconsistencies and opportunities for abuse. mental challenges of multiple-test adjustment procedures Errors in hypothesis test conclusions depend on the that control study-wide error rates and explain why false frequency of the truth of null hypotheses being discovery rate control may be an appealing alternative for tested. drawing statistical inferences in health studies. In doing so, we distinguish between tests that are exploratory and False discovery rate control procedures do not suf- those that are hypothesis driven. The explanations we pre- fer from the philosophical challenges evident with sent to discourage use of adjustments based on study-wide Bonferroni-type procedures. error rate control are not newdthe case has been made Health researchers may benefit from relying on strongly over the past 10 to 20 years [4e9]. Arguments false discovery rate control in studies with multiple in favor of using false discovery rate control have been tests. made based on power considerations [10e13], but we are unaware of explanations based on the distinction between exploratory and hypothesis-driven testing. is the expected fraction of tests declared statistically signif- 2. Adjustment for multiple testing through false posi- icant in which the null hypothesis is actually true. The false tive rate control discovery rate can be contrasted with the false positive rate, which is the expected fraction of tests with true null The usual argument to convince researchers that adjust- hypotheses that are mistakenly declared statistically ments are necessary when multiple tests are performed is to significant. In other words, the false positive rate is the point out that, without adjustments, the probability of at least probability of rejecting a null hypothesis given that it is one null hypothesis being rejected is larger than acceptable true, while the false discovery rate is the probability that levels. Suppose, for example, that a researcher performs 100 a null hypothesis is true given that the null hypothesis has tests at the a 5 0.05 significance level in which the null been rejected. hypothesis is true in every case. If all the tests are indepen- Table 1 illustrates the distinction between the false pos- dent, then the probability that at least one test would be incor- itive and false discovery rates. Suppose that a set of tests rectly rejected is 1 À (1 À 0.05)100 5 0.9941, or 99.41%. In can be cross-classified into a 2 Â 2 table according to truth most studies, tests are not independent (eg, when tests share of the hypotheses (whether the null hypothesis is true or the same data), in which case, the probability of at least one not), and the decision made based on the data (whether to incorrect rejection would not be quite so large, though likely reject the null hypothesis or not). Let a be the fraction of large enough to be of some concern. Recognizing that the tests with true null hypotheses that are not rejected, b be probability of at least one false positive may be unacceptably the fraction of tests with true null hypotheses that are large, a common strategy is to adjust the significance level as a mistakenly rejected, c be the fraction of tests with false null function of the number of tests performed. hypotheses that are mistakenly not rejected, and d be the One of the simplest and most commonly used ap- fraction of tests with false null hypotheses that are rejected. proaches to adjusting significance levels is the Bonferroni Assuming these fractions can be viewed as long-run rates, procedure [14,15]. Letting n be the number of tests per- the false positive rate is computed as b/(a þ b), whereas formed, and a the significance level one would normally the false discovery rate is computed as b/(b þ d ). Although use if performing only one test, the Bonferroni procedure the numerators of these fractions are the same, the denom- involves rejecting null hypotheses whose P-values are less inator of the false positive rate is the rate of encountering than a/n rather than a. For example, if a study involves 100 true null hypotheses, and the denominator of the false dis- hypothesis tests and the researcher would ordinarily use covery rate is the overall rate of rejecting null hypotheses. a 5 0.05 as the significance level for a single test, then Conventional hypothesis testing, along with procedures the Bonferroni procedure requires the researcher to to control study-wide error rates, are set up to limit false compare each of the 100 P-values to a/n 5 0.05/ positive rates, but not false discovery rates. False discovery 100 5 0.0005. By invoking this procedure, the researcher rate control has become increasingly standard practice in is guaranteed that the probability of at least one false pos- genomic studies and the analysis of micro-array data where itive, regardless of the dependence among the tests, is no an abundance of testing occurs.

False Discovery Rate Control Is a Recommended Alternative to Bonferroni-Type Adjustments in Health Studies Mark E

Underestimation of Type I Errors in Scientific Analysis

The Reproducibility of Research and the Misinterpretation of P Values

From P-Value to FDR

Statistical Significance for Genome-Wide Experiments

Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy

Redefine Statistical Significance We Propose to Change the Default P-Value Threshold for Statistical Significance from 0.05 to 0.005 for Claims of New Discoveries

Benjamini-Hochberg Method

Hypothesis Testing

8. Multiple Test Corrections

Hypothesis Testing (1/22/13)

How Should We Use Hypothesis Tests to Guide the Next Experiment?

Multiple Comparisons: Problem and Solutions