Lecture 11: Testing Multiple Hypotheses

STATS 200: Introduction to Statistical Inference Lecture 11: Testing multiple hypotheses The multiple testing problem The multiple testing problem The multiple testing problem The multiple testing problem The multiple testing problem Open access, freely available online Essay Why Most Published Research Findings Are False John P. A. Ioannidis factors that infl uence this problem and is characteristic of the fi eld and can Summary some corollaries thereof. vary a lot depending on whether the There is increasing concern that most fi eld targets highly likely relationships Modeling the Framework for False or searches for only one or a few current published research fi ndings are Positive Findings false. The probability that a research claim true relationships among thousands is true may depend on study power and Several methodologists have and millions of hypotheses that may bias, the number of other studies on the pointed out [9–11] that the high be postulated. Let us also consider, same question, and, importantly, the ratio rate of nonreplication (lack of for computational simplicity, of true to no relationships among the confi rmation) of research discoveries circumscribed fi elds where either there relationships probed in each scientifi c is a consequence of the convenient, is only one true relationship (among fi eld. In this framework, a research fi nding yet ill-founded strategy of claiming many that can be hypothesized) or is less likely to be true when the studies conclusive research fi ndings solely on the power is similar to fi nd any of the conducted in a fi eld are smaller; when the basis of a single study assessed by several existing true relationships. The effect sizes are smaller; when there is a formal statistical signifi cance, typically pre-study probability of a relationship greater number and lesser preselection for a p-value less than 0.05. Research being true is R⁄(R + 1). The probability of tested relationships; where there is is not most appropriately represented of a study fi nding a true relationship greater fl exibility in designs, defi nitions, and summarized by p-values, but, refl ects the power 1 − β (one minus outcomes, and analytical modes; when unfortunately, there is a widespread the Type II error rate). The probability there is greater fi nancial and other notion that medical research articles of claiming a relationship when none interest and prejudice; and when more truly exists refl ects the Type I error teams are involved in a scientifi c fi eld It can be proven that rate, α. Assuming that c relationships in chase of statistical signifi cance. most claimed research are being probed in the fi eld, the Simulations show that for most study expected values of the 2 × 2 table are designs and settings, it is more likely for fi ndings are false. given in Table 1. After a research a research claim to be false than true. fi nding has been claimed based on Moreover, for many current scientifi c should be interpreted based only on achieving formal statistical signifi cance, fi elds, claimed research fi ndings may p-values. Research fi ndings are defi ned the post-study probability that it is true often be simply accurate measures of the here as any relationship reaching is the positive predictive value, PPV. prevailing bias. In this essay, I discuss the formal statistical signifi cance, e.g., The PPV is also the complementary implications of these problems for the effective interventions, informative probability of what Wacholder et al. conduct and interpretation of research. predictors, risk factors, or associations. have called the false positive report “Negative” research is also very useful. probability [10]. According to the 2 “Negative” is actually a misnomer, and × 2 table, one gets PPV = (1 − β)R⁄(R ublished research fi ndings are the misinterpretation is widespread. − βR + α). A research fi nding is thus sometimes refuted by subsequent However, here we will target evidence, with ensuing confusion relationships that investigators claim Citation: Ioannidis JPA (2005) Why most published P exist, rather than null fi ndings. research fi ndings are false. PLoS Med 2(8): e124. and disappointment. Refutation and As has been shown previously, the controversy is seen across the range of Copyright: © 2005 John P. A. Ioannidis. This is an research designs, from clinical trials probability that a research fi nding open-access article distributed under the terms and traditional epidemiological studies is indeed true depends on the prior of the Creative Commons Attribution License, probability of it being true (before which permits unrestricted use, distribution, and [1–3] to the most modern molecular reproduction in any medium, provided the original research [4,5]. There is increasing doing the study), the statistical power work is properly cited. concern that in modern research, false of the study, and the level of statistical Abbreviation: PPV, positive predictive value fi ndings may be the majority or even signifi cance [10,11]. Consider a 2 × 2 the vast majority of published research table in which research fi ndings are John P. A. Ioannidis is in the Department of Hygiene compared against the gold standard and Epidemiology, University of Ioannina School of claims [6–8]. However, this should Medicine, Ioannina, Greece, and Institute for Clinical not be surprising. It can be proven of true relationships in a scientifi c Research and Health Policy Studies, Department of that most claimed research fi ndings fi eld. In a research fi eld both true and Medicine, Tufts-New England Medical Center, Tufts false hypotheses can be made about University School of Medicine, Boston, Massachusetts, are false. Here I will examine the key United States of America. E-mail: [email protected] the presence of relationships. Let R be the ratio of the number of “true Competing Interests: The author has declared that The Essay section contains opinion pieces on topics relationships” to “no relationships” no competing interests exist. of broad interest to a general medical audience. among those tested in the fi eld. R DOI: 10.1371/journal.pmed.0020124 PLoS Medicine | www.plosmedicine.org 0696 August 2005 | Volume 2 | Issue 8 | e124 What are some ways we can think about acceptance/rejection errors across multiple hypothesis tests/experiments? What statistical procedures can control these measures of errors? The multiple testing problem Multiple testing problem: If I test n true null hypotheses at level α, then on average I'll still (falsely) reject αn of them. Examples: I Test the safety of a drug in terms of a dozen different side effects I Test whether a disease is related to 10,000 different gene expressions The multiple testing problem Multiple testing problem: If I test n true null hypotheses at level α, then on average I'll still (falsely) reject αn of them. Examples: I Test the safety of a drug in terms of a dozen different side effects I Test whether a disease is related to 10,000 different gene expressions What are some ways we can think about acceptance/rejection errors across multiple hypothesis tests/experiments? What statistical procedures can control these measures of errors? P[ reject any null hypothesis ] h (1) (n) i = P freject H0 g [ ::: [ freject H0 g h (1)i h (n)i ≤ P reject H0 + ::: + P reject H0 α α = + ::: + = α n n Verification: The Bonferroni correction (1) (n) Consider testing n different null hypotheses H0 ;:::; H0 , all of which are, in fact, true. One goal we might set is to ensure P[ reject any null hypothesis ] ≤ α: A simple and commonly-used method of achieving this is called the Bonferroni method: Perform each test at significance level α=n, instead of level α. α α = + ::: + = α n n The Bonferroni correction (1) (n) Consider testing n different null hypotheses H0 ;:::; H0 , all of which are, in fact, true. One goal we might set is to ensure P[ reject any null hypothesis ] ≤ α: A simple and commonly-used method of achieving this is called the Bonferroni method: Perform each test at significance level α=n, instead of level α. Verification: P[ reject any null hypothesis ] h (1) (n) i = P freject H0 g [ ::: [ freject H0 g h (1)i h (n)i ≤ P reject H0 + ::: + P reject H0 The Bonferroni correction (1) (n) Consider testing n different null hypotheses H0 ;:::; H0 , all of which are, in fact, true. One goal we might set is to ensure P[ reject any null hypothesis ] ≤ α: A simple and commonly-used method of achieving this is called the Bonferroni method: Perform each test at significance level α=n, instead of level α. Verification: P[ reject any null hypothesis ] h (1) (n) i = P freject H0 g [ ::: [ freject H0 g h (1)i h (n)i ≤ P reject H0 + ::: + P reject H0 α α = + ::: + = α n n Family-wise error rate More generally, suppose we test n null hypotheses, n0 of which are true and n − n0 of which are false. Results of the tests might be tabulated as follows: H0 is true H0 is false Total Reject H0 VS R Accept H0 UT n − R Total n0 n − n0 n R = # rejected null hypotheses V = # type I errors, T = # type II errors Remark: We consider n0 and n − n0 to be fixed quantities. The number of hypotheses we reject, R, as well as the cell counts U, V , S, T , are random, as they depend on the data observed in each experiment. Bonferroni controls FWER: Without loss of generality, let (1) (n0) H0 ;:::; H0 be the true null hypotheses. h (1) (n0) i P[V ≥ 1] = P freject H0 g [ ::: [ freject H0 g h (1)i h (n0)i ≤ P reject H0 + ::: + P reject H0 α α αn = + ::: + = 0 ≤ α: n n n Family-wise error rate The family-wise error rate (FWER) is the probability of falsely rejecting at least one true null hypothesis, P[V ≥ 1]: A procedure controls FWER at level α if P[V ≥ 1] ≤ α, regardless of the (possibly unknown) number of true null hypotheses n0.

Load more