How Should We Use Hypothesis Tests to Guide the Next Experiment?

Many tests, many strata: How should we use hypothesis tests to guide the next experiment? Jake Bowers Nuole Chen University of Illinois @ Urbana-Champaign University of Illinois @ Urbana-Champaign Departments of Political Science & Department of Political Science Statistics Associate Fellow, Office of Evaluation Fellow, Office of Evaluation Sciences, GSA Sciences, GSA Methods Director, EGAP https://publish.illinois.edu/nchen3 Fellow, CASBS http://jakebowers.org 25 January 2019 The Plan Motivation and Application: An experiment with many blocks raises questions about where to focus the next experiment Background: Hypothesis testing and error for many tests Adjustment based methods for controlling error rate. Limitations and Advantages of Different Approaches 1/23 Motivation and Application: An experiment with many blocks raises questions about where to focus the next experiment A (Fake) Field Experiment ”...We successfully randomized the college application communication to people within each of the 100 HUD buildings. The estimated effect of the new policy is 5 (p < .001).” ● 120 ● ● ● ● ● 100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 80 ● ● ● ● ● ● 60 40 20 0 2/23 0 1 people in just a few buildings? Policy Expert: ...Great! But this seems small. Could it be mainly effecting A (Fake) Field Experiment 60 50 80 50 40 40 60 30 30 40 20 20 20 10 10 0 1 0 1 0 1 ● ● 70 70 ● 80 ● 60 60 50 50 60 40 40 40 30 30 20 20 20 10 10 0 0 0 1 0 1 0 1 ● ● 80 100 80 ● 80 60 60 ● 60 40 40 40 20 20 20 0 0 1 0 1 0 1 3/23 Jake: This is a natural and important question both for the development of new theory and better policy. Let me get back to you... A (Fake) Field Experiment 60 50 80 50 40 40 60 30 30 40 20 20 20 10 10 0 1 0 1 0 1 ● ● 70 70 ● 80 ● 60 60 50 50 60 40 40 40 30 30 20 20 20 10 10 0 0 0 1 0 1 0 1 ● ● 80 100 80 ● 80 60 60 ● 60 40 40 40 20 20 20 0 0 1 0 1 0 1 3/23 One response: use large-scale hypothesis testing Given a block-randomized experiment, if effects are concentrated in a few blocks (but zero is most or many blocks), how can we detect them? • Report 100 simple average treatment effects? { − } But reporting τb = 5, 0, 2, 100,... will require p-values (or CIs, or SEs, or posterior dists). • Report 100 simple p-values? But we know that the false rejection rate / false positive rate for 100 tests is 1 − (1 − .05)100 = .99. • Report 100 adjusted p-values? But we know that Bonferroni-style adjustments will reduce power to detect effects (i.e. would only reject if p < .05/100 = .0005.) Our proposal to increase power: Test many hypotheses but control the false positive rate via testing hypotheses in order (novel in this context) OR by controlling the false discovery rate (common in genetics and machine learning). 4/23 Background: Hypothesis testing and error for many tests A hypothesis test summarizes design information against a claim or model. 5/23 What is a good hypothesis test? A good test casts doubt on the truth rarely (few false positive errors) and detects falsehoods often (high statistical power). Rarely and often refer to repeated use in a given research design. To assess false positive rate, for example: 1. Shuffle the treatment assignment to make the true relationship zero but keeping everything else about the data fixed. 2. Test the true null hypothesis, H0 : τi = 0, and record the p-value. 3. Repeat steps 1 and 2 B times. 4. ∑The false positive rate (aka Type I error rate aka size of the test) is ≤ b I(p α)/B. 6/23 The false positive error rate increases with the number of tests. 1.00 α = 0.05 0.75 0.50 Probability of Rejecting at Least One Null Hypotheses 0.25 0 50 100 150 200 m This figure shows the relationship between the number of hypotheses m and the probability of rejecting at least one null hypothesis P(1 − (1 − α)m), when α = 0.05 Bretz, Hothorn, and Westfall, 2011. 7/23 Error19951 rates CONTROLLING of multiple testing FALSE procedures DISCOVERY RATE 291 TABLE 1 Number of errors committed when testing m null hypotheses Declared Declared Total non-significant significant True null hypotheses U V mO Non-true null hypotheses T S m -mO m-R R m (Benjamini and Hochberg, 1995)’s Table 1. Cells are numbers of tests. For example, R are number of ”discoveries” and V are false discoveries. situation in a traditional form. The specific m hypotheses are assumed to be known in advance. R is an observable random variable; U, V, S and T are unobservable Tworandom main variables. error rates If each to control individual when null testing hypothesis many is hypotheses: tested separately at level a, then R = R(a) is increasing in a. We use the equivalent lower case letters for their • Family wise error rate (FWER) is P(V > 0) (Probability of any false positive realized values. Inerror) terms of these random variables, the PCER is E(V/m) and the FWER is P(V• >False 1). Testing Discovery individually Rate (FDR) eachis hypothesisE(V/R|R > at0 )level(Average a guarantees proportion that ofE(V/m) false S a. Testingpositive individually errors given each somehypothesis rejections.) at level a/lm guarantees that P(V > 1) < a. 8/23 2.1. Definition of False Discovery Rate The proportion of errors committed by falsely rejecting null hypotheses can be viewed through the random variable Q = V/(V + S)-the proportion of the rejected null hypotheses which are erroneously rejected. Naturally, we define Q = 0 when V + S = 0, as no error of false rejection can be committed. Q is an unobserved (unknown) random variable, as we do not know v or s, and thus q = v/(v + s), even after experimentation and data analysis. We define the FDR Qe to be the expectation of Q, Qe = E(Q) = E{V/(V + S)} = E(V/R). Two properties of this error rate are easily shown, yet are very important. (a) If all null hypotheses are true, the FDR is equivalent to the FWER: in this case s = 0 and v = r, so if v = 0 then Q = 0, and if v> 0 then Q = 1, leading to P(V > 1) = E(Q) = Qe. Therefore control of the FDR implies control of the FWER in the weak sense. (b) When mo < m, the FDR is smaller than or equal to the FWER: in this case, if v > 0 then vir < 1, leading to X(v > l) > Q. Taking expectations on both sides we obtain P(V > 1) > Qe, and the two can be quite different. As a result, any procedure that controls the FWER also controls the FDR. However, if a procedure controls the FDR only, it can be less stringent, and a gain in power may be expected. In particular, the larger the number of the non-true null hypotheses is, the larger S tends to be, and so is the difference between the error rates. As a result, the potential for increase in power is larger when more of the hypotheses are non-true. This content downloaded from 98.212.150.165 on Thu, 03 May 2018 01:47:30 UTC All use subject to http://about.jstor.org/terms Family Wise Error Rate (FWER) Simulations to show FWER with two different tests, ranging from 10 to 100 hypotheses (blocks), where each hypothesis (block) is one test. Same data as shown earlier but sampled to show effects in datasets with 10 …100 blocks. ”pOneway” is a t-test using means. ”pWilcox” is a Wilcox-Mann-Whitney rank based test. 9/23 False Discovery Rate This simulation of 100 hypotheses tests shows that the false discovery rate decreases when τ increases and/or when more blocks (hypotheses) have true effects. 10/23 Adjustment based methods for controlling error rate. Some ways to control error rates: Holm-Bonferroni-style adjustment Family-Wise Error Rate (FWER) Control via Bonferroni-style adjustment 1. Set α 2. Test hypotheses and record p-values 3. Order p-values of tests from smallest to largest ≤ − 4. Reject hypothesis i if pi α/(m i + 1), where m is the number of hypothesis Simulations show FWER with Holm-Bonferroni adjustments. 11/23 Some ways to control error rates: Benjamini-Hochberg FDR control • Set α • Test hypotheses and record p-values • Order p-values of tests from smallest to largest ≤ • Reject hypothesis i if the corresponding p-value pi (i/m)α, where m is the number of hypotheses • Reject all hypotheses with p-values that are smaller than pi 12/23 Some ways to control error rates: Benjamini-Hochberg FDR control Controlling False Discovery Rate (FDR) The false discovery rates and false discovery rates after the Benjamini-Hochberg 13/23 adjustment from our simulations. Some ways to control error rates: Structured hypothesis testing. Our approach: (1) creating a hierarchy of hypotheses based on a priori power (e.g. sample size of block); (2) test hypotheses on smaller and smaller subsets of blocks in a predetermined order, stopping testing when p > .05. 1. Test overall H0 : τi = 0. If p0 > α stop testing. 2. If p0 ≤ α, split the sample. 3. Test H0 in each set of blocks. Record maximum p-value for the set, ps, so far (overall versus test in this split).

How Should We Use Hypothesis Tests to Guide the Next Experiment?

Underestimation of Type I Errors in Scientific Analysis

The Reproducibility of Research and the Misinterpretation of P Values

From P-Value to FDR

Statistical Significance for Genome-Wide Experiments

Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy

Redefine Statistical Significance We Propose to Change the Default P-Value Threshold for Statistical Significance from 0.05 to 0.005 for Claims of New Discoveries

Benjamini-Hochberg Method

Hypothesis Testing

8. Multiple Test Corrections

Hypothesis Testing (1/22/13)

False Discovery Rate Control Is a Recommended Alternative to Bonferroni-Type Adjustments in Health Studies Mark E

Multiple Comparisons: Problem and Solutions