Sequential Multiple Hypothesis Testing with Type I Error Control

Sequential Multiple Hypothesis Testing with Type I Error Control Alan Malek Sumeet Katariya MIT University of Wisconsin-Madison [email protected] [email protected] Yinlam Chow Mohammad Ghavamzadeh Stanford University Adobe Research [email protected] [email protected] Abstract type I error, the probability of falsely rejecting the null hypothesis, is bounded. A second goal is to minimize This work studies multiple hypothesis testing the probability type II error, which occurs when the in the setting when we obtain data sequen- statistician erroneously fails to reject the null (when tially and may choose when to stop sampling. the alternative is indeed true). HT is a fundamental We summarize the notion of a sequential p- problem in statistical inference that arises in numer- value (one that can be continually updated ous fields. Examples include online marketing [6] and and still maintain a type I error guarantee) pharmaceutical studies [20]. and provide several examples from the litera- HT has been traditionally studied in the fixed-horizon ture. This tool allows us to convert fixed- setting. The desired parameters such as type I error horizon step-up or step-down multiple hy- and power are used to calculate a lower bound on the pothesis testing procedures (which includes necessary sample size, the requisite number of samples Benjamini-Hochberg, Holm, and Bonferroni) are collected, and then a decision is made. In particular, into a sequential version that allows the statis- one must wait until the minimum sample size is reached tician to reject a hypothesis as soon as the or the type I error guarantee is forgone. While this sequential p-value reaches a threshold while approach provides rigorous statistical guarantees in maintaining type I error control. We show accuracy and robustness, it usually takes a long time to that if the original procedure has a type I error conclude a test and also its offline nature possesses huge guarantee in a certain family (including FDR limitations to emerging applications, such as digital and FWER), then the sequential conversion marketing, where data is streaming in a sequential inherits an analogous guarantee. The con- fashion and a data-efficient approach is crucial to allow version also allows for allocating samples in real-time adaption to user’s preferences. a data-dependent way, and we provide simu- lated experiments demonstrating an increased Multiple hypothesis testing (MHT) is a natural exten- number of rejections when compared to the sion of HT to control error across multiple hypotheses. fixed-horizon setting. While type I and type II errors are straightforward in single tests, there are many ways one could penalize incorrect decisions in MHT. The two most widely used 1 Introduction notions are family-wise error rate (FWER), which is the probability of at least one false rejection, and false discovery rate (FDR), which is the expected proportion of Hypothesis testing (HT) is a statistically rigorous pro- the rejections that are false. Generally, the statistician cedure that makes decision about the truth of a hy- would like to reject as many tests as possible while still pothesis is a way that allows the probability of error controlling the number of false positives. to be carefully controlled. Specifically, given a null hypothesis and an alternative hypothesis, the statistician Recent advances in the scale and rate of data collec- decides when to reject the null hypothesis such that the tion have increased the need for sequential hypothesis testing; in this framework, the data arrives sequentially Proceedings of the 20th International Conference on Artifi- and the statistician may make decisions after observing cial Intelligence and Statistics (AISTATS) 2017, Fort Laud- partial data, such as deciding when to stop collecting erdale, Florida, USA. JMLR: W&CP volume 54. Copyright data or from which populations to collect data from. 2017 by the author(s). Sequential Multiple Hypothesis Testing with Type I Error Control Large clinical trials and Hypothesis testing in social dynamic allocation property allows us to conclude more networks, mobile applications, and large internet sys- tests (than their static/uniform allocation scenario) at tems are examples where sequential HT is desirable. each time-step during the lifespan of the procedure. In contrast, in fixed-horizon setting all the data are Moreover, our framework is more flexible and allows collected a priori. control of FDR, FWER, and a host of other metrics, in contrast to only controlling FDR. Of course, such data-dependent decision introduce de- pendencies between future data and past data, and the statistician must be more careful designing proce- 1.2 Our Contributions dures if type I error control is desired. For example, We first describe a framework that fits most a common goal in sequential hypothesis testing is to fixed-horizon hypothesis test procedures, including observe the data sequence and stop at the statisticians all step-up procedures (e.g., Hochberg [16] and choosing. However, if the statistician were to follow the Benjamini-Hochberg [7]) and step-down procedures evolution of a statistic in the traditional fixed-horizon (e.g., Holm’s [17]) with an accompanying family of type HT and reject the null hypothesis as soon as it crosses I errors that includes FWER and FDR. By leveraging the fixed-horizon decision boundary, the type I error the powerful tool of sequential p-values, we propose would be greatly inflated [18] and many erroneous re- a method for converting any fixed-horizon MHT into jections would be made. We survey some alternative a sequential MHT and show that the fixed-horizon hypothesis testing procedures that provide a type I guarantees generalizes to the sequential setting in Sec- error guarantee in the sequential setting, but our work tion 5.2. Further, our sequential MHT can stop tests will use the existence of such sequential tests as a start- early in a data-dependent manner, and therefore stops ing point. Our goal will be to use sequential hypothesis easier hypotheses first and potentially allowing more tests to adapt multiple hypothesis testing procedures samples for difficult hypotheses. We discuss this dy- into sequential variants that can enjoy the benefits of namic allocation property and demonstrate that it does data-depending sampling and early stopping. lead to more rejected tests on synthetic data in Sec- tion 7. Additionally, Section 6 discusses controlling 1.1 Related Work type II errors and provides a sequential calculator for estimating a minimum sample size to control both FDR Sequential hypothesis testing has a long history [30, and false non-discovery rate (FNR) in sequential MHT. 20, 24], but its extension to multiple hypotheses and tests is more recent. Several recent papers study the 2 Hypothesis Testing mechanics of statistics that allow for uniform type I error guarantees for any stopping time [1, 15, 28, 29]. In hypothesis testing (HT), the statistician tests a Bartroff and co-authors have established several scenar- null hypothesis H0 against an alternative hypothesis H1. ios where sequential MHT has type I error in the cases The goal is to design a procedure that controls the of all simple hypotheses with FWER [2] and FDR con- type I error α, the probability of erroneously rejecting trol [3]. Similarly, De and Baron [10] proposed a sequen- H0, while simultaneously minimizing the type II error tial MHT algorithm that guarantees the overall type β, the probability of failing to reject H0 when it is I error. However since this algorithm is based on the false. A prototypical example is A/B testing with Bonferroni correction [9], the power of rejecting alterna- baseline option A and alternative option B; the null tive hypotheses has been empirically shown to be low. and alternative hypotheses are H0 : µA = µB and Fellouris and Tartakovsky [14] also studied the sequen- H1 : µA 6= µB, where µA and µB are the mean values tial testing problem of a simple null hypothesis vs. a (e.g., conversion rates or click through rates). composite alternative hypothesis, for which the decision Multiple hypothesis testing (MHT) is an extension of boundaries were derived using both mixture-based and HT to a finite number of tests m. Examples are a weighted-generalized likelihood ratio statistics. While base option A against a set of m alternative options they have shown asymptotic optimally of such a sequen- A ;:::;A , a single null hypothesis against a set of tial testing procedure, their proposed method requires 1 m alternative hypotheses [23, 5], and a test with a set of high-dimensional hyper-parameter tuning and cannot nested hypotheses [19]. When the test stops, the MHT be easily extended to multiple hypothesis testing. Ze- procedure rejects R tests, out of which V are rejected by hetmayer and collaborators [31, 32] looked at control- mistake, and does not reject m − R tests, out of which ling FDR when the p-values are modeled as Gaussian. W should have been rejected. Table 1 summarizes Similar to our results, the approach of [18] relies on these quantities. The decision of the MHT procedure sequential p-values. Our work extends their results to reject/not-reject a test is a random quantity, because by allowing each test to be stopped individually. This it depends on the test sample that is random. Thus, all Alan Malek, Sumeet Katariya, Yinlam Chow, and Mohammad Ghavamzadeh not-rejected rejected total The most common procedure for fixed-horizon multi- H0 true U V m0 ple hypothesis testing is to pick sample sizes, gather H0 false W S m − m0 the samples and compute the p-values of the m tests, total m − R R m then determine, as a whole, which null hypotheses to reject. The most common MHT procedure to control Table 1: MHT Error Quantities.

Sequential Multiple Hypothesis Testing with Type I Error Control

Sequential Analysis Tests and Confidence Intervals

Trial Sequential Analysis: Novel Approach for Meta-Analysis

Detecting Reinforcement Patterns in the Stream of Naturalistic Observations of Social Interactions

©2018 Oxford University Press

The Problem of Multiple Testing and Its Solutions for Genom-Wide Studies

Sequential Analysis Exercises

Simultaneous Confidence Intervals for Comparing Margins of Multivariate

Sequential and Multiple Testing for Dose-Response Analysis

P and Q Values in RNA Seq the Q-Value Is an Adjusted P-Value, Taking in to Account the False Discovery Rate (FDR)

Dataset Decay and the Problem of Sequential Analyses on Open Datasets

A Partitioning Approach for the Selection of the Best Treatment

Group-Sequential Analysis for Two Proportions