<<

Sequential Multiple Hypothesis Testing with Type I Error Control

Alan Malek Sumeet Katariya MIT University of Wisconsin-Madison [email protected] [email protected] Yinlam Chow Mohammad Ghavamzadeh Stanford University Adobe Research [email protected] [email protected]

Abstract type I error, the probability of falsely rejecting the null hypothesis, is bounded. A second goal is to minimize This work studies multiple hypothesis testing the probability type II error, which occurs when the in the setting when we obtain data sequen- statistician erroneously fails to reject the null (when tially and may choose when to stop . the alternative is indeed true). HT is a fundamental We summarize the notion of a sequential p- problem in that arises in numer- value (one that can be continually updated ous fields. Examples include online marketing [6] and and still maintain a type I error guarantee) pharmaceutical studies [20]. and provide several examples from the litera- HT has been traditionally studied in the fixed-horizon ture. This tool allows us to convert fixed- setting. The desired parameters such as type I error horizon step-up or step-down multiple hy- and power are used to calculate a lower bound on the pothesis testing procedures (which includes necessary sample size, the requisite number of samples Benjamini-Hochberg, Holm, and Bonferroni) are collected, and then a decision is made. In particular, into a sequential version that allows the statis- one must wait until the minimum sample size is reached tician to reject a hypothesis as soon as the or the type I error guarantee is forgone. While this sequential p-value reaches a threshold while approach provides rigorous statistical guarantees in maintaining type I error control. We show accuracy and robustness, it usually takes a long time to that if the original procedure has a type I error conclude a test and also its offline nature possesses huge guarantee in a certain family (including FDR limitations to emerging applications, such as digital and FWER), then the sequential conversion marketing, where data is streaming in a sequential inherits an analogous guarantee. The con- fashion and a data-efficient approach is crucial to allow version also allows for allocating samples in real-time adaption to user’s preferences. a data-dependent way, and we provide simu- lated demonstrating an increased Multiple hypothesis testing (MHT) is a natural exten- number of rejections when compared to the sion of HT to control error across multiple hypotheses. fixed-horizon setting. While type I and type II errors are straightforward in single tests, there are many ways one could penalize incorrect decisions in MHT. The two most widely used 1 Introduction notions are family-wise error rate (FWER), which is the probability of at least one false rejection, and false dis- covery rate (FDR), which is the expected proportion of Hypothesis testing (HT) is a statistically rigorous pro- the rejections that are false. Generally, the statistician cedure that makes decision about the truth of a hy- would like to reject as many tests as possible while still pothesis is a way that allows the probability of error controlling the number of false positives. to be carefully controlled. Specifically, given a null hy- pothesis and an alternative hypothesis, the statistician Recent advances in the scale and rate of data collec- decides when to reject the null hypothesis such that the tion have increased the need for sequential hypothesis testing; in this framework, the data arrives sequentially Proceedings of the 20th International Conference on Artifi- and the statistician may make decisions after observing cial Intelligence and (AISTATS) 2017, Fort Laud- partial data, such as deciding when to stop collecting erdale, Florida, USA. JMLR: W&CP volume 54. Copyright data or from which populations to collect data from. 2017 by the author(s). Sequential Multiple Hypothesis Testing with Type I Error Control

Large clinical trials and Hypothesis testing in social dynamic allocation property allows us to conclude more networks, mobile applications, and large internet sys- tests (than their static/uniform allocation scenario) at tems are examples where sequential HT is desirable. each time-step during the lifespan of the procedure. In , in fixed-horizon setting all the data are Moreover, our framework is more flexible and allows collected a priori. control of FDR, FWER, and a host of other metrics, in contrast to only controlling FDR. Of course, such data-dependent decision introduce de- pendencies between future data and past data, and the statistician must be more careful designing proce- 1.2 Our Contributions dures if type I error control is desired. For example, We first describe a framework that fits most a common goal in sequential hypothesis testing is to fixed-horizon hypothesis test procedures, including observe the data sequence and stop at the statisticians all step-up procedures (e.g., Hochberg [16] and choosing. However, if the statistician were to follow the Benjamini-Hochberg [7]) and step-down procedures evolution of a in the traditional fixed-horizon (e.g., Holm’s [17]) with an accompanying family of type HT and reject the null hypothesis as soon as it crosses I errors that includes FWER and FDR. By leveraging the fixed-horizon decision boundary, the type I error the powerful tool of sequential p-values, we propose would be greatly inflated [18] and many erroneous re- a method for converting any fixed-horizon MHT into jections would be made. We survey some alternative a sequential MHT and show that the fixed-horizon hypothesis testing procedures that provide a type I guarantees generalizes to the sequential setting in Sec- error guarantee in the sequential setting, but our work tion 5.2. Further, our sequential MHT can stop tests will use the existence of such sequential tests as a start- early in a data-dependent manner, and therefore stops ing point. Our goal will be to use sequential hypothesis easier hypotheses first and potentially allowing more tests to adapt multiple hypothesis testing procedures samples for difficult hypotheses. We discuss this dy- into sequential variants that can enjoy the benefits of namic allocation property and demonstrate that it does data-depending sampling and early stopping. lead to more rejected tests on synthetic data in Sec- tion 7. Additionally, Section 6 discusses controlling 1.1 Related Work type II errors and provides a sequential calculator for estimating a minimum sample size to control both FDR Sequential hypothesis testing has a long history [30, and false non-discovery rate (FNR) in sequential MHT. 20, 24], but its extension to multiple hypotheses and tests is more recent. Several recent papers study the 2 Hypothesis Testing mechanics of statistics that allow for uniform type I error guarantees for any stopping time [1, 15, 28, 29]. In hypothesis testing (HT), the statistician tests a Bartroff and co-authors have established several scenar- null hypothesis H0 against an alternative hypothesis H1. ios where sequential MHT has type I error in the cases The goal is to design a procedure that controls the of all simple hypotheses with FWER [2] and FDR con- type I error α, the probability of erroneously rejecting trol [3]. Similarly, De and Baron [10] proposed a sequen- H0, while simultaneously minimizing the type II error tial MHT algorithm that guarantees the overall type β, the probability of failing to reject H0 when it is I error. However since this algorithm is based on the false. A prototypical example is A/B testing with [9], the power of rejecting alterna- baseline option A and alternative option B; the null tive hypotheses has been empirically shown to be low. and alternative hypotheses are H0 : µA = µB and Fellouris and Tartakovsky [14] also studied the sequen- H1 : µA =6 µB, where µA and µB are the values tial testing problem of a simple null hypothesis vs. a (e.g., conversion rates or click through rates). composite alternative hypothesis, for which the decision Multiple hypothesis testing (MHT) is an extension of boundaries were derived using both mixture-based and HT to a finite number of tests m. Examples are a weighted-generalized likelihood ratio statistics. While base option A against a set of m alternative options they have shown asymptotic optimally of such a sequen- A ,...,A , a single null hypothesis against a set of tial testing procedure, their proposed method requires 1 m alternative hypotheses [23, 5], and a test with a set of high-dimensional hyper-parameter tuning and cannot nested hypotheses [19]. When the test stops, the MHT be easily extended to multiple hypothesis testing. Ze- procedure rejects R tests, out of which V are rejected by hetmayer and collaborators [31, 32] looked at control- mistake, and does not reject m − R tests, out of which ling FDR when the p-values are modeled as Gaussian. W should have been rejected. Table 1 summarizes Similar to our results, the approach of [18] relies on these quantities. The decision of the MHT procedure sequential p-values. Our work extends their results to reject/not-reject a test is a random quantity, because by allowing each test to be stopped individually. This it depends on the test sample that is random. Thus, all Alan Malek, Sumeet Katariya, Yinlam Chow, and Mohammad Ghavamzadeh

not-rejected rejected total The most common procedure for fixed-horizon multi- H0 true U V m0 ple hypothesis testing is to pick sample sizes, gather H0 false W S m − m0 the samples and compute the p-values of the m tests, total m − R R m then determine, as a whole, which null hypotheses to reject. The most common MHT procedure to control Table 1: MHT Error Quantities. FWER is Bonferroni [9], which controls FWER at the level q by applying the union bound and only rejecting the tests whose p-values are smaller than q/m. How- the elements in Table 1, except m and m0 are random variables. ever, Bonferroni procedure lacks power and is overly conservative, especially when the number of hypotheses Unlike HT, there is no single notion of type I error is large or the test statistics are highly correlated [22]. for multiple hypotheses. The two common are family- Popular alternatives to Bonferroni to control FWER wised error rate (FWER) and (FDR) are Holm’s step-down procedure [17] that has larger that are defined as1 power than Bonferroni with the same type I guaran-  V  tee, and Hochberg’s step-up procedure [16] that rejects FWER := P (V ≥ 1) and FDR := . (1) E R ∨ 1 even more tests than Holm’s with the same guarantees. FDR is typically controlled by Benjamini-Hochberg’s The goal is usually to control these metrics at the procedure [7], which is step-up. Formally, level of q. FWER, first introduced by Bonferroni [9], Definition 2.1 (MHT Procedures). A MHT procedure is the probability of making at least one false positive. P consisting of m tests is a mapping [0, 1]m → {0, 1}m As the number of tests grows, requiring FWER con- from the set of p-values of the individual tests to m trol quickly makes rejection very difficult. Therefore, reject/not-reject decisions. FDR [7], which controls the expected proportion of false discoveries, has become a popular alternative as Denote the ascending p-values of the tests by p(1) ≤ it allows the number of false positives to grow with the ... ≤ p(m). We say that P is a step-up procedure, if number of tests. This notion has been expanded to de- for some sequence of decision thresholds α1, . . . , αm, P pendent tests [8] and to variations such as γ-FDR [21]. rejects tests (1),..., (k∗) for

One can also ask for guarantees on the type II error ∗ (k) k = max{k : p ≤ αk}. (see e.g., [26, 27]) by defining Similarly, P is a step-down procedure, if for some  W  FWER II:=P (W ≥1) and FNR:=E . sequence of decision thresholds α1, . . . , αm, P rejects (m−R)∨1 tests (1),..., (k∗ − 1) for

We do not consider procedures with type II error ∗ (k) bounds in this paper, as they require strong assump- k = min{k : p > αk}. tions on the distributions of the statistics under the Finally, we will call P a monotonic test procedure, if alternative hypothesis. it is either step-up or step-down.

2.1 Fixed-Horizon Hypothesis Testing The step-up procedures Hochberg and Benjamini- Hochberg set their decision thresholds to α = q Fixed-horizon is the traditional approach to HT that is k m−k and α = kq , respectively, and the step-down proce- widely used in industry. In the fixed-horizon HT, the k m dure Holm sets its decision threshold to α = q . statistician calculates the minimum necessary sample k m−k size for some desired type I and type II errors (plus Note that q is the desired FWER level in Hochberg some information that depends on the hypotheses being and Holm procedures, and the desired FDR level in tested). The test should be continued until this horizon Benjamini-Hochberg. is reached. At this point, a typical fixed-horizon HT The following definition encapsulate a large set of MHT procedure first computes the test statistic from the type I metrics for different functions f. This unified observations and then uses it to calculate the p-value of formulation will help us with our in Section 5.2. the test, i.e., the probability under the null hypothesis Definition 2.2 (Error Guarantee). Let f : 2 → + of sampling a test statistic at least as extreme as the R R be a monotonically increasing function in its first ar- one that was observed. Finally, it rejects the null gument and the error level q > 0 be a positive real hypothesis if the p-value is less than or equal to the number. We say that a test procedure P has an (f, q) desired type I error, i.e. p ≤ α. error guarantee if 1We denote by ∨ the maximum operator, e.g., R ∨ 1 =   max{R, 1}. E f(V,S) ≤ q (2) Sequential Multiple Hypothesis Testing with Type I Error Control

for all values of m and m0, and all distributions of of the paper, we present a few examples in Section 4 true rejections S, assuming that the distributions of that include Wald’s sequential probability ratio test, the p-values corresponding to the m0 true null hypothe- and under a simple null hypothesis. ses pπ(1), . . . , pπ(m0) are marginally uniform on [0, 1], where π is any arbitrary permutation of 1, . . . , m0. Problem definition: We can now state the problem that we study. Suppose that the statistician has a If Eq. 2 holds only when pπ(1), . . . , pπ(m0) are i.i.d. uni- family of m simultaneous sequential hypothesis tests form, we will say that P has an (f, q) error guarantee and her goal is to establish a procedure for rejecting under independence (a weaker condition). tests while controlling the type I error across the whole k For the well-known MHT type I error metrics FWER family. For the hypothesis test k, we denote by H0 k and FDR, function f is equal to f(V,S) = 1[V > 0] the null hypothesis, H1 the alternative hypothesis, and k and f(V,S) = V , respectively. {pt } the sequential p-value. At each time t during (V +S) ∨ 1 the test, the statistician has access to the sequential 1 m p-values of all the m tests, {pt },..., {pt }, and can 3 Problem Formulation generate new samples and update p-values arbitrarily. Her goal is to select samples in order to maximize the Consider testing the null hypothesis H0 against the number of rejected null hypotheses without violating alternative hypothesis H1 with type I error α. Let the considered notion of type I error over the family of X ∈ X be the random variable of test samples and x tests (e.g., FWER or FDR). be a realization of this random variable. We start by defining the notion of sequential p-value, the crucial tool that we make use of throughout the paper. 4 Examples of Sequential P -value

Definition 3.1. A sequential p-value for testing the In the previous section, we defined sequential p-values null hypothesis H0 against the alternative hypothe- without arguing about their existence. This section sis H1, denoted by {pt}, is a sequence of mappings t describes several sequential p-values and provides the pt : X 7→ [0, 1], t ≥ 1 that satisfy the following two 2 reader with more references. We also prove some im- properties under the null hypothesis: portant properties.

1. (super-uniform) For any δ ∈ [0, 1] and any t ≥ 1, 4.1 Sequential Probability Ratio Tests we have   The sequential probability ratio tests (SPRT) [30] was P sup ps(X1,...,Xs) ≤ δ ≤ δ. (3) one of the first sequential hypothesis testing frameworks s≤t where only two hypotheses are taken into the account. In order to extend the aforementioned SPRT techniques 2. (non-increasing) For any fixed realization to address the sequential multiple hypothesis testing ∞ {xt}t≥1 ∈ X and any t ≥ 1, we have problem, Robbins and Siegmund [25] developed the M−ary sequential probability ratio test (MSPRT) pro- p (x , . . . , x ) ≥ p (x , . . . , x ). t 1 t t+1 1 t+1 cedure that leverages the notion of mixture posterior probabilities in the likelihood ratio test. In particu- Sequential p-value allows “peeking” into the test, i.e. lar, the stopping rule effectively picks one of the M rejecting the null hypothesis without violating the type hypotheses that has the largest , I error guarantee. If we reject H0 as soon as pt ≤ α, given all the previous samples observed. More recent the super-uniform property guarantees that the type I work on the extension of MSPRT can also be found error remains below α. Note that unlike the sequential in [5, 12, 11]. setting, peeking in fixed-horizon hypothesis testing can greatly increase the type I error. This is due to the fact Recall the infinite sequence of independent and identi- that in this class of tests, the p-value has been designed cally distributed (i.i.d.) random variables X1,X2,... to hold only at the test horizon and not uniformly with an underlying probability density function f. across the entire sample path. Also let Hj be the hypothesis that f = fj, for j ∈ {1,...,M}, such that fk 6= fj, almost surely for all Our framework is agnostic to the hypothesis test as long j 6= k. We further assume that the prior probabilities as we have access to a sequential p-value for the test. of the hypotheses are known, and let πj denote the There is an extensive literature on creating sequential of hypothesis Hj for each j. There- p-values for hypothesis tests. While this is not the focus fore, given a sequence of thresholds {A1,...,AM } that 2Note that a sequential p-value is a sequence of random characterizes the false discovery rate of each hypothesis, variables whose randomness comes from the samples X. the stopping time T for the MSPRT can be described Alan Malek, Sumeet Katariya, Yinlam Chow, and Mohammad Ghavamzadeh as the minimum sample n ≥ 1 such that testing (i.e. SPRT) procedures, finding a test super- martingale is still a non-trivial task. In order to have π Qn f (X ) 1 k i=1 k i > . a more natural construction of the sequential p−value PM Qn 1 + A in many of such tests, we will introduce the concept j=1 πj i=1 fj(Xi) k of Bayes factor and draw some important connections for some k ∈ {1,...,M}. Correspondingly, the final between Bayes factor, test super-martingale and the decision is defined as δ = Hk. sequential p−value. Consider an arbitrary probability space (Ω, F,P ), a non-negative measurable function For comparison, Let A and A > 0 be the specific 0 1 B :Ω → [0, ∞] is known as a Bayes factor for prob- parameters that characterize the type-I and type-II ability measure P if R (1/B)dP ≤ 1. A Bayes factor error guarantees of a sequential two hypothesis testing B is said to be precise if R (1/B)dP = 1. In order procedure, we have the SPRT [30] stopping time T to relate this definition of Bayes factor with the tra- defined as the minimum sample n ≥ 1 such that ditional definition of a likelihood ratio, we note first Qn that whenever Q is a probability measure on (Ω, F), i=1 f1(Xi) Ln = 6∈ [A0,A1]. the Radon-Nikodym derivative (or the likelihood ratio) Qn f (X ) i=1 0 i dQ/dP will satisfy (dQ/dP )dP ≤ 1, with equality if Q is absolutely continuous with respect to P . Therefore, Here the final decision is given by δ = H1 if Ln > by definition B = 1/(dQ/dP ) is a Bayes factor for P . A1, and it is given by δ = H0 if Ln < A0. It is straightforward to show that the MSPRT with M = 2 The Bayes factor B will be precise if Q is absolutely continuous with respect to P . In this case B will be a is identical to the SPRT with parameters π0A0/π1 and version of the Radon-Nikodym derivative dP/dQ. π0/(π1A1). This implies that for the SPRT, the prior probabilities can be incorporated into the parameters Equipped with the mathematical definition of a Bayes 0 A0 and B . factor, we now provide the following result that shows for a sequential hypothesis test (A/B test) with a simple 4.2 Test Martingales null-hypothesis, the Bayes factor defined under the null hypothesis is a test martingale. The construction of p–tests is an important problem but not the focus of this paper. We refer the reader to [29, Corollary 4.3. Consider any sequential hypothesis 15] for a more thorough treatment but will include a tests with an alternative hypothesis H1 and a simple null few examples below. We are particularly interested in hypothesis H0. Both the likelihood ratio under H0 : θ0 test super-martingales, which are nonnegative super- and Bayes factor under H0 : θ0 are test martingales. martingales X ≥ 0 for any t that satisfy [X ] ≤ 1, t E 0 The proof of this corollary can be found in Appendix A. and test martingales, which are martingales that are Combining this result with the aforementioned one nonnegative and satisfy [X ] = 1. We require test E 0 from Corollary 4.2, one concludes that the reciprocal martingales have initial value 1. A simple application of the supremum of Bayes factor (or likelihood ratio) of the optional stopping theorem yields the following over all time points is a sequential p−value. well known result, often called the maximal inequality. See Appendix A for a proof. 4.3 Empirical Approaches Lemma 4.1. For any test martingale Xt, we have 1 P (supt Xt > b) ≤ b . Besides using MSPRT or test martingales to construct decision boundaries for sequential hypothesis testing, This inequality shows that Xt can take the value ∞ one can also construct non-parametric methods for se- only with probability zero. Now we will connect the quential A/B testing [1]. By keeping track of a single notion of test super-martingale and sequential p-value scalar test statistic that is derived based on a zero- by showing that the inverse of a supremum of a test mean random walk process under the null hypothesis, super-martingale is indeed a sequential p−value. This this non-parametric procedure tests the null hypothesis statement is true when the supremum is taken over all whenever a new data point is processed, and once a time points, and its proof can be found in Appendix A. hypothesis is rejected it controls the type I error by Corollary 4.2. If the test-statistic Λt is a positive utilizing the classical probability result of the law of test super-martingale, then 1 is a sequential the iterated logarithm (LIL). Since this procedure is supn0≤t Λn0 p-value. sequential, by nature it only takes linear time and con- stant space to compute the decision at each step. Fur- While the above result shows that a sequential p−value thermore, by using the empirical Bernstein-LIL-based can be easily derived from a test super-martingale, analysis from the above paper, it has also been shown in many likelihood ratio based sequential hypothesis that this algorithm has the same power guarantee as Sequential Multiple Hypothesis Testing with Type I Error Control its non-sequential counterpart. Fixed-horizon The fixed-horizon multiple hypothe- sis test with horizon N corresponds to the sequential k k procedure with pt = pN (the p-values are constant), 5 Algorithms for Sequential Multiple m St = 1 (every test is sampled every round), and Hypothesis Testing T = N (the stopping time is deterministic).

Now that we have presented a framework for fixed- Our definition of sequential test procedure provides an horizon MHT and sequential p-values, we can describe easy extension of the (f, q) guarantee for fixed-horizon: our procedure for conducting MHT in the sequential Definition 5.2. Given a function f : R2 → R+ that setting. Here, we assume that, for each hypothesis k, is monotonically increasing in the first argument and the decision maker has access to a sequential p-value a real number q > 0, we say that a sequential test k k p1 , p2 ,.... In the fixed horizon MHT, the statistician procedure (St,T, P) has an (f, q) error guarantee if collects samples from each hypothesis, calculated the corresponding p-value, then applies a testing procedure; E[f(VT ,ST )] ≤ q (4) we would like to applying a rejection procedure before all the samples are collected with the hope that easier for all m0 and all distributions of true rejections S, k tests can be concluded early. The intuition is that, assuming that pt is a sequential p-values for every true k null hypothesis. because sequential p-values pt are valid for all t and non-decreasing, we may apply any monotonic rejection If E[f(VT ,ST )] ≤ q only holds when the sequential p- procedure iteratively without distorting which tests are values corresponding to the true null hypothesis are rejected. Hence, we need only sample from the tests independent, we will say that (St,T, P) has a error that have not been rejected. This section is devoted guarantee under independence (a weaker condition). to proving that this intuition does hold, and indeed the sequential procedure inherits the same (f, q) error 5.1 The Sequential Conversion guarantee as the fixed-horizon procedure. Specifically, we begin by defining the sequential analog We now have an idea of what a sequential test proce- of a test procedure and then show how sequential p- dure consists of and what type of error guarantee we values can be used to upgrade a fixed horizon procedure can hope to achieve. This section describes a method to with a (f, q)-guarantee into a sequential test procedure transform a fixed-horizon test procedure into a sequen- with an analogous guarantee. This result is the main tial test procedure, and the next section shows that this theorem of this paper and is presented in Section 5.2. sequential conversion method transforms fixed-horizon procedures with (f, q)-guarantees into sequential pro- Definition 5.1 (Sequential Test Procedure). Given m cedures with (f , q)-guarantees. 1 m T sequential p-values {pt , . . . , pt }t≥1, a sequential test 1 m procedure consists of, for every round t = 1, 2,..., Let Ft be the filtration generated by {pt , . . . , pt } and consider a test procedure P. By a slight abuse of notation, we say that i ∈ B, where B ∈ {0, 1}m is some • Samplers S : {pk} 7→ {0, 1}m to de- t s s≤t−1,1≤k≤m binary vector, if B = 1. The sequential conversion C cide which hypothesis tests to sample from during i of P is defined as follows. round t, Definition 5.3. Assume we have sequential p-values 1 m • A stopping time T with respect to the filtration pt , . . . , pt , and some stopping time T with respect to 1 m generated by {pt , . . . , pt } (i.e. T is measurable Ft with finite expectation. The sequential conversion 0 0 w.r.t. this filtration) that determines when to stop of the test procedure P, which we label C = (St,T, P ), sampling, and is defined to be • For round t, we sample S0 = P(p1 , . . . , pm ) • A mapping P : [0, 1]m×∞ 7→ {0, 1}m from histories t t−1 t−1 0 of sequential p-values to reject/not-reject decisions. • If k ∈ St, sample from the hypothesis and update k k the sequential p-value; otherwise, pt = pt−1. In words, the sequential test procedure (St,T, P) dic- • At the end of the test, we reject P0 = tates what hypotheses to sample from, when to stop, P(p1 , . . . , pm). and what hypotheses to reject once stopped. T T Further, the analogous quantities to U, V, and R in Fig- Intuitively, C applies the fixed-horizon procedure P ure 1 will be denoted as UT ,VT , and RT to emphasize at every round to the sequential p-values and does that these are the random variables evaluated when the not sample from the already rejected hypotheses. The testing procedure finishes at stopping time T . pseudocode for C is presented in Algorithm 1. Alan Malek, Sumeet Katariya, Yinlam Chow, and Mohammad Ghavamzadeh

k Without the use of sequential p-values, using C to to the p–tests at time T . Denoting VT = |{R = 1 : k ∈ k conduct a hypothesis test would greatly inflated type H0}| and ST = m0 − |{R = 1 : k ∈ H0}| ( the number I errors for the same reasons that peeking invalidates of false positives and true positives, respectively), we type I error guarantees in single hypothesis testing. have However, if P is of a certain form and we use sequential E[f(VT ,ST )] ≤ q. (5) p-values, we maintain error control in a strong sense: the (f, q) guarantee of P translates exactly into an Furthermore, if P has only an independent (f, q) guar- (f, q) guarantee of C. This results is precisely stated antee, then (5) holds when the p–tests are independent. and proven in the next section. We would like to Lemma 5.6. Under the same setting as Lemma 5.5, emphasize that the algorithm allows the user to stop consider C, the sequential test procedure defined the each test once they have reached a significance level sequential conversion in Definition 5.3, and C0, defined while still maintaining a type 1 error guarantee. to be the sequential test procedure that is identical except 1 m with T = N. Then, for every realization of pt , . . . , pt , Algorithm 1 Sequential Conversion C C and C0 reject the same tests. 1: Input: stopping time T , rejection procedure P, Type 1 error α Proof of Theorem 5.4. Consider C and C0 as in 1 m 2: Initialize p0, . . . , p0 to 1 Lemma 5.6; let VT and ST be the random variables 3: for t = 1, 2,..., do 0 0 corresponding to S(P) and VN , SN the the random 1 m 0 4: Set St = P(pt−1, . . . , pt−1) variables corresponding to S (P). 5: For each k∈ / S , draw a sample and update pk t t 0 0 k k From Lemma 5.5, we have that [f(V ,S )] ≤ q. On 6: For each k ∈ St−1, set p = p E N N t t−1 the other hand, Lemma 5.6 implies that the decision 7: if Stopping time T is reached or St = ∅ then 8: break of both procedures are exactly the same, and hence 0 0 9: end if VT = VN and ST = SN almost surely. Combining 10: end for these, we have E[f(VT ,ST )] ≤ q. 11: Return rejected tests ST ∨t 6 Sequential Calculator for FWER

5.2 Main Result Even for the sequential setting, it is desirable to have an an estimate of the number of samples required. Theorem 5.4. Let P be a monotonic test procedure While it is standard to estimate the effective horizon with a (f, q) guarantee. Then its sequential conver- in a fixed-horizon setting [13], effective horizon calcula- sion C of P given by Definition 5.3 also has an (f, q) tions for sequential tests are underdeveloped. There- guarantee. That is, fore, we propose a sequential calculator for the general MHT procedure. Consider the case when there are m [f(V ,S )] ≤ q. E T T tests with corresponding sequential p-values given by 1 m Furthermore, if P only has an independent (f, q) guar- pt , . . . , pt . Equipped with the following parameters: antee, then C only has an independent (f, q)-guarantee. 1) FWER α, and 2) FWER II β, the problem is to find a deterministic horizon N ∗ such that both FWER and Before the proof, we give two lemmas. First, we ar- FWER II are guaranteed, while similar performance gue that applying a test procedure P to sequential guarantees are seen in SPRT-based algorithms such p–values instead of offline p–values retains the same as [4] for simple hypotheses. error guarantee (at the potential cost of lower power). Second, we show that the sequential procedure with 6.1 Bonferroni Correction early stopping, as described in Algorithm 1, produces the same decisions as the procedure where all sequen- Recall from the case of an A/B test with statistic tial statistics are run to the same length. Putting these Λt, where the stopping time T is the random time at two facts together, we have that the early stopping which the gap first crosses the decision boundary, i.e., sequential procedure inherits the same error guarantee T = inft{Λt ≥ 1/α}. as the test procedure P. Proofs for both lemmas are We can use the Bonferroni correction to extend this in the appendix. property to multiple hypothesis testing and obtain a Lemma 5.5. Consider a monotonic rejection proce- stopping condition that guarantees a total FWER of k dure P with an (f, q) guarantee, sequential p–values α: Λt ≥ m/α, ∀k ∈ {1, . . . , m}. Here the term m/α 1 m pt , . . . , pt , and a stopping time T with respect to Ft. corresponds to the union bound with FWER guarantee, 1 m k Let R = P(pT , . . . , pT ), i.e. the results of applying P and T is the stopping time of test k. Sequential Multiple Hypothesis Testing with Type I Error Control

Let K be set of tests with a true alternative hypothesis. 800 We wish to find the smallest horizon N ∗ such that P(T k ≤ N ∗, ∀k ∈ K|K) ≥ 1 − β. Because the underly- ing hypothesis of each test is generally unknown, we 750 Sequential cannot solve for N ∗ exactly and must choose N ∗ to Classical hold for all possible values of K. By the union bound # rejected 700 for intersection of events, we have that 10 20 30 40 50 60 70 80 90 100 ! k  \ k P T ≤ t, ∀k ∈K|K = P {T ≤t}|K 0.015 0.4 k∈K 0.3 0.01 X k ≥ P(T ≤ t|K) − (|K| − 1) 0.2 FDR FNR k∈K 0.005 X 0.1 = (T k ≤ t|Hk) − (|K| − 1), P 1 0 0 k∈K 0 50 100 0 50 100 where last equality is due to the independence of tests. Figure 1: Experiments for sequential vs. fixed-horizon When t = Ne ∗, we have that MHT, plotted against thousands of samples. X k ∗ k P(T ≤ Ne |H1 ) − (|K| − 1) ≥ 1 − |K|β/m ≥ 1 − β, k∈K p-value in Corollary 4.2. The simulation results are presented in Fig. 1. The top plot shows the number of which by definition of N ∗ implies Ne ∗ ≥ N ∗. Uti- rejected tests per thousand sample points; the perfect lizing this property, one can approximate the effec- algorithm (with unlimited data) would reject all 800 tive horizon by upper bounding N ∗ as follows: Ne ∗ = true alternatives. We see that the sequential does lead ∗,k maxk∈{1,...,m} N such that to more rejections, as desired. The FDR and FNR are  β  plotted against budget in the bottom left and right N ∗,k = inf t : (T k ≤ t|Hk) ≥ 1 − . plots, respectively. The sequential algorithm has bet- P 1 m ter performance in both metrics: it makes fewer false positives and false negatives. 7 Experiments 8 Conclusion and Future Work One of the primary advantages of dynamic allocation is that we can stop easy tests early. Consider the scenario We introduced a unified framework for sequential multi- when we have a limit on the number of samples but ple hypothesis testing that includes most of the known can allocate them to arbitrary hypotheses. In the non- common error notions such as Family-Wised Error sequential framework, the budget is uniformly allocated Rate (FWER) and False Discovery Rate (FDR). Af- to each test, a p-value is computed, and the Benjamini- ter familiarizing the reader with sequential p-values Hochberg procedure is applied at the end. However, and providing several examples, we showed how the our sequential framework allows sampling to cease for MHT procedure can be successively applied to the se- a test that will certainly be rejected at the end of the quential p-values to create a sequential procedure that sequential procedure, a property we refer to as dynamic dynamically allocates samples, stops as soon as enough allocation. Thus, our procedure can use proportionally evidence is obtained, and retains the same false positive more samples on harder tests. control. We then discussed type II error and provided some experimental validation. We tested 1000 simple vs. composite hypotheses where the null is a zero-mean Gaussian and the alterna- There are many interesting directions. Type II errors tives are non-zero mean Gaussians. Of these tests, are still not well-understood beyond the simple vs. sim- 800 are true alternatives with uniformly chosen ple hypothesis case; what classes of hypotheses are from [−10, 10] and the other 200 are true nulls (with compatible with guarantees on type II error? We have zero mean). We compare the performance of our se- sacrificed performance for generality by using sequen- quential algorithm against the classical non-sequential tial p-values as our starting point; we would like to Benjamini-Hochberg procedure [7]. This entire proce- investigate when we can exploit the structure of spe- dure was repeated over 1000 independent instantiations cific sequential p-values to obtain tighter type I error and the results were averaged. The p-values for the non- guarantees. Lastly, our (f, q)-guarantee encompasses sequential test were calculated using the independent most known type I error criteria, but there likely exist two-sample t-test for groups with equal . The looser notions that are more amenable to providing sequential p-values were calculated using the sequential error guarantees in the sequential setting. Alan Malek, Sumeet Katariya, Yinlam Chow, and Mohammad Ghavamzadeh

References [16] Y. Hochberg. A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75(4):800– [1] Akshay Balsubramani and Aaditya Ramdas. Sequen- 802, 1988. tial nonparametric testing with the law of the iterated logarithm. In Proceedings of the Thirty-Second Confer- [17] S. Holm. A simple sequentially rejective multiple test ence on Uncertainty in Artificial Intelligence, UAI’16, procedure. Scandinavian journal of statistics, pages pages 42–51, Arlington, Virginia, United States, 2016. 65–70, 1979. AUAI Press. [18] R. Johari, L. Pekelis, and D. Walsh. Always valid inference: Bringing to A/B testing. [2] J. Bartroff and T. Lai. Multistage tests of multiple arXiv:1512.04922, 2015. hypotheses. Communications in Statistics-Theory and Methods, 39(8-9):1597–1607, 2010. [19] R. Kass and L. Wasserman. A reference Bayesian test for nested hypotheses and its relationship to the [3] J. Bartroff and J. Song. Sequential tests of multiple hy- Schwarz criterion. Journal of the American statistical potheses controlling false discovery and non-discovery association, 90(431):928–934, 1995. rates. arXiv:1311.3350, 2013. [20] T. Lai. Incorporating scientific, ethical and economic [4] J. Bartroff and J. Song. Sequential tests of multiple considerations into the design of clinical trials in hypotheses controlling type I and II family-wise error the pharmaceutical industry: A sequential approach. rates. Journal of statistical planning and inference, Communications in Statistics-Theory and Methods, 153:100–114, 2014. 13(19):2355–2368, 1984. [5] C. Baum and V. Veeravalli. A sequential procedure [21] E. Lehmann and J. Romano. Generalizations of the for multi-hypothesis testing. IEEE Transactions on familywise error rate. Annals of statistics, 33(3):1138– Information Theory, 40(6), 1994. 1154, 2005. [6] H. Baumgartner and C. Homburg. Applications of [22] S. Nakagawa. A farewell to Bonferroni: The problems structural equation modeling in marketing and con- of low statistical power and publication bias. Behav- sumer research: A review. International journal of ioral Ecology, 15(6):1044–1045, 2004. Research in Marketing, 13(2):139–161, 1996. [23] A. Novikov. Optimal sequential multiple hypothesis [7] Y. Benjamini and Y. Hochberg. Controlling the false tests. arXiv:0811.1297, 2008. discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Soci- [24] H Robbins and D Siegmund. The expected sample size ety. Series B (Methodological), pages 289–300, 1995. of some tests of power one. The Annals of Statistics, pages 415–436, 1974. [8] Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testing under dependency. [25] H. Robbins and D. Siegmund. A class of stopping rules Annals of statistics, pages 1165–1188, 2001. for testing parametric hypotheses. In Herbert Robbins Selected Papers, pages 293–297. Springer, 1985. [9] C. Bonferroni. Teoria statistica delle classi e calcolo delle probabilita. Libreria internazionale Seeber, 1936. [26] S. Sarkar. False discovery and false nondiscovery rates in single-step multiple testing procedures. The Annals [10] S. De and M. Baron. Step-up and step-down methods of Statistics, pages 394–415, 2006. for testing multiple hypotheses in sequential experi- [27] S. Sarkar. Step-up procedures controlling generalized ments. Journal of Statistical Planning and Inference, FWER and generalized FDR. The Annals of Statistics, 142(7):2059–2070, 2012. pages 2405–2420, 2007. [11] V. Dragalin, A. Tartakovsky, and V. Veeravalli. Mul- [28] D. Seugmund. Sequential Analysis: tests and confi- tihypothesis sequential probability ratio tests. ii. Ac- dence intervals. Springer Science & Business Media, curate asymptotic expansions for the expected sam- 1985. ple size. IEEE Transactions on Information Theory, 46(4):1366–1383, 2000. [29] G. Shafer, A. Shen, N. Vereshchagin, and V. Vovk. Test martingales, Bayes factors and p−values. Statistical [12] V. Draglia, A. Tartakovsky, and V. Veeravalli. Multi- Science, pages 84–101, 2011. hypothesis sequential probability ratio tests. I. Asymp- totic optimality. IEEE Transactions on Information [30] A. Wald. Sequential tests of statistical hypotheses. Theory, 45(7):2448–2461, 1999. The annals of mathematical statistics, 16(2):117–186, 1945. [13] W. Dupont and W. Plummer. Power and sample size calculations: A review and computer program. [31] S. Zehetmayer, P. Bauer, and M. Posch. Two-stage Controlled clinical trials, 11(2):116–128, 1990. designs for experiments with a large number of hy- potheses. , 21(19):3771–3777, 2005. [14] G. Fellouris and A. Tartakovsky. Almost optimal se- quential tests of discrete composite hypotheses. arXiv [32] S. Zehetmayer, P. Bauer, and M. Posch. Optimized preprint arXiv:1204.5291, 2012. multi-stage designs controlling the false discovery or the family-wise error rate. Statistics in medicine, [15] Peter Grünwald. Safe probability. arXiv preprint 27(21):4145–4160, 2008. arXiv:1604.01785, 2016.