Multiple Conditional Randomization Tests∗

Multiple conditional randomization tests∗ Yao Zhang1 and Qingyuan Zhao†2 1Department of Applied Mathematics and Theoretical Physics, University of Cambridge 2Statistical Laboratory, Department of Pure Mathematics and Mathematical Statistics, University of Cambridge June 30, 2021 Abstract We propose a general framework for (multiple) conditional randomization tests that incorporate several important ideas in the recent literature. We establish a general sufficient condition on the construction of multiple conditional randomization tests under which their p-values are “independent”, in the sense that their joint distribution stochastically dominates the product of uniform distributions under the null. Conceptually, we argue that randomization should be understood as the mode of inference precisely based on randomization. We show that under a change of perspective, many existing statistical methods, including permutation tests for (conditional) independence and conformal prediction, are special cases of the general conditional randomization test. The versatility of our framework is further illustrated with an example concerning lagged treatment effects in stepped-wedge randomized trials. 1 Introduction arXiv:2104.10618v2 [math.ST] 29 Jun 2021 Randomization is one of the oldest and most important topics in statistics and design of experiments (Fisher 1926; Fisher 1935). Randomization tests were proposed by Fisher (1935, Section 21) to substitute the t-test when normality is not true and to restore randomization as “the physical basis of the validity of the test”. This idea was immediately extended by Pitman (1937), Welch (1937), and Wilcoxon (1945), and Kempthorne (1952), among many others. Randomization tests are appealing because they are exact and do not rely on distributional ∗We thank Stephen Bates, Jing Lei, and Paul Rosenbaum for their comments on an earlier version of this manuscript. †[email protected] 1 assumptions. Due to this reason, they are often advocated with rank-based statistics under the name of nonparametric tests (Lehmann 1975; Hájek, Šidák, and Sen 1999). This has led to a wide held belief that randomization tests are synonymous to permutation tests, which emphasize instead on the algorithm rather than the basis of inference. Unfortunately, the role of physical randomization has also become obscure in statistics education and practice. For example, at the time of writing randomization tests are being described in the Wikipedia page on “Resampling (statistics)” with the bootstrap, subsampling, and cross-validation procedures. As another example, in the Cambridge dictionary of statistics (Everitt and Skrondal 2002), randomization tests are defined as “procedures for determining statistical significance directly from data without recourse to some particular sampling distribution” without referring to the physical act of randomization at all. More recently, there has been a rejuvenated interest in randomization tests in several areas of statistics, including testing associations in genomics (Efron et al. 2001; Bates et al. 2020), testing conditional independence (Candès et al. 2018; Berrett et al. 2020), conformal inference for machine learning methods (Vovk, Gammerman, and Shafer 2005; Lei, Robins, and Wasserman 2013; Tibshirani et al. 2019), analysis of complex experimental designs (Morgan and Rubin 2012; Ji et al. 2017), evidence factors for observational studies (Rosenbaum 2010; Rosenbaum 2017; Karmakar, French, and Small 2019), and causal inference with interference (Athey, Eckles, and Imbens 2018; Basse, Feller, and Toulis 2019). Because randomization tests are distribution-free, they offer an easy way out of the difficulties (or even impossibilities) in theoretically deriving the sampling distribution, a frequent task in modern statistical applications. One of the main goals of this article is to provide a unified framework for (conditional) randomization tests and try to push some existing ideas to their natural boundaries. This framework should subsume several general ideas and concepts that have appeared in the literature: (i) Explicit conditioning on the counterfactual or potential outcomes of the experiment (Rubin 1980; Rosenbaum 1984); (ii) Algebraic structure of permutation tests (Lehmann and Romano 2006; Southworth, Kim, and Owen 2009; Rosenbaum 2017); (iii) Randomization model versus population model (Lehmann 1975; Ernst 2004); (iv) Post-experiment conditioning and randomization (Hennessy et al. 2016; Basse, Feller, and Toulis 2019; Bates et al. 2020); (v) Using exchangeability to obtain distribution-free predictive intervals (Vovk, Gammer- man, and Shafer 2005). Some of these ideas were introduced recently, but many were re-introduced from the earlier literature. For example, counterfactual outcomes were used in one of the earliest papers on randomization tests by Welch (1937). Conditioning is not a new technique in statistical 2 inference (e.g., Fisher (1925)’s exact test for 2 × 2 tables), nor is randomized test (e.g., the Neyman-Pearson lemma for discrete distributions, where decision at the critical value needs to be randomized to make the test exact). However, we believe that these ideas have not been fully explored in the context of randomization inference, perhaps partly due to the fact that they originated from some very different applications. Our main thesis is that randomization inference should be precisely understood as what its name suggests: it is a mode of statistical inference that is based on randomization and nothing more than randomization. To make the nature of randomization clear, we argue that it is helpful to consider counterfactual versions of the data, even if they cannot be immediately conceptualized for the problem at hand. The introduction of potential/counterfactual outcomes allows us to trichotomize the randomness in data as (i) Randomness in nature that is involved in all the counterfactual variables; (ii) Randomness that is introduced by the experimenter through physical acts (e.g., drawing balls from an urn or using a pseudo-random number generator on a computer); (iii) Randomness that is optionally introduced by the analyst. Using this trichotomy, a randomization test can be understood as a null hypothesis significance test that conditions on the potential outcomes and obtains the sampling distribution (often called the randomization distribution) by random variations from the second and third sources. In other words, a randomization test is based solely on the randomness introduced by humans (experimenters and/or analysts) and thus provides a coherent logic of scientific induction as envisioned by Fisher (1956). The above trichotomy is hardly new, nor is the definition of randomization distribution. As mentioned above, conditioning on the potential outcomes is almost always implicit. The difference between randomization before and after the experiment has also been well recognized (see e.g., Basu 1980). However, these conceptual and methodological ideas often only appear as neat tricks that solve some specific problems or witty points in a philosophical debate about statistics. We argue that if these ideas are put together in a single rigorous framework, a great deal can be learned: the structure of randomization tests becomes clearer, one can understand the strengths and limitations of randomization tests much better, and some confusions and misunderstandings in the literature can be settled. As an example, a common belief among statisticians is that randomization tests rely on certain kinds of group structure or exchangeability (Lehmann and Romano 2006; Southworth, Kim, and Owen 2009; Rosenbaum 2017), which is perhaps why some texts treat randomization tests and permutation tests as synonyms. We will show that such algebraic structures are not necessary. This point becomes almost immediately clear once randomization inference is understood as the mode of inference based on randomization (our main thesis). That being said, in conditional randomization tests some invariance properties are needed to ensure that conditioning is well defined. This point can be easily overlooked from an algorithmic point of view and has led to mistakes. 3 Another frequent point of confusion is that randomization tests are applicable in two different models, one involving physical randomization and one involving exchangeability. The former is usually referred to as the randomization model and the latter is referred to as the population model (Lehmann 1975; Ernst 2004). Although such a classification may be useful for pedagogic purposes, we argue that these two models are fundamentally “two sides of the same coin”. A randomization test not only tests the null hypothesis (relation between the potential outcomes), but also tests the assumption that the treatment is independent of the potential outcomes (Rubin 1980). The two models are thus unified: the first hypothesis is satisfied by definition in the population model (so independence is tested), while the second hypothesis is automatically satisfied by the physical act of randomization in the randomization model (so the null hypothesis on the potential outcomes is tested). Our main theoretical contribution is a very general sufficient condition in Theorem2 below that ensures the p-values from multiple conditional randomization tests are “independent”, in the sense that their joint distribution stochastically dominates the multivariate uniform distribution under the null. This is important because not only is the type I error of each test

Load more