<<

Statistical Inference and Power Analysis for Direct and Spillover Effects in Two-Stage Randomized

Zhichao Jiang† Kosuke Imai‡

November 16, 2020

Abstract

Two-stage randomized experiments are becoming an increasingly popular experimental design for causal inference when the outcome of one unit may be affected by the treatment assignments of other units in the same cluster. In this paper, we provide a methodological framework for general tools of and power analysis for two-stage randomized experiments. Under the randomization-based framework, we propose unbiased point estimators of direct and spillover effects, construct conservative variance estimators, develop hypothesis testing proce- dures, and derive sample size formulas. We also establish the equivalence relationships between the randomization-based and regression-based methods. We theoretically compare the two-stage randomized design with the completely randomized and cluster randomized designs, which rep- resent two limiting designs. Finally, we conduct simulation studies to evaluate the empirical performance of our sample size formulas. For empirical illustration, the proposed methodology is applied to the analysis of the data from a field on a job placement assistance program.

Keywords: experimental design, interference between units, partial interference, spillover effects, statistical power

∗Imai thanks the Alfred P. Sloan Foundation for partial support (Grant number 2020–13946). †Assistant Professor, Department of Biostatistics and Epidemiology, University of Massachusetts, Amherst MA 01003. ‡Professor, Department of Government and Department of , Institute for Quantitative Social Sci- ence, Harvard University, Cambridge MA 02138. Phone: 617–384–6778, Email: [email protected], URL: https://imai.fas.harvard.edu 1 Introduction

Much of the early causal inference literature relied upon the assumption that the outcome of one unit cannot be affected by the treatment assignment of another unit. Over the last two decades, however, researchers have made substantial progress by developing a variety of methodological tools to relax this assumption (see e.g., Sobel, 2006; Rosenbaum, 2007; Hudgens and Halloran, 2008; Tchetgen Tchetgen and VanderWeele, 2010; Forastiere et al., 2016; Aronow and Samii, 2017; Athey et al., 2018; Basse and Feller, 2018; Imai et al., 2020, and many others). Two-stage randomized experiments, originally proposed by Hudgens and Halloran (2008), have become an increasingly popular experimental design for studying spillover effects. Under this exper- imental design, researchers first randomly assign clusters of units to different treatment assignment mechanisms, each of which has a different of treatment assignment. For example, one treatment assignment mechanism may randomly assign 80% of units to the treatment group whereas another mechanism may only treat 40%. Then, within each cluster, units are randomized to the treatment and control conditions according to its selected treatment assignment mechanism. By comparing units who are assigned to the same treatment conditions but belong to different clus- ters with different treatment assignment mechanisms, one can infer how the treatment conditions of other units within the same cluster affect one’s outcome. Two-stage randomized experiments are now frequently used in a number of disciplines, including economics (e.g., Cr´epon et al., 2013; Angelucci and Di Maro, 2016), education (e.g., Muralidharan and Sundararaman, 2015; Rogers and Feller, 2018), political science (e.g., Sinclair et al., 2012), and public health (e.g., Benjamin-Chung et al., 2018). The increasing use of two-stage randomized experiments in applied scientific research calls for the development of general methodology for analyzing and designing such experiments. Building on the prior methodological literature (e.g., Hudgens and Halloran, 2008; Basse and Feller, 2018; Imai et al., 2020), we consider various causal quantities, representing direct and spillover effects, and develop their unbiased point estimators and conservative variance estimators under the nonparamet- ric randomization-based framework. We also show how to conduct hypothesis tests and derive the sample size formulas for the estimation of these causal effects. The resulting formulas can be used to conduct power analysis when designing two-stage randomized experiments. Finally, we theoretically compare the two-stage randomized design with its two limiting designs, the completely randomized and cluster randomized designs. Through this comparison, we analyze the potential efficiency loss of the two-stage randomized design when no spillover effect exists.

1 We make several methodological contributions to the literature. First, the proposed causal quantities generalize those of Hudgens and Halloran (2008) to more than two treatment assignment mechanisms. We consider the joint estimation of the average direct and spillover effects to charac- terize the causal heterogeneity across different treatment assignment mechanisms. We also propose the average marginal direct effect as a scalar summary of several average direct effects. Second, our variance estimators are guaranteed to be conservative while those of Hudgens and Halloran (2008) are not when generalized to our setting. Third, while Baird et al. (2018) develops power analysis under similar settings, they adopt linear regression models and focus on the randomized saturation design, in which the proportion of treated units for each cluster is considered as a parameter to be op- timized. In contrast, we study the standard two-stage randomized design and use the nonparametric randomization-based framework without making modeling assumptions. In addition, we prove the equivalence relationships between the proposed randomization-based estimators and the popular least squares estimators. This extends the result of Basse and Feller (2018) to more general two-stage randomized experiments. Our result can also be viewed as a generalization of Samii and Aronow (2012) in the presence of interference. We conduct simulation studies to evaluate the sample size formulas and use data from an experiment on a job placement assistance program to illustrate the proposed methodology. The remainder of the paper is organized as follows. Section 2 introduces our motivating study, which uses the two-stage randomized design to evaluate the efficacy of a job placement assistance pro- gram (Cr´epon et al., 2013). In Section 3, we formally present the two-stage randomized design and define the three causal quantities of interest. In Section 4, we propose a general methodology for sta- tistical inference and power analysis. While Sections 5 presents simulation studies, Section 6 revisits the evaluation study of the job placement assistance program and apply the proposed methodol- ogy. Finally, Section 7 compares the two-stage randomized design with the cluster and individual randomized designs before providing concluding remarks in Section 8.

2 Randomized Evaluation of a Job Placement Assistance Program

In this section, we describe the randomized evaluation of a job placement assistance program (Cr´epon et al., 2013), which serves as a motivating application. The goal of the study is to assess the impacts of the job placement assistance program on the labor market outcomes of young, educated job seekers in France. The experiment took place in a total of 235 areas (e.g., cities), each of which is covered by one of the French public unemployment agency offices in 10 administrative regions. Each office

2 Treatment assignment mechanisms I II III IV V Treatment assignment probability 0% 25% 50% 75% 100% Number of clusters 47 47 47 47 47 Number of job seekers 4,467 4,839 4,899 4,598 4,517

Table 1: The Two-stage Randomized Design for the Evaluation of the Job Placement Assistance Program. represents a small labor market. The program eligibility criteria are based on age (30 years old or younger), education (at least a two-year college degree), and unemployment status (having spent either 12 out of the last 18 months or 6 months continuously unemployed). The evaluation was conducted through the two-stage randomized design shown in Table 1. In the first stage, 235 areas were randomly assigned to one of five treatment assignment mechanisms, which correspond to different levels of treatment assignment : 0, 25, 50, 75, and 100%. In the second stage, job seekers were assigned to the treatment within each area according to its treatment assignment probability chosen in the first stage. Those assigned to the treatment group were offered an opportunity to enroll in the job placement program, whereas those in the control group received the standard placement assistance. Job seekers participated in the experiment as 14 monthly cohorts, starting in September 2007. The study focused on cohorts 3–11, which consists of 11,806 unemployed individuals. It targeted two binary labor market outcomes after eight months of the assignment: fixed-term contract of six months or more (LTFC) and permanent contract (PC). Four follow-up surveys were conducted 8 months, 12 months, 16 months, and 20 months after the treatment assignment, which collected the information on labor market outcomes. Both direct and spillover effects are of interest in this evaluation. While the direct effect charac- terizes how much the job seekers would benefit from the program, the spillover effect corresponds to the displacement effect, representing the possibility that job seekers who benefit from the program may get a job but at the expense of other unemployed workers in the same labor market. Moreover, the heterogeneity in the direct and spillover effects are also of interest. For example, a greater treat- ment assignment probability may lead to a larger spillover effect. The original study relied upon linear regression models (Cr´epon et al., 2013). In contrast, our analysis presented in Section 6 is based on the proposed nonparametric randomization-based framework, which, unlike linear regres- sion models, does not suffer from possible model misspecification. We also conduct power analysis and obtain the sample size required for detecting a pre-specified effect size in a future experiment.

3 3 Experimental Design and Causal Quantities of Interest

In this section, we formally describe the two-stage randomized experimental design and define the causal quantities of interest using the potential outcomes framework (e.g., Neyman, 1923; Rubin, 1974; Holland, 1986).

3.1 Assumptions

Suppose that we have a total of J clusters and each cluster j has nj units. Let n represent the Pj total number of units, i.e., n = j=1 nj. Under the two-stage randomized design, we first randomly assign clusters to different treatment assignment mechanisms, and then assign a certain proportion of individual units within a cluster to the treatment condition by following the treatment assignment mechanism selected at the first stage of randomization. Let Aj denote the treatment assignment mechanism chosen for cluster j, which takes a value in M = {1, 2, . . . , m}. Let A = (A1,A2,...,AJ ) denote the vector of treatment assignment mechanisms for all J clusters and a = (a1, a2, . . . , aJ ) represent the vector of realized assignment mechanisms. We assume complete randomization such Pm that a total of Ja clusters are assigned to the assignment mechanism a ∈ M where a=1 Ja = J. The second stage of randomization concerns the treatment assignment for each unit within cluster j based on the assignment mechanism Aj. Let Zij be the binary treatment assignment variable for unit i in cluster j where Zij = 1 and Zij = 0 imply that the unit is assigned to the treatment and control conditions, respectively. Then, Pr(Zj = zj | Aj = a) represents the distribution of the treatment assignment when cluster j is assigned to the assignment mechanism Aj = a where

Zj = (Z1j,...,Znj j) is the vector of assigned treatments for the nj units in the cluster and zj =

(z1j, . . . , znj j) is the vector of realized assignments. We assume complete randomization such that a total of njz units in cluster j are assigned to the treatment condition z ∈ {0, 1} where nj0 +nj1 = nj. We now formally define the two-stage randomized design.

Assumption 1 (Two-Stage Randomization) 1. Complete randomization of treatment assignment mechanisms across clusters: J ! ··· J ! Pr(A = a) = 1 m J! PJ 0 0 for all a such that j=1 1{aj = a } = Ja0 where for a ∈ M. 2. Complete randomization of treatment assignment across units within each cluster: 1 Pr(Z = z | A = a) = j j j nj  nj1 > for all zj such that 1nj zj = nj1 where 1nj is the nj dimensional vector of ones.

4 Next, we introduce the potential outcomes. For unit i in cluster j, let Yij(z) be the potential value of the outcome if the assigned treatment vector for the entire sample is z where z is an n dimensional vector. The observed outcome is given by Yij = Yij(Z). This notation implies that the outcome of one unit may be affected by the treatment assignment of any other unit in the sample. Unfortunately, it is impossible to learn about causal effects without additional assumptions because each unit has 2n possible potential outcome values. Following the literature (Sobel, 2006; Hudgens and Halloran, 2008), we assume that the potential outcome of one unit cannot be affected by the treatment assignment of another unit in other clusters while allowing for possible interference between units within a cluster.

Assumption 2 (No Interference Between Clusters)

0 0 0 Yij(z) = Yij(z ) for any z, z with zj = zj.

Assumption 2, which is known as the partial interference assumption in the literature, partially relaxes the standard assumption of no interference between units (Rubin, 1990). This assumption reduces the number of potential outcome values for each unit from 2n to 2nj . In our application, As- sumption 2 is plausible because the areas are sufficiently large, and it appears that most participants of the experiment did not move to another area during the study. Lastly, we rely upon the stratified interference assumption proposed by Hudgens and Halloran (2008) to further reduce the number of potential outcome values.

Assumption 3 (Stratified Interference)

nj nj 0 0 X X 0 Yij(zj) = Yij(zj) if zij = zij and zij = zij. i=1 i=1 Assumption 3 implies that the outcome of one unit depends on the treatment assignment of other units only through the number of those who are assigned to the treatment condition within the same cluster. Under Assumptions 2 and 3, we can simplify the potential outcome as a function of one’s own treatment and the treatment assignment mechanism of its cluster, i.e., Yij(z) = Yij(z, a).

3.2 Direct effect

Under the above assumptions, we now define the main causal quantities of interest. The first quantity is the direct effect of the treatment on one’s own outcome. We define the unit-level direct effect for unit i in cluster j as,

ADEij(a) = Yij(1, a) − Yij(0, a)

5 for a = 1, . . . , m. This quantity may depend on the treatment assignment mechanism a due to the possible spillover effect from other units’ treatments. The direct effect quantifies how the treatment of a unit affects its outcome under a specific assignment mechanism. This unit-level direct effect can be aggregated, leading to the definition of the cluster-level direct effect,

nj 1 X ADE (a) = ADE (a) = Y (1, a) − Y (0, a), j n ij j j j i=1 where nj 1 X Y (z, a) = Y (z, a). j n ij j i=1 We can further aggregate this quantity and obtain the population-level direct effect,

J 1 X ADE(a) = ·ADE (a) = Y (1, a) − Y (0, a), (1) J j j=1 where J 1 X Y (z, a) = Y (z, a). J j j=1 The direct effects depend on the treatment assignment mechanisms; we denote them by a column vector, ADE = (ADE(1),..., ADE(m))>.

3.3 Marginal direct effect

With m treatment assignment mechanisms, we have a total of m direct effects ADE(a) for a = 1, . . . , m. Although such direct effects are informative about how the treatment of a unit affects its own outcome given different treatment assignment mechanisms, applied researchers may be more interested in having a single quantity that summarizes all the direct effects. Therefore, we define the unit-level marginal direct effect by marginalizing the direct effects over the treatment assignment mechanisms, m X Ja MDE = {Y (1, a) − Y (0, a)}. ij J ij ij a=1

The weight Ja/J is the proportion of the clusters assigned to treatment assignment mechanism a. Based on the unit-level effect, we define the cluster-level marginal direct affect as,

nj 1 X MDE = ADE , j n ij j i=1 and the population-level marginal direct affect as,

J 1 X MDE = ·ADE . (2) J j j=1

6 3.4 Spillover effect

In two-stage randomized experiments, another causal quantity of interest is the spillover effect, which quantifies how one’s treatment affects the outcome of another unit. Under Assumptions 2 and 3, we define the unit-level spillover effect on the outcome as,

0 0 ASEij(z; a, a ) = Yij(z, a) − Yij(z, a ), which compares the average potential outcomes under two different assignment mechanisms, a and a0, while holding one’s treatment assignment constant at z. We then define the spillover effects on the outcome at the cluster and population levels,

nj 1 X ASE (z; a, a0) = ASE (z; a, a0) = Y (z, a) − Y (z, a0), j n ij j j j i=1 J 1 X ASE(z; a, a0) = ASE (z; a, a0) = Y (z, a) − Y (z, a0). J j j=1

The spillover effects depend on both the treatment condition and treatment assignment mechanisms; we denote them by ASE = (ASE(1; 1, 2), ASE(1; 2, 3),..., ASE(1; m−1, m), ASE(0; 1, 2), ASE(0; 2, 3), ..., ASE(0; m − 1, m)), which consists of the spillover effects comparing adjacent treatment assign- ment mechanisms for both the treatment and control conditions. We give equal weight to each cluster in the quantities defined above (see Hudgens and Halloran, 2008), while Basse and Feller (2018) assigns an equal weight to each individual unit. For example, Basse and Feller (2018) defines the direct effect as

J J nj X nj 1 X X ADE(a) = · ADE (a) = ADE (a). N j N ij j=1 j=1 i=1

When the cluster sizes are equal, these two types of estimands are identical. While our analysis focuses on the cluster-weighted quantities rather than individual-weighted quantities, our method can be generalized to any weighting scheme.

4 A General Methodology for Two-Stage Randomized Experiments

In this section, we develop a general methodology for the direct and spillover effects introduced above. We show how to estimate these quantities of interest, compute the randomization-based variance, and conduct hypothesis tests. We also derive the sample size formulas for testing the direct and spillover effects.

7 Formally, define Y = (Y (1, 1), Y (0, 1),..., Y (1, m), Y (0, m))>, which is a 2m-dimensional column vector with the (2a − 1)-th and 2a-th elements representing the treatment and control potential outcomes under treatment assignment mechanism a, respectively, for a = 1, . . . , m. The direct, marginal direct, and spillover effects can all be written as linear transformations of Y . In particular, let el denote the 2m-dimensional column vector whose l-th element is equal to 1 with other elements being equal to 0. Then, the direct effect can be written as ADE = C1Y , where C1 = (e1 − > e2, e3 − e4, . . . , e2m−1 − e2m) is an m × 2m matrix with the a-th row representing the contrast in ADE(a) for a = 1, . . . , m. Similarly, the marginal direct effect can be written as MDE = > C2Y , where C2 = (J1, −J1,J2, −J2,...,Jm, −Jm) /J. Lastly, the spillover effect can be written > > as ASE = C3Y , where C3 = (C31,C30) with C31 = (e1 − e3, e3 − e5, . . . , e2m−3 − e2m−1) and > C30 = (e2 − e4, e4 − e6, . . . , e2m−2 − e2m) . That is, the a-th column in C31 and C30 represents the contrast in ASE(1; a, a + 1) and ASE(0; a, a + 1), respectively, for a = 1, . . . , m − 1. In the following, we will develop a general methodology by exploiting these linear transformations.

4.1 Unbiased estimation

Hudgens and Halloran (2008) propose unbiased estimators of the average direct and spillover effects. Here, we present analogous estimators for the three causal quantities defined above. Define

nj PJ P Y 1(Z = z) Ybj(z)1(Aj = a) Y (z) = i=1 ij ij and Y (z, a) = j=1 , bj Pnj b PJ i=1 1(Zij = z) j=1 1(Aj = a) where Ybj(z) is the average outcome under treatment condition z in cluster j, and Yb(z, a) is the average of Ybj(z) in clusters with treatment assignment mechanism a. The following theorem gives the unbiased estimators of the ADE, MDE, and ASE.

Theorem 1 (Unbiased Estimation) Define Yb = (Yb(1, 1), Yb(0, 1),..., Yb(1, m), Yb(0, m)). Under Assumptions 1, 2, and 3, Yb is unbiased for Y , i.e., E(Yb) = Y . Therefore, ADE\ = C1Yb, MDE\ = C2Yb, and ASE[ = C3Yb are unbiased for ADE, MDE, and ASE, respectively, i.e.,

E(ADE\) = ADE, E(MDE\) = MDE, E(ASE[ ) = ASE.

We note that theory of simple random sampling implies E{Ybj(z) | Aj = a} = Y j(z, a). Therefore, it is straightforward to show that E{Yb(z, a)} = Y (z, a) and hence E(Yb) = Y . 4.2 Variance

Hudgens and Halloran (2008) derive the variances of ADE\(a) and ASE[ (z; a0, a) under stratified interference (Assumption 3). However, this is not sufficient for obtaining the variance of our causal quantities, which require the covariance between the elements in Yb. We first derive the covariance

8 matrix of Yb and then use it to obtain the covariance matrix of our causal quantities of interest, i.e., ADE, MDE, and ASE. The covariance matrix of Yb consists of the variance of Yb(z, a) and the covariance between Yb(z, a) and Yb(z0, a0). Define,

nj 1 X σ2(z, z0; a, a0) = {Y (z, a) − Y (z, a)}{Y (z0, a0) − Y (z0, a0)}, j n − 1 ij j ij j j i=1 J 1 X σ2(z, z0; a, a0) = {Y (z, a) − Y (z, a)}{Y (z0, a0) − Y (z0, a0)}, b J − 1 j j i=1 2 0 0 0 0 2 0 0 where σj (z, z ; a, a ) is the within-cluster covariance between Yij(z, a) and Yij(z , a ), and σb (z, z ; a, a ) 0 0 0 2 0 0 is the between-cluster covariance between Yij(z, a) and Yij(z , a ). When a = a , σb (z, z ; a, a ) re- 2 0 0 2 0 0 2 0 0 duces to σb (z, z ; a). Similarly, when z = z , σb (z, z ; a, a ) reduces to σb (z; a, a ). Lastly, when z = z 0 2 0 0 2 2 0 0 2 and a = a , σj (z, z ; a, a ) reduces to σj (z, a) and σb (z, z ; a, a ) equals σb (z, a). The following theorem gives each element of the covariance matrix of Yb.

Theorem 2 (Variance-Covariance Matrix) Under Assumptions 1, 2, and 3, we have   J   n o 1 Ja 1 X 1 njz var Yb(z, a) = 1 − σ2(z, a) + 1 − σ2(z, a), J J b J J n n j a a j=1 jz j   J 2 n o 1 Ja 1 X σj (1, 0; a) cov Yb(1, a), Yb(0, a) = 1 − σ2(1, 0; a) − , J J b J J n a a j=1 j n o 1 cov Yb(z, a), Yb(z, a0) = − σ2(z; a, a0), J b n o 1 cov Yb(1, a), Yb(0, a0) = − σ2(1, 0; a, a0), J b for a 6= a0.

Proof is given in Appendix S1.1. Let D denote the covariance matrix of Yb multiplied by J. The multiplication facilitates the development of sample size formulas in Section 4.4. Theorem 2 implies that the covariance matrices of ADE\, MDE\, and ASE[ can be written as, > > > C1DC C2DC C3DC var{ADE\} = 1 , var{MDE\} = 2 , var{ASE[ } = 3 . J J J

Because we cannot observe Yij(1, a) and Yij(0, a) simultaneously, no unbiased estimator exists 2 2 0 2 0 for σj (1, 0; a). Similarly, no unbiased estimators exist for σb (z; a, a ) and σb (1, 0; a, a ). This implies that no unbiased estimation of D is possible. Following the idea of Hudgens and Halloran (2008), we propose a conservative estimator. Define

J 2 2 1 X n o σ (z, a) = Ybj(z, a) − Yb(z, a) 1(Aj = a), bb J − 1 a i=1

9 J 2 1 X n o n o σ (1, 0; a) = Ybj(1, a) − Yb(1, a) Ybj(0, a) − Yb(0, a) 1(Aj = a), bb J − 1 a i=1

2 2 where σbb (z, a) represents the between-cluster sample variance of Yij(z, a), and σbb (1, 0; a) denotes the between-cluster sample covariance between Yij(1, a) and Yij(0, a). The following theorem proposes a conservative variance estimator, which is exactly unbiased when the cluster-level average potential outcomes, i.e., Y j(z, a), do not vary across clusters.

Theorem 3 (Conservative Estimator of Variance) Let Db be a 2m by 2m block diagonal ma- trix with the a-th matrix (a = 1, . . . , m) on the diagonal

J  2 2  σbb (1, a) σbb (1, 0; a) Dba = 2 2 . Ja σbb (1, 0; a) σbb (0, a)

Then, Db is a conservative estimator for D, i.e., E{Db} − D is a positive semi-definite matrix. It is an unbiased estimator for D when the cluster-level average potential outcomes, i.e., Y j(z, a), is constant across clusters.

Proof is given in Appendix S1.2. n o n o The covariance matrix estimator Db estimates var Yb(z, a) and cov Yb(1, a), Yb(0, a) by their 2 2 corresponding between-cluster sample variance and covariance, σb (1, a) and σb (1, 0; a), while replac- n o ing cov Yb(1, a), Yb(0, a0) with 0. Theorem 3 yields the following conservative variance estimators for ADE, MDE, and ASE,

> > > C1DCb C2DCb C3DCb var{ADE\} = 1 , var{MDE\} = 2 , var{ASE[ } = 3 . c J c J c J

Similar to Db, these variance estimators become unbiased if Y j(z, a) are the same across clusters. Note that an alternative conservative estimator exists for each ADE(a). Hudgens and Halloran (2008) propose the following conservative variance estimator,

  J ( 2 2 ) 1 Ja 1 X σj (1) σj (0) 1 − σ2(1, a) + σ2(0, a) − 2σ2(1, 0; a) + b + b 1(A = a), J J bb bb bb JJ n n j a a j=1 j1 j0 where

nJ 2 1 X 2 σ (z) = {Yij − Ybj(z)} 1(Zij = z) bj n − 1 jz i=1 represents the within-cluster sample variance of Yij(z). They show that it is an conservative estimator of the variance of ADE(a), and is unbiased if the unit-level direct effects, Yij(1, a) − Yij(1, a), do not vary within each cluster. In practice, this variance estimator is generally smaller than the a-th diagonal element of varc {ADE\}. However, its conservativeness property holds only for the variance of

10 each ADE(a). No similar estimator can be obtained for the covariance matrix of ADE\. For example, replacing the diagonal elements of varc {ADE\} with Hudgens and Halloran (2008)’s estimators do not yield a conservative estimator for var{ADE\}. Therefore, we recommend using Hudgens and Halloran (2008)’s estimator when the variance of ADE(a) alone is of interest whereas our proposed estimator should be used when the joint distribution of ADE is of interest.

4.3 Hypothesis testing

We consider testing the following three null hypotheses of no direct effect, no marginal direct effect, and no spillover effect,

de mde se H0 : ADE = 0,H0 : MDE = 0,H0 : ASE = 0.

Because ADE, MDE, and ASE are linear transformations of Y , we focus on a more general null hypothesis,

H0 : CY = 0, (3) where C is a constant contrast matrix with full row rank. By setting C to C1, C2, and C3, H0 mde se se becomes H0 , H0 , and H0 , respectively. We propose the following Wald-type test statistic,

T = J(CYb)>(CDCb >)−1(CYb), (4) where the covariance matrix of Yb is replaced with its conservative estimator D/Jb . Unfortunately, T does not follow the χ2 distribution asymptotically because the covariance matrix estimator is conservative. Instead, the following theorem shows that it is asymptotically dominated by the χ2 distribution.

Theorem 4 (Asymptotic Distribution of the Test Statistic) Suppose that the rank of C is k. Under the null hypothesis in equation (3), the asymptotic distribution of the test statistic T defined in equation (4) is stochastically dominated by the χ2 distribution with k degrees of freedom, i.e., pr(T ≥ t) ≤ pr{X ≥ t} for any constant t where X ∼ χ2(k).

Proof is given in Appendix S1.3. With a pre-specified significance level α, we can reject H0 if 2 2 2 T > χ1−α(k) where χ1−α(k) represents the (1 − α) quantile of the χ distribution with k degrees of freedom. Theorem 4 implies that this rejection rule controls the type I error asymptotically. We can use the following three Wald-type test statistics for the direct, marginal direct, and spillover effects, respectively,

> > −1 Tde = J(C1Yb) (C1DCb 1 ) (C1Yb), (5)

11 > > −1 Tmde = J(C2Yb) (C2DCb 2 ) (C2Yb), (6)

> > −1 Tse = J(C3Yb) (C3DCb 3 ) (C3Yb). (7)

Theorem 4 implies that under the corresponding null hypothesis, the asymptotic distributions of 2 Tde, Tmde, and Tse are stochastically dominated by a χ distribution with the degrees of freedom equal to m, one, and 2(m − 1), respectively.

4.4 Sample size formula

When planning a two-stage randomized experiment, we may wish to determine the sample size needed to detect a certain effect size with a given statistical power (1 − β) and a significance level (α). The sample size depends on the number of clusters and the cluster sizes. However, in two-stage randomized experiments, the cluster sizes are often fixed. Therefore, we will derive the required number of clusters of fixed sizes that ensures sufficient power to detect a deviation from the null hypothesis of interest.

General formulation. We begin by considering a general alternative hypothesis,

H1 : CY = x, (8) where C is a k × 2m matrix of full row rank (k ≤ 2m) and x is a vector of constants. With the test statistic given in equation (4), the required number of clusters J should satisfy

> > −1 2 pr{J(CYb) (CDCb ) (CYb) ≥ χ1−α(k) | CY = x} ≥ 1 − β. (9)

However, because Db is a conservative estimator for D,(CYb)>(CDCb >)−1(CYb) follows a generalized chi-square distribution instead of a standard chi-square distribution asymptotically, rendering it difficult for directly solving equation (9) for J. Fortunately, based on the properties of the generalized chi-square distribution, the following theorem gives a conservative sample size formula.

Theorem 5 (General sample size formula) Consider a statistical hypothesis test with level α where the null and alternative hypotheses are given in equations (3) and (8), respectively. We reject 2 the null hypothesis if T > χ1−α(k) where the test statistic T is defined in equation (4) and k is the rank of C. Then, the number of clusters required for this hypothesis test to have the statistical power of (1 − β) is given by,

s2(χ2 (k), 1 − β, k) J ≥ 1−α , x>{CE(Db)C>}x where s2(q, 1 − β, k) represents the non-centrality parameter of the non-central χ2 distribution with k degrees of freedom whose β quantile is equal to q.

12 2 2 Proof is given in Appendix S1.4. In practice, we must compute s (χ1−α(k), 1 − β, k) numerically. Based on Theorem 2, we can obtain the sample size formula for the direct, marginal direct, and spillover effects by setting k to m, one, and 2(m − 1), respectively.

Simplification. The practical difficulty of the sample size formula in Theorem 2 is that it requires the specification of many parameters in E(Db) and the value of vector x in the alternative hypothesis. Thus, we consider the further simplification of the sample size formula to facilitate its application by reducing the number of parameters to be specified by researchers. Specifically, we consider the following simplifying conditions.

Assumption 4 (Simplification) The following conditions are assumed for simplifying the sample size formulas:

(a) the cluster sizes are equal: nj = n for all j;

(b) the within-cluster variances of Yij(z, a) are the same across different clusters, different treat- 2 2 ments, and different treatment assignment mechanisms: σj (z, a) = σw for all z and a;

(c) the between-cluster variances of Yij(z, a) are the same across different treatments and different 2 2 treatment assignment mechanisms: σb (z, a) = σb for all z and a;

(d) The within-cluster and between-cluster correlation coefficients between Yij(1, a) and Yij(0, a) are 2 2 2 2 the same and non-negative: σj (1, 0; a) = σj (1, 0; a) ≥ 0 and σb (1, 0; a) = σb (1, 0; a) ≥ 0 for all a and a0.

2 2 2 2 Under these simplifying conditions, we can write σj (1, 0; a) = ρσw and σb (1, 0; a) = ρσb where

ρ ≥ 0 is the within-cluster and between-cluster correlation coefficient between Yij(1, a) and Yij(0, a). 2 2 We can also rewrite σw and σb as,

2 2 2 2 σw = (1 − r)σ , σb = rσ ,

2 2 2 2 2 2 where σ = σw + σb represents the total variance of Yij(z, a) and r = σb /(σw + σb ) is the intracluster ∗ ∗ ∗ ∗ correlation coefficient with respect to Yij(z, a). Denote D0 = diag(D01,D02,...,D0m), where   r + (1−pa)(1−r) ρ r − 1−r  ∗ 1 npa n D0a =   qa ρ r − 1−r  r + pa(1−r) n n(1−pa)

∗ for a = 1, . . . , m. Thus, D0 is a 2m × 2m diagonal matrix with the (2a − 1)-th and 2a-th diagonal ∗ elements being the same as those in D0a for a = 1, . . . , m. We derive the sample size formula for the direct effect under the simplifying conditions given in

Assumption 4. To reduce the number of parameters in the alternative hypothesis H1 : ADE = x,

13 we consider the alternative hypothesis about the greatest direct effect across across m treatment assignment mechanisms:

Hde : max |ADE(a)| = µ. (10) 1 a

The following theorem gives the sample size formula for rejecting the null hypothesis H0 : ADE = 0, with respect to the alternative hypothesis in equation (10).

Theorem 6 (Simplified Sample Size Formula for Direct Effects) Consider a statistical hy- de pothesis test with level α where the null hypothesis is H0 : ADE(a) = 0 for all a and the alternative 2 hypothesis is given in equation (10). We reject the null hypothesis if Tde > χ1−α(m) where the test statistic Tde is defined in equation (5). Under Assumption 4, the number of clusters required for this test to have the statistical power of 1 − β is given by,

s2(χ2 (m), 1 − β, m) · σ2 n o J ≥ 1−α · max (1, −1)D∗ (1, −1)> . (11) µ2 a 0a Moreover, if r ≥ 1/(n + 1), then the required number of clusters is given by,

2 2 2 s (χ1−α(m), 1 − β, m) · σ n >o J ≥ · max (1, −1)D0a(1, −1) , (12) µ2 a where

(1−p )(1−r) ! 1 r + a 0 D = npa . 0a pa(1−r) qa 0 r + n(1−pa)

Proof is given in Appendix S1.5. To apply equation (11), one needs to specify (pa, qa, n), based on the study design, and (ρ, r, σ2), based on prior knowledge or pilot studies. Since ρ is the correlation coefficient between potential outcomes under different treatment conditions, it is an unidentifiable parameter. Therefore, we provide a more conservative sample size formula in equation (12) that does not involve ρ. The condition r ≥ 1/(n + 1) is easily satisfied so long as the cluster size is moderate or large. Under this condition, if J satisfies equation (12), then it also satisfies equation (11). Next, we derive the sample size formula for the marginal direct effect under Assumption 4. Because the marginal direct effect is a scalar, we continue to use the alternative hypothesis considered above, i.e., H1 : MDE = µ. The following theorem gives the sample size formula.

Theorem 7 (Simplified Sample Size Formula for Marginal Direct Effect) Consider a mde statistical hypothesis test wtih level α where the null hypothesis is H0 : MDE = 0 and the al- mde 2 ternative hypothesis is H1 : MDE = µ. We reject the null hypothesis if Tmde > χ1−α(1) where Tmde is the test statistic defined in equation (6). Under Assumption 4, the number of clusters required for the test to have the statistical power of 1 − β is given by,

m s2(χ2 (1), 1 − β, 1) · σ2 X n o J ≥ 1−α · q2 (1, −1)D∗ (1, −1)> . (13) µ2 a 0a a=1

14 Moreover, if r ≥ 1/(n + 1), then the number of clusters required is given by,

m s2(χ2 (1), 1 − β, 1) · σ2 X n o J ≥ 1−α · q2 (1, −1)D (1, −1)> . (14) µ2 a 0a a=1 Proof is given in Appendix S1.6. Similar to Theorem 6, the application of equation (13) requires 2 the specification of both (pa, qa, n) and (ρ, r, σ ), while the more conservative formula given in equa- tion (14) does not depend on ρ. Finally, we derive the sample size formula for the spillover effect under Assumption 4. To reduce the number of parameters in the alternative hypothesis H1 : ASE = x, we consider the following alternative hypothesis about the greatest spillover effect across different treatment conditions and treatment assignment mechanisms,

se 0 H1 : max |ASE(z; a, a )| = µ. (15) z,a6=a0

The next theorem gives the sample size formula.

Theorem 8 (Simplified Sample Size Formula for Spillover Effects) Consider a statisti- se 0 cal hypothesis test with level α where the null hypothesis is H0 : ASE(z; a, a ) = 0 for all z and a 6= a0 and the alternative hypothesis given in equation (15). We reject the null hypothesis 2 if Tse > χ1−α(2(m − 1)) where the test statistic Tse is defined in equation (7). Under Assumption 4, the number of clusters required for the test to have the statistical power 1 − β is given by,

s2(χ2 (2(m − 1)), 1 − β, 2(m − 1)) · σ2 J ≥ 1−α , (16) 2 > ∗ > −1 µ · mins∈S s {C3D0C3 } s where S is the set of s = (ASE(0; 1, 2), ASE(0; 2, 3),..., ASE(0; m−1, m), ASE(1; 1, 2), ASE(1; 2, 3),..., 0 ASE(1; m − 1, m)) satisfying maxz,a6=a0 |ASE(z; a, a )| = 1.

Proof is given in Appendix S1.7. In Appendix S3, we show how to numerically compute the denom- inator of equation (16) using quadratic programming. Unlike Theorems 6 and 7, we cannot obtain a more conservative sample size formula by setting ρ to zero. Nonetheless, we use the following formula that does not involve ρ and evaluate its performance in our simulation study of Section 5,

s2(χ2 (2(m − 1)), 1 − β, 2(m − 1)) · σ2 J ≥ 1−α , (17) 2 > > −1 µ · mins∈S s {C3D0C3 } s where D0 = diag(D01,D02,...,D0m).

4.5 Connections to linear regression

In this section, we establish direct connections between the proposed estimators and the least squares estimators, which is popular among applied researchers. Basse and Feller (2018) study the relation- ships between the ordinary least squares and randomization-based estimators for the direct and

15 spillover effects under a particular two-stage randomized experiment. Here, we extend these previ- ous results to a general setting with m treatment assignment mechanisms. We consider the following linear model for the outcome,

m X Yij = {β1aZij1(Aj = a) + β0a(1 − Zij)1(Aj = a)} + ij, (18) a=1 where ij is the error term. Unlike the two-step procedure in Basse and Feller (2018), we fit the weighted least squares regression with the following inverse probability weights,

1 1 wij = · . (19) JAj njZij

> Let βb = (βb11, βb01,..., βb1m, βb0m) be the weighted least squares estimators of the coefficients in the models of equation (18), respectively. For the variance estimator, we need additional notation. > Let Xj = (X1j,...,Xnj j) be the design matrix of cluster j for the model given in (18) with > Xij = (Zij1(Aj = 1), (1 − Zij)1(Aj = 1),...,Zij1(Aj = m), (1 − Zij)1(Aj = m)) . Let X = > > > (X1 ,..., XJ ) be the entire design matrix, Wj = diag(w1j, . . . , wnj j) be the weight matrix for cluster j, and W = diag(W1,..., WJ ) be the entire weight matrix. We use ˆj = (ˆ1j,..., ˆnj j) to denote the residual vector for cluster j obtained from the weighted least squares fit of the model > > > given in equation (18), and ˆ = (ˆ1 ,..., ˆJ ) to represent the residual vector for the entire sample. We consider the cluster-robust generalization of HC2 covariance matrix (Bell and McCaffrey, 2002),   X  varcluster(β) = (X>WX)−1 X>W (I − P )−1/2 >(I − P )−1/2W X (X>WX)−1, c hc2 b j j nj j bj bj nj j j j  j 

where Inj is the nj × nj identity matrix and Pj is the following cluster leverage matrix,

1/2 > −1 > 1/2 Pj = Wj Xj(X WX) Xj Wj .

The next theorem establishes the equivalence relationship between the regression-based inference and randomization-based inference.

Theorem 9 (Equivalent Weighted Least Squares Estimators) The weighted least squares estimators based on the model of equation (18) are equivalent to the randomization-based estimators of the average potential outcomes, i.e., βb = Yb. The cluster-robust generalization of HC2 covariance cluster matrix is equivalent to the randomization-based covariance matrix estimator, i.e., varc hc2 (βb) = D/Jb .

Proof is in Appendix S1.8.

16 5 Simulation Studies

We conduct simulation studies to evaluate the empirical performance of the sample size formulas for the direct, marginal direct, and spillover effects. We consider a two-stage randomized experiment with three different treatment assignment mechanisms (m = 3), under which the treated proportions are 25%, 50%, and 75%, respectively. We generate the treatment assignment mechanism Aj with

Pr(Aj = a) = 1/3 for a = 1, 2, 3 such that Ja = J/3. We then completely randomize the treatment assignment Zij within each cluster according to the selected assignment mechanism. Our data generating process is as follows. First, we generate the cluster-level average potential outcomes as,

2 2 2 Y j(0, a) ∼ N(θ0a, σb ), Y j(1, a) ∼ N(θ1a + ρ{Y j(0, a) − θ0a}, (1 − ρ )σb ) for a = 1, 2, 3. Second, we generate the individual-level average potential outcomes Yij(z, a) as,       2 2 Yij(1, a) Y j(1, a) σw ρσw   ∼ N2   ,   2 2 Yij(0, a) Y j(0, a) ρσw σw for a = 1, 2, 3. In this super population setting, the direct effect under treatment assignment mech- anism a is given by θ1a − θ0a for a = 1, 2, 3, whereas the marginal direct effect equals (θ11 + θ12 +

θ13)/3 − (θ01 + θ02 + θ03)/3. The spillover effect comparing treatment assignment mechanisms a and 0 0 a under treatment condition z is θza − θza0 for a, a = 1, 2, 3. However, our target causal quantities of interest are finite-sample causal effects (ADE(a), MDE, ASE(z; a, a0)), which generally do not equal their super-population counterparts due to sample variation. Therefore, we center the gener- ated potential outcomes so that the finite-sample and super-population causal effects are equal to one another, i.e., Y (z, a) = θza for z = 0, 1 and a = 1, 2, 3. We choose different values of θ’s based on the different alternative hypotheses for our three causal effects of interest. For the direct effect, we generate θ0a (a = 1, 2, 3) from a uniform distribution on the interval [−0.5, 0.5], θ1a (a = 1, 2) from a uniform distribution on the interval [−0.5+θ0a, 0.5+θ0a], and set θ13 = 0.5 + θ03; the generated potential outcomes satisfy maxa |ADE(a)| = 0.5. For the marginal direct effect, we generate θ0a (a = 1, 2, 3) from a uniform distribution on the interval

[−0.5, 0.5] and set θ11 = 0.25 + θ01, θ12 = 0.75 + θ02, and θ13 = 0.5 + θ03; the generated potential outcomes satisfy MDE = 0.5. For the spillover effect, we generate θ0a (a = 1, 2, 3) and θ1a (a = 1, 2) from a uniform distribution on the interval [−0.25, 0.25] and set θ13 = 0.5 + min(θ11, θ12); the 0 generated potential outcomes satisfy maxz,a6=a0 |ASE(Z; a, a )| = 0.5.

17 Direct Effect Marginal Direct Effect Spillover Effect 300

200

100 Number of clusters

0

0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 Intracluster correlation coeffcient

Figure 1: The required number of clusters calculated from equations (12), (14), and (17) for the statistical power of 80%. The parameters are set to σ2 = 1, µ = 0.5, α = 0.05, β = 0.2 with the intracluster correlation coefficient varying from 0 to 1 (horizontal axis). The solid lines indicate the setting with cluster size of n = 20, and the dashed lines indicate the setting with n = 100.

We first consider the scenario with equal cluster size n for all clusters. We choose the settings with the total variance σ2 = 1, two levels of cluster size (n = 20 and n = 100), and three levels of correlation coefficient between potential outcomes (ρ = 0, ρ = 0.3, and ρ = 0.6). In each setting, 2 2 2 we vary the intracluster correlation coefficient r = σb /(σw + σb ) from 0 to 1, which also determines 2 the value of σw. We compute the required number of clusters using the sample size formulas and then generate the data based on the resulting number of clusters. The statistical power is estimated under each setting by averaging over 1, 000 Monte Carlo simulations. Figure 1 shows the required number of clusters calculated from equations (12), (14), and (17) for the statistical power of 80%. The parameters are set to σ2 = 1, µ = 0.5, α = 0.05, and β = 0.2 with the intracluster correlation coefficient varying from 0 to 1 (horizontal axis). The required number of clusters for the marginal direct effect (middle panel) is much less than those for the direct and spillover effects (left and right panels, respectively). Across all settings, the required cluster number increases linearly with the intracluster correlation coefficient. The difference between the settings with a small cluster size n = 20 and a moderate cluster size n = 100 is not substantial. This is because the conservative variance (covariance) matrix estimators rely solely on the estimated between-cluster variances, in which the cluster size plays a minimal role. As a result, having a large

18 cluster size does not affect the required number of clusters significantly. Figure 2(a) presents the estimated statistical power for testing the alternative hypotheses concern- ing the direct, marginal direct, and spillover effects in the left, middle, and right plots, respectively. The numbers of clusters used for generating the data are the same under each of three different values of ρ because the sample size formulas do not depend on ρ. For the direct effect, the achieved power is greater than the expected level (0.8) under almost all settings and are close to 1 under ρ = 0.3 and ρ = 0.6 when the intracluster correlation coefficient is moderate or large. This suggests that the sample size formula is conservative in some cases. Note that in practice it is difficult to specify ρ. The statistical power for the marginal direct and spillover effects exhibits a similar pattern, though it is less conservative for the spillover effect. When the intracluster correlation coefficient is small, the statistical power for the marginal direct effect is sometimes below the nominal level of 0.8. This might arise because the required number of clusters is small under these settings (e.g. J ≥ 18 when the intracluster correlation coefficient is 0.2), making the normal approximation used by the sample size formulas less accurate. Finally, we consider the scenario with unequal cluster size. For n = 20 or n = 100, we generate the data with clusters equally divided to have cluster size 0.6n, n, and 1.4n, and otherwise use the same settings as those of the case with equal cluster size. Figure 2(b) shows the results, which are largely similar to those presented in Figure 2(a).

6 Empirical Analysis

In this section, we analyze the data from the randomized experiment of the job placement assistance program described in Section 2. As mentioned earlier, we focus on two outcomes: fixed-term contract of six months or more (LTFC) and permanent contract (PC). The experiment design used for this study combines the two-stage randomized design with the cluster randomized design, which has the treatment assignment mechanisms of 0% and 100% treatment probabilities. For the sake of illustration, we focus on the two-stage randomized design part of the experiment and exclude any clusters with these two treatment assignment mechanisms. Figure 3 shows the estimated direct, marginal direct, and spillover effects for LTFC and PC with their 95% confidence intervals. We find that all the estimated effects for LTFC (left panel) are not statistically distinguishable from zero. For PC (right panel), however, we find that the estimated direct effect under treatment assignment mechanism 1 (treatment probability 25%) to be positive,

19 Direct Effect Marginal Direct Effect Spillover Effect 1.0 0.8 ρ

0.6 =0 0.4 0.2 0.0 1.0 0.8 ρ

0.6 =0.3 0.4 Power 0.2 0.0 1.0 0.8 ρ

0.6 =0.6 0.4 0.2 0.0 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 Intracluster correlation coeffcient

(a) Equal cluster size

Direct Effect Marginal Direct Effect Spillover Effect 1.0 0.8 ρ

0.6 =0 0.4 0.2 0.0 1.0 0.8 ρ

0.6 =0.3 0.4 Power 0.2 0.0 1.0 0.8 ρ

0.6 =0.6 0.4 0.2 0.0 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 Intracluster correlation coeffcient

(b) Unequal cluster size

Figure 2: Estimated statistical power for testing the alternative hypotheses the direct, marginal direct, and spillover effects. The solid lines indicate the setting with cluster size of n = 20, and the dashed lines indicate the setting with n = 100. In each plot, we vary the correlation between potential outcomes ρ as well as the intracluster correlation coefficient (horizontal axis).

20 LTFC PC

ADE(3) ● ●

ADE(2) ● ●

ADE(1) ● ●

MDE ● ●

ASE(0;2,3) ● ●

ASE(0;1,2) ● ●

ASE(1;2,3) ● ●

ASE(1;1,2) ● ●

−4.0% 0.0% 4.0% 8.0% −4.0% 0.0% 4.0% 8.0%

Figure 3: Estimated direct, marginal direct, and spillover effects for the two outcomes of interest, LTFC and PC. The top three lines are the direct effects (ADE) under the three treatment assignment mechanisms; the middle line is the marginal direct effect (MDE); the bottom four lines are the spillover effects (ASE) comparing adjacent treatment assignment mechanisms under the treatment and control conditions. 95% confidence intervals as well as point estimates are shown.

0 maxa |ADE(a)| = 3% MDE = 3% maxz,a6=a0 |ASE(z; a, a )| = 3% LTFC 516 116 614 PC 428 97 512 Table 2: The required number of clusters for detecting the causal effects of certain sizes with the statistical power 0.8 at the significance level 0.05. while the estimated direct effect under treatment assignment mechanism 2 (treatment probability 50%) to be negative. In addition, we find some evidence of spillover effects for PC under the control condition. In particular, the spillover effect is estimated to be positive when comparing treatment assignment mechanism 3 (treatment probability 75%) with treatment assignment mechanism 2 whereas the esti- mated spillover effect is negative when comparing treatment assignment mechanism 2 with treatment assignment mechanism 1. The finding suggests that treating about half of the job seekers yields the greatest spillover effect. Finally, we consider a hypothetical scenario, in which researchers use this study as a pilot study for planning a future experiment. We suppose that researchers wish to compute the sample size required for detecting certain effect sizes at statistical power 0.8 and significance level 0.05. For each of the two outcomes, we consider three null hypotheses: maxa |ADE(a)| = 3%, MDE = 3%, and

21 0 2 2 maxz,a6=a0 |ASE(z; a, a )| = 3%. Note that the total variance is σ = 0.167 for PC and σ = 0.195 for LTFC; the intracluster correlation coefficient is r = 0.02 for both PC and LTFC. Table 2 presents the results of the sample size calculation. As expected, we find that a greater sample size is required for detecting the direct and spillover effects than the marginal direct effect of the same size.

7 Theoretical Comparison of Three Randomized Experiments

Although the two-stage randomized design allows for the detection of spillover effects, this may come at the cost of statistical efficiency for detecting the average treatment effect if it turns out that spillover effects do not exist. In this section, we conduct a theoretical comparison of the two-stage randomized design with the completely randomized design and cluster randomized design in the absence of interference between units. The latter two are the most popular experimental designs and are limiting designs of the two-stage randomized designs. That is, we compute the relative efficiency loss due to the use of the two-stage randomized design when there is no spillover effect.

Formally, when there is no interference between units, we can write Yij(z, a) = Yij(z), Y j(z, a) =

Y j(z), and Y (z, a) = Y (z). As a result, both the direct and marginal direct effects reduce to the standard average treatment effect. To unify the notation in the three types of experiments, we define the unit-level average treatment effect as, ATEij = Yij(1) − Yij(0), the cluster-level average Pnj treatment effect as, ATEj = i=1{Yij(1) − Yij(0)}/nj, and the population-level average treatment PJ effect as ATE = i=1 ATEj/J. As noted above, our comparison of three designs assumes no interference between units. The reason for this assumption is that the average treatment effect represents a different causal quantity under the three designs in the presence of interference, making the efficiency comparison across the designs less meaningful (Karwa and Airoldi, 2018).

For simplicity, consider the case when the cluster size is equal, i.e., nj = n for all j. Define the within-cluster variance of Yij(z) and ATEij as,

PJ Pn 2 PJ Pn 2 {Yij(z) − Y j(z)} {ATEij − ATEj} η2 (z) = j=1 i=1 , τ 2 = j=1 i=1 , w nJ − 1 w nJ − 1 the between-cluster variance of Yij(z) and ATEij as,

PJ 2 PJ 2 {Y j(z) − Y (z)} {ATEj − ATE} η2(z) = j=1 , τ 2 = j=1 , b J − 1 b J − 1 and the total variance of Yij(z) and ATEij as,

PJ Pn 2 PJ Pn 2 {Yij(z) − Y (z)} {ATEij − ATE} η2(z) = j=1 i=1 , τ 2 = j=1 i=1 . nJ − 1 nJ − 1

22 We can connect these variances by defining the intracluster correlation coefficient with respect to

Yij(z) in cluster j under treatment condition z as,

Pn 0 (Yij(z) − Y (z))(Yi0j(z) − Y (z)) r (z) = i6=i . j Pn 2 (n − 1) · i=1(Yij(z) − Y (z)) and the intracluster correlation coefficient with respect to ATEij in cluster j as,

n P 0 0 i6=i0 (ATEij − ATE)(ATEi j − ATE) rj = Pn 2 . (n − 1) · i=1(ATEij − ATE) To further facilitate our theoretical comparison, we make additional approximation assumptions.

First, the intracluster correlation coefficients are approximately the same with respect to Yij(z) and 0 ATEij across clusters and treatment conditions, i.e., rj(z) ≈ rj ≈ r. Second, the cluster size is relatively small compared to the number of clusters nJ − 1 ≈ nJ ≈ n(J − 1). These approximations help simplify the expressions of the variances as

2 2 2 2 2 2 2 2 ηw(z) ≈ (1 − r) · η (z), τw ≈ (1 − r) · τ , ηb (z) ≈ r · η (z), τb ≈ r · τ . (20)

We consider three randomized experiments in the population with nJ units. Under the two-stage randomized design, the treatment is randomized according to Assumptions 1. Under the completely randomized design, the treatment is randomized across units,

1 Pr(Z = z) = , nJ  Pm a=1 Janpa

> Pm for all z such that 1nJ z = a=1 Janpa. Finally, under the clustered randomized design, the treat- ment is randomized across clusters, where all the units in each cluster is assigned to the same treatment condition, i.e.,

1 Pr(A = a) = , J  Pm a=1 Japa

> Pm for all z such that 1J a = a=1 Japa. Note that under this setting, the number of treated units will be the same in the three types of randomized experiments. We consider the difference in means estimator for estimating ATE,

J Pn Pn  1 X YijZij Yij(1 − Zij) ATE[ = i=1 − i=1 . (21) J n n j=1 j1 j0

The following theorem gives the variances of this estimator under the three experimental designs.

23 Theorem 10 (Comparison of Three Experimental Designs) Under the approximation assump- tions of equation (20), the variance of the average treatment effect estimator ATE[ given in equa- tion (21) under the two-stage randomized design is

m m 1 − r X Ja 1 − r X Ja 1 − r · η2(1) + · η2(0) − · τ 2, (22) J 2 np J 2 n(1 − p ) nJ a=1 a a=1 a the variance of ATE[ under the completely randomized design is

1 2 1 2 1 2 Pm · η (1) + Pm · η (0) − · τ , (23) a=1 Janpa a=1 Jan(1 − pa) Jn the variance of ATE[ under the cluster randomized design is

r 2 r 2 r 2 Pm · η (1) + Pm · η (0) − · τ . (24) a=1 Japa a=1 Ja(1 − pa) J Proof is given in Appendix S1.9. According to Theorem 10, the ratio of the coefficients of η2(1) in equations (22) and (23) is

m m X X qa (1 − r) · q p · , (25) a a p a=1 a=1 a whereas the ratio of the coefficients of η2(1) in equations (22) and (24) is

m m 1 − r X X qa · q p · . (26) nr a a p a=1 a=1 a The ratios of the coefficients of other parameters take similar forms. Thus, our discussion focuses on equations (25) and (26). Equation (25) implies that the relative efficiency of the two-stage randomized design over the completely randomized design depends on the intracluster correlation coefficient, and the assignment probabilities at the first and the second stage of randomization. Due to Cauchy-Schwarz inequality, equation (25) is greater than or equal to 1−r. The value of this quantity increases as the heterogeneity between pa increases. Therefore, as the difference in treated proportion between clusters becomes large, the two-stage randomized design becomes less efficient for estimating the average treatment effect. On the other hand, the ability to detect spillover effects relies on the heterogeneity of pa. This implies that there is a tradeoff between the efficiency of estimating the average treatment effects and the ability to detect spillover effects. 0 In addition, when the treated proportion is identical across clusters, pa = pa0 for any a, a , the two-stage randomized design becomes stratified randomized design. In this case, equation (25) equals 1 − r, which is less than 1. This is consistent with the classic result that the stratified randomized design improves efficiency over the completely randomized design.

24 Lastly, equation (26) implies that the relative efficiency of the two-stage randomized design with respect to the clustered randomized design depends additionally on the cluster size. As the cluster size increases, the two-stage randomized design becomes more efficient than the clustered randomized design. When cluster size is large, the two-stage randomized design may be preferable because it allows for the detection of spillover effects while maintaining efficiency in estimating the average treatment effect.

8 Concluding Remarks

In this paper, we introduced a general methodology for analyzing and planning two-stage randomized experiments, which have recently gained popularity in various scientific disciplines. Future research should address several remaining methodological challenges that further facilitate the use of the two-stage randomized design. First, many experiments suffer from attrition, which leads to missing outcome data for some units. It is of interest to deal with such a complication in the presence of spillover effects. Second, it is often believed that spillover effects arise from interactions among a relatively small number of units. How to explore this causal heterogeneity is an important question to be addressed. Third, the standard two-stage randomized design can be extended to sequential experimentation, allowing researchers to examine how spillover effects evolve over time. Finally, from a policy-making perspective, it is of interest to develop an optimal policy that exploits spillover effects. The two-stage randomized design, or its extensions, may be able to shed light on the construction of such cost-effective policies.

25 References

Angelucci, M. and Di Maro, V. (2016). Programme evaluation and spillover effects. Journal of Development Effectiveness 8, 1, 22–43.

Aronow, P. and Samii, C. (2017). Estimating average causal effects under general interference. Annals of Applied Statistics 11, 4, 1912–1947.

Athey, S., Eckles, D., and Imbens, G. W. (2018). Exact p-values for network interference. Journal of the American Statistical Association 113, 521, 230–240.

Bahr, B. (1972). On sampling from a finite set of independent random variables. Probability Theory and Related Fields 24, 4, 279–286.

Baird, S., Bohren, J. A., McIntosh, C., and Ozler, B. (2018). Optimal design of experiments in the presence of interference. Review of Economics and Statistics 100, 5, 844–860.

Basse, G. and Feller, A. (2018). Analyzing multilevel experiments in the presence of peer effects. Journal of the American Statistical Association 113, 521, 41–55.

Bell, R. M. and McCaffrey, D. F. (2002). Bias reduction in standard errors for linear regression with multi-stage samples. Survey Methodology 28, 2, 169–181.

Benjamin-Chung, J., Arnold, B. F., Berger, D., Luby, S. P., Miguel, E., Colford Jr, J. M., and Hubbard, A. E. (2018). Spillover effects in epidemiology: parameters, study designs and method- ological considerations. International Journal of Epidemiology 47, 1, 332–347.

Cr´epon, B., Duflo, E., Gurgand, M., Rathelot, R., and Zamora, P. (2013). Do labor market policies have displacement effects? evidence from a clustered randomized experiment. The quarterly journal of economics 128, 2, 531–580.

Forastiere, L., Mealli, F., and VanderWeele, T. J. (2016). Identification and estimation of causal mechanisms in clustered encouragement designs: Disentangling bed nets using bayesian principal stratification. Journal of the American Statistical Association 111, 514, 510–525.

Holland, P. W. (1986). Statistics and causal inference (with discussion). Journal of the American Statistical Association 81, 945–960.

Hudgens, M. G. and Halloran, M. E. (2008). Toward causal inference with interference. Journal of the American Statistical Association 103, 482, 832–842.

26 Imai, K., Jiang, Z., and Malai, A. (2020). Causal inference with interference and noncompliance in two-stage randomized experiments. Journal of the American Statistical Association Forthcoming.

Karwa, V. and Airoldi, E. M. (2018). A systematic investigation of classical causal inference strategies under mis-specification due to network interference. arXiv preprint arXiv:1810.08259 .

Muralidharan, K. and Sundararaman, V. (2015). The aggregate effect of school choice: Evidence from a two-stage experiment in india. Quarterly Journal of Economics 130, 3, 1011–1066.

Neyman, J. (1923). On the application of probability theory to agricultural experiments: Essay on principles, section 9. (translated in 1990). Statistical Science 5, 465–480.

Ohlsson, E. (1989). Asymptotic normality for two-stage sampling from a finite population. Probability theory and related fields 81, 3, 341–352.

Rogers, T. and Feller, A. (2018). Reducing student absences at scale by targeting parents’ misbeliefs. Nature Human Behaviour 2, 335–342.

Rosenbaum, P. R. (2007). Interference beteween units in randomized experiments. Journal of the American Statistical Association 102, 477, 191–200.

Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and non-randomized studies. Journal of Educational Psychology 66, 688–701.

Rubin, D. B. (1990). Comments on “On the application of probability theory to agricultural ex- periments. Essay on principles. Section 9” by J. Splawa-Neyman translated from the Polish and edited by D. M. Dabrowska and T. P. Speed. Statistical Science 5, 472–480.

Samii, C. and Aronow, P. M. (2012). On equivalencies between design-based and regression-based variance estimators for randomized experiments. Statistics & Probability Letters 82, 2, 365–370.

Sinclair, B., McConnell, M., and Green, D. P. (2012). Detecting spillover effects: Design and analysis of multilevel experiments. American Journal of Political Science 56, 4, 1055–1069.

Sobel, M. E. (2006). What do randomized studies of housing mobility demonstrate? causal inference in the face of interference. Journal of the American Statistical Association 101, 476, 1398–1407.

Tchetgen Tchetgen, E. J. and VanderWeele, T. J. (2010). On causal inference in the presence of interference. Statistical Methods in Medical Research 21, 1, 55–75.

27 Supplementary Appendix

S1 Proofs of the Theorems

S1.1 Proof of Theorem 2

First, we derive the variance of Yb(z, a). Theory of simple random sampling implies,   n o 1 njz 2 var Ybj(z, a) | Aj = a = 1 − σj (z, a). njz nj From the law of total variance, we have n o h i h i var Yb(z, a) = var E{Yb(z, a) | A} + E var{Yb(z, a) | A}   J   1 Ja 1 X 1 njz = 1 − σ2(z, a) + 1 − σ2(z, a). J J b J J n n j a a j=1 jz j

Second, we derive the covariance between Yb(1, a) and Yb(0, a). Theory of of simple random sampling implies

n o 1 2 cov Ybj(1, a), Ybj(0, a) | Aj = a = − σj (1, 0; a). nj From the law of total covariance, we have n o h i h i cov Yb(1, a), Yb(0, a) = cov E{Yb(1, a) | A}, E{Yb(0, a) | A} + E cov{Yb(1, a), Yb(0, a) | A}   J 1 Ja 1 X 1 = 1 − σ2(1, 0; a) − σ2(1, 0; a). J J b J J n j a a j=1 j

Third, we derive the covariance between Yb(z, a) and Yb(z, a0) for a 6= a0. From the theory of of simple random sampling, we have

n 0 o h 0 i h 0 i cov Yb(z, a), Yb(z, a ) = cov E{Yb(z, a) | A}, E{Yb(z, a ) | A} + E cov{Yb(z, a), Yb(z, a ) | A} h 0 i = cov E{Yb(z, a) | A}, E{Yb(z, a ) | A} 1 = − σ2(z; a, a0), J b 0 where the second equality follows from the conditional independence Zj⊥⊥Zj0 | A for j 6= j . Finally, we derive the covariance between Yb(z, a) and Yb(z0, a0) for z = z0 and a 6= a0. We have,

n 0 0 o h n o n 0 0 oi h n 0 0 oi cov Yb(z, a), Yb(z , a ) = cov E Yb(z, a) | A , E Yb(z , a ) | A + E cov Yb(z, a), Yb(z , a ) | A h n o n 0 0 oi = cov E Yb(z, a) | A , E Yb(z , a ) | A 1 = − σ2(z, z0; a, a0), J b 0 where the second equality follows from the conditional independence Zj⊥⊥Zj0 | A for j 6= j . 

28 S1.2 Proof of Theorem 3

Recall that Db be a 2m by 2m block diagonal matrix with the a-th matrix on the diagonal   2 2 J σbb (1, a) σbb (1, 0; a) Dba =   . Ja 2 2 σbb (1, 0; a) σbb (0, a)

We calculate the expectation of each term in Dba. We have

2 E{σbb (z, a)}  J  1 X 2 2 = Yb (z)I(Aj = a) − JaYb(z) J − 1E j a j=1 

 J  1 X h n o 2i Ja 2 = var Ybj(z, a) | Aj = a + Y j(z, a) I(Aj = a) − [var{Yb(z, a)} + Y (z, a) ] J − 1E   J − 1 a j=1 a J J Ja X Ja X 2 Ja 2 = var{Ybj(z) | Aj = a} + Y j(z, a) − [var{Yb(z, a)} + Y (z, a) ] J(J − 1) J(J − 1) J − 1 a j=1 a j=1 a J Ja X n o Ja(J − 1) 2 Ja = var Ybj(z) | Aj = a + σ (z, a) − var{Yb(z, a)} J(J − 1) (J − 1)J b J − 1 a j=1 a a J Ja X n o Ja(J − 1) 2 = var Ybj(z) | Aj = a + σ (z, a) J(J − 1) (J − 1)J b a j=1 a     2 J Ja Ja σb (z, a) 1 X n o − 1 − + var Ybj(z, 1) | Aj = a J − 1  J J J J  a a a j=1 J 2 1 X n o = σ (z, a) + var Ybj(z, a) | Aj = a b J j=1 J   1 X 1 njz = σ2(z, a) + 1 − σ2(z, a). b J n n j j=1 jz j

Similarly, we obtain

J  2 2 1 X n o σ (1, 0; a) = σ (1, 0; a) + cov Ybj(1, a), Ybj(0, a) | Aj = a E bb b J j=1 J 2 1 X σj (1, 0; a) = σ2(1, 0; a) − . b J n j=1 j

Finally, we prove that Db is a conservative estimator for D. Denote R = E(Db) − D with the

(i, j)-th element rij. We have

2 2 2 r2a−1,2a−1 = σb (1, a), r2a,2a = σb (0, a), r2a−1,2a = σb (1, 0; a)

29 for a = 1, . . . , m. For a 6= a0, we have

2 0 2 0 r2a−1,2a0−1 = σb (1; a, a ), r2a,2a0−1 = σb (1, 0; a, a ).

> Pm Therefore, for any vector c = (c1, . . . , c2m), cRc is the between-cluster variance of a=1 c2a−1Yij(1, a)+ Pm a=1 c2aYij(0, a). As a result, Db is a conservative estimator for D and is unbiased for D if Y j(z, a) is constant across clusters.  S1.3 Proof of Theorem 4

To prove Theorem 4, we need the following lemma.

> d Pk 2 Lemma S1 (i) If X ∼ Nk(0,A), Then X BX = j=1 λj(AB)ξj , where the λj(AB)’s are eigenval- 2 ues of AB, and ξj ∼ χ (1) and are i.i.d. with each other. d p > d Pk −1 2 (ii) If Xn → Nk(0,A), and Bn → B, then Xn BnXn = j=1 λj(AB )ξj . If B − A is positive −1 semidefinite, then 0 ≤ λj(AB ) ≤ 1 for all j.

Proof of Lemma S1. Lemma S1(i) follows form linear algebra and Lemma S1(ii) follows from Slutsky’s

Theorem.  d We now prove Theorem 4. From the asymptotic properties of Yb, we know that CYb − x → > > p > > > N2m(0,CDC /J), CDCb → CE(Db)C . Because CE(Db)C − CDC is positive semidefinite, from d Pk 2 Lemma S1(ii), we have T = j=1 λjξj , where k is the rank of C and 0 ≤ λj ≤ 1 for all j.  S1.4 Proof of Theorem 5

To prove Theorem 5, we need the following lemma.

0 Lemma S2 Suppose (X1,...,Xk) follows a standard multivariate normal distribution. If 0 < aj ≤ aj for j = 1, . . . , k, then as J goes to infinity,

 k  √ 2 X  0   pr ajXj + Jxj ≥ t ≥ p j=1  implies

 k  X  √ 2  pr ajXj + Jxj ≥ t ≥ p j=1  where xj’s, t, and p are arbitrary non-zero constants.

Proof of Lemma S2. Without loss of generality, we can assume xj > 0 for all j. Since Xj’s are independent of each other, it suffices to show that

 √ 2   0  pr ajXj + Jxj ≥ t ≥ p (S1)

30 implies

 √ 2  pr ajXj + Jxj ≥ t ≥ p (S2) for all j. By some algebra, (S1) is equivalent to √ √ ! √ √ ! Jxj − t − Jxj − t Φ 0 + Φ 0 ≥ p. (S3) aj aj

As J goes to infinity, the second term on the left-hand side of (S3) goes to 0. Therefore, we can write (S3) as √ √ ! Jxj − t Φ 0 ≥ p. (S4) aj

Similarly, we can show that (S2) is equivalent to √ √ ! Jx − t Φ j ≥ p. (S5) aj

0 Because aj ≥ aj, (S4) implies (S5). This completes the proof. 

We now prove Theorem 5. The number of clusters requires for the test to have power 1 − β should satisfy

> > −1 2 pr{J(CYb) (CDCb ) (CYb) ≥ χ1−α(k) | CY = x} ≥ 1 − β.

Theorem S1 implies √ d > J(CYb − x) → Nk(0,CDC ) √ > 1/2 Therefore, we can write CYb = 1/ J · (CDC ) · Wk + x, where Wk is a k-length vector following a standard multivariate normal distribution. As a result, we can write the test statistic as

> 1/2 > > −1 > 1/2 {(CDC ) Wk + x} (CDCb ) {(CDC ) Wk + x}

By Slutsky’s theorem, it has the same asymptotic distribution as √ √ 0 > 1/2 > > −1 > 1/2 T = {(CDC ) Wk + Jx} {CE(Db)C } {(CDC ) Wk + Jx} √ > −1/2 > 1/2 > −1/2 > = [{CE(Db)C } (CDC ) Wk + J{CE(Db)C } x] √ > −1/2 > 1/2 > −1/2 ·[{CE(Db)C } (CDC ) Wk + J{CE(Db)C } ].

> 1/2 > −1 > 1/2 > From the matrix theory, we can write (CDC ) {CE(Db)C } (CDC ) = P ΛP , where P is an orthogonal matrix and Λ = diag(λ1, . . . , λk) is a diagonal matrix. Because D0 − D is positive

31 semidefinite, 0 ≤ λj ≤ 1 for all j. Denote U = (U1,...,Um) = PW , which also follows a standard multivariate normal distribution. Then, we can write

√ > √ 0 h 1/2 > −1/2 i h 1/2 > −1/2 i T = Λ U + J{CE(Db)C } x Λ U + J{CE(Db)C } x k √ X p 0 2 = ( λjUj + Jxj) , j=1

0 > −1/2 0 where xj is the j-th element of {CE(Db)C } x. From Lemma S2, pr(T ≥ t) ≥ 1 − β is implied by   k √ X 0 2  pr (Uj + Jxj) ≥ t ≥ 1 − β. (S6) j=1 

Based on the definition of s2(q, 1 − β, k), (S6) is equivalent to

k X 02 2 2 J xj ≥ s (χ1−α(k), 1 − β, k). j=1

Pk 02 > > Because j=1 xj = x {CE(Db)C }x, we obtain the sample size formula,

s2(χ2 (m), 1 − β, m) J ≥ 1−α . x>{CE(Db)C>}−1x  S1.5 Proof of Theorem 6

We first derive the expression of E(Db) under Assumption 4. From Appendix S1.2, we have J   1 X 1 nj1 {σ2(1, a)} = σ2(1, a) + 1 − σ2(1, a) E bb b J n n j j=1 j1 j

2 1 − pa 2 = σb + σw npa  (1 − p )(1 − r) = r + a σ2, npa where the second equality follows from conditions (a), (b), and (c) of Assumption 4. Similarly, we obtain J   1 X 1 nj0 {σ2(0, a)} = σ2(0, a) + 1 − σ2(0, a) E bb b J n n j j=1 j0 j 2 pa 2 = σb + σw n(1 − pa)  p (1 − r)  = r + a σ2 n(1 − pa)

32 and

J 2 1 X σj (1, 0; a) {σ2(1, 0; a)} = σ2(1, 0; a) − E bb b J n j=1 j ρσ2 = ρσ2 − w b n  1 − r  = ρ r − · σ2. n

∗ 2 ∗ ∗ ∗ Therefore, under Assumption 4, E(Db) = D0 = σ · diag(D01,D02,...,D0m), where   r + (1−pa)(1−r) ρ r − 1−r  ∗ 1 npa n D0a =   qa ρ r − 1−r  r + pa(1−r) n n(1−pa) for a = 1, . . . , m. We next prove the sample size formula. From Theorem 5, the number of clusters required for de detecting the alternative hypothesis H1 : ADE = x with power 1 − β based on Tde is given as, s2(χ2 (m), 1 − β, m) J ≥ 1−α , > > −1 x {C1E(Db)C1 } x which, under Assumption 4, is equivalent to s2(χ2 (m), 1 − β, m) · σ2 J ≥ 1−α . > ∗ > −1 x {C1D0C1 } x

Therefore, under the alternative hypothesis H1 : maxa |ADE| = µ, the sample size formula is

s2(χ2 (m), 1 − β, m) · σ2 J ≥ max 1−α . > ∗ > −1 |x|∞=µ x {C1D0C1 } x

> ∗ > −1 It suffices to calculate min|x|∞=µ x {C1D0C1 } x. Suppose that x = (x1, . . . , xm). We can write

m −1 > ∗ > −1 X 2 n ∗ >o x {C1D0C1 } x = xa (1, −1)D0a(1, −1) . a=1

0 0 Therefore, the minimum is attained at xa0 = µ and xa = 0 for a 6= a0, where

n ∗ >o a0 = argmaxa (1, −1)D0a(1, −1) .

As a result,

−1 > ∗ > −1 2 h n ∗ >oi min x {C1D0C1 } x = µ · max (1, −1)D0a(1, −1) |x1|∞=µ a and the sample size formula is given as,

s2(χ2 (m), 1 − β, m) · σ2 n o J ≥ 1−α · max (1, −1)D∗ (1, −1)> . µ2 a 0a

33 > ∗ > Under r ≥ 1/(n + 1), we have (1, −1)D0a(1, −1) ≥ (1, −1)D0a(1, −1) . Thus, a more conser- vative sample size formula is given as,

2 2 2 s (χ1−α(m), 1 − β, m) · σ n >o J ≥ · max (1, −1)D0a(1, −1) . µ2 a

 S1.6 Proof of Theorem 7

mde From Theorem 5, the number of clusters required for detecting the alternative hypothesis H1 :

MDE = µ with power 1 − β based on Tmde is given as, s2(χ2 (m), 1 − β, m) J ≥ 1−α , 2 > −1 µ {C2E(Db)C2 } which, under Assumption 4, is equivalent to

m s2(χ2 (1), 1 − β, 1) · σ2 X n o J ≥ 1−α q2 (1, −1)D∗ (1, −1)> . µ2 a 0a a=1

> ∗ > Under r ≥ 1/(n + 1), we have (1, −1)D0a(1, −1) ≥ (1, −1)D0a(1, −1) . Thus, a more conservative sample size formula is given as,

m s2(χ2 (m), 1 − β, m) · σ2 X n o J ≥ 1−α · q2 (1, −1)D (1, −1)> . µ2 a 0a a=1

 S1.7 Proof of Theorem 8

From Theorem 5, the number of clusters required for detecting the alternative hypothesis ASE = x with power 1 − β based on Tse is given as, s2(χ2 (2m − 2), 1 − β, 2m − 2) J ≥ 1−α , > > −1 x {C3E(Db)C3 } x which, under Assumption 4, is equivalent to s2(χ2 (m), 1 − β, m) · σ2 J ≥ 1−α . > ∗ > −1 x {C3D0C3 } x se 0 Therefore, under the alternative hypothesis H1 : maxz,a6=a0 |ASE(z; a, a )| = µ, the sample size formula is s2(χ2 (2m − 2), 1 − β, 2m − 2) · σ2 J ≥ 1−α , 2 > ∗ > −1 µ · mins∈S s {C3D0C3 } s where S is the set of s = (ASE(0; 1, 2), ASE(0; 2, 3),..., ASE(0; m−1, m), ASE(1; 1, 2), ASE(1; 2, 3),..., 0 ASE(1; m − 1, m)) satisfying maxz,a6=a0 |ASE(z; a, a )| = 1. 

34 S1.8 Proof of Theorem 9

We first prove the equivalence between the point estimators. The OLS estimate can be written as,

βb = (X>WX)−1X>WY .

Because the columns of X are orthogonal to each other, we can consider each element of βb separately. Therefore, we have

 −1   J nj J nj X X  X X  βbza = 1(Zij = z, Aj = a)wij 1(Zij = z, Aj = a)wijYij j=1 i=1  j=1 i=1 

J nj X X 1 = · 1(Z = z, A = a)Y J n ij j ij j=1 i=1 a jz = Yb(z, a).

We then prove the equivalence between the variance estimators. Recall the variance estimator,   X  varcluster(β) = (X>WX)−1 X>W (I − P )−1/2 >(I − P )−1/2W X (X>WX)−1, c hc2 b j j nj j bj bj nj j j j  j 

where Inj is the nj × nj identity matrix and Pj is the following cluster leverage matrix,

1/2 > −1 > 1/2 Pj = Wj Xj(X WX) Xj Wj .

Without loss of generality, suppose Aj = 1. We have

> −1 (X WX) = In×n,

1/2 > −1 > 1/2 Pj = Wj Xj(X WX) Xj Wj    > √ 1 √ 1 1nj1 0nj1 1nj1 0nj1 =  J1nj1   J1nj1   √ 1   √ 1  0nj0 1nj0 0nj0 1nj0 J1nj0 J1nj0

 1  J n 1nj1×nj1 0nj1×nj0 =  a j1  , 0 1 1 nj0×nj1 Janj0 nj0×nj0 where Ik is an k-dimensional identity matrix, 1k (0k) is an k-dimensional vector of ones (zeros) and

1k1×k2 (0k1×k2 ) is an k1 × k2 dimensional matrix of ones (zeros). > > > > > > Since (1nj1 , 0nj0 ) and (0nj1 , 1nj0 ) are two eigenvectors of Inj − Pj whose eigenvalue is (J1 −

1)/J1, we have, r −1/2 > > > J1 > > > (Inj − Pj) (1nj1 , 0nj0 ) = (1nj1 , 0nj0 ) , J1 − 1

35 r −1/2 > > > J1 > > > (Inj − Pj) (0nj1 , 1nj0 ) = (0nj1 , 1nj0 ) . J1 − 1

Thus,   r 1 1 0 0 −1/2 J1 J1nj1 nj1 nj1 nj1×(2m−2) (Inj − Pj) WjXj =   . J1 − 1 0 1 1 0 nj0 J1nj0 nj0 nj0×(2m−2)

For a unit with (Aj = 1,Zij = 1), we have bij = Yij − βb11 = Yij − Yb(1, 1), and for a unit with (Aj = 1,Zij = 0), we have bij = Yij − αb01 = Yij − Yb(0, 1). As a result,

>(I − P )−1/2W X bj nj j j j   r 1 1 0 0 J1 J1nj1 nj1 nj1 nj1×(2m−2) = (Y1j − Yb(1, 1),...,Ynj j − Yb(0, 1))   , J1 − 1 0 1 1 0 nj0 J1nj0 nj0 nj0×(2m−2)  n o > r 1 Pnj J i=1 YijZij − nj1Yb(1, 1) = 1 J1nj1  n n o J1 − 1 1 P j Y (1 − Z ) − n Y (0, 1) J1nj0 i=1 ij ij j0 b s 1  >  = Ybj(1) − Yb(1, 1), Ybj(0) − Yb(0, 1), 02m−2 . J1(J1 − 1)

cluster Similar result applies for Aj = a, where a = 1....,J. Therefore, varc hc2 (βb) is a block diagonal matrix with the a-th block

J 1 X    > 1(Aj = a) Ybj(1) − Yb(1, a), Ybj(0) − Yb(0, a) Ybj(1) − Yb(1, a), Ybj(0) − Yb(0, a) J (J − 1) a a j=1  PJ 2 PJ  i=1{Ybj (1)−Yb (1,a)} 1(Aj =a) i=1{Ybj (1,a)−Yb (1,a)}{Ybj (0,a)−Yb (0,a)}1(Aj =a) Ja(Ja−1) Ja(Ja−1) =  PJ PJ 2  i=1{Ybj (1,a)−Yb (1,a)}{Ybj (0,a)−Yb (0,a)}1(Aj =a) i=1{Ybj (0)−Yb (0,a)} 1(Aj =a) Ja(Ja−1) Ja(Ja−1) Db = . J

 S1.9 Proof of Theorem 10

First, we calculate the variance of ATE[ under the two-stage randomized design. In this case, ATE[ is the same as ADE\. From Theorem 2, we have

m 2   X J n o X JaJa0 n o var ADE\ = a · var ADE\(a) + · cov ADE\(a), ADE\(a0) . J 2 J 2 a=1 a6=a0

When there is no interference, we have

n o var DEY[(a)

36   2 J Pn 2 Pn 2 Pn 2  Ja τ 1 X (Yij(1) − Y j(1)) (Yij(0) − Y j(0)) (ATEij − ATEj) = 1 − b + i=1 + i=1 − i=1 J J J J (n − 1)np (n − 1)n(1 − p ) (n − 1)n a a j=1 a a   2  2 2  Ja τb nJ − 1 ηw(1) ηw(0) 2 = 1 − + + − τw J Ja (n − 1)nJaJ pa 1 − pa

n 0 o 2 and cov ADE\(a), ADE\(a ) = −τb /J. Therefore, we can obtain   var ADE\ m   2 m 2 m 2  2 2  X Ja τb X JaJa0 τb X Ja nJ − 1 ηw(1) ηw(0) 2 = Ja 1 − 2 − 2 · + 2 · + − τw J J J J J (n − 1)nJaJ pa 1 − pa a=1 a6=a0 a=1 m m m nJ − 1 X Ja nJ − 1 X Ja nJ − 1 X Ja = · η2 (1) + · η2 (0) − · τ 2 J 3(n − 1) np w J 3(n − 1) n(1 − p ) w J 3(n − 1) n w a=1 a a=1 a a=1 m m 1 − ρ X Ja 1 − ρ X Ja 1 − ρ ≈ · η2(1) + · η2(0) − · τ 2, J 2 np J 2 n(1 − p ) nJ a=1 a a=1 a where the last line follows from the approximation assumptions in equation (20). Second, the variance of ATE[ under the completely randomized experiment with the number of Pm the treated units equal to a=1 Janpa is given as,

1 2 1 2 1 2 Pm · η (1) + Pm · η (0) − · τ . a=1 Janpa a=1 Jan(1 − pa) Jn

Third, we calculate the variance of ATE[ under cluster randomized experiments with the same number of treated units. In the cluster randomized experiments, the units in each cluster get the Pm same treatment condition. Thus, the number of the treated clusters is a=1 Japa. As a result, the variance of ATE[ is given as,

2 2 2 ηb (1) ηb (0) τb Pm + Pm − · a=1 Japa a=1 Ja(1 − pa) J r 2 r 2 r 2 ≈ Pm · η (1) + Pm · η (0) − · τ , a=1 Japa a=1 Ja(1 − pa) J where the last line follows from the approximation assumptions in equation (20).  S2 Asymptotic properties

The following theorem gives the asymptotic normality result for the estimators of direct and spillover effects. Imai et al. (2020) prove a similar result, but under stronger conditions.

Theorem S1 Suppose that Assumptions 1, 2 and 3 hold. As J → ∞, if Y is bounded, then

√ d J{Yb − Y } → N2m(0,D), (S7) where D = limJ→∞ Jcov(Yb) .

37 The covariance matrix D is a 2m by 2m matrix with the (i, j)-th element dij with the (2a−1)-th and 2a rows and columns representing Yb(1, a) and Yb(0, a), respectively. It is straightforward to obtain the expression of D based on the results in Section S1.1. In the univariate setting, Bahr (1972) and Ohlsson (1989) give the asymptotic result for two-stage random sampling. Using their techniques, we can prove Theorem S1. Because the details for the asymptotic properties are not the focus of the paper, we leave the proof to future work.

Theorem S2 As J → ∞, if Y is bounded, then

p Db → E(Db).

S3 Computational details

We provide a strategy for numerically calculating the required number of clusters in Theorem 8. We focus on the following optimization problem,

> > −1 min s {C3D0C3 } s, s∈S where a = (ASE(0; 1, 2), ASE(0; 2, 3),..., ASE(0; m−1, m), ASE(1; 1, 2), ASE(1; 2, 3),..., ASE(1; m− 0 1, m)) satisfies the constraint maxz,a6=a0 |ASE(z; a, a )| = 1. 0 We consider all the possible cases in which maxz,a6=a0 |ASE(z; a, a )| = 1 holds. First, using > > −1 quadratic programming, we can obtain the minimum of s {C3D0C3 } s under the constraint 0 0 ASE(0; 1, 2) = 1 and −1 ≤ ASE(z; a, a ) ≤ 1 for all z, a, a . We denote it by l0,1,2. Similarly, we can > > −1 obtain the minimum of s {C3D0C3 } s, denoted by l0,1,2, under the constraints ASE(0; 1, 2) = 1 0 0 0 and −1 ≤ ASE(z; a, a ) ≤ 1 for all z, a, a . Therefore, we can obtain lz,a,a0 for all z, a, a by imple- 0 menting this procedure for each of the possible cases satisfying maxz,a6=a0 |ASE(z; a, a )| = 1. As a result, the solution to the optimization problem is min lz,a,a0 .

38