<<

Statistical Science 2009, Vol. 24, No. 1, 29–53 DOI: 10.1214/08-STS274 © Institute of Mathematical , 2009 The Essential Role of Pair Matching in Cluster-Randomized , with Application to the Mexican Universal Health Insurance Evaluation Kosuke Imai, Gary King and Clayton Nall

Abstract. A basic feature of many field experiments is that investigators are only able to randomize clusters of individuals—such as households, com- munities, firms, medical practices, schools or classrooms—even when the individual is the unit of interest. To recoup the resulting efficiency loss, some studies pair similar clusters and randomize treatment within pairs. However, many other studies avoid pairing, in part because of claims in the litera- ture, echoed by clinical trials standards organizations, that this matched-pair, cluster-randomization design has serious problems. We argue that all such claims are unfounded. We also prove that the estimator recommended for this design in the literature is unbiased only in situations when matching is unnecessary; its is also invalid. To overcome this problem without modeling assumptions, we develop a simple design-based estimator with much improved statistical properties. We also propose a model-based approach that includes some of the benefits of our design-based estimator as well as the estimator in the literature. Our methods also address individual- level noncompliance, which is common in applications but not allowed for in most existing methods. We show that from the perspective of , efficiency, power, robustness or research costs, and in large or small samples, pairing should be used in cluster-randomized experiments whenever feasible; failing to do so is equivalent to discarding a considerable fraction of one’s data. We develop these techniques in the context of a randomized evaluation we are conducting of the Mexican Universal Health Insurance Program. Key words and phrases: Causal inference, community intervention trials, field experiments, group-randomized trials, place-randomized trials, health policy, matched-pair design, noncompliance, power.

1. INTRODUCTION

Kosuke Imai is Assistant Professor, Department of Politics, For political, ethical or administrative reasons, re- Princeton University, Princeton, New Jersey 08544, USA searchers conducting field experiments are often un- (e-mail: [email protected];URL:http://imai.princeton. able to randomize treatment assignment to individuals edu). Gary King is Albert J. Weatherhead III University and so instead randomize treatments to clusters of indi- Professor, Institute for Quantitative Social Science, viduals (Murray, 1998; Donner and Klar, 2000a;Rau- Harvard University, 1737 Cambridge St., Cambridge, denbush, Martinez and Spybrook, 2007). For example, Massachusetts 02138, USA (e-mail: King@Harvard. 19 (68%) of the 28 field experiments we found pub- edu;URL:http://GKing.Harvard.edu). Clayton Nall is Ph.D. Candidate, Department of Government, Institute for Cambridge St., Cambridge, Massachusetts 02138, USA Quantitative Social Science, Harvard University, 1737 (e-mail: [email protected]. edu).

29 30 K. IMAI, G. KING AND C. NALL lished in major journals since 2000 in a widely cited article, Martin et al. (1993) claim that randomized households, precincts, city-blocks or vil- in small samples, pairing can reduce statistical power. lages even though individual voters were the inferen- We show that each of the claims regarding analyti- tial target (e.g., Arceneaux, 2005); in public health and cal limitations of MPCR is incorrect. We also demon- medicine, where “the number of trials reporting a clus- strate that the power calculations leading Martin et al. ter design has risen exponentially since 1997” (Camp- (1993) to recommend against MPCR in small samples bell, 2004), randomization occurs at the level of health is dependent on an assumption of equal cluster sizes clinics, physicians or other administrative and geo- that vitiates one major advantage of pair matching; we graphical units even though individuals are the units show in real data that the assumption does not apply of interest (e.g., Sommer et al., 1986; Varnell et al., and without it pair matching on cluster sizes and pre- 2004); and numerous education researchers random- treatment variables that affect the outcome improves ize schools, classrooms or teachers instead of students both efficiency and power a great deal, even in samples (e.g., Angrist and Lavy, 2002). as small as three pairs. In fact, because the efficiency Since efficiency drops when randomizing clusters of gain of MPCR depends on the correlation of cluster individuals instead of individuals themselves (Corn- weighted by cluster size, the advantage can be field, 1978), many scholars attempt to recoup some of much larger than the unweighted correlations that have this lost efficiency by pairing clusters, based on the been studied seem to indicate, even when cluster sizes similarity of available background characteristics, be- are independent of the outcome. fore randomly assigning one cluster within each pair Finally, there exists no published formal evaluation to receive the treatment assignment (e.g., Ball and Bo- of the statistical properties of the estimator for MPCR gatz, 1972;Gailetal.,1992; Hill, Rubin and Thomas, data most commonly recommended in the methodolog- 1999). Since matching prior to random treatment as- ical literature. By defining the quantities of interest signment can greatly improve the efficiency of causal separately from the methods used to estimate them, effect estimation (Bloom, 1978;Greevyetal.,2004), and identifying a model that gives rise to the most and matching in pairs can be substantially more ef- commonly used estimator, we show that this approach ficient than matching in larger blocks, matched-pair, depends on assumptions, such as the homogeneity of cluster-randomization (MPCR) would appear to be an treatment effects across all clusters, that apply best attractive design for field experiments (Imai, King and when matching is not needed to begin with. The com- Stuart, 2008). [See also Moulton (2004).] The design is monly used estimator is also biased. We then especially useful for public policy experiments since, offer new simple design-based estimators and their when used properly, it can be robust to interventions . We also propose an alternative model-based by politicians and others that have ruined many policy approach that includes the benefits of our design-based evaluations, such as when office-holders arrange pro- estimator, which has little or no bias, and the estima- gram benefits for constituents who live in control group tor in the literature, which under certain circumstances clusters (King et al., 2007). has lower variance. Finally, we extend our methods to Unfortunately, despite its apparent benefits and com- situations with individual-level noncompliance, which mon usage, this experimental design has an uncertain is a basic feature of many MPCR experiments but for scientific status. Researchers in this area and formal which most prior methods have not been adapted. With the results and new methods offered here, ambiguity statements from standards organizations about what to do in cluster randomized experiments (e.g., Donner and Klar, 2004;Fengetal.,2001;Med- vanishes: pair matching should be used whenever fea- ical Research Council, 2002) claim that certain “ana- sible. lytic limitations” make MPCR, or at least the existing methods available to analyze data from this design, in- appropriate. These claimed limitations include “the re- 2. EVALUATION OF THE MEXICAN UNIVERSAL HEALTH INSURANCE PROGRAM striction of prediction models to cluster-level baseline risk factors (e.g., cluster size), the inability to test for As a running example of MPCR, we introduce a homogeneityof...[causaleffectsacrossclusters],and randomized evaluation we are conducting of Seguro difficulties in estimating the intracluster correlation co- Popular de Salud (SPS) in Mexico. A major domestic efficient, a measure of similarity among cluster mem- initiative of the Vicente Fox presidency, the program bers” (Klar and Donner, 1997, page 1754). In addition, seeks “to provide social protection in health to the 50 THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 31 million uninsured Mexicans” (Frenk et al., 2003,page 3. MATCHED-PAIR, CLUSTER-RANDOMIZED 1667), constituting about half the population (King et EXPERIMENTS al., 2007). The government intends to spend an addi- We now introduce MPCR experiments, including the tional one percent of GDP on health compared to 2002 theories of inference commonly applied (Section 3.1), once the program is fully introduced. the formal definitions, notation and assumptions used SPS permitted a cluster randomized (CR) study to be in (Section 3.2), and the quantities of interest typically built into the program rollout. Under national legisla- sought (Section 3.3). tion, Mexican states must apply to the federal govern- ment for funds both to publicize the program and fund 3.1 Theories of Inference its operations. The federal government approves these requests only when local health clinics are brought up We describe the model-based and permutation-based to federal standards. When an area is approved to be- theories of that have been applied gin program enrollment, families who affiliate are ex- to MPCR data and then the design-based theory from pected to receive free preventative and regular medical which our work is derived. care, pharmaceuticals and medical procedures. How- First, model-based inference applied to MPCR typ- ever, because local health clinics and hospitals may ically uses generalized mixed-effects models, general- take years to meet federal standards, and also because ized or multi-level models (Feng of budget restrictions, a staged rollout was necessary et al., 2001). Most of these work only if the modeling and also allowed us the chance to run this randomized assumptions are correct; they also rely on asymptotic study. Finally, since SPS allows individuals to decide approximations. Model-based and model-assisted ap- for themselves whether to enroll (if necessary, by trav- proaches have proved to be powerful in other areas, es- eling from unenrolled to enrolled areas), it was possi- pecially in survey research and where it is ble to adopt a clustered encouragement design (Fran- often necessary, but they violate the purpose and spirit gakis et al., 2002), thereby permitting estimation of of experimental work which goes to great lengths and individual-level program effects. (We focus on the ITT expense to avoid these types of assumptions. effect until Section 6.) Fisher’s (1935) permutation-based theory of infer- The MPCR design was implemented in geographic ence, which constructs exact nonparametric hypothe- areas created for the project which we call “health clus- sis tests based only on the random treatment assign- ters,” defined as the geographic catchment area of a lo- ment, has also been applied to MPCR. Although per- cal hospital or clinic. The country is tiled by 12,824 mutation inference in principle requires no models or such clusters, and negotiations with the Mexican gov- approximations, in practice applications typically have ernment produced more than 100 for which random required additional assumptions such as constant treat- assignment was acceptable. The chosen clusters were ment effects across clusters or some kind of (e.g., paired based on demographics, poverty, educa- Monte Carlo, large sample) approximations. The exist- tion, and health infrastructure. Within each pair, one ing applications include Gail et al. (1996)andBraun “treatment” cluster was randomly chosen for early pro- and Feng (2001), which combine permutation infer- gram rollout, receiving funds to upgrade their health ence with parametric modeling, and Small, Ten Have clinics and encourage individual enrollment. The “con- and Rosenbaum (2008) which considers quantile ef- trol” cluster in each pair had its rollout set for some fu- fects using different and more modest assumptions. ture time. (Individuals could still obtain SPS benefits In contrast, we use Neyman’s (1923) theory of in- by traveling to SPS-approved clusters, but did not re- ference, which is well known but has not before been ceive encouragement or resources to do so.) For design attempted for MPCR. Like Fisher’s permutation-based details, see King et al. (2007). theory, Neyman’s approach is also design (or “ran- The primary outcome of interest at this stage was the domization”) based and nonparametric, but it naturally level of out-of-pocket health expenditures, while sec- avoids the constant treatment effect assumption and ondary outcomes of interest included medical utiliza- can provide valid inferences about both sample and tion, health self-assessment and self-reported health population average treatment effects without modeling behaviors. Outcomes were measured in a baseline and assumptions (Rubin, 1991). The estimators we derive followup panel survey of more than 32,000 house- are also simple to understand and easier to compute holds. Our examples draw upon 67 of these variables (requiring only weighted means and no numerical op- measured at the 10-month followup. timization, or simulation). 32 K. IMAI, G. KING AND C. NALL

3.2 Formal Design Definition, Notation and ASSUMPTION 1 (No interference between matched- Assumptions pairs). Let Yij k (T) be the potential outcomes for the ith unit in the jth cluster of the kth matched-pair where Consider a MPCR where 2m clusters T is a (m × 2) matrix whose (j, k) element is T . We are paired, based on a known function of the cluster jk assume that if T = T , then Y (T) = Y (T). characteristics, prior to the randomization of a binary jk jk ij k ij k treatment. We assume the jth cluster in the kth pair The assumption allows us to write Yij k (Tjk) rather = = contains njk units, where j 1, 2andk 1,...,m, than Yij k (T).SinceT1k = Zk and T2k = 1 − Zk, = and thus the total number of units is equal to n Yij k (Tjk) only depends on Zk. Given that the as- m + k=1(n1k n2k). sumption of no interference among individuals is of- Under MPCR, simple randomization of an indicator ten highly unrealistic (Sobel, 2006), MPCR offers an variable, Zk for k = 1, 2,...,m, is conducted indepen- attractive alternative. In the Mexico experiment, As- dently across the m pairs. For a pair with Zk = 1, the sumption 1 is reasonable because most of the clusters first cluster within pair k is treated (in our case, as- in our experiment are noncontiguous and the travel signed encouragement to affiliate with SPS), and the times between them are substantial. However, espe- second cluster is assigned control. In contrast, for a pair cially in small villages, individual-level no interference with Zk = 0, the first cluster is the control whereas the assumptions would have been implausible. second is treated. Thus, using Tjk for the treatment in- Finally, we formalize the cluster-level randomized dicator for the jth cluster in the kth pair, then T1k = Zk treatment assignment as follows. and T2k = 1 − Zk. In the context of the SPS evaluation, we consider an intention-to-treat (ITT) analysis to es- ASSUMPTION 2 (Cluster randomization under timate the causal effects of encouragement to affiliate matched-pair design). The potential outcomes are with the program (see Section 6 on the estimation of independent of the randomization indicator variable: ⊥⊥ causal effects of the actual affiliation). (Yij k (1), Yij k (0)) Zk, for all i, j and k. Also, Zk is independent across matched-pairs, and Pr(Z ) = 1/2 We denote Yij k (Tjk) as the potential outcomes under k for all k. the treatment (Tjk = 1) and control (Tjk = 0) condi- tions for the ith unit in the jth cluster of the kth pair The assumption also implies (Yij k (1), Yij k (0)) ⊥⊥ (Holland, 1986; Maldonado and Greenland, 2002). The Tjk since Tjk is a function of Zk. observed outcome variable is Yij k = TjkYij k (1) + (1 − Tjk)Yij k (0). Finally, the order of clusters within each 3.3 Quantities of Interest pair is randomized so that the population distribution We now offer the definitions of the causal effects of of (Yi1k(1), Yi1k(0)) equals (Yi2k(1), Yi2k(0)) (though interest under MPCR (or CR in general) which have this equality may not hold in sample). not been formally defined in the literature. At least A defining feature of CR experiments is that the po- two types of each of four distinct quantities may be of tential outcomes for the ith unit in the jth cluster of interest in these experiments. We begin with the four the kth pair are a function of the cluster-level random- quantities, which define the target population, and then ized treatment variable, Tjk, rather than its unit-level discuss the two types, which clarify the role of interfer- treatment counterpart. Similarly, the unit-level causal ence. Section 6 introduces additional quantities of in- − effect, Yij k (1) Yij k (0), is the difference between two terest when individual-level noncompliance exists. (All unit-level potential outcomes that are the functions of the quantities below are based on causal effects de- the cluster-level treatment variable. Thus, in CR exper- fined as grouped individual-level phenomena; we dis- iments, the usual assumption of no interference (Cox, cuss cluster-level causal quantities in Section 4.6.) 1958; Rubin, 1990) applies only at the cluster level. Moreover, in MPCR, assuming no interference only 3.3.1 Target population quantities.Table1 offers between pairs of clusters is sufficient. This advantage an overview of the four target population causal effects. of MPCR designs can be substantial if contagion or so- All four quantities represent the causal treatment effect cial influence is present at the individual level, where, (the potential outcome under treatment minus the po- for example, individuals may affect the behavior of tential outcome under control) averaged over different neighbors or friends, but such interference does not ex- sets of units. ist across clusters or pairs of clusters. Thus, we only First is the sample (SATE assume: or ψS ), which is an average over the set of all units in THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 33

TABLE 1 were in the observed sample. This is what we call the Quantities of Interest: For each causal effect, this table lists unit average treatment effect (UATE or ψU ) and is de- whether clusters and units within clusters are treated as observed ≡ E − and fixed or instead as a sample from a larger population. The fined as ψU U(Y (1) Y(0)). resulting inferential target is also given The final quantity of interest is the population aver- age treatment effect (PATE or ψP ), which is defined as Units ψP ≡ EP (Y (1)−Y(0)), where the expectation is taken within over the entire population P —that is, the population of Quantities Clusters clusters Inferential target units within the population of clusters. For simplicity throughout, we assume an infinite population of clus- ψS SATE Observed Observed Observed sample ψC CATE Observed Sampled Population within ters, but this is easily extended to finite populations at observed clusters some cost in additional notation. ψU UATE Sampled Observed Observable units Researchers should design their experiments to make within the population inferences to their desired quantity of interest, though of clusters in practice they may choose to estimate other quanti- ψP PATE Sampled Sampled Population ties of interest when they face design limitations. In the SPS evaluation, for example, we would like to in- the observed sample (which we denote as S): fer PATE for all of Mexico, but our health clusters were   not (and due to political and administrative constraints ψS ≡ ES Y(1) − Y(0) could not be) randomly selected. This means that, like (1) n most medical experiments, any method applied to our 1 m 2 jk   = Y (1) − Y (0) , data to estimate PATE will be dependent on assump- n ij k ij k k=1 j=1 i=1 tions about the selection process. An alternative ap- proach would be to try to estimate one of the other where the sums go over pairs, the two clusters within quantities. CATE or SATE are straightforward possi- each pair and the units within each cluster. bilities, and CATE is probably most apt in this case, The second quantity treats observed clusters as fixed since individuals within clusters were randomly se- (and not necessarily representative of some population) lected, and both quantities condition on the clusters we and the units within clusters as randomly sampled from observe. Of course, even when inferences are made to the finite population of units within each cluster. This restricted populations, readers may still extrapolate to gives the cluster average treatment effect (CATE or a different population of interest, and so the researcher ψC):   needs to decide on the appropriate presentation strat- ψC ≡ EC Y(1) − (0) egy. From a public policy perspective, UATE may be (2) a reasonable target quantity, where we try to infer to m 2 Njk 1   the individuals who would be sampled in all the health = Y (1) − Y (0) , N ij k ij k clusters in Mexico that are similar to our observed clus- k=1 j=1 i=1 ters, and from which our clusters could plausibly have where the expectation is taken over the set C which been randomly drawn. contains all observed units within the sample clusters, 3.3.2 Interference. Inference in CR experiments Njk is the known (and finite) population cluster size, ≡ m + may be affected by three different types of interference, and N k=1(N1k N2k). Throughout, we assume simple random within each cluster for sim- each of which may require different assumptions. First, plicity, but other random sampling procedures can eas- when interference exists among individuals within a ily be accommodated via unit-level weights. Thus, the cluster, the potential outcomes of one person (or unit) only difference between SATE and CATE is whether within a cluster may be different depending on other each unit within clusters is treated as fixed or randomly units’ treatment assignment. This type of interference drawn based on a known sampling mechanism. is expected and no assumptions are required for the A third quantity treats the clusters as randomly sam- four causal quantities of interest. In CR experiments, pled from a larger population, but the units within the within-cluster interference is part of the outcome, and sampled clusters are treated as fixed. The inferential researchers can estimate the causal effects of cluster- target is the set U, which includes all units in the pop- level treatment on unit-level outcome. Understanding ulation of clusters that would be observed if its cluster the effect of individuals independent of and isolated 34 K. IMAI, G. KING AND C. NALL from other individuals in the same cluster is best left to little or no bias, and the existing estimator is biased but studies where individual randomization is possible. may have low variance in some circumstances, we also Second, interference between clusters in different offer a model-based method that combines some of the pairs may affect outcomes. Assumption 1 requires the benefits of both approaches. absence of such interference between clusters in dif- 4.1 Definitions ferent pairs. We continue to maintain this assumption, as Sobel (2006) demonstrates that without it even the The point estimators for the with-interference ver- definition of a causal effect is complicated (see also sion of the four quantities of interest are each weighted Rosenbaum, 2007). averages of within-pair differences between the Third, interference between treatment and control treated and control clusters, but with different weights. clusters in the same pair requires us to redefine causal We thus define ˆ effects to account for interference. For example, if one ψ(wk) cluster is assigned SPS, individuals in the other (con- 1 ≡  trol) cluster within the pair may become envious or m depressed as a consequence. This type of interference k=1 wk     within a pair can be dealt with in two ways. In the first, m n1k n2k i=1 Yi1k i=1 Yi2k no-interference (3) · wk Zk − which we call , we define the causal ef- n n fect (SATE, CATE, UATE or PATE) so that the treat- k=1 1k 2k ment in one cluster has no effect on the potential out- + (1 − Zk)    comes of units in the control cluster. In the second, n2k n1k Yi k Yi k which we call the with-interference, the causal effect · i=1 2 − i=1 1 , is defined so that it includes interference between clus- n2k n1k ters within pairs as well as interference between units where the weight for the kth pair of clusters, denoted within each cluster. (For our Mexico experiment, we by wk, defines a specific estimator. do not expect much direct interference within or across The estimator most commonly recommended in the pairs, although nearby clusters outside our experiment methodological literature is based on a weight using might exert some influence over those we observe, in the of sample cluster sizes, which can ˆ + which case the definition of UATE or PATE might be written as ψ(n1kn2k/(n1k n2k)) (see, e.g., Don- change). ner, 1987; Donner and Donald, 1987; Donner and Klar, Estimating the no-interference version of SATE, 1993; Hayes and Bennett, 1999; Bloom, 2006;Rau- CATE, UATE or PATE in the presence of inter- denbush, 1997; Turner, White and Croudace, 2007). ference is feasible only with assumption-laden esti- This estimator, and its variance estimator, are in gen- mators. In contrast, estimating the with-interference eral biased (see Appendix A.4), but may have low vari- version is easier since it accepts whatever level of non- ance in some situations, an issue we return to in Sec- interference one’s data happens to present. Of course, tion 4.5. ˆ + having a quantity that is easy to estimate is not a satis- As shown in Table 2, ψ(n1k n2k) is our point es- ˆ + factory substitute for having an estimate of the quantity timator for both SATE and UATE, whereas ψ(N1k of interest. The best way to avoid this problem is to use N2k) applies to both CATE and PATE. This is intuitive, these facts to design better experiments. For example, as SATE and UATE are based on those units (which we can select noncontiguous clusters to pair, and pairs would be) sampled in a cluster whereas CATE and that are not contiguous to other pairs. Following rules PATE are based on the population of units within clus- ters. Our estimator for SATE and UATE differs from like this whenever feasible reduces the difference be- the existing estimator based on harmonic mean weights tween the no-interference and with-interference quan- unless the sample cluster sizes within each matched tities. pair are equal (n1k = n2k for all k = 1,...,m), which rarely occurs at least in field experiments. 4. ESTIMATORS Table 2 also summarizes the variances and their esti- We now define our estimators and derive their sta- mators. Under our design-based inference, UATE and tistical properties. Our strategy throughout is to make PATE have identifiable variances, the exact expres- as few assumptions as feasible beyond the experimen- sion for which we give below. SATE and CATE have tal design. We also briefly discuss an approach that has unidentifiable variances, and so we offer their upper been offered in the literature. Since our approach has bound, leading to a conservative confidence interval. THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 35

TABLE 2 Point estimators and variances for the four causal quantities of interest. “Identified” refers to design-based identification of estimated causal effects without modeling assumptions

SATE CATE UATE PATE

ˆ ˆ ˆ ˆ Point estimator ψ(n1k + n2k) ψ(N1k + N2k) ψ(n1k + n2k) ψ(N1k + N2k) ˆ ˆ ˆ ˆ Variance Vara(ψ) Varau(ψ) Var ap (ψ) Var aup(ψ) Identified no no YES YES

Our variance estimators differ from the existing estima- These two conditions motivate our choice of weights tor even when sample cluster sizes are matched exactly. (wk = n1k + n2k). First, when cluster sizes are equal Our variance estimator is approximately unbiased for within each matched-pair (i.e., n1k = n2k for all k), any weights. Estimates from UATE and PATE (or the bias is always zero. This implies that researchers equivalently SATE and CATE) will differ depending on may wish to form pairs of clusters, at least partially, how sample and population sizes vary across clusters. based on their sample cluster size if SATE is the esti- mand. Second, ψ(nˆ + n ) is also unbiased if match- 4.2 Bias 1k 2k ing is effective, so that the within-cluster SATEs are n1k − We first focus on SATE. This allows us, follow- identical for each matched-pair (i.e., i=1(Yi1k(1) = n2k − ingNeyman(1923), to use the randomized treatment Yi1k(0))/n1k i=1(Yi2k(1) Yi2k(0))/n2k for all k). assignment mechanism as the sole basis for statistical In contrast, bias may remain if cluster sizes are poorly inference (see also Imai, 2008). Here, the potential out- matched and within each pair cluster sizes are strongly comes are assumed fixed, but possibly unknown, quan- associated with the cluster-specific SATEs. However, ˆ tities. We begin by rewriting ψ(n1k + n2k) using po- the bounds on the bias can be found by applying tential outcome notation: the Cauchy–Schwarz inequality to equation (4)and they can be consistently estimated from the observed ψ(nˆ + n ) 1k 2k data. In sum, roughly speaking, if cluster sizes and 1 m important confounders are matched well so that pre- = (n + n ) n 1k 2k randomization matching accomplishes the purpose for k=1     which it was designed, this estimator will be approxi- n1k n2k Yi k(1) Yi k(0) mately unbiased. · Z i=1 1 − i=1 2 k n n A similar bias expression can be derived for our 1k 2k ˆ CATE estimator, ψ(N1k + N2k), where the weights + (1 − Zk)    are now based on the of the popula- n2k n1k = Yi2k(1) = Yi1k(0) tion cluster sizes rather than their sample counterparts. · i 1 − i 1 . n2k n1k A calculation analogous to the one above yields the fol- lowing bias expression: Then, taking the expectation with respect to Zk yields  ˆ  ˆ Eau ψ(N1k + N2k) − ψC Ea{ψ(n1k + n2k)}−ψS     m 2 + 1 m 2 n + n 1 N1k N2k = 1k 2k − (5) = − Njk (4) njk N 2 n = = 2 k=1 j=1 k 1 j 1   njk Y (1) − Y (0) · Eu Yij k (1) − Yij k (0) , · ij k ij k , njk i=1 where subscript “au” means that the expectation is where the expectation is taken with respect to the ran- taken with respect to random treatment assignment and domization of treatment assignment which we indicate the simple random sampling of units within each clus- by the subscript “a.” ter. The conditions under which this bias disappears are Although the bias does not generally equal zero, analogous to the ones for SATE: If matching is effec- either of two common conditions can eliminate it. tive so that the cluster-specific average causal effects, 36 K. IMAI, G. KING AND C. NALL     [ − ] m n1k Y n2k Y that is, Eu Yij k (1) Yij k (0) , are constant across clus- · ˜ i=1 i1k − i=1 i2k ters within each pair, then the bias is zero. The bias wk Zk = n1k n2k also vanishes if the population cluster sizes are identi- k 1 + − cal within each pair, that is, N1k = N2k for all k.Again, (6) (1 Zk)    the bounds on the bias can be obtained in the manner n2k n1k i=1 Yi2k i=1 Yi1k similar to the case of SATE above. · − n2k n1k Finally, the bias for UATE and PATE can be ob- tained by taking the expectation of the bias for SATE nψ(ˆ w˜ ) 2 − k . and CATE, respectively, with the expectation defined m with based on random sampling of cluster pairs. If the ˆ ˜ within-cluster sample (population) average treatment SATE. We first consider the variance of ψ(wk) for ψ(ˆ w˜ ) effects are uncorrelated with cluster sizes within each SATE. Taking the expectation of k with respect to Z , the true variance of ψ(ˆ w˜ ) is given by matched-pair, then the bias for the estimation of UATE k k (PATE) is zero, regardless of whether one can match 1 m   (7) Var (ψ(ˆ w˜ )) = w˜ 2 D (1) − D (0) 2. exactly on cluster sizes. In general, however, cluster a k 2 k k k 4n = sizes may be correlated with the size of average treat- k 1 ment effects. In such cases, the matching strategies to This variance is not identified since we do not jointly reduce the bias for the estimation of SATE (CATE) also observe Dk(1) and Dk(0) for each k. Thus, we identify work for the estimation of UATE (PATE). That is, pairs an upper bound of this variance, making no additional of clusters should be constructed such that within each assumptions, and estimate it from the observed data. pair, cluster sizes and important pre-treatment covari- The next proposition establishes that the true vari- ˆ ˜ ates are similar. (We also derived an unbiased estimator ance, Vara(ψ(wk)), is not identifiable, and shows that ˆ ˜ and its variance, but we do not present it here because our proposed variance estimator, σ(wk), is conserva- they are not invariant to a constant shift of the outcome tive. variable when cluster sizes vary within each pair.) PROPOSITION 1 (SATE variance identification). Suppose that SATE is the estimand. Then, thetruevari- 4.3 Variance ˆ ance of ψ(w˜ k) is not identifiable. The bias of σ(ˆ w˜ k) is In a critical comment about Klar and Donner (1997), given by Thompson (1998) shows how to obtain valid variance ˆ ˜ − ˆ ˜ estimates under the linear mixed effects model and the Ea(σ(wk)) Var a(ψ(wk))    “common effect assumption.” In their reply, Klar and m = var w˜ k Dk(1) + Dk(0) , Donner (1998) criticize the common effect assumption 4n2 and, as a result, maintain their claim of analytical diffi- where var(·) represents the sample variance with de- culties with MPCRs. We show here how to obtain valid nominator m − 1. variance estimates without the common treatment ef- See Appendix A.1 for a proof. This proposition im- fect assumption or other modeling assumptions. plies that on average σ(ˆ w˜ ) overestimates the true Rather than focusing on each of our proposed esti- k variance Var (ψ(ˆ w˜ )) unless the sample variance of mators, ψ(nˆ +n ) and ψ(Nˆ +N ), separately we a k 1k 2k 1k 2k weighted within-cluster SATEs across pairs is zero. consider the variance of the general estimator, ψ(wˆ ) k For example, if SATE is constant across pairs, and in equation (3), so that the analytical results we develop the cluster sizes are equal, σ(ˆ w˜ ) estimates the true apply to any choice of weights including the harmonic k variance without bias. However, such a scenario is mean weights. For notational simplicity, we use nor-  highly unlikely under MPCR, and thus σ(ˆ w˜ ) should malized weights, that is, w˜ ≡ nw / m w (so that k k k k=1 k be seen as a conservative estimator of the variance. It the weights sum up to n as in our estimator of SATE is also possible to obtain a less conservative variance and UATE), and consider the variances of ψ(ˆ w˜ ).First, k estimate than σ(ˆ w˜ ). For example, researchers may we use potential outcomes notation and write ψ(ˆ w˜ ) = k   k use a consistent estimator of {( m w˜ 2D (1)2)1/2 + m w˜ {Z D (1) + (1 − Z )D (0)}/n. Then, our  k=1 k k k=1 k k k k k ( m w˜ 2D (1)2)−1/2}2/4n2, which is obtained by variance estimator is k=1 k k applying the Cauchy–Schwarz inequality to equa- σ(ˆ w˜ k) tion (7). Another approach to obtain a tighter bound would be to apply the covariance inequality to the bias ≡ m (m − 1)n2 expression given in Proposition 1. THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 37 ˆ CATE. Next, we study variance for CATE, ψ(w˜ k), PATE are given by: which we write as 1 Var (ψ(ˆ w˜ )) = Var (w D ), ˆ ap k mw¯ 2 p k k Var au(ψ(w˜ k)) ˆ ˆ 1 2 = E {Var (ψ(w˜ ))} Var apu(ψ(w˜ k)) = [Ep{w Var u(Dk)} u a k mw¯ 2 k + { ˆ ˜ } (8) Var u Ea(ψ(wk)) + Var p{˜wkEu(Dk)}], m 1     respectively, where D = Z D (1) + (1 − Z )D (0) = w˜ 2 E D (1) − D (0) 2 k k k k k 4n2 k u k k and “p” represents the expectation with respect to k=1   simple random sampling of matched-pairs of clusters. + Var u Dk(1) + Dk(0) , Conditional on w¯ , both variances can be estimated by σ(ˆ w˜ k) without bias under their corresponding sam- where the second equality holds because sampling of pling schemes. units is independent within clusters. Similar to the SATE variance, this is not identified since we do not See Appendix A.3 for a proof. The proposition jointly observe Dk(1) and Dk(0) for each k. The next shows that when estimating PATE, the variance of ψ(ˆ w˜ ) is proportional to the sum of two elements: the proposition shows that σ(ˆ w˜ k) is again conservative. k mean of within-cluster variances and the variance of PROPOSITION 2 (CATE variance identification). within-cluster means. If all units are included in each Suppose that CATE is the estimand. The true variance cluster, then the first term will be zero because the ˆ ˆ of ψ(w˜ k),Varau(ψ(w˜ k)), is not identifiable. The bias within-cluster means are observed without sampling of σ(ˆ w˜ k) is given by uncertainty, that is, Varu(Dk) = 0forallk. In either case, however, our proposed variance estimator σ(ˆ w˜ ) ˆ ˜ − ˆ ˜ k Ea(σ(wk)) Var a(ψ(wk)) is unbiased, conditional on the mean weight, w¯ . m    = ˜ + Inference. Given our proposed estimators and vari- 2 var wkEu Dk(1) Dk(0) . 4n ances, we make statistical inferences by assuming that ˆ See Appendix A.2 for a proof. The proposition im- ψ(w˜ k) is approximately unbiased. We consider three plies that our proposed variance estimator, σ(ˆ w˜ k),isan situations: upper bound of the true variance. As in the case of the 1. Many pairs. When the number of pairs is large (re- SATE, this upper bound can be improved. For example, gardless of the number of units within each clus- rewrite the variance in equation (8)as ter), no additional assumption is necessary due to ˆ ˜ the . For PATE and UATE, the Var au(ψ(wk)) [ ˆ ˜ − levelα confidence intervals are given by ψ(wk) 1 m z σ(ˆ w˜ ), ψ(ˆ w˜ ) + z σ(ˆ w˜ )] where z (9) = w˜ 2 Var (D (1)) + Var (D (0)) α/2 k k α/2 k α/2 2n2 k u k u k represents the critical value of two-sided level α k=1 normal test. For the SATE and CATE, the confi-    1 2 dence level of this interval will be greater than or + Eu Dk(1) − Dk(0) . 2 equal to α. 2. Few pairs, many units. For CATE (and PATE), Then, apply the Cauchy–Schwarz inequality to the the central limit theorem implies that D follows third term in the bracket of equation (9). Alternatively, k the normal distribution. Since the weights are as- applying the covariance inequality to the bias expres- sumedtobefixedforCATE,w˜ kDk is also nor- sion in Proposition 2 yields a tighter bound. mally distributed. For the other three quantities, UATE and PATE. Unlike in the case of the SATE we assume w˜ kDk is normally distributed. In ei- and the CATE, the variance of ψˆ is identified and can ther case, the level α confidence intervals are [ ˆ ˜ − ˆ ˜ ˆ ˜ + estimated approximately without bias when UATE or given by ψ(wk) tm−1,α/2 σ(wk), ψ(wk) PATE is the estimand. We establish this result as the tm−1,α/2 σ(ˆ w˜ k)],wheretm−1,α/2 represents the following proposition: critical value of the one-sample, two-sided level αt-  test with (m − 1) degrees of freedom. For the SATE ¯ = m PROPOSITION 3. Conditional on w k=1 wk/ and CATE, the confidence level of this interval will ˆ m, the variances of ψ(w˜ k) for estimating the UATE and be greater than or equal to α. 38 K. IMAI, G. KING AND C. NALL

3. Few pairs, few units. When little information is the results. As expected due to the central limit theo- available, a distributional assumption is required for rem, both sets of our 90% confidence intervals (solid the inferences about all four quantities. We may disks) approach their corresponding nominal cover- assume w˜ kDk follows the normal distribution as age probabilities as the number of pairs increase. In above and construct the confidence intervals and contrast, the confidence intervals based on the har- conduct hypothesis tests based on t-distribution. monic mean variance estimator (open diamonds) are biased—too wide in the top graph and too narrow in Finally, although it was once thought that the need the bottom—and the magnitude of bias does not de- for, and inability to estimate, the intracluster corre- crease even as the number of pairs grows. lation coefficient (ICC) was a major disadvantage of Standard error comparisons. We begin by comput- MPCR designs (Campbell, Mollison and Grimshaw, ing the standard error (the square root of the estimated 2001; Klar and Donner, 1997; Donner, 1998), esti- variance) based on the general variance formula pro- mates of the ICC are in fact not needed for our esti- posed in Donner (1987), Donner and Donald (1987) mators or their variances. Below, we also show that ef- and Donner and Klar (1993), as well as the one based ficiency analysis, power comparisons and sample size on our approximately unbiased alternative. For compa- calculations can also be conducted without the ICC es- rability, we use the arithmetic mean weights for both timation. standard error calculations. We make these computa- 4.4 Performance in Practice tions for a large number of outcome variables from the SPS evaluation survey conducted 10 months after We now study how our estimator and the harmonic randomization. The outcome variables include some mean estimator work in practice. The results here also which were binary (e.g., did the respondent suffer motivate a combined approach to estimation we offer catastrophic medical expenditures? Does our blood in Section 4.5. test indicate that the respondent has high cholesterol? Has the respondent been diagnosed with asthma?) and Confidence interval coverage. To construct realis- others denominated in Mexican pesos (e.g., out-of- tic simulations, we begin with the observed cluster- pocket expenditures for health care, for drugs, etc.). specific mean for two out-of-pocket health expendi- We then divided this standard error by our alternative tures from the SPS evaluation data (measured in pesos) for each variable. The top right graph in Figure 1 gives and use this to set the potential outcomes’ true popu- a smoothed of these ratios (plotted on the lation for the simulation. Finally, we generate the out- log scale but labeled in original units, with 1 the point come variables via independent normal draws for units of equality). In these real data, the biased standard er- within clusters using a set of heterogeneous variances. rors from about two times too small to two times Thus, the existing harmonic mean estimator’s mean too large. Note that the of this his- and variance constancy assumptions are violated, as togram has no particular meaning, as it is constructed is common in real data, although its normality and in- from whatever questions happened to be asked on the dependence assumptions are maintained. ( survey. The key point is that in real data the deviation data are available in Imai, King and Nall, 2009.) from the approximately unbiased estimator for any one We study the properties of the proposed and existing such standard error can be large in either direction. variance estimators with PATE or UATE as the esti- Bias-variance tradeoff. Using data from an expendi- mand. (As shown in Proposition 2, the CATE variance ture outcome in the SPS sample, we simulate an in- is not identified and the expectation of our variance es- stance in which the variance of the existing estimator timator equals a upper bound.) We generate a popula- outperforms our estimator. To distinguish between the tion of clusters by bootstrapping the observed pairs of harmonic and arithmetic mean, we begin by setting all SPS clusters along with their observed means and a set within-pair cluster sizes equal to the size of the treat- of heterogeneous variances. We then compute cover- ment cluster in the SPS evaluation. Then, keeping the age probabilities under both estimators where the arith- total pair size constant, we increase the difference in metic and harmonic mean weights are used for the pro- within pair cluster size such that the added difference in posed and existing estimators, respectively. We draw cluster sizes is proportional to the within-pair treatment from the discrete empirical distribution, which is far effect. This leaves the average treatment effect constant from a Gaussian distribution, yielding a hard case for while demonstrating differences in the two weighting both estimators. The left panels of Figure 1 summarize schemes. The bottom right graph in Figure 1 presents THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 39

FIG.1. Inference Accuracy. Simulations in the left panels demonstrate how our estimator’s coverage is approximately correct and in- creasingly so for larger sample sizes, while the existing estimator can yield confidence intervals that are either too large (top left) or too small (bottom left). The top right panel uses real data to give the ratios of the harmonic mean standard error to our approximately unbiased alternative on the horizontal axis (on the log scale, but labeled as ratios). The bottom right figure gives squared bias, MSE, and variance comparisons as a function of the average cluster size ratio; a vertical line marks the observed heterogeneity in the SPS data. the absolute difference between the two estimators in badly matched or imprecisely measured, because doing mean square error, squared bias and variance, with the so could result in arbitrarily large biases. In contrast, observed SPS value marked with a vertical line. the arithmetic mean estimator avoids bias by making The overall picture from these results indicates that no assumptions about the nature of how treatment ef- the arithmetic estimator would be preferred because it fects vary over the pairs. However, a consequence of it has lower mean square error than the harmonic mean imposing no structure on treatment effect heterogene- estimator. However, at the expense of introducing bias ity is that mismatched pairs are not downweighted and when treatment effect is both variable across pairs and so some inefficiency may result if in fact the treatment correlated with the cluster size, the harmonic mean es- effects are similar across pairs. timator can have substantially lower variance. These We now combine the insights of these two ap- results suggest the possibility of an improved estimator proaches and propose a single encompassing model based on the combination of both approaches, a subject that provides some of the advantages of each, at the to which we now turn. cost of somewhat more stringent assumptions than 4.5 An Encompassing, Model-Based Approach with our design-based approach. Consider data with m∗ groups of clusters, where the homogeneity assump- The standard harmonic mean estimator is unbiased tion holds within each group. Assume that the clusters when applied to data where the between-cluster ho- within any one pair are never split between groups. mogeneity assumption holds. In this situation, the har- Let g(k) = l denote the group to which pair k belongs, monic mean weights also have the attractive property l = 1, 2,...,m∗ with m∗ ≤ m. Then, make the model- of downweighting observations with worse matches i.i.d. ing assumption that Y (t) ∼ N (μ(g(k)), σ˜ (g(k))) for and larger variances, thereby reducing variance. If the ij k t t = 0, 1whereμ(l) and σ˜ (l) are not necessarily equal homogeneity assumption is violated, however, then one t (l ) (l) cannot afford to downweight pairs, no matter how to μt and σ˜ for l = l . Under this model, CATE 40 K. IMAI, G. KING AND C. NALL equals, as collected via survey), there will likely be sampling error and so may result in a larger variance. 1 m   (10) ψ = (N + N ) μ(g(k)) − μ(g(k)) . C N 1k 2k 1 0 k=1 5. COMPARING MATCHED-PAIR AND OTHER When the group membership is known ex ante, an DESIGNS unbiased and efficient estimator of CATE is given by We now study the relative efficiency and power (l) ≡ (l) − (l) replacing μk μ1 μ0 with its harmonic mean es- of the MPCR and unmatched cluster randomization timate   (UMCR) designs, and give sample size calculations for m  m (l) MPCR. We also briefly compare MPCR with the strat- ˆ = { = } { = } μk 1 g(k) l wkDk 1 g(k ) l wk , ified design and discuss the consequences of loss of k=1 k =1 clusters under each. where wk = n1kn2k/(n1k + n2k) and 1{·} represents the indicator function. Thus, this mixture model esti- 5.1 Unmatched Cluster Randomized Design mator is an arithmetic mean of within (homogeneous) The UMCR design is defined as follows. Consider a group harmonic mean estimators. A special case is the random sample of 2m clusters from a population. We harmonic mean estimator in the literature, where the observe a total of nj units within the jth cluster in the homogeneity assumption is made across all clusters, sample, and use n to denote the total number of units ∗  that is, m = 1. When every pair belongs to a differ- in the sample, n = 2m n . Under this design, m ran- ∗ j=1 j ent group, that is, m = m, this estimator approximates domly selected clusters are assigned to the treatment our proposed design-based estimator. In many applica- group with equal probability while the remaining m tions, the group membership as well as the number of clusters are assigned to the control group. groups may be unknown. In this case, CATE may be We construct an estimator analogous to that pro- estimated via standard methods for fitting finite mix- posed for the UMCR as ture models (e.g., McLaughlan and Peel, 2000). n 2 2m j w˜ 4.6 Cluster-Level Quantities of Interest τ(ˆ w˜ ) ≡ j {Z Y − (1 − Z )Y } j n n j ij j ij j=1 i=1 j The eight quantities of interest defined in Sec- (11) tion 3.3—SATE, CATE, PATE and UATE, both with 2m nj 2 w˜ j and without interference—are all defined as aggrega- = {Zj Yij (1) − (1 − Zj )Yij (0)}, n n tions of unit-level causal effects. For some purposes, j=1 i=1 j however, analogous quantities of interest can be de- where Zj is the randomized binary treatment variable, fined at the cluster level. For example, quantities of Yij (t) is the potential outcome for the ith unit in the jth interest in the SPS evaluation include the health clinic- cluster under the treatment value t for t = 0, 1, and w˜ j level variables. Some of these effects, such as the sup-  is the known normalized weight with 2m w˜ = n. ply of drugs and doctors, are defined and measured at j=1 j For SATE and UATE, we use w˜ = n . For CATE and the health clinic, and so are effectively unit-level vari- j j PATE, we use w˜ ∝ N where N is the population ables amenable to cluster-level analyses. j j j size of the jth cluster. Analysis similar to the one in However, for other variables, individual-level sur- Section 4.2 shows that this estimator is unbiased for all vey responses are required to measure the aggregate four quantities in UMCR experiments. variables. Examples include the success health clinics in our experiment have in protecting privacy, reduce The commonly used estimator in the literature for this design takes a form slightly different from equa- waiting times, etc. If these latter variables are used to  n  tion (11): κˆ ≡ 2m Z j Y / 2m Z n + judge the causal effect of SPS on the clinics, we have a   j=1 j i=1 ij j=1 j j 2m − nj − 2m CR experiment, but a quantity of interest at the cluster j=1(1 Zj ) i=1 Yij /(n j=1 Zj nj ).Thises- level. In this situation, our estimator is a special case timator is applicable to SATE and UATE but not of equation (3), with a constant weight, ψ(ˆ 1).Sim- CATE and PATE because it ignores cluster population ilarly, the variance of this estimator is a special case weights. The estimator is also biased for SATE and of our general formulation in equation (6), σ(ˆ 1).This UATE, and the magnitude of bias can be derived using estimator for aggregate quantities is unbiased and in- the Taylor series. Without modeling assumptions, the variant for all quantities of interest. In the case of unit- exact variance calculation is difficult within the design- level variables amenable to cluster-level analysis (such based framework because the denominator as well as THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 41 the numerator is a function of the randomized treat- also Imai, 2008). This is another advantage of MPCR ment variable. In addition, the usual approximate vari- since the converse is not true. (If cluster sizes are equal, ance calculations for such a ratio estimator yield either one can also estimate the ICC nonparametrically and thesamevarianceasτ(nˆ j ) or the variance estimator separately for the treated and control groups—there is that is not invariant to a constant shift. Thus, for the no reason to assume the ICC is the same for two po- sake of simplicity, we focus on τ(ˆ w˜ j ) in this section tential outcomes as done in the literature. Note that the although κˆ and its approximate variance estimator may ICC is not required for efficiency, power or sample size perform reasonably well in practice. calculations.) For the rest of this section, we assume that the esti- Empirical evidence mand is UATE. However, the same calculations apply . Although the MPCR design when the estimand is PATE since the variance estima- have other advantages in public policy evaluations tor is the same for both. For SATE and CATE, we can (King et al., 2007), their advantage in statistical effi- interpret these results as conservative estimates of effi- ciency can be considerable. We estimate the efficiency ciency, power and sample sizes. of MPCR as used in the SPS evaluation over the effi- ciency that our experiment would have achieved, if we 5.2 Efficiency had used complete randomization without matching. ˆ ˜ Figure 2 plots the relative efficiency of our estimator When the estimand is UATE, the variance of τ(wj ) 2m for MPCR over UMCR for UATE and for PATE. We do is approximately (conditional on n = = w˜ ) equal j 1 j this for our 14 outcome variables denominated in pe- to Var (τ(ˆ w˜ )) = 4m {Var (w˜ Y (1)) + Var (w˜ · ac j n2  c j j c j sos. For UATE, the estimator based on the MPCR is be- } ≡ nj = Yj (0)) ,whereYj (t) i=1 Yij (t)/nj for t 0, 1, tween 1.13 and 2.92 times more efficient, which means and the subscript “c” represents the simple random √that our standard errors would have been as much as sampling of clusters. To facilitate comparison, as- 2.92 = 1.7 times larger if we had neglected to pair sume that under MPCR one is able to match on clus- clusters first. The result is even more dramatic for esti- ter sizes so that n1k = n2k for all k. Proposition 3 mating PATE, for which the MPCR design for different implies that under the same condition the variance variables is between 1.8 and 38.3 times more efficient. ˆ ˆ of ψ(w˜ k) can be approximated by Varap (ψ(w˜ k)) = In this situation, our standard errors would have been as m Var {˜w (Y (1) − Y (0))}/n2,whereY (t) ≡  p k jk j k jk much as six times larger if we had neglected to match njk = first. i=1 Yij k (t)/njk and j j . Since the assumption = ˜ = ˜ of n1k n2k means wjk 2wj ,wehave 5.3 Power Var p(w˜ kYjk(t)) = 4Varc(w˜ j Yj (t)) for t = 0, 1. Thus, the relative efficiency of MPCR over UMCR is We now use the variance results in Section 4.3 to calculate statistical power, that is, the probability of Var ac(τ(ˆ w˜ j )) ˆ Var ap (ψ(w˜ k))  −1 2Covp(w˜ kYjk(1), w˜ kY (0)) ≈ 1 −  j k . 1 ˜ t=0 Var p(wkYjk(t)) This implies that the relative efficiency of MPCR de- pends on the correlation of the observed within-pair cluster mean outcomes weighted by cluster sizes. If matching induces a positive correlation, as is its pur- pose and will normally occur in practice, then MPCR is more efficient. (In the worst case scenario where matching is implemented in a manner opposite to the way it was designed, and thus induces a negative cor- relation, MPCR can be less efficient.) Under MPCR, we can estimate Covp(w˜ kYjk(1), w˜ kYj k(0)) without bias using the sample covariance between w˜ kYjk(1) and w˜ kYj k(0), which are jointly observed for each k. And thus, under MPCR, the variance one would obtain FIG.2. Relative efficiency of matched-pair over unmatched clus- under UMCR can also be estimated without bias (see ter randomized designs in the SPS evaluation. 42 K. IMAI, G. KING AND C. NALL rejecting the null if it is indeed false, for UATE and 5.3.2 Sample size calculations. We use the above PATE, which also represent the minimum power for results to estimate the sample size required to achieve SATE and CATE, respectively. a given precision in a future experiment under MPCR. Suppose an investigator wishes to specify the desired 5.3.1 Power calculations under the matched-pair degree of precision in terms of Type I and Type II error design. We begin with power calculation for UATE rates in hypothesis testing, denoted by α and β, respec- given a null hypothesis of H : ψ = 0, the alterna- 0 U tively. In particular, the goal is to calculate the sam- tive hypothesis of H : ψ = ψ,andthelevelαt-test. A U ple size required to achieve a given degree of power, In this setting, Proposition 3 implies the power func- 1 − β, against a particular alternative (Snedecor and tion, 1 + T − (−t − | nψ/ m Var {˜w D ) − m 1 m 1,α/2 p k k Cochran, 1989, Section 6.14), using the power func- tions just derived. For example, for UATE under equal Tm−1(tm−1,α/2 | nψ/ m Var p{˜wkDk}),whereTm−1(·| cluster sizes, and using equation (12), the desired num- ζ) is the distribution function of the noncentral t dis- ber of cluster pairs is the smallest value of m such tribution with (m − 1) degrees of freedom and the √ that 1 + T − (−t − | d m) − T − (t − | noncentrality parameter ζ ,andw˜ = n + n .For √ m 1 m 1,α/2 U √ m 1 m 1,α/2 k 1k 2k d m) ≥ 1 − β where d ≡ ψ/ Var(D ), α,andβ UATE, we sample cluster pairs but not units within U U k are specified by the researcher. Similarly, for PATE, each cluster. Thus, a simpler expression for the power equation (13) is used to determine the number of pairs function results if we assume equal cluster sizes. In and units within each cluster. that case, a researcher may reparameterize the power function by normalizing ψ in terms of the standard de- Empirical evidence. To illustrate, we use SPS eval- viation√ of within-pair mean differences, that is, dU ≡ uation data on the annualized out-of-pocket health care ψ/ Var(Dk). Then, we write the power function as expenditure that a household spent in the most recent  √  { } + − | month. Using estimates of π and Varp Eu(Dk) from 1 Tm−1 tm−1,α/2 dU m the SPS data and equation (13), we calculate the mini- (12)  √  mal absolute for PATE that can be detected − Tm−1 tm−1,α/2 | dU m . using a two-sided t-test with size 0.95 and power 0.8, Next, for PATE, we sample units within each cluster for any given cluster size and number of cluster pairs. as well as pairs of clusters. The null hypothesis is given Since the household is the unit of interest in this ex- = = by H0 : ψP 0 and the alternative is Ha : ψP ψ. ample, our population count involves the number of ¯ = = Again, for simplicity, we assume n njk n/(2m) households per cluster, instead of the number of indi- for all j and k. Then, Proposition 3 implies the power viduals. function is of the same form as equation (12) except In the left panel of Figure 3, horizontal axis is the that the noncentrality parameter is given by number of pairs and the vertical axis indicates the num-   ber of units within each cluster. The contour lines rep- √ 2 2  Ep{˜w Var u(Yij k )} resent the minimum detectable size in pesos. The graph ψ m  k + Var (w˜ E (D )), n¯ p k u k shows that MPCR with 30 pairs and 100 units within j=1 each cluster can detect the true absolute effect size of where w˜ k ∝ N1k + N2k. Similar to UATE, if popula- approximately 450 pesos with the given precision. The tion clusters sizes are equal, we obtain a simpler power figure displays the obvious result that experiments with more pairs or clusters, can detect smaller sized effects function √    (contour lines are labeled with smaller numbers as we  dP m 1 + Tm−1 −tm−1,α/2√ move to the top right of the figure). More importantly, 1 + π/n¯ the nearly vertical contour lines (above 50 or so units (13)   √   dP m within each cluster) indicates that adding more pairs of − Tm−1 tm−1,α/2√ , 1 + π/n¯ clusters adds more statistical power than adding more units within each pair. However, adding one more pair where, for UATE, ψ is normalized by the standard ≡ means that many more units will be added, and in some deviation of the within-pair mean difference, dP situations sampling units within new clusters is more { } ψ/ Var p Eu(Dk) ,andπ is the ratio of the mean expensive than within existing clusters. As such, the variances of the potential outcomes and the variance exact tradeoff depends on the specifics of each appli- of within-pair differences-in-means by the mean vari- cation, and it would be incorrect to conclude that more ≡ 2 { · ances of the potential outcomes, π j=1 Ep Var u clusters always dominates more units. (We discuss the (Yij k )}/ Var p(Eu(Dk)). right panel of the figure next.) THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 43

FIG.3. Sample size calculations for PATE under MPCR. The left panel plots the smallest detectable absolute effect size of SPS on annual- ized out-of-pocket expenditures (in pesos) using a 0.05 level two-sided test with power 0.95, with π estimated from SPS data. The horizontal and vertical axes plot m and n¯, respectively. The right panel compares correlations with and without population weights between treatment and control group cluster-specific means in SPS data. All but one variable has higher correlation when incorporating weights, as seen by a ◦ dot below the 45 line. The graph also presents “break-even” correlations (indicated by dashed and dotted lines with and without weights, respectively), which are the smallest possible correlations matching must induce in order for MPCR to detect smaller effect size than the UMCR, given fixed power (0.8) and size (0.95). The graph suggests, when weights are appropriately taken into consideration, that MPCR should be preferred (for all but possibly one variable) even when the number of pairs is as small as three.

5.3.3 Power comparison. Although MPCR is typ- than those of unweighted outcomes; this is true even ically more efficient than UMCR regardless of sam- when cluster sizes are independent of outcomes. Thus, ple size, Martin et al. [(1993), page 330] point out in CR trials with unequal cluster sizes, the efficiency that when the number of pairs is small (fewer than gain due to pre-randomization matching is likely to be about 10), “the matched design will probably have less considerably greater than the equal cluster size case power than the unmatched design” due to the loss of considered by Martin et al. (1993). Any power compar- degrees of freedom. Here, we show that this conclu- ison must take this factor into consideration, and along sion critically hinges on Martin et al.’s assumption of with the bias reduction, this is another reason to incor- equal cluster population sizes as well as their partic- porate cluster sizes into one’s matching procedure. ular assumed parametric model relating the matching Empirical evidence. The right panel of Figure 3 il- and outcome variables. Modeling assumptions are al- lustrates the argument above using the SPS evalua- ways worrisome, but the equal cluster size assumption tion data, by calculating the across-pair correlations be- tween treatment and control cluster means of 67 out- is especially problematic because varying cluster sizes come variables (ranging from health related variables is in fact a fundamental feature of numerous CR exper- to household health care expenditure variables), both iments. with and without weights. We use population cluster When cluster sizes are unequal, the efficiency gain sizes as weights, which were observed prior to the of matching in CR trials depends on the correlations of randomization of the treatment and incorporated into weighted cluster means between the treatment and con- the matching procedure used (King et al., 2007). The trol clusters across pairs (with weights based on sam- graph shows that all but one variable has consider- ple or population cluster sizes depending on the quan- ably higher correlations when weights are incorporated tity of interest), not the unweighted correlations used (which does not make the equal cluster size assump- in Martin et al.’s calculations. Since population cluster tion) than when they are ignored (which assumes con- sizes are typically observed prior to the treatment ran- stant cluster size); this can be seen by all but one of the domization, researchers can incorporate this variable dots falling below the solid 45◦ line. In fact, the me- into their matching procedure. As a result, correlations dian of the correlations is more than three times larger of weighted outcomes (constructed from clusters with with (0.68) than without (0.20) weights. In their con- matched weights) will usually be substantially higher clusion, Martin et al. [(1993), page 336] recommend 44 K. IMAI, G. KING AND C. NALL that if the number of pairs is 10 or fewer, then match- sign, where units are matched in blocks of larger than ing should be used only if researchers are confident that two. However, a stratified design is merely a UMCR the correlation due to matching is at least 0.2. Indeed, design operating within each stratum. If all units within all variables in SPS meet this criteria if the weights are a stratum have identical values on important back- appropriately taken into account, the minimum corre- ground covariates, then it is effectively equivalent to lation with weights being 0.22. (If the correlations are MPCR. But if any heterogeneity on these covariates calculated incorrectly as they did without weights, then or cluster sizes remain within strata, then the stratified only about half of the variables meet their criteria.) design may leave some efficiency on the table. Thus, To illustrate the above result in terms of power when feasible, switching from a stratified to an MPCR and sample size calculations, the graph also presents design has the potential to greatly increase efficiency the “break-even” matching correlations (indicated by and power. dashed and dotted lines for correlations with and with- Second, Donner and Klar [(2000a), page 40] explain out weights, respectively) that are used by Martin et that a “disadvantage of the matched-pair design is that al. (1993), Section 7. As in the original article, we set the loss to follow-up of a single cluster in a pair implies the power and size of the test to be 0.8and0.95, re- that both clusters in that pair must effectively be dis- spectively, and derive the smallest correlation matching carded from the trial, at least with respect to testing the must induce in order for the matched-pair design to be effect of the intervention. This problem...clearly does able to detect smaller effect sizes than the UMCR de- not arise if there is some replication of clusters within sign. The result indicates that even with as few as three each combination of intervention and stratum.” Indeed, pairs, more than 85% of the variables had a correlation the loss of a single cluster from a stratum with more higher than the break-even point, which is 0.56. With than two clusters may make it possible to estimate the five pairs, all but one variable exceeds the threshold. causal effect within that stratum, but the missingness In contrast, if one ignores the weights, by incorrectly process must be ascertained or assumed and some type assuming that the clusters are equally sized, as in Mar- of imputation strategy (or other procedure; e.g., Wei, tin et al., then only 4% and 34% of the variables have 1982) must be used, risking model dependence. These the correlations higher than the break-even correlations are issues for both MPCR and stratified designs. Al- of three and five pairs, respectively. Martin et al. (1993) ternatively, if a cluster is lost in an MPCR study, then described the correlation of 0.25 as “difficult to achieve dropping the other member of the pair makes it pos- by matching” (page 335). However, as the data from sible to retain the benefits of randomization for SATE SPS evaluation show, since one can match on cluster or CATE defined in the remaining pairs—without los- sizes, the level of weighted correlations is much higher ing other observations, without imputation and possi- when cluster sizes are different. ble model dependence, and regardless of the missing For another example, Donner and Klar [(2000b), data mechanism (King et al., 2007). In contrast, the page 37] give the unweighted correlations from seven loss of a cluster in a UMCR design turns an experi- different studies, only one of which is negative (0.49, mental study into an requiring the 0.41, 0.13, 0.63, −0.32, 0.94 and 0.21). The correct addition of ignorability assumptions which experimen- weighted correlations are not reported, but in all cases talists normally try to avoid. The loss of a single cluster would be higher, and in all likelihood all seven would within a stratum larger than two units means that more be positive. than one cluster will need to be dropped in order to re- Thus, by dropping the assumption that all clusters tain the benefits of randomization, which may lead to are equally sized we have shown here that, for practi- unnecessary efficiency losses. cal purposes, the matched pair design may well have Third is the claim that MPCR restricts “prediction more statistical power than the UMCR design, even models to cluster-level baseline risk factors (for exam- for small samples. Of course, if one has fewer than ple, cluster size)” (Donner and Klar, 2004). This sen- three matched pairs, it’s probably time to stop worrying tence has been widely interpreted to mean that pre- about the properties of statistical estimators and head to diction models under MPCR cannot include baseline the field to gather more data. risk factors, but Donner and Klar clearly intended it to indicate (and confirmed to us that they meant) the 5.4 Lost Clusters, Stratified Designs and Causal more straightforward point that cluster-level fixed ef- Heterogeneity fects cannot be included in regression models under We now clarify four additional issues about MPCR MPCR. Of course, results can be analyzed within strata that have arisen. First, some recommend a stratified de- defined by any individual or cluster level variable, so THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 45 long as it is pre-treatment. For example, the bottom two approach to MPCR experiments with unit-level non- rows of Table 3 repeat the same analysis as the top two compliance. To complement the parametric Bayesian rows but only for male-headed households, a variable approach to this problem (under the unmatched clus- measured only at the unit-level and used to separate ter randomized design) by Frangakis, Rubin and Zhou the sample at that level. (The results for each quantity (2002), we consider a design-based analysis applying of interest in this case appear only slightly larger than the approach introduced in Section 4. for the entire sample.) Regression models with fixed 6.1 Causal Quantities of Interest effects for clusters are unidentified under MPCR, al- though substituting in random effects is unproblematic, We consider the two types of causal quantities of at least for identification purposes. interest under MPCR encouragement designs—the Finally, Donner and Klar (2004) explain that MPCR intention-to-treat (ITT) effect and the complier aver- is to be faulted because of its “inability to test for ho- age causal effect (CACE) (Angrist, Imbens and Rubin, mogeneity” of causal effects within a pair. And hy- 1996). The ITT effect is the average causal effect of en- pothesis tests cannot be conducted for the difference couragement (rather than treatment) and is equivalent between two pairs. However, the causal effect is easy to the various versions of the average treatment effect to measure without bias or model dependence under in Section 3.3 (i.e., SATE, CATE, UATE and PATE, MPCR (but not under UMCR) at the pair level with- with or without interference). out bias merely by taking the difference in means be- In contrast, the CACE estimand is the average treat- tween the two clusters. This may be a noisy estimate ment effect (for SATE, CATE, UATE or PATE, with or if matching quality is poor, but it serves as a useful without interference) among compliers only. Compli- unbiased dependent variable for subsequent analyses. ers are neither those merely observed to affiliate among We can see how it varies as a function of any vari- those in encouragement clusters nor those observed not able measured at the unit level and then aggregated to to affiliate in clusters not encouraged since the former the cluster-pair level, or measured directly at the aggre- includes always-takers and the latter includes never- gate level from existing data, such as from census data. takers. Note that always-takers (never-takers) are those Even hypothesis tests are possible if we pool pairs. For who always (never) take the treatment regardless of example, since the point of SPS was to help poor fam- whether or not they are encouraged. In addition, these ilies, we could examine whether the causal effect of groups are defined as a consequence of the treatment. rolling out SPS on various outcome variables increases Compliers are those who would affiliate only if they as the wealth of an area drops. This can be done by a were encouraged and would not affiliate only if they simple plot of the pair-level causal effect by wealth, or were not encouraged, and so this group is defined prior fitting a regression model. to the encouragement but its members are not com- pletely observed. We propose a method that can be 6. METHODS FOR UNIT-LEVEL NONCOMPLIANCE used to estimate CACE. CR trials typically have imperfect treatment compli- 6.2 Design and Notation ance at the unit level. Some individuals in treatment The setup is the same as Section 3.2 except that clusters refuse treatment while others in the control Tjk represents whether the units in the jth cluster in cluster receive the treatment. Since most CR social ex- the kth pair are encouraged to receive the treatment periments, including the SPS evaluation, allow non- rather than whether it received the treatment. Recall compliance, analyses, in addition to ITT estimates, that T1k = Zk and T2k = 1 − Zk.Now,letRij k (Tjk) may account for noncompliance and estimate the effect be the potential treatment receipt indicator variables of the program only for individuals who would adhere for the ith unit in the jth cluster of the kth pair un- to the experimental protocol. Thus, we now extend our der the encouragement (Tjk = 1) and control (Tjk = 0) approach to CR trials under the MPCR encouragement conditions. The observed treatment variable is, then, design, where the encouragement to receive a treat- Rij k ≡ TjkRij k (1) + (1 − Tjk)Rij k (0). Similar to the ment, rather than the receipt of the treatment itself, is potential outcomes, these potential treatment variables randomized at the cluster-level. depend on cluster-level encouragement variable rather Angrist, Imbens and Rubin (1996)showhowanin- than the unit-level encouragement variable, requiring strumental variable method can be used to analyze a different interpretation of the resulting causal ef- unit-randomized experiments with noncompliance un- fects. Finally, we write the potential outcomes as func- der individually randomized designs. We extend their tions of (cluster-level) randomized encouragement and 46 K. IMAI, G. KING AND C. NALL actual receipt of treatment (at the unit-level),that is, monotonicity assumption excludes the existence of de- Yij k (Rij k ,Tjk). This formulation makes the following fiers. assumption, which an extension of Assumption 1: ASSUMPTION 5 (Monotonicity). There exists no ASSUMPTION 3 (No interference between units). defier. That is, Rij k (1) ≥ Rij k (0) holds for all i, j, k. Let R (T) be the potential outcomes for the ith unit ij k In our Mexico evaluation, never-takers are those who in the jth cluster of the kth matched-pair where T is would not affiliate with SPS regardless of whether the an (m × 2) matrix whose (j, k) element is T . Fur- jk government encourages them to do so or not. Since thermore, let Yij k (R, T) be the potential outcomes for the ith unit in the jth cluster of the kth matched-pair SPS was designed for the poor, many wealthy citizens with their own preexisting health care arrangements where R is an (njk × m × 2) ragged array whose (i,j,k)element is R . Then: may be never-takers. We expected a substantial propor- ij k tion of the population to qualify as never-takers, and = = 1. If Tjk Tjk, then Rij k (T) Rij k (T ). in fact estimate them at 56%. Always-takers are those = = = 2. If Tjk Tjk and Rij k Rij k , then Yij k (R, T) who would affiliate with SPS regardless of assignment. Yij k (R , T ). These are more uncommon, and would likely be the poor without access to health care who nevertheless In other words, this assumption requires that one per- have the information and financial resources neces- son’s decision to affiliate has no effect on any other sary to travel to the place to sign up for SPS and to person’s outcomes within the same cluster; as such, travel back to receive care. (The estimated proportion the requirements are more demanding than for the ITT of always-takers is only 7%.) The last type is defiers, or effects above. This assumption might be violated for people who would affiliate with SPS if not encouraged certain health outcomes in the SPS evaluation: if all to do so but would not affiliate if encouraged. Assum- of one’s neighbors affiliate with SPS, the health care ing the absence of defiers seems reasonable. they receive may reduce the prevalence of infectious diseases and so might thereby improve that person’s 6.3 Estimation health outcomes (an example of “herd immunity”). If we assume sampling of both pairs of clusters and Relaxing this assumption thus remains an important units within each cluster, then the ITT causal effect can methodological issue that seems worthy of future re- be defined as ψ . Thus, ψ(Nˆ + N ) can be used search. P 1k 2k to estimate this ITT effect, and the approximately un- The no interference assumption allows us to write biased estimation of its variance is possible using the R (T) = R (T ) and Y (R, T) = Y (R ,T ). ij k ij k jk ij k ij k ij k jk results given in Section 4.3. Since T = Z and T = 1 − Z , both R (T ) and 1k k 2k k ij k jk Next, we consider population CACE. Under the as- Y (T ) depend on Z alone. ij k jk k sumption of simple random sampling of both clus- Extending the framework of Angrist, Imbens and ters and units within each cluster, this estimand is de- Rubin (1996) to CR trials, we make an exclusion re- fined as γ ≡ E (Y (1) − Y(0)|C = c) = E (Y (1) − striction so that cluster-level encouragement affects the P P Y(0))/E (R(1) − R(0)), where the equality follows unit-level outcome only through the unit-level receipt P from the direct application Angrist, Imbens and Rubin of the treatment: (1996) to CR trials under the assumptions stated above. ASSUMPTION 4 (Exclusion restriction). Yij k (r, If we only assume simple random sampling of clusters 0) = Yij k (r, 1) for r = 0, 1 and all i, j, and k. as in UATE, then the expectation in γ is taken with respect to the set U rather than P . These assumptions together simplify the problem by Thus, the instrumental variable estimator based on enabling us to write the potential outcomes as func- the general weighted estimator in equation (3)is tions of T (or Z ) alone, that is, Y (R ,T ) = jk k ij k jk jk γ(wˆ ) ≡ ψ(wˆ )/τ(wˆ ),whereτ(wˆ ) is the estimator Y (T ). k k k k ij k jk of the ITT effect on the receipt of the treatment Finally, following Angrist, Imbens and Rubin (1996), we call the units with R (T ) = T com- 1 ij k jk jk τ(wˆ ) ≡  pliers (and denote them by C = c), those with k m ij k k=1 wk Rij k (Tjk) = 1 always-takers (Cij k = a), those with     m n1k n2k = = = Ri1k = Ri2k Rij k (Tjk) 0 never-takers (Cij k n), and the units · w Z i 1 − i 1 = − = k k n n with Rij k (Tjk) 1 Tjk defiers (Cij k d). The k=1 1k 2k THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 47    + − n2k n1k (1 Zk) i=1 Ri2k i=1 Ri1k    + (1 − Zk) − n2k R n1k R n2k n1k · i=1 i2k − i=1 i1k . n n nτ(ˆ w˜ k) 2k 1k − . When matching is effective or when cluster sizes are m equal within each matched-pair, this estimator is con- sistent and approximately unbiased. Using a Taylor se- 7. SEGURO POPULAR EVALUATION ries expansion, the variance of this estimator can be We now estimate the causal effect of SPS on the approximated by probability of a household suffering catastrophic health Var apu(γ(wˆ k)) expenditures (out-of-pocket health care expenditures totaling more than 30% of a household’s annual post- 1 ≈ subsistence or disposable income). As nearly 10% of {E (τ(wˆ ))}4 apu k households suffer catastrophic health expenditures in a 2 ˆ ·[{Eapu(τ(wˆ k))} Var apu(ψ(wk)) year, it is easy to see why this would be a major pri- (14) ority. We estimate all four target population quantities +{ ˆ }2 ˆ Eapu(ψ(wk)) Var apu(τ(wk)) of interest (SATE, CATE, UATE, and PATE) both for ˆ the intention to treat (ITT) effect of encouragement to − 2Eapu(ψ(wk))Eapu(τ(wˆ k)) affiliate an the average causal effect among compliers ˆ · Covapu(ψ(wk), τ(wˆ k))], (CACE). Although in most applications, substantive where if simple random sampling of pairs of clusters interest would narrow this list to one or a few of these alone is assumed, then the subscript “apu” (for assign- quantities, for our methodological purposes we present ment, pairs, and units) is replaced with “ap.” Further- all eight estimates (and standard errors) in Table 3. more, the argument given in Section 4.3 implies, for A table like this will always have some of the same example, that the variance of γ(ˆ w˜ k) for estimating the features, no matter what variable is analyzed. Recall, sample CACE is on average less than the variance for for example, that point estimates of SATE and UATE the population CACE given in equation (14). are the same, as they are for CATE and PATE. In addi- Finally, Proposition 3 shows how to estimate tion, standard errors of UATE and PATE are the upper ˆ ˆ Var apu(ψ(wk)),Varapu(τ(wˆ k)) (or Varap (ψ(wk)) and bounds of the standard errors for SATE and CATE, re- Var ap (τ(wˆ k))) approximately without bias. Thus, we spectively. CACE estimates of course are never smaller only need an estimate of the covariance between of than those for ITT. ˆ ψ(wk) and τ(wˆ k) from the observed data. Using the For the specific estimates, consider first the two top normalized weights w˜ k, Appendix A.5 proves that the lines of Table 3 corresponding to all households. For following estimator is approximately unbiased for both these data, the CACE estimates are about 2.7 times ˆ ˆ Covapu(ψ(wk), τ(wˆ k)) and Covpu(ψ(wk), τ(wˆ k)) un- larger than that for ITT. The large difference is be- der their respective sampling assumptions: cause of all those who had preexisting health care and ν(ˆ w˜ k) so were largely never-takers. Overall, these results in- dicate that SPS was clearly successful in reducing the ≡ m (m − 1)n2 most devastating type of medical expenditures. The     differences among the columns indicate that the aver- m n1k Y n2k Y · w˜ Z i=1 i1k − i=1 i2k age causal effect of encouragement to affiliate to SPS k k n n k=1 1k 2k (the ITT effect) is somewhat larger in the population of − + − individuals represented by our sample ( 0.023) than (1 Zk) −    among the individuals we directly observe ( 0.014). n2k Y n1k Y · i=1 i2k − i=1 i1k The same is also true among compliers, but at a higher − − n2k n1k level ( 0.038 vs. 0.064). Substantively, these numbers are quite large. Since nψ(ˆ w˜ ) − k those who suffer from catastrophic health expenditures m are mostly the poor without access to health insurance,     n1k R n2k R they are likely to be disproportionately represented · ˜ i=1 i1k − i=1 i2k wk Zk among compliers as compared to the wealthy with pre- n1k n2k 48 K. IMAI, G. KING AND C. NALL

TABLE 3 Estimates of eight causal effect of SPS on the probability of catastrophic health expenditures for all households and male-headed households (standard errors in parentheses)

SATE CATE UATE PATE

All ITT −0.014 (≤ 0.007) −0.023 (≤ 0.015) −0.014 (0.007) −0.023 (0.015) CACE −0.038 (≤ 0.018) −0.064 (≤ 0.024) −0.038 (0.018) −0.064 (0.024) Male-headed ITT −0.016 (≤ 0.008) −0.025 (≤ 0.018) −0.016 (0.008) −0.025 (0.018) CACE −0.042 (≤ 0.020) −0.070 (≤ 0.031) −0.042 (0.020) −0.070 (0.031)

existing health care arrangements. As such, this analy- Of course, administrative, political, ethical and other sis indicates that the causal effect of rolling out the issues will sometimes constrain the ability of re- policy reduces by about 23% the proportion of those searchers to pair clusters prior to randomization. With who experience catastrophic expenditures (i.e., −0.023 the results and new estimators offered here, the effort of the 10% with catastrophic expenditures). (Detailed in the design of cluster-randomized experiments can analyses of these and other substantive results from the now shift from debates about when pairing is useful to SPS evaluation appear in King et al., 2009.) practical discussions of how best to marshal creative arguments and procedures to ensure that clusters can 8. CONCLUDING REMARKS more often be paired prior to randomization. The methods developed here are designed for re- Cornfield [(1978), pages 101–102] concludes his searchers lucky enough to be able to randomize treat- now classic study by writing that “Randomization ment assignment, but stuck because of political or by cluster accompanied by an analysis appropriate other constraints with having to randomize clusters to randomization by individual is an exercise in self- of individuals rather than the individuals themselves. deception, ... and should be avoided,” and an enor- Field experiments in particular frequently require clus- mous literature has grown in many fields echoing this ter randomization. Individual-level randomization was warning. We can now add that randomization by cluster impossible in our evaluation of the Mexican SPS pro- without prior construction of matched pairs, when pair- gram; in fact, negotiations with the Mexican govern- ing is feasible, is an exercise in self-destruction. Failing ment began with the presumption that no type of ran- to match can greatly reduce efficiency, power and ro- domization would be politically feasible, but it eventu- bustness, and is equivalent to discarding a large portion ally concluded by allowing cluster-level randomization of experimental data or wasting grant money and inves- to be implemented. tigator effort. This result should affect practice, espe- When clusters of individuals are randomized rather cially in literatures like political science where exper- than the individuals themselves, the best practice imental analyses routinely use cluster-randomization should involve three steps. First, researchers should but examples of matched-pair designs have almost choose their causal quantity of interest, as defined never been used, as well as community consensus in Section 3.3. They should then identify available recommendations for best practices in the conduct pre-treatment covariates likely to affect the outcome and analysis of cluster-randomized experiments, which variable, and, if possible, pair clusters based on the similarity of these covariates and cluster sizes; this closely follow prior methodological literature. These step is severely underutilized and, when feasible, will include the extension to the “CONSORT” agreement translate into considerable research resources saved among the major biomedical journals (Campbell, El- and numerous observations gained. Finally, researchers bourne and Altman, 2004), the Cochrane Collabora- should randomly choose one treated and one control tion requirements for reviewing research (Higgins and cluster within each pair. Claims in the literature about Green, 2006, Section 8.11.2), the prominent Medical problems with matched-pair cluster randomization de- Research Council (2002) guidelines, and the educa- signs are misguided: clusters should be paired prior to tion research What Works Clearinghouse (2006). Each randomization when considered from the perspective would seem to require crucial modifications in light of of efficiency, power, bias or robustness. the results given here. THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 49

APPENDIX A: MATHEMATICAL APPENDIX   · Dk (1) + Dk (0) A.1 Proof of Proposition 1    This proof uses a strategy similar to that of Proposi- = m ˜ + 2 var wk Dk(1) Dk(0) . tion 1 of Imai (2008). First, rewrite σ(ˆ w˜ k) as 4n 2 (m − 1)n This bias term is not identifiable because Dk(1) and σ(ˆ w˜ k) m Dk(0) are not jointly observed for any k, implying that  m the variance is not identifiable either. = w˜ {Z D (1) + (1 − Z )D (0)} k k k k k A.2 Proof of Proposition 2 k=1 1 m Applying the law of iterated expectations to equa- − w˜ k {Zk Dk (1) tion (15), we have m k=1 ˆ ˜ 2 Eau(σ(wk))  m + (1 − Zk )Dk (0)} 1  = w˜ 2E {D (1)2 + D (0)2} 2n2 k u k k − m k=1 m 1 2 2 2 = w˜ {ZkDk(1) + (1 − Zk)Dk(0) } 1 m k (16) − k=1 2(m − 1) m  1 m  − ˜ ˜ {   wkwk ZkZk Dk(1)Dk (1) · ˜ ˜ + m wkwk Eu Dk(1) Dk(0) k=1 k=k k=1 k=k + Z (1 − Z )D (1)D (0)  k k k k   · E D (1) + D (0) , + (1 − Zk)Zk Dk(0)Dk (1) u k k

+ (1 − Zk)(1 − Zk )Dk(0) where the equality follows from the assumption that · } Dk (0) . sampling of units is independent across clusters. To- ˆ ˜ Assumption 2 implies Ea(Zk) = 1/2andEa(ZkZk ) = gether with the definition of Varau(ψ(wk)) given 1/4fork = k . Thus, taking expectations over Zk and above, we have rearranging, gives ˆ Eau(σ(ˆ w˜ k)) − Var au(ψ(w˜ k)) Ea(σ(ˆ w˜ k))  m    m   = 1 ˜ 2 2 + 2 1 2 2 2 wk Eu Dk(1) Dk(0) = w˜ Dk(1) + Dk(0) 4n2 2n2 k k=1 k=1 − Var u(Dk(1)) − Var u(Dk(0)) 1 m   (15) − w˜ w˜ + 2(m − 1) k k 2Eu(Dk(1))Eu(Dk(0)) k=1 k=k   − 1 · Dk(1) + Dk(0) − m 1   m    · + Dk (1) Dk (0) . · w˜ kw˜ k Eu Dk(1) + Dk(0) k=1 k=k Finally, we compare this with the true variance expres-  ˆ ˜ − ˆ ˜   sion in (7): Ea(σ(wk)) Var a(ψ(wk)), which equals · + Eu Dk (1) Dk (0) 1 m ˜ 2{ + }2    2 wk Dk(1) Dk(0) m 4n = = var w˜ E D (1) + D (0) . k 1 4n2 k u k k 1 m    − w˜ w˜ D (1) + D (0) Since we do not observe Dk(1) and Dk(0) jointly, this m − 1 k k k k k=1 k=k variance is not identified. 50 K. IMAI, G. KING AND C. NALL

indep.  − A.3 Proof of Proposition 3 ∼ m 1 ence in means: Dk N (μ, (wk/ k=1 wk ) σ) = Since UATE is a special case of PATE where all for k 1, 2,...,m where wk is the known harmonic units within each cluster are observed (n = N ), mean weight. In our context, Dk is the observed jk jk ≡ + ˆ ˜ within-pair mean difference, that is, Dk ZkDk(1) we first derive the variance of ψ(wk) for PATE. Let − ≡ n1k −  = ˜ =  ˜ = (1 Zk)Dk(0) where Dk(1) i=1 Yi1k(1)/n1k Dk(t) wkDk(t), μk(t) Eu(Dk(t)),andηk(t) n2k ≡ n2k −  = i=1 Yi2k(0)/n2k and Dk(0) i=1 Yi2k(1)/n2k Var u(Dk(t)) for t 0, 1. Then, randomizing the or- n1k i=1 Yi1k(0)/n1k. It is well known that under this der of clusters within each pair implies Ec(η˜k) = m m model, w D / w is the uniformly mini- E (η˜ (1)) = E (η˜ (0)) and Var (μ˜ ) = Var (μ˜ (1)) = k=1 k k k =1 k c k c k c k c k mum variance unbiased estimator. ˜ Var c(μk(0)). Then, the variance is Although the derivation of this model is not dis- ˆ cussed in the cluster randomization literature, a model Var apu(ψ(w˜ k)) commonly used in the statistics literature for other pur- 1   poses gives rise to these weights (see e.g., Kalton, = Ep Var u(Dk(1)) + Var u(Dk(0)) 2mw¯ 2 i.∼i.d. ˜ = ˜ = 1968 ): Yij k (t)  N (μt , σ) for t 0, 1whereσ m m 1   σ = w and = w is a known constant since w + E D (1) − D (0) 2 k 1 k k 1 k k 2 u k k is assumed fixed. The normality assumption is not nec- essary for some inferential purposes, but this model 1    + Var E D (1) + D (0) does require (1) independent and identical distributions ¯ 2 p u k k 4mw across units within each cluster as well as (2) across 1 clusters and pairs (which of course implies constant = {Ep(η˜k) + Var p(μ˜ k)}. mw¯ 2 means and variances within and across clusters and pairs) and (3) equal variances for the two potential out- When the estimand is UATE, η˜ = 0forallk since k comes. In sum, the model assumes homogeneity within within-cluster means are observed without sampling and across matched-pairs. [Although we focus on the ˆ ˜ = ˜ ¯ 2 variability. Thus, Varapu(ψ(wk)) Var p(μk)/(mw ). t-test here, for binary outcomes the suggested approach ˆ ˜ Next, we show that σ(wk) is approximately unbiased in the literature is also based on a homogeneity as- by applying the law of iterated expectations to Equa- sumption where the odds ratio is assumed constant tion (16): across clusters; see, e.g., Donner and Donald (1987); Donner and Hauck, (1989).] Eapu(σ(ˆ w˜ k)) Bias conditions. The harmonic mean weight differs 1    = E w2E D (1)2 + D (0)2 from our proposed weight in three ways. First, it gives 2mw¯ 2 p k u k k more weight to pairs with well-matched cluster sizes     1 2 than to pairs whose cluster sizes are unbalanced. That − Ep wkEu Dk(1) + Dk(0) 2 is, if we assume the sum n1k + n2k is fixed, the har- = 1 monic mean is the largest when n1k n2k and becomes = [E {Var (D )}+Var (μ˜ )], smaller as n − n increases. Second, and most im- ¯ 2 p u k p k 1k 2k mw portantly, this weight does not remove the bias when 2 = 2 = 2 within-cluster average treatment effects are identical where Eu(Dk ) Eu(Dk(0) ) Eu(Dk(1) ) holds be- cause the order of clusters within each pair is random- within pairs (so long as heterogeneity across matched- pairs remains), meaning that bias may remain even ized. For PATE, Varu(Dk) = 0forallk since within- when matching is effective. (The direction of the bias cluster means are observed without sampling uncer- depends on the data.) One condition under which it is tainty. Thus, E (σ(ˆ w˜ )) = Var (μ˜ )/(mw¯ 2). ap k p k unbiased is with exact matching on sample cluster sizes A.4 Properties of the Harmonic Mean Estimator (i.e., n1k = n2k for all k), in which case this estimator and Standard Error coincides with our proposed estimator. Finally, since the weight is based on sample cluster sizes, this esti- Modeling assumptions. The harmonic mean esti- mator is not valid for estimating CATE or PATE. When mator, with weights based on the harmonic mean of its assumptions hold, the harmonic mean estimator is sample cluster sizes wk = n1kn2k/(n1k + n2k),stems uniformly minimum variance unbiased, and is clearly from the weighted one-sample t-test for the differ- useful in those circumstances. THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 51

Bias in the variance estimator. We show here that ·{Zk Dk (1) the variance estimator proposed in the literature (see,  e.g., Donner, 1987; Donner and Donald, 1987; Don- + (1 − Zk )Dk (0)} . ner and Klar, 1993) may be biased regardless of choice of weights and the direction of bias is indeterminate. ˆ ˜ A condition under which this variance estimator is un- Taking the expectation with respect to Zk, Ea(δ(wk)), biased (and approximately equal to ours) is when m is gives    large and the weights are identical across pairs, which m 2 m = w˜ w˜   is uncommon in practice. We first write this estimator k 1 k 1 − k w˜ D (1)2 + D (0)2 2n3 n k k k using our notation: k=1  m 2 m  = w˜ 1   ˆ ˜ ≡ k 1 k − w˜ w˜ D (1) + D (0) δ(wk) 3 k k k k n 2n = =     k 1 k k m n1k n2k i=1 Yi1k i=1 Yi2k · w˜ k Zk −   n n · D (1) + D (0) . k=1 1k 2k k k

(17) + (1 − Zk)    Comparing this expression with Ea(σ(ˆ w˜ k)) in equa- n2k Y n1k Y · i=1 i2k − i=1 i1k tion (15) shows a difference which remains even after n2k n1k taking the expectation with respect to simple random 2 sampling of pairs of clusters or units within clusters. − ˆ ˜ ψ(wk) . Since σ(ˆ w˜ k) is an approximately unbiased estimate of the variance for UATE and PATE, δ(ˆ w˜ ) may be bi- ˆ k Next, we rewrite δ(w˜ k) as ased. n3  δ(ˆ w˜ ) A.5 Covariance Estimation m ˜ 2 k k=1 wk  This Appendix derives unbiased estimates of m Cov (ψ(ˆ w˜ ), τ(ˆ w˜ )) and Cov (ψ(ˆ w˜ ), τ(ˆ w˜ )) us- = ˜ + − auc k k ac k k wk ZkDk(1) (1 Zk)Dk(0) ing the proofs in Propositions 1–3.First,wederive k=1 ˆ ˜ ˆ ˜ the true covariance between ψ(wk)and τ(wk).De- m = n1k − n2k 1 fine Gk(1)  i=1 Ri1k(1)/n1k  i=1 Ri2k(0)/n2k − w˜ k {Zk Dk (1) n2k n1k n and Gk(0) = = Ri2k(1)/n2k − = Ri1k(0)/n1k. k=1 i 1 i 1  Taking the expectation of with respect to Zk yields: 2 Cov (ψ(ˆ w˜ ), τ(ˆ w˜ )) = 1 m w˜ 2(D (1) − D (0)) · a k k n2 k=1 k k k + ( − Z )D ( )} 1 k k 0 (Gk(1) − Gk(0)). Then, we have  ˆ m Covap (ψ(w˜ k), τ(ˆ w˜ k)) = w˜ Z D (1)2 + (1 − Z )D (0)2 k k k k k = E {Cov (ψ(ˆ w˜ ), τ(ˆ w˜ ))} k=1 p a k k m ˆ 2  + Covp{Ea(ψ(w˜ k)), Ea(τ(ˆ w˜ k))} − w˜ {Z D (1) n k k k 1 k=1 = Cov (D , G ), mw¯ 2 p k k + (1 − Zk)Dk(0)}  where Gk(t) = wkGk(t) for t = 0, 1, and the last ·{Zk Dk (1)  equality follows from the fact that Ep(Dk) = + − }      (1 Zk )Dk (0) Ep(Dk(t)), Ep(Gk) = Ep(Gk(t)) and Ep(DkGk) =   m m Ep(Dk(t)Gk(t)) for t = 0, 1. Similarly, 1 2 2 + w˜ w˜ n2 k k ˆ k=1 k=1 Covau(ψ(w˜ k), τ(ˆ w˜ k)) ˆ ·{Zk Dk (1) = Eu{Cova(ψ(w˜ k), τ(ˆ w˜ k))} ˆ + (1 − Zk )Dk (0)} + Covu{Ea(ψ(w˜ k)), Ea(τ(ˆ w˜ k))} 52 K. IMAI, G. KING AND C. NALL

m 1 ANGRIST,J.D.,IMBENS,G.W.andRUBIN, D. B. (1996). Identi- = Cov (D (1), G( 1)) 2w¯ 2 u k fication of causal effects using instrumental variables (with dis- k=1 cussion). J. Amer. Statist. Assoc. 91 444–455.   ARCENEAUX, K. (2005). Using cluster randomized field experi- + Covu(Dk(0), Gk(0)) ments to study voting behavior. The Annals of the American 1 Academy of Political and Social Science 601 169–179. + {E (D (1)) − E (D (0))} 2 u k u k BALL,S.andBOGATZ, G. A. (1972). Reading with television: An evaluation of the electric company. Technical Report PR-72-2,   ·{Eu(Gk(1)) − Eu(Gk(0))} . Educational Testing Service, Princeton, NJ. BLOOM, H. S. (2006). The core analytics of randomized experi- And thus, ments for social research. Technical report, MDRC. ˆ ˜ ˆ ˜ BOX,G.E.,HUNGER,W.G.andHUNTER, J. S. (1978). Statistics Covapu(ψ(wk), τ(wk)) for Experimenters. Wiley, New York. MR0483116

1 BRAUN,T.M.andFENG, Z. (2001). Optimal permutation tests for =   the analysis of group randomized trials. J. Amer. Statist. Assoc. 2 Covu(Dk, Gk) mw¯ 96 1424–1432. MR1946587 CAMPBELL,M.,ELBOURNE,D.andALTMAN, D. (2004). CON- 1   + Ep{Eu(Dk(1)) − Eu(Dk(0))} SORT statement: Extension to cluster randomised trials. BMJ 4 328 702–708.   CAMPBELL,M.,MOLLISON,J.andGRIMSHAW, J. (2001). Cluster ·{Eu(Gk(1)) − Eu(Gk(0))} trials in implementation research: Estimation of intracluster cor- 1 Statist. Med. 20 + Cov {E (D (1) + D (0)), relation coefficients and sample size. 391–399. 4 p u k k CAMPBELL, M. J. (2004). Editorial: Extending consort to include cluster trials. BMJ 328 654–655. Available at   Eu(Gk(1) + Gk(0))} http://www.bmj.com/cgi/content/full/328/7441/654%20?q=y. CORNFIELD, J. (1978). Randomization by group: A formal analysis. American Journal of 108 100–102. 1   = E {Cov (D , G )} COX, D. R. (1958). Planning of Experiments. Wiley, New York. mw¯ 2 p u k k MR0095561 DONNER, A. (1987). Statistical methodology for paired cluster   + Covp{Eu(Dk), Eu(Gk)} . designs. American Journal of Epidemiology 126 972–979. DONNER, A. (1998). Some aspects of the design and analysis of Then, calculations analogous to the ones above shows cluster randomization trials. Appl. Statist. 47 95–113. ˆ that Eap (ν(ˆ w˜ k)) = Covap (ψ(w˜ k), τ(ˆ w˜ k)) and DONNER,A.andDONALD, A. (1987). Analysis of data arising ˆ ˜ = ˆ ˜ ˆ ˜ from a stratified design with the cluster as unit of randomiza- Eapu(ν(wk)) Covapu(ψ(wk), τ(wk)). tion. Statist. Med. 6 43–52. DONNER,A.andHAUCK, W. (1989). Estimation of a common ACKNOWLEDGMENTS odds ration in paired-cluster randomization designs. Statist. Our proposed methods can be implemented us- Med. 8 599–607. DONNER,A.andKLAR, N. (1993). Confidence interval construc- ing an R package, experiment, which is available for tion for effect measures arising from cluster randomization download at the Comprehensive R Archive Network trials. Journal of Clinical Epidemiology 46 123–131. (http://cran.r-project.org). We thank Kevin Arceneaux, DONNER,A.andKLAR, N. (2000a). Design and Analysis of Jake Bowers, Paula Diehr, Ben Hansen, Jennifer Hill, Cluster Randomization Trials in Health Research. Oxford and Dylan Small for helpful comments, and Neil Univ. Press, New York. Klar and Allan Donner for detailed suggestions and DONNER,A.andKLAR, N. (2000b). Design and Analysis of Clus- ter Randomization Trials in Health Research. Arnold, London. gracious extended conversations. For research sup- DONNER,A.andKLAR, N. (2004). Pitfalls of and controversies port, our thanks go to the National Institute of Pub- in cluster randomization trials. American Journal of Public lic Health of Mexico, the Mexican Ministry of Health, Health 94 416–422. the National Science Foundation (SES-0550873, SES- FENG,Z.,DIEHR,P.,PETERSON,A.andMCLERRAN, D. (2001). 0752050), the Princeton University Committee on Re- Selected statistical issues in group randomized trials. Annual search in the Humanities and Social Sciences and the Review of Public Health 22 167–187. FISHER, R. A. (1935). The .Oliverand Institute for Quantitative Social Science at Harvard. Boyd, London. FRANGAKIS,C.E.,RUBIN,D.B.andZHOU, X.-H. (2002). REFERENCES Clustered encouragement designs with individual noncompli- ance: with randomization, and application ANGRIST,J.andLAV Y, V. (2002). The effect of high school ma- to advance directive forms (with discussion). 3 triculation awards: Evidence from randomized trials. Working 147–164. Paper 9389, National Bureau of Economic Research, Washing- FRENK,J.,SEPÚLVEDA,J.,GÓMEZ-DANTÉS,O.andKNAUL, ton, DC. F. (2003). Evidence-based health policy: Three generations of THE ESSENTIAL ROLE OF PAIR MATCHING IN CLUSTER-RANDOMIZED EXPERIMENTS 53

reform in Mexico. The Lancet 362 1667–1671. MCLAUGHLAN,G.andPEEL, D. (2000). Finite Mixture Models. GAIL,M.H.,BYAR,D.P.,PECHACEK,T.F.andCORLE,D.K. Wiley, New York. MR1789474 (1992). Aspects of statistical design for the community inter- MEDICAL RESEARCH COUNCIL (2002). Cluster randomized vention trial for smoking cessation (COMMIT). Controlled trials: Methodological and ethical considerations. Technical re- Clinical Trials 13 16–21. port, MRC Clinical Trials Series. Available at http://www.mrc. GAIL,M.H.,MARK,S.D.,CARROLL,R.J.,GREEN,S.B.and ac.uk/Utilities/Documentrecord/index.htm?d=MRC002406. PEE, D. (1996). On design considerations and randomization- MOULTON, L. (2004). Covariate-based constrained randomization based inference for community intervention trials. Statist. Med. of group-randomized trials. Clinical Trials 1 297. 15 1069–1992. MURRAY, D. M. (1998). Design and Analysis of Community GREEVY,R.,LU,B.,SILBER,J.H.andROSENBAUM,P. Trials. Oxford Univ. Press, Oxford. (2004). Optimal multivariate matching before randomization. NEYMAN, J. (1923). On the application of probability theory to Biostatistics 5 263–275. agricultural experiments: Essay on principles, section 9. Statist. HAYES,R.andBENNETT, S. (1999). Simple sample size calcu- lation for cluster-randomized trials. International Journal of Sci. 5 465–480. (Translated in 1990.) MR1092986 Epidemiology 28 319–326. RAUDENBUSH, S. W. (1997). Statistical analysis and optimal HIGGINS,J.andGREEN, S., eds. (2006). Cochrane Handbook for design for cluster-randomized trials. Psychological Methods 2 Systematic Review of Interventions 4.2.5 [Updated September 173–185. 2006]. Wiley, Chichester, UK. RAUDENBUSH,S.W.,MARTINEZ,A.andSPYBROOK, J. (2007). HILL,J.L.,RUBIN,D.B.andTHOMAS, N. (1999). The design of Strategies for improving precision in group-randomized exper- the New York school choice scholarship program evaluation. In iments. Educational Evaluation and Policy Analysis 29 5–29. Research Designs: Inspired by the Work of Donald Campbell ROSENBAUM, P. R. (2007). Interference between units in ran- (L. Bickman, ed.) 155–180. Sage, Thousand Oaks. domized experiments. J. Amer. Statist. Assoc. 102 191–200. HOLLAND, P. W. (1986). Statistics and causal inference. J. Amer. MR2345537 Statist. Assoc. 81 945–960. MR0867618 RUBIN, D. B. (1990). Comments on “On the application of prob- IMAI, K. (2008). Variance identification and efficiency analysis in ability theory to agricultural experiments. Essay on principles. randomized experiments under the matched-pair design. Statist. Section 9” by J. Splawa-Neyman translated from the Polish Med. 27 4857–4873. and edited by D. M. Dabrowska and T. P. Speed. Statist. Sci. 5 IMAI,K.,KING,G.andSTUART, E. A. (2008). Misunderstandings 472–480. MR1092987 among experimentalists and observationalists about causal RUBIN, D. B. (1991). Practical implications of modes of statistical inference. J. Roy. Statist. Soc.,Ser.A171 481–502. inference for causal effects and the critical role of the assign- IMAI,K.,KING,G.andNALL, C. (2009). Replication ment mechanism. Biometrics 47 1213–1234. MR1157660 data for: The essential role of pair matching in cluster- SMALL,D.,TEN HAV E,T.andROSENBAUM, P. (2008). Random- randomized experiments, with application to the Mexican universal health insurance evaluation hdl:1902.1/11047 ization inference in a group-randomized trial of treatments for UNF:3:jeUN9XODtYUp2iUbe8gWZQ== Murray Research depression: Covariate adjustment, noncompliance and quantile Archive [Distributor]. effects. J. Amer. Statist. Assoc. 103 271–279. KALTON, G. (1968). Standardization: A technique to control for SNEDECOR,G.W.andCOCHRAN, W. G. (1989). Statistical Meth- extraneous variables. Appl. Statist. 17 118–136. MR0234599 ods, 8th ed. Iowa State Univ. Press, Ames, IA. MR1017246 KING,G.,GAKIDOU,E.,RAVISHANKAR,N.,MOORE,R.T., SOBEL, M. E. (2006). What do randomized studies of housing mo- LAKIN,J.,VARGAS,M.,TÉLLEZ-ROJO,M.M.,ÁVILA, bility demonstrate?: Causal inference in the face of interference. J. E. H., ÁVILA,M.H.andLLAMAS, H. H. (2007). A ‘politi- J. Amer. Statist. Assoc. 101 1398–1407. MR2307573 cally robust’ experimental design for public policy evaluation, SOMMER,A.,DJUNAEDI,E.,LOEDEN,A.A.,TARWOTJO,I.J., with application to the Mexican universal health insurance pro- WEST,K.P.andTILDEN, R. (1986). Impact of vitamin A gram. Journal of Policy Analysis and Management 26 479–506. supplementation on childhood mortality, a randomized clinical Available at http://gking.harvard.edu/files/abs/spd-abs.shtml. trial. Lancet 1 1169–1173. KING,G.,GAKIDOU,E.,IMAI,K.,LAKIN,J.,MOORE,R. THOMPSON, S. G. (1998). Letter to the editor: The merits of T., R AVISHANKAR,N.,VARGAS,M.,TÈLLEZ-ROJO, matching in community intervention trials: A cautionary tale M. M., ÁVILA,J.E.H.,ÁVILA,M.H.andLLA- by N. Klar and A. Donner. Statist. Med. 17 2149–2151. MAS, H. H. (2009). Public policy for the poor? A ran- TURNER,R.M.,WHITE,I.R.andCROUDACE, T. (2007). domised assessment of the Mexican universal health in- Analysis of cluster-randomized cross-over data. Statist. Med. surance programme. The Lancet. To appear. Available at 26 274–289. MR2312715 http://gking.harvard.edu/files/abs/spi-abs.shtml. VARNELL,S.,MURRAY,D.,JANEGA,J.andBLITSTEIN,J. KLAR,N.andDONNER, A. (1997). The merits of matching in (2004). Design and analysis of group-randomized trials: A re- community intervention trials: A cautionary tale. Statist. Med. view of recent practices. American Journal of Public Health 93 16 1753–1764. 393–399. KLAR,N.andDONNER, A. (1998). Author’s reply. Statist. Med. 17 2151–2152. WEI, L. J. (1982). of location difference with MALDONADO,G.andGREENLAND, S. (2002). Estimating causal incomplete data. Biometrika 69 249–251. MR0655691 effects. International Journal of Epidemiology 31 422–429. WHAT WORKS CLEARINGHOUSE (2006). Evidence standards MARTIN,D.C.,DIEHR,P.,PERRIN,E.B.andKOEPSELL,T.D. for reviewing studies. Technical report, Institute for Educa- (1993). The effect of matching on the power of randomized tional Sciences. Available at http://www.whatworks.ed.gov/ community intervention studies. Statist. Med. 12 329–338. reviewprocess/standards.html.