<<

Optimal Rerandomization Designs via a Criterion that Provides Insurance Against Failed

Adam Kapelner∗

Department of Mathematics, Kiely Hall Room 604, 65-30 Kissena Boulevard, Queens, NY, 11367, USA

Abba M. Krieger

Department of , Huntsman Hall Room 442, 3730 Walnut Street, Philadelphia, PA, 19104, USA

Michael Sklar

Department of Statistics, Sequoia Hall Room 105, 390 Serra Mall, Stanford, CA, 94305, USA

David Azriel Faculty of Industrial Engineering and Management, Bloomfield Building Room 301, Technion City, Haifa 32000 Israel

Abstract We present an optimized rerandomization design procedure for a non-sequential treatment-control . Randomized experiments are the for finding causal effects in nature. But sometimes random assignments re- sult in unequal partitions of the treatment and control group visibly seen as imbalance in observed covariates. There can additionally be imbalance on un- observed covariates. Imbalance in either observed or unobserved covariates in- creases treatment effect error inflating the width of confidence regions and reducing experimental power. “Rerandomization” is a strategy that omits poor imbalance assignments by limiting imbalance in the observed covariates to a prespecified threshold. However, limiting this threshold too much can in- crease the risk of contracting error from unobserved covariates. We introduce arXiv:1905.03337v2 [stat.ME] 25 Jan 2021 a criterion that combines observed imbalance while factoring in the risk of in- advertently imbalancing unobserved covariates. We then use this criterion to locate the optimal rerandomization threshold based on the practitioner’s level of desired insurance against high estimator error. We demonstrate the gains of our designs in simulation and in a dataset from a large

∗Corresponding author Email address: [email protected] (Adam Kapelner)

Preprint submitted to Elsevier January 26, 2021 in education. We provide an open source package available on CRAN named OptimalRerandExpDesigns which generates designs according to our algorithm. Keywords: Randomized experiments, Rerandomization, Experimental design, Optimization

1. Background

We consider a classic problem: a treatment-control experiment with n sub- jects (individuals, participants or units) and one clearly defined outcome of interest (also called the response or endpoint) for the n subjects denoted y = > [y1, . . . , yn] . We scope our discussion to the response being continuous and uncensored (with incidence and survival response left to further research). Each subject is assigned to the treatment group and a control group denoted T and C and referred to as the two arms. Here we consider the settings where all subjects along with their observed subject-specific covariates (measurements or characteristics) denoted x known beforehand and considered fixed. This non-sequential setting was studied by Fisher (1925) when assigning treatments to agricultural plots and this setting is still of great importance today. In fact, it occurs in clinical trials as “many phase I studies use ‘banks’ of healthy volunteers ... [and] ... in most cluster randomised trials, the clusters are identified before treatment is started” (Senn, 2013, page 1440). Synonymously referred to as a , an allocation or an assign- > ment is a vector w = [w1, . . . , wn] whose entries indicate whether the subject received T (coded as +1) or C (coded as -1) and thus this vector must be an element in {−1, +1}n. The goals of such experiments are to estimate and test a population average treatment effect (PATE) denoted 2βT (where the multiplicative factor of 2 would not be present if w was coded with 0 and 1). After the sample is provided, the only degree of control the experimenter has is to choose the entries of w. The process that results in the choices of such allocations of the n subjects to the two arms is termed synonymously as an experimental design (or just design), a strategy, an algorithm, a method or a procedure. The design essentially is a generalized multivariate Bernoulli denoted W with mass function denoted P (w), denoted ΣW and support denoted W and synonymously termed the allocation space. If all > realizations satisfy w 1n = 0, i.e. all of the assignments feature an equal number of subjects assigned to the treatment group and the control group, n/2, we term W a forced balance procedure (Rosenberger and Lachin, 2016, Chapter 3.3). The question then becomes: what is the optimal design strategy? Regard- less of how optimality is defined on whichever criterion, this is an impossible question to answer. The multivariate Bernoulli random variable has 2n − 1 pa- rameters (Teugels, 1990, Section 2.3). Finding the optimal design is tantamount to solving for exponentially many in n using only the n observations,

2 a hopelessly unidentifiable task. Thus all “optimal” designs will fundamentally be heuristics. The question then becomes: which heuristic design should we employ? This has been hotly debated for over 100 years. There are loosely two main camps: those that optimize assignments and those that randomize assignments. The latter first advocated by Fisher (1925) has prevailed, certainly in the context of causal inference and clinical trials. However, the design being “best” is depen- dent on many choices:

(C1) the choice of estimator for βT , (C2) the source of the in the response-generating process Y , (C3) the form of the response model as a function of the observed and unob- served covariates (and embedded within is the importance of the observed covariates relative to the unobserved) and (C4) a definition of optimality and a that gauges this optimality.

Further, drawing valid inference will be different depending on these choices, a point that we address in Section 2.3, but here we will continue this example with the focus on estimation. We now briefly describe our (C1-C3). (C4) will be developed iteratively in the next section of the paper. For (C1), we examine two choices of for βT . First, the classic ˆDM ¯ ¯ ¯ difference in estimator βT := (YT − YC )/2 where YT is the random variable of the average response in the treatment group and Y¯C is the random variable of the average response in the treatment group. If the model were to ˆDM be unknown, βT would be the uniformly minimum variance unbiased esti- mator for the PATE (Lehmann and Casella, 1998, Example 4.7). Second, the ˆLR covariate-adjusted regression estimator or OLS estimator, βT defined as the slope coefficient of w in an ordinary fit of y using x and w. If y ˆLR is known to be linear in x, then βT is known to be the best linear unbiased estimator for the PATE.1 In (C2), the source of randomness in Y is typically assumed to either be the population model (Rosenberger and Lachin, 2016, Chapter 6.2) or the ran- domization model (Rosenberger and Lachin, 2016, Chapter 6.3). To explain the difference between these two, assume for (C3) the following linear response model which we me generalize later:

y = βT w + βxx + z (1)

1 ˆLR The employment of βT in the realistic setting where the assumption is untestable (and most likely incorrect) results in bias. This bias is a subject of a large debate (Freedman, 2008; Berk et al., 2010; Lin, 2013) but is known to disappear rapidly in sample size.

3 To make the expressions even simpler, we assume x is centered and scaled so its entries have average zero and one. This normaliza- tion allows us to omit an intercept term in this simple response model without consequences to this discussion. In the population model, subjects in the treatment group are sampled at random from an infinitely-sized superpopulation and thus z is a realization from some generating process Z while w is fixed. In the randomization model, it is the opposite: the source of the randomness is in the treatment assignments w which is drawn from W while z is fixed. These two choices for (C2) represent very different worldviews with a rich history of debate (see for instance Reichardt and Gollob, 1999, pages 125-127). We touch on some of these issues by elaborating on the vastly different properties of our two estimators for βT in Appendix A.1 and Appendix A.2.

1.1. Assumption of the Randomization Model In this work, we employ the randomization model perspective following Rosenberger and Lachin (2016); Freedman (2008); Lin (2013) and others mostly because the population model requires the subjects to truly be sampled at ran- dom from a superpopulation. However, in most experiments, subjects are re- cruited from nonrandom sources in nonrandom locations. Experimental settings are frequently selected because of expertise, an ability to recruit subjects, and their budgetary requirements (Rosenberger and Lachin, 2016, page 99). In the context of clinical trials, Lachin (1988, page 296) states rather harshly that, “the invocation of a population model for the analysis of a becomes a matter of faith that is based upon assumptions that are inherently untestable”.2

1.2. Restricted Designs We now return to our initial question of “which heuristic design should we employ?” and develop more terminology. If we were to assign each individual to treatment by an independent fair coin flip we will term this the the Bernoulli Trial (Imbens and Rubin, 2015, Chapter 4.2). In the Bernoulli Trial, all assign- n ments are possible i.e. W = {−1, 1} and all assignments (the n elements of W ) are independent. Thus the variance of the design W is simply ΣW = In where the latter notation indicates the n × n identity . Any other design besides for this Bernoulli Trial is conventionally termed a because the permissible allocations are restricted to a subset of all possible allocations. A very weak restriction is that of forced balance. If each forced balance allocation is equally likely, we term this as the balanced completely randomized design (BCRD) as in Wu (1981). Here, there is a slight, but vanishing negative between the assignments and thus −1 the off-diagonal elements of ΣW are all − (n − 1) .

2However, assuming the randomization model complicates the claim of the generalizability of the study’s results. For reasons why these complications may be ignored, see the discussion in (Rosenberger and Lachin, 2016, Chapter 6.3).

4 However, there is a large problem with employing Bernoulli Trial or BCRD assignments described at the outset: under some “unlucky” assignments w there are unluckily large differences in the distribution of observed covariates between the two groups creating error in the estimation. Large differences are destructive for either choice of estimator (C1) and either choice of source of randomness model (C2) as shown explicitly in Appendix A.2.1. The solution is to eliminate these unlucky assignments from W, the main reason restricted randomization has been employed for the past 100 years. What kind of heuristic design sufficiently performs this elimination? There are many but in this work we focus on rerandomization.

1.3. The Rerandomization Design > We denote the imbalance in x as Bx := w x/(n/2) =x ¯T − x¯C . A naive restricted design that eliminates poor w’s goes as follows. (1) Realize a w from 2 2 BCRD and measure Bx and (2) if this Bx is less than a predetermined threshold a, then retain w and run the experiment otherwise return to step (1). Here, the variance- of the strategy W will be dependent on both x and threshold a. This rerandomization heuristic dates to the inception of randomized ex- perimentation. Student (1938, page 366) wrote that after an unlucky, highly imbalanced randomization, “it would be pedantic to [run the experiment] with [an assignment] known beforehand to be likely to lead to a misleading conclu- sion”. His solution is for “common sense [to prevail] and chance [be] invoked a second time”. Although this rerandomization design is classic, it has been rigorously investigated only recently (Morgan and Rubin, 2012). However, se- lecting the optimal threshold wich we denote as a∗ “remains an open problem” (Li et al., 2018, page 9162). This threshold is a critical quantity because it controls the degree of randomness in the design. A miniscule a will demand the optimal, deterministic design; a large a would allow for complete random- ization. Hence, solving this problem can bridge the intellectual gap between those that deterministically design experiments via optimization and those that design experiments via randomization.

2. Methodology

Herein, we provide procedures to locate this optimal threshold a∗ for the regression estimator under a realistic response model. For reasons that will become clear later, we leave an algorithm for the threshold in the context of the differences-in-means estimator for future work. Before we begin, we would like to reiterate that the rerandomization design features valid inference for any threshold (see Section 2.3). Thus, any specific a∗ that results from our algorithm will not affect the of the resulting inference. We assume p < n covariates measured for each subject and collect them into a row vector xi := [x1,i, . . . , xp,i]. We denote X as the n × p matrix that stacks the n vectors for each subject row-wise. Without loss of generality,

5 we assume that each column in X is centered and scaled. We examine a more general response model (than we introduced in Equation 1) linear in the observed covariates,

y = βT w + Xβ + z. (2)

Here, β is the vector of p weights for each covariate as a column vector of length p. And z is the fixed contribution of unobserved measurements for the subjects as in the model of Equation 1 plus the misspecification errors of the true response > function f minus the linear component, i.e. [f(x1) − x1β . . . f(xn) − xnβ] . Because each column in x is standardized, the linear component of this model does not require the additive intercept term. In order to simplify our expressions, we note that the rerandomization design W has the mirror property (Property 2.1) whereby the treatment group and con- trol group can be swapped without changing the probability of the assignment.3

Property 2.1 (Mirror Property). For all w ∈ W, P (W = w | X) = P (W = −w | X). We now need to discuss how we gauge performance of our estimators (C4). We begin by employing the squared error, i.e. the weighted mean over the squared loss considering all allocations with X and z fixed. As an example, if P (w) is discrete uniform (such as in the case of BCRD and rerandomization), −1 P ˆ 2 the expressions are computed via | | (β(w) − βT ) . W w∈W In Appendix A.3 we show that in the randomization model, the MSE of ˆDM βT is 1 SE [βˆDM | z, X; β] = (Xβ + z)>Σ (Xβ + z) (3) M w T n2 W ˆLR and in Appendix A.5 we show that the MSE of βT can be approximated to the third order as

1  2  SE [βˆLR | z, X, β] ≈ z> G + D z (4) M w T n2 n

−1  T  T where G := (I − P )ΣW (I − P ), where P := X X X X is the stan- dard orthogonal onto the column space of X and D :=  > > 4 E w ww P ww , an expectation of a quartic form that cannot be simplified.

3Note that every design currently in wide use is mirror (e.g. BCRD, blocked designs, matched pair designs and of course, rerandomization, the design investigated herein). Further, a non-mirror design would be suspicious in the following way. Consider a partition of subjects into two subsets of treatments and controls. A non-mirror design would say you could not switch the treatments and controls subsets with equal probability. 4For intution about the behavior of Equation 4, see the Appendix’s Table A.2 that provides it for the special case of p = 1.

6 The expression of Equation 4 is independent of the unknown β coefficients, a fact that will allow us to evaluate explicit designs in the following sections. In theory, the optimal MSE rerandomization strategy is to locate a∗, the threshold corresponding to the minimum of Equation 3 or 4 over a. As a is varied, the determining matrix of the quadratic form varies as ΣW is a function of X and a. However, in practice this is impossible as z (and β in the case ˆDM of βT ) is unknown. We now discuss three criterions (C4) that remove this mathematical dependence on the fixed set of z.

2.1. Criterions of Design Optimality 2.1.1. The Minimax Design If we assume nature to be adversarial, we can examine the MSE in the case of the supremum over the MSE, i.e. the worst possible finite vector of values z. The optimal minimax strategy would then be the BCRD (Kapelner et al., ˆDM ˆLR 2020, Theorem 2.1) for βT and would asymptotically be BCRD for βT (ibid, Appendix 6.11), corresponding to a∗ = ∞. This result is not only trivial but would also correspond to the situation where z is aligned with one arbitrary direction in Rn, a punitively conservative and unrealistic situation.

2.1.2. The Mean Unobserved Design Can we consider the case of an “average z”? To do so, we must imagine many possible experimental datasets of size n indexed by the elements of the support of Z, assumed a continuous measure. Although this is incoherent in the randomization model (where z is fixed) and thus considering these many pos- sible experimental datasets is tantamount to mixing the randomization model with the population model, we do so only in order to create a sensible crite- rion. The assumption of random Z is not used in any way during inference (Section 2.3). Since remind the reader that Z is exclusively a useful temporary mathematical device, we refer to it as the “faux measure”. To obtain tractable results, we make the standard regression assumptions: we assume Z is mean centered (Assumption 2.1) and the n subjects’ unobserved covariates are in- dependent (Assumption 2.2). And, only to make our expressions simpler, we assume homoskedasticity in Z (Assumption 2.3).

Assumption 2.1 (Faux Measure Mean Centeredness). E [Z | X, β, w] = 0n i.e. linearity in the observed covariates.

Assumption 2.2 (Faux Measure Independence). Var [Z | X] is diagonal.

2 Assumption 2.3 (Faux Measure Homoskedasticity). Var [Zi | X] = σz for all i.

ˆDM In Appendix A.4, we derive the mean criterion for βT ,

h i σ2 1 SE [βˆDM | z, X, β] = z + β>X>Σ Xβ (5) E z M w T n n2 W

7 ˆLR and in Appendix A.5 we derive the mean criterion for βT approximated to the third order as

h i σ2 σ2 h i SE [βˆLR | z, X, β] ≈ z + z tr X>Σ X (6) E z M w T n n2 ⊥ W ⊥ where we denote X⊥ as the orthogonalized X matrix with columns x⊥·1 ,..., x⊥·p . ˆDM The optimal design for βT would be the single vector w∗ that minimizes (w>Xβ)2. Note that this assumes knowledge of β which is unknown in practice. ˆLR Thus we leave this design problem for future work. The optimal design for βT Pp > 2 would be the single vector w∗ that minimizes j=1(w x⊥·j ) which does not depend upon β. This is similar to the optimal design under the population model setting as explained in Appendix A.1 which makes sense since we now have made the population model-like assumptions (2.1, 2.2, 2.3). Since there are an exponential number of w vectors, the corresponding reran- domization threshold will be a∗ ≈ 0. Finding this vector is practically impossible as there is no known polynomial-time algorithm. Thus, this is not a practical way of selecting a rerandomization threshold. Even if it were, there would be inferential complications with such a deterministic design that are explained in Section 2.3.

2.1.3. The Tail Unobserved Design The supremum criterion is too conservative and the mean criterion does not incorporate the ruinous effect of vectors that can be near the supremum. Following Kapelner et al. (2020, Section 2.2.6), we consider a tail criterion for (C4) that we employ for the rest of this paper: the qth quantile of the MSE ˆ the estimator, i.e. Quantilez[MSEw[βT | z, X, β], q]. This criterion gauges the average experimental error at the worst 1−q percent of z’s. For example, setting q = 95% would create a criterion that considers the “5% worst experiments”. As q increases, the practitioner takes out insurance on unobserved covariates that are more and more improbable. As q → 100%, the only way to insure against every event is more randomness in W (and more imbalance) as explained in Section 2.1.1. As q decreases towards the quantile corresponding to the mean, minimizing imbalance in W becomes more important at the expense of less randomization as seen in Section 2.1.2. Hence the tail provides a tradeoff between the two competing interests of optimality and randomization. We will understand the tradeoff explicitly in Section 2.2.2 but we first describe the algorithm.

2.2. Algorithms to Optimize the Tail Criterion Our Algorithm 1 (see Appendix B) is an exhaustive search to locate a ˆLR minimal MSE design for βT . An analogous algorithm to locate the minimal ˆDM MSE design for βT we leave to future work as it depends on β, an unknown quantity. We first outline two inputs to the algorithm besides the raw data X that are required.

8 The algorithm takes as input a collection of assignments Wbase := {w1,..., wS} where S is large, so that Wbase loosely approximates the full space W. Since “rerandomization is simply a tool that allows us to draw from some predeter- mined set of acceptable randomizations” (Morgan and Rubin, 2012, page 1267), there is no theory we know of that specifies Wbase. Thus, following Morgan and Rubin (2012) and Li et al. (2018), we consider the initial set to be S draws from BCRD. In practice, this decision can limit our algorithm’s performance in certain situations that will become apparent in the simulation study in Section 3 and strategies to mitigate these limitations are discussed in Section 4. We then sort the assignments based on imbalance in X. The imbalance calculation requires the second input: a metric of imbalance, of which our algo- rithm is agnostic. An imbalance metric B takes X and w as input and outputs a non-negative scalar. An output of zero indicates perfect balance between the treatment and control groups. One such metric is Mahalanobis distance, a collinearity adjusted squared sum of average differences in each dimension. This metric has nice properties; for instance it is an “affinely invariant scalar measure” Morgan and Rubin (2012, page 1271). In our notation, it can be compactly expressed as 1 B(X, w) = w>P w. (7) n We discuss other choices of metrics in Section 4. Let w(1) be the vector with the smallest imbalance in the set and w(S) be the largest, i.e. w := arg max {B(X, w)}. The closest vector to w , (1) w∈Wbase ∗ the assignment that truly minimizes imbalance over the exponential number of theoretical vectors, in our small initial subset Wbase will be w(1). Beginning with W ∼ {w(1) w.p. 1 }, we compute Q, the quantile tail criterion at q for both the estimators. In order to calculate the quantile, we require the inverse CDF function in z for Equations 3 and 4. In general the closed form of the density functions are unknown asymptotically (see e.g. G¨otze and Tikhomirov, 2002). For both estimators, we present three procedures in the next three sections but first finish explaining the algorithm. We then proceed to the strategy W ∼ {w(1) w.p. 1/2 and w(2) w.p. 1/2} and recalculate Q and then W ∼ {w(1) w.p. 1/3, w(2) w.p. 1/3 and w(3) w.p. 1/3} and recalculate Q. We repeat this procedure until the Sth iteration where W ∼ U(Wbase), i.e. the uniform discrete distribution which constitutes the finite approximation of BCRD. Over all iterations, there is a minimum quantile Q∗ corresponds to a an iteration number s∗ ≤ S, defining the approximate optimal  strategy W∗ = w(1),..., w(s∗) and optimal threshold a∗ corresponding to the imbalance B(X, w(s∗)). Our exhaustive search algorithm is computationally intensive but can be approximated by skip-stepping through the S iterations at only a tiny loss of resolution. Also, from our empirical experience with hundreds of plot illustra- tions of Q(a∗), we conjecture that the criterion is convex in s but leave a proof for further work. We can provide a golden-section type line-search method which performs exponentially faster for those who wish to rely on this conjecture.

9 We now discuss three strategies to compute Q based on assumptions about the measure on Z. Algorithm 1 thus takes an argument that specifies one of ˆLR three different tail computation functions found in Algorithm 2 for βT (see Appendix B). These three we offer for convenience in order of the degree of practitioner knowledge. The first procedure (Section 2.2.1) uses the usual OLS assumption of normal- ity. The second procedure (Section 2.2.2) allows the practitioner the flexibility to specify a departure from normality (summed up by a estimate of the excess ). This is useful if the practitioner happens to have prior knowledge about the nature of Z. However, this use case would be rare but renders the software package more general and hence more powerful. The third procedure (Section 2.2.3) requires full knowledge of the Z distribution. This full knowl- edge is completely unrealistic but may be useful in certain physical experimental settings and also is useful for comparison purposes during simulation to under- stand sensitivity. As we provide the practitioner with options, it is important to provide use-case recommendations and we do so in Section 4.1.

2.2.1. Assume Normality and Use a CDF Approximation In addition to assuming independence and homoskedasticity (Assumptions 2.2 and 2.3), we make the assumption that Z is Gaussian, then the distribution of ˆLR the quadratic form of the MSE of βT (Equation 4) is known explicitly. Since the optimal threshold a∗ will be invariant to shifts or scales of the MSE, we assume a standard Gaussian. Since the determining matrix is positive definite and symmetric, the distribution of this quadratic form is a standard result: a weighted sum of standard chi-squared distributions,

n  2  X SE [βˆLR | X, β] ≈ Z> G + D Z ∼ λ χ2 (8) M w,z T n i 1 i=1 where the λi’s are the eigenvalues of G + (2/n)D. Quantiles of this distribution are unknown in closed form but can be computed by numerical integration us- ing the characteristic function. Fast approximations have been studied since the 1930’s. Bodenham and Adams (2016) compared many approximations on accu- racy and computation time and recommend the Hall-Buckley-Eagleson method (Buckley and Eagleson, 1988) which has error that is O(n−1). This is the ap- proximation used in our R package.

2.2.2. Assume an Excess Kurtosis and Use a Double Approximation The previous two strategies are weak in that they require an assumed dis- tribution of Z. However, any quantile can be expressed as

h ˆ i h ˆ i Q := E z MSEw[βT | z, X, β] + c × SEz MSEw[βT | z, X, β] (9) where q is one-to-one with c given ΣW and SE[·] denotes the of an expression with respect to the subscripted random variable. For example, if

10 ˆ q = 95% and the MSEw[βT | z, X, β] is normal, c = 1.65, the Gaussian quantile. The MSE for both estimators is conjectured to be asymptotically the form of Equation 8 which is well-approximated by the normal distribution as long as (1) there are many non-zero eigenvalues relative to n and (2) these eigenvalues are fairly uniformly distributed. These assumptions will be true as long as W remains fairly random i.e. this approximation will work for rerandomization thresholds that are not too small and thus maintain randomness in W. The simulations of Kapelner et al. (2020, Section 3) as well as much unshown simu- lation by its authors demonstrate that the Gaussian quantiles are approximately correct in many situations. This is the approximation employed in this section and in our R package. In order to derive the standard error of the MSE’s for both expressions, we need to assume that the Zi’s have a finite fourth and that the third and fourth moments are independent of X.

 4 Assumption 2.4 (Faux Measure has Finite Fourth Moment). E Zi < ∞ for all i.

Assumption 2.5 (Faux Measure Independence of Higher Moments).  3  4 E Zi and E Zi are independent of X for all i. Although we do not have a procedure for finding the optimal threshold for ˆDM βT due to its dependent on β, we derive the standard error of its MSE ex- pression in order to gain intuition for the estimator we study later. In Appendix A.7, we derive the standard error for the MSE of the differences-in-means estimator,

h ˆDM i SEz MSEw[βT | z, X, β] =

1 2   2 σz 2 4 > T 2 2 nκz + 2 ||ΣW ||F + 2 β X ΣW Xβ (10) n σz where κz is the excess kurtosis in Z and ||·||F denotes the Frobenius norm. Putting Equations 5 and 10 together and recalling that the optimal threshold a∗ will be invariant to shifts or scales of the MSE, we derive the proportional tail criterion objective function in Appendix Appendix A.7 and denote it Q0,

1 ! 2 4 Q0 = β>XT Σ Xβ +c σ2 nκ +2 ||Σ ||2 + β>XT Σ2 Xβ . (11) βˆDM W z z W F 2 W T | {z } | {z } σz | {z } BAL1 RAND BAL2 The quantities in grey are independent of the choice of design given the normal approximation. The BAL terms measure observed imbalance. Over the of possible a∗ values, the BAL1 term ranges from O(n) in the case of BCRD down to O(n32−n), i.e. effectively zero, in the case of the deterministic minimal imbalance (Kallus, 2018, Section 3.3).

11 For BCRD, BAL2 = n/(n − 1)BAL1 = O(n). We have the general upper bound BAL2 ≤ nBAL1. The RAND term measures design randomness and over the range of possible a∗ values, the RAND term ranges from ≈ n in the case of BCRD all the way to n2 in the case of the deterministic minimal imbalance. The tradeoff between imbalance and randomness is now clearly seen in tail criterion Q0. In Appendix A.9, we derive an approximation for the standard error for the MSE in the linear regression estimator,

h ˆLR i SEz MSEw[βT | z, X, β] ≈

1 σ2  8 8  2 z 2 ||(I − P )Σ ||2 + tr [GD] + tr D2 + κ SS (12) n2 W F n n2 z Pn 2 where SS := i=1 (gi,i + 2di,i/n) . Putting Equations 6 and 12 together and dropping additive and multiplicative constants, we find an approximate propor- tional tail criterion objective function,

0 h > i Q ˆLR ≈ tr X⊥ΣW X⊥ + βT | {z } BAL 1  2 ! 2 2 tr [GD] tr D c 2 ||(I − P )ΣW ||F +8 +8 2 +κzSS .(13) | {z } n n RAND2 | {z } | {z } ≤RAND×BAL ≤BAL2 Once again, quantities in grey are independent of the choice of design and the tradeoff between imbalance and randomness5 is clearly observed as in the case ˆDM for βT . The inequalities underbraced above are proven in Appendix A.9.

2.2.3. Provide a Distribution Explicitly If the distribution of Z is provided explicitly, then in each iteration of Al- ˆLR gorithm 1, the empirical quantile q of the MSE of βT (Equation 4) can be approximated as the mean over many draws of z. This will not only be compu- tationally slow, but the average MSE at each a will be noisy, and the criterion will have to be smoothed e.g. via cubic smoothing splines (Hastie and Tibshi- rani, 1990, Chapter 2.2). The optimal threshold a∗ will be invariant to shifts

5 2 Note that the naive estimator for ||(I − P )ΣW ||F , i.e. setting ΣW equal to the estimate ˆ 1 PS > ˆ ΣW = S s=1 wsws and then adding all the squared entries of (I − P )ΣW , is biased upwards by Jensen’s inequality. If this naive estimate were to be used in our computations, it would erroneously place more weight on the RAND term, pushing the optimal design more towards BCRD and thus increasing a∗. To mitigate this, we derive a modified estimator computed by de-biasing the naive estimator (see Appendix A.11). This de-biased estimator we found to be unstable at small S. Future work will investigate this phenomenon further. Since the bias in our current expression is only O(S−1), our designs will be conservative only for small S which is irrelevant as we recommend S to be in the tens of thousands when using our procedures.

12 or scales of the MSE and thus the Z distribution can be standardized without changing the result. Note that assuming independence and homoskedasticity (Assumptions 2.2 and 2.3) are unnecessary in this section.

2.3. in Our Designs

To run the experiment, we first draw one wexp from our optimal thresh- old rerandomization strategy W and these treatments are administered to the subjects. Then, responses for the treatment group YT,1,...,YT,n/2 and for the control group YC,1,...,YC,n/2 are collected. Using these responses, we wish to test theories about βT and produce a confidence interval for βT . Under the randomization model, the only randomness is in w and thus a randomization test is most appropriate; this was Fisher’s “reasoned basis” for inference. The randomization test “can incorporate whatever rerandomization procedure was used, will preserve the significance level of the test and works for any estimator” (Morgan and Rubin, 2012, p. 1268). This test requires a sharp null hypothesis whereby all subjects’ response will be exactly equal under both treatment and control conditions. ˆ First, using the experimental wexp we compute βexp via linear regression. A null distribution can then created by using every other w ∈ W∗, the curated set of assignments thresholded by the optimal a∗, and computing the same estimate (since these estimates were computed by labeling responses to treatment and control in a way unrelated to the design used in the actual experiment). “One should analyze as one designs” (Rosenberger and Lachin, 2016, p. 105), so that replicates can only be drawn from the same restricted allocation set where wexp originated. If W∗ is large, the null distribution can be approximated with R < |W∗| draws from W∗ where the value of R controls the precision of the p-value. Each of the R replicates involves drawing one w ∈ W and recomputing the estimates. A properly-sized but α-level two-sided test can be run by assessing if the ˆ experimental βexp falls in the retainment region bounded by the α/2 and the ˆ ˆ 1 − α/2 quantiles of the replicate set β1,..., βR. By the duality of confidence intervals and hypothesis tests, one can alter the sharp null by adding (or sub- tracting) δ and rerun the test. The set of values of δ where the test rejects constitutes an approximate 1 − α confidence interval for the average treatment effect. This is computationally expensive, but efficient algorithms exist (e.g. Garthwaite, 1996). If the optimal threshold a∗ is small, our design “must leave enough acceptable randomizations to perform a randomization test” (Morgan and Rubin, 2012, page 1268) and there may not be enough assignments out of the total S that satisfy this threshold. The smallest imbalance values are the left tail of the −2/p distribution {B(X, w): w ∈ Wbase} which are Op(pS ) in the case of the X being sampled from an iid Gaussian generating process and B is the Mahalanobis distance. Kallus (2018, Section 3.3) proved that the imbalance for the optimal −2n assignment in this case is B(X, w∗) = O(n2 ). Thus, BCRD will not be able to locate vectors far into this left tail in a reasonable amount of time.

13 Additionally, estimating the terms in our criterion Q0 of Equation 13 may suf- fer from smaller and smaller sets of assignments. The RAND term for instance relies on an accurate estimate of a function of the n × n variance-covariance 2 matrix ΣW and the n × n matrix D. These matrices have an O(n ) number of parameters and require many samples for high accuracy and a bias correction.

3. Simulations for the Linear Regression Estimator

3.1. The Three Strategies in Different Settings We first simulate a case where n = 100 and p = 10 where the entries of X are standard normal realizations. We begin with Wbase of size S = 25, 000 sampled from BCRD. The baseline imbalances are given in Figure 1. It is the job of our algorithms to threshold this distribution at a∗ to optimize the tail of estimator MSE.

60

40 count 20

0 0.03 0.10 0.30 1.00 scaled Mahalanobis Distance

Figure 1: Imbalances in Wbase as measured by the Mahalanobis distance save constants.

For the linear estimator, we run Algorithm 1 to find the optimal thresh- old for many settings and plot the 95% tail criterion over a in Figure 2. We simulate three settings where the faux measure is normal: OPT-chisq denotes the normal approximation of Section 2.2.1, OPT-tail-kappa0 denotes the tail approximation of Section 2.2.2 with κz = 0 and OPT-exact-Normal denotes the exact strategy of Section 2.2.3 simulated with 1,000 standard normal realiza- tions for each subset W0 and smoothed. We also simulated three settings where the faux measure is leptokurtic: OPT-exact-Laplace denotes the exact strategy for Z being the standard Laplace distribution simulated with 1,000 realizations for each subset W0 and smoothed, OPT-exact-T6 denotes the exact strategy for Z being the Students’ t distribution with six degrees of freedom with 1,000 realizations for each subset W0 and smoothed and OPT-tail-kappa3 denotes the tail approximation with κz = 3. Both the Laplace distribution and Students’ t distribution with six degrees of freedom have excess kurtosis of 3. Thus, the tail approximation in OPT-tail-kappa3 sets κz = 3 to test our tail approximation procedure’s ability to approximate this degree of leptokurtosis. There are many observations made from this simulation. First note that for all plots of exact distributions the criterion is essentially flat until a < 0.20. At

14 design

OPT−chisq OPT−exact−Laplace OPT−exact−Normal OPT−exact−T6 OPT−tail−kappa0 OPT−tail−kappa3 relative tail criterion (Q') relative value

0.05 0.10 0.30 0.50 imbalance threshold (a)

Figure 2: Relative tail criterion (Q save additive and multiplicative constants) for q = 95% for many settings and strategies plotted versus a on a log scale. The vertical line indicates the optimal threshold a∗ for each strategy. For the purpose of illustration, all tail criterions were set to 1 at their 99%ile values so that the six curves coincide in order to clearly visualize differences. these thresholds, there is too much restriction which will be detrimental. Any a∗ design where a > 0.20 (a large range of potential thresholds) will be about equal and will not effect tail MSE performance as will be shown in Section 3.2. Second, all a∗ values returned by any of the six procedure-design settings is about the same and hence the number of vectors |W∗| will also be about the same. Their values can be read from the vertical lines in Figure 2 and are as follows: OPT-chisq has a∗ = 0.271, OPT-tail-kappa0 has a∗ = 0.266, OPT-exact-Normal has a∗ = 0.289, OPT-exact-Laplace has a∗ = 0.255, OPT- exact-T6 has a∗ = 0.245 and OPT-tail-kappa3 has a∗ = 0.232. This indicates a great degree of robustness to the assumptions on Z and choice of procedures implying unambiguous advice to a practitioner (see Section 4.1).

3.2. Visualizing the Optimal MSE Tail We now simulate response models of Equation 2 under different scenarios to demonstrate that our algorithm finds the optimal threshold for a tail quantile. We retain the same X from before with n = 100 and p = 10. To draw responses, we set βT = 1 and draw β from Np (0p, Ip). The entire simulation is fixed upon 2  2 these values. We then draw 5,000 different z’s from Nn 0n, σz In where σz = 3 2 so that to RX ≈ 26%, emulating a typical clinical case where the covariates are only modestly explanatory of the outcome measure. We employ seven design strategies: BCRD, balanced complete randomiza- tion with S = 25, 000 vectors; DET, the deterministic design of the best vector out of the S vectors; GOOD, a naive selection of the 0.5% least imbalanced vectors of all S; BAD, a naive selection of the 0.5% most imbalanced vectors of all S and our three proposed procedures: OPT-chisq, the optimal design ac- cording to the weighted chi-squared approximation algorithm of Section 2.2.1; OPT-tail, the optimal design according to the tail criterion approximation al- gorithm of Section 2.2.2 and OPT-exact, the optimal design when simulating Z’s exact distribution algorithm of Section 2.2.3.

15 For each design and for each of the 5,000 draws of z, we draw 1,000 different w vectors from the BCRD, OPT-chisq, OPT-tail and OPT-exact designs, all 0.5%S vectors in the GOOD and BAD designs and the one optimal vector from ˆLR the DET design, w(1). We then compute y via Equation 2 and βT from X, w ˆLR 2 and y and then we record the squared error (βT − βT ) . The average over all ˆLR w’s is recorded as an MSEw[βT | z, X, β]. Over the 5,000 draws of z, the MSE density is estimated. Figure 3a shows the MSE density estimates for all design strategies. Then, the expectation of the MSE over Z is estimated (the vertical solid lines) as well as the 95% empirical quantile estimates the tail criterion (the vertical dashed lines).

6 design BAD BCRD 4 DET GOOD density OPT−chisq 2 OPT−exact OPT−tail

0 0.1 0.2 0.3 OLS Estimator Squared Error over the Z distribution

(a) All seven designs

12

design 8 OPT−chisq OPT−exact density 4 OPT−tail

0 0.10 0.11 0.12 OLS Estimator Squared Error over the Z distribution

(b) Closeup of the three proposed rerandomization designs in this paper

Figure 3: Empirical MSE densities for the different design procedures. The solid vertical lines are the average MSE and the dotted vertical lines are the empirical 95% quantiles.

There are many observations from this figure. First, as Equation 6 predicts, DET has the lowest expected MSE as shown as the solid yellow line. How- ever, note the shape of the MSE density for DET with its long right tail. The tail shape can be understood by examining the first term in the MSE expres- 1 > sion (Equation 4) that can be written as n2 z⊥ΣW z⊥ where z⊥ denotes the

16 component of z orthogonal to the column space of X. In the case of DET, > 2 the quadratic form becomes (z⊥w(1)) . This implies that for z’s that interact poorly with the single, lowest imbalance allocation w(1), there is a lot of error. The dotted yellow line is its 95%ile which is more than twice as large as the maximum of the visible tails of the other designs’ MSE densities (save BAD). Second, the OPT-chisq, OPT-tail and OPT-exact designs (the best 5,881, 5,601 and 6,901 vectors out of the original S = 25, 000 respectively) provide the best insurance against the 95% worst MSE (the blue, red and purple dotted lines respectively) and are the lowest 95%ile among all the seven strategies. They also perform the best for the mean MSE and only nominally larger than the mean MSE for DET (which again features a long tail of disastrous realizations). Third, the OPT-chisq, OPT-tail and OPT-exact designs perform similarly (Figure 3b). More important is that OPT-chisq and OPT-tail produce similar answers as these are the two choices a practitioner has when implementing our proposed designs (we elaborate on recommendations in Section 4.1). One reason why the OPT-tail may feature a slight performance penalty in error when compared to the OPT-chisq design is that it is designed to be robust to other distributions besides the Gaussian (meaning it will be closer to BCRD, have a larger rerandomization threshold and more vectors in W∗). Additionally, this procedure also has more variability due to the double approximation explained in Section 2.2.2. Fourth, there is a lot of leeway in finding the optimal design. The GOOD strategy of the top 0.5% assignments performs about the same as all of our OPT designs (but our OPT designs do indeed provide a lower quantile). This is due to the flatness of the tail criterion among designs that are neither too random or too deterministic (see discussion below Figure 2 about the range where a > 0.20). Fifth, BCRD performs somewhat worse, but not terrible when compared to OPT. This is a testament to the advantage of the OLS estimator — the contribution of a priori imbalance is attenuated by a factor of 1/n2, a rapid rate of vanishing (Equation 6). We will see in the next section that OLS cannot rest on its laurels in the case of larger p. Sixth, the BAD strategy (the worst 0.5% assignments) performs much worse in both the expectation and the tail criterions. This illustrates a case of covariate imbalance that is so lopsided that even OLS cannot fix it.

3.3. Optimal Rerandomization Threshold Dependence on p

We now investigate how the optimal threshold a∗ depends on the number of observed covariates p. We fix n = 200, S = 25, 000 and iteratively add more columns to X so that p ∈ {1, 2,..., 199} where new column entries are standard normal realizations. For each p, we compute the optimal threshold via the weighted chi-squared approximation algorithm of Section 2.2.1 and record |W∗(p)| /S, i.e. the fraction of vectors in the original set of S that meet this threshold. Figure 4 shows that the fraction of acceptable vectors seems to be converging 1 to zero when p < 2 n. This portion of the illustration is consistent with our

17 1.0

0.8

0.6 fraction

0.4

0.2 0 50 100 150 200 p

Figure 4: An investigation into the optimal threshold’s dependence on the number of covari- ates: the fraction of base vectors less than a∗ versus the number of covariates p. intuition: as there are more observed covariates, a priori restrictions on the randomization to decrease imbalance become more important to the resultant MSE tail (even when employing the linear adjustment using OLS). A decrease in imbalance means a decrease in the level of randomness and this seems to be well-justified. 1 The second portion of the illustration (i.e. when p > 2 n) displays the fraction of acceptable vectors increasing and thus Q, the tail of the MSE, must be increasing. As p → n, then P → In implying that B(X, w) → 1 (Equation 7) meaning there is a large imbalance between the treatment groups with very little variability meaning that all risk is in the unmeasured confounders. To illustrate this point more clearly, P → In implies that G → 0n and D → nΣW and thus ˆLR 2 > MSEw[βT | z, X, β] → n2 z ΣW z (Equation 4). Because there is no longer much variability in B(X, w) the expectation of the MSE over Z (Equation 6) will be relatively unaffected by the choice of a∗. Thus, the quantile is almost completely controlled by the variability in MSE over Z. This standard deviation 2 1/2 expression (Equation 12) converges to a constant times (tr[ΣW ]) which is minimized by employing BCRD i.e. a∗ → ∞ (with only a nominal effect of κz). Luckily, most randomized experiments and clinical trials do not feature very many informative control variables relative to number of observations. One final point is that the relationship does not appear to be smooth. The- oretically it should be smooth and convex and the observed jaggedness is likely due to the O(n−1) numerical error in the Hall-Buckley-Eagleson method.

3.4. Designing an Educational Intervention Experiment The Tennessee Student Teacher Achievement Ratio (STAR) randomized ex- periment measured the effect of class size on student standardized test scores (Word et al., 1990). Similar to Kallus et al. (2018, Section 5), we used this data set to validate our methodology using simulation on subsamples. We focused on two of the experimental conditions: small classes versus regular-sized classes. The outcome metric we used was the average of the read- ing, math, language and listening standardized aptitude tests. Covariates we

18 included and their data types were student gender (binary and 1 for male), race (binary and 1 for African American), class size (integer), whether they re- ceived free lunch (binary), birthdate (continuous), rural home address (binary), suburban home address (binary) and inner city home address (binary) for a total of p = 8 control covariates. The experiment monitored students in kinder- garten through third grade. We only used the outcomes in the final year. After dropping students with , we were left with a total of n = 3, 685 observations. We then paralleled the MSE measurement simulation of Section 3.2 for BCRD, DET, OPT-chisq and OPT-tail (with κz = 0). We did not run OPT- exact as this procedure is not realistic as the practitioner would not have knowl- edge of the exact distribution of Z in a social science experiment setting. As in the previous simulation, in order to measure MSE by averaging squared errors over w as a function of z, we need a way to vary both w and z. Since the experimental data has one fixed known w and one fixed unknown z, we em- ploy a parametric bootstrap. We model the underlying data generating process assuming a response model additive in the conditional expectation, PATE and residuals:

y = f(x·1,..., x·8) + βT w + z. (14)

We estimated βT = 3.056 on the whole dataset and this was employed as baseline truth. We then subtracted βT from all observations’ responses which were assigned the treatment and added βT to all observations’ responses which were assigned the control. We then used a Random Forest model (Breiman, 2001) to flexibly fit non-linearities and interactions within f(x·1,..., x·8), the conditional expectation function, by fitting the βT -adjusted responses with the covariates. We recorded the predictions of f(x) for all x denoted fˆ which were computed out-of-bag (in order to not use overfit predictions). We then computed the residuals as z := y − fˆ. The parametric bootstrap then works as follows: we sample n rows from X with replacement along with their corresponding conditional expectation function estimates. We then draw the n residuals independently of the n rows, using a sample with replacement zrep from the set of all z. This residual procedure assumes homoskedasticity in the faux measure Z and the experimental subjects are assumed independent. For this fixed zrep we can compare the experimental designs by drawing w from the design, computing the response value via Equation 14, computing the ˆLR ˆLR 2 estimate βT , computing the squared error (βT − βT ) and averaging these ˆLR squared errors over the w’s to estimate MSEw[βT | z, X, β]. The variability in this MSE is then assessed by doing many iterations of these steps for different zrep’s. In our simulation, we used n = 100 students and S = 25, 000. Figure 5 parallels Figure 3 by illustrating the density of these conditional MSE’s over all 1,000 zrep’s.

19 6 design

BCRD 4 DET

density OPT−chisq

2 OPT−tail

0 10 20 30 OLS Estimator Squared Error over the Z distribution

Figure 5: Empirical MSE densities for the different design procedures. The solid vertical lines are the average MSE and the dotted vertical lines are the empirical 95% quantiles.

Our results indicate that the OPT-chisq and OPT-tail rerandomization de- signs outperform BCRD in the mean MSE by 4.6% and 6.2% respectively and in the 95% tail of the MSE by 5.7% and 4.9%. The tail is the target of our optimization procedure and thus a performance boost is expected but the per- formance gain in the mean MSE comes for free because we employed restricted randomization providing a boost even with an OLS estimator. Once again DET slightly beats the OPT designs in the mean but to realize that tiny benefit, one pays a big price: the 95% tail MSE event is almost 300% larger. Note that other samples of students resulted in illustrations that were not appreciably different. This simulation also further vindicates the idea that the OPT-chisq and OPT-tail designs perform similarly (the third observation on Figure 3b). This is important because the OPT-chisq and OPT-tail are the two choices a practi- tioner has at the design stage. These results are further expected as sample size, number of covariates, as 2 well as the RX (which was found to be 22% when averaged over 1,000 samples of size n) parallel the conditions simulated before in Section 3.2. As mentioned before, this gain will be attenuated as sample size increases due to the OLS estimator’s ability to adjust covariate imbalance a posteriori obviating the need to restrict the randomizations to improve imbalance a priori.

4. Discussion

We have introduced an algorithm that can provide an optimal rerandomiza- tion threshold for the randomization model when inferring the average treat- ment effect using the covariate-adjusted linear regression estimator when insur- ance on high error is sought. The expressions we derived and our simulations confirm that relatively little rerandomization restriction is needed in the preva- lent practical setting where p is small relative to a large n. This underscores the robustness of complete randomization coupled with linear adjustment as a

20 design-estimation approach in experimentation. However, experimental settings with small sample size are likely to benefit from our approach. One can argue that we have replaced explicitly picking a value of a∗ with ˆ instead picking a value of q in Quantilez[MSEw[βT | z, X, β], q], the quantile in our tail criterion. We would argue that one does not understand the scale of a∗ but the q is much more interpretable: it is taking out insurance on bad experiments. Further, the range of reasonable q choices is small, likely at most between 80-99%. Our work leaves many unanswered questions. First, there are other imbal- ance metrics and the choice of Mahalanobis distance seems arbitrary or one of convenience especially when real-world response models are non-linear. Re- cently, Kallus (2018, Section 2.3) proved that the optimal imbalance metric is dependent on the choice of norm for the response model. For instance, if the norm is Lipschitz, then the optimal B would capture the deviation from a pair- allocation, such as pairwise Mahalanobis distance (ibid, Theorem 4). Krieger et al. (2020) propose a design that uses both rerandomization and matching together. If the norm is the sup norm, then the optimal B is one that captures deviation from an incomplete design (ibid, Theorem 3). If the  −1 squared norm is n−1β> XT X β, i.e. a measure of the signal in the linear model, then the optimal B is the Mahalanobis distance (ibid, Section 2.3.3). If the response function is represented by a reproducing Kernel Hilbert Space, then the optimal B would be w>Kw where K is the n × n gram matrix, a matrix whose i, j entries tally the kernel distance from xi to xj (ibid, Equation 4.2). These kernel spaces can be infinite in dimension and thereby can model non-parametric response functions. The exponential kernel with kernel distance > function exp(xi xj) can be seen as modeling all polynomials (with infinite de- 2 gree) and the Gaussian kernel with kernel distance function exp(− ||xi − xj|| ) is especially flexible (the two examples in ibid, Section 2.4). Since the true model is unknown, one would want to employ an B sufficiently flexible to not suffer poor performance if the model deviates from their arbitrary choice. Thus, the Gaussian kernel is recommended in practice (ibid, Section 6). Our software contains options for a wide variety of choices for B. Additionally, as mentioned in Section 2.2, there is no theory we know of that specifies how Wbase should be sampled from the full space W. We have employed BCRD here but there may be other choices for instance those that pro- vide more vectors with exponentially smaller imbalance. For example, Krieger et al. (2019, Algorithm 1) is a greedy heuristic that begins with BCRD and in each iteration finds the switch of treatment-control subject pair that is optimal until no more minimization is possible. This algorithm finds assignments that −1−2/p are Op(pn ). A nice property of this greedy heuristic is that it converges quickly requiring very few switches and thus their assignments are nearly as ran- dom as BCRD. Thus, including assignments from their algorithm are unlikely to affect the RAND terms but greatly decrease the BAL terms in our criterion Q0. Since the imbalance of vectors of BCRD is known, the greedy pair-switching algorithm can be used for stratified sampling for imbalance distributions upper

21 bounded at small a∗. Our algorithm would proceed identically, but early sim- ulation suggests the optimal designs can change drastically. This we believe is the most pressing open question for future research.

4.1. Recommendations for the Practitioner

We provided two procedures to locate a∗ based on the assumptions of the Z distribution: the first is to assume normality and the second is to specify a departure from normality. (Our third procedure requires the unrealistic full specification of the Z distribution and thus we omit it in this discussion). Over- all, our procedures seem to have a high degree of robustness: differing assump- tions about Z will only result in minor effects on the estimation error. Thus, we believe that using the first procedure (the normality assumption) should be used in practice unless the practitioner feels strongly enough that the errors (plus misspecification) have significantly thinner or fatter tails than a Gaussian in which case the second procedure could be used. Using the second procedure in place of the first when the normality assump- tion is justified only incurs a slight cost (as seen when comparing the red and purple vertical lines of Figures 2, 3b and 5). The difference between both pro- cedures will likely only be large under high R2 settings which are not the norm at least in clinical trials. Regardless of the procedure employed or the value of q chosen, we reiterate that our rerandomization design features valid inference (an unbiased estimator and properly sized hypothesis tests) for any value of a (see Section 2.3).

Replication The figures and values given in Section 3 can be duplicated by running the code found in https://github.com/kapelner/OptimalRerandExpDesigns/blob/ master/JSPI_paper_figures_and_data.

Acknowledgements

We thank Nathan Kallus and Uri Shalit for helpful discussions. This research was supported by Grant No. 2018112 from the United States-Israel Binational Science Foundation (BSF).

References

Berk, R., Barnes, G., Ahlman, L., Kurtz, E., 2010. When second best is good enough: A comparison between a true experiment and a regression disconti- nuity quasi-experiment. Journal of Experimental Criminology 6, 191–208. Bertsimas, D., Johnson, M., Kallus, N., 2015. The power of optimization over randomization in designing experiments involving small samples. Operations Research 63, 868–876.

22 Bodenham, D.A., Adams, N.M., 2016. A comparison of efficient approximations for a weighted sum of chi-squared random variables. Statistics and Computing 26, 917–928. Breiman, L., 2001. Random forests. Machine Learning 45, 5–32. Buckley, M.J., Eagleson, G., 1988. An approximation to the distribution of quadratic forms in normal random variables. Australian Journal of Statistics 30, 150–159. Fisher, R.A., 1925. Statistical methods for research workers. Edinburgh Oliver & Boyd. Freedman, D.A., 2008. On regression adjustments to experimental data. Ad- vances in Applied Mathematics 40, 180–193. Garthwaite, P.H., 1996. Confidence intervals from randomization tests. Bio- metrics 52, 1387–1393. G¨otze,F., Tikhomirov, A., 2002. Asymptotic distribution of quadratic forms and applications. Journal of Theoretical Probability 15, 423–475. Harville, D.A., 1975. Experimental randomization: Who needs it? The Ameri- can 29, 27–31. Hastie, T., Tibshirani, R., 1990. Generalized Additive Models. CRC Mono- graphs on Statistics and Applied Probability. first ed., CRC Press. Imbens, G.W., Rubin, D.B., 2015. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press. Kallus, N., 2018. Optimal a priori balance in the design of controlled experi- ments. Journal of the Royal Statistical Society: Series B (Statistical Method- ology) 80, 85–112. Kallus, N., Puli, A.M., Shalit, U., 2018. Removing hidden by exper- imental grounding, in: Advances in Neural Information Processing Systems, pp. 10888–10897. Kapelner, A., Krieger, A.M., Sklar, M., Shalit, U., Azriel, D., 2020. Harmonizing optimized designs with classic randomization in experiments. The American Statistician 0, 1–12. doi:10.1080/00031305.2020.1717619. Kiefer, J., Wolfowitz, J., 1959. Optimum designs in regression problems. The Annals of 30, 271–294. Krieger, A.M., Azriel, D., Kapelner, A., 2019. Nearly random designs with greatly improved balance. Biometrika 106, 695–701. Krieger, A.M., Azriel, D., Kapelner, A., 2020. Better experimental design by hybridizing binary matching with imbalance optimization. arXiv:1604.00698 .

23 Lachin, J.M., 1988. Properties of simple randomization in clinical trials. Con- trolled clinical trials 9, 312–326. Land, A.H., Doig, A.G., 1960. An automatic method of solving discrete pro- gramming problems. Econometrica 28, 497–520. Lehmann, E.L., Casella, G., 1998. Theory of . 2 ed., Springer Verlag. Li, X., Ding, P., Rubin, D.B., 2018. Asymptotic theory of rerandomization in treatment–control experiments. Proceedings of the National Academy of Sciences of the United States of America 115, 9157–9162. Lin, W., 2013. Agnostic notes on regression adjustments to experimental data: Reexamining freedman’s critique. Annals of Applied Statistics 7, 295–318. Morgan, K.L., Rubin, D.B., 2012. Rerandomization to improve covariate bal- ance in experiments. Annals of Statistics 40, 1263–1282. Petersen, K.B., Pedersen, M.S., 2012. The matrix cookbook. URL: http: //www2.imm.dtu.dk/pubdb/p.php?3274. version 20121115. Reichardt, C.S., Gollob, H.F., 1999. Justifying the use and increasing the power of at test for a randomized experiment with a convenience sample. Psycho- logical Methods 4, 117. Rosenberger, W.F., Lachin, J.M., 2016. Randomization in clinical trials: theory and practice. Second ed., John Wiley & Sons. Senn, S., 2013. Seven myths of randomisation in clinical trials. Statistics in Medicine 32, 1439–1450. Smith, K., 1918. On the standard deviations of adjusted and interpolated values of an observed polynomial function and its constants and the guidance they give towards a proper choice of the distribution of observations. Biometrika 12, 1–85. Steinberg, D.M., Hunter, W.G., 1984. Experimental design: review and com- ment. Technometrics 26, 71–97. Student, 1938. Comparison between balanced and random arrangements of field plots. Biometrika 29, 363–378. Teugels, J.L., 1990. Some representations of the multivariate bernoulli and binomial distributions. Journal of Multivariate Analysis 32, 256–268. Word, E.R., Johnston, J., Bain, H.P., Fulton, B.D., Zaharias, J.B., Achilles, C.M., Lintz, M.N., Folger, J., Breda, C., 1990. The State of Tennessee’s Stu- dent/Teacher Achievement Ratio (STAR) Project: Technical Report (1985- 1990). Technical Report. Tennessee Board of Education. Wu, C.F., 1981. On the robustness and efficiency of some randomized designs. Annals of Statistics 9, 1168–1177.

24 Appendix A. Appendix

Appendix A.1. The Population Model Perspective In the population model (Rosenberger and Lachin, 2016, Chapter 6.2), sub- jects in the treatment group are sampled at random from an infinitely-sized superpopulation and thus the responses in the treatment group are considered independent and identically distributed with density fY (y | θ = βT ). Subjects in the control group are sampled analogously from a different superpopulation density fY (y | θ = −βT ) where θ denotes one of many possible parameters. In the simple model of Equation 1, the population model can be created via the assumption that z is a random draw from Z such that z := y − E[Y | x, w] and its entries are independent. The law of iterated expectation yields that the distribution of the Z0s are mean-centered. Each experiment gets one draw from z which is usually referred to as “noise” or “measurement error”. To make our expressions simpler, we also assume homoskedasticity so that each of the Zi’s 2 share variance denoted σz . It is straightforward to demonstrate that the strategy’s mass function P (W ) is ancillary to the likelihood of the responses and thus the randomization pro- cedure can be ignored during estimation (Rosenberger and Lachin, 2016, page 98). ˆDM Table A.1 provides the expression of the two estimators we consider (βT ˆLR and βT ), their expectations, and mean squared errors (MSE). The expectations are taken over the distribution of Z, as this is the source of ran- domness in the response and thus the expressions are conditional on w and z. To simplify the expressions we denote the imbalance in x as Bx :=x ¯T − x¯C and the imbalance in z as Bz :=z ¯T − z¯C where the bar notation denotes sample av- −1 Pn erage and the subscript indicates arm and r := n i=1 xizi, a covariance-like term measuring the relatedness of the observed and unobserved measurements. For further simplicity we let βx = 1 which simplifies our expressions by re- moving constants that merely signify the signal ratio of observed to unobserved covariates.

Table A.1: Estimator properties in the population model.

ˆDM ˆLR βT βT 2 −1 Finite Estimate βT + Bx + Bz βT + Bx + (Bz − rBx)(1 − Bx) Expectation βT + Bx βT 2 2 σz σz 2 −1 Variance n n (1 − Bx) 2 2 σz 2 σz 2 −1 MSE n + Bx n (1 − Bx)

If the optimal design is selected by the MSE, then the best assignment is the one that minimizes Bx, the imbalance in the observed covariate. This is the main justification used to advocate against randomization and instead use deterministic optimal designs (Kiefer and Wolfowitz, 1959; Harville, 1975; Bertsimas et al., 2015; Kallus, 2018). This debate date back to the inception of

25 experimentation with Smith (1918) and a good review of classic works including Kiefer and Wolfowitz (1959) is given by Steinberg and Hunter (1984, Chpater 3). Here, the strategy W is degenerate where P (W = w∗) = 1 for w∗, the MSE- optimal assignment. To find w∗ in practice is more difficult. One can formulate the procedure as a binary integer programming problem. If forced balance (a term defined in Section 1.2) is required, the partitioning problem is NP hard, meaning that there does not exist a known algorithm that can find the optimal allocation in polynomial time. There are approximations that run in polynomial time that are usually close enough for practical purposes, e.g. branch and bound (Land and Doig, 1960). ˆLR Tangentially, we also observe that if the model is linear, then βT is more ˆDM 2 2 2 efficient than βT as its variance is ≈ σz /n + (σz /n)Bx when using two terms in a geometric series approximation6 versus the variance of the simple estimator 2 2 σz /n + Bx. This will become important when we discuss our methodology in sec. 2 Once again, those that assume the population model prefer optimal deter- ministic design.

Appendix A.2. The Randomization Model Perspective In the randomization model (Rosenberger and Lachin, 2016, Chapter 6.3) also called the “Fisher model” or “Neyman model”, the source of the randomness is in the treatment assignments w. “The n subjects are the population of interest; they are not assumed to be randomly drawn from a superpopulation” (Lin, 2013, page 297). Table A.2 provides the expression of the estimator, its expectation, variance and mean squared error (MSE) for both the differences- in-means and linear regression estimators assuming only that w comes from a forced balance procedure. The expectations are taken over the distribution of W , as this is the source of randomness in the response and thus the expressions are conditional on z and x.

Table A.2: Estimator properties in the randomization model.

ˆDM ˆLR βT βT Finite Estimate [same as Table A.1] [same as Table A.1] √ p 4 3 Expectation βT βT + O(1/ n) E(Bx)) + O(E(Bx)) 1 >  2 2 1 >  2 Variance and MSE n2 (x + z) ΣW (x + z) 2E w BxBz + n2 z⊥ΣW z⊥ (1 + O(E(Bx)))

The response model of Equation 1 under the randomization model perspec- tive was studied by Kapelner et al. (2020) and the expressions in Table A.2 can be found therein. To derive these expressions, there was one additional assump- tion placed on W that we will make precise in Section 2: for any assignment w,

6 2 And thus is only appropriate when σz /n ≤ 1.

26 the assignment where the treatment group subjects are swapped for the con- ˆLR trol groups subjects, −w, is equally likely. Also note that for βT , no closed form expressions are available so they are approximated to the third order in a geometric series. The ΣW term denotes Var [W ], the variance-covariance matrix of the strat- egy that features ones along the diagonal and off-diagonals that gauge the covari- ance between subject i’s assignment and subject j’s assignment. z⊥ is defined to be the component of z orthogonal to x. It makes sense that this term is important in the error of the linear regression estimator as the linear regression can only adjust for the component of z it has access to via x. The most striking observation is that the optimal design strategy is not clear from the MSE expressions as they were in the population model perspective. Our response model was deliberately picked to be the most simplistic and our estimators were chosen to be the most popular. Further, if optimal design were to be defined by minimal MSE, it cannot be resolved as z is unknown, a practical problem addressed in Section 2. We note that there is finite sample bias in the linear regression estimator because the regression assumed a response model that omitted the additive z term. However, this bias is small as noted by Freedman (2008) and Lin (2013).

Appendix A.2.1. The Effect of Large Imbalances in the Observed Covariates Running the experiment under a particular unlucky assignment is destructive to both perspectives. This can be seen in our simple model of Equation 1 where the estimator features an additive Bx term. In the population model ˆDM 2 perspective, the MSE of βT suffers an additive penalty of Bx and the MSE ˆLR 2 −1 of βT suffers a multiplicative penalty of (1 − Bx) derived from the specific unlucky assignment (see Table A.1). In the randomization model perspective, ˆDM −2  2 the MSE of βT suffers an additive penalty of n E w Bx and the approximate ˆLR MSE of βT also suffers an additive penalty but it is more difficult to see as it  2 2 is buried in the quartic form E w BxBz (see Table A.2). These penalties are due to the presence of many unlucky assignments in W.

ˆDM Appendix A.3. The MSE for βT This section is heavily adapted from Kapelner et al. (2020, Appendices 6.2– 6.3). ˆDM We first show that the estimator is unbiased i.e. E w[βT | z, X; β] = βT . Throughout this appendix X is considered fixed. Using the model given by Equation 2:

1 [βˆDM | z, X; β] = [w> (β w + Xβ + z) | z, X; β] E w T nE w T 1 = [β w>w] + [w>Xβ | X; β] + [w>z | z] n E w T E w E w 1 = β [w>w] + [w>Xβ | X; β] + [w]>z n T E w E w E w

27 1 = β + [w>Xβ | X; β] + [w]>z T n E w E w > Pn 2 Since wi ∈ {−1, +1} then, w w = i=1 wi = n. The assumption that W is a mirror strategy (Assumption 2.1) implies that E w[w] = 0n and thus

 >  E w w Xβ | X, β = 0 (A.1) since each w cancels out with the summand with −w with shared probability. This unbiasedness implies that the MSE equals the variance,

ˆ ˆDM2 2 Varw[βT | z, X; β] = E w[βT | z, X; β] − βT > 2 2 = E[(w y/n) | z, X; β] − βT 1 h 2 i = w>(β w + Xβ + z) z, X; β − β2 n2 E T T 1 h 2 i = β w>w + w>(Xβ + z) z, X; β − β2 n2 E T T 1 h i = 2nβ w>(Xβ + z) + (w>(Xβ + z))2 z, X; β n2 E T where the last line follows from w>w = n, simplification and canceling out the 2  >  constant βT . By the same arguments of Equation A.1, E w w (x + z) | z, X; β = 0 leaving us with

1 h i = (w>(Xβ + z))2 z, X; β n2 E 1 h i = (Xβ + z)>ww>(Xβ + z) z, X; β n2 E 1 = (Xβ + z)>Σ (Xβ + z). n2 W

ˆDM Appendix A.4. The Expectation of the MSE for βT This section is heavily adapted from Kapelner et al. (2020, Appendix 6.4). We compute

 1  [ SE [βˆ | z, X; β]] = (Xβ + z)>Σ (Xβ + z) | x E z M w T E z n2 W 1 = (Xβ)>Σ Xβ + 2(Xβ)>Σ z + z>Σ z | X; β n2 E z W W W 1 2 1 = β>XT Σ Xβ + β>XT Σ [z | X] + z>Σ z n2 W n2 W E z n2 E z W and note that since E z[z | X] = 0n (by implication of Assumption 2.1), the second term is zero. The third term is the expectation of a quadratic form. By Petersen and Ped- ersen (2012, Equation 318) and assuming homoskedasticity (Assumption 2.3)

28 and the fact that since W is generalized multivariate bernoulli, this implies tr [ΣW ] = n we arrive at

σ2 1 [ SE [βˆ | z, x]] = z + β>X>Σ Xβ. E z M w T n n2 W

ˆLR Appendix A.5. The MSE for βT We let the overall design matrix of the OLS regression be X˜ := [w | X]. By standard OLS theory and Equation 2,

> −1  T −1 T  n w>X   w>y  βˆ = X˜ X˜ X˜ y = T Xw X>X X>y

ˆLR Note that the first entry of the estimator above is βT , what we care about. We use formula given in Petersen and Pedersen (2012, Section 3.2.6) to provide the top row of the inverse matrix above:

" > > −1 # ! 1 −w X(X X)  w>y  w>y − w>P y βˆLR = n−w>P w n−w>P w = T N/A N/A X>y n − w>P w 1 w>(I − P )y = n − w>P w where P := X(X>X)−1X>, i.e. the projection matrix onto the covariates and I is the n × n identity matrix. Now, replacing y with the model, we find:

w>(I − P )(β w + βX + z) βˆLR = T T n − w>P w w>(β w + βX + z) − w>P (β w + βX + z) = T T n − w>P w nβ + w>βX + w>z − β w>P w − w>βX − w>P z = T T n − w>P w β (n − w>P w) + w>z − w>P z = T n − w>P w w>(I − P )z = β + T n − w>P w > Bz − BX Ar = βT + > 1 − BX ABX 0 0 Bz − BX r = βT + 0> 0 1 − BX BX | {z } g where r = X>z/n, a column vector of correlation-like metrics of the p covariates > with z, denote BX = X w/n, a column vector of average differences in the

29 p covariates between treatment and control, A = n(X>X)−1, a symmetric 1 1 0 2 0 2 matrix, BX := A BX , r := A r and Bz is the same as in the univariate case, the average difference in z between treatment and control. Note that 0> 0 BX BX is the Mahalanobis distance between the average covariate values in the treatment group and the average covariate values in the control group. Bz −rBx We can show that g simplifies to 2 in the univariate case (the expres- 1−Bx sion in Table A.2). Note that this estimator is not a function of β! This is a convenient result of covariate adjustment. ˆLR ˆ 2 2 The MSEw[βT | z] is the expectation of (βT − βT ) = g . This expecta- tion is difficult to express in closed form. We instead take a geometric series approximation,

0 0 Bz − BX r 0 0  0 2  0 4 g = 0> 0 = Bz − BX r 1 + BX + O BX 1 − BX BX We can use asymptotic notation to expand g2 and then approximate it

2 2 0 02  0 2  0 4 g = Bz − BX r 1 + BX + O BX

 0 02 0 2 2   0 2 = Bz − BX r + 2 BX Bz 1 + O BX 0 02 0 2 2 ≈ Bz − BX r + 2 BX Bz

ˆ 2 Recall that our target is MSEw[βT | z] = E w[g ]. The expectation of the first term can be written as

h 0 02i h > 2i E w Bz − BX r = E w w (I − P )z  > >  = E w (z(I − P )) ww (I − P )z > = z (I − P )ΣW (I − P )z = z>Gz where G := (I − P )ΣW (I − P ). This term can alternatively be expressed > as z⊥ΣW z⊥ where z⊥ is defined to be the component of z orthogonal to the column space of X. The expectation of the second term is:

h 0 2 2i h 0 2 2i E w 2 BX Bz = 2E w BX Bz  1  = 2 B w>P wB E w z n z 2 = z>ww>P ww>z n3 E w 2 = z>Dz n3

30  > > where D := E w ww P ww , an expectation of a homogeneous quartic form that cannot be simplified further. Putting this all together,

1 2 SE [βˆLR | z] = z>Gz + z>Dz M w T n2 n3 = z>Rz

2 where the determining matrix of the quadratic form is R := G + n D.

ˆLR Appendix A.6. The Expectation of the MSE for βT We find the expectation over z as

h h ii 1 σ2 SE βˆLR | z = z>Rz = z tr [R] E z M w T n2 E z n2 where the equality comes from Assumptions 2.2 and 2.3 and an application of Petersen and Pedersen (2012, Equation 318).

2 tr [R] = tr [G] + tr [D] n 2 = tr [(I − P )Σ (I − P )] + tr  ww>P ww> W n E w 2 = tr [(I − P )Σ ] + tr ww>P ww> W nE w 2 = n − tr [P Σ ] + tr ww>ww>P  W nE w   >  = n − tr [P ΣW ] + 2E w tr ww P  >  = n − tr [P ΣW ] + 2E w w P w = n − tr [P ΣW ] + 2tr [P ΣW ]

= n + tr [P ΣW ]  −1   T  T = n + tr X X X X ΣW

 −1   T  T = n + tr X X X ΣW X

 − 1 − 1   T  2 T  T  2 = n + tr X X X ΣW X X X

1 h 1 T 1 i = n + tr A 2 X Σ XA 2 n W h > i = n + tr X⊥ΣW X⊥ .

− 1  T  2 Let X⊥ := X X X be the orthogonalization of X. Thus,

h h ii σ2 σ2 h i SE βˆLR | z = z + z tr X>Σ X E z M w T n n2 ⊥ W ⊥

31 where the term can more intuitively be expressed as: h i tr X>Σ X = x> Σ x + ... + x> Σ x ⊥ W ⊥ ⊥·1 W ⊥·1 ⊥·p W ⊥·p

h 2 i h 2 i = w B + ... + w B E x⊥·1 E x⊥·p meaning the sum of imbalances squared of the each of the orthogonalized di- mensions of the column space of X.

ˆDM Appendix A.7. The Standard Error of the MSE for βT This section is heavily adapted from Kapelner et al. (2020, Appendix 6.6). Section Appendix A.3 derived the quadratic form of the MSE. The vari- ance with respect to z can be calculated straightforwardly using Petersen and Pedersen (2012, eq. 319) when assuming that the fourth moment of z is finite (Assumption 2.4) and do not depend on X (Assumption 2.5) as

2 σ2   h ˆ i z 2 4 > T 2 > Varz MSEw[βT | z, x] = 4 nκz + 2 ||ΣW ||F + 2 β X ΣW Xβ + γz1n ΣW Xβ , n σz

 3 where κz is the excess kurtosis in Z and γz := E z and we prove in sec. Ap- > pendix A.8 that this last term γz1n ΣW Xβ is zero (regardless of skew) if we assume Appendix A.1 that all assignments are forced balance.

> Assumption Appendix A.1 (Forced Balance). For all w ∈ W, w 1n = 0.

Since the rerandomization strategy in the text employed BCRD as the base strategy, this assumption was implicit throughout the paper.

Appendix A.8. A Proof that the Last MSE Term is Zero This section is heavily adapted from Kapelner et al. (2020, Appendix 6.7). > > > We wish to demonstrate that 1n ΣW Xβ = β X ΣW 1n = 0. By the forced > > balance assumption (Appendix A.1), E w[1n w | X, β] = Varw[1n w | X, β] = 0 > > 2 since every w is balanced. Then, Varw[1n w | X, β] = E w[ 1n w | X, β] = > 1n ΣW 1n = 0. Pn > Note that ΣW = i=1 λivivi where the λ1 ≥ ... ≥ λn ≥ 0 and vi’s are its eigenvalues and eigenvectors respectively. Since ΣW is a variance-covariance ma- trix, it is symmetric implying that its eigenvalues are all non-negative. We can > Pn > 2 > 2 then write 1n ΣW 1n = i=1 λi 1n vi = 0. This means that λi vi 1n = 0 > for all i. In order for this to be true, for every i either λi = 0 or vi 1n = 0. Pn > We now examine just the term ΣW 1n which can be written as i=1 λivivi 1n. > For all i either λi or vi 1n is zero rendering the “middle” vi irrelevant. Thus > > > > ΣW 1n = 0n and β X ΣW 1n = β X 0n = 0.

32 ˆLR Appendix A.9. The Standard Error of the MSE for βT By Assumptions 2.2 and 2.3 and an application of Petersen and Pedersen (2012, Equation 319),

  22 n ! 1 σz X ar z>Rz = 2tr R2 + κ R2 V z n2 n4 z i,i i=1 where κz denotes excess kurtosis and Ri,i are the diagonal entries of R that we cannot simplify further. We evaluate the trace term below:

2  2 2 tr R = G + D n F  2   2  = tr G + D G + D n n  4 4  = tr G2 + GD + D2 n n2 4 4 = tr G2 + tr [GD] + tr D2 n n2 4 4 = tr [(I − P )Σ (I − P )(I − P )Σ (I − P )] + tr [(I − P )Σ (I − P )D] + tr D2 W W n W n2 4 4 = tr [(I − P )Σ (I − P )Σ ] + tr [(I − P )Σ (I − P )D] + tr D2 W W n W n2 4 4 = ||(I − P )Σ ||2 + tr [GD] + tr D2 . W F n n2

Putting this all together, we find that v h h ii 2 u   n ˆ σz u 2 4 4  2 X 2 Ez SEw βT | z, X = t2 ||(I − P )ΣW || + tr [GD] + tr D + κz R . S M n2 F n n2 i,i i=1

We now prove some bounds on some of these terms so we can interpret them more easily:

2  2 2 2 h > i tr D ≤ tr [D] = n tr X⊥ΣW X⊥ where the first inequality is a trace inequality for symmetric matrices and the second equality was shown in App. Appendix A.6. Also,

q q  2  2  2 h > i tr [GD] ≤ tr G tr D ≤ tr [D] tr G = ntr X⊥ΣW X⊥ ||(I − P )ΣW ||F where the first inequality is Cauchy-Schwartz, the second is the trace inequality used above and the equality is then by substitutions.

33 Appendix A.10. Estimating the Squared Frobenius Norm without Bias

Consider a p-dimensional r.v. W where E [W ] = 0n. Its variance-covariance  > 2 iid n × n matrix ΣW = E WW . We wish to estimate θ := ||ΣW ||F given ∼ samples w1,..., wS. The unbiased sample variance-covariance matrix estimator is:

S 1 X Σˆ = w w> W S s s s=1

2 ˆ If we use the naive plug-in estimator for θ i.e. θnaive := Σˆ W then by F Jensen’s inequality,

hˆ i E θnaive > θ unless W is degenerate. We now calculate the bias term below. We denote let ws,j be the jth entry in ws. We note the following identity:

n !2 S X X 2 X ws,jws,k = (ws,jws,k) + ws,jws,kws0,jws0,k s=1 s=1 s6=s0

Taking the expectation of both sides we find:

 S !2 S X X h 2i X E  ws,jws,k  = E (ws,jws,k) + E [ws,jws,kws0,jwis,k] s=1 s=1 s6=s0

 S !2 X h 2i 2 E  ws,jws,k  = SE (ws,jws,k) + S(S − 1) (E [wi,jwi,k]) s=1 | {z }

The underbraced term is the contribution to the expected Frobenius norm squared of the j, kth entry of ΣW which we call Fj,k. Isolating this, we ob- tain:

 2 PS  h 2i E s=1 ws,jws,k − SE (wi,jwi,k) F := ( [w w ])2 = j,k E s,j s,k S(S − 1)

Now using plugin estimators we define:

2 PS  PS 2 s=1 ws,jws,k − s=1 (ws,jws,k) Fˆ := j,k S(S − 1)

34 which is an unbiased estimator for Fj,k since each component is an unbiased estimator. Now, to get the total Frobenius norm estimator which we call θˆ, we sum over all entries:

n n  n n S !2 n n S  X X 1 X X X X X X 2 θˆ = Fˆ = w w − (w w ) j,k S(S − 1)  s,j s,k s,j s,k  j=1 k=1 j=1 k=1 s=1 j=1 k=1 s=1

PS ˆ Note that s=1 ws,jws,k is the j, kth entry of SΣW which means the double sum over its squared is just the Frobenius norm squared of SΣˆ W , i.e.

 n n S  1 X X X 2 θˆ = S2θˆ − (w w ) S(S − 1)  naive s,j s,k  j=1 k=1 s=1

This is as far as we can go for general r.v. W . Specifically for the case where n 2 W is an allocation strategy whose support is {−1, +1} , then (ws,jws,k) = 1 always. Thus we have:

1   S n2 θˆ = S2θˆ − n2S = θˆ − S(S − 1) naive S − 1 naive S − 1

Appendix A.11. Estimating the Squared Frobenius Norm without Bias with a constant multiple 2 Now we wish to do a different problem. We wish to estimate θ := ||(I − P )ΣW ||F iid given ∼ samples w1,..., wn. Note the following identity

2 2 2 ||ΣW ||F = ||(I − P )ΣW ||F + ||P ΣW ||F + 2 h(I − P )ΣW , P ΣW iF where

 >  h > i h(I − P )ΣW , P ΣW iF = tr ((I − P )ΣW ) P ΣW = tr ΣW (I − P )P ΣW = 0 so that

2 2 2 ||(I − P )ΣW ||F = ||ΣW ||F − ||P ΣW ||F and we have an unbiased expression for the first term on the right hand side. Now we work on the second term:

−1 2  −1  2 2  T  T  T  T > ||P ΣW ||F = X X X X ΣW = E X X X X ww F F which means the frobenius norm squared’s j, k entry contribution will be:

35   −1 2  T  T Fj,k = E xj X X X wwk

 −1 Letting A := XT X XT , this can be estimated naively (i.e. biased) via

S !2 1 X Fˆ = x Aw w j,k S j s s,k s=1 To develop a bias correction, we start with the following identity:

S !2 S X X 2 X xjAwsws,k = (xjAwsws,k) + (xjAwsws,k)(xjAws0 ws0,k) s=1 s=1 s6=s0

Taking an expectation

 S !2 " S #   X X 2 X E  xjAwsws,k  = E (xjAwsws,k) + E  (xjAwsws,k)(xjAws0 ws0,k) s=1 s=1 s6=s0

We then have:

" S # h i 1 X 2 S2Fˆ = S (x Aw w ) + S(S − 1)F E j,k E S j s s,k j,k s=1 Re-arranging, this gives us the unbiased estimator:

S S 1 X 2 Fˆ∗ = Fˆ − (x Aw w ) j,k S − 1 j,k S(S − 1) j s s,k s=1 Summing this over j and k, thus an unbiased estimator for the whole frobenius norm of the second term is

S 2 1 n n S ˆ 2 ˆ X X X 2 ||P ΣW ||F = P ΣW − (xjAwsws,k) S − 1 F S(S − 1) j=1 k=1 s=1

2 We notice that ws,k = 1, and hence can collect terms over k:

S 2 n n S ˆ 2 ˆ X X 2 ||P ΣW ||F = P ΣW − (xjAws) S − 1 F S(S − 1) j=1 s=1

The rightmost term can also be re-written as:

36 n S n " S # n n X X 2 n X 1 X n X (x Aw ) = P w wT P T = P Σˆ P T S(S − 1) j s S − 1 j S s s j S − 1 j W j j=1 s=1 j=1 s=1 j=1 n h i n h i = tr P Σˆ P = tr P Σˆ , S − 1 W S − 1 W where P j is the j’th row of P . We apply the bias correction only for S ≥ 30; otherwise, the variance of the estimator is too high for the bias correction to be meaningful.

Appendix B. Algorithms

Below are the algorithms for finding a∗ discussed in the main body of the manuscript.

Algorithm 1 This algorithm returns the optimal rerandomization design W∗ along with the optimal threshold a∗ and the value of the relative tail criterion 0 for the minimum design Q∗. The argument “TailMSE” takes one of the follow- ing functions: TailDistLR for the procedure providing an explicit distribution of Section 2.2.3, TailNormalLR for the procedure assuming normality of Sec- tion 2.2.1 and TailApproxLR for the procedure assuming the approximation of Section 2.2.2 (these functions are found in Algorithm 2). The “...” means other parameters that will be different for each of the tail functions. This procedure is implemented to be faster via caching intermediate values.

procedure OptimalRerandomizationDesign(X, B, Wbase, q, TailMSE, ...) n ←nrow(X) Bs ← {B(X, w): w ∈ Wbase} . Precompute all balances sort sort {Bs , Wbase} ← sort asc(Bs, Wbase) . Sort all balances from smallest to largest and use these sorted indices to sort Wbase 0 sf ← size(Wbase), s∗ ← NULL, Q∗ ← ∞ . Initialize search parameters for s = 1 . . . sf do sort Ws ← Wbase[1 : s] . The brackets indicate the sub-array operator Σ ← 1 P ww> W s w∈ s 0 W Q ← TailMSE(Ws, s, X, n, ΣW , q, ...) 0 0 if Q < Q∗ then 0 0 Q∗ ← Q s∗ ← s end if end for sort sort 0 0 return new HashMap(W∗ = Wbase[s∗ : sf ], a∗ = Bs [s∗], Q∗ = Q∗) end procedure

37 Algorithm 2 This algorithm provides the methods for computing the tail cri- ˆLR terion for βT .

function TailDistLR(Ws, s, X, n, ΣW , q, simZn, NZ ) . Strategy of Section 2.2.3 shared lr 0 Q s ← new array(NZ ) . Empty Array of size for nZ = 1 ...NZ do z ←simZn(n) . A user-provided function 0 > 2  Q s[nZ ] = z G + n D z end for return empirical quantile(Q0s, q) end function

function TailNormalLR(Ws, s, X, n, ΣW , q) . Strategy of Section 2.2.1 shared lr 2  λs ← eigen values of G + n D return Hall Buckley Eagleson Quantile Approx(λs, q) end function

function TailMSEApproxLR(Ws, s, X, n, ΣW , q, κz) . Strategy of Section 2.2.2 shared lr − 1  T  2 X⊥ ← X X X c ← normal quantile(q) return Equation 13 computed end function

shared lr{  −1 P ← X XT X XT

G ← (I − P )ΣW (I − P ) D ← 1 P ww>P ww> s w∈Ws }

38