Causal Inference for Multiple Non-Randomized Treatments using Fractional Factorial Designs ∗

Nicole E. Pashley Marie-Ab`eleC. Bind Rutgers University Massachusetts General Hospital March 17, 2021

Abstract We explore a framework for addressing causal questions in an observational setting with multiple treatments. This setting involves attempting to approximate an experi- ment from observational . With multiple treatments, this would be a factorial design. However, certain treatment combinations may be so rare that there are no measured outcomes in the observed data corresponding to them. We propose to conceptualize a hypothetical fractional factorial experiment instead of a full facto- rial experiment and lay out a framework for analysis in this setting. We connect our design-based methods to standard regression methods. We finish by illustrating our approach using biomedical data from the 2003-2004 cycle of the National Health and Nutrition Examination Survey to estimate the effects of four common pesticides on body mass index. Keywords: potential outcomes; interactions; joint effects; observational studies; multi- ple treatments; Neymanian inference

1 Introduction

What is the effect of exposing you to pesticide A, compared to no exposure, on your health? arXiv:1905.07596v4 [stat.ME] 15 Mar 2021 What about exposing you to pesticide B? Or exposing you to both pesticides at the same

∗Email: [email protected]. The authors thank Zach Branson, Kristen Hunter, Kosuke Imai, Xinran Li, Luke Miratrix, and Alice Sommer for their comments and edits. They also thank Donald B. Rubin and Tirthankar Dasgupta for their insights and prior work that made this paper possible. Addi- tionally, they thank members of Marie-Ab`eleBind’s research lab and Luke Miratrix’s C.A.R.E.S. lab, as well as participants of STAT 300, the Harvard IQSS Workshop, and the Yale Quantitative Research Methods Workshop for their feedback on this project. Marie-Ab`eleBind was supported by the John Harvard Distin- guished Science Fellows Program within the FAS Division of Science of Harvard University and is supported by the Office of the Director, National Institutes of Health under Award Number DP5OD021412. Nicole Pashley was supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE1745303. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the National Science Foundation.

1 time? To answer these questions, we need to be able to assess the impact, including inter- actions, of multiple treatments. A factorial experiment involves of all possible treatment combinations to units and can be used to help understand these different effects. There is much interest in assessing the effects of multiple treatments, as reflected by the recent growth in the literature regarding the use of factorial designs in causal in- ference (e.g., Branson et al., 2016; Dasgupta et al., 2015; Dong, 2015; Egami and Imai, 2019; Espinosa et al., 2016; Lu, 2016a,b; Lu and Deng, 2017; Mukerjee et al., 2018; Zhao et al., 2018). However, the literature primarily focuses on randomized and in some cases (likely including the motivating example), a is infeasible and instead we must rely on observational studies. In this paper, we develop a Neymanian framework to draw causal inferences in observational studies with multiple treatments. Standard analysis methods for this setting tend to be model-based. In particular, regres- sion models with terms are commonly used in observational studies to estimate the effects of multiple treatments (Bobb et al., 2015; Oulhote et al., 2017; Patel et al., 2010; Valeri et al., 2017). For instance, Bobb et al. (2015) considers a Bayesian kernel machine regression for estimating the health effects of multi-pollutant mixtures. However, the use of regression without a careful design phase, in which one tries to uncover or approximate some underlying experiment, can lead to incorrect conclusions (e.g., see Rubin, 2008). We thus view the problem of estimating causal effects with multiple treatments through an experimental design perspective. With a single binary treatment (or equivalently a single factor with two levels), conceptualizing observational studies as plausible treatment-control hypothetical randomized experiments has long been a common strategy (Rosenbaum, 2002; Rubin, 2008; Stuart, 2010; Bind and Rubin, 2019). Approximation of an experiment is achieved via a conceptualization stage and a design stage. In the design stage, treatment groups are balanced with respect to covariates, in an attempt to replicate balance achieved in a randomized experiment (Rubin, 2008; Imbens and Rubin, 2015; Bind and Rubin, 2019). Covariate balance is important not only for the unconfoundedness (Rosenbaum, 2002, Chap- ter 3), but also to increase plausibility of assumptions that allow the analysis stage to be implemented with limited model extrapolation (Imbens and Rubin, 2015, Chapter 12). For observational data with multiple contemporaneously applied treatments, a factorial

2 design is a natural choice of experimental design. There are some extensions of techniques for a single binary treatment to multiple treatments in the literature. For instance, Lopez and Gutman (2017) discuss matching for obtaining causal estimates in observational studies with multiple treatments, although they focus on a single factor with multiple levels, rather than multiple treatments that may be applied contemporaneously. Nilsson (2013) considers matching using the generalized propensity score (GPS) (Hirano and Imbens, 2004) in a 22 factorial setting. However, further exploration of matching and other methods, such as weighting, that attempt to approximate an experiment in observational studies with factorial structure is still needed. We discuss obtaining covariate balance further in Section 5.2. An important issue when using model-based approaches for multiple treatments in the observational setting, in addition to the usual concerns that we attempt to address in the design phase, is that we may have limited or no data available for some treatment com- binations. If a single treatment combination has no measurements, then the recreation of a full factorial design is not possible. When there are only one or two observations for a certain treatment combination, utilizing a factorial design would rely heavily on those few individuals being representative. Using design-based methods to address the issue of data limitations implies conceptualizing an experimental design that fits the observed data. We propose to embed an into a hypothetical fractional factorial experiment, a design that uses only a subset of the treatment combinations in the full factorial design. We additionally discuss an alternative, more flexible design called incomplete factorial, which also uses a subset of the total treatment combinations. With , linear or additive regression models, often used in practice, have implicit assumptions resulting in estimators that are not always transparent, especially with respect to the implicit imputation of the missing potential outcomes. We discuss regression estimates in such setting in Section 4 and discuss how they connect to design-based estimates throughout the paper. In summary, we discuss the estimation of the causal effects of multiple non-randomized treatments, in particular when we do not observe all treatment combinations. We make a number of contributions. Firstly, we build on the potential outcomes framework to consider causal effects with multiple treatments in observational settings. Secondly, we identify and

3 explore two designs (and their subsequent analyses) that are useful when we have lack of data issues in our observational study. Thirdly, we discuss implications of our estimation strategies in terms of what can be estimated, and compare these to what occurs with re- gression estimation. Finally, we identify challenges that still need to be addressed in this area. The paper proceeds as follows: Section 2 reviews full factorial designs within the poten- tial outcomes framework described in Dasgupta et al. (2015). Section 3 reviews extensions of this framework to fractional factorial designs and expands upon current inference results for and variance estimation. Section 4 explores the use of incomplete factorial designs. Section 5 examines how to embed an observational study into one of these experimental designs. Connections to regression-based methods are noted in each of the previously men- tioned sections. Section 6 illustrates our method and the challenges when working with observational data with multiple treatments with an application examining the effects of four pesticides on body mass index (BMI) using data from the National Health and Nutri- tion Examination Survey (NHANES), which is conducted through the Centers for Disease Control and Prevention (CDC). Section 7 concludes.

2 Full factorial designs

2.1 Set up

We work in the Rubin Causal Model (Holland, 1986), also known as the potential outcomes framework (Splawa-Neyman, 1990; Rubin, 1974). Throughout this paper, we focus on two- level factorial and fractional factorial designs. We closely follow the potential outcomes framework for 2K factorial designs proposed by Dasgupta et al. (2015). We start by reviewing the notation and basic setup for factorial experiments. Let there be K two-level factors; that is, there are K treatments (e.g. medications), each having two levels (e.g. receiving a certain medication or not receiving that medication) that are assigned in combination. This creates 2K = J total possible treatment combinations.

th Let zj denote the j treatment combination. See Table 1 for an example where K = 3

4 which illustrates the notation, with treatment combinations listed in lexicographic order

(the standard ordering, used here for consistency). Let zj,k ∈ {−1, +1} be the level of the th th k factor in the j treatment combination, so zj = (zj,1, ..., zj,K ). Let there be n units in the sample. The potential outcome for unit i under treatment combination zj is Yi(zj) ¯ Pn and the sample average potential outcome under treatment zj is Y (zj) = i=1 Yi(zj)/n. ¯ ¯ ¯ ¯ K Y = (Y (z1), Y (z2),..., Y (zJ )) is the vector of potential outcomes for all 2 treatment combinations.

Let Wi(zj) = 1 if unit i received treatment combination zj and Wi(zj) = 0 otherwise. We assume the Stable Unit Treatment Value Assumption (SUTVA), that is, there is no interference and no different forms of treatment level (Rubin, 1980). Then, as in Dasgupta PJ et al. (2015), j=1 Wi(zj) = 1 and the observed potential outcome for unit i is

J obs X Yi = Wi(zj)Yi(zj). j=1

Let there be a fixed number, nj, of units randomly assigned to treatment combination zj. That is, we are assuming a (possibly unbalanced) completely randomized design. An estimator of the observed average potential outcome for treatment zj is

n n ¯ obs 1 X 1 X obs Y (zj) = Wi(zj)Yi(zj) = Yi . nj nj i=1 i:Wi(zj )=1

obs ¯ obs ¯ obs ¯ obs Denote Y¯ = (Y (z1), Y (z2),..., Y (zJ )) as the vector of observed mean potential outcomes for all 2K treatment combinations.

Treatment Factor 1 Factor 2 Factor 3 Outcomes ¯ z1 -1 -1 -1 Y (z1) ¯ z2 -1 -1 +1 Y (z2) ¯ z3 -1 +1 -1 Y (z3) ¯ z4 -1 +1 +1 Y (z4) ¯ z5 +1 -1 -1 Y (z5) ¯ z6 +1 -1 +1 Y (z6) ¯ z7 +1 +1 -1 Y (z7) ¯ z8 +1 +1 +1 Y (z8)

g1 g2 g3 Y¯

Table 1: Example of a 23 factorial design.

5 Using the framework from Dasgupta et al. (2015), denote the column j in the

design matrix as gj, as illustrated in Table 1. Following that paper, we can also define the contrast vector for the two-factor interaction between factors k and k0 as

gk,k0 = gk ◦ gk0 ,

where ◦ indicates the Hadamard (element-wise) product. Similarly, the contrast vector for three-factor interactions is

gk,h,i = gk ◦ gh ◦ gi,

and all higher order interaction contrast vectors can be calculated analogously.

2.2 Estimands and estimators

Continuing to follow Dasgupta et al. (2015), define the finite population main causal effect for factor k and the finite population interaction effect for factors k and k0 as

1 T 0 1 T τ(k) = g Y¯ and τ(k, k ) = g 0 Y¯ , 2K−1 k 2K−1 k,k respectively, where by finite population we mean that we are only interested in inference for the units we have in the experiment. Higher-level interaction terms are defined analogously. We can similarly define the individual-level effects as

T 0 1 T τ (k) = g Y and τ (k, k ) = g 0 Y , i k i i 2K−1 k,k i

T where Yi = (Yi(z1),...,Yi(zJ )) . We also define the average potential outcome across treatments as PJ Y¯ (z ) 1 τ(0) = i=1 k = gT Y¯ , 2K 2K 0 K where g0 is a vector of length 2 of all +1’s. There is a correspondence between the potential outcomes and the causal effects. Con- sider a full factorial model matrix, G, that includes the mean and interactions and whose

T rows are comprised of gk , where k ∈ {0; 1; ...; K; 1, 2; ...; K − 1; K}. Based on the definition of our estimands in Section 2.2, the matrix G relates the mean potential outcomes and the factorial effects as follows:

6  T    g0 2τ(0) T  g1   τ(1)   .   .   .   .    Y¯ (z )    gT  1  τ(K)   K  Y¯ (z )   −(K−1)  T   2    2  g1,2   .  =  τ(1, 2)  .  .   .   .   .     .    Y¯ (z )    gT  J  τ(K − 1,K)   K−1,K  | {z }    .  Y¯  .   .   .  T g1,2,...,K−1,K τ(1, 2, ..., K − 1,K) | {z } | {z } G τ

Note that because of , the inverse of G is simply GT rescaled (i.e., G−1 =

1 T 2k G ), as argued by Dasgupta et al. (2015) in the context of imputing potential outcomes under Fisher’s sharp null hypothesis. The mean potential outcomes can be rewritten in terms of the factorial effects as 1 Y¯ = GT τ . 2 th T Let the j row of G be denoted by hj. The first entry of hj is +1, corresponding to the

mean, the next K entries are equal to the entries of zj, and the remaining entries correspond to interactions, with the order given by the order of the rows of G. For instance, in the 23

factorial design shown in Table 1, h1 = (+1, −1, −1, −1, +1, +1, +1, −1) We have

1 Y¯ (z ) = h τ . j 2 j

This representation will be useful in Sections 4 and 5.3 when considering what can be esti- mated when we do not observe all treatment combinations. Let us now focus on estimation. The estimator for τ(k) and τ(k, k0) are defined as

1 T obs 0 1 T obs τ(k) = g Y¯ and τ(k, k ) = g 0 Y¯ , b 2K−1 k b 2K−1 k,k respectively. Estimators for higher-level interaction terms and τ(0) are defined analogously by replacing Y¯ by Y¯ obs. In Dasgupta et al. (2015), the variance of the factorial effect estimators was derived under a balanced design; however, the variance expression was extended to unbalanced designs in

7 Lu (2016b), as follows:

J 1 X 1 1 Var (τ(k)) = S2(z ) − S2, (1) b 22(K−1) n j n k j=1 j where

n n 1 X 2 1 X 2 S2(z ) = Y (z ) − Y¯ (z ) and S2 = (τ (k) − τ(k)) . j n − 1 i j j k n − 1 i i=1 i=1 An expression for the between two factorial effect estimators was provided by Dasgupta et al. (2015), which again can be extended to unbalanced factorial designs (Lu, 2016b):

  0 1 X 1 2 X 1 2 1 2 Cov (ˆτ(k), τˆ(k )) = 2(K−1)  S (zj) − S (zj) − Sk,k0 , (2) 2 nj nj n j:gkj =gk0j j:gkj 6=gk0j

2 Pn 0 0 where Sk,k0 = i=1 (τi(k) − τ(k)) (τi(k ) − τ(k )) /(n − 1).

2.3

Among the various types of statistical analyses that could be performed for a factorial de- sign (e.g., Neymanian, frequentist , Fisherian, and Bayesian), in this paper, we follow the Neymanian perspective. To do so, we first need good estimators for the -based and of the estimated causal effects. A conserva- tive Neyman-style estimator for the variance was proposed by Dasgupta et al. (2015) and extended to the unbalanced case by Lu (2016b):

J 1 X 1 2 Vard (τ(k)) = s (zj), (3) b 22(K−1) n j=1 j

where 2 1 X ¯ obs 2 s (zj) = Yi(zj) − Y (zj) nj − 1 i:Wi(zj )=1 is the estimated variance of potential outcomes under treatment combination zj. Dasgupta et al. (2015) discussed other variance estimators and their performance under

8 different assumptions (e.g. strict additivity, compound symmetry). In that paper, the authors also provided a Neyman-style estimator for the covariance by again substituting

2 2 s (zj) for S (zj). Unfortunately, this estimator is not guaranteed to be conservative because 2 Sk,k0 may be positive or negative. The same authors provided a Neyman confidence region for τ , the vector of all J − 1 factorial effects. First, they define, in Equation 26,

T ˆ −1 Tn = τˆ Στˆ τˆ,

ˆ where τˆ is the vector of Neyman estimators of τ that we defined earlier and Στˆ is the estimator of the of τˆ,Στˆ. Then the 100(1 − α)% confidence region for τ (Equation 27 of Dasgupta et al. (2015)), is

{τˆ : pα/2 ≤ Tn ≤ p1−α/2},

where 0 < α < 1 and pα is the is the α quantile of the asymptotic distribution of Tn. In that paper, a Neyman-style confidence interval for individual effects was also provided as q τˆ(k) ± zα/2 Vard (τb(k)),

where the authors rely on a normal approximation for the distribution ofτ ˆ(k), typically assumed to hold asymptotically. See Proposition 1 in Li et al. (2020) for conditions under which the asymptotic normality holds in this setting. These asymptotic properties are also explored in Li and Ding (2017). Note that a model that linearly regresses an outcome against the factors coded using contrast values −1 and +1 and all interactions between factors (but no other covariates) will result in the same point estimates for the factorial effects as presented here, divided by 2 (Dasgupta et al., 2015; Lu, 2016b). For a balanced design or when treatment effects are

2 assumed to be additive, so that the variances S (zj) are all the same, the standard linear regression variance estimate, relying on homoskedasticity, will be the same as the Neymanian variance estimate (Dasgupta et al., 2015; Imbens and Rubin, 2015). However, this is not true for an unbalanced design. Samii and Aronow (2012) showed that the HC2 heteroskedasticity robust variance estimator (see MacKinnon and White, 1985, for more details) is the same as the Neymanian variance estimator presented here for a single treatment experiment. Lu

9 (2016b) extended this finding to factorial designs, showing that this estimator is also the same as the Neymanian variance estimator in that case. Note that because the regression estimator is different by a factor of 2, this regression variance is different by a factor of 4. Fisherian and Bayesian types of analyses are also possible. See Dasgupta et al. (2015) for additional discussion. In particular, they explore creating “Fisherian Fiducial” intervals (Fisher, 1930; Wang, 2000) for factorial effects. The basic idea is to invert our estimands so that we write our potential outcomes in terms of the estimands. Then, under a Fisher sharp null hypothesis, we can impute all missing potential outcomes and generate the randomiza- tion distribution.

3 Fractional factorial designs

3.1 Set up

There are situations in which a full 2K factorial design experiment cannot be conducted or is not optimal. For instance, limited resources may mean there are not enough units to randomly assign to each of the J (= 2K ) treatment combinations. Or the full factorial design may not be the most efficient allocation of resources, for instance if the experimenter believes that higher order interactions are not significant (Wu and Hamada, 2000). Instead, an experimenter might implement a 2K−p = J 0 fractional factorial design in which only J 0 of the total J treatment combinations are used. Here we will give a brief overview of this design, but we recommend Wu and Hamada (2000) to obtain a much more detailed review. We follow notation in Dasgupta et al. (2015) and Dasgupta and Rubin (2015). To create a 2K−p design, we can write out a full factorial design for the first K − p factors, and fill in the p columns for the remaining factors using multiplicative combinations of subsets of the previous contrast columns. We use a generator to choose the factors whose treatment levels we multiply together to get the treatment levels for the other factors (Wu and Hamada, 2000). For example, if we have want to create a 23−1 design using the generator 3 = 12, then we would start by writing out a 22 design for factors 1 and 2, as in column 1 and 2 of Table 2. Then we generate the third column, corresponding to treatment levels

10 for factor 3, by multiplying together the contrast vectors for factors 1 and 2, as shown in column 3 of Table 2. Note that which two factors are used initially is irrelevant in this case because of symmetry. That is, because 3 = 12, we also have 1 = 23 and 2 = 13. However, this may not hold in general and one should choose which factors to use based on the final aliasing structure, which tells us which effects are confounded (discussed more below).

Treatment Factor 1 Factor 2 Factor 3 Outcomes ∗ ¯ ∗ z1 -1 -1 +1 Y (z1) ∗ ¯ ∗ z2 -1 +1 -1 Y (z2) ∗ ¯ ∗ z3 +1 -1 -1 Y (z3) ∗ ¯ ∗ z4 +1 +1 +1 Y (z4) ∗ ∗ ∗ ¯ g1 g2 g3 Y∗ Table 2: Example of a 23−1 factorial design.

∗ Under this design, the contrast columns, gk, are now shortened versions or subsets of 3 the contrast columns of a full 2 factorial design, gk. The treatment combinations in the ∗ 3 fractional design, zj , are a subset of the full set of treatment combinations, zj, for the 2 ∗ ∗ ∗ design. Again, note that, referring to Table 2, g3 = g1 ◦ g2. The generator 3 = 12 indicates that the main effect of factor 3 is aliased with the two-factor interaction 12. Two effects being aliased that we cannot distinguish the effects – they are confounded or combined in our estimators. The full aliasing in this design is as follows: I = 123, 3 = 12, 2 = 13,

1 = 23. Note that I corresponds to a vector of all +1’s (g0). The relation I = 123 is called the defining relation, which characterizes the aliasing and how to generate the rest of the columns. We see that the main effects, as defined in Section 2.2 are aliased with the two-factor interactions. If the two-factor interactions are negligible, the 23−1 design is a parsimonious design to estimate the main effects. That is, we can create unbiased estimators (reviewed in Section 3.2) for the main effects. The resulting design is also orthogonal (i.e.,

∗ all of the pairs columns are orthogonal) and balanced, which means that each gk has an equal number +1’s and -1’s (Wu and Hamada, 2000). These properties simplify the aliasing structure. We typically choose the generator or defining relation based on the maximum resolution criterion for the design. The resolution indicates the aliasing structure and which levels of effects the main effects are aliased. Resolution is defined as the word length (i.e., the number

11 of factors) in the shortest word of the defining relation (see Wu and Hamada, 2000, for more details). In our example of a 23−1 design, we only have one defining relation, I = 123, and the word is length 3. This means that the main effects are aliased with two-factor interactions. We can imagine an alternate aliasing structure in which some main effects are aliased with other main effects. The effect hierarchy principle assumes that lower-order effects are more significant than higher-order interaction effects (Wu and Hamada, 2000). Therefore, generally one chooses a fractional design where main effects, and some other lower-order interaction effects, are only aliased with higher-order terms. In particular, we may want the main effects and two-factor interactions to be clear. This assumption goes along with the assumption of effect sparsity, that the number of significant or important effects is small, as a justification of the use of fractional designs over the full factorial (Wu and Hamada, 2000).

3.2 Estimators

We now review estimators for the fractional design, which are similar to the full factorial case. We follow the framework laid out in Dasgupta et al. (2015) and Dasgupta and Rubin (2015). The estimator for τ(k) is defined as

1 τ ∗(k) = g∗T Y¯ obs. b 2K−p−1 k ∗

The estimator for τ(k, k0) is defined as

∗ 0 1 ∗T obs τ (k, k ) = g 0 Y¯ . b 2K−p−1 k,k ∗

Note that these estimators are no longer unbiased. Let S be the set of all effects aliased with factorial effect k as well as k itself. The number of factors aliased with factor k is 2p − 1 (Montgomery, 2017, Chapter 8). So, S has 2p elements. Factor k may be aliased with the negative of a main effect or interaction. Let Sk,j be the indicator for whether factor j is negatively aliased with factor k (Sk,j = 1) or positively aliased (Sk,j = 0). Then we find,

12 using the orthogonality of the gk vectors, that

1 E [τ ∗(k)] = g∗T Y¯ b 2K−p−1 k ∗ X = (−1)Sk,j τ(j) j∈S X = τ(k) + (−1)Sk,j τ(j). j∈S\{k}

Hence by aliasing these effects, they get combined in our estimator. We see that the estimator for τ(k) is unbiased if the effects aliased with factorial effect k are zero, and will be close to unbiased as long as the aliased effects are negligible, which may be justified by the effect hierarchy principle. When aliasing occurs such that main effects are aliased with low-order interaction terms, such as two-factor interactions, this assumption may be unrealistic. However, if we have a large number of factors and are only aliasing main effects with higher-order effects, this may be reasonable. For instance, in a 26−1 fractional design, main effects can be aliased with five-factor interactions and two-factor interactions can be aliased with four-factor interactions, where both higher-order interactions may be assumed to be small. Now we extend the variance and variance estimator expressions from Dasgupta et al.

0 K−p ∗ (2015) (see Section 2) to this setting. Recall J = 2 and defineτ ˜(k) = E[τb (k)]. Further 2 define the analog of Sk, n 1 X 2 S˜2 = (˜τ (k) − τ˜(k)) , k n − 1 i i=1 ∗ which is the variation in our newly defined aliased effects,τ ˜i(k). Let nj be the number of ∗ ∗ units assigned to treatment combination zj . Then the variance of the estimatorτ ˆ (k) is

J0 1 X 1 1 Var (ˆτ ∗(k)) = S2(z∗) − S˜2. (4) 22(K−p−1) n∗ j n k j=1 j

See Appendix A.2 for more details on this derivation. Similarly, we can obtain the covariance

2 between two fractional factorial effect estimators. Here we define an altered version of Sk,k0 :

n 2 1 X 0 0 S˜ 0 = (˜τ (k) − τ˜(k)) (˜τ (k ) − τ˜(k )) . (5) k,k n − 1 i i i=1

13 Then, the covariance ofτ ˆ∗(k) andτ ˆ∗(k0) is   ∗ ∗ 0 1 X 1 2 ∗ X 1 2 ∗ 1 2 Cov (ˆτ (k), τˆ (k )) = S (z ) − S (z ) − S˜ 0 . (6) 22(K−p−1)  n∗ j n∗ j  n k,k j:g∗ =g∗ j j:g∗ 6=g∗ j kj k0j kj k0j

See Appendix A.3 for more details on this derivation. The variance expressions for fractional factorial designs are similar to the full factorial case, but are defined over aliased or grouped effects.

3.3 Statistical inference

It is straightforward to extend the analysis for factorial designs in Dasgupta et al. (2015) and reviewed in Section 2 to fractional factorial designs. We can again use a conservative Neyman-style variance estimator:

J0 1 X 1 Vard (ˆτ ∗(k)) = s2(z∗). (7) 22(K−p−1) n∗ j j=1 j This leads to J0 h i 1 X 1 E Vard (ˆτ ∗(k)) = S2(z∗) 22(K−p−1) n∗ j j=1 j ∗ ˜2 and thus Vard (ˆτ (k)) is a conservative estimator and unbiased if and only if Sk = 0, which would occur if the aliased effectsτ ˜i(k) are constant. Note that this condition holds when all effects aliased with factor k are constant additive effects. We can build confidence regions and confidence intervals analogously to what was re- viewed in Section 2.3. To be more rigorous in this section, we show how to use the results of Li and Ding (2017) to build Wald-type confidence regions. First, note that there is a new matrix G∗ that relates the treatment combinations in the fractional experiment to the unique treatment effect estimands and estimators defined for this experiment. For the 23−1 example, ∗ ∗ recalling the aliasing structure which gives, for instance, g0 = g123 andτ ˜(2) =τ ˜(13), we have  ∗T   ¯ ∗    g0 Y (z1) 2˜τ(0) ∗T ¯ ∗ −(K−p) g1  Y (z2)  τ˜(1)  2  ∗T   ¯ ∗  =   . g2  Y (z3)  τ˜(2)  ∗T ¯ ∗ g3 Y (z4) τ˜(3) | {z } | {z } | {z } G∗ Y¯ ∗ τ˜

14 Define τˆ∗ as the 2K−p vector of unique treatment effect estimators that corresponds to estimands in τ˜ (shown in previous example). 2 ∗ ˜2 Now we can define asymptotic results. Assume that all S (zj ) and Sk,k0 have limiting ∗ values, that the nj /n have positive limiting values, and ∗ ¯ ∗ 2 max1≤j≤J0 max1≤i≤n Yi(zj ) − Y (zj ) /n → 0. Then, according to Theorem 5 of Li and Ding (2017), nvar(τˆ∗) has a limiting value, which, following their notation, we denote by V and

√ n (τˆ∗ − τ˜) −→d N(0, V ).

ˆ PJ0 ∗ ∗2 ∗ ∗T ∗ ∗ Further let V = j=1 nj Gj s(zj )Gj , where Gj is the jth column of G . In addition to PJ0 ∗ ∗2 ∗ ∗T the previous assumptions, we require that the limit of j=1 nj Gj S(zj )Gj is nonsingular. Then under Proposition 3 of Li and Ding (2017) and following their notation, the Wald-type confidence region, ∗ T ˆ ∗ {µ :(τˆ − µ) V (τˆ − µ) ≤ qJ0,1−α},

2 where qJ0,1−α corresponds to the 1 − α quantile of the χJ0 distribution, has at least 1 − α asymptotic coverage. In these asymptotic results we assume that the number of treatments is constant as n → ∞, but Li and Ding (2017) discuss the case where the number of treatment combinations grows as well. As in the full factorial setting, linear regression yields the same point estimates (divided by 2) and the HC2 variance estimator yields the same variance estimate (divided by 4) as the Neymanian estimates. For proof, see Appendix B.

4 Incomplete factorial designs

In this section, we discuss alternative experimental designs to the fractional factorial design called the incomplete factorial design, which uses a subset of data from a full factorial design but a different subset than the fractional design (Byar et al., 1993). We discuss aspects of incomplete factorial designs, as defined and discussed in Byar et al. (1993), but with a design-based and potential outcome perspective. In particular, the estimators we discuss here will not use all of the available data. We therefore can consider each estimator to be

15 associated with a particular “design” that corresponds to an experiment that randomizes units only to treatment combinations used in the estimator. Different estimators may give non-zero weight to different treatment combinations, and so the hypothetical designs may be estimand-specific. Due to this inherent linking between the design and the estimators, we deviate from a previous sections’ structure and discuss the design and corresponding estimators together in the next section.

4.1 Design and estimators

Let us assume we ran an experiment with K binary treatments where, whether due to human error (e.g., neglecting to randomize units to a particular treatment combination) or lack of resources, some of treatment combinations from the full 2K factorial design were not included in the experiment. How should we analyze this? We could reduce the data to a fractional factorial design, which requires removing multiple treatment groups. For instance, we may have a factorial structure as in Table 3 but no outcome measurements for treatment

combination z7. Treatment 1 2 3 Observed Outcomes ¯ obs z1 -1 -1 -1 Y (z1) ¯ obs z2 -1 -1 +1 Y (z2) ¯ obs z3 -1 +1 -1 Y (z3) ¯ obs z4 -1 +1 +1 Y (z4) ¯ obs z5 +1 -1 -1 Y (z5) ¯ obs z6 +1 -1 +1 Y (z6) z7 +1 +1 -1 ? ¯ obs z8 +1 +1 +1 Y (z8) ¯ obs g1 g2 g3 Y

Table 3: Example of a 23 factorial design with no observations for one treatment combination.

Let us focus on estimation of τ(1) first. If we recreate a fractional factorial design that aliases the main effects with the two-way interactions, then we estimate τ(1) + τ(23), which

would involve using outcome data from units assigned to treatment combinations z2, z3, z5,

and z8, but not from units assigned to treatment combinations z1, z4, and z6. Instead, we might consider building an estimator using all of the treatment combinations

except for z3, which has the same levels for factors 2 and 3 but the opposite level for

16 factor 1 as combination z7. Thus, in some sense, removing z3 “balances” the remaining treatment combinations and is essentially the “na¨ıve estimator” discussed in Byar et al. (1993). This strategy creates a different hypothetical experimental design with a different aliasing structure. In this case, we would be estimating

Y¯ (z ) + Y¯ (z ) + Y¯ (z ) Y¯ (z ) + Y¯ (z ) + Y¯ (z ) τ˙(1) = 5 6 8 − 1 2 4 3 3 τ(1|F = −1,F = −1) + τ(1|F = −1,F = +1) + τ(1|F = +1,F = +1) = 2 3 2 3 2 3 , 3

th where we let Fk denote the factor level of the k factor so that

τ(k|Fj = x, Fi = y) is the main effect of factor k conditional on level x of factor j and level y of factor i, as in Dasgupta et al. (2015). That is, we estimate the average of the conditional effects of factor 1 given the combinations (-1,-1), (-1,+1), and (+1,+1) for factors 2 and 3. To find the aliasing structure, we can refer to the matrix GT in Section 2.2 to rewrite the estimand above as

1 1 τ˙(1) = (h + h + h − h − h − h ) τ 3 2 8 6 5 4 2 1 −τ(13) + τ(23) + τ(123) = τ(1) + . 3

Now we have partially aliased the main effect for factor 1 with the two-factor interactions for factors 1 and 3 and factors 2 and 3, as well as the three-factor interaction, all divided by three. This is partial aliasing because the factors are neither fully aliased nor completely clear of each other (Wu and Hamada, 2000, Chapter 7). Whether this is preferable to aliasing the main effect with just the two-factor interaction between factors 2 and 3 depends entirely on subject-matter knowledge. For instance, if we know that the three-factor interaction is negligible, then we might expect this new estimator to have lower bias, as we are dividing both two-factor interactions by 3. However, even if we had knowledge that the two-factor interactions were of the same sign, we would not know the direction of the bias in this case without knowing the relative magnitudes of the two-factor interactions. When estimating the main effect for factor 2, we would naturally approximate a different design. In that case, using the same logic as before, we would remove treatment combination z5.

17 As an alternative design, we may alias our main effects with the highest-order interaction possible, allowing for a different design for each main effect estimator. So when estimating the main effect of factor 1, we would use a design that aliases factor 1 with the three-factor interaction. This leads to a design using treatment combinations z1, z4, z5, and z8 for which we can build the following estimand:

Y¯ (z ) + Y¯ (z ) Y¯ (z ) + Y¯ (z ) τ˙(1) = 5 8 − 1 4 2 2 1 1 = (h + h − h − h ) τ 2 2 8 5 4 1 = τ(1) + τ(123).

Based on the hierarchy principle, an estimator with this aliasing structure should be a superior estimator to the original fractional factorial estimator mentioned because the three- factor interaction is more likely to be negligible than the two-factor interactions.

Denote by g˙k the analog of gk with zeros where outcome data are missing or excluded in a given estimator for factor k. If there is a single treatment combination with no outcome measured, we can choose the aliasing such that factor k is aliased with the K-factor inter- action. When more rows are missing, the pattern of missingness will dictate what aliasing structure is possible. For example, if we are missing two treatment combinations but they are missing from the same design that aliases factor k with the negative of the K-factor interaction, then we can still recreate the design that aliases factor k with the positive of the K-factor interaction. But if we are missing a row from each of these designs, neither option is possible and we must choose a different aliasing structure. If this method is con- tinued for each factor, one ends up with a set of g˙k, each with zeros for different treatment combinations. Section 4.5 of Wu and Hamada (2000) gives a general strategy to design experiments while attempting to reduce aliasing for certain main effects. There are other designs we can construct, not considered here, such as nonregular designs that have partial aliasing (see Wu and Hamada, 2000, Chapter 7 for more details on nonregular designs).

18 4.2 Statistical inference

More generally, denote our estimator for kth factor under one of these alternative incomplete ˆ T ¯ obs factorial designs as τ˙(k) = g˙k Y . It is straightforward to extend the variance expression     Var τˆ˙(k) and variance estimator Vard τˆ˙(k) , as well as the covariance of τˆ˙(k) and τˆ˙(k0), from Section 3.3; in fact, it is easy to similarly extend these results to any linear combinations of interest on the 2K treatment combinations. See Appendix C for the specification and derivation. However, terms for some treatment combinations will be set to zero in these expressions because they are present in one estimator but not the other. If we run a regression with all interactions on a dataset with missing treatment combina- tions, it is ambiguous what design the resulting estimators correspond to and, as discussed above, multiple designs may be plausible to approximate the observational study. If we specify a regression of the outcome on the fully interacted treatments, but some treatment combinations are not present in the data, the regression will not be able to estimate all inter- actions. Not including a set of interactions implies that the assumes that those interaction effects are zero, and we can assess how reasonable this assumption is. However, the specific aliasing structure between the effects included in the model and those dropped by the regression is not obvious from the usual computer output alone, though can be discerned from the design matrix. Byar et al. (1993) and Byar et al. (1995) give more discussion of these types of estimators and Appendix C.3 has further discussion.

5 Embedding observational studies in fractional facto- rial designs

5.1 General issues

To address causality in a non-randomized study, it has been argued that one needs to con- ceptualize a hypothetical randomized experiment that could correspond to the observational data (Bind and Rubin, 2019; Rosenbaum, 2002; Rubin, 2008; Stuart, 2010). The hypothetical randomization is plausible if the treatment groups are “similar” with respect to variables (Rubin, 2007, 2008). In a setting with multiple treatments of interest, we recreate

19 a hypothetical factorial randomized experiment. However, in observational studies, there may be no units that received certain treatment combinations. Therefore, our strategy is to recreate a hypothetical fractional factorial or incomplete factorial experiment instead of a full factorial experiment. Focusing on the fractional factorial design, we must decide which fractional factorial design to recreate. This decision should be based on some criteria such as the maximum resolution criteria. Typically, it is desirable to alias main effects with the highest order interactions, which usually means a small p, i.e., a small fraction of total design is removed. In practice, we may not be able to control the aliasing structure. We must choose an aliasing structure for the 2K−p fractional design such that the treatment combination(s) with no observations is not used. However, this strategy usually results in the removal of units assigned to treatment combinations that were present in the observational data set but not used in the design. That is, if only one treatment combination is not present in the data set, we could use a 2K−1 design, but then we are not using 2K−1 − 1 treatment combinations for which we have data. After a design is chosen, a strategy to balance covariates should be used to ensure that the units across treatment combinations are similar with respect to background covariates. We assume strong ignorability, that is, that conditional on measured covariates, the assign- ment mechanism is individualistic, probabilistic, and unconfounded (Rosenbaum and Rubin, 1983). Then we can obtain unbiased causal estimates by analyzing the data as if it arose from a hypothetical randomized experiment. Note that these assumptions apply to all treatment combinations in the final experimental design used. We must also assume that the treatment assignment is, at least hypothetically, manipulable such that all potential outcomes are well-defined. This in turn ensures that our estimands of interest are defined. If this assumption on manipulation does not hold, we would need to consider a causal estimand that does not depend on unmeasurable or undefined potential outcome. In particular, although our fractional factorial estimators do not use all treatment combinations, the estimand and aliasing structure both do depend on those potential outcomes for the unobserved treatment combinations being well defined. Thus, it is important that subject-matter knowledge guides us in deciding which covariates

20 are informative about which units have well-defined potential outcomes under all treatment combinations. That is, only some sub-populations may have all potential outcomes in the experiment be well-defined, similar to some analyses for noncompliance where the estimand is only well-defined for compliers. For other groups where not all treatment combinations could theoretically be received, there still may be interesting estimands based on the well-defined potential outcomes, but we do not explore those specifically in this work. A similar argument must hold for the unconfoundedness assumption. The unconfoundedness assumption is often assumed when reasonable covariate balance is achieved (Imbens and Rubin, 2015).

Remark. Under some assumptions, the estimand based on only well-defined potential out- comes may be the same as the original. For instance, if we have a 22 experiment and there is no interaction between factor 1 and factor 2, then τ(1) = Y¯ (+1, +1) − Y¯ (−1, +1) = Y¯ (+1, −1)−Y¯ (−1, −1). Hence, even if we only observed one level of factor 2 but both levels of factor 1, we could recover τ(1). However, τ(1) would not be of scientific interest if all potential outcomes are not defined.

Achieving covariate balance in multiple treatment groups in non-randomized studies can be non-trivial. We discuss this issue further in the context of hypothetical fractional factorial designs in Section 5.2. Due to challenges in obtaining full balance, a first step might be to only obtain covariate balance between two treatment groups, a task commonly done in the causal inference literature, and compare outcomes for these groups. For instance, we could estimate the difference in outcomes between the units that were assigned level +1 for all factors and the units that were assigned −1 for all factors. Under certain assumptions, testing the difference in these groups can act as a global test for whether any effects of interest are significant. We discuss the test and assumptions that are needed for this simple comparison to be meaningful in Section 5.3. Once we have decided to recreate a particular fractional factorial design and have obtained a data set with covariate balance, estimation and inference can follow in a similar way to Section 3 or Section 4. We discuss some of the relative benefits of using fractional factorial vs. incomplete designs in Section 5.4

21 5.2 Covariate balance

An important stage when estimating the causal effect of non-randomized treatments is the design phase (Rubin, 2007, 2008). At this stage, we attempt to obtain a subset of units for which we can assume unconfoundedness. That is, units for which P (Wi|Yi(z), Xi) =

P (Wi|Xi) where Xi is an m-dimensional vector of covariates (Imbens and Rubin, 2015). Matching strategies are often used to ensure no evidence of covariate imbalance between treatment groups, as reviewed by Stuart (2010). Note that matching often involves removing units and so the statistical analysis is generally performed on a subset of the original study population. This implies that the population of units we are doing inference with respect to may change after trimming units. In settings with multiple treatments, matching can be difficult. There have been ex- tensions of propensity score balancing to multiple treatments, most notably the generalized propensity score (GPS) (Hirano and Imbens, 2004) and more recently the covariate balanc- ing propensity score (CBPS) (Imai and Ratkovic, 2014) was introduced and shown to extend to multiple treatments. Lopez and Gutman (2017) review techniques, including matching, for observational studies with multiple treatments and although they do not explore facto- rial (full or fractional) designs, it may be straightforward to extend their methods to this design. Nilsson (2013) discusses matching in the 22 design. Recent work by Bennett et al. (2020) uses template matching, which matches units to a “template” population of units, for multiple treatments, though again not with factorial designs. Therefore, methods for creating covariate balance specifically for factorial type designs require further exploration. In our data illustration in Section 6, we employ sequential trimming and checks on covariate balance, as discussed below. Testing for covariate imbalance across multiple treatment groups can be done using mul- tivariate analyses of variance (MANOVA), which uses the covariance between variables to test for mean differences across treatment groups, as used in Branson et al. (2016). Re- call that the factorial design has J treatment combinations. Define the following H and E matrices (Coombs and Algina, 1996):

22 J X T H = nk(Xk − X)(Xk − X) k=1

J nk X 1 X T E = (nk − 1)Sk, where Sk = (Xkj − Xk)(Xkj − Xk) , nk − 1 k=1 j=1 where Xk is the m-dimensional vector of mean covariate values for treatment group k, X is the average m-dimensional vector of mean covariate values for all units, that is X =

PJ nk k=1 n Xk, and Xkj is the m-dimensional vector of covariates for the jth unit in treatment −1 group k. Denote by θk the ordered eigenvalues of HE , where k ∈ {1, ..., s} and s = min(m, K − 1). Standard MANOVA statistics, which can be used to test covariate balance, are typically functions of the eigenvalues of HE−1 (Coombs and Algina, 1996). We chose the Wilks’ (Wilks, 1932),

K |E| Y 1 Wilks = = , |H| + |E| 1 + θk k=1 th −1 where θk corresponds to the k eigenvalue of HE . As discussed in Imai et al. (2008), a potential drawback of testing for evidence against covariates imbalance is that as we drop units we lose power to detect deviations from the null hypothesis of no difference in covariates between the treatment groups. Another diagnostic for covariate balance is checking covariate overlap via plots and other visual summaries of the data. So called “Love plots”, which show standardized differences in covariate means between two treatment groups before and after adjustment (Ahmed et al., 2006), are difficult to generalize directly because of the multitude of treatment groups and comparisons. However, plots of standardized means or of distributions may be helpful to detect imbalance.

5.3 Initial test for significance of effects

As discussed in the previous section, achieving covariate balance for more than two treatment groups can be a challenge. Therefore, instead of attempting to achieve balance among all treatment groups, a simple first step might be to examine two carefully chosen treatment

23 groups and attempt to balance these two groups only. Obtaining balance between two treatment groups has been well-studied in causal inference; see Stuart (2010) for a review of common matching methods. Once significant covariate imbalance can be ruled out, we can test whether the mean difference between these two groups is significantly different. But what can we learn from this comparison about our factorial effects? We have from Section 2.2 that

1 Y¯ (z ) = h τ . j 2 j

So when we subtract two observed means, assuming that the observed means are unbiased estimates of the true means (i.e. we have randomization or strong ignorability), we are estimating 1 Y¯ (z ) − Y¯ (z 0 ) = (h − h 0 )τ , j j 2 j j which is the sum of terms that are signed differently in hj and hj0 . As an example for a 23 design, we have the following matrix for GT :

    h1 +1 −1 −1 −1 +1 +1 +1 −1 h2 +1 −1 −1 +1 +1 −1 −1 +1     h3 +1 −1 +1 −1 −1 +1 −1 +1     T h4 +1 −1 +1 +1 −1 −1 +1 −1 G =   =   . h5 +1 +1 −1 −1 −1 −1 +1 +1     h6 +1 +1 −1 +1 −1 +1 −1 −1     h7 +1 +1 +1 −1 +1 −1 −1 −1 h8 +1 +1 +1 +1 +1 +1 +1 +1 ¯ ¯ ¯ ¯ Now consider taking the difference between Y (z8)−Y (z1) = Y (+1, +1, +1)−Y (−1, −1, −1), which yields

1 Y¯ (z ) − Y¯ (z ) = (h − h )τ 8 1 2 8 1 = τ(1) + τ(2) + τ(3) + τ(123).

¯ ¯ Hence, testing whether the difference between Y (z8) and Y (z1) is zero is the same as testing whether τ(1) + τ(2) + τ(3) + τ(123) is zero. If it is reasonable to assume that all main effects are of the same sign based on subject matter knowledge and that the three-factor interaction is also of the same sign or negligible, then this global test of whether there are any treatment

24 effects is relevant to the estimand of interest. According to the effect hierarchy principle (Wu and Hamada, 2000), the main effects should dominate the three-factor interaction. Thus, even if the interaction differs in sign, we would expect to see an effect under this assumption if there is one. If the global test is not rejected, then we would move on to recreating the entire factorial design. If it is unclear whether the signs of the effects are the same (also referred as antagonistic effects), then this global test would be inappropriate because effects could still be different from zero but their sum could be (close to) zero. If we are particularly interested in the causal effects involving the first factor, we can create a test specifically for those effects. For instance in the the 23 setting, we might use ¯ ¯ the estimand Y (z8) − Y (z4) as a proxy for the effect of the first factor, as follows:

¯ ¯ ¯ ¯ Y (z8) − Y (z4) = Y (+1, +1, +1) − Y (−1, +1, +1) 1 = (h − h )τ 2 8 4 = τ(1) + τ(12) + τ(13) + τ(123). (8)

In Equation 8, if all terms have the same sign and we find that the difference is signifi- cantly different than zero, then we would conclude that factor 1 has an effect, either on its own or through interactions with the other factors. A different choice of levels for the other factors, for instance comparing the mean potential outcome when one factor is high vs. low when all other factors are at the low level, would result in different signs for the interactions. The choice of which levels to compare should be based on subject matter knowledge and reasonable assumptions. To reiterate, throughout this section we focused on the meaning of several estimands, which are easily estimated under a randomized experiment. In an observational study, we would first need to obtain balance for any treatment groups we would be using in our estimator.

5.4 Comparing designs

So far our discussion of designs has focused on randomized experiments. However, there may be added complications in the observational setting that make one design more desirable than

25 another. For instance, consider using a different design for each estimator as in Section 4. In doing so, we are able to use more of the data than a fractional factorial design, but this can also incur a cost. If we have no outcome measurement for only one treatment combination, we would use all 2K − 1 treatment combinations if we did an analysis for all effects and used a different design for each. This approach would require us to either first obtain balance among all 2K −1 treatment groups or to obtain balance among the treatment groups within each design separately. The former option may be difficult; as the number of treatment combinations grows, obtaining covariate balance across all treatment groups becomes increasingly difficult and may result in smaller and smaller sample sizes, especially if trimming is used. The latter option will make joint inferences more challenging because different units would be used in each analysis. Therefore, although these incomplete factorial designs may improve the bias of our es- timators, the fractional factorial design in which we are using the same 2K−p rows may be more attractive in terms of obtaining covariate balance. The fractional factorial design also has the benefit of being a classical experimental design with an aliasing structure that is easy to understand.

6 Data illustration

6.1 Data description

Here we give an illustration of the implementation of our methods using data on pesticide exposure and body mass index (BMI). We use the 2003-2004 cycle of the National Health and Nutrition Examination Survey (NHANES) collected by the Centers for Disease Control and Prevention (CDC). We access the data via the R (R Core Team, 2017) package RHANES (Susmann, 2016). We focus on four organochlorine pesticides, measured via a blood serum test and then dichotomized based on whether they were above (+1) or below (-1) the de- tection limit, as given in the NHANES dataset, as factors. Organochlorine pesticides are persistent in the environment and adverse health effects have been reported by the CDC (2009), making them an interesting group of pesticides to study. To keep this illustration

26 simple, we chose to use only four pesticides and those were chosen primarily based on data availability and exposure rates. That is, we did not use those pesticides that were so com- mon (or rare) that virtually everyone (or no one) in the data set was above the detection limit (or below the detection limit). The following are the the four pesticides that were cho- sen: beta-Hexachlorocyclohexane (beta-Hex), heptachlor epoxide (Hept Epox), mirex, and p.p’-DDT. Previous findings of an association between pesticide exposures and body mass index (BMI) (Buser et al., 2014; Ranjbar et al., 2015) led us to choose BMI as the outcome of interest. BMI is the ratio between weight and height-squared. We removed 271 units with missing values of pesticide and BMI observations, noting that because this is an illustration and not intended to draw causal conclusions we drop those units for simplicity. We also decided to study a non-farmer population as farmers are more likely to be exposed to pesticides than the general population and may also differ on other unobserved covariates that affect health outcomes. This is our first step to achieving covariate balance, leaving a dataset with 1,259 observations (see Figure 5 in the Appendix for more).

6.2 Design stage

To show the process of how estimands change as we adjust our design stage, we consider different “designs.” First we consider analyzing the data as if it came from a 24 factorial hypothetical experiment, without adjusting the sample to balance for covariates. Second, we instead analyze the data as if it came from a fractional factorial 24−1 hypothetical experiment to assess how estimates change when going from a factorial design to a fractional factorial design. Finally, we trim units to obtain a fractional factorial 24−1 hypothetical experiment with covariate overlap and with no evidence of covariate imbalance with respect to gender, age, and smoking status, to see how estimates change when a true design phase with the aim of obtaining covariate balance is implemented. We use simple trimming as the focus of this paper is not to compare different methods of balancing for covariates, though exploration of different balancing methods in this setting would be a worthwhile future endeavor. Note that we trim units to obtain better balance on gender, age, and smoking status

27 as these are our first tier of most important covariates. Then additional adjustment for our second tier of covariates, race and ethnicity and income, are done via linear regression, as described in the following section. We also found that even with trimming, gender and smoking status were imbalanced across the treatment groups so these were also adjusted for in the linear regression. There are a few notes on the definitions of these covariates. From now on we will refer to the “race and ethnicity” covariate as simply ethnicity, as a short hand and to make clear that this is one in the NHANES dataset. The income variable is defined as annual household income. For categorizing individuals as smokers vs non-smokers, we use the question “Have you smoked at least 100 cigarettes in your life time?” and we categorized “Yes” and “Don’t know” as smokers. Note that only one observation with value “Don’t know” was recorded.

6.3 Statistical analysis

We analyzed the three datasets described in the design stage using: 1) a multiple linear re- gression that regresses BMI on treatment factors; 2) a multiple linear regression that regresses BMI on treatment factors, as well as the following covariates, as factors: ethnicity, income, gender, and smoking status; 3) Fisher-randomization tests of the sharp null hypothesis of no treatment effects. Recall from Sections 3.3 and 2.3 that regression estimates when including all factors and interactions, but not covariates, correspond to the Neymanian estimates divided by two. Figures 19 and 20 in Appendix D.3.2 show that further adjustment for ethnicity and income are needed, even after balancing gender, age, and smoking status. Additional data descriptions and the full statistical analyses are available in Appendix D. Due to the right- skewed nature of the weight variable, BMI exhibits some degree of right-. Therefore we use log-transformed BMI as the outcome in these analyses.

28 6.3.1 Full factorial design

Table 4 provides the counts of observations for each treatment combination (zj). There were 1,259 units in this dataset, although, when adjusting for covariates in the regression, units with missing covariate values were removed resulting in 1,183 units. We see that factor combination 10 (z10) has only one observation. Relying on only one observation for a treatment combination will lead to unstable estimates. Hence, Section 6.3.2 aims to avoid this issue by embedding the observational study in a fractional factorial hypothetical experiment. Nonetheless, we perform the analysis of the 24 factorial hypothetical experiment in this section and will compare the results to the analysis of the fractional factorial hypothetical experiment. Because there is only one unit assigned to z10, it is not possible to estimate Neymanian variances here. See Appendix D.1 for full analysis results.

Factor Levels Number of Obs. beta-Hex Hept Epox Mirex p,p’-DDT z1 +1 +1 +1 +1 426 z2 -1 +1 +1 +1 12 z3 +1 -1 +1 +1 70 z4 -1 -1 +1 +1 51 z5 +1 +1 -1 +1 291 z6 -1 +1 -1 +1 25 z7 +1 -1 -1 +1 94 z8 -1 -1 -1 +1 54 z9 +1 +1 +1 -1 21 z10 -1 +1 +1 -1 1 z11 +1 -1 +1 -1 19 z12 -1 -1 +1 -1 19 z13 +1 +1 -1 -1 42 z14 -1 +1 -1 -1 19 z15 +1 -1 -1 -1 37 z16 -1 -1 -1 -1 78 g1 g2 g3 g4 1259 Table 4: Counts of observations for each treatment combination of the pesticides with farmers removed for the factorial design. Red treatment combinations are used when recreating a fractional factorial design. +1 refers to exposure to the pesticide, -1 refers to no detectable exposure to the pesticide.

29 6.3.2 Fractional (24−1) factorial design

Recalling that I corresponds to the intercept, in the 24−1 fractional factorial design, instead of using I = 1234, we chose I = −1234 to exclude row 10 in Table 4 that has only a single observation. The dataset in this hypothetical experiment consists of 523 observations. However, when adjusting for covariates in the regression, units with missing covariate values were removed resulting in 488 units. In this design, aliasing is as follows: I = −1234, 4 = −123, 3 = −124, 2 = −134, 1 = −234, 12 = −34, 13 = −24, 14 = −23. The main effects are aliased with the negative of the three-factor interactions and the two-factor interactions are aliased with each other with reversed signs. In order to identify main effects, we will assume that the three-factor interaction is negligible. In practice, researchers should assess whether this aliasing assumption is realistic. See Appendix D.2 for full analysis results.

6.3.3 Fractional factorial design with no evidence of covariate imbalance

We examined the distributions across treatments of the following covariates: gender (recorded as male vs. female), smoking status (smoker vs. non-smoker, as defined earlier), and age at the time of survey (in years). Figure 1 shows covariate imbalance with respect to gender, smoking, and age in the fractional factorial design. We used a rejection approach that sequentially pruned the observations of the fractional factorial dataset until we found no evidence of imbalance across exposure groups with respect to gender, smoking status, and age. To test for covariate balance across treatments, we per- form a MANOVA using the Wilks’ statistic (Wilks, 1932), as defined in Section 5.2. Figure 2 shows the covariate distribution for gender, smoking status and age after trimming. We see that gender and smoking status are still imbalanced even after trimming, and hence were adjusted for in the linear regression that includes covariates (this is true for all “designs”). The first dataset that resulted in no evidence of covariate imbalance consisted of 169 observations, and the new treatment counts are presented in Table 5. After removing units with missing covariate values for the regression that adjusts for income and ethnicity, we had 158 units. Note that gender, smoking status, and age are our first tiers of covariates and ethnicity and income constitutes the second tiers. In practice, balancing a large set of

30 covariates is not always possible. Therefore, choosing the most important covariates should be done using subject matter knowledge. We see here that the number of units has been drastically reduced in our attempt to achieve covariate balance, a major challenge in this setting. In fact, one of the treatment groups only has three observations, a very small number. See Appendix D.3 for full analysis results. Factor Levels Number of Obs. beta-Hex Hept Epox Mirex p,p’-DDT Original Trimmed z2 -1 +1 +1 +1 12 6 z3 +1 -1 +1 +1 70 22 z5 +1 +1 -1 +1 291 102 z8 -1 -1 -1 +1 54 10 z9 +1 +1 +1 -1 21 8 z12 -1 -1 +1 -1 19 10 z14 -1 +1 -1 -1 19 3 z15 +1 -1 -1 -1 37 8 ∗ ∗ ∗ ∗ g1 g2 g3 g4 523 169 Table 5: Counts of observations for each treatment combination of the pesticides with farmers removed for the fractional factorial design, before and after trimming. +1 refers to exposure to the pesticide, -1 refers to no detectable exposure to the pesticide.

6.4 Results comparison across different conceptualized experiments and statistical approaches

Figure 3 shows a comparison of the regression estimates of the main effects across designs and statistical analyses. To compare the different methods we present univariate analyses. That is we utilize individual tests for each main effect rather than joint tests, for better illustration of the different methods. Recall that the standard Fisherian and Neymanian point estimates are the unadjusted regression estimates multiplied by two. The bars show two standard errors above and below the estimate calculated by the usual ordinary as the Neymanian variance estimates are not available for the full factorial design. In practice, adjustment for multiple comparisons should be considered. All methods and designs seem to agree on the positive effect of heptachlor epoxide and the negative effect of mirex on BMI. Although the full factorial estimates generally agree with the estimates of the two fractional factorial designs, differences in estimates of beta-Hexachlorocyclohexane

31 (Beta-Hex) and p,p’-DDT may be due to the aliasing of the three-factor interactions with the main effects. However, it is also plausible that we have reduced our data in the fractional design to a subset if individuals with different average main effects than in the full data set. Figure 4 shows a comparison of the significance of these estimates. The Fisher p-value is the p-value for the test of no effects of any pesticides, based on effect estimates for a given pesticide, which is suggested as a screening stage in Espinosa et al. (2016). We obtained low p-values for the main effect of mirex on BMI across all methods and designs. However, the p- values disagree for the other pesticides, especially the p-values testing the effects of p,p’-DDT and beta-Hexachlorocyclohexane. Note that the HC2/Neyman p-value is the significance based on the HC2 variance estimate (or Neyman variance estimate as we have shown this estimator to be equivalent in settings with no covariates) and the Normal approximation. This p-value was only calculated for the without covariates and was unavailable for the full factorial model due to limited data.

6.5 Discussion of data illustration

We have performed a data illustration to show the benefits and challenges of using our method and working with observational data with multiple treatments in general. Here we outline some of the problems that we ran into, that could be improved in future analyses, and also some of our findings. First, it is important to note that simplifications were made in the statistical analyses to focus on illustrating how researchers can capitalize on using fractional factorial designs to estimate the main and interactive effects of multiple treatments in observational studies. For instance, we are aware that multiply imputing the missing data would have been more appropriate to provide valid estimates and inferences in the original study population. However, for simplicity we focused on the complete-case observations. We additionally did not adjust for all important hypothetical covariates, such as diet. Another consideration is that we log-transformed BMI to address the fact that BMI is a ratio and so its distribution tends to have heavy tails. However, heavy-tailed distributions could have been considered. We reiterate that we use this data analysis as an illustration but do not claim to draw valid causal conclusions here. There were also some major challenges in working with this data and our method, mostly

32 related to sample size. Trimming helps mitigate bias but greatly reduced our sample size, potentially leading to decreased power and precision. There was also a great reduction in sample size by using the fractional factorial design which entailed dropping treatment combinations. We could have used an incomplete factorial type design, but this may have made the covariate balancing even more difficult, as discussed previously. For p,p’-DDT, the full factorial design resulted in a lower p-value than the other designs, which could be an issue of aliased interactions watering down the effect in the fractional design. It could also have occurred because the populations are different in these two designs. For beta-Hexachlorocyclohexane, the covariate balance adjusted fractional factorial design differs, leading to a lower p-value for the association of beta-Hexachlorocyclohexane with BMI compared to the other designs. We are more inclined to trust the balanced design as this should have reduced bias. This result may indicate that there was some confounding that made the effect appear less significant before trimming. In fact, the stark contrast between the unbalanced and balanced fractional design suggests that confounding may be to blame. Alternatively, by trimming we may have reduced our sample to a subpopulation where beta-Hexachlorocyclohexane has a larger effect on BMI than the rest of the population.

7 Discussion

In this paper, we have proposed to embed observational studies with multiple treatments in fractional factorial hypothetical experiments. This type of design is useful in settings with many treatments, especially when some treatment combinations have few or no observations and the aliasing assumptions are plausible. Once we recreate a factorial or fractional factorial experiment in the design phase, we can use standard methods, extended as in Sections 2 and 3, to estimate causal effects of interest. We first reviewed the basic setup for factorial and fractional factorial designs. Our work includes extensions of some of the known factorial design results for variance and regression to the fractional factorial setting. We also explored the use of incomplete factorial designs in the design-based potential outcome framework. A main contribution of this paper consists of extending these ideas for observational studies. This includes discussion of tests that can be performed before the full analysis. We have

33 also discussed covariate balance complications that may arise when dealing with multiple non-randomized treatments in practice. We illustrated these methods on a data set with pesticide exposure and BMI. This ex- emplifies the uses of our new methodology as well as identifies challenges that occur when working with observational data with multiple treatments. It is important to note that using the small subset of the NHANES dataset, we do not intend to provide policy recom- mendations on pesticide use. In the general population, organochlorine pesticide exposure primarily occurs through diet (excluding those with farm-related jobs), particularly eating foods such as dairy products and fatty fish (Centers for Disease Control and Prevention (CDC), 2009). Without further adjustments for diet, we are not be able to disentangle the causal effect of diet and pesticides. For instance, in our study individuals are likely to have been exposed to mirex largely through fish consumption (Agency for Toxic Substances and Disease Registry [ATSDR], 1995). Further studies could investigate BMI differences in a group of fish consumers with high level of mirex and a “similar” group of fish consumers with low level of mirex, where similar is with respect of important confounding variables. Indeed, it could be that eating fish cause individuals to have both high levels of mirex and also lower BMI. We have given a short overview of factorial and fractional factorial designs, as well as some other designs, in the potential outcomes framework. However, there are many aspects of these designs and classic analysis techniques that we did not cover. For instance, there are many nonregular design types, such as Plackett-Burman designs, that we could have explored more. Practitioners may also use variable selection and the principle of effect heredity to select their model for estimating factorial effects, including via Bayesian variable selection. See Wu and Hamada (2000) for more details on these methods from a more classical experimental design perspective. We see many avenues of future exploration connected to our approach. For instance, coupling fractional factorial designs with a Bayesian framework would provide more statis- tical tools and would potentially offer different methodology for dealing with missing data. Additionally, development of balancing techniques for factorial designs with many treatment combinations should be an area of future exploration. A particular challenge is that as we

34 increase the number of treatment combinations and therefore treatment groups, matching becomes more and more difficult due to the increased dimensionality and weighting meth- ods may produce unstable estimators. One direction could also be to choose the design of the observational study based upon the ability to balance different treatment combinations. Random allocation designs (Dempster, 1960, 1961), in which of the design is incorporated, could also be utilized in this framework. Finally, we could explore other causal estimands that may be of interest in observational studies with multiple treatments, such as those in Egami and Imai (2019) and De la Cuesta et al. (2019).

References

Agency for Toxic Substances and Disease Registry (ATSDR). (1995). Public Health State- ment for Mirex and Chlordecone. Atlanta, GA: U.S. Department of Health and Human Ser- vices, Public Health Service. https://www.atsdr.cdc.gov/phs/phs.asp?id=1189&tid=276. Ahmed, A., Husain, A., Love, T. E., Gambassi, G., Dell’Italia, L. J., Francis, G. S., Gheo- rghiade, M., Allman, R. M., Meleth, S., and Bourge, R. C. (2006). Heart failure, chronic diuretic use, and increase in mortality and hospitalization: an observational study using propensity score methods. European Heart Journal, 27(12):1431–1439. Bennett, M., Vielma, J. P., and Zubizarreta, J. R. (2020). Building representative matched samples with multi-valued treatments in large observational studies. Journal of Compu- tational and Graphical Statistics, 29(4):744–757. Bind, M.-A. C. and Rubin, D. B. (2019). Bridging observational studies and randomized ex- periments by embedding the former in the latter. Statistical Methods in Medical Research, 28(7):1958–1978. Bobb, J. F., Valeri, L., Claus Henn, B., Christiani, D. C., Wright, R. O., Mazumdar, M., Godleski, J. J., and Coull, B. A. (2015). Bayesian kernel machine regression for estimating the health effects of multi-pollutant mixtures. , 16(3):493–508. Branson, Z., Dasgupta, T., and Rubin, D. B. (2016). Improving covariate balance in 2k factorial designs via rerandomization with an application to a New York City Department of Education High School Study. Ann. of Appl. Statist., 10(4):1958–1976. Buser, M. C., Murray, H. E., and Scinicariello, F. (2014). Association of urinary phenols with increased body weight measures and obesity in children and adolescents. The Journal of Pediatrics, 165(4):744–749. Byar, D. P., Freedman, L. S., and Herzberg, A. M. (1995). Identifying which sets of param- eters are simultaneously estimable in an incomplete factorial design. J. R. Statist. Soc. D, 44(4):451–456. Byar, D. P., Herzberg, A. M., and Tan, W.-Y. (1993). Incomplete factorial designs for randomized clinical trials. Statist. Med., 12(17):1629–1641.

35 Centers for Disease Control and Prevention (CDC) (2009). Fourth Report on Human Exposure to Environmental Chemicals. Atlanta, GA: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention. https://www.cdc.gov/exposurereport/. Centers for Disease Control and Prevention (CDC). National Center for Health Statistics (NCHS) (2003-2004). National Health and Nutrition Examination Survey Data. Hy- attsville, MD: U.S. Department of Health and Human Services, Centers for Disease Con- trol and Prevention. https://www.cdc.gov/nchs/nhanes/. Accessed Oct. 1, 2018. Coombs, W. T. and Algina, J. (1996). New test statistics for manova/descriptive discriminant analysis. Educational and Psychological Measurement, 56(3):382–402. Dasgupta, T., Pillai, N. S., and Rubin, D. B. (2015). Causal inference from 2K factorial designs by using potential outcomes. J. R. Statist. Soc. B, 77(4):727–753. Dasgupta, T. and Rubin, D. B. (2015). Harvard university STAT 240: Matched Sampling and Study Design lecture notes and draft textbook, Fall 2015. De la Cuesta, B., Egami, N., and Imai, K. (2019). Improving the external of conjoint analysis: The essential role of profile distribution. Political Analysis, pages 1–27. Dempster, A. (1960). Random allocation designs I: On general classes of estimation methods. The Annals of , 31(4):885–905. Dempster, A. (1961). Random allocation designs II: Approximate theory for simple random allocation. The Annals of Mathematical Statistics, 32(2):387–405. Dong, N. (2015). Using propensity score methods to approximate factorial experimental designs to analyze the relationship between two variables and an outcome. American Journal of , 36(1):42–66. Egami, N. and Imai, K. (2019). Causal interaction in factorial experiments: Application to conjoint analysis. J. Am. Statist. Ass., 114(526):529–540. Espinosa, V., Dasgupta, T., and Rubin, D. B. (2016). A Bayesian perspective on the analysis of unreplicated factorial experiments using potential outcomes. Technometrics, 58(1):62– 73. Fisher, R. A. (1930). Inverse probability. In Proceedings of the Cambridge Philosophical Society, volume 26, pages 528–535. Cambridge University Press. Hirano, K. and Imbens, G. W. (2004). The propensity score with continuous treatments. In Gelman, A. and Meng, X., editors, Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives, chapter 7, pages 73–84. Hoboken, N.J.: Wiley. Holland, P. W. (1986). Statistics and causal inference. J. Am. Statist. Ass, 81(396):945–960. Imai, K., King, G., and Stuart, E. A. (2008). Misunderstandings between experimentalists and observationalists about causal inference. J. R. Statist. Soc. A, 171(2):481–502. Imai, K. and Ratkovic, M. (2014). Covariate balancing propensity score. J. R. Statist. Soc. B, 76(1):243–263. Imbens, G. W. and Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomed- ical Sciences: An Introduction. Cambridge University Press, New York, NY.

36 Li, X. and Ding, P. (2017). General forms of finite population central limit theorems with applications to causal inference. Journal of the American Statistical Association, 112(520):1759–1769. Li, X., Ding, P., and Rubin, D. B. (2020). Rerandomization in 2K factorial experiments. The Annals of Statistics, 48(1):43–63. Lopez, M. J. and Gutman, R. (2017). Estimation of causal effects with multiple treatments: A review and new ideas. Statist. Sci., 32(3):432–454. Lu, J. (2016a). Covariate adjustment in randomization-based causal inference for 2K factorial designs. Statistics & Probability Letters, 119:11–20. Lu, J. (2016b). On randomization-based and regression-based inferences for 2K factorial designs. Statistics & Probability Letters, 112:72–78. Lu, J. and Deng, A. (2017). On randomization-based causal inference for matched-pair factorial designs. Statistics & Probability Letters, 125:99–103. MacKinnon, J. G. and White, H. (1985). Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties. Journal of , 29(3):305–325. Montgomery, D. C. (2017). Design and analysis of experiments. Wiley, New York, NY. Mukerjee, R., Dasgupta, T., and Rubin, D. B. (2018). Using standard tools from finite population sampling to improve causal inference for complex experiments. J. Am. Statist. Ass., 113(522):868–881. Nilsson, M. (2013). Causal inference in a 22 factorial design using generalized propensity score. Master’s thesis, Uppsala University, Sweden. Oulhote, Y., Bind, M.-A. C., Coull, B., Patel, C. J., and Grandjean, P. (2017). Combin- ing ensemble learning techniques and G-computation to investigate chemical mixtures in environmental studies. bioRxiv. Patel, C. J., Bhattacharya, J., and Butte, A. J. (2010). An Environment-Wide Association Study (EWAS) on type 2 diabetes mellitus. PLoS ONE, 5(5):e10746. R Core Team (2017). R: A Language and Environment for Statistical Computing. R Foun- dation for Statistical Computing, Vienna, Austria. Ranjbar, M., Rotondi, M. A., Ardern, C. I., and Kuk, J. L. (2015). The influence of uri- nary concentrations of organophosphate metabolites on the relationship between BMI and cardiometabolic health risk. Journal of Obesity, 2015. Article ID 687914. Rosenbaum, P. R. (2002). Observational Studies. Springer, New York, NY, 2nd edition. Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55. Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandom- ized studies. Journal of , 66(5):688–701. Rubin, D. B. (1980). Comments on “Randomization Analysis of Experimental Data: The Fisher Randomization Test” by D. Basu. J. Am. Statist. Ass., 75(371):591–593. Rubin, D. B. (2007). The design versus the analysis of observational studies for causal

37 effects: Parallels with the design of randomized trials. Statist. Med., 26(1):20–36. Rubin, D. B. (2008). For objective causal inference, design trumps analysis. Ann. Appl. Statist., 2(3):808–840. Samii, C. and Aronow, P. M. (2012). On equivalencies between design-based and regression- based variance estimators for randomized experiments. Statistics & Probability Letters, 82(2):365–370. Splawa-Neyman, J. (1923/1990). On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Annals of Agricultural Sciences, page 101–151. [Translated to English and edited by D. M. Dabrowska and T. P. Speed in Statistical Science 5 (1990) 463–480.]. Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical Science, 25(1):1. Susmann, H. (2016). RNHANES: Facilitates Analysis of CDC NHANES Data. R package version 1.1.0. Valeri, L., Mazumdar, M. M., Bobb, J. F., Claus Henn, B., Rodrigues, E., Sharif, O. I. A., Kile, M. L., Quamruzzaman, Q., Afroz, S., Golam, M., Amarasiriwardena, C., Bellinger, D. C., Christiani, D. C., Coull, B. A., and Wright, R. O. (2017). The joint effect of prenatal exposure to metal mixtures on neurodevelopmental outcomes at 20-40 months of age: Evidence from rural Bangladesh. Environmental Health Perspectives, 125(6). CID:067015. Wang, Y. H. (2000). Fiducial intervals: What are they? The American , 54(2):105–111. Wilks, S. S. (1932). Certain generalizations in the . Biometrika, 24(3/4):471–494. Wu, C. F. J. and Hamada, M. S. (2000). Experiments : Planning, Analysis, and Parameter Design Optimization. Wiley, New York, NY. Zhao, A., Ding, P., Mukerjee, R., and Dasgupta, T. (2018). Randomization-based causal inference from split-plot designs. Ann. Statist., 46(5):1876–1903.

38 Figures

10 31 0.8

185 35 12 0.6 13 37 10 Gender 33 9 Female 0.4 8 7 Percent 106 19 Male

0.2 2 6

0.0 2 3 5 8 9 12 14 15 Treatment Combination

16

33 0.6 7 11 37 150 10 19 33 141 9 18 Smoker 5 8 0.4 21 No

Percent Yes 5 0.2

0.0 2 3 5 8 9 12 14 15 Treatment Combination

80

70 60 21

12 Age 291

40 19 37 54 19

20 2 3 5 8 9 12 14 15 Treatment Combination

Figure 1: Comparing covariates across treatment combinations in the 24−1 fractional factorial design. Text labels give number of observations per group. For bar plots, the y-axis gives the percent within each treatment combination for each category of the covariate. For age, individuals with “>= 85 years of age” were set to 85 on the graph. Note that all individuals older than 85 were dropped in the covariate balance stage.

39 5 0.8 7 2 0.6 62 5 5 12 5 5 Gender 10 Female 0.4 40 3 3 Percent 1 Male 3

0.2 1

0.0 2 3 5 8 9 12 14 15 Treatment Combination

1.00 3

8 0.75 6 4 5 Smoker 12 0.50 54 5 5 No 10 48 Percent Yes 3 2 0.25 2 2

0.00 2 3 5 8 9 12 14 15 Treatment Combination

50

10 8 45 6 22 102 Age

8 10 3 40

2 3 5 8 9 12 14 15 Treatment Combination

Figure 2: Comparing covariates across treatment combinations in the 24−1 fractional factorial design after trimming. Text labels give number of observations per group. For bar plots, the y-axis gives the percent within each treatment combination for each category of the covariate. For age, individuals with “>= 85 years of age” were set to 85 on the graph. Note that all individuals older than 85 were dropped in the covariate balance stage.

40 Comparing Estimates

0.05 Analysis Regression Regression with Covariate Adjustment 0.00

Design log BMI Full Factorial −0.05 Fractional Factorial Trimmed Fractional Factorial

−0.10

Beta−Hex Hept Epox Mirex p,p−DDT Factor

Figure 3: Plots of estimates of factorial effects (associations), on the log BMI scale. Bars indicate two standard errors (using standard OLS standard estimates) above and below point estimate.

Comparing Univariate p−values of Estimates

Analysis 0.75 Regression Regression with Covariate Adjustment Fisher

0.50 HC2/Neyman p−value Design Full Factorial 0.25 Fractional Factorial Trimmed Fractional Factorial

0.00

Beta−Hex Hept Epox Mirex p,p−DDT Factor

Figure 4: Plots of p-values of factorial effect (association) estimates, which are compared in Figure 3.

41 A Variance derivations

A.1 Variance and covariance of observed mean potential outcomes This section gives results for the building blocks necessary to obtain results such as Equa- tion 1 and Equation 4. We start with the variance and covariance of the treatment indicators, which give results related to those in Lemma 1 and 2 of Dasgupta et al. (2015). For i 6= k,

Cov (Wi(zj),Wk(zj)) = E [Wi(zj)Wk(zj)] − E [Wi(zj)] E [Wk(zj)] n2 = P (W (z ) = 1|W (z ) = 1)P (W (z ) = 1) − j k h i j i j n2 n n − 1 n2 = j j − j n n − 1 n2 n (n − n) = j j . n2(n − 1)

For i 6= k and j 6= h,

Cov (Wi(zj),Wk(zh)) = E [Wi(zj)Wk(zh)] − E [Wi(zj)] E [Wk(zh)] n n = P (W (z ) = 1|W (z ) = 1)P (W (z ) = 1) − j h k h i j i j n2 n n n n = h j − j h n − 1 n n2 n n = j h . n2(n − 1)

¯ obs These results can be used directly to get the variance of Y (zj) and covariance of ¯ obs ¯ obs Y (zj) and Y (zh). See Lu (2016b) for proof.

A.2 Variance for fractional factorial design This section gives details on the derivation of the variance of our estimators of factorial effects under a fractional factorial design, as given in Equation 4. The proofs here are similar to ˜2 those given in Dasgupta et al. (2015) and Lu (2016b). We first breakdown Sk in the same 2 way that Dasgupta et al. (2015) and Lu (2016b) broke down Sk, to show that we obtain ∗ similar results for the fractional case as the full factorial. Let gkj be the jth element of vector

42 ∗ gk.

n 1 X 2 S˜2 = (˜τ (k) − τ˜(k)) k n − 1 i i=1 n 1 X 2 = 2−2(K−p−1) g∗T Y − g∗T Y¯  n − 1 k ∗i k ∗ i=1 2 n J0 ! 1 X X = 2−2(K−p−1) g∗ Y (z∗) − Y¯ (z∗) n − 1 kj i j j i=1 j=1 " J0 # −2(K−p−1) X ∗2 2 ∗ X X ∗ ∗ 2 ∗ ∗ = 2 gkj S (zj ) + gkjgkhS (zj , zh) j=1 j h6=j

∗ Now we find the variance of τb (k) in a proof analogous to that given for balanced factorial designs in Dasgupta et al. (2015) and for unbalanced factorial designs in Lu (2016b). 1 Var (τ ∗(k)) = g∗T Var Y¯ obs g∗ b 22(K−p−1) k ∗ k " J0 # 1 X X X = g∗2Var Y¯ obs(z∗) + g∗ g∗ Cov Y¯ obs(z∗), Y¯ obs(z∗) 22(K−p−1) kj j kj kh j h j=1 j h6=j " J0 ∗ # 1 X n − nj 1 X X = S2(z∗) − g∗ g∗ S2(z∗, z∗) 22(K−p−1) nn∗ j n kj kh j h j=1 j j h6=j J0  ∗  1 X n − nj 1 1 = + S2(z∗) − S˜2 22(K−p−1) nn∗ n j n k j=1 j J0 1 X 1 1 = S2(z∗) − S˜2 22(K−p−1) n∗ j n k j=1 j

A.3 Covariance for fractional factorial design In this section we find the covariance between two factorial effect estimates from a fractional factorial design. We again closely follow proofs from Dasgupta et al. (2015) and Lu (2016b), showing that the same processes work in the fractional factorial case. First we breakdown

43 ˜2 Sk,k0 . n 2 1 X 0 0 S˜ 0 = (˜τ (k) − τ˜(k)) (˜τ (k ) − τ˜(k )) k,k n − 1 i i i=1 n 1 −2(K−p−1) X ∗T ∗T  ∗T ∗T  = 2 g Y − g Y¯ g 0 Y − g 0 Y¯ n − 1 k ∗i k ∗ k ∗i k ∗ i=1 n J0 ! J0 ! 1 −2(K−p−1) X X ∗ ∗ ∗ X ∗ ∗ ∗ = 2 g (Y (z ) − Y¯ (z )) g 0 (Y (z ) − Y¯ (z )) n − 1 kj i j j k j i j j i=1 j=1 j=1 J0 ! −2(K−p−1) X ∗ ∗ 2 ∗ X X ∗ ∗ 2 ∗ ∗ = 2 gkjgk0jS (zj ) + gkjgk0hS (zj , zh) j=1 j h6=j

∗ ∗ 0 −2(K−p−1) ∗T ¯ obs ∗ Cov (τb (k), τb (k )) = 2 gk Var Y∗ gk0 J0 1 h X ∗ obs ∗  = g g 0 Var Y¯ (z ) 22(K−p−1) kj k j j j=1 i X X ∗ ∗ ¯ obs ∗ ¯ obs ∗  + gkjgk0hCov Y (zj ), Y (zh) j h6=j " J0 ∗ # 1 X n − nj ∗ 2 ∗ 1 X X ∗ ∗ 2 ∗ ∗ = g g 0 S (z ) − g g 0 S (z , z ) 22(K−p−1) nn∗ kj k j j n kj k h j h j=1 j j h6=j " J0  ∗  # 1 X n − nj 1 ∗ ∗ 2 ∗ 1 2 = + g g 0 S (z ) − S˜ 0 22(K−p−1) nn∗ n kj k h j n k,k j=1 j " J0 # 1 X 1 ∗ ∗ 2 ∗ 1 2 = g g 0 S (z ) − S˜ 0 22(K−p−1) n∗ kj k h j n k,k j=1 j   1 X 1 2 ∗ X 1 2 ∗ 1 2 = S (z ) − S (z ) − S˜ 0 22(K−p−1)  n∗ j n∗ j  n k,k j:g∗ =g∗ j j:g∗ 6=g∗ j kj k0j kj k0j

B Relating linear regression estimators to Neyman es- timators in the fractional factorial design

This section gives a brief overview of a proof that the linear regression point estimates are the same as the Neymanian estimates for the fractional factorial design. For proofs of these results for the full factorial design, see Lu (2016b). The linear regression coefficient estimate is βˆ = (XT X)−1XT Y obs, where X is a n × 2K−p matrix whose columns correspond to first an intercept, then include the main effects, then the second order interactions not aliased with the main effects, and

44 so on such that no two columns are aliased. For instance, for the 23−1 design laid out in Section 3, X would be a design matrix with a first column of 1’s and the rest of the columns corresponding to levels of the first, second, and third factor. No interactions would be included in this example because each of the two factor interactions is aliased with a main effect and the three factor interaction is aliased with the intercept, which are all already ∗ ∗ ∗ included in the model. Thus the design looks a bit like the columns g1, g2, and g3 of Table 2, but with repeated rows for each of the units assigned to the same treatment combination. We need (XT X)−1XT X = I. Denote B = (XT X)−1XT .

(BX)ij = bi·fj T where bi· is the ith row of B and fj is the jth column of X . Using notation from Section 5.3, ∗ ∗T ∗ let hj be the jth row of G which is an expanded version of zj which includes elements for ∗ ∗ interactions and the intercept. Then fji, the ith element of fj is hkj, the jth element of hk, where k is the treatment combination for the ith individual. It must be true that (BX)ii = 1 1 −1 T and (BX)ij = 0 for i 6= j. Consider letting the ith row of B be bi· = 2K−p (n˜ ◦ fi) where −1 1 n˜ is the vector whose ith entry is ∗ where j is number of the treatment combination that nj PJ0 the ith unit is assigned to (j = k=1 kWi(zk)). Then we have

(BX)kk = bk·fk 1 = (n˜−1 ◦ f )T f 2K−p k k 1 = (n˜−1)T f ◦ f 2K−p k k J0 1 X X 1 = K−p ∗ 2 ∗ nj j=1 i:Wi(zj )=1 1 X = 1 J 0 j = 1. For k 6= j

(BX)kj = bk·fj 1 = (n˜−1 ◦ f )T f 2K−p k j 1 = (n˜−1)T f ◦ f 2K−p k j J0 1 X 1 X ∗ ∗ = K−p ∗ hskhsj 2 ns ∗ s=1 i:Wi(hs )=1   1 X X = K−p  1 − 1 2 ∗ ∗ ∗ ∗ s:hsk=hsj s:hsk6=hsj = 0.

45 1 −1 T So the kth row of B is 2K−p (n˜ ◦ fk) . This means that

ˆ T −1 T obs βk = (X X) X Y k 1 = ( n˜−1 ◦ f )T Y obs 2K−p k J0 1 X 1 X ∗ ∗ = K−p ∗ hjkYi(hj ) 2 nj ∗ j=1 i:Wi(zj )=1 J0 1 X = h∗ Y obs(z∗) 2K−p jk j j=1 1 = g∗T Y¯ obs 2K−p k 1 = τ ∗(k). 2b So, indeed, the linear regression estimates are one half of the factorial effects. Now let us consider the HC2 variance estimator. It has the form (MacKinnon and White, 1985) T −1 T T −1 Vd arHC2(βb) = X X X ΩbX X X   eˆi where Ω = diag withe ˆi being the residual for observation i and hii being the ii value b 1−hii of the hat matrix, X XT X−1 XT . For discussion of this estimators in the single treatment case and the factorial case, see Samii and Aronow (2012) and Lu (2016b), respectively. We use similar ideas to those papers in the arguments below. ∗ T −1 T If unit i was assigned to treatment zk then the ith column of X X X is b·i = 1 1 ∗T K−p ∗ hk . We have 2 nk

 T −1 T  1 1 ∗ ∗T X X X X = K−p ∗ hkhk ii 2 nk 1 1 K−p = K−p ∗ 2 2 nk 1 = ∗ . nk

∗ 1 nk−1 So then 1 − hii = 1 − ∗ = ∗ . This in turn means that nk nk

∗ obs ¯ obs ∗ 2 eˆi nk Yi − Y (zk) = ∗ 1 − hii nk − 1

Now we can solve for the whole expression of Vd arHC2(βb). We focus on the diagonal

46 entries.      T −1 T eˆj  T −1 Vd arHC2(βb) = X X X diag X X X kk 1 − hjj i· ·i   1 −1 T eˆj 1 −1 = ( K−p n˜ ◦ fk) diag ( K−p n˜ ◦ fk) 2 1 − hjj 2 1 1 = ( s ◦ f )T ( n˜−1 ◦ f ) 2K−p 1 k 2K−p k 1 1 = ( s )T ( n˜−1 ◦ f ◦ f ) 2K−p 1 2K−p k k 1 = (s )T n˜−1 22(K−p) 1 J0 ∗ ∗ 2 1 X 1 X Yi(zj ) − Yi(zj ) = 2(K−p) ∗ ∗ 2 nj ∗ nj − 1 j=1 i:Wi(zj )=1 1 X 1 = s2(z∗) 22(K−p) n∗ j j j where s1 is the vector of whose ith entry, given entry i is assigned treatment combination ∗ ∗ 2 ∗ (Yi(zk)−Yi(zk))   zk, is n∗−1 . Thus, we have that Vd arHC2(βb) is 1/4 times the Neyman style k kk variance estimator.

C Incomplete factorial Designs

C.1 Inference This section discusses alternative incomplete factorial designs that one might use. See Byar et al. (1993) for more discussion on incomplete factorial designs. This discussion also applies more generally to recreating fractional designs in observational settings where we are using a subset of treatment combinations present in the data. We follow the same outline as proofs from Section A.2. In these incomplete factorial designs, for each main effect or other estimand of interest, we define a new experimental design tailored to estimating that effect. When this is done, we then analyze the data as if the treatment groups within the new design are the only possible treatment groups. First we need to introduce some notation. Let g˙j be the same as gj but with zero elements corresponding to treatment combinations that are not included in the particular design in use. Let 2m treatment groups be used in the design, with half assigned to the +1 level of −(m−1) T ¯ factor one and half assigned to the −1 level of factor one. Letτ ˙(k) = 2 g˙k Y . Then

47 ˙ 2 we can do a breakdown Sk, defined in the first line below: n 1 X 2 S˙ 2 = (τ ˙ (k) − τ˙(k)) k n − 1 i i=1 n 1 X 2 = 2−2(m−1) g˙ T Y − g˙ T Y¯  n − 1 k i k i=1 n J !2 1 X X = 2−2(m−1) g˙ Y (z ) − Y¯ (z ) n − 1 kj i j j i=1 j=1 " J # −2(m−1) X 2 2 X X 2 = 2 g˙kjS (zj) + g˙kjg˙khS (zj, zh) . j=1 j h6=j Then we have   1 T ¯ obs Var bτ˙(k) = g˙ Var Y g˙k 22(m−1) k " J # 1 X X X = g˙ 2 Var Y¯ obs(z ) + g˙ g˙ Cov Y¯ obs(z ), Y¯ obs(z ) 22(m−1) kj j kj kh j h j=1 j h6=j " J # 1 X 2 n − nj 2 1 X X 2 = 2(m−1) g˙kj S (zj) − g˙kjg˙khS (zj, zh) 2 nnj n j=1 j h6=j J   1 X n − nj 1 1 = g˙ 2 + S2(z ) − S˙ 2 22(m−1) kj nn n j n k j=1 j J 1 X 1 1 = g˙ 2 S2(z ) − S˙ 2. 22(m−1) kj n j n k j=1 j An important note is that in terms of estimation of this variance, whether we analyze the data as if the treatment levels in this design are the only possible treatment combinations or if we keep the assumption that units can be assigned to any possible treatment combinations (which aids in the interpretation and inference), the variance estimator will be the same. This is because we can only estimate the first term in this expression which only involves the specific treatment levels in this design.

C.2 Variance of estimators for incomplete factorial designs Consider comparing two designs. The first involves J ∗ treatment groups and the second ˜ ∗ involves J < J treatment groups. Let’s assume that for all treatment groups nj = c and 2 2 S (zj) = S . That is, all treatment groups are the same size and effects are additive so that the variance of potential outcomes in each treatment group is the same. Then for the first design we have variance of J∗ 1 X 1 1 1 S2(z ) − S2 = S2. J ∗2 n j n k cJ ∗ j j

48 For the second design we have

J˜ 1 X 1 2 1 2 1 2 S (zj) − S = S . ˜2 n n k ˜ J j j cJ So in this setting the design with more treatment groups will have lower variance, which is intuitive. However, in general if we do not have the additive treatment effect assumption, it is possible that the design with more treatment groups includes treatment groups that are much more variable and therefore the variance of the estimator is actually larger.

C.3 Regression with missing levels This section discusses what will result from a standard regression interacting all factors when not all treatment combinations are observed in the data. If the dataset is missing m treatment combinations, then the regression will be able to estimate the first 2K − m effects (including interactions) that are not aliased and the rest will be removed due to collinearity of the matrix columns. Then for each effect, there will be some aliasing structure imposed but the same “design” will not necessarily be used for each factor. To explore this scenario more, let’s take the specific example of three factors where we only observe five of the eight treatment combinations. For simplicity let there be one observation for each treatment combination and let the model matrix be as follows: 1 −1 −1+1+1  1 −1+1 −1 −1   X = 1+1 −1 −1 −1 .   1+1+1+1+1  1 −1 −1 −1+1 The first column corresponds to the intercept, the second through fourth columns give the levels for the three factors, and the fifth column corresponds to the interaction between the first and second factor. Note that the first four rows correspond to a 23−1 design, so it is possible to recreate that design here. We have  0.25 0.25 0.25 0.25 0  −0.25 −0.25 0.25 0.25 0  T −1 T   X X X = −0.25 0.25 −0.25 0.25 0  .    0.5 0 0 0 −0.5 −0.25 −0.25 −0.25 0.25 0.5

Recalling that βˆ = XT X−1 XT Y obs, the first three columns correspond to estimates we would get using the fractional factorial design using the defining relation I = 123. The last two estimates, for factor 3 and the interaction between factors 1 and 2, have a different aliasing structure. In particular, the aliasing on factor 3 is similar to aliasing structures in Section 5.3 and we can find that factor 3 will be aliased with the two-factor interactions 13 and 23 as well as the three-factor interaction.

49 D Data illustration

This section gives some additional descriptors for the data. Figure 5 shows that the proportion of farmers is different across the eight treatments of the hypothetical fractional factorial experiment defined in Section 6.3.2.

1.00 12 70 291 21 19 37 54 19

0.75

Farming Job

0.50 No

Percent Yes

0.25

6 2 3 12 1 1 2 0.00 2 3 5 8 9 12 14 15 Treatment Combination

Figure 5: Comparing number of farmers across factor levels in the 24−1 fractional factorial design. Text labels give number of observations per group.

Figure 6 shows the correlations between the levels of the different pesticides. For each design, we explore regression (both saturated, i.e. with all interactions among factors, and unsaturated, i.e. without interactions among factors), regression with covariates (both saturated and unsaturated for the factors), and Fisher tests for significance of effects. We also include the HC2 estimate for the saturated model without covariates, where it is equivalent to the Neyman estimator, when possible. Note that all regression outputs are using log BMI as output. An important note is that individuals with missing values, either for factor levels, treatment levels, or covariates where they are used, were removed. This action likely changes the types of individuals within the analysis and therefore the generalizability of the results. However, as this is intended as simply an illustration of the methods and not as a full analysis to draw substantive conclusions from, this simple model suffices to allow us to continue with the analysis.

D.1 Full factorial design D.1.1 Full factorial: Regression analysis We start by ignoring our limited data and use a full 24 factorial approach. Table 6 shows an analysis with all factors and no interactions. Table 7 shows the saturated model with all interactions. Note that the individual who received treatment combination (−1, 1, 1, −1) has leverage 1 because they are the only individual with that treatment combination. Because of this data limitation, estimating variance using the HC2/Neyman variance estimator is not possible for the saturated case. Also note that in the saturated model, the variance estimates given in the linear model summary are all the same. This will be true of Neyman variance estimates too, since they are the same for each factorial effect estimators. Note the changes in significance of the estimator and even a change in sign for the estimate of beta-Hex going from the model with just main effects to the saturated model.

50 p,p'-DDT Hept Epox beta-Hex 1

0.8 Mirex 0.23 0.17 0.16 0.6

0.4

0.2

p,p'-DDT 0.34 0.32 0

-0.2

-0.4

-0.6 beta-Hex 0.48

-0.8

-1

Figure 6: Plot of correlations between pesticide levels.

Table 6: All-pesticide model Estimate Std. Error t value Pr(>|t|) (Intercept) 3.289 0.008 430.426 0.000 beta-Hex -0.001 0.008 -0.067 0.946 Hept Epox 0.062 0.007 9.201 0.000 Mirex -0.027 0.006 -4.709 0.000 p,p’-DDT 0.020 0.008 2.595 0.010

D.1.2 Full factorial: Regression analysis adjusting for covariates This section gives the analysis of the full factorial design, adjusting for the covariates of income, ethnicity, gender and smoking status as linear factors in the model. For simplicity, we remove all individuals who had missing values for income or ethnicity, and assume that those values are missing at random. This resulted in 75 units being removed. In practice one should instead use multiple imputation to account for the missing values. One individual refused to give income (the only such individual) and so was removed. This did not affect the analysis. The unit assigned to the unique treatment combination, necessarily had leverage one in the saturated model. This baseline level for income in the analysis was “$0 to $4,999.” Baseline for ethnicity is “Mexican American.”

51 Table 7: Saturated model Estimate Std. Error t value Pr(>|t|) (Intercept) 3.276 0.015 224.028 0.000 beta-Hex 0.010 0.015 0.713 0.476 Hept Epox 0.038 0.015 2.605 0.009 Mirex -0.050 0.015 -3.445 0.001 p,p’-DDT 0.025 0.015 1.699 0.090 beta-Hex:Hept Epox 0.012 0.015 0.830 0.407 beta-Hex:Mirex 0.007 0.015 0.456 0.649 Hept Epox:Mirex -0.037 0.015 -2.515 0.012 beta-Hex:p,p’-DDT -0.008 0.015 -0.569 0.569 Hept Epox:p,p’-DDT 0.011 0.015 0.779 0.436 Mirex:p,p’-DDT 0.015 0.015 1.016 0.310 beta-Hex:Hept Epox:Mirex 0.026 0.015 1.790 0.074 beta-Hex:Hept Epox:p,p’-DDT 0.007 0.015 0.500 0.617 beta-Hex:Mirex:p,p’-DDT 0.003 0.015 0.179 0.858 Hept Epox:Mirex:p,p’-DDT 0.014 0.015 0.969 0.333 beta-Hex:Hept Epox:Mirex:p,p’-DDT -0.004 0.015 -0.257 0.797

Residuals vs Fitted Normal Q−Q

229 4 229 1205 1205 137 137 0.5 2 0.0 0 Residuals Std. resid. Std. deviance −2 −0.5

3.20 3.25 3.30 3.35 3.40 −3 −1 0 1 2 3

Predicted values Theoretical Quantiles

Scale−Location Residuals vs Leverage

2.0 229

1205 4 137 1205 764 . 1.5 d i 2 s e r

e c n a 1.0 i v 0 e d

. d

t Cook's distance S Std. Pearson resid. Std. Pearson 0.5 −2

17 0.0

3.20 3.25 3.30 3.35 3.40 0.000 0.004 0.008

Predicted values Leverage

Figure 7: Basic diagnostics plot for the full model given in Table 6.

52 Residuals vs Fitted Normal Q−Q

229 4 229 1205 1205 137 137 0.5 2 0.0 0 Residuals Std. deviance resid. Std. deviance −2 −0.5

3.10 3.20 3.30 3.40 −3 −1 0 1 2 3

Predicted values Theoretical Quantiles

Scale−Location Residuals vs Leverage

2.0 229 1205 4 137 764 . 1.5 d i 2 s e r

e c n a 1.0 i v 0 e d

. d t Cook's distance S Std. Pearson resid. Std. Pearson 0.5 −2

17 406 0.0

3.10 3.20 3.30 3.40 0.00 0.02 0.04 0.06 0.08

Predicted values Leverage

Figure 8: Basic diagnostics plot for the saturated model given in Table 7. Note that the individual with unique treatment combination had leverage 1.

D.1.3 Full factorial: Fisherian analysis We assume the sharp null hypothesis of zero individual factorial effects, for all factors and interactions. This means that imputed missing potential outcomes for different assignments are just the observed potential outcomes. We do the imputation, or effectively rearrange the assignment vector, 1000 times. As we can see from the plots and confirmed by calculation, only heptachlor epoxide (Hept Epox) and mirex appear to be significantly different from zero at the 0.05 level. We only examine the main effects here but could further consider interactions.

53 Table 8: All-pesticide model Estimate Std. Error t value Pr(>|t|) (Intercept) 3.302 0.047 70.165 0.000 beta-Hex 0.001 0.008 0.070 0.944 Hept Epox 0.064 0.007 9.413 0.000 Mirex -0.031 0.006 -5.240 0.000 p,p’-DDT 0.019 0.008 2.452 0.014 Income:$ 5,000 to $ 9,999 -0.005 0.051 -0.108 0.914 Income:$10,000 to $14,999 -0.016 0.049 -0.332 0.740 Income:$15,000 to $19,999 0.022 0.049 0.445 0.657 Income:$20,000 to $24,999 0.011 0.048 0.228 0.820 Income:$25,000 to $34,999 -0.014 0.047 -0.287 0.774 Income:$35,000 to $44,999 -0.002 0.048 -0.037 0.971 Income:$45,000 to $54,999 0.000 0.049 0.002 0.999 Income:$55,000 to $64,999 0.000 0.050 0.009 0.993 Income:$65,000 to $74,999 0.011 0.052 0.206 0.837 Income:$75,000 and Over -0.004 0.046 -0.091 0.928 Income:Don’t know -0.168 0.142 -1.188 0.235 Income:Over $20,000 0.033 0.081 0.404 0.686 Ethnicity:Non-Hispanic Black 0.068 0.019 3.612 0.000 Ethnicity:Non-Hispanic White -0.023 0.015 -1.514 0.130 Ethnicity:Other Hispanic -0.022 0.033 -0.678 0.498 Ethnicity:Other Race -0.055 0.029 -1.912 0.056 - Including Multi-Racial Gender:Male -0.005 0.012 -0.405 0.686 Smoker: Yes -0.011 0.011 -0.930 0.353

54 Table 9: Saturated model Estimate Std. Error t value Pr(>|t|) (Intercept) 3.291 0.049 67.583 0.000 beta-Hex 0.011 0.015 0.766 0.444 Hept Epox 0.044 0.015 3.047 0.002 Mirex -0.054 0.015 -3.698 0.000 p,p’-DDT 0.025 0.014 1.700 0.089 beta-Hex:Hept Epox 0.011 0.014 0.734 0.463 beta-Hex:Mirex 0.003 0.014 0.218 0.828 Hept Epox:Mirex -0.039 0.015 -2.651 0.008 beta-Hex:p,p’-DDT -0.011 0.015 -0.724 0.469 Hept Epox:p,p’-DDT 0.009 0.015 0.639 0.523 Mirex:p,p’-DDT 0.015 0.014 1.014 0.311 beta-Hex:Hept Epox:Mirex 0.028 0.014 1.927 0.054 beta-Hex:Hept Epox:p,p’-DDT 0.004 0.015 0.270 0.787 beta-Hex:Mirex:p,p’-DDT 0.005 0.015 0.371 0.711 Hept Epox:Mirex:p,p’-DDT 0.011 0.015 0.773 0.440 beta-Hex:Hept Epox:Mirex:p,p’-DDT –0.002 0.014 -0.114 0.909 Income:$ 5,000 to $ 9,999 0.003 0.050 0.060 0.952 Income:$10,000 to $14,999 -0.016 0.048 -0.329 0.743 Income:$15,000 to $19,999 0.028 0.049 0.576 0.565 Income:$20,000 to $24,999 0.007 0.048 0.136 0.891 Income:$25,000 to $34,999 -0.011 0.047 -0.232 0.816 Income:$35,000 to $44,999 0.001 0.048 0.024 0.980 Income:$45,000 to $54,999 -0.001 0.049 -0.011 0.991 Income:$55,000 to $64,999 0.001 0.050 0.010 0.992 Income:$65,000 to $74,999 0.010 0.052 0.196 0.844 Income:$75,000 and Over -0.003 0.046 -0.073 0.942 Income:Don’t know -0.159 0.141 -1.127 0.260 Income:Over $20,000 0.039 0.081 0.488 0.626 Ethnicity:Non-Hispanic Black 0.067 0.019 3.571 0.000 Ethnicity:Non-Hispanic White -0.023 0.015 -1.505 0.133 Ethnicity:Other Hispanic -0.026 0.033 -0.792 0.429 Ethnicity:Other Race -0.057 0.029 -1.971 0.049 - Including Multi-Racial Gender:Male -0.002 0.012 -0.133 0.894 Smoker:Yes -0.014 0.011 -1.215 0.225

55 Residuals vs Fitted Normal Q−Q

229 4 229 1205 1205 329 329 0.5 2 0.0 0 Residuals Std. deviance resid. Std. deviance −2 −0.5

3.1 3.2 3.3 3.4 3.5 −3 −1 0 1 2 3

Predicted values Theoretical Quantiles

Scale−Location Residuals vs Leverage 229 1 2.0 1205 4 329 0.5 39 . d 1.5 i s 2 e r

e 1103 c n a i 1.0 v 0 e d

. d t

S Cook's distance Std. Pearson resid. Std. Pearson 956 0.5 −2

0.5 0.0

3.1 3.2 3.3 3.4 3.5 0.0 0.1 0.2 0.3 0.4 0.5

Predicted values Leverage

Figure 9: Basic diagnostics plot for the full model given in Table 8.

56 Residuals vs Fitted Normal Q−Q

229 4 229 1205 1205 535 535 0.5 2 0.0 0 Residuals Std. deviance resid. Std. deviance −2 −0.5

3.1 3.2 3.3 3.4 3.5 −3 −1 0 1 2 3

Predicted values Theoretical Quantiles

Scale−Location Residuals vs Leverage 229 2.0 1205 535 4 0.5

39 . d 1.5 i s 2 e r

e 1103 c n a i 1.0 v 0 e d

. d t

S Cook's distance Std. Pearson resid. Std. Pearson 956 0.5 −2 0.0

3.1 3.2 3.3 3.4 3.5 0.0 0.1 0.2 0.3 0.4 0.5

Predicted values Leverage

Figure 10: Basic diagnostics plot for the saturated model given in Table 9. Note that the individual with unique treatment combination had leverage 1.

57 250 200 200 150 150 100 Frequency 100 50 50 0 0

−0.10 0.00 0.05 0.10 −0.10 0.00 0.05 0.10

Beta−Hex tau estimates Hept Epox tau estimates 250 250 200 200 150 150 Frequency Frequency 100 100 50 50 0 0

−0.10 0.00 0.05 0.10 −0.10 0.00 0.05 0.10

Mirex tau estimates p,p'−DDT tau estimates

Figure 11: Plots of simulated treatment effects estimated. Observed treatment effect esti- mates plotted in red.

58 D.2 Fractional factorial design D.2.1 Fractional factorial: Regression analysis This section gives the analysis of the fractional factorial design as is, using regression. Note that we remove farmers again, leaving 523 observations. Table 10 shows an analysis with all factors and no interactions. Table 11 shows the saturated model with all interactions. These results generally align with the full factorial analysis, in terms of sign and significance of terms. Figures 12 and 13 show basic diagnostic plots for the model with all main effects and the saturated model. Note again that it makes sense that the standard errors estimates are the same for all estimates in the saturated model because the same groups are being used to calculate them.

Table 10: All-pesticide model Estimate Std. Error t value Pr(>|t|) (Intercept) 3.283 0.013 257.366 0.000 beta-Hex 0.007 0.012 0.570 0.569 Hept Epox 0.046 0.011 4.232 0.000 Mirex -0.053 0.011 -4.673 0.000 p,p’-DDT 0.007 0.011 0.578 0.563

Table 11: Saturated model Estimate Std. Error HC2 t value Pr(>|t|) (Intercept) 3.279 0.013 0.013 249.551 0.000 beta-Hex -0.004 0.013 0.013 -0.284 0.776 Hept Epox 0.035 0.013 0.013 2.699 0.007 Mirex -0.058 0.013 0.013 -4.389 0.000 p,p’-DDT -0.001 0.013 0.013 -0.102 0.919 beta-Hex:Hept Epox -0.003 0.013 0.013 -0.207 0.836 beta-Hex:Mirex -0.005 0.013 0.013 -0.359 0.720 Hept Epox:Mirex -0.028 0.013 0.013 -2.165 0.031

59 Residuals vs Fitted Normal Q−Q

137 137 3 0.6 815 815 0.4 2 0.2 1 0 Residuals −0.2 −1 Std. deviance resid. Std. deviance −2

327

−0.6 327 −3

3.20 3.25 3.30 3.35 3.40 −3 −2 −1 0 1 2 3

Predicted values Theoretical Quantiles

Scale−Location Residuals vs Leverage 137 327 815 3 1.5 . 2 d i s e r

1 e c 1.0 n a i 0 v e d

. −1 d t S Std. Pearson resid. Std. Pearson 0.5

−2 402 560 Cook's distance406 −3 0.0

3.20 3.25 3.30 3.35 3.40 0.000 0.010 0.020 0.030

Predicted values Leverage

Figure 12: Basic diagnostics plot for the saturated model given in Table 10.

D.2.2 Fractional factorial: Regression analysis adjusting for covariates This section gives the analysis of the fractional factorial design, adjusting for the covariates of income, ethnicity, gender and smoking status as linear factors in the model. We again removed units who were missing values for income or race, reducing the sample size by 34. One unit replied “Don’t know” for income so this unit was removed. This did not affect the analysis.

60 Residuals vs Fitted Normal Q−Q

137 137 3 0.6 815 815 0.4 2 0.2 1 0 Residuals −0.2 −1 Std. deviance resid. Std. deviance −2

−0.6 327 327 −3

3.20 3.25 3.30 3.35 3.40 −3 −2 −1 0 1 2 3

Predicted values Theoretical Quantiles

Scale−Location Residuals vs Leverage 137 327 815 3 1.5

. 634 2 d i s e r

1 e c 1.0 n a 0 i v e d

. −1 d t Cook's distance S Std. Pearson resid. Std. Pearson 0.5 −2 560

−3 406 0.0

3.20 3.25 3.30 3.35 3.40 0.00 0.02 0.04 0.06 0.08

Predicted values Leverage

Figure 13: Basic diagnostics plot for the saturated model given in Table 11.

D.2.3 Fractional factorial: Fisherian analysis Once again, only heptachlor epoxide (Hept Ex) and mirex appear to be significantly different than zero at the 0.05 level.

61 Table 12: All-pesticide model Estimate Std. Error t value Pr(>|t|) (Intercept) 3.332 0.074 44.845 0.000 beta-Hex 0.008 0.012 0.682 0.495 Hept Epox 0.047 0.011 4.304 0.000 Mirex -0.052 0.012 -4.464 0.000 p,p’-DDT 0.006 0.011 0.487 0.626 Income:$ 5,000 to $ 9,999 -0.079 0.083 -0.949 0.343 Income:$10,000 to $14,999 -0.066 0.077 -0.867 0.387 Income:$15,000 to $19,999 -0.060 0.078 -0.775 0.439 Income:$20,000 to $24,999 -0.031 0.077 -0.398 0.691 Income:$25,000 to $34,999 -0.027 0.076 -0.353 0.724 Income:$35,000 to $44,999 -0.014 0.076 -0.181 0.856 Income:$45,000 to $54,999 -0.036 0.076 -0.470 0.639 Income:$55,000 to $64,999 -0.001 0.080 -0.017 0.986 Income:$65,000 to $74,999 -0.017 0.080 -0.216 0.829 Income:$75,000 and Over -0.048 0.073 -0.660 0.510 Income:Over $20,000 -0.022 0.118 -0.190 0.849 Ethnicity:Non-Hispanic Black 0.064 0.030 2.149 0.032 Ethnicity:Non-Hispanic White -0.015 0.023 -0.667 0.505 Ethnicity:Other Hispanic -0.025 0.051 -0.502 0.616 Ethnicity:Other Race -0.040 0.043 -0.930 0.353 - Including Multi-Racial Gender:Male -0.000 0.019 -0.014 0.989 Smoker:Yes -0.016 0.018 -0.888 0.375

62 Table 13: Saturated model Estimate Std. Error t value Pr(>|t|) (Intercept) 3.331 0.074 44.901 0.000 beta-Hex -0.001 0.014 -0.075 0.941 Hept Epox 0.037 0.013 2.794 0.005 Mirex -0.057 0.014 -4.212 0.000 p,p’-DDT -0.002 0.013 -0.113 0.910 beta-Hex:Hept Epox -0.005 0.013 -0.406 0.68 beta-Hex:Mirex -0.007 0.013 -0.545 0.586 Hept Epox:Mirex -0.027 0.013 -2.027 0.043 Income:$ 5,000 to $ 9,999 -0.086 0.083 -1.035 0.301 Income:$10,000 to $14,999 -0.076 0.077 -0.995 0.320 Income:$15,000 to $19,999 -0.064 0.077 -0.831 0.406 Income:$20,000 to $24,999 -0.045 0.077 -0.583 0.560 Income:$25,000 to $34,999 -0.031 0.076 -0.407 0.684 Income:$35,000 to $44,999 -0.023 0.076 -0.308 0.758 Income:$45,000 to $54,999 -0.046 0.076 -0.606 0.545 Income:$55,000 to $64,999 -0.013 0.080 -0.162 0.872 Income:$65,000 to $74,999 -0.027 0.080 -0.336 0.737 Income:$75,000 and Over -0.056 0.073 -0.759 0.448 Income:Over $20,000 -0.028 0.117 -0.238 0.812 Ethnicity:Non-Hispanic Black 0.071 0.030 2.360 0.019 Ethnicity:Non-Hispanic White -0.010 0.023 -0.451 0.652 Ethnicity:Other Hispanic -0.025 0.051 -0.494 0.621 Ethnicity:Other Race -0.041 0.042 -0.965 0.335 - Including Multi-Racial Gender:Male 0.003 0.019 0.139 0.889 Smoker:Yes -0.016 0.018 -0.882 0.378

63 Residuals vs Fitted Normal Q−Q 0.6 146 3 146 1038 39 0.4 2 0.2 1 0.0 0 Residuals −0.2 −1 Std. deviance resid. Std. deviance −2 −0.6 327 −3 327

3.1 3.2 3.3 3.4 3.5 −3 −2 −1 0 1 2 3

Predicted values Theoretical Quantiles

Scale−Location Residuals vs Leverage 327 146 3 39 39 2 1.5

. 417 d i s 1 e r

e c 1.0 n 0 a i v e d

−1 . d t

S Cook's distance Std. Pearson resid. Std. Pearson 0.5 −2 557 −3 0.0

3.1 3.2 3.3 3.4 3.5 0.00 0.10 0.20

Predicted values Leverage

Figure 14: Basic diagnostics plot for the saturated model given in Table 12.

Residuals vs Fitted Normal Q−Q 0.6 3 146 14639 0.4 2 0.2 1 0.0 0 Residuals −0.2 −1 Std. deviance resid. Std. deviance −2 406 −0.6 327 −3 327

3.2 3.3 3.4 3.5 −3 −2 −1 0 1 2 3

Predicted values Theoretical Quantiles

Scale−Location Residuals vs Leverage 327 146 39 3 39 2 1.5 .

d 417 i s 1 e r

e c 1.0 n 0 a i v e d

−1 . d t

S Cook's distance Std. Pearson resid. Std. Pearson 0.5 −2 557 −3 0.0

3.2 3.3 3.4 3.5 0.00 0.10 0.20

Predicted values Leverage

Figure 15: Basic diagnostics plot for the saturated model given in Table 13.

64 250 250 200 200 150 150 Frequency Frequency 100 100 50 50 0 0

−0.10 0.00 0.05 0.10 −0.10 0.00 0.05 0.10

Beta−Hex tau estimates Hept Epox tau estimates 250 250 200 200 150 150 Frequency Frequency 100 100 50 50 0 0

−0.15 −0.05 0.05 −0.10 0.00 0.05 0.10

Mirex tau estimates p,p'−DDT tau estimates

Figure 16: Plots of simulated treatment effects estimated. Observed treatment effect esti- mates plotted in red.

65 D.3 Fractional factorial with covariate adjustment 169 units are retained after trimming.

D.3.1 Fractional factorial with covariate adjustment: Regression analysis This section gives the analysis of the fractional factorial design after trimming to attain balance, using regression. Table 14 shows an analysis with all factors and no interactions. Table 15 shows the saturated model with all interactions. Figures 17 and 18 show basic diagnostic plots for the model with all main effects and the saturated model.

Table 14: All-pesticide model Estimate Std. Error t value Pr(>|t|) (Intercept) 3.282 0.020 167.246 0.000 beta-Hex 0.041 0.020 2.053 0.042 Hept Epox 0.032 0.018 1.779 0.077 Mirex -0.075 0.018 -4.127 0.000 p,p’-DDT -0.006 0.020 -0.327 0.744

Table 15: Saturated model Estimate Std. Error HC2 std t value Pr(>|t|) (Intercept) 3.282 0.022 0.027 149.359 0.000 beta-Hex 0.038 0.022 0.027 1.719 0.088 Hept Epox 0.034 0.022 0.027 1.534 0.127 Mirex -0.077 0.022 0.027 -3.515 0.001 p,p’-DDT -0.008 0.022 0.027 -0.383 0.703 beta-Hex:Hept Epox -0.007 0.022 0.027 -0.299 0.765 beta-Hex:Mirex -0.003 0.022 0.027 -0.115 0.908 Hept Epox:Mirex -0.006 0.022 0.027 -0.264 0.792

D.3.2 Fractional factorial with covariate balance: Regression analysis adjusting for covariates This section gives the analysis on the trimmed data set adjusting for ethnicity, income, gender, and smoking status as linear factors in the regression. Figures 19 and 20 shows the balance across treatment groups for income and ethnicity after trimming, indicating further need to adjust. We see from Figure 2 that gender and smoking also require further adjustment, even after trimming. Units who had missing values for income or ethnicity were removed, which reduced the sample size by 9. We found that two units had unique income values of “Over $20,000” and “0 to $4,999”. These units were removed. This changes the baseline level for income in the analysis from “$0 to $4,999” to “$5,000 to $9,999.”

66 Residuals vs Fitted Normal Q−Q 3

599 599 0.4 233 561 291561 2 0.2 1 0.0 0 Residuals Std. deviance resid. Std. deviance −1 −0.2 −2 −0.4

3.0 3.1 3.2 3.3 3.4 3.5 −2 −1 0 1 2

Predicted values Theoretical Quantiles

Scale−Location Residuals vs Leverage

599 3 291 561 599

1.5 561 2 . d i s e 1 r

1.0 e c n a i 0 v e d

. d t

0.5 Cook's distance −1 S Std. Pearson resid. Std. Pearson

−2 360 0.0

3.0 3.1 3.2 3.3 3.4 3.5 0.00 0.10 0.20

Predicted values Leverage

Figure 17: Basic diagnostics plot for the saturated model given in Table 14.

Residuals vs Fitted Normal Q−Q 3

599 599 0.4 233 561 2 0.2 1 0.0 0 Residuals Std. deviance resid. Std. deviance −1 −0.2

1185 −2 1185 −0.4

3.1 3.2 3.3 3.4 3.5 −2 −1 0 1 2

Predicted values Theoretical Quantiles

Scale−Location Residuals vs Leverage

599 3 1185 561 1.5

2 787 . d i s e 1 r

1.0 e c n a i 0 v e d

. d t

0.5 Cook's distance −1 S Std. Pearson resid. Std. Pearson

−2 360 540 0.0

3.1 3.2 3.3 3.4 3.5 0.0 0.1 0.2 0.3 0.4

Predicted values Leverage

Figure 18: Basic diagnostics plot for the saturated model given in Table 15.

67 1.00

0.75 Ethnicity Mexican American Non−Hispanic Black 0.50 Non−Hispanic White Percent Other Hispanic

0.25 Other Race − Including Multi−Racial

0.00 2 3 5 8 9 12 14 15 Treatment Combination

Figure 19: Balance of ethnicity in the different treatment groups after matching.

$ 0 to $ 4,999 0.4 $ 5,000 to $ 9,999 $10,000 to $14,999 0.3 $15,000 to $19,999 $20,000 to $24,999 $25,000 to $34,999 0.2

Percent $35,000 to $44,999 $45,000 to $54,999 0.1 $55,000 to $64,999 $65,000 to $74,999

0.0 $75,000 and Over 2 3 5 8 9 12 14 15 NA Treatment Combination

Figure 20: Balance of income in the different treatment groups after matching.

D.3.3 Fractional factorial with covariate adjustment: Fisherian analysis. In this analysis mirex is the only pesticide that appears to have a significant effect on BMI at the 0.05 level.

68 Table 16: All-pesticide model Estimate Std. Error t value Pr(>|t|) (Intercept) 3.292 0.086 38.110 0.000 beta-Hex 0.037 0.021 1.759 0.081 Hept Epox 0.047 0.019 2.509 0.013 Mirex -0.060 0.020 -3.086 0.002 p,p’-DDT -0.008 0.021 -0.383 0.702 Income:$10,000 to $14,999-0.031 0.086 -0.360 0.720 Income:$15,000 to $19,999 -0.003 0.083 -0.038 0.970 Income:$20,000 to $24,999 -0.001 0.082 -0.014 0.989 Income:$25,000 to $34,999 -0.014 0.085 -0.160 0.873 Income:$35,000 to $44,999 0.069 0.088 0.777 0.439 Income:$45,000 to $54,999 0.039 0.088 0.445 0.657 Income:$55,000 to $64,999 0.059 0.087 0.675 0.501 Income:$65,000 to $74,999 -0.048 0.083 -0.577 0.565 Income:$75,000 and Over 0.015 0.075 0.202 0.840 Ethnicity:Non-Hispanic Black 0.059 0.047 1.251 0.213 Ethnicity:Non-Hispanic White 0.016 0.036 0.435 0.664 Ethnicity:Other Hispanic 0.056 0.082 0.677 0.499 Ethnicity:Other Race -0.082 0.074 -1.109 0.269 - Including Multi-Racial Gender:Male -0.058 0.030 -1.952 0.053 Smoker:Yes -0.003 0.029 -0.087 0.930

69 Table 17: Saturated model Estimate Std. Error t value Pr(>|t|) (Intercept) 3.278 0.088 37.234 0.000 beta-Hex 0.033 0.023 1.459 0.1470 Hept Epox 0.042 0.022 1.892 0.061 Mirex -0.056 0.023 -2.428 0.017 p,p’-DDT -0.011 0.024 -0.461 0.645 beta-Hex:Hept Epox -0.006 0.023 -0.276 0.783 beta-Hex:Mirex -0.019 0.023 -0.844 0.400 Hept Epox:Mirex -0.014 0.023 -0.600 0.549 Income:$10,000 to $14,999 -0.035 0.088 -0.393 0.695 Income:$15,000 to $19,999 0.001 0.084 0.015 0.988 Income:$20,000 to $24,999 0.004 0.083 0.044 0.965 Income:$25,000 to $34,999 -0.008 0.086 -0.088 0.930 Income:$35,000 to $44,999 0.073 0.089 0.822 0.413 Income:$45,000 to $54,999 0.037 0.089 0.411 0.682 Income:$55,000 to $64,999 0.063 0.088 0.720 0.473 Income:$65,000 to $74,999 -0.049 0.084 -0.587 0.558 Income:$75,000 and Over 0.019 0.076 0.253 0.800 Ethnicity:Non-Hispanic Black 0.072 0.049 1.492 0.138 Ethnicity:Non-Hispanic White 0.022 0.037 0.598 0.551 Ethnicity:Other Hispanic 0.055 0.083 0.665 0.507 Ethnicity:Other Race -0.090 0.075 -1.198 0.233 - Including Multi-Racial Gender:Male -0.055 0.030 -1.830 0.069 Smoker:Yes -0.005 0.030 -0.155 0.877

70 Residuals vs Fitted Normal Q−Q 3

599 599 0.4 2 0.2 1 0.0 0 Residuals −1 Std. deviance resid. Std. deviance −0.2

838 −2

−0.4 1185 8381185

3.0 3.1 3.2 3.3 3.4 3.5 −2 −1 0 1 2

Predicted values Theoretical Quantiles

Scale−Location Residuals vs Leverage

599 8381185 3 599 1.5 561 2 . d i s e 1 r

1.0 e c n a 0 i v e d

. d t −1 0.5

S Cook's distance Std. Pearson resid. Std. Pearson −2

838 0.0 −3 3.0 3.1 3.2 3.3 3.4 3.5 0.00 0.10 0.20 0.30

Predicted values Leverage

Figure 21: Basic diagnostics plot for the model given in Table 16.

Residuals vs Fitted Normal Q−Q 3

599 599 0.4 233 561 2 0.2 1 0.0 0 Residuals −1 Std. deviance resid. Std. deviance −0.2 −2

−0.4 1185 1185

3.0 3.1 3.2 3.3 3.4 3.5 −2 −1 0 1 2

Predicted values Theoretical Quantiles

Scale−Location Residuals vs Leverage

599 1185 3 561 1.5 2 787 . d i s e 1 r

1.0 e c n a 0 i v e d

. d t −1 0.5 S Std. Pearson resid. Std. Pearson Cook's distance

−2 540360 0.0 −3 3.0 3.1 3.2 3.3 3.4 3.5 0.0 0.1 0.2 0.3 0.4

Predicted values Leverage

Figure 22: Basic diagnostics plot for the saturated model given in Table 17.

71 150 300 100 200 Frequency Frequency 50 100 50 0 0

−0.15 −0.05 0.05 0.15 −0.2 −0.1 0.0 0.1 0.2

Beta−Hex tau estimates Hept Epox tau estimates 150 150 100 100 Frequency Frequency 50 50 0 0

−0.15 −0.05 0.05 0.15 −0.15 −0.05 0.05 0.15

Mirex tau estimates p,p'−DDT tau estimates

Figure 23: Plots of simulated treatment effects estimated. Observed treatment effect esti- mates plotted in red.

72