Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19

Assessing Treatment Effect Variation in Observational Studies: Results from a Challenge

Carlos Carvalho [email protected] Department of Information, Risk and Operations Management The University of Texas at Austin Austin, TX 78712, USA Avi Feller [email protected] Goldman School of Public Policy The University of California, Berkeley Berkeley, CA 94720, USA Jared Murray [email protected] Department of Information, Risk and Operations Management The University of Texas at Austin Austin, TX 78712, USA Spencer Woody [email protected] Department of and Data Science The University of Texas at Austin Austin, TX 78712, USA David Yeager [email protected] Department of Psychology The University of Texas at Austin Austin, TX 78712, USA

Abstract A growing number of methods aim to assess the challenging question of treatment effect variation in observational studies. This special section of Observational Studies reports the results of a workshop conducted at the 2018 Atlantic Causal Inference Conference designed to understand the similarities and differences across these methods. We invited eight groups of researchers to analyze a synthetic observational data set that was generated using a recent large-scale randomized trial in education. Overall, participants employed a diverse set of methods, ranging from and flexible outcome modeling to semiparametric estimation and ensemble approaches. While there was broad consensus on the topline estimate, there were also large differences in estimated treatment effect . This highlights the fact that estimating varying treatment effects in observational studies is often more challenging than estimating the average treatment effect alone. We suggest several directions for future work arising from this workshop. Keywords: Heterogeneous treatment effects, effect modification, average treatment effect

1. Introduction Spurred by recent statistical advances and new sources of data, a growing number of meth- ods aim to assess treatment effect variation in observational studies. This is an inherently

c 2019 Carlos Carvalho, Avi Feller, Jared Murray, Spencer Woody, and David Yeager . Carvalho, Feller, Murray, Woody, and Yeager

challenging problem. Even estimating a single overall effect in non-randomized settings is difficult, let alone estimating how effects vary across units. As applied researchers begin to use these tools, it is therefore important to understand both how well these approaches work in practice and how they relate to each other. This special section of Observational Studies reports the results of a workshop con- ducted at the 2018 Atlantic Causal Inference Conference designed to address these ques- tions. Specifically, we invited researchers to analyze a common data set using their preferred approach for estimating varying treatment effects in observational settings. The set, which we describe in detail in Section 2.2, was based on data from the National Study of Learning Mindsets, a large-scale randomized trial of a behavioral intervention (Yeager et al., 2019). Unlike the original study, the simulated dataset was constructed to include meaningful measured , though the assumption of no unmeasured con- founding still holds. We then asked participants to answer specific questions related to treatment effect variation in this simulated data set and to present these results at the ACIC workshop. Workshop participants submitted a total of eight separate analyses. At a high level, all contributed analyses followed similar two-step procedures: (1) use a flexible approach to unit-level treatment effects; (2) find low-dimensional summaries of the estimates from the first step to answer the substantive questions of interest. The analyses, however, differed widely in their choice of flexible modeling, including matching, machine learning models, semi-parametric methods, and ensemble approaches. In the end, there was broad consensus on the topline estimate, but large differences in estimated treatment effect moderation. This underscores that estimating varying treatment effects in observational studies is more challenging than estimating the ATE alone. Section 2 gives an overview of the data challenge and the proposed methods. Section 3 discusses the contributed analyses. Section 4 addresses common themes and highlights some directions for future research. Participants’ analyses appear subsequently in this volume.

2. Overview of the data challenge 2.1 Background and problem setup The basis for the data challenge is the National Study of Learning Mindsets (Yeager et al., 2019; Yeager, 2019), which several workshop organizers helped to design and analyze. NSLM is a large-scale randomized evaluation of a low-cost “nudge-like” intervention designed to instill students with a growth mindset. At a high level, a growth mindset is the belief that people can develop intelligence, as opposed to a fixed mindset, which views intelligence as an innate trait that is fixed from birth. NSLM assessed this intervention by randomizing students separately within 76 schools drawn from a national probability sample of U.S. public high schools. In addition to assessing the overall impact, the study was designed to measure impact variation, both across students and across schools. See Yeager et al. (2019) for additional discussion. The goal in generating the synthetic data was to create an that emulated NSLM in key ways, including covariate distributions, data structures, and effect sizes, but that also introduced additional confounding not present in the original randomized trial. The final dataset included 10,000 students across 76 schools, with four student-level

22 Data Challenge: Assessing Treatment Effect Variation

Covariate Description S3 Student’s self-reported expectations for suc- cess in the future, a proxy for prior achieve- ment, measured prior to C1 for student race/ethnicity C2 Categorical variable for student identified gen- der C3 Categorical variable for student first- generation status (i.e., first in family to go to college) XC School-level categorical variable for urbanicity of the school (i.e., rural, suburban, etc.) X1 School-level of students’ fixed mindsets, reported prior to random assignment X2 School achievement level, as measured by test scores and college preparation for the previous four cohorts of students X3 School racial/ethnic minority composition – i.e., percentage black, Latino, or Native American X4 School poverty concentration – i.e., percent- age of students who are from families whose incomes fall below the federal poverty line X5 School size – Total number of students in all four grade levels in the school

Table 1: Descriptions of available covariates in the ACIC workshop synthetic dataset covariates and six school-level covariates shown in Table 1. These covariates were drawn from the original National Study but were slightly perturbed to ensure privacy and other data restrictions (see Appendix A for details). For each student, we then generated a simulated outcome Y , representing a continuous measure of achievement, and a simulated binary treatment variable Z. We describe the data generating process in Section 2.2, with details in Appendix A.

Participants were presented with several objectives for their analysis:

1. To assess whether the mindset intervention is effective in improving student achieve- ment

2. To assess two potential effect moderators of primary scientific interest: Pre-existing mindset norms (X1) and school level achievement (X2). In particular, participants were asked to evaluate two competing hypotheses about how X2 moderates the effect of the intervention: if it is an effect modifier, researchers hypothesize that either it is

23 Carvalho, Feller, Murray, Woody, and Yeager

largest in middle-achieving schools (a “Goldilocks effect”) or is decreasing in school- level achievement.

3. To assess whether there are any other effect modifiers among the recorded variables.

We chose these objectives in part because they arose in the design and analysis of the original National Study. We intentionally kept the wording statistically vague, and did not map the objectives onto specific estimands in order to emulate part of our experience working with collaborators.

2.2 Overview of data generation According to the model generating synthetic data: 1. The mindset intervention has a relatively large, positive effect on average.

2. The impact varies across both pre-existing mindset norms (X1) and school-level achieve- ment (X2). However, there is no “Goldilocks” effect present, at least when controlling for other variables (C1 and X1).

3. The impact also varies across race/ethnicity (C1). Where possible we anchored the synthetic data generating process to the original data, borrowing the original covariate distribution (with slight perturbations) and using semi- parametric models fit to an immediate post-treatment outcome from the original study. Specifically, let wij denote the vector of the variables in Table 1 for student i in school j. We generated the data from the following model:

obs yij = αj + µ(wij) + [τ(xj1, xj2, cij1) + γj]zij + ij, (1) where µ is an additive function obtained approximately by fitting a generalized additive model to the control arm of the original data; see Appendix A. We simulated αj ∼ 2 2 N(0, 0.15 ) and γj ∼ N(0, 0.105 ) independently. We drew iid samples of ij by jitter- ing and the residuals from a model fit to the original data, scaling them to have 0.5; the distribution of the error terms is displayed in Figure 3. Finally, we generated treatment effects from the following model:

τ(x1, x2, c1) = 0.228+0.05·1(x1 < 0.07)−0.05·1(x2 < −0.69)−0.08·1(c1 ∈ {1, 13, 14}), (2) where x1 is a measure of pre-existing mindset norms, x2 is school-level achievement, and c1 is a categorical race/ethnicity variable. All three appeared to be associated with treat- ment effect variation in preliminary data from the original National Study, although we adjusted the pattern for the synthetic dataset. Note that there is no additional idiosyn- cratic treatment effect variation and the control and treatment potential outcomes have the same (conditional) . We generated confounding via a two-step process. First, we dropped observations from the treatment arm with probability 1 − Φ[−0.5 + 1.5µ(wij)], where Φ is the standard nor- mal CDF. This simulates a scenario where students with high expected outcomes under control were more likely to receive the treatment, yielding naive treatment effect estimates

24 Data Challenge: Assessing Treatment Effect Variation

that are too high. Second, similar to the design in Hill (2011), we dropped select units from the treatment arm to induce a more complicated functional form for the confounding structure. Specifically, we dropped students from the treatment arm if they were above the 80th on an additional covariate from the National Study that was not included in the synthetic dataset (and not used in generating the synthetic outcomes). In principle this has the potential to induce violations of overlap in the presence of very strong dependence, but overlap in the final dataset was quite good (see e.g. Carnegie et al and Johannsson for overlap checks). As with any synthetic data simulation, the process involved an extensive number of modeling choices. Appendix A contains more details about the data generating process and discusses the principles we formulated prior to generating the data, including decisions about the relative magnitude of the average treatment effect, treatment effect heterogeneity, and residual variance.

3. Overview of contributed analyses 3.1 Summary of contributed methods Participants submitted eight separate analyses using a wide of methods.1 At a high level, all of the contributed analyses followed similar two-step procedures: (1) use a flexible approach to impute student-level (or school-level) treatment effects; (2) find low-dimensional summaries of the flexible model to answer the substantive questions. Specifically, the pro- posed first-stage methods fall into three broad categories: • Matching. Keller et al; Keele and Pimentel; and Parikh et al; • Outcome modeling. Carnegie et al.; Johannsson; • Machine learning and semi-parametric estimation. Zhao & Panigrahi; Athey & Wa- ger. K¨unzelet al. proposed an ensemble approach that could incorporate any number of candi- date estimators. Approaches for the second step range from graphical summaries and simple aggregation (e.g., averaging student-level impact estimates by school) to fitting regression or tree models to the estimated student-level impacts.

3.2 Summary of findings At a high level, most analyses resulted in remarkably similar estimates for the average treatment effect. Conversely, the proposed methods generally do not agree on treatment effect heterogeneity, though the presented results rarely contradict each other: While some methods reported discovered effect modifiers that others did not detect, no two methods, for example, found effects varying in opposite directions. Objective 1: average effect. Table 2 shows the reported estimates of the Average Treatment Effect and corresponding 95% uncertainty intervals.2 Overall the point estimates

1. Alejandro Schuler participated in the ACIC workshop but did not submit a written analysis. 2. The original prompt was intentionally vague on the target estimand. Most respondents reported esti- mates for an Average Treatment Effect. Keele & Pimentel specifically reported estimates for an Average

25 Carvalho, Feller, Murray, Woody, and Yeager

Author ATE estimate (95 % C.I.) Athey & Wager 0.25 (0.21, 0.29) Carnegie et al. 0.25 (0.23, 0.27) Johannsson 0.27 (0.22, 0.31) Keele & Pimentel 0.27 (0.25, 0.30) Keller et al. 0.26 (0.22, 0.30) K¨unzelet al. 0.25 (0.22, 0.27) Parikh et al. 0.26 (0.25, 0.26) Zhao & Panigrahi 0.26 (0.24, 0.28)

Table 2: Submitted estimates for the average treatment effect and corresponding 95% un- certainty intervals. are quite similar, ranging from 0.25 to 0.27, compared to a true value of 0.24. The interval widths varied widely, however, from 0.01 for Parikh et. al. to 0.09 for Johannsson. And while most intervals bracketed the true value, with the exceptions of Parikh et. al. and Keele and Pimentel.3 we cannot assess the quality of these intervals based on a single realized synthetic data set. These results are consistent with our own experiences and anecdotal evidence suggesting that — in settings where the identifying assumptions hold, sample sizes are large, and the signal is reasonably strong — estimates of the overall effect are fairly robust to modeling or analytic choices. Inference may be another story, however. While all eight analyses point to large, positive effects overall, the variation in interval widths suggests differences in efficiency and/or frequentist validity of uncertainty intervals, or possibly differences in the target estimand; for example, Athey and Wager target the population school average treatment effect, while Keele and Pimentel estimate the sample ATT.

Objective 2: variation across pre-specified moderators. In contrast to responses concerning the overall effect, there is considerable disagreement about treatment effect variation across the two pre-specified moderators; see Table 3. All participants found at least some support for effect modification across baseline levels of mindset beliefs (X1) and generally agreed on the direction: higher baseline mindset beliefs associated with lower treatment effects on average. However, there was some disagreement on whether this trend should be considered “statistically significant.” As we discuss in Section 4, this is due in part to different standards for evidence as well as challenges in quantifying uncertainty for some approaches. Results were more mixed for variation by baseline student achievement (X2). Half of the analyses found essentially no support for variation across X2. The remaining analyses were split as to the magnitude and strength of evidence for variation, though all were suggestive of an increasing trend.

Treatment Effect on the Treated. Athey & Wager reported an Average Treatment Effect for a population of sites. In the end, the true values are quite similar across these estimands in the synthetic data set. Finally, all uncertainty intervals are confidence intervals, except for Carnegie et al., who report credible intervals. 3. The true Sample Average Treatment Effect on the Treated was also about 0.24.

26 Data Challenge: Assessing Treatment Effect Variation

Author School mindset norms (X1) School achievement level (X2) Athey & Wager Yes, though negative effect No. modification is insignificant after multiplicity adjustment. Carnegie et al. Yes, decreasing trend in Yes, an increasing trend in treatment effect. treatment effect, though with- out evidence of a Goldilocks effect Johannsson Yes, decreasing trend in No. treatment effect. Keele & Pimentel Yes, decreasing trend in Moderate support; lowest treatment effect, though con- quintile has lower treatment fidence intervals for quintiles effect, though there is little overlap. separation among other quin- tiles. Keller et al. Moderate support from ex- Moderate support for pres- ploratory analysis. ence Goldilocks effect. K¨unzelet al. Yes. increasing trend in No, though there is cursory treatment effect. evidence in CATE plots. Parikh et al. Yes, increasing then decreas- Yes, exploratory support for ing trend in treatment effect. Goldilocks effect. Zhao & Panigrahi Yes, though negative effect No. modification is insignificant after selection adjustment.

Table 3: Conclusions about Objective 2 concerning effect modification by X1 and X2

Objective 3: exploratory treatment effect variation. The contributed analyses also differed in their conclusions about additional effect modifiers. (Recall from Eq (1) that in the data generating process, X1, X2, and C1 were all “true” treatment effect modifiers, with treatment effects increasing in X1, decreasing in X2, and smaller for C1=1,13, or 14) . With the exception of Athey & Wager, who found no additional treatment effect vari- ation, the remaining seven analyses all identified urbanicity (XC) as an effect modifier, although they varied somewhat in their assessment of the strength of evidence. Some au- thors also identified student self-reported expectations for future success (S3) as a possible effect modifier, generally with an increasing trend. These findings were interesting in part because neither XC nor S3 are “true” effect modifiers in the sense of appearing in the data generating process for τ in Equation (2), although XC is associated with τ in the population. This highlights two important issues in assessing treatment effect variation in non- randomized studies. First, even in a randomized trial, estimating varying treatment effects is, in some sense, an inherently observational problem and is generally susceptible to Simp- son’s paradox (VanderWeele and Knol, 2011). For example, urbanicity is strongly related to the true effect modifiers X1 and X2: if the analysis does not condition on X1 and X2, the

27 Carvalho, Feller, Murray, Woody, and Yeager

Figure 1: Relationships between the Urbanicity variable (XC) and other quantities, from left to right: Base mindset norms (X1), school achievement (X2), and the true CATE. In the left two panels we see how XC correlates with X1 and X2. The dotted lines are the locations of the step for the CATE function τ in Equation (2) for each variable (X1 and X2). Note that, in addition to the sample correlation between X1/X2 and XC, observations with XC = 3 tended to have norms (X1) above the X1 cutpoint in τ and achievement (X2) below the X2 cutpoint in τ. In the last panel we see that marginally the true CATEs for observations with XC = 3 are notably lower than the other levels.

estimated treatment effects will vary across levels of XC; see Figure 1.4 Therefore, whether urbanicity should be considered a “true” effect modifier depends on whether the analysis conditions on X1, X2, and C1, and on whether we are estimating heterogeneity across sam- pled schools or in the population. The research question given to workshop participants was intentionally vague on this point, and different authors interpreted the question in different ways. Second, disentangling differential confounding from treatment effect variation is inher- ently challenging in observational settings. Unlike XC, S3 has essentially no relationship with the true individual-level treatment effects unconditionally, but is a very important predictor of selection into treatment. In the analyses that found evidence for treatment effect heterogeneity in S3, the estimated pattern of heterogeneity lined up well with the patterns of confounding – higher estimated treatment effects in the upper two levels of S3 – suggesting that the estimated treatment effect variation may be due to residual confound- ing; see Figure 2. These results are noteworthy in part because the submitted analyses were able to more or less correctly estimate the overall treatment effect despite confound- ing, which underscores that estimating varying treatment effects in observational studies is more challenging than estimating the ATE.

4. The observed relationship between urbanicity and treatment is further amplified by the step function specification in Equation (2) as well as the particular draw of random slopes, γj , in the realized synthetic data set.

28 Data Challenge: Assessing Treatment Effect Variation

Figure 2: True CATE and estimated propensity score by level of expected success (S3). Unlike XC, we see relatively little variability in CATE by S3 (left panel). However, S3 was one of the most important confounders (the right panel above shows how the propensity score varies with S3; Figure 4 shows that S3 is also very predictive of the outcome).

4. Common themes and open questions 4.1 Making substantive questions statistically precise As with all real-world analyses, there are important open-ended questions in terms of trans- lating the substantive goals of the research study into corresponding statistical methods. One important challenge is weighing evidence for pre-defined subgroups versus ex- ploratory treatment effect variation. The proposed methods differ widely. At one extreme, Zhao and Panigrahi have different inferential procedures for pre-determined candidate sub- group effects (school-level fixed mindset X1 and achievement level X2, as specified in Ques- tion 2) versus “discovered” subgroup effects, as mentioned in Question 3. They argue that using the same data for both discovery and estimation of effect modification will generally lead to bias (Fithian et al., 2014), which motivates their post-selection inference procedure (Lee et al., 2016). At the other extreme, Carnegie et al. treat all variation the same, and conduct their investigation into treatment effect heterogeneity by leveraging posterior uncertainty from BART. Most approaches lie between these two extremes, such as using different sample splits to define subgroups versus estimating their impacts. (K¨unzelet al.). Another difference across methods is how submitted analyses addressed the multilevel structure of the data. Approaches included using “cluster-robust” random forests (Wager & Athey), bootstrapping and sample splitting at the school level (Johannson; K¨unzelet al.), including fixed or random intercepts at the school level (Carnegie et al.), and considering matching both between and within schools (Keele & Pimentel). These approaches make different assumptions about patterns of treatment effect moderation as well as the estimand of interest. For example, Carnegie et. al.’s choice of random intercepts per school relaxes the assumption of iid errors at the student level but does not allow the effect of treatment effect moderators to vary across schools. Alternatively, Athey & Wager use a cluster-robust approach and an estimator that targets a population of schools, rather than (as was often implicitly targeted by the other analyses) a population of students drawn from the 76 observed schools.

29 Carvalho, Feller, Murray, Woody, and Yeager

Third, as we mention in Section 3, the contributed analyses all use a two-step approach for assessing treatment effect variation, first fitting a flexible model for impacts and then finding a low-dimensional summary of that model. While a promising framework, the statistical properties of such procedures are not well understood. First, as Hahn et al. (2017) argue, this is generally valid from a Bayesian perspective, since it simply entails summarizing the posterior distribution after conditioning on the data only once. The story, however, is more complicated from a frequentist perspective. Chernozhukov et al. (2018) offer some promising directions in the randomized trial setting; see also the discussion in Athey & Wager and Zhao & Panigrahi. These ideas merit further exploration.

4.2 Tailoring methods for observational studies The (simulated) data for this data challenge come from an observational study rather than a randomized trial. Therefore, appropriately accounting for confounding is a first-order concern. While all the analyses adjusted for confounding, the contributions varied in the level of emphasis put on the observational aspect of the data exercise, which creates addi- tional challenges for estimating treatment effect variation. In particular, when a method only partially accounts for confounding, it is possible to confuse differential confounding for varying treatment effects. This appears to be driving some of the findings of S3 as an effect modifier, as we discuss in Section 3. Understanding this dimension will be critical for broader adoption of these methods. First, it is common to check for global covariate balance when estimating overall impact in an observational study. A natural extension to heterogeneous treatment effect estimation would be to also check for local covariate balance, such as separately by candidate subgroups. Even if we attain global covariate balance to a reasonable tolerance, imbalance within subgroups should give us pause. Such analyses were largely absent from the contributed papers, suggesting that the field should develop standards in this area. Another element that is critical for observational studies is assessing the consequences when the unconfoundedness assumption fails, either globally or locally. Keele & Pimentel assess sensitivity within the matching framework; Carnegie et al. instead use a model-based approach, and K¨unzelet al. use a permutation approach. Sensitivity analysis for treatment effect variation remains an active research area, with some recent proposals within the minimax framework (Kallus and Zhou, 2018; Yadlowsky et al., 2018).

5. Conclusion Our goal for this workshop was to understand how different researchers would approach the kinds of questions we routinely face in our own applied work, as a complement to existing competitions like the ACIC data challenge (Dorie et al., 2019) where methods are formally evaluated based on their operating characteristics. We were fortunate to bring together a “methodologically diverse” panel and set them to analyze a single dataset, giving us a unique opportunity to compare perspectives. Our hope was that we would learn more about new methods while finding areas of overlap and points of that suggest new lines of research. We suggest a few directions for future work above, but fully expect this synthesis is the first word and not the last. We sincerely thank the workshop participants and the other contributors to this volume for making the workshop such a success.

30 Data Challenge: Assessing Treatment Effect Variation

Acknowledgments

We gratefully acknowledge support from the Center for Enterprise and Policy Analytics at the McCombs School of Business at the University of Texas at Austin. Preparation of this manuscript was supported by the William T. Grant Foundation, the National Science Foundation under grant number HRD 1761179, and by National Institute of Child Health and Human Development (Grant No. 10.13039/100000071 R01HD084772-01, and Grant No. P2C-HD042849, to the Population Research Center [PRC] at The University of Texas at Austin). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes

Appendix A. Additional details for generating synthetic data A.1 Principles for data generation We had several goals for our data generating process. Since we had a single dataset to generate based on a scientific problem with which we were quite familiar, we had the lux- ury of thinking carefully about plausible structures for treatment effect heterogeneity and confounding. Our model for generating treatment effect heterogeneity was based on the following principles:

1. The average treatment effect should be relatively well-estimated by any reasonable procedure for confounding adjustment.

2. The variability in conditional average treatment effects (CATEs) should be relatively modest; here the covariate-dependent component of the CATEs ranged from 0.10 to 0.28, while the total CATEs (including school-level heterogeneity unexplained by the covariates) ranged from -0.08 to 0.53 and have a sample average of 0.24 and standard deviation 0.10. “Relative” here should be understood in terms of the marginal stan- dard deviation of Y (about 0.6) and the standard deviation of the error term in the model we used to generate the data (0.5).

3. It should be possible to approximately recover the treatment effect heterogeneity at conventional levels of statistical significance given complete knowledge of the correct functional form. This is an obvious baseline. We also tried to ensure that it was possible to recover at least some aspects of treatment effect moderation using plau- sible methods in a fully exploratory fashion, or when the true set of treatment effect moderators was known.

4. There should be no additional unmeasured treatment effect moderation at the indi- vidual level, since this is inestimable. This is wholly unrealistic and was primarily a variance-reduction decision; to the extent that other simulation exercises are explicit about this point, they tend to simulate unmeasured treatment effect moderation as due to independent unmeasured variables (e.g. Dorie et al. (2019)). This is closer to reality of course, but in our context with a single dataset to be analyzed we saw little to be gained by injecting additional variability into the problem.

31 Carvalho, Feller, Murray, Woody, and Yeager

5. Unexplained treatment effect heterogeneity at the group level should be present at reasonable levels. We achieved this by simulating a random slope per school from a normal distribution with mean zero and standard deviation slightly larger than what we observed in models fit to the original data.

Our confounding structure was based on the following principles:

1. The confounding should be strong enough to matter, but not so strong as to be unrealistic and/or induce practical violations of overlap. In our context this meant a naive, unadjusted estimate average treatment effect (ATE) estimate was about 25% higher than the estimate using a correctly specified model (roughly, 0.30 unadjusted versus 0.24 adjusted, a difference of multiple standard errors).

2. The confounding should be explicitly modeled, at least in part, rather than induced solely by randomly coefficients in outcome and selection models. In par- ticular, we assumed that selection into treatment was partially based on expected outcomes. Here students with higher expected outcomes under control were more likely to be treated. Rather than violate conditional ignorability/unconfoundedness via a (perhaps more plausible) latent variable model, we made the selection model depend directly on µ0(w) := E(Y (0) | W = w), where we use W generically to denote the collection of potential effect moderators and confounders. Hahn et al. (2017) refer to this confounding structure as “targeted selection.”

3. We entertained inducing confounding at the group level that was unexplained by group-level covariates, but decided against it. This was largely a pragmatic decision, as it was already surprisingly difficult to design a data generating process that met our other desiderata.

We wanted the covariate distribution to be realistic. We were interested in the practical effects of dependence in the covariates on estimating effect variation, and in how participants would interpret the research questions in light of this dependence. We took the covariates almost directly from the early Mindset study data. To satisfy privacy constraints and avoid premature disclosure of variables constructed by the Mindset team, the original covariates were slightly perturbed. We added noise to continuous variables, where the disturbances were sampled from multivariate normal distributions that preserved the structure (marginally over categorical variables). Categorical variables were subject to low levels of data swapping. In both cases we preserved the original multilevel data structure.

A.2 Summary of original GAM estimate See Figure 3.

Appendix B. List of participants In total there were nine participants in the workshop, all of whom gave presentations of their findings. Here we list the presenters by order of appearance, along with a brief description of their methodology:

32 Data Challenge: Assessing Treatment Effect Variation

Figure 3: Nonlinear terms in µ and error distribution (bottom right panel) for the data generating model.

• Stefan Wager (Stanford): Causal random forests and the R-Learner transformed out- come

• Alejandro Schuler (Stanford): Shallow CART fit to transformed outcome (R-Learner).

• Luke Keele (Penn) and Sam Pimentel (UC-Berkeley): Various matching-based ap- proaches.

• Qingyuan Zhao (Penn): on a transformed outcome and treatment with lasso regularization, with selection adjustment for inference about nonzero coefficients.

• S¨orenK¨unzel(UC-Berkeley): Ensemble method to detect candidate subgroups, using standard ATE estimators within these subgroups.

• Nicole Carnegie (Montana State) and Jennifer Hill (New York University): BART with and without random intercepts.

• Alexander Volfovsky (Duke): Matching after learning to stretch (MALTS), a matching method that infers a distance metric.

• Frederik Johannsson (MIT): Ridge regression, neural networks, and random forests.

• Bryan Keller (Columbia): Propensity score matching and CART summarization.

All presenters, with the exception of Schuler, submitted a written report (most with other coauthor(s)). These submissions are all included in this issue.

33 Carvalho, Feller, Murray, Woody, and Yeager

Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.279038 0.014847 -18.794 < 2e-16 *** S3 == 2 -0.009130 0.014948 -0.611 0.541351 S3 == 3 -0.017494 0.013360 -1.309 0.190402 S3 == 4 0.046260 0.012716 3.638 0.000276 *** S3 == 5 0.252158 0.012290 20.518 < 2e-16 *** S3 == 6 0.614424 0.012315 49.893 < 2e-16 *** S3 == 7 0.964031 0.012872 74.891 < 2e-16 *** C1 == 2 0.011635 0.005728 2.031 0.042255 * C1 == 3 -0.035327 0.012573 -2.810 0.004966 ** C1 == 4 -0.107783 0.004981 -21.638 < 2e-16 *** C1 == 5 0.215135 0.008003 26.882 < 2e-16 *** C1 == 6 0.001796 0.021055 0.085 0.932026 C1 == 7 0.034712 0.021602 1.607 0.108097 C1 == 8 -0.073001 0.010092 -7.233 4.96e-13 *** C1 == 9 -0.138227 0.011785 -11.729 < 2e-16 *** C1 == 10 -0.099958 0.010868 -9.197 < 2e-16 *** C1 == 11 -0.084852 0.010809 -7.850 4.45e-15 *** C1 == 12 0.049686 0.008453 5.878 4.25e-09 *** C1 == 13 -0.100919 0.010573 -9.545 < 2e-16 *** C1 == 14 0.064214 0.006531 9.832 < 2e-16 *** C1 == 15 -0.019233 0.008204 -2.344 0.019078 * C2 -0.155658 0.002546 -61.127 < 2e-16 *** C3 -0.093056 0.002850 -32.655 < 2e-16 *** XC 0.019555 0.002527 7.740 1.07e-14 *** ---

edf Ref.df F p-value s(X1) 8.718 8.950 128.08 <2e-16 *** s(X2) 8.920 8.987 25.08 <2e-16 *** s(X3) 6.726 7.661 62.31 <2e-16 *** s(X4) 7.827 8.554 19.73 <2e-16 *** s(X5) 8.818 8.974 58.23 <2e-16 *** ---

Figure 4: GAM summary table of µ for the true expected outcomes under control, see Figure 3 for the forms of nonlinear terms in X1 through X5.

34 Data Challenge: Assessing Treatment Effect Variation

References Chernozhukov, V., Demirer, M., Duflo, E., and Fernandez-Val, I. (2018). Generic ma- chine learning inference on heterogenous treatment effects in randomized . Technical report, National Bureau of Economic Research.

Dorie, V., Hill, J., Shalit, U., Scott, M., Cervone, D., et al. (2019). Automated versus do-it- yourself methods for causal inference: Lessons learned from a data analysis competition. Statistical Science, 34(1):43–68.

Fithian, W., Sun, D., and Taylor, J. (2014). Optimal Inference After . arXiv e-prints, page arXiv:1410.2597.

Hahn, P. R., Murray, J. S., and Carvalho, C. (2017). Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects. ArXiv e-prints.

Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference. Journal of Com- putational and Graphical Statistics, 20(1):217–240. Available from: https://doi.org/ 10.1198/jcgs.2010.08162.

Kallus, N. and Zhou, A. (2018). Confounding-robust policy improvement. In Advances in Neural Information Processing Systems, pages 9269–9279.

Lee, J. D., Sun, D. L., Sun, Y., and Taylor, J. E. (2016). Exact post-selection inference, with application to the lasso. Ann. Statist., 44(3):907–927. Available from: https: //doi.org/10.1214/15-AOS1371.

VanderWeele, T. J. and Knol, M. J. (2011). Interpretation of subgroup analyses in random- ized trials: heterogeneity versus secondary interventions. Annals of Internal Medicine, 154(10):680–683.

Yadlowsky, S., Namkoong, H., Basu, S., Duchi, J., and Tian, L. (2018). Bounds on the conditional and average treatment effect in the presence of unobserved confounders. arXiv preprint arXiv:1808.09521.

Yeager, D. S. (2019). The National Study of Learning Mindsets, [United States], 2015–2016. Inter-university Consortium for Political and Social Research [distributor].

Yeager, D. S., Hanselman, P., Walton, G. M., Murray, J. S., Crosnoe, R., Muller, C., Tipton, E., Schneider, B., Hulleman, C. S., Hinojosa, C. P., Paunesku, D., Romero, C., Flint, K., Roberts, A., Trott, J., Iachan, R., Buontempo, J., Yang, S. M., Carvalho, C. M., Hahn, P. R., Gopalan, M., Mhatre, P., Ferguson, R., Duckworth, A. L., and Dweck, C. S. (2019). A national reveals where a growth mindset improves achievement. Nature. Available from: https://doi.org/10.1038/s41586-019-1466-y.

35 Observational Studies 5 (2019) 36-51 Submitted 7/19; Published 8/19

Estimating Treatment Effects with Causal Forests: An Application

Susan Athey [email protected] Stanford Graduate School of Business, Stanford, CA-94305 Stefan Wager [email protected] Stanford Graduate School of Business, Stanford, CA-94305

Abstract We apply causal forests to a dataset derived from the National Study of Learning Mindsets, and discusses resulting practical and conceptual challenges. This note will appear in an upcoming issue of Observational Studies, Empirical Investigation of Methods for Heterogeneity, that compiles several analyses of the same dataset.

1. Methodology and Motivation There has been considerable recent interest in methods for heterogeneous treatment effect estimation in observational studies (Athey and Imbens, 2016; Athey et al., 2019; Ding et al., 2016; Dorie et al., 2017; Hahn et al., 2017; Hill, 2011; Imai and Ratkovic, 2013; K¨unzelet al., 2017; Luedtke and van der Laan, 2016; Nie and Wager, 2017; Shalit et al., 2017; Su et al., 2009; Wager and Athey, 2018; Zhao et al., 2017). In order to help elucidate the drivers of successful approaches to treatment effect estimation, Carlos Carvalho, Jennifer Hill, Avi Feller and Jared Murray organized a workshop at the 2018 Atlantic Causal Inference Conference and asked several authors to analyze a shared dataset derived from the National Study of Learning Mindsets (Yeager et al., 2016). This note presents an analysis using causal forests (Athey et al., 2019; Wager and Athey, 2018); other approaches will be discussed in a forthcoming issue of Observational Studies with title “Em- pirical Investigation of Methods for Heterogeneity.” All analyses are carried out using the R package grf, version 0.10.2 (Tibshirani et al., 2018; R Core Team, 2017). Full files are available at github.com/grf-labs/grf, in the directory experiments/acic18.

1.1 The National Study of Learning Mindsets The National Study of Learning Mindsets is a randomized study conducted in U.S. public high schools, the purpose of which was to evaluate the impact of a nudge-like intervention designed to instill students with a growth mindset1 on student achievement. To protect student privacy, the present analysis is not based on data from the original study, but rather on data simulated from a model fit to the National Study dataset by the workshop organizers. The present analysis could serve as a pre-analysis plan to be applied to the original National Study dataset (Nosek et al., 2015). Our analysis is based on data from n = 10, 391 children from a probability sample of J = 76 2 schools. For each child i = 1, ..., n, we observe a binary treatment indicator Zi, a real-valued outcome Yi, as well as 10 categorical or real-valued covariates described in Table1. We expanded

1. According to the National Study, “A growth mindset is the belief that intelligence can be developed. Students with a growth mindset understand they can get smarter through hard work, the use of effective strategies, and help from others when needed. It is contrasted with a fixed mindset: the belief that intelligence is a fixed trait that is set in stone at birth.” 2. Initially, 139 schools were recruited into the study using a stratified probability sampling method (Gopalan and Tipton, 2018). Of these 139 recruited schools, 76 agreed to participate in the study; then, students were

c 2019 Susan Athey and Stefan Wager. Athey and Wager

S3 Student’s self-reported expectations for success in the future, a proxy for prior achievement, measured prior to random assignment C1 Categorical variable for student race/ethnicity C2 Categorical variable for student identified gender C3 Categorical variable for student first-generation status, i.e. first in family to go to college XC School-level categorical variable for urbanicity of the school, i.e. rural, suburban, etc. X1 School-level mean of students’ fixed mindsets, reported prior to random assign- ment X2 School achievement level, as measured by test scores and college preparation for the previous 4 cohorts of students X3 School racial/ethnic minority composition, i.e., percentage of student body that is Black, Latino, or Native American X4 School poverty concentration, i.e., percentage of students who are from families whose incomes fall below the federal poverty line X5 School size, i.e., total number of students in all four grade levels in the school Y Post-treatment outcome, a continuous measure of achievement Z Treatment, i.e., receipt of the intervention

Table 1: Definition of variables measured in the National Study of Learning Mindsets

p out categorical random variables via one-hot encoding, thus resulting in covariates Xi ∈ R with p = 28. Given this data, the workshop organizers expressed particular interest in the three following questions: 1. Was the mindset intervention effective in improving student achievement?

2. Was the effect of the intervention moderated by school level achievement (X2) or pre-existing mindset norms (X1)? In particular there are two competing hypotheses about how X2 moder- ates the effect of the intervention: Either it is largest in middle-achieving schools (a “Goldilocks effect”) or is decreasing in school-level achievement.

3. Do other covariates moderate treatment effects? We define causal effects via the potential outcomes model (Imbens and Rubin, 2015): For each sample i, we posit potential outcomes Yi(0) and Yi(1) corresponding to the outcome we would have observed had we assigned control or treatment to the i-th sample, and assume that we observe Yi = Yi(Zi). The average treatment effect is then defined as τ = E [Yi(1) − Yi(0)], and the conditional average   treatment effect function is τ(x) = E Yi(1) − Yi(0) Xi = x . This dataset exhibits two methodological challenges. First, although the National Study itself was a randomized study, there seems to be some selection effects in the synthetic data used here. As seen in Figure1, students with a higher expectation of success appear to be more likely to receive treatment. For this reason, we analyze the study as an observational rather than randomized study. In order to identify causal effects, we assume unconfoundedness, i.e., that treatment assignment is as good as random conditionally on covariates (Rosenbaum and Rubin, 1983)

{Yi(0),Yi(1)} ⊥⊥ Zi Xi. (1)

individually randomized within the participating schools. In this note, we do not discuss potential bias from the non-randomized selection of 76 schools among the 139 recruited ones.

38 Estimating Treatment Effects with Causal Forests 0.40 0.35

● ● ● ● ● ●

● ●

0.30 ● Propensity Score

● ●

● 0.25

1 2 3 4 5 6 7 Student Expectation of Success

Figure 1: Visualizing estimated treatment propensities against student expectation of success.

To relax this assumption, one could try to find an instrument for treatment assignment (Angrist and Pischke, 2008), or conduct a sensitivity analysis for hidden confounding (Rosenbaum, 2002). Second, the students in this study are not independently sampled; rather, they are all drawn from 76 randomly selected schools, and there appears to be considerable heterogeneity across schools. Such a situation could arise if there are unobserved school-level features that are important treat- ment effect modifiers; for example, some schools may have leadership teams who implemented the intervention better than others, or may have a student culture that is more receptive to the treat- ment. If we want our conclusions to generalize outside of the 76 schools we ran the experiment in, we must run an analysis that robustly accounts for the sampling variability of potentially unexplained school-level effects. Here, we take a conservative approach, and assume that the outcomes Yi of students within a same school may be arbitrarily correlated within a school (or “cluster”), and then apply cluster-robust analysis tools (Abadie et al., 2017). The rest of this section presents a brief overview of causal forests, with an emphasis of how they address issues related to clustered observations and selection bias. Causal forests are an adaptation of the random forest algorithm of Breiman(2001) to the problem of heterogeneous treatment effect estimation. For simplicity, we start below by discussing how to make random forests cluster-robust in the classical case of non-parametric regression, where we observe pairs (Xi,Yi) and want to estimate   µ(x) = E Yi Xi = x . Then, in the next section, we review how forests can be used for treatment effect estimation in observational studies.

1.2 Cluster-Robust Random Forests When observations are grouped in unevenly sized clusters, it is important to carefully define the underlying target of inference. For example, in our setting, do we want to fit a model that accurately reflects heterogeneity in our available sample of J = 76 schools, or one that we hope will generalize to students from other schools also? Should we give more weight in our analysis to schools from which we observe more students? Here, we assume that we want results that generalize beyond our J schools, and that we give each school equal weight; quantitatively, we want models that are accurate for predicting effects on a new student from a new school. Thus, if we only observed outcomes Yi for students with school

39 Athey and Wager

membership Ai ∈ {1, ..., J} we would estimate the global mean asµ ˆ with standard errorσ ˆ, with

J J 1 X 1 X 2 1 X 2 µˆj = Yi, µˆ = µˆj, σˆ = (ˆµj − µˆ) , (2) nj J J(J − 1) {i:Ai=j} j=1 j=1 where nj denotes the number of students in school j. Our challenge is then to use random forests to bring covariates into an analysis of type (2). Formally, we seek to carry out a type of non-parametric random effects modeling, where each school is assumed to have some effect on the student’s outcome, but we do not make assumptions about its distribution (in particular, we do not assume that school effects are Gaussian or additive). At a high level, random forests make predictions as an average of b trees, as follows: (1) For each b = 1, ..., B, draw a subsample Sb ⊆ {1, ..., n}; (2) Grow a tree via recursive partitioning on each such subsample of the data; and (3) Make predictions

B n 1 X X Yi 1 ({Xi ∈ Lb(x), i ∈ Sb}) µˆ(x) = , (3) B |{i : Xi ∈ Lb(x), i ∈ Sb}| b=1 i=1 where Lb(x) denotes the leaf of the b-th tree containing the training sample x. In the case of out-of- (−i) bag prediction, we estimateµ ˆ (Xi) by only considering those trees b for which i 6∈ Sb. This short description of forests of course leaves many details implicit. We refer to Biau and Scornet(2016) for a recent overview of random forests and note that, throughout, all our forests are “honest” in the sense of Wager and Athey(2018). When working with clustered data, we adapt the random forest algorithm as follows. In step (1), rather than directly drawing a subsample of observations, we draw a subsample of clusters Jb ⊆ {1, ..., J}; then, we generate the set Sb by drawing k samples at random from each cluster 3 j ∈ Jb. The other point where clustering matters is when we want to make out-of-bag predictions in step (3). Here, to account for potential correlations within each cluster, we only consider an observation i to be out-of-bag if its cluster was not drawn in step (1), i.e., if Ai 6∈ Jb.

1.3 Causal Forests for Observational Studies One promising avenue to heterogeneous treatment effect estimation starts from an early result of Robinson(1988) on inference in the (Nie and Wager, 2017; Zhao et al.,     2017). Write e(x) = P Zi Xi = x for the propensity score and m(x) = E Yi Xi = x for the expected outcome marginalizing over treatment. If the conditional average treatment effect function is constant, i.e., τ(x) = τ for all x ∈ X , then the following estimator is semiparametrically efficient for τ under unconfoundedness (1)(Chernozhukov et al., 2017; Robinson, 1988):

1 Pn Y − mˆ (−i)(X ) Z − eˆ(−i)(X ) τˆ = n i=1 i i i i , (4) 1 Pn (−i) 2 n i=1 Zi − eˆ (Xi) assuming thatm ˆ ande ˆ are o(n−1/4)-consistent for m and e respectively in root-mean-squared error, that the data is independent and identically distributed, and that we have overlap, i.e., that propen- sities e(x) are uniformly bounded away from 0 and 1. The (−i)-superscripts denote “out-of-bag” or (−i) “out-of-fold” predictions meaning that, e.g., Yi was not used to computem ˆ (Xi).

3. If k ≤ nj for all j = 1, ..., J, then each cluster contributes the same number of observations to the forest as in (2). In grf, however, we also allow users to specify a value of k larger than the smaller nj ; and, in this case, for clusters with nj ≤ k, we simply use the whole cluster (without duplicates) every time j ∈ Jb. This latter option may be helpful in cases where there are some clusters with a very small number of observations, yet we want Sb to be reasonably large so that the tree-growing algorithm is stable.

40 Observational Studies 5 (2019) 52-70 Submitted 9/18; Published 8/19

Examining treatment effect heterogeneity using BART

Nicole Carnegie [email protected] Montana State University, Bozeman, MT, USA Vincent Dorie [email protected] Columbia University, New York, NY, USA Jennifer L. Hill [email protected] New York University, New York, NY, USA

Keywords: Causal Inference, Bayesian Additive Regression Trees, Treatment Effect Modification, Group-structured Data

1. Methodology and Motivation

We were presented with the challenge of estimating causal effects using simulated data that was intended to roughly mirror “preliminary data extracted from the National Study of Learning Mindsets.” In particular, we were asked to address three research goals:

1. Was the mindset intervention effective in improving student achievement?. 2. Researchers hypothesize that the effect of the intervention is moderated by school level achievement (X2) and pre-existing mindset norms (X1). In particular there are two competing hypotheses about how X2 moderates the effect of the intervention: Either it is largest in middle-achieving schools (a “Goldilocks effect”) or is decreasing in school-level achievement. 3. Researchers also collected other covariates and are interested in exploring their possible role in moderating treatment effects.

We discuss our approach to these three research goals as well as a summary of our results.

1.1 Assumptions

Given that the simulated dataset was based on data from a large-scale randomized ex- periment, the Learning Mindsets study, we were hopeful that the simulated data satisfied ignorability for the research questions posed. To be conservative, we assumed that ignor- ability was conditional on the full set of observed covariates—that is, Y (0),Y (1) ⊥ Z | X (Rubin 1979). In the post-workshop analyses we examined the sensitivity of our estimates to violations of ignorability and were satisfied that it was not an unreasonable assumption. Given this ignorability assumption, any analysis would require appropriate conditioning on covariates to achieve unbiased estimates of E[Y (0) | X] and E[Y (1) | X]. We had two strategies for avoiding strong parametric assumptions. First, we checked that each variable

c 2019 Nicole Carnegie, Vincent Dorie, and Jennifer L. Hill. Treatment effect heterogeneity with BART

we controlled for satisfied balance and overlap. This helped ensure that empirical counter- factuals existed for all observations. Second, we used a very flexible modeling strategy for estimating these conditional expectations. At different stages in our analysis, we made several different types of modeling choices with respect to the grouped data structure. Each has its own set of assumptions. When we represented this structure through school-specific fixed effects, α, our ignorability assump- tion generalized to Y (0),Y (1) ⊥ Z | X, α. When we instead modeled school-level variation as varying intercepts—or “random effects”—we imposed the additional assumption that the random effects were uncorrelated with the (school-level aggregates of) covariates and treatment indicator. This would be violated if an unobserved school-level covariate was predictive of both school-level treatment rates and mean response. We had no way of knowing whether SUTVA was satisfied. We performed analyses under the assumption that it was satisfied.

1.2 Choice of BART as the foundation of our approach

Without information about the true parametric form of the response surface, we opted for a method that flexibly fit the response surface. Recent evidence demonstrates the advantages of machine learning algorithms as an approach to causal effect estimation (for instance, Hill 2011; Dorie et al. 2018). Within this class of estimators, we prefer automated algorithms that have been integrated into Bayesian inferential frameworks. This combination allows for uncertainty quantification and is more flexible in accommodating several other compli- cations such as grouped data structures and missing outcome data. One such modeling strategy, based on Bayesian Additive Regression Trees (BART; Chipman et al. 2007, 2010), already has a proven track record of superior performance in causal inference settings (Hill 2011; Hill et al. 2011; Hill and Su 2013; Dorie et al. 2016; Kern et al. 2016; Wendling et al. 2018). Here we briefly introduce BART and its uses for causal inference.

Bayesian Additive Regression Trees. The BART algorithm consists of a sum-of-trees model and a regularization prior. The prior avoids overfit by specifying the number of trees, the for the size of each tree, the shrinkage applied to the fit from each tree, and the degrees of freedom for the prior distribution for the residual . Interested readers can find more information on the model, prior, and fitting algorithms in Chipman et al. (2007, 2010). The key point is that BART can be used to flexibly fit even highly nonlinear response surfaces, which is consistent with our goal to fit E[Y (1) | X] − E[Y (0) | X] without making undue parametric assumptions.

BART for causal inference. It is straightforward to use BART to estimate the average treatment effect (ATE). First fit BART to the observed data (Y given Z and X). Next

53 Carnegie et al.

make predictions for two datasets (Hill 2011). X is kept intact for both, however, in one all treatment values are set to 0, and in the other they are all set to 1. This allows BART to draw from the posterior distribution for E[Y (1) | X] and E[Y (0) | X] for each person, implying we can also obtain draws from E[Y (1)−Y (0) | X] for each person. These posterior distributions for individual-level treatment effects can then be aggregated to obtain posterior distributions of average treatment effects either for the full dataset or any subset thereof.

Adding the propensity score as a covariate. While the approach described in Hill (2011) has good properties across a variety of settings (Hill et al. 2011; Hill and Su 2013; Dorie et al. 2016; Kern et al. 2016; Wendling et al. 2018), recent work (Hahn et al. 2017) re- veals situations where the performance of BART can be compromised due to regularization- induced confounding. While this is less of a concern in settings like the present one, in which covariates are outnumbered by observations and data are well-behaved, in general the “best practice” recommendation for using BART for causal inference is to guard against this po- tential source of bias. One approach to doing so, suggested by Hahn et al. (2017), is to include an estimate of the propensity score as a covariate. We used BART to fit a propensity score model (as described below) and included the estimate in response models.

Cross-validation to choose hyperparameters. BART tends to perform well using the default prior specification described by Chipman et al. (2007), but performance can sometimes be improved by choosing hyperparameters via cross-validation (Chipman et al. 2010). This is particularly important when using BART for non-continuous outcomes, a case for which off-the-shelf BART is currently not optimized (Dorie et al. 2016).1

Overlap. BART has certain advantages over propensity score approaches to evaluating overlap, which can be misled by covariates that are strongly predictive of the treatment but not are not strongly associated with the outcome. Therefore, in addition to checking overlap marginally for each covariate and for the propensity score, we also checked using a BART-specific approach as in Hill and Su (2013); results described in Section 3.

Causal inference with group structured data. The data have a multilevel structure. Treatment was assigned at the individual level, but students are grouped within schools. A primary goal was to decide whether and how to model this structure. During the workshop, we presented results from a fixed-effects specification. Ultimately, our preferred model for the response surface is the random-effects specification. However, robustness checks (below) reveal that this choice made little difference in overall results.2 1. Murray (2017) has derived models for a wide class of extensions to BART that are optimized for binary, count, and multiple category outcomes however these are not yet available in shareable software. 2. We used fixed effects for the propensity score model, since random effects with a binary response are not yet implemented in dbarts, or elsewhere in R, to our knowledge.

54 Treatment effect heterogeneity with BART

Overview of preferred modeling strategy. Our preferred modeling strategy proceeds as follows: 1) Fit a propensity score model with BART using all covariates and including school ID as a fixed effect, using cross-validation to choose hyperparameters (75 trees with k of 8). 2) Fit a response model on observed covariates and the estimated propensity score with BART including schools as random effects, using cross-validation to choose hyperpa- rameters (350 trees with k = 1.5). For both fits we run 4 chains with 1000 iterations each (in addition to 500 burn-in iterations). Given the symmetry of the posterior distributions of interest, we report credible intervals based on normal approximations.

2. Results from analyses run for the workshop

During the workshop we discussed assumptions and addressed the questions posed.

2.1 Checking assumptions

Our first step was to check balance and overlap of covariates between treatment groups. We checked each covariate individually as well as the propensity score and found overwhelming support for both balance and overlap (see Figures A1 and A2 in Supplemental Appendix). We revisit overlap in Section 3 with more sophisticated diagnostics.

2.2 Goal 1: Intervention effectiveness

We addressed this question by using BART in the manner described above with a focus on estimating an average causal effect and associated uncertainty interval. The posterior distribution of this effect is reasonably symmetric, so we reported only an effect estimate (posterior mean) of 0.248 with a 95% of (0.227, 0.270). By this measure, we deem the intervention to be effective on average. Choice of grouping adjustment makes little difference to the estimate of the overall ATE leading to differences of less then .003 in posterior and interval endpoints.

2.3 Goal 2: Moderation by specific covariates

We had several related strategies for exploring moderation. These capitalize on the fact that BART provides a posterior distribution of the causal effect for each observation. It is thus straightforward to examine the relationship between the expected effect for each person (represented by the mean of the corresponding posterior distribution) and any covariate of interest. We can do the same with respect to school-level effects and covariates. We present a few of the myriad methods for portraying these relationships.

The role of urbanicity. Before discussing our results for Goal 2, we address an important discovery made in our pursuit of that goal. While exploring the role of X1 (“fixed mindset”)

55 Carnegie et al.

and X2 (“achievement”) as moderators we created scatterplots of individual treatment effect estimates (posterior means) versus X1 and X2. These plots revealed a group of schools with average treatment effects substantially lower than the rest, as displayed in the left-most plots of Figure 1.3 Fitting a regression tree for the individual effects given all covariates (using the rpart package in R) easily identified the five-category, school-level covariate XC, or “urbanicity”, as the culprit. Color-coding by urbanicity levels displays this visually. Posterior distributions of the treatment effect for each urbanicity level, displayed in the right-most plot of this figure, would have alerted us to this phenomenon as well. Of course, researchers do not typically check for moderation with respect to all covariates (and in fact are often discouraged from doing so out of fear of data snooping). Therefore, in the absence of a specific hypothesis about urbanicity, the substantial differences in treatment effects across its levels might have gone undetected with a more traditional test of moderation.

● ● ● 0.35 ● 0.30 0.25

● ● 0.20 ● ● Individual ATE ● ● Individual ATE treatment effects ● ● ● ● 0.15 ●

● ● 0.10 0.15 0.20 0.25 0.30 0.35 0.10 0.15 0.20 0.25 0.30 0.35 0.10 ● ●

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 0 1 2 3 4 X1: Fixed mindset X2: Achievement Urbanicity

Figure 1: The role of urbanicity

Moderation by X1 and X2 Given the distinctive role that urbanicity plays in predicting school-level treatment effects, we opted to subtract the variation due to urbanicity from the school-level treatment effects. This was accomplished by centering the posterior mean individual ATEs on the average individual ATE within urbanicity category before computing school-level averages. In practice, this choice would be made in conjunction with the applied researcher, since it subtly changes the nature of the research question. In essence, we are now examining whether treatment effects vary across schools with the same urbanicity rating that differ in their average level of fixed mindset or their achievement. For the workshop, we presented plots of the relationship between the school-level treat- ment effects and each of these potential moderators as lowess curves with uncertainty bounds as in Figure 2. These provide weak evidence of moderation by X1 with a trend towards smaller effects for schools that had higher levels of fixed mindset. Similarly there appears

3. Actually first we created lowess plots of these relationships. These masked this phenomenon! This is a testimony to the power of plotting your data!

56 Treatment effect heterogeneity with BART

to be some evidence for a positive association between school achievement and treatment effect. However, we weren’t satisfied with using the default uncertainty bounds provided by ggplot for lowess. Our post-workshop analyses provide more clarity regarding these trends and associated uncertainty, but do not alter our overall conclusions.

0.05

0.00 School−level ATE School−level

−0.05

X1: Fixed mindset X2: Achievement −3 −2 −1 0 1 2 3 −2 0 2

Figure 2: Lowess representations of X1 (left) and X2 (right) as moderators.

2.4 Goal 3: Moderation by other covariates

We presented moderation plots similar to those above for each of the continuous covariates (X3, X4, and X5); these are displayed in the Supplemental Appendix as Figure A3. We made more informative plots after the workshop; our conclusions did not change. For binary covariates, we assessed moderation using the posterior distribution of the difference in ATE between groups. For “first-generation status” (C3), we observe a mean difference of -0.025 with 95% credible interval (-0.018, 0.060); the treatment effect is slightly (but not significantly) lower for first-generation students. There is no evidence of a difference in treatment effects by gender (difference estimate: -0.0095, 95% CI: (-0.44, 0.32)). For multi-category factors, we began by simply examining side-by-side boxplots of the adjusted individual ATEs by level. As can be seen in Figure A4 in the Supplemental Appendix, there appears to be little evidence of a race effect. It is possible that there is a trend of increasing ATE as student expected success increases.

3. Post-workshop analyses

After the workshop we examined a few issues in more depth, as summarized here.

3.1 Re-examination of modeling choices

Our initial comparison of modeling strategies with regard to the grouping variable only considered differences in the overall ATE and corresponding uncertainty intervals. Given

57 Carnegie et al.

the focus on moderation, we are interested in understanding whether the individual ATE estimates varied much based on this choice. Figure A6 in the Supplemental Appendix presents scatter plots of individual ATEs across all pairs of the three choices. There is little difference in estimates excluding the school variable and including it as a fixed effect. However, modeling the grouped structure using a random effect creates a noticeably wider distribution of individual-level effects. We also examined more closely the impact of including the propensity score, by com- paring our results to models without this feature. The estimate of the ATE in a model that excludes propensity score from the covariate set is 0.249 with associated credible interval (0.228, 0.270). This is almost identical to that of our preferred analyses. The correlation between posterior means of individual effects between these analyses is 0.896. We provide a more detailed comparison in the Supplemental Appendix.

3.2 Revisiting Moderation

We redid some of our original moderation analyses for several reasons. First, we found a way of graphically displaying our uncertainty about the relationship trends that is easier to interpret. Second, we wanted to more explicitly test the “Goldilocks” hypothesis posed by the research team. The analyses reported here are net of the impact of urbanicity on the treatment effects. Figure A7 in the Supplemental Appendix displays similar results without adjustment for urbanicity. These relationships are so dominated by the urbanicity- specific treatment effect differentials that they nearly all demonstrate a “reverse Goldilocks” pheonomenon. We start by discussing moderation by school-level achievement, X2, since the hypotheses regarding X2 are more complicated.

Moderation by X2. We explore the research questions about potential moderation by X2 in two ways. Our first approach partitions X2 into 3 subgroups using a recursive partitioning algorithm. Then treatment effects are averaged within subgroup to create draws from the posterior distribution of the average treatment effect for “low”, “medium”, and “high” values of X2. We can compare the “low” and “medium” subgroups or the “high” and “medium” subgroup by differencing the corresponding posterior distributions, as displayed on the left side of Figure 3. The that the average treatment effect for medium subgroup is greater than for the low subgroup is 99.9%. However, the effect for the medium subgroup is not likely to be greater than the “high” subgroup - indeed, we find a posterior probability of 97.8% that the high subgroup has a larger average effect. This analysis does not provide evidence for the Goldilocks effect. Second, we display on the right side of Figure 3 a of school-level average treatment effects (net of urbanicity-specific means) versus X2 with a sample of quadratic fits to the posterior draws to illustrate our uncertainty about this fit. While limited by its

58 Treatment effect heterogeneity with BART

School ATE, XC Controlled Mid − Left 0.10 p = 0.004 50

0.05 ● 20 ● ● ● Density ● ●● ● ● ● ● ● ●● ● 0 ● ● ●● ● ● ● ● ●● ● 0.00 0.01 0.02 0.03 0.04 0.05 0.06 ● ● ● ● ● ● ● ●● ● ● ● ● ● ATE difference ● ● ● ● 0.00 ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● Mid − Right ● ● ●

School−level ATE School−level ● p = −0.05

60 0.006 30 Density −0.10 0

−0.04 −0.03 −0.02 −0.01 0.00 −3 −2 −1 0 1 2 ATE difference X2: Achievement

Figure 3: Left: of posterior distributions of differences in school average treat- ment effects between the medium and low X2 subgroups (top) and the high and medium subgroups (bottom). Right: Posterior distributions for school average treatment effects as a function of X2, after controlling for XC. Points are the posterior means of school average treatment effects and vertical lines show as- sociated 95% posterior credible intervals. Curved lines show 30 samples from the posterior distribution of quadratic regressions fit to the school average effects (gray lines) and the posterior mean of all such regressions (black line).

simplistic parametric form, examining the coefficient of the squared term offers a straightfor- ward test for a rise-then-fall relationship. The posterior mean indicates a slight Goldilocks effect, however the posterior uncertainty in the square-term coefficient is consistent with no effect. There is only an approximate 68.7% posterior probability of this term being negative. Even cursory visual inspection of Figure 3 discounts the that the higher school-level achievement is associated with smaller treatment effects.

Moderation by X1. The relationship between school-level treatment effects and fixed mindset, X1, is decreasing without strong evidence of quadratic curvature, as displayed in Figure 4. The probability that the linear part of this decreasing trend is less then zero is approximately 96.6%, and the probability that the quadratic part is less than zero is 75.8%.

Moderation by the other school-level continuous covariates. In Figure 5 we dis- play moderation plots similar to those in the previous section for the remaining continuous school-level variables: “minority composition” (X3), “poverty concentration” (X4), and “school size” (X5). The posterior probabilities that the linear terms are negative are ap-

59 Carnegie et al.

School ATE, XC controlled 0.10

● ● ● ● ● 0.05 ● ● ● ●● ●● ● ●● ● ●● ● ● ●●● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● 0.00 ● ● ● ● ●● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● School−level ATE School−level −0.05 −0.10

−3 −2 −1 0 1 2 3 X1: Fixed mindset

Figure 4: Posterior distributions for school average treatment effects as a function of X1, after controlling for XC. All else is as described in Figure 3.

proximately 84.0%, 73.0%, and 10.6% respectively, while the corresponding probabilities for the quadratic terms are 90.8%, 91.4%, and 6.3%. This provides some evidence of a Goldilocks effect for poverty concentration but nothing earthshattering.

School−level ATE, XC Controlled 0.3 0.3 0.3 0.2 0.2 0.2 ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● 0.1 ●● ● 0.1 ●● 0.1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ●

0.0 ●● 0.0 ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● School−level ATE School−level ● ATE School−level ● ATE School−level ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.1 −0.1 −0.1

● ● ● −0.2 −0.2 −0.2

−1 0 1 2 −2 −1 0 1 2 3 −1 0 1 2

X3: Minority composition X4: Poverty concentration X5: School size

Figure 5: Posterior distributions for school average treatment effects as a function of X3, X4, and X5, net of XC. All else is as described in Figure 3.

Moderation by Student Expected Success: A closer look. We return to examine moderation by S3, student expected success, because our initial results provided some evidence for a moderated effect but we performed no formal tests.4 Figure 6 displays a line plot connecting the posterior means of the ordered categories along with corresponding credible intervals. The right plot tests whether levels 6 and 7 have larger effects than those below; there is moderate support (91% probability) for this hypothesis.

4. The Supplemental Appendix provides a somewhat similar reanalysis for the race variable.

60 Treatment effect heterogeneity with BART

Future Success ATEs, XC Controlled S3:Future Success 6−7 − 1−5, XC Controlled

0.04 p = 0.094 1400 0.02

● 1200 ● 1000 0.00 ●

● ●

● 800

ATE ● −0.02 600 −0.04 400 200 −0.06 0

−0.08 1 2 3 4 5 6 7 −0.05 0.00 0.05 0.10 0.15 Expectations for future success Diff in ATE

Figure 6: Left: Means and 95% credible intervals of posterior distributions (impact of XC removed) for each level of ordered categorical variable S3 presented as a line plot. Right: Posterior distribution for the difference in mean effects between the two top levels of S3 and the rest. About 91% of this distribution lies above zero.

3.3 More formal checks of assumptions

BART has already been incorporated into diagnostic frameworks to examine the plausibility of two crucial causal assumptions: ignorability and overlap. Dorie et al. (2016) demonstrate how BART can be incorporated into a sensitivity analysis framework to help researchers to understand under what conditions their results might be sensitive to unobserved confounding. This approach is available in the treatSens package on CRAN as the function treatSens.BART. The results from this sensitivity analysis are displayed in Figure A10 in the Supplementary Appendix. This plot reveals that the level of unobserved confounding would need to be extremely strong in order to remove the estimated positive effect. This sensitivity analysis strongly supports our assumption of ignorability.5 Our covariate-by-covariate examination of overlap in the previous section (results dis- played in the Supplemental Appendix) provided strong evidence in support of marginal balance and overlap. In the Supplemental Appendix we present two additional looks at the issue. The first, displayed on the left side of Figure A9, is a scatter plot of the joint distribution (estimated posterior means of) Y (0) and Y (1) for each observation for both treated (red) and control (blue) observations that suggests strong overlap. To investigate local overlap (as suggested in the workshop discussion) we calculated an overlap recommended by Hill and Su (2013). For each person we calculate the ratio of the variance of the posterior distribution of their counterfactual outcome relative to the variance of the posterior distribution for their factual outcome. The distribution of these ratios is displayed in Figure 4 A9. We gauge the extremity of any such ratio relative to a

5. The treatSens package does not yet accommodate random effects so was run with fixed effects.

61 Carnegie et al.

Chi-squared distribution; the 10% cutoff would be about 2.7; no ratios even come close to this threshold, suggesting that overlap is present locally as well as marginally.

4. Discussion

Several features make BART a powerful tool for causal inference. The sum-of-trees model flexibly fits even highly non-linear response surfaces. The Bayesian inferential framework allows us to easily quantify our uncertainty not only about the average treatment effect and individual-level treatment effects but also any functions of the potential outcomes (all without re-using our data). The recent extensions that accommodate varying treatment effects extend the applicability of this tool to simple multilevel data structures. We used BART to address the questions posed and found strong evidence of a large pos- itive average effect of the intervention (the “effect size” is about .4 and the credible interval has near zero probability of covering 0). Urbanicity strongly moderates this treatment effect therefore we addressed the other questions after adjusting for this.6 Net of urbanicity, we find some evidence of moderation at the school level: the school-level mean of students’ fixed mindsets, X1, is moderately negatively associated with the size of effect; achievement, X2, is moderately positively associated. Of all student-level variables there is most support for moderation by student expected success, S3. Our results are predicated on satisfying several assumptions—ignorability, overlap, etc.— that in some situations can be heroic. The BART extensions that easily allow examination of the evidence for and implications of these assumptions add credibility to our analyses. We found strong support that these assumptions were satisfied.

References

Chipman, H., George, E., and McCulloch, R. (2007). Bayesian ensemble learning. In Sch¨olkopf, B., Platt, J., and Hoffman, T., editors, Advances in Neural Information Pro- cessing Systems 19. MIT Press, Cambridge, MA.

Chipman, H. A., George, E. I., and McCulloch, R. E. (2010). BART: Bayesian additive regression trees. Annals of Applied Statistics, 4(1):266–298.

Dorie, V., Carnegie, N. B., Harada, M., and Hill, J. (2016). A flexible, interpretable framework for assessing sensitivity to unmeasured confounding. Statistics in Medicine, 35(20):3453–70.

6. In a “real-life” situation where we could interact with the applied researcher we might make a different choice based on their understanding of the theoretical questions of primary interest. However we were making decisions in the absence of the availability of such an .

62 Treatment effect heterogeneity with BART

Dorie, V., Hill, J., Shalit, U., Scott, M., and Cervone, D. (2018). Automated versus do-it- yourself methods for causal inference: Lessons learned from a data analysis competition. Statistical Science, accepted with discussion.

Hahn, P. R., Murray, J. S., and Carvalho, C. M. (2017). Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects. ArXiv e-prints.

Hill, J. (2011). Bayesian nonparametric modeling for causal inference. Journal of Compu- tational and Graphical Statistics, 20(1):217–240.

Hill, J. and Su, Y.-S. (2013). Assessing lack of common support in causal inference using bayesian nonparametrics: Implications for evaluating the effect of breastfeeding on children’s cognitive outcomes. Ann. Appl. Stat., 7(3):1386–1420. Available from: http://dx.doi.org/10.1214/13-AOAS630.

Hill, J. L., Weiss, C., and Zhai, F. (2011). Challenges with propensity score strategies in a high-dimensional setting and a potential alternative. Multivariate Behavioral Research, 46:477–513.

Kern, H. L., Stuart, E. A., Hill, J. L., and Green, D. P. (2016). Assessing methods for generalizing experimental impact estimates to target samples. Journal of Research in Educational Effectiveness, 9:103–127.

Murray, J. S. (2017). Log-Linear Bayesian Additive Regression Trees for Categorical and Count Responses. ArXiv e-prints.

Rubin, D. B. (1979). Using multivariate matched sampling and regression adjustment to control bias in observational studies. Journal of the American Statistical Association, 74:318–328.

Wendling, T., Jung, K., Callahan, A., Schuler, A., Shah, N., and Gallego, B. (2018). Com- paring methods for estimation of heterogeneous treatment effects using observational data from health care databases. Statistics in medicine.

63 Carnegie et al.

Appendix A. Supplemental Appendix to Carnegie et al. Discussion

A.1 Additional workshop analyses

Overlap plots We examined the overlap and balance of each of the covariates marginally through a variety of plots; see Figure A1 and Figure A2. These demonstrated a high degree of both balance and overlap.

C2: Gender C3: First generation C1: Student race/ethnicity S3: Student expected success XC: School urbanicity

1.00 1.00 1.00 1.00 1.00

0.75 0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 0.50 Percent of observations Percent of observations Percent

0.25 0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00 0.00

FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE Treated Treated

Figure A1: Overlap in student-level binary variables (left) and multiple-level categorical variables (right)

64 Treatment effect heterogeneity with BART

X1: Fixed mindset X2: Achievement X3: Minority composition

3 ● ● ● ● 2 2 2

1 1 0 0

0 −1

−2 −2 ● ● ● −1 ● ●

−3 ● ● ● ●

FALSE TRUE FALSE TRUE FALSE TRUE X4: Poverty concentration X5: School size Value 3 ● ● 2

2 1

1

0 0

−1 −1

−2 FALSE TRUE FALSE TRUE Treated

Figure A2: Overlap in school-level continuous variables

65 Carnegie et al.

Effect modification by continuous variables: lowess plots For the workshop, we presented plots of the relationship between the school-level treatment effects and each of the potential moderators as lowess curves with uncertainty bounds. Figure A3 gives the resulting plots for school-level continuous covariates X3 through X5.

0.05

0.00

−0.05 School−level ATE School−level

X3: Minority composition X4: Poverty concentration X5: School size −1 0 1 2 −2 −1 0 1 2 3 −1 0 1 2

Figure A3: Lowess representations of X3 (left) through X5 (right) as moderators.

Effect modification by categorical variables: boxplots For categorical variables, we used simple side-by-side boxplots to evaluate potential effec modification. There was little evidence of an effect of race using this method, but some suggestion of an increasing effect with student expected success (S3). In particular it appears that the treatment effect is larger for students whos expected success is greater than 5.

0.10 ● 0.10 ● ● 0.05 0.05 0.00 0.00

● ●

Individual adjusted ATE ● Individual adjusted ATE −0.05 ● −0.05 ● ● ● ● ● ● −0.10 −0.10 1 2 3 4 5 6 7 1 14 13 3 12 15 2 11 9 4 8 5 6 10 7 s3: Student expected success C1: Race/ethnicity

Figure A4: Boxplots of adjusted individual ATE by of S3 (left) and C1 (right) as modera- tors.

A.2 Additional results from post-workshop analyses

The impact of group-level modeling strategies The plot in A6 displays a scatter plot of individual ATE’s that displays how they vary across adjustment methodologies: no

66 Treatment effect heterogeneity with BART

adjustment (that is, including school ID as a continuous covariate so that BART is forced to make splits based on difference between continguous categories), fixed effects, and random effects.

Figure A5: Scatterplots of individual ATE estimates across adjustment methodologies.

Impact of adding the propensity score as a covariate We discuss in the main text the high degree of correspondence between the individual-level treatment effect estimates produced using an estimation strategy that includes the estimated propensity score as a covariate versus one that excludes it. However a scatter plot of the two sets of estimates reveals an interesting feature which is that these estimates appear to come from a mixture of two subpopulations. When we predict the random effect estimates using a regression tree S3 emerges as by far the strongest predictor. Highlighting the pattern we saw in Figure A4.

Treatment Effect Modification when urbanicity has not been netted out. The plots in the main test display results examining associations between covariate and treat- ment effects. We felt it was also important to examine how different these results might be if we had decided not to net out urbanicity (XC). Figure A7 shows school average treatment effects as a function of X2 with levels of XC highlighted by color. Urbanicity category 4 is markedly below the others and it complicates one of the primary objectives of this exercise: characterizing the moderating effect of X2. Not only are the hypotheses that X2 has a “Goldilocks” impact on the treatment effect or that it steadily decreases effectiveness ruled out, if urbanicity is not controlled for one can reach an opposite conclusion - that of least effect in the middle range. Consequently, all future analyses are done by controlling for urbanicity and subtracting out the level average effects.

Moderation by Race: A closer look We revisited moderation by race after the work- shop to implement some more formal tests. Figure A8 shows the racial average treatment effects after controlling for urbanicity (XC). It provides some evidence of racial moderation of the treatment effect, however many of the effects are consistent across race categories.

67 Carnegie et al.

Figure A6: Scatterplot of individual ATE estimates using random effects model with and without propensity score included as a predictor.

The highest and lowest race average treatment effects are estimated with a considerable degree of uncertainty due to their sample sizes, however the posterior distribution of the difference between the highest and lowest racial averages - based on their posterior means - yields a borderline “statistically significant” difference. The distribution over this difference has 5.2% probability assigned to negative values, so that a one-sided posterior credible inter- val would just barely include 0. However this contrast was chosen after looking at the plots and without a clear hypothesis about “race level 11” as a specific moderator. Therefore we see such analyses as exploratory.

Diagnostics that assess plausibility of the ignorability and overlap assumptions. Figure A9 displays evidence regarding the overlap based on BART output as described in the main text.

Sensitivity to unobserved confounding. We see that the amount of confounding nec- essary to substantively change our results would be quite extreme and certainly far exceeds the current levels of associations with observed covariates.

68 Treatment effect heterogeneity with BART

School ATE Fn X1, XC Not Controlled School ATE Fn X2, XC Not Controlled School ATE Fn X3, XC Not Controlled

● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0.30 ● 0.30 ● 0.30 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● 0.25 ● ● 0.25 ● ● 0.25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.20 0.20 0.20 School−level ATE School−level ATE School−level ATE School−level ● ● ● ● ● ● ● ● ● ● ● ●

0.15 XC 1 0.15 XC 1 0.15 XC 1 ● XC 2 ●● ● XC 2 ● ● ● ● ● XC 2 ● ● ● XC 3 ●● ●● XC 3 ● ● ● ● ● ● ● ● XC 3 ● ● ● ● XC 4 ●● ● ● ● XC 4 ● ●● ● ●● ● ● ● XC 4 ● XC 5 ● XC 5 ● XC 5 ● ● ● ● ● ● 0.10 0.10 0.10 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 −1 0 1 2 X1: Fixed mindset X2: Achievement X3: Minority composition School ATE Fn X4, XC Not Controlled School ATE Fn X5, XC Not Controlled

● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●

0.30 ● 0.30 ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● 0.25 ● ● 0.25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● 0.20 0.20 School−level ATE School−level ATE School−level ● ● ● ● ● ● ● ●

0.15 XC 1 0.15 XC 1 ● ● ● XC 2 ● ● ● XC 2 ● ● ●● ● ● XC 3 ● ● ● ● XC 3 ● ● ● ● ● ● ● XC 4 ● ●● ● ● XC 4 ● XC 5 ● XC 5 ● ● ● ● 0.10 0.10 −2 −1 0 1 2 3 −1 0 1 2 X4: Poverty concentration X5: School size

Figure A7: Posterior distributions for school average treatment effects as a function of, left-to-right/top-to-bottom, X1, X2, X3, X4, X5. The points are the posterior means in each school while the curved line is the posterior mean of quadratic regressions fit to the school average treatment effects.

Race Ave TEs, XC Controlled C1:Race/ethnicity 11 − 3, XC Controlled

11 ● 170 p = 0.052 7 ● 43 9 ● 135 4 ● 5030 600 10 ● 176 12 ● 325 2 ● 1578 400 15 ● 349 race 5 ●

420 Frequency num students num 8 ● 195 6 ● 40 200 14 ● 652 1 ● 983 13 ● 182

● 3 113 0 −0.2 −0.1 0.0 0.1 0.2 −0.2 0.0 0.2 0.4 0.6 ATE Diff in ATE

Figure A8: Left: Posterior means and 95% credible intervals of average treatment effects for each race category after controlling for XC. Right: of posterior samples for the difference between the highest and lowest treatment effect races.

69 Carnegie et al.

Overlap in Y(0) and Y(1) Overlap described by variance ratios

● ● ●● ●● ●●●● ●●●● ●●● ●●●●●● ● ●●●●●● ●●●●●● ●●●●●● ● 2500 ●●●●●●● 1.0 ●●● ●●●●●●●●●●● ● ●●●●●●●●●●● ●●●●●●●● ● ●●●●●●●●● ● ●●●●●●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●● ● ●●●●●●●●●●●● ● ● ●●●●●●●●●●●● ● ● ●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●● ● ●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●● ● 0.5 ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●● ● ●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● 1500 ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● Y(1) ●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● 0.0 ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●● Frequency ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●● ●●● ● 500 ●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●● −0.5 ●● ●●●●● ●● ●●●●●●● ●●● ●●●●●● ● ●●●●●●●● ●● ● ●●●●● ● ●● ● ● 0

−1.0 −0.5 0.0 0.5 1.0 1.0 1.5 2.0 2.5

Y(0) Counterfactual to factual variance ratio

Figure A9: Left: Overlap across treatment (red) and control (blue) groups with regard to distribution of Y(0) and Y(1). Right: Distribution of variance ratios for counterfactual versus factual outcomes.

Figure A10: Sensitivity to unobserved confounding.

70 Observational Studies 5 (2019) 71-82 Submitted 7/19; Published 8/19

Machine Learning Analysis of Heterogeneity in the Effect of Student Mindset Interventions

Fredrik D. Johansson [email protected] Institute for Medical Engineering & Science Massachusetts Institute of Technology

Abstract We study heterogeneity in the effect of a mindset intervention on student-level performance through an observational dataset from the National Study of Learning Mindsets (NSLM). Our analysis uses machine learning (ML) to address the following associated problems: assessing treatment group overlap and covariate balance, imputing conditional average treatment effects, and interpreting imputed effects. By comparing several different model families we illustrate the flexibility of both off-the-shelf and purpose-built estimators. We find that the mindset intervention has a positive average effect of 0.26, 95%-CI [0.22, 0.30], and that heterogeneity in the range of [0.1, 0.4] is moderated by school-level achievement level, poverty concentration, urbanicity, and student prior expectations. Keywords: Machine learning, interpretability, counterfactual estimation

1. Methodology and Motivation Machine learning (ML) has had widespread success in solving prediction problems in ap- plications ranging from image and speech recognition (LeCun et al., 2015) to personalized medicine (Kononenko, 2001). This makes it an attractive tool also for studying heterogene- ity in causal effects. In fact, ML excels at overcoming well-known limitations of traditional methods used to solve this task. For example, matching methods struggle to perform well when confounders and covariates are high-dimensional (Rubin and Thomas, 1996); gener- alized linear models are not flexible enough to discover variable interactions and non-linear trends; and propensity-based methods suffer from variance issues in estimation (Lee et al., 2011). In contrast, supervised machine learning has proven highly useful in discovering patterns in high-dimensional data (LeCun et al., 2015), approximating complex functions and trading off bias and variance (Swaminathan and Joachims, 2015). In this observational study based on data from the National Study of Learning Mindsets (NSLM), we apply both off-the-shelf and purpose-built ML estimators to characterize het- erogeneity in the effect of a student mindset intervention on future academic performance. In particular, we compare estimates of conditional average treatment effects based on linear models, random forests, gradient boosting and deep neural networks. Below, we introduce the problem of discovering treatment effect heterogeneity and describe our methodology.

1.1 Problem setup We study the effect of a student mindset intervention based on observations of 10391 stu- dents in 76 schools from the NSLM study. The intervention, assigned at student level, is

c 2019 Fredrik D. Johansson. Johansson

represented by a binary variable Z 0, 1 and the performance outcome by a real-valued ∈ { } variable Y R. Students are observed through covariates S3,C1,C2,C3 and schools through ∈ 1 > covariates X1, ..., X5 . For convenience, we let X = [S3,C1,C2,C3,X1, ..., X5] represent the full set of covariates of a student-school pair. We let (xij, zi, yi) denote the observation corresponding to a student i 1, ..., m in a school j 1, ..., n . As each student is enrolled in at most one school,∈ we{ omit} the index j in the∈ { sequel.} Observed treatment groups G0 (control) and G1 (treated) are defined by Gz = i 1, ..., m : zi = z . The { ∈ { } } full dataset of observations is denoted = (x1, z1, y1),..., (xm, zm, ym) , and the density of all variables p(X,Z,Y ). D { } We adopt the Neyman-Rubin causal model (Rubin, 2005), and denote by Y (0),Y (1) the potential outcomes corresponding to interventions Z = 0 and Z = 1 respectively. The goal of this study is to characterize heterogeneity in the treatment effect Y (1) Y (0) across students and schools. As is well known, this effect is not identifiable without− additional assumptions as each student is observed in only one treatment group. Instead, we estimate the conditional average treatment effect (CATE) with respect to observed covariates X.

τ(x) = E [Y (1) Y (0) X = x] (1) − | CATE is identifiable from observational data under the standard assumptions of ignorability

Y (1),Y (0) Z X, ⊥⊥ | consistency, Y = ZY (1) + (1 Z)Y (0), and overlap (positivity) − x : p(Z = 0 X = x) > 0 p(Z = 1 X = x) > 0 . ∀ | ⇔ | The CATE conditioned on the full set of covariates X is the closest we get to estimating the treatment effect for an individual student. However, to design policies, it is rarely necessary to have this level of resolution. In later sections, we estimate conditional effects also with respect to subsets or functions of X, such as the average effect stratified by school achievement level. By first identifying τ(x) and then marginalizing it with respect to such functions, we adjust for confounding to the best of our ability.

1.2 Methodology overview The flexibility of ML estimators creates both opportunities and challenges. For example, it is typical for the number of parameters of the best performing model on a task to exceed the number of available samples. This is made possible by mitigating overfitting (high variance) through appropriate regularization. Indeed, many models achieve the best fit only after having carefully set several tuning parameters that control the bias and variance trade-off. It is standard practice in ML to use sample splitting for this purpose. Here, we apply such a pipeline to CATE estimation, proceeding through the following steps.

1. Split the observed data into two partitions, a training set t and a validation set D D v, for parameter fitting and model selection respectively. D 1. The meanings of the different covariates are described in later sections of the manuscript.

72 Machine Learning Analysis of Student Mindset Interventions

2. Fit estimators f0, f1 of potential outcomes Y (0) and Y (1) to t and select tuning D parameters based on held-out error on v D

3. Impute CATE,τ ˆi := f1(xi) f0(xi), for every student in and fit an interpretable model h(x) τˆ to characterize− treatment effect heterogeneityD ∼ This pipeline allows us to find the best fitting (black-box) estimators possible in Step 2 without regard for the interpretability of their predictions. By fitting a simpler, more interpretable model to the imputed effects in Step 3, we may explain the predictions of the more complex model in terms of known quantities. This procedure is particularly well suited when the effect is a simpler function than the response and it also allows us to control the granularity at which we study heterogeneity. The data from NSLM has a multi-level nature; students (level 1) are grouped into schools (level 2) and each level is associated with its own set of covariates. The literature is rich with studies of causal effects in multi-level settings, see for example Gelman and Hill (2006). However, this is primarily targeted towards studying the effects of high-level (e.g. school- level) interventions on lower-level subjects (e.g. students), and the increased uncertainty that comes with such an analysis. While interventions are assigned at student-level, it is important to note that only 76 values of school-level variables are observed, which introduces the risk of overfitting to these covariates specifically. We adjust for the multi-level nature of the data in sample splitting, bootstrapping and the analysis of imputed effects. In the following sections, we describe each step of our methodology in detail.

Step 1. Sample splitting To enable unbiased estimation of prediction error and select tuning parameters, we divide the dataset into two parts with 80% of the data used for the training set t and 20% for D D a validation set v. We partition the set of schools, rather than students, making sure that D the entire student body of any one school appears only in either t or v. This is to mitigate overfitting to school-level covariates. As there are only 76 schools,D randomD sampling may create sets that have very different characteristics. To mitigate this, we balance t and v by constructing a large number of splits uniformly at random and selecting theD one thatD minimizes the Euclidean distance between of the two sets. In particular, we compare the first and second order moments of all covariates. We increase the influence of the treatment variable Z by a factor 10 in this comparison to ensure that treatment groups are split evenly across t and v. D D Step 2. Estimation of potential outcomes The conditional average treatment effect is the difference between expected potential out- comes, here denoted µ0 and µ1. Under ignorability w.r.t. X (see above), we have that

µz(x) := E[Y (z) X = x] = E[Y X = x, Z = z] for z 0, 1 , | | ∈ { } and thus, τ(x) = µ1(x) µ0(x). A straight-forward route to estimating τ is to independently − fit the conditionals E[Y X = x, Z = z] for each value of z 0, 1 and compute their difference. This has recently| been dubbed the T-learner approach∈ { to} distinguish it from

73 Johansson

other learning paradigms (K¨unzelet al., 2017). Below, we briefly cover theory that motivates this method and point out some of its shortcomings. To study heterogeneity, we consider several T-learners as well as two alternative approaches described below. We approximate µ0, µ1 using hypotheses f0, f1 and measure their quality by the mean squared error. The group-conditional expected and empirical risks are defined as follows

2 ˆ 1 X 2 z(fz) := E[(µz(x) fz(x)) Z = z] and z(fz) := (f(xi; θ) yi) . (2) R − | R Gz − | | i∈Gz | {z } | {z } Expected group-conditional risk Empirical group-conditional risk

We never observe µz directly, but learn from noisy observations y. Statistical learning theory helps resolve this issue by bounding the expected risk in terms of its empirical counterpart and a measure of function complexity (Vapnik, 1999). For hypotheses in a class with a particular complexity measure F (δ, n) with logarithmic dependence on n (e.g. aF function of the covering number), it holdsC with probability greater than 1 δ that − F (δ, n) 2 fz : z(fz) ˆz(fz) + C σ , (3) ∀ ∈ F R ≤ R √n − Y 2 where σY is a bound on the expected variance in Y (see Johansson et al. (2018) for a full derivation). This class of bounds illustrate the bias-variance trade-off that is typical for machine learning and motivates the use of regularization to control model complexity. In our experiments, we consider several T-learner models that estimate each potential outcome independently using regularized empirical risk minimization, solving the following problem.

fz = arg min Rˆz(f(x; θ)) + λr(θ) (4) f(·;θ)∈F Here, f(x; θ) is a function parameterized by θ and r(θ) a regularizer of model parameters such as the `1-norm (LASSO) or `2-norm (Ridge) penalties. In our analysis, we compare four commonly used off-the-shelf machine learning estimators: ridge regression, random forests, gradient boosting and deep neural networks. Sharing power between treatment groups A drawback of T-learners is that no in- formation is shared between estimators of different potential outcomes. In problems where the baseline response Y (0) is a more complex function than the effect τ itself, the T-learner is wasteful in terms of statistical power (K¨unzel et al., 2017; Nie and Wager, 2017). As an alternative, we apply the Treatment-Agnostic Representation (TARNet) neural network architecture of Shalit et al. (2017). TARNet estimates all potential outcomes Y (z) jointly { } as compositions fz(x) := hz(Φ(x)) of treatment-specific hypotheses fz(Φ) and treatment- agnostic representations Φ(x). Trained by minimizing the overall empirical risk, as described in (4), this choice of architecture encourages sharing of information across treatment groups in learning the average response, while capturing heterogeneity in treatment effects. For an illustration comparing T-learners and TARNet, see Figure 1.

Generalizing across treatment groups The careful reader may have noticed that the population and empirical risk in equations (2)– (4) are defined with respect to the observed treatment assignments. To estimate CATE, we

74 Machine Learning Analysis of Student Mindset Interventions

Predicted potential outcomes Neural network layers Predicted potential outcomes Intervention

� Estimator for control group � � … � � � … Φ

� Estimator for treated � � … � � �

Covariates Group-conditional losses Covariates Shared representation Learning objective Outcome (a) T-learner estimator (b) TARNet architecture (Shalit et al., 2017).

Figure 1: Illustration of T-learner estimator (left) and Treatment-Agnostic Representation (TARNet) architecture (Shalit et al., 2017). TARNet estimators learn representations of covariates Φ(x) that are shared between treatment groups and model different potential outcomes Y (z) as functions of Φ. Counterfactual Regression (CFR) (Shalit et al., 2017) extends TARNet by regularizing Φ to encourage balance across different treatment groups. want our estimates of potential outcomes to be accurate for the counterfactual assignment as well. In other words, we want for the risk on the full cohort,

2 R(fz) := E[(µz(x) fz(x)) ] − to be small. When treatment groups p(X Z = 0) and p(X Z = 1) are highly imbalanced, the expected risk within one treatment group| may not be representative| of the risk on the full population. This is another drawback of T-learner estimators, which do not adjust for this discrepancy. In recent work, Shalit et al. (2017) characterize the difference between R(fz) and Rz(fz) and bound the error in CATE estimates using a distance metric between treatment groups. In particular, they considered the integral probability metric (IPM) family of distances, defined with respect to a function family and densities p, q as G Z

IPMG(p, q) := sup g(x)(p(x) q(x))dx , g∈G x − resulting in the following relation between population and treatment group risk

R(fz) Rz(fz) + IPMG(p(X Z = 0), p(X Z = 1)) , (5) | {z } ≤ | {z } | | {z | } Population risk Treatment group risk Treatment group imbalance under appropriate assumptions. This bound inspired the estimator Counterfactual Regres- sion (CFR) in which the TARNet architecture (see above) is trained to minimize the upper bound in (5) applied to the learned representation φ, instead of the empirical risk. This encourages balance between treatment groups in the learned representation space. In our analysis, we apply CFR with the family of functions in the reproducing-kernel Hilbert space defined by the GaussianG RBF-kernel; the resulting IPM is known as the Maximum Mean Discrepancy (Gretton et al., 2012) and may be estimated efficiently from samples.

75 Johansson

1.0 0.4 0.2 0.2 0.5 0.2 0.1

0.0 0.0 0.0 0.0 0.0 0.5 1.0 2.5 5.0 2.5 0.0 2.5 2 0 2 Z S3 − X1 − X2

0.2 0.2 0.4 0.1 0.1 0.1 0.2

0.0 0.0 0.0 0.0 0 2 2 0 2 0 2 5 10 15 X3 − X4 X5 C1

0.4 0.50 Control 0.2 Control 0.2 0.25 Treated Treated 0.0 0.00 0.0 1.0 1.5 2.0 0.0 0.5 1.0 0 2 4 C2 C3 XC

(a) Marginals of covariates X (b) t-SNE projection of covariates X

Figure 2: Examination of overlap through covariate marginal distributions and a low- dimensional t-SNE projection of covariates (Maaten and Hinton, 2008). Marker color cor- responds to treatment assignment Z. Best viewed in color.

Step 3. Characterization of heterogeneity in CATE

After fitting models f0, f1 for each potential outcome, the conditional average treatment effect is imputed for each student byτ ˆi = f1(xi) f0(xi). Unlike with linear regressors, the predictions of most ML estimators are difficult− to interpret directly through model parameters. For this reason, ML models are often considered black-box methods (Lipton, 2016). However, in the study of heterogeneity, it is crucial to characterize for which subjects the effect of an intervention is low and for which it is high. To accomplish this, we adopt the common practice of post-hoc interpretation—fitting a simpler, more interpretable model h to the imputed effects τˆi . ∈ H { } In its very simplest form h(xi) may be a function of a single attribute, such as the school size, effectively averaging over other attributes. This is usually a good way of discovering global trends in the data but will neglect meaningful interactions between variables, much like a linear model. As a more flexible alternative, we also fit decision tree models and inspect the learned trees. Trees of only two variables may be visualized directly in the covariate space, and larger trees in terms of their decision rules.

2. Workshop results We present the first results of our analysis as shown in the workshop Empirical Investigation of Methods for Heterogeneity at the Atlantic Causal Inference Conference, 2018.

2.1 Covariate balance First, we investigate the extent to which the overlap assumption holds true by comparing the covariate statistics of the treatment and control groups. In Figure 2, we visualize the marginal distributions of each covariate, as well as a 2D t-SNE projection of the entire covariate set (Maaten and Hinton, 2008). The observed difference between the marginal covariate distributions of the two treatment groups is very small. Also, the non-linear t-SNE projection reveals little difference between treatment groups. The less imbalance

76 Machine Learning Analysis of Student Mindset Interventions

Estimator ATEˆ R2

Na¨ıve 0.30 — ATE = 0.26 0.15 RR 0.26 [0.22, 0.29] 0.26 [0.17, 0.29]

RF 0.27 [0.23, 0.30] 0.25 [0.22, 0.29] 0.10 GB 0.26 [0.21, 0.30] 0.25 [0.20, 0.30] NN 0.27 [0.17, 0.38] 0.14 [ 0.08, 0.23] 0.05 − Fraction of students TARNet 0.26 [0.23, 0.30] 0.27 [0.21, 0.31] 0.00 CFR 0.26 [0.22, 0.30] 0.27 [0.22, 0.31] 0.5 0.0 0.5 1.0 − Estimated CATE (a) ATE and held-out R2 score with 95% school-level cluster bootstrap confidence intervals. (b) Histogram of CATE (CFR).

Figure 3: Comparison between the na¨ıve estimator, T-learners, and representation learning methods (left). Heterogeneity in treatment effect estimated by CFR (right).

between treatment groups, the closer our problem is to standard supervised learning. Said differently, the density ratio p(Z = 1 X)/p(Z = 0 X) is close to 1.0 and the IPM distance between conditional distributions, see| (5), is small.| Hence, we expect neither propensity re-weighting nor balanced representations (e.g. CFR) to have a large effect on the results.

2.2 Estimation of potential outcomes

In our analysis, we compare four T-learners based on ridge regression (RR), random forests (RF), gradient boosting (GB), and neural networks (NN). In addition, we compare the representation-learning algorithms TARNet and Counterfactual Regression (CFR) (Shalit et al., 2017). For each estimator family, we fit models of both potential outcomes on the 2 training set t and select tuning parameters based on the held-out R score on the validation D set v. To estimate uncertainty in model predictions, we perform school-level bootstrapping of theD training set (Cameron et al., 2008), fitting each model to each bootstrap sample2. In Table 3, we give the estimate of the average treatment effect (ATE) from each model, the held-out R2 score of the fit of factual outcomes, and 95% confidence intervals based on the empirical bootstrap. In addition, we give the na¨ıve estimate of the ATE—the difference between observed average outcomes in the two treatment groups. We see that all methods produce very similar estimates of ATE and perform comparably in terms of R2. As expected, based on the small covariate imbalance shown in the previous section, the regression adjusted estimates are close to the na¨ıve estimate of the ATE. This likely also explains the small difference between TARNet and CFR, as even for moderate to large imbalance regularization, the empirical risk dominates the objective function. The performance of the neural network T-learner would likely be improved with a different choice of architecture or tuning parameters. This is consistent with Shalit et al. (2017) in which TARNet architecture achieved half of the error of the T-learner on the IHDP benchmark.

2. The bootstrap analysis was added after the workshop results, but is presented here for completeness.

77 Johansson

ATE 0.4

CATE 0.2

0.0 2.5 0.0 2.5 2 0 2 A B C D E X1,− School-level student mindsets X4,− School poverty concentration XC, Urbanicity of the school

0.4

CATE 0.2

0.0 0 2 2.5 5.0 2.5 0.0 X5, School size S3, Student’s expectations X2,− School achievement level

Figure 4: Heterogeneity in causal effect estimated using counterfactual regression (CFR) stratified by different covariates. Bars indicate variation in point estimates across subjects.

2.3 Heterogeneity in causal effect

We examine further the CATE for each student imputed by the best fitting model. As CFR had a slight edge in R2 over T-learning estimators (although confidence intervals overlap) and has stronger theoretical justification, we analyze the effects imputed by CFR below. In Figure 3b, we visualize the distribution of imputed CATEs. We see that for almost all students, the effect is estimated to be positive, indicating an improvement in performance as an effect of the mindset intervention. Recall that the average treatment effect was estimated to be 0.26. Around 95% of students were estimated to have an effect in the range [0.05, 0.45]. For reference, the mean of the observed outcome was 0.10 and the standard deviation 0.64.

To discover drivers of heterogeneity we fit a random forest estimator to imputed effects and inspect the feature importance of each variable—the frequency with which it is used to split the nodes of a tree. The five most important variables of the random forest were X1, X4, XC , X5 and S3. In Figure 4 we stratify imputed CATE with respect to these variables, as well as X2 which is of interest to the study organizers. We see a strong trend that the effect of the intervention decreases with prior school-level student mindset, X1. The urbanicity of the school, XC , is a categorical variable for which Category D appears to be associated with substantially lower effect. In contrast, the effect of the intervention appears to increase with students’ prior expectations S3. One of the questions of the original study was whether there exists a “Goldilocks effect” for school-achievement level X2, meaning that the intervention only has an effect for schools that are neither achieving too poorly nor too well. These results cannot confirm this hypothesis, nor reject it.

78 Machine Learning Analysis of Student Mindset Interventions

0.45 0.6 0.40 2 6 0.290.30 0.23 0.5 0.35 0.23 0.4 0.30 0.33 0.3 0.25 4 0 0.20 0.2 0.33 0.1 0.15 0.270.250.19 0.18 0.22 0.15 2 0.26 0.10 0.0

2 S3, Student’s expectations X4, School poverty concentr. 0.05 − 2 0 2 2 0 2 X1, School-level− student mindsets X1,− School-level student mindsets

(a) CATE vs. X1 and X4 (school-level) (b) CATE vs. X1 and S3 (student-level)

Figure 5: Interpretation of CATE estimates using regression trees fit to pairs of covariates. Each dot represents a single school (left) or student (right). The color represents the predicted CATE. Black lines correspond to leaf boundaries. Background color and numbers in boxes correspond to the average predicted CATE in that box. Best viewed in color.

3. Post-workshop analysis

Heterogeneity in treatment effect may be a non-linear or non-additive function of observed covariates. Such patterns remain hidden when analyzing CATE as a function of a single variable at a time or using . To reveal richer patterns of heterogeneity, we fit highly regularized regression tree models and inspect their decision rules. First, we consider combinations only of pairs of variables at a time. We note that for school-level variables, only 76 unique values exist, one for each school. To prevent overfitting to these variables, we require that each leaf in the regression tree contains samples from at least 10 schools. When student-level covariates are included, we require leaves to have samples of at least 1000 students. In Figure 5, we visualize trees fit to two distinct variable pairs. We note a very slight non-linear pattern in heterogeneity as a function of X1 (school-level student mindset) and X4 (school poverty concentration), and that X1 explains a lot of the variance observed at moderate values of X4 in Figure 4. We emphasize, however, that the sample size at the school-level is small, and that observed patterns have high variance. In the right-hand figure, S3 (student’s expectations) appears associated with a larger effect only if the average mindset of the school is sufficiently high. This pattern disappears when using a linear model. In the Appendix, we show a regression tree fit to the entire covariate set.

4. Discussion

Machine learning offers a broad range of tools for flexible function approximation and pro- vides theoretical guarantees for statistical estimation under model misspecification. This makes it a suitable framework for estimation of causal effects from non-linear, imbalanced or high-dimensional observational data. The flexibility of machine learning comes at a price however: many methods come with tuning parameters that are challenging to set for causal estimation; models are often difficult to optimize globally; and the interpretability

79 Johansson

of models suffers. While progress has been made independently on each of these problems, a standardized set of tools has yet to emerge. In the analysis of the NLSM data, machine learning appears well-suited to study overlap, potential outcomes and heterogeneity in imputed effects. However, the analysis also opens some methodological questions. The multi-level nature of covariates is not accounted for in most off-the-shelf ML models and regularization of models applied to multi-level data has been comparatively less studied than for single-level data. In addition, as pointed out by several authors (K¨unzelet al., 2017; Nie and Wager, 2017), the T-learner approach to causal effect estimation may suffer from compounding bias and from wasting statistical power. This may be one of the reasons we observe a slight advantage of representation learning methods such as TARNet and CFR.

References Cameron, A. C., Gelbach, J. B., and Miller, D. L. (2008). Bootstrap-based improvements for inference with clustered errors. The Review of Economics and Statistics, 90(3):414–427.

Gelman, A. and Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch¨olkopf, B., and Smola, A. (2012). A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773.

Johansson, F. D., Kallus, N., Shalit, U., and Sontag, D. (2018). Learning weighted repre- sentations for generalization across designs. arXiv:1802.08598.

Kononenko, I. (2001). Machine learning for medical diagnosis: history, state of the art and perspective. Artificial Intelligence in medicine, 23(1):89–109.

K¨unzel,S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2017). Meta-learners for estimating heterogeneous treatment effects using machine learning. arXiv:1706.03461.

LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553):436.

Lee, B. K., Lessler, J., and Stuart, E. A. (2011). Weight trimming and propensity score weighting. PloS one, 6(3):e18174.

Lipton, Z. C. (2016). The mythos of model interpretability. arXiv:1606.03490.

Maaten, L. v. d. and Hinton, G. (2008). Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605.

Nie, X. and Wager, S. (2017). Learning objectives for treatment effect estimation. arXiv:1712.04912.

Rubin, D. B. (2005). Causal inference using potential outcomes: Design, modeling, deci- sions. Journal of the American Statistical Association, 100(469):322–331.

Rubin, D. B. and Thomas, N. (1996). Matching using estimated propensity scores: relating theory to practice. Biometrics, pages 249–264.

80 Machine Learning Analysis of Student Mindset Interventions

Shalit, U., Johansson, F. D., and Sontag, D. (2017). Estimating individual treatment effect: generalization bounds and algorithms. In International Conference on Machine Learning, pages 3076–3085.

Swaminathan, A. and Joachims, T. (2015). Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pages 814–823.

Vapnik, V. N. (1999). An overview of statistical learning theory. IEEE transactions on neural networks, 10(5):988–999.

81 Johansson

Appendix A. Regression tree explanation of CATE

� ≤ 0.15 True False

� ≤ −0.28 � = �

�̅ = 0.10 �̅ = 0.26 � ∈ {0,1} � ≤ −1.12 � = 1333 � = 3337

�̅ = 0.24 �̅ = 0.30 �̅ = 0.37 �̅ = 0.32 � = 1699 � = 1282 � = 1063 � = 1677

Figure 6: Visualization of a regression tree fit to the imputed CATE values based on the full covariate set.

82 Observational Studies 5 (2019) 83-92 Submitted 7/19; Published 8/19

Matching with attention to effect modification in a data challenge

Luke Keele [email protected] University of Pennsylvania 3400 Spruce St, Philadelphia, PA 19130 Samuel D. Pimentel [email protected] University of California, Berkeley 367 Evans Hall, Berkeley, CA 94720

Abstract For this data challenge, we implement a data analytic approach based on matching. First, we use matching to control for observed confounders. Next, we implement a test that combines results from subgroups of matched pairs formed using potential effect modifiers. This analysis, which can be conducted in either a planned or an exploratory manner, is designed both to detect effect modification in specific subgroups and to exploit any variation in effect size to make the study more robust to bias from a hidden confounder. We find that accounting for possible effect modification does make the study results less sensitive to hidden bias. However, we also find that an exploratory analysis that does not fully exploit outcome information fails to fully discover possible effect modifiers. Keywords: matching, effect modification, sensitivity analysis

1. Methodology and Motivation

1.1 Matching in observational studies Matching is a flexible and intuitive tool for designing effective observational studies. In its simplest form, matching involve pairing each subject from a group receiving a treatment or exposure of interest to a control subject that appears similar on observed pre-treatment co- variates. When observed covariates include all relevant confounders and sufficiently similar pairs are available, the resulting paired design may be analyzed as though it were a random- ized study. Analyzing observational data using matched comparisons has many attractive features. For instance, problems with covariate overlap are quickly identified (Stuart 2010), the separation of design from analysis ensures valid inference even if multiple matched de- signs are initially considered (Rubin 2007), and the preservation of the original units of analysis permits helpful qualitative input from scientific collaborators (Rosenbaum 2002). For our purposes, two such features are particularly important. One is the natural way matched designs capture information about treatment effect variation. Specifically, each distinct matched pair difference is an estimate of the treatment effect at a particular point in multidimensional covariate space. A single matched design may permit both an average treatment effect estimate over the entire sample and a more granular study of treatment effect heterogeneity or, in the parlance of the matching literature, effect modification by

c 2019 Luke Keele and Samuel D. Pimentel. Keele and Pimentel

pre-treatment covariates. The second key advantage of matching methods is the set of tools available for conducting sensitivity analysis after matching in order to assess the impact of possible violations of the strong ignorability, or “no unmeasured confounders,” assumption. Since some form of this uncheckable assumption underlies all analysis of observational data, effective methods for causal inference should be able to demonstrate some robustness to a violation of this assumption. Sensitivity analysis can provide just such a certificate, or identify situations in which results are very fragile with respect to failures of ignorability.

1.2 Identifying effect modification in matched designs

Effect modification occurs when the magnitude of the treatment effect varies with the value of one or more pretreatment covariates, so that certain subgroups of matched pairs expe- rience a different effect than others. A recent literature has studied methods for detecting and leveraging effect modification in matched observational studies (Hsu et al. 2013, 2015; Lee et al. 2018). These works identify two basic strategies for studying effect modification. In one strategy, a small set of subgroups, perhaps identified by one or two nominal vari- ables, are identified a priori based on scientific considerations. In addition to analyzing the data for the entire match, investigators conduct inference separately within each of these subgroups. This strategy is very similar to standard subgroup analysis in social and medical applications. The second approach is to identify subgroups likely to experience effect modification in an adaptive way, from the data itself. Hsu et al. (2013) describe one way to do this by fitting a CART model to matched pair differences using a large set of pre-treatment covariates, and measuring treatment effects within the subgroups identified by the leaves of the resulting tree model. Importantly, these trees are fit not on the actual matched pair outcome differences but on their absolute values, or on the ranks of the absolute values. Using the signed outcomes to select subgroups for inference would violate the design principle of outcome blinding that is fundamental in matching approaches and would invalidate tests for treatment effects in the resulting groups. While the absolute values are an imperfect proxy for the actual matched pair differences, they preserve valid inference since randomization inference varies only the signs of these differences. In principle these methods may be applied to any matched design, whether or not it was constructed with the study of effect modification in mind. However, a subtle aspect of the approaches just described is that they rely on pairs exactly matched for the potential effect modifier of interest. For example, if gender is a possible effect modifier, an effect may be estimated within a group of pairs in which both subjects are women and another effect may be estimated in group of pairs in which both subjects are men. But if the study also contains pairs that include both a woman and a man, these last pairs are not directly useful for studying effect modification. One way to understand this principle is to recognize that effect modification is a result of an interaction effect between a treatment variable and a pre-treatment covariate. Pairs matched exactly for the covariate remove the main effect of the pre-treatment covariate and allow study of the interaction, while outcome differences in pairs with non-identical covariates may be partly explained by the main effect of the covariate and partly explained by the interaction.

84 Matching with attention to effect modification in a data challenge

In practice, this requirement means that investigators with an interest in effect modifi- cation must take care to ensure high proportions of matched pairs share identical values of the potential effect modifiers when they construct the match. This requirement stands in contrast to recent recommendations and techniques for matching that focus on achieving marginal balance, a less stringent goal than exact matching (Zubizarreta 2012; Pimentel et al. 2015). Note that this requirement applies both when subgroups of interest are identi- fied a priori and when the exploratory technique is used; in the latter case the CART model is fit only to the subset of pairs matched exactly on all the covariates used in the fit.

1.3 Impact of effect modification on inference and sensitivity to bias

While understanding effect modification for a given dataset is usually scientifically interest- ing in itself, another important benefit is reduced sensitivity to bias in testing for the overall presence or absence of a treatment effect in the entire sample. In particular, matched data are often analyzed by using randomization inference to test Fisher-style sharp null hypothe- ses of no treatment effect. Such hypotheses specify that no individual anywhere in the study receives treatment, in contrast to the more common Neyman hypothesis framework which specifies that an average effect is equal to zero but allows for nonzero individual effects. A rejection of the sharp null hypothesis of zero effect for a specific subgroup of the sample is also a rejection for the overall sharp null; note that this is not the case in a Neyman-style testing framework. If effect modification is structured so that certain small subgroups of the sample experience very large effects, a test that combines results from individual tests in several subgroups may be more powerful than a single test incorporating all the data. For example, Hsu et al. (2015) uses a truncated product of p-values generated from tests in several subgroups. More important yet, analyses that pay attention to effect modification may also reduce sensitivity to unmeasured bias. To make this claim more clear, we briefly review Rosen- baum’s method of sensitivity analysis, described in greater detail in Rosenbaum (2005) and Rosenbaum (2002). Randomization inference in a matched design typically proceeds under the assumption that individuals in the same pair have the same true propensity to receive treatment; mathematically, their odds of treatment are assumed to be identical, allowing construction of a known and tractable randomization distribution. In sensitivity analysis, this assumption is relaxed so that odds of treatment for paired individuals may differ by up to a multiplicative factor Γ. For any specific choice of Γ, a worst-case analysis may be conducted to obtain the largest possible p-value or the widest possible confidence interval that might have been achieved under these settings. Analysts often repeat the analysis from many Γ-values and report a threshold value at which the hypothesis test ceases to reject or at which the confidence interval barely covers 0. For example, say we observe that the estimated p-value is 0.01. This p-value assumes Γ = 1, that is the unobserved confounder does not change the odds of treatment within matched pairs. To perform a sensitivity anal- ysis, we increase Γ until the worst-case p-value reaches or exceeds 0.05. If this occurs for a relatively small value of Γ, we can conclude that a confounder with a small effect on the treatment odds would change our conclusions. However, if the value of Γ is large when the bounds include zero, then we have greater confidence that a weak confounder would not change the conclusions of the statistical analysis.

85 Keele and Pimentel

When effect modification is present, hypothesis tests for different subgroups of a sample may vary not only in their degree of statistical significance but in their sensitivity to bias. Results for multiple sensitivity analyses may be combined using truncated p-values just as standard test results are combined, resulting in an overall sensitivity analysis, and Hsu et al. (2013) demonstrate that in this setting the level of sensitivity to bias tends to be driven by the subgroup with the strongest or most robust effect. As such, combining the results of several subgroup analyses may produce more reliable inferences less sensitive to unmeasured confounding than the results of a single test for the entire sample.

1.4 Study design for the workshop data.

We demonstrate both matching and a search for effect modification in a synthetic dataset constructed from the National Study of Learning Mindsets for a workshop at the 2018 Atlantic Causal Inference Conference, described earlier in this journal issue in Carvalho et al. (2019). We approached the workshop data by forming a matched pair design and conducting randomization inferences and sensitivity analyses designed to detect and leverage possible effect modification. We first used methods based on a priori identification of two important effect modifiers, and then conducting an exploratory analysis designed to identify additional potential modifiers. Given the framing of the workshop data as an observational study, a primary concern guiding the design was the possibility of unmeasured confounding between the treatment and control groups. When constructing the matched design itself, we sought to match exactly as frequently as possible on variables judged to be likely effect modifiers. While in general it is a difficult task to match on many variables exactly, in this case the large majority of covariates, includ- ing the two pre-identified possible effect modifiers X1 and X2, were measured at the school level. Since in addition each school contained proportionally similar numbers of treated and control students, we chose to match students within schools, thus ensuring that all pairs in the study shared identical values for all school-level covariates. (However, as discussed be- low, we also conducted an across-schools match for comparative purposes.) Within schools we matched students optimally on a robust Mahalanobis distance incorporating all student covariates. We also estimated a propensity score using all available covariates and imposed a matching caliper of 0.15 standard deviations of the estimated propensity scores, requir- ing that all pairs formed contain individuals whose estimated propensity scores differ by no more than that amount. This ensures that in the absence of unmeasured confound- ing, matched individuals have near-identical odds of treatment, an important assumption underlying later randomization inferences. The caliper restriction prevented some treated units in the study from being matched as described in the next section, but we retained a large proportion of the treated students. To account for effect modification by X1 and X2, the a priori possible effect modifiers, we made two separate partitions of the matched pairs, one by quintiles of X1 and one by quintiles of X2. For each partition, we conduct a separate randomization test of the sharp null hypothesis using a member of the family of Huber-Maritz M-tests chosen to be insensitive to unmeasured bias (Rosenbaum 2007). We then combine the 5 p-values from these tests using the truncated product of p-values (Zaykin et al. 2002). We also conduct a sensitivity analysis for this procedure, reporting the threshold level of unobserved

86 Matching with attention to effect modification in a data challenge

confounding Γ at which the combined p-value ceases to be significant. For comparative purposes, we also conduct a Huber-Maritz M-test and sensitivity analysis for the overall sample. We also conducted exploratory analyses to detect effect modification using CART models fit to absolute matched pair differences, as discussed below.

2. Workshop Results We first present a few details on the results of the two matches. First, there was very little observed bias in the data. To assess the quality of the match, we used a measure of standardized distance, which for a given variable is computed by taking the mean difference between matched patients and dividing by the pooled standard deviation before matching (Silber et al. 2001; Rosenbaum and Rubin 1985; Cochran and Rubin 1973). In matching, we generally seek to make all standardized differences less than 0.10 or less than one-tenth of a standard deviation, which is often considered an acceptable discrepancy (Silber et al. 2001; Rosenbaum and Rubin 1985; Cochran and Rubin 1973; Rosenbaum 2010).

Table 1: Balance Statistics for the Unmatched Data and Two Matched Comparisons Std. Diff Unmatched Std. Diff. Within School Match Std. Diff. Across School Match S3 0.13 0.00 -0.01 X1 -0.10 0.00 0.01 X2 0.06 0.00 -0.00 X3 -0.00 0.00 0.00 X4 -0.02 0.00 -0.00 X5 0.07 0.00 -0.01 C1 Cat. 1 -0.04 0.00 0.01 C1 Cat. 2 -0.00 -0.02 -0.02 C1 Cat. 3 -0.02 0.03 0.02 C1 Cat. 4 0.04 -0.04 -0.07 C1 Cat. 5 0.01 0.03 0.02 C1 Cat. 6 0.03 -0.01 0.00 C1 Cat. 7 0.02 0.00 0.03 C1 Cat. 8 0.02 0.02 0.04 C1 Cat. 9 -0.03 0.02 0.02 C1 Cat. 10 0.01 0.02 0.03 C1 Cat. 11 -0.01 0.01 0.05 C1 Cat. 12 -0.01 0.02 0.02 C1 Cat. 13 -0.05 0.01 0.01 C1 Cat. 14 -0.02 -0.01 0.01 C1 Cat. 15 0.01 0.03 0.04 C2 0.05 0.01 0.00 C3 -0.10 -0.01 -0.03 XC Cat 1 0.01 0.00 0.00 XC Cat 2 -0.03 0.00 0.00 XC Cat 3 0.03 0.00 0.01 XC Cat 4 -0.05 0.00 -0.00 XC Cat 5 0.03 0.00 -0.00

Table 1 contains balance statistics for both the unmatched data and the two matches we implemented. First, it is clear that there is little overt bias in the unmatched data. For example, the largest standardized difference across treated and control groups before

87 Keele and Pimentel

matching was .13. The matching does further improve balance. After matching, with one exception, all the standardized differences were 0.05 or less. For the within school match, the standardized differences are zero for all discretized school-level variables since by design each pair is exactly matched on these variables. Table 2 further illustrates the primary difference between the two matches. For student level covariates, neither match exactly matches on student level covariates, but the within school match exactly matches on all school level covariates, which is a major advantage for studying effect modification by these covariates.

Table 2: Proportion of pairs matched exactly on each variable Across School Match Within School Match S3 0.79 0.78 C1 0.84 0.82 C2 0.90 0.87 C3 0.92 0.91 XC 0.98 1.00 X1 0.90 1.00 X2 0.90 1.00 X3 0.90 1.00 X4 0.90 1.00 X5 0.90 1.00

Table 3 contains the outcome estimates. We estimated treatment effects and confi- dence intervals after matching by inverting Huber-Maritz m-tests and computing Hodges- Lehmann estimates as described in Rosenbaum (2002); for the unadjusted case we used simple regression on the treatment indicator. The results from both matches are nearly identical with treatment effect estimates of 0.26 and 0.27 for the across and within school matches respectively. Moreover, consistent with the fact that there is very little overt bias, the unadjusted estimate does not differ much from the adjusted estimates. The sample sizes do vary some as a result of the matching. In the unmatched data, there were 3,384 treated units and 7,007 controls. In the across school match, we matched 3,375 of the treated units one to one. In the within school match, we matched 3,079 of the treated units one to one. Some loss of sample size is often a by-product of exact matching, especially when combined with a caliper restriction. Next, we explored the possibility of effect modification by X1. For this analysis, we only focus on results from the within school match. For this analysis, we simply stratified the matched pairs by the quintiles of X1e, and estimated overall treatment effect within each quantile, again by inverting randomization tests and computing Hodges-Lehmann estimates. Table 4 contains the results from this stratified analysis. The point estimates indicate quite clearly that for students with lower X1 scores the intervention appears to be more effective. However, the confidence intervals overlap to an extent that these estimates that can’t be differentiated statistically. Next, we repeated this analysis for the X2 covariate. Again, we stratified the matched pairs by the quintiles of the X2 variable, and estimated overall treatment effect within each

88 Matching with attention to effect modification in a data challenge

Table 3: Results from Outcome Analysis Outcome 0.30 Unadjusted [ 0.28 , 0.33 ] Across School 0.26 Match 1 [ 0.24 , 0.29 ] Within School 0.27 Match 2 [ 0.25 , 0.30 ]

Table 4: Treatment Effect Estimates Stratified by X1 Quantile Point Estimate 95% CI Quantile 1 0.33 [0.27, 0.38] Quantile 2 0.27 [0.21, 0.34] Quantile 3 0.31 [0.26, 0.36] Quantile 4 0.22 [0.16, 0.28] Quantile 5 0.21 [0.14, 0.27] quantile. Table 5 contains the results from this stratified analysis. Here, the pattern is less clear, however, treatment effect estimates in the lowest quantile are much lower. Again, the confidence intervals overlap to an extent that these estimates can’t be differentiated statistically.

Table 5: Treatment Effect Estimates Stratified by X2 Quantile Point Estimate 95% CI Quantile 1 0.17 [0.11, 0.24] Quantile 2 0.30 [0.24, 0.36] Quantile 3 0.29 [0.24, 0.34] Quantile 4 0.33 [0.28, 0.39] Quantile 5 0.24 [0.19, 0.30]

Next, we explored whether effect modification by these two covariates reduced sensitivity to bias from unobserved confounders. Table 6 contains the results from three different sensitivity analyses. In the first sensitivity analysis, we calculated the upper-bound on the one-sided p-value for unconditional treatment effect. Here, we find that Γ = 2.51. This implies that a hidden confounder would need to induce the odds of treatment within matched pairs to differ by a factor of 2.5 before the observed results could be fully explained under the null hypothesis. Next using the methods we outlined above, we conducted two additional sensitivity analyses that account for effect modification by the mindset and test score variables. In both cases, we find the results are less sensitive to hidden bias as the value for Γ is now 2.87 for the X2 effect modifier and 2.83 for the X3 effect modifier.

89 Keele and Pimentel

Table 6: Sensitivity Analysis for Unconditional and Conditional Treatment Effect Estimates

Unconditional X1 Effect Modifier X2 Effect Modifier Γ 2.51 2.87 2.83

To conduct a more exploratory analysis of effect modification, we calculated within- pair differences in outcome for each matched pair in our study, which we denote as τi. Following Hsu et al. (2013), we then regressed the rank of |τi| on possible effect modifiers using CART in order to identify subgroups with differential effects (without jeopardizing later inference by using the signs of the τis). We fit a CART model both to all covariates, school- and student-level, using only those pairs matched exactly on all variables, and to all school-level covariates using all pairs (since all were matched exactly on school-level covariates). However, in neither case did the CART model make any splits after pruning using cross-validation, meaning that no subgroups for effect modification were identified.

3. Post-workshop analysis

We conducted one additional analysis after the workshop. Specifically, we were curious as to why the CART regression did not detect any effect modification. We suspected it was due to the fact that we used the unsigned differences in treated and control outcomes in the CART model. As we noted above, using the unsigned ranks is necessary so the statistical tests on which the sensitivity analysis are based have the correct level. To explore what role this might have played in our analysis, we re-fit the CART regression to all matched pairs using the signed within-pair matched differences as outcomes and the school-level covariates as regressors. We trimmed the fitted tree to ensure reduce overfitting. The results are in Figure 1. Now the tree splits the data on both the X1 covariate as well as one of the indicators for XC variable. Of course, a hypothesis test or sensitivity analysis based on this search procedure violates the design principle by using the same outcome information both to identify subgroups and to conduct inference.

4. Discussion

In our analysis, we demonstrated how to explore effect modification through a sensitivity analysis. Under this approach, investigators can both test whether the magnitude of the treatment effect varies with a baseline covariate and explore whether effect modification makes the results less sensitive to bias from hidden confounders. This approach is quite flexible, since it can be applied to both planned and exploratory analyses of effect modi- fication. That is, one can use flexible tree-based methods to search for subgroups where treatment effects may be larger. Our analysis did reveal some interesting tradeoffs that are required when the analysis is exploratory. To yield valid tests, the exploratory model must be blind to some outcome information. However, we showed that ignoring the full outcome distribution caused us to overlook some evidence of effect modification.

90 Matching with attention to effect modification in a data challenge

−0.027 100% yes mindset >= −1.2 no

−0.062 88% urb2 >= 1

−0.19 −0.023 0.24 21% 68% 12%

Figure 1: The regression tree formed fitting matched pair differences in outcomes on base- line covariates

We argue that this result has two implications. First, it may be valuable to clarify conditions under which effect modification cannot be detected without signed data, as occurred here. This line of research could provide guidance about exactly when and how researchers pay a cost by blinding themselves to signs of pair differences. Second, we suggest a two-step analysis of effect modification when groups are to be selected from data. First, the CART model procedure can be done using partial outcome information to preserve the validity of formal confirmatory test. Second, a purely exploratory analysis can be conducted using the full outcome information. This allows recovery of strong signals in the data that may be hidden by the partially-blinded analysis, even if inference cannot easily be obtained. Finally, a promising new direction for studying effect modification in matched sets, called the submax method, is presented in recent work by Lee et al. (2017). Simulations show that the submax method may have greater power to detect moderate effect modification than the approaches based on CART we describe here. While it is beyond the scope of our current investigation to adapt the submax framework to the problem in the data challenge, we recommend its further study and generalization.

References Carlos Carvalho, Avi Feller, Jared Miller, Spencer Woody, and David Yeager. Assessing Treatment Effect Variation in Observational Studies: Results from a Data Challenge. Observational Studies, 5(21-35), 2019.

William G. Cochran and Donald B. Rubin. Controlling bias in observational studies. Sankyha-Indian Journal of Statistics, Series A, 35:417–446, December 1973.

Jesse Y Hsu, Dylan S Small, and Paul R Rosenbaum. Effect modification and design sensitivity in observational studies. Journal of the American Statistical Association, 108

91 Keele and Pimentel

(501):135–148, 2013.

Jesse Y Hsu, Jos´eR Zubizarreta, Dylan S Small, and Paul R Rosenbaum. Strong control of the familywise error rate in observational studies that discover effect modification by exploratory methods. Biometrika, 102(4):767–782, 2015.

Kwonsang Lee, Dylan S Small, and Paul R Rosenbaum. A powerful approach to the study of moderate effect modification in observational studies. Biometrics, 2017.

Kwonsang Lee, Dylan S Small, Jesse Y Hsu, Jeffrey H Silber, and Paul R Rosenbaum. Discovering effect modification in an observational study of surgical mortality at hospitals with superior nursing. Journal of the Royal Statistical Society: Series A (Statistics in Society), 181(2):535–546, 2018.

Samuel D Pimentel, Rachel R Kelz, Jeffrey H Silber, and Paul R Rosenbaum. Large, sparse optimal matching with refined covariate balance in an observational study of the health outcomes produced by new surgeons. Journal of the American Statistical Association, 110(510):515–527, 2015.

Paul R. Rosenbaum. Observational Studies. Springer, New York, NY, 2nd edition, 2002.

Paul R. Rosenbaum. Sensitivity analysis in observational studies. In Brian S. Everitt and David C. Howell, editors, Encyclopedia of Statistics in Behavioral Science, volume 4, pages 1809–1814. John Wiley and Sons, Chichester, UK, 2005.

Paul R Rosenbaum. Sensitivity analysis for m-estimates, tests, and confidence intervals in matched observational studies. Biometrics, 63(2):456–464, 2007.

Paul R. Rosenbaum. Design of Observational Studies. Springer-Verlag, New York, 2010.

Paul R. Rosenbaum and Donald B. Rubin. Constructing a control group using multivariate matched sampling methods. The American , 39(1):33–38, February 1985.

Donald B Rubin. The design versus the analysis of observational studies for causal effects: parallels with the design of randomized trials. Statistics in medicine, 26(1):20–36, 2007.

Jeffrey H Silber, Paul R Rosenbaum, Martha E Trudeau, Orit Even-Shoshan, Wei Chen, Xuemei Zhang, and Rachel E Mosher. Multivariate matching and bias reduction in the surgical outcomes study. Medical Care, 39(10):1048–1064, 2001.

Elizebeth A Stuart. Matching methods for causal inference: A review and a look forward. Statistical Science, 25(1):1–21, 2010.

Dmitri V Zaykin, Lev A Zhivotovsky, Peter H Westfall, and Bruce S Weir. Truncated product method for combining p-values. Genetic : The Official Publication of the International Genetic Epidemiology Society, 22(2):170–185, 2002.

Jos´eR Zubizarreta. Using mixed integer programming for matching in an observational study of kidney failure after surgery. Journal of the American Statistical Association, 107 (500):1360–1371, 2012.

92 Observational Studies 5 (2019) 93-104 Submitted 7/19; Published 8/19

Heterogeneous Subgroup Identification with Observational Data: A Case Study Based on the National Study of Learning Mindsets

Bryan Keller [email protected] Department of Human Development Teachers College, Columbia University New York, NY 10027, USA

Jianshen Chen [email protected] Learning, Evaluation, and Research College Board Yardley, PA 19607, USA

Tianyang Zhang [email protected] Department of Human Development Teachers College, Columbia University New York, NY 10027, USA

Abstract In this paper, we use a two-step approach for heterogeneous subgroup identification with a synthetic data set motivated by the National Study of Learning Mindsets. In the first step, optimal full propensity score matching is used to estimate stratum-specific treatment effects. In the second step, regression trees identify key subgroups based on covariates for which the treatment effect varies. In working with regression trees, we emphasize the role of the cost-complexity tuning parameter, selected through permutation-based Type I error rate studies, in justifying inferential decision-making, which we contrast with graphical and quantitative exploration for future study. Results indicate that the mindset intervention was effective, overall, in improving student achievement. While our exploratory analyses identified XC, C1, and X1 as potential effect modifiers worthy of further study, we find no statistically significant evidence of effect heterogeneity with the exception of urbanicity category XC = 3, but the finding is not robust to propensity score estimation method. Keywords: Heterogeneous Treatment Effect, Observational Studies, Propensity Score Matching, Regression Trees

1. Methodology and Motivation

1.1 Introduction Despite the overwhelming focus on the overall average treatment effect (ATE) in the statis- tics and causal inference literatures, there are many scenarios for which the efficacy of a treatment may vary depending on unit background characteristics. Methods that target conditional average treatment effects can explain how pretreatment variables interact with

c 2019 Bryan Keller, Jianshen Chen and Tianyang Zhang. Keller, Chen and Zhang

treatment exposure to cause heterogeneity in treatment efficacy. The identification of such heterogeneity, to the extent that it exists, is of tremendous interest to stakeholders because it can provide insight into which types of participants are likely to be helped the most, helped the least, or even harmed by an intervention. In this paper, we begin with an overview of the synthetic data set generated for the Workshop for Empirical Investigation of Methods for Heterogeneity, a workshop that co-occurred with the 2018 Meeting of the Atlantic Causal Inference Conference in Pittsburgh, PA. We then describe our approach for heterogeneous subgroup identification based on propensity score matching and regression trees. We then discuss data analysis results presented at the workshop, followed by the results of further analyses conducted after the workshop. We conclude with some discussion.

1.2 The Data The workshop data analyzed herein are synthetic, but were motivated by the National Study of Learning Mindsets, a randomized controlled trial of an intervention designed to encourage a growth mindset in high school students (Mindset Scholars Network, 2018). Approximately 10,000 cases, nested in 76 schools, were simulated to emulate an observational study based on four categorical student-level covariates and six numeric school-level covariates. The three research questions we were asked to address for the workshop are as follows:

1. Was the mindset intervention effective in improving student achievement?

2. X1 is a measure of the average fixed mindset rating for each school; X2 is a measure of school-level academic achievement; both were measured before the intervention. Re- searchers suspect either (a) the effect is largest in middle-achieving schools, or (b) the effect is decreasing in school-level achievement. Is there any evidence that X1 and/or X2 moderate the effect of the intervention on student-level academic achievement?

3. Is there evidence that any other covariates moderate the intervention effect?

1.3 Notation 1 0 Let Yi and Yi be the potential outcomes (Neyman, 1923; Rubin, 1974) under treatment (Zi = 1) and comparison (Zi = 0) conditions, respectively. The average treatment effect, or 1 0 ATE, is defined as the average of individual treatment effects; that is, ATE = E[Yi − Yi ]. A conditional average treatment effect, or CATE, is defined as the average of individual treatment effects, given that a vector of covariates Xi1,Xi2,...,Xip take on particular 1 0 values; that is, CATE = E[Yi − Yi |Xi1 = xi1,Xi2 = xi1,...,Xip = xip]. The propensity score, ei(Xi) = pr(Zi = 1|Xi), is the probability that unit i is assigned to (or selects) the treatment group, given the observed covariates. For identification of the ATE and CATE, propensity score analysis, and other conditioning strategies, rely on the strong ignorability assumption (Rosenbaum and Rubin, 1983), which specifies

1. ignorability: the potential outcomes are independent of the treatment assignment given observed covariates X; that is, {Y 0,Y 1} ⊥⊥ Z|X,

2. reliable measurement: observed covariates X have been reliably measured (Steiner et al., 2011), and

94 Heterogeneous Subgroup Identification with Observational Data

3. positivity: the propensity score for each unit lies strictly between zero and one; that is, 0 < ei(Xi) < 1 for all i.

The observed outcome for unit i, Yi, is defined via the potential outcomes and the 1 0 treatment indicator as Yi = ZiYi + (1 − Zi)Yi .

1.4 Methodology Our approach to heterogeneous subgroup identification is based on the fact that, under ignorability, X ⊥⊥ Z|e(X) (Rosenbaum and Rubin, 1983). That is, by conditioning on the propensity score, balance on the observed covariates across treated and comparison groups may be restored to what would have been expected in a ; namely, covariate distributions are identical (in the limit) across groups. We use optimal full propensity score matching to stratify units into S strata, each of which contains at least one treated case and at least one comparison case. For each stratum s ∈ 1,...,S, the estimate of the stratum-specific treatment effect is calculated as the difference in sample averages, treated group minus comparison group. That is,

1 X 1 X ATˆ Es = Yi − Yi, nTs nCs i∈Ts i∈Cs where Ts and Cs are, respectively, the sets of indices of the treated and comparison cases in stratum s, and nTs and nCs respectively represent the cardinalities of Ts and Cs. Once stratum-specific treatment effect estimates have been calculated, we use those values as estimates of the individual treatment effect for each unit in the stratum. We then regress the vector of individual treatment effects on the set of predictors using a single regression tree. Any predictors identified by the regression tree as important, meaning that the regression tree split on those variables, are interpreted as evidence for effect heterogeneity on the variable or variables involved in the splits.

1.4.1 Regression Trees A regression tree is an algorithmic method invented by Breiman et al. (1984) that models the response surface for an outcome variable, Y , based on predictors, X1,...,Xp, by iteratively splitting units into subgroups based on rectangular regions of predictor values. At each iteration, a split creates two subgroups, called nodes, and a node that has not been split is referred to as a terminal node. For unit i in terminal node t, where Nt represents the set of units in t, the tree-predicted value for unit i is simply the mean score on the outcome variable for all units in that node: Yˆ = 1 P Y . The for a tree i |Nt| i∈Nt i 2 P  ˆ  T , dev(T ) = i Yi − Yi , is used as a cost function to determine the split point at each iteration. After considering all possible splits on all possible variables, the split that yields the largest decrease in deviance is selected. If left unchecked, regression trees would continue to split until each terminal node con- tained only one point. A commonly used approach to prevent this kind of overfitting is based on adding a term to the squared error that penalizes the number of terminal nodes, |T |, in tree T : C(T )cp = dev(T ) + cp|T |. This approach is referred to as cost-complexity

95 Keller, Chen and Zhang

pruning, and is implemented in the rpart package (Therneau et al., 2015) in R (R Core Team, 2018), which we use to fit regression trees. The tuning parameter, cp, is analogous to the smoothing parameter in the lasso or regularized regression, and is typically selected through cross-validation.

2. Workshop Results In the synthetic workshop data, school sample sizes for the 76 schools ranged from 14 to 529, with a of 111, and a mean of 136.7. Furthermore, the treatment was non- randomly assigned within schools, such that each school sample contained a proportion of treated cases that ranged from about 17% to about 45%. This design feature allowed us to estimate propensity scores and create matches within schools1. As a result of within- school matching, all were matched exactly on the five continuously measured school-level covariates, X1,...,X5. For the workshop, we used two methods to estimate propensity scores: random forests (RF) and generalized boosted modeling (GBM). Both methods are based on regression trees and, therefore, algorithmically handle interactions and nonlinear relationships.

2.1 Research Question 1 To address the first question, we used standard propensity score methodology and simply took weighted averages of stratum-specific treatment effect estimates. The overall ATE was estimated to be 0.25 or 0.26, based on GBM or RF, respectively, for propensity score estimation. The distribution of estimated individual treatment effects, along with a vertical line denoting the average, is shown for the RF analysis in Figure 1. While the results suggest a positive treatment effect, we did not present standard errors, so we made no claims regarding evidence for an overall effect.

2.2 Research Question 2 We fit regression trees and varied the level of the complexity parameter to search for het- erogeneity on X1 and X2. With analyses based on propensity scores estimated by RF and GBM, we noted, based on the regression tree output shown in Figure 2, that the treatment effect did appear to vary with X2 and X1. Figure 2 shows the results of a regression tree fit based on random forests with a com- plexity parameter of 0.0033. Note that at the root node, the overall ATE is estimated to be 0.26 based on 8910 cases. The first split was at X2 = −0.71, which led to condi- ˆ ˆ tional ATE estimates of CATE{X2<−0.71} = 0.12 and CATE{X2≥−0.71} = 0.29. The next split was also on X2, thereby modeling a quadratic relationship. In particular, we see that ˆ ˆ CATE{X2≥0.83} = 0.23 and CATE{−0.71

1. Note, however, that two schools, numbers 11 and 31, were dropped due to insufficient sample sizes of 21 and 14, respectively

96 Heterogeneous Subgroup Identification with Observational Data

Figure 1: Average Treatment Effect Estimate by the Random Forest (RF) Method

Figure 2: Regression Tree Based on Regressing Individual Treatment Effect Estimates on Observed Covariates; Propensity Scores Estimated by Random Forests

schools with X2 values in the middle range between -0.71 and 0.83, the treatment was more effective for schools with fixed mindset scores lower -0.37 at pretest. Although we did examine the results of ten-fold cross-validation for cp produced by the rpart package, we encountered multiple situations in which the cross-validated error rate continued to decrease without bound as the value of the tuning parameter decreased (i.e., favoring more and more complex tree structures; see Figure 3 for an example). The advice given in the rpart manual (Therneau et al., 2015) is that “A good choice of cp for pruning is often the leftmost value for which the mean lies below the horizontal line.” where the “horizontal line” represents one standard error above the minimum value of the

97 Keller, Chen and Zhang

Figure 3: Cross-validated error based on output from package rpart; the horizontal line represents one standard error above the minimum value of the cross-validated error curve

cross-validated error curve. Despite this rule of thumb, we often encountered tree solutions that were very volatile at the one SE mark. Thus, a limitation of the exploratory approach used for the workshop analyses is a lack of rationale for the selection of the cp value, which had the potential to drastically impact results.

2.3 Research Question 3 We noted that student level variables C1, a fifteen-category race variable, and XC, a four- category urbanicity variable, were identified in some of the RF and GBM regression tree fits, but did not discuss their roles in detail.

3. Post-Workshop Analysis For post-workshop analyses, we included main-effects (LR) for propensity score estimation, in addition to RF and GBM. Propensity score strata based on optimal full matching were created in each school, as described above. The number of strata per school varied both with the school sample size and the method of propensity score analysis. For propensity scores estimated by GBM, for example, the number of strata per school ranged from 4 for school 13 (n = 24) to 161 for school 62 (n = 529), with a mean of 36 and median of 27; the numbers of strata based on LR and RF were similar. Furthermore, we ran a series of Type I error rate studies, using random permutation, to select cp values that yielded 5% Type I error rate. Following Chen and Keller (2019), for each permutation, we shuffled yoked outcome/treatment pairs while leaving covariate values fixed. Under this permutation scheme, the overall average treatment effect and the covariate marginal distributions and interrelationships remain unperturbed; meanwhile, any

98 Heterogeneous Subgroup Identification with Observational Data

dependence between covariates and individual treatment effects is destroyed, which provides recourse to the permutation null hypothesis of no effect (Rubin, 1980; Keller, 2012). For research questions 2 and 3 for the post-workshop analyses we distinguish between testing and exploration. We test for effect heterogeneity by using cp values that were found, through permutation, to hold the rate of false positives to the nominal 5% level; results based on these cp values are appropriate for inferential decision-making. We explore (a) graphically, by examining graphical depictions of key relationships, and (b) quanti- tatively, by variable importance ratings from random forest fits. Although these explorations are suitable for hypothesis generation for future study, they are not appropriate for inferential decision-making.

3.1 Research Question 1 We found that the desired nominal Type I error rate of approximately 5% was attained for GBM, RF, and LR, respectively, for cp values of 0.006 0.008, and 0.006. The overall ATE estimates were hardly changed when using the cp values determined through permutation. For propensity score estimation via GBM, RF, and LR, respectively, the overall ATE es- timates, with 95% nonparametric bootstrap confidence intervals (percentile method), were 0.25 (0.22, 0.30), 0.26 (0.22, 0.30), and 0.27 (0.24, 0.28).

3.2 Research Questions 2 & 3 3.2.1 Testing For propensity scores estimated via LR, and with cp = 0.006, one split on variable XC = 3 was flagged. No splits were identified using propensity scores estimated via RF with cp = 0.008, nor via GBM with cp = 0.006. Thus, we found some evidence of effect heterogeneity based on XC, but the finding was not robust to propensity score specification. There was no evidence of significant effect heterogeneity for any other covariates.

3.2.2 Exploration In Figure 4 we plot curves to show the relationship between school- level estimates of the average treatment effect on the vertical axis against each school-level covariate. The notion that the intervention was more effective for schools in a “middle range” on X2 and with lower values on X1 is not inconsistent with the relationships shown in the first two panels of Figure 4. In Figure 5, because the student-level covariates are categorical, we use conditional boxplots to show how individual treatment effect estimates vary by category across the five student-level covariates. We note what appears to be considerable variability in both median and across levels of C1, the 15-category race variable. We also note a lower median for category XC = 3 as compared with the other categories of XC, a five-category urbanicity variable. Finally, we fit random forests using the vector of individual treatment effect estimates as outcome and the school- and student-level variables as predictors to calculate variable importance ratings. Because these data constitute a mix of continuously and categorically measured predictor variables, and especially because several of the categorical variables

99 Keller, Chen and Zhang

Figure 4: Average School-Level Treatment Effect as a Function of School-Level Covariates have many categories, traditional random forest variable importance (Breiman, 2001) will result in biased importance by unjustly favoring variables with many categories (Strobl et al., 2007). Instead, we report variable importance from random forests based on conditional inference trees, as implemented in R package party Hothorn et al. (2006a), which produce unbiased importance values with multi-category predictors. Conditional inference trees differ from traditional recursive partitioning approaches in that splitting is based on p-values for linear test statistics derived by permutation the- ory. The p-values are associated with tests of null hypotheses of conditional independence between each predictor and the response, given the tree structure. At each step, these statistics are aggregated to form a global test of the null hypothesis. If the result of the global test is not significant, splitting stops; thus, tree pruning is not needed. If the result of the global test is significant, the p-values for individual predictors are ranked, and the next split occurs on the variable with the smallest p-value. By focusing on p-values, which are not affected by the scales of predictor variables, fair comparisons may be made even for variables on different scales; see Hothorn et al. (2006b) for more details. After fitting random forests based on conditional inference trees, we find variable XC is ranked as the most important predictor of variability in the individual treatment effect across all three propensity score estimation methods: GBM, RF, and LR. The average importance ranks across the three PS estimation methods identify XC, X1, and C1 as the three most important predictors, respectively. Notably, X2 is among the three least important predictors of effect heterogeneity, according to variable importance rankings.

4. Discussion We implemented a two-step approach to detect treatment effect heterogeneity character- ized by (1) optimal full propensity score matching within schools to estimate individual (stratum-specific) treatment effects, followed by (2) fitting a regression tree of estimated individual treatment effects on covariates. In the analyses prepared for the workshop, we focused on the second research question by exploring the relationships between X1, X2,

100 Heterogeneous Subgroup Identification with Observational Data

Figure 5: Individual Treatment Effect as a Function of Student-Level Predictors by Propen- sity Score Estimation Method; GBM = Generalized Boosted Modeling, RF = Random Forests, LR = Logistic Regression

Figure 6: Variable Importance Rankings from Random Forest Runs Regressing the Indi- vidual Treatment Effect Estimates on the Ten Predictors of Interest

101 Keller, Chen and Zhang

and estimated school-level treatment effects. For the post-workshop analyses, we further demarcated analyses by distinguishing between testing and exploration. In general, our analyses leaned heavily on the regression tree algorithm, which was used (a) in estimating propensity scores via random forests and boosted modeling, (b) to test for effect heterogeneity through regression tree analysis of individual treatment effect estimates, and (c) for additional exploration through conditional random forest variable importance. With respect to fitting regression trees, we noted that ten-fold cross-validation and the one SE rule of thumb, both methods typically used to select the cost complexity pruning parameter, cp, are inconclusive with respect to Type I error rate. Instead, we used a simple permutation approach to select cp values that yielded the desired Type I error rate and enabled testing. For the first research question, we found that the average intervention effect estimates by different methods were all positive, with 95% bootstrap confidence intervals indicating that the mindset intervention was effective in improving student achievement. For the second and third research questions, we found evidence of heterogeneity based on membership in the third category of the urbanicity variable, but the finding was not robust to propensity score estimation method. We found no other significant evidence of treatment effect modification. Based on exploratory analyses, if we were to plan a follow-up study to search for effect modification, we would recommend focusing on the student-level urbanicity variable, XC, the student level race variable, C1, and the school-level fixed mindset rating variable, X1. We would not recommend prioritizing X2, the school-level achievement variable. As noted by Feller and Holmes (2009), the assumptions required for identification of CATEs are identical to those required for the overall ATE (i.e., strong ignorability, no interference between units, single version of each treatment). We assume these key as- sumptions are met here. Furthermore, the usual recommended steps for the specification of the propensity score, including iterative respecification to achieve acceptable balance on observed covariates and an examination of overlap are also important, but details are omitted because our focus is on heterogeneous subgroup identification. Finally, resampling approaches such as the jackknife, bootstrap, and boosting may be used to attain error bounds on ATEs and CATEs estimated via our two-step approach; however, care must be taken when using resampling techniques to estimate standard errors for estimators that involve matching (Abadie and Imbens, 2008; Austin and Small, 2014).

Acknowledgments

We thank Carlos Carvalho, Avi Feller, Jennifer Hill, and Jared Murray for organizing the workshop and inviting our submission. Jianshen Chen was employed by Educational Testing Service when this work was carried out.

References Abadie, A. and Imbens, G. W. (2008). On the failure of the bootstrap for matching esti- mators. Econometrica, 76:1537–1557.

102 Heterogeneous Subgroup Identification with Observational Data

Austin, P. C. and Small, D. S. (2014). The use of bootstrapping when using propensity-score matching without replacement: a simulation study. Statistics in Medicine, 33:4306–4319.

Breiman, L. (2001). Random forests. Machine Learning, 45:5–32.

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and Regression Trees. Wadsworth and Brooks/Cole, Monterey, CA.

Chen, J. and Keller, B. (2019). Heterogeneous subgroup identification in observational studies. Journal for Research on Educational Effectiveness. Available from: https: //doi.org/10.1080/19345747.2019.1615159.

Feller, A. and Holmes, C. (2009). Beyond toplines: Heterogeneous treatment effects in randomized experiments. Technical report, University of Oxford.

Hothorn, T., Buehlmann, P., Dudoit, S., Molinaro, A., and Van Der Laan, M. (2006a). Survival ensembles. , 7(3):355–373.

Hothorn, T., Hornik, K., and Zeileis, A. (2006b). Unbiased recursive partitioning: A condi- tional inference framework. Journal of Computational and Graphical Statistics, 15:651– 674.

Keller, B. (2012). Detecting treatment effects with small samples: The power of some tests under the randomization model. Psychometrika, 77:324–338.

Mindset Scholars Network (2018). National study of learning mindsets. Available from: http://mindsetscholarsnetwork.org.

Neyman, J. (1923). Sur les applications de la th´eoriedes probabilit´esaux experiences agricoles: Essai des principes. In Roczniki Nauk Rolniczych, volume X, pages 1–51. In Polish, English translation by D. Dabrowska and T. Speed in Statistical Science 5, 465 - 72, 1990.

R Core Team (2018). R: A language and environment for statistical computing. Available from: http://www.R-project.org/.

Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70:41–55.

Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonran- domized studies. Journal of Educational Psychology, 66:688–701.

Rubin, D. B. (1980). Randomization analysis of experimental data: The Fisher randomiza- tion test comment. Journal of the American Statistical Association, 75:591–593.

Steiner, P. M., Cook, T. D., and Shadish, W. R. (2011). On the importance of reliable covariate measurement in selection bias adjustments using propensity scores. Journal of Educational and Behavioral Statistics, pages 213–236.

103 Keller, Chen and Zhang

Strobl, C., Boulesteix, A., Zeileis, A., and Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC , page 8:25.

Therneau, T., Atkinson, B., and Ripley, B. (2015). rpart: Recursive partitioning and regression trees. R package version 4.1-9. Available from: http://CRAN.R-project. org/package=rpart.

104 Observational Studies 5 (2019) 105-117 Submitted 7/19; Published 8/19

Causaltoolbox—Estimator Stability for Heterogeneous Treatment Effects

S¨orenR. K¨unzel∗ [email protected] Department of Statistics University of California, Berkeley Simon J. S. Walter∗ [email protected] Department of Statistics University of California, Berkeley Jasjeet S. Sekhon [email protected] Department of Political Science & Department of Statistics University of California, Berkeley

Abstract Estimating heterogeneous treatment effects has become increasingly important in many fields: for example, they are required to select a personalized treatment for a patient, which may be a life or death decision. Recently, a variety of procedures relying on different assumptions have been suggested for estimating heterogeneous treatment effects. Unfortu- nately, there are no compelling approaches that allow identification of the procedure that has assumptions that hew closest to the process generating the data set under study and researchers often select one arbitrarily. This approach risks making inferences that rely on incorrect assumptions and gives the experimenter too much scope for p-hacking. A single estimator will also tend to overlook patterns other estimators could have picked up. We be- lieve that the conclusion of many published papers might change had a different estimator been chosen and we suggest that practitioners should evaluate many estimators and assess their similarity when investigating heterogeneous treatment effects. We demonstrate this by applying 28 different estimation procedures to an emulated observational data set; this analysis shows that different estimation procedures may give starkly different estimates. We also provide an extensible R package which makes it straightforward for practitioners to follow our recommendations. Keywords: Heterogeneous treatment effects, conditional average treatment effect, X- learner, joint estimation.

1. Introduction

Heterogeneous Treatment Effect (HTE) estimation is now a mainstay in many disciplines, including personalized medicine (Henderson et al., 2016; Powers et al., 2018), digital ex- perimentation (Taddy et al., 2016), economics (Athey and Imbens, 2016), political science (Green and Kern, 2012), and statistics (Tian et al., 2014). Its prominence has been driven by a combination of the rise of big data, which permits the estimation of fine-grained hetero- geneity, and recognition that many interventions have heterogeneous effects, suggesting that

∗. These authors contributed equally to this work.

c 2019 S¨orenR. K¨unzel,Simon J. S. Walter, and Jasjeet S. Sekhon. Kunzel,¨ Walter, and Sekhon

much can be gained by targeting only the individuals likely to experience the most positive response. This increase in interest amongst applied has been accompanied by a burgeoning methodological and theoretical literature: there are now many methods to characterize and estimate heterogeneity; some recent examples include Hill (2011); Athey and Imbens (2015); K¨unzelet al. (2017); Wager and Athey (2017a); Nie and Wager (2017). Many of these methods are accompanied by guarantees suggesting they possess desirable properties when specific assumptions are met, however, verifying these assumptions may be impossible in many applications; so, practitioners are given little guidance for choosing the best estimator for a particular data set. As an alternative to verifying these assump- tions we suggest practitioners construct a large family of HTE estimators and consider their similarities and differences. Treatment effect estimation contrasts with prediction, where researchers can use cross- validation (CV) or a validation set to compare the performance of different estimators or to combine them in an ensemble. This is infeasible for treatment effect estimation because of the fundamental problem of causal inference: we can never observe the treatment effect for any individual unit directly, so we have no source of truth to validate or cross-validate against. Partial progress has been made in addressing this problem; for example, Athey and Imbens (2015) suggest using the transformed outcome as the truth, a quantity equal in expectation to the individual treatment effect and K¨unzelet al. (2017) suggests using matching to impute a quantity similar to the unobserved potential outcome. However, even if there were a reliable procedure for identifying the estimator with the best predictive per- formance, we maintain that using multiple estimates can still be superior, because the best performing method or ensemble of methods may perform well in some regions of the feature space and badly in others; using many estimates simultaneously may permit identification of this phenomenon. For example, researchers can construct a worst-case estimator that is equal to the most pessimistic point estimate, for each point in the feature space, or they can consider stability (Yu, 2013) to assess whether one can trust estimates for a particular subset of units.

2. Methods

2.1 Study setting The data set we analyzed was constructed for the Empirical Investigation of Methods for Heterogeneity Workshop at the 2018 Atlantic Causal Inference Conference. The organizers of the workshop: Carlos Carvalho, Jennifer Hill, Jared Murray, and Avi Feller used the Na- tional Study of Learning Mindsets, a randomized controlled trial in a probability sample of U.S. public high schools, to simulate an observational study. The organizers did not disclose how the simulated observational data were derived from the experimental data because the workshop was intended to evaluate procedures for analyzing observational studies, where the mechanism of treatment assignment is not known a priori.

2.2 Measured variables The outcome was a measure of student achievement; the treatment was the completion of online exercises designed to foster a learning mindset. Eleven covariates were available for

106 CausalToolbox

each student: four are specific to the student and describe the self-reported expectations for success in the future, race, gender and whether the student is the first in the family to go to college; the remaining seven variables describe the school the student is attending mea- suring urbanicity, poverty concentration, racial composition, the number of pupils, average student performance, and the extent to which students at the school had fixed mindsets; an anonymized school id recorded which students went to the same school.

2.3 Notation and estimands

For each student, indexed by i, we observed a continuous outcome, Yi, a treatment indicator, Zi, that is 1 if the student was in the treatment group and 0 if she was in the control group, and a feature vector Xi. We adopt the notation of the Neyman-Rubin causal model: for each student we assume there exist two potential outcomes: if a student is assigned to treatment we observe the outcome Yi = Yi(1) and if the student is assigned to control we observe Yi = Yi(0). Our task was to assess whether the treatment was effective and, if so whether the effect is heterogeneous. In particular, we are interested in discerning if there is a subset of units for which the treatment effect is particularly large or small. To assess whether the treatment is effective, we considered the average treatment effect,

ATE := E[Yi(1) − Yi(0)], and to analyze the heterogeneity of the data, we considered average treatment effects for a selected subgroup S, E[Yi(1) − Yi(0)|Xi ∈ S], and the Conditional Average Treatment Effect (CATE) function,

τ(x) := E[Yi(1) − Yi(0)|Xi = x].

2.4 Estimating average effects Wherever we computed the ATE or the ATE for some subset, we used four estimators. Three were based on the CausalGAM package of Glynn and Quinn (2017). This package uses generalized additive models to estimate the expected potential outcomes,µ ˆ0(x) := ˆ ˆ ˆ E[Yi(0)|Xi = x], andµ ˆ1(x) := E[Yi(1)|Xi = x], and the propensity score:e ˆ(x) := E[Zi|Xi = x]. With these estimates we computed the inverse probability weighting (IPW) estimator, n   1 X YiZi Yi(1 − Zi) ATEˆ := − , IPW n eˆ 1 − eˆ i=1 i i the regression estimator, n 1 X ATEˆ := [ˆµ (X ) − µˆ (X )] , Reg n 1 i 0 i i=1 and the augmented inverse probability weighted (AIPW) estimator, n   1 X [Yi − µˆ0(Xi)]Zi [ˆµ1(Xi) − Yi][1 − Zi] ATEˆ := + . AIPW 2n eˆ 1 − eˆ i=1 i i

107 Kunzel,¨ Walter, and Sekhon

We also used the Matching package of Sekhon (2011) to construct a matching estimator for the ATE. Matches were required to attend the same school as the student to which they were matched and to be assigned to the opposite treatment status. Amongst possi- ble matches satisfying these criteria we selected the student minimizing the Mahalonobis distance on the four student specific features.

2.5 Characterizing heterogeneous treatment effects In any data set there might be some units where estimators significantly disagree; when this happens, we should not trust any estimate unless we understand why certain estimates are unreasonable for these units.1 Instead of simply reporting an estimate that is likely to be wrong, we should acknowledge that conclusions cannot be drawn and more data or domain specific knowledge is needed. Figure 1 demonstrates this phenomenon arising in practice. It shows the estimated treatment effect for ten subjects corresponding to 28 CATE estimators (these estimates arise from the data analyzed in the remainder of this paper). Some of these estimators may have better generalization error than others. However, a reasonable analyst could have selected any one of them. We can see that for five units the estimators all fall in a tight cluster, but for the remaining units, the estimators disagree markedly. This may be due to those units being in regions with little overlap, where the estimators overcome data scarcity by pooling information in different ways. Since an estimate of the entire CATE function is hard to interpret and drawing sta- tistically significant conclusions based on it is difficult, we decided to use our estimate of the CATE function to identify large subgroups with a markedly different average treatment effects and then form conclusions based on the differences in ATE for these subgroups. To ensure the treatment effect estimates for the selected subgroups were valid, we divided the data into an exploration set and an equally sized validation set. We used the exploration set to identify subsets, which may behave differently. To do this, we trained all 28 CATE estimators on the exploration set and formulated hypotheses based on plots of the marginal CATE: for example, based on plots of the CATE estimates we might theorize that students in schools with more than 900 students have a much higher treatment effect than those in schools with less than 300 students. Next, we used the validation set to verify our findings by estimating the ATEs of each of the subgroups. The exploration and validation sets were constructed by randomly associating schools (not individual students) with each set, so students who attended the same school were never split between the exploration and validation set. We adopted this procedure because it excludes the possibility of there being dependence between the exploration and validation sets if students who attend the same school influence each others’ outcomes; it also mirrors the probability sampling approach used to construct the full sample and it means that we can argue that the estimand captured by evaluating our hypotheses on the validation set is the estimand corresponding to the population from which all schools were drawn.

1. A standard approach to capture uncertainty is to report the standard errors or confidence regions for a single estimator, and we recommend using this approach in addition to our appraoch. However, for CATE estimation, these methods can be misleading and cannot be trusted blindly. For example, in Appendix C of K¨unzelet al. (2017), the authors found that in regions without overlap, bootstrap confidence intervals were smaller than in regions with overlap. This suggests that estimates were more trustworthy in regions without overlap but, in fact, the opposite was true.

108 CausalToolbox

● ● ● ● 1.0 ●

● ● ● ● ●

● ● ● ● ● ● ● ● 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● CATE estimate CATE ● ● ● ● ● ● ● ● ● ● ● ● −0.5 ● ● ● ● ●

● ●

−1.0 ● ● ● 326 491 995 1011 1915 2387 2416 3482 3822 4626 id

Figure 1: CATE estimation for ten units. For each unit, the CATE is estimated using 28 different estimators.

2.6 CATE estimators We use several procedures to estimate the CATE and we give a brief overview of the procedures here; however, interested readers should consult the referenced papers for a complete exposition. Many of the procedures can be classified as meta-learners: they modify base learners designed for standard non-causal prediction problems to estimate the CATE. This is ad- vantageous because we can select a base learner that is designed to the work well on the data we are analyzing.

1. The T-Learner is the most common meta-learner. Base learners are used to estimate ˆ the control and treatment response function separately,µ ˆ1(x) := E[Yi(1)|Xi = x] and ˆ µˆ0(x) := E[Yi(0)|X = x]. The CATE estimate is then the difference between these T two estimates,τ ˆ (x) :=µ ˆ1(x) − µˆ0(x).

2. The S-Learner uses one base learner to estimate the joint outcome function,µ ˆ(x, z) := ˆ E[Yi|Xi = x, Zi = z]. The predicted CATE is the difference between the predicted values when the treatment assignment indicator is changed from treatment to control, τˆS(x) :=µ ˆ(x, 1) − µˆ(x, 0).

3. The MO-Learner (Rubin and van der Laan, 2007; Walter et al., 2018) is a two stage meta-learner. It first uses the base learners to estimate the propensity score, ˆ eˆ(x) := E[Zi|X = x], and the control and treatment response functions. It then

109 Kunzel,¨ Walter, and Sekhon

defines the adjusted modified outcome as

Zi − eˆ(xi)   Ri := Yi − µˆ1(xi)[1 − eˆ(xi)] − µˆ0(xi)ˆe(xi) . eˆ(xi)[1 − eˆ(xi)]

An estimate of the CATE is obtained by using a base learner to estimate the condi- MO ˆ tional expectation of Ri given Xi,τ ˆ (x) := E[Ri|Xi = x].

4. The X-Learner (K¨unzelet al., 2017) also uses base learners to estimate the response functions and the propensity score. It then defines the imputed treatment effects ˜ 1 for the treatment group and control group separately as Di := Yi(1) − µˆ0(Xi) and ˜ 0 Di :=µ ˆ1(Xi) − Yi(0). The two estimators for the CATE are obtained by using base learners to estimate the conditional expectation of the imputed treatment effects, X ˆ ˜ 1 X ˆ ˜ 0 τˆ1 := E[Di |Xi = x], andτ ˆ0 := E[Di |Xi = x]. The final estimate is then a convex combination of these two estimators,

X X X τˆ (x) :=e ˆ(x)ˆτ0 (x) + (1 − eˆ(x))ˆτ1 (x).

All of these meta-learners have different strength and weaknesses. For example, the T-Learner performs particularly well when the control and treatment response function are simpler than the CATE. The S-Learner performs particularly well when the expected treatment effect is mostly zero or constant. The X-Learner, on the other hand, has very desirable properties when either the treatment or control group is much larger than the other group. Note, however, that all of these meta-learners need a base learners to be fully defined. We believe that tree-based estimators perform well on mostly discrete and low-dimensional data sets. Therefore, we use the causalToolbox package (K¨unzel et al., 2018) that implements all of these estimators combined with RF and BART. Using two different tree estimators is desirable because CATE estimators based on BART perform very well when the data-generating process has some global structure (e.g., global sparsity or linearity), while random forest is better when the data has local structure that does not necessarily generalize to the entire space. However, to protect our analysis from biases caused by using tree-based approaches only, we also included methods based on neural networks. We followed K¨unzelet al. (2018) and implemented the S, T and X-Neural Network methods. We also included tree-based learners that were not meta-learners that we believe would work well on this data set:

5. The causal forest algorithm (Wager and Athey, 2017b) is a generalization of the random forest algorithm to estimates the CATE directly. Similar to random forest, it is an ensemble of many tree estimators. Each of the tree estimators follows a greedy splitting strategy to generate leaves for which the CATE function is as homogeneous as possible. The final estimate for each tree for a unit with features x is the difference- in-means estimate of all units in the training set that fall in the same leaf as x.

110 CausalToolbox

6. The R-Learner (Nie and Wager, 2017) is a set of algorithms that use an approxima- tion of the following optimization problem to estimate the CATE,

( n  2 ) 1 X  (−i)   (−i)  arg min Yi − µˆ (Xi) − Wi − eˆ (Xi) τ(Xi) + Λn(τ(·)) . τ n i=1

(−i) (−i) Λn(τ(·)) is a regularizer and µ (x) ande ˆ (Xi) are held-out predictions of µ(x) = E[Yi|Xi = x] and the propensity score, e(x), respectively. There are several versions of the R-Learner; we have decided to use one that is based on XGBoost (Chen and Guestrin, 2016) and one that is based on RF.

Although we expected there would be school-level effects, and that both the expected performance of each student and the CATE would vary from school to school, it was not clear how to incorporate the school id. The two choices we considered were to include a categorical variable recording the school id, or to ignore it entirely. The former makes parameters associated with the six school-level features essentially uninterpretable because they cannot be identified separately from the school id; the second may lead to less efficient estimates because we are denying our estimation procedure the use of all data that was available to us. Because we do not want our inference to depend on this decision, we fit each of our estimators twice: once including school id as a feature, and once excluding it. We considered 14 different CATE estimation procedures; since each procedure was applied twice, a total of 28 estimators were computed.

3. Workshop Results Our sample consisted of about 10,000 students enrolled at 76 different schools. The in- tervention was applied to 33% of the students. Pre-treatment features were similar in the treatment and control groups but some statistically significant differences were present. Specifically, a variable capturing self-reported expectations for success in the future had mean 5.22 (95% CI, 5.20-5.25) in the control group and mean 5.36 (5.33-5.40) in the treat- ment group. This meant students with higher expectations of achievement were more likely to be treated. We assessed whether overlap held by fitting a propensity score model and we found that propensity score estimate for all students in the study was between 0.15 and 0.46 therefore, the overlap condition is likely to be satisfied.

3.1 Average treatment effects The IPW, regression, and AIPW estimator yielded estimates identical up to two significant figures: 0.25 with 95% bootstrap confidence interval of (0.22, 0.27). The matching estimator gave a similar ATE estimate of 0.26 with confidence interval (0.23, 0.28). The similarity of all the estimates we evaluated is reassuring, but we cannot exclude the possibility that the experiment is affected by an unobserved confounder that affects all estimators in a similar away. To address this, we characterize the extent of hidden bias required to explain away our conclusion. We conducted a sensitivity analysis for the matching estimator using the sensetivitymv package of Rosenbaum (2018). We found that

111 Kunzel,¨ Walter, and Sekhon

a permutation test for the matching estimator still finds a significant positive treatment effect provided the ratio of the odds of treatment assignment for the treated unit relative to the odds of treatment assignment for the control unit in each pair can be bounded by 0.40 and 2.52. This bound is not very large, and it is plausible that there exists an unobserved confounder that increases the treatment assignment probability for some unit by a factor of more than 2.52. More information about the treatment assignment mechanism would be required to conclude whether this extent of confounding exists.

3.2 Heterogeneous effects The marginal distribution and partial dependence plots for the 28 CATE estimators as a function of school-level pre-existing mindset norms are shown on the left-hand side of Figure 2. There appears to be substantial heterogeneity present: students at schools with mindset norms lower than 0.15 may have a larger treatment effect than students at schools with higher mindset norms. However, the Figure suggests the conclusion is not consistent for all of the 28 estimators. A similar analysis of the feature recording the school achievement level is shown on the right-hand side of this Figure. Again, we appear to find the existence of heterogeneity: students with school achievement level near the middle of the range had the most positive response to treatment. On the basis of this figure, we identified thresholds of -0.8 and 1.1 for defining a low achievement level, a middle achievement level, and a high achievement level subgroup.

0.3 0.3

0.2 0.2 CATE CATE 0.1 0.1

0.0 0.0

0.35 0.30

0.30 0.25

0.25 0.20

CATE PDP CATE 0.20 PDP CATE 0.15

0.15 0.10 Density Density −2 0 2 −2 0 2 School−level pre−existing mindset norms School achievement level

Figure 2: Marginal CATE and Partial Dependence Plot (PDP) of the CATE as a function of school-level pre-existing mindset norms and school achievement level.

We then used the validation set to construct ATE estimates for each of the subgroups. We found that students who attended schools where the measure of fixed mindsets was less than 0.15 had a higher treatment effect (0.31, 95% CI 0.26-0.35) than students where

112 CausalToolbox

the fixed mindset was higher (0.21, 0.17–0.26). Testing for equality of the ATE for these two groups yielded a p-value of 0.003. However, when we considered the subsets defined by school achievement level the differences were not so pronounced. Students at the lowest achieving schools had the smallest ATE estimate 0.19 (0.10–0.32); while students at middle and high-achieving schools had similar ATE estimates: 0.28 (0.24–0.32) and 0.24 (0.16- 0.31) respectively. However, none of the pairwise difference between the three groups were significant.

4. Postworkshop results

During the workshop, other contributors found that the variable recording the urbanicity of the schools might explain some of the heterogeneity and so we will investiage that finding here. The left-hand side of Figure 3 shows the CATE as a function of urbanicity, and the right hand side of this figure shows the CATE as a function of the student’s self- reported expectation of success. We formulated two hypotheses: students at schools with an urbanicity of 3 seemed to have a lower treatment effect than students at other schools; students with a self-reported evaluation of 4 might enjoy a higher treatment effect. These hypotheses were obtained by only using the exploration set; to confirm or refute these hypotheses we used the validation set. The validation set confirmed the hypothesis that students at schools with an urbanicity of 3 had a lower treatment effect (0.16, 0.08– 0.24) compared to students at schools with a different urbanicity (0.28, 0.25–0.31); however, we could not reject the null hypothesis of no difference for the subsets identified by the self-reported evaluation measure. The urbanicity test yielded a p-values of 0.008 and the self-reported evaluation test yielded a p-value of 0.56.

0.30 0.25 0.25 0.20

CATE CATE 0.20 0.15 0.15

0.10

0.30 0.30

0.25 0.25

0.20 0.20 CATE PDP CATE PDP CATE 0.15 0.15 0.10

50 50 25 25 Density Density 0 1 2 3 4 1 2 3 4 5 6 7 Urbanicity Students' self−reported expectations

Figure 3: Marginal CATE and PDP of Urbanicity and self-reported expectations.

113 Kunzel,¨ Walter, and Sekhon

5. Discussion

5.1 The importance of considering multiple estimators The results of our analysis confirm that point estimates of the CATE can differ markedly depending on subtle modelling choices; so, we confirm that an analyst’s discretion may be the deciding factor in whether and what kind of heterogeneity is found. As the method- ological literature on heterogeneous treatment effect estimation continues to expand this problem will become more serious. To facilitate applying many estimation procedures we have authored an R package causalToolbox that provides a uniform interface for construct- ing many common heterogeneous treatment estimators. The design of the package makes it straightforward to add new estimators as they are proposed and gain currency. Differences that arise in our estimation of the CATE function translate directly into suboptimal real-world applications of the treatment considered. To see this, we propose a thought experiment: suppose we wanted to determine the treatment for a particular student: a natural treatment rule is to allocate her to treatment if her estimated CATE exceeds a small positive threshold or withhold treatment if it is below the threshold. A CATE estimator may be chosen on the basis of personal preference or prior experience and it is likely that, for some experimental subjects, the choice of estimator will affect the CATE estimated to such an extent that it changes the treatment decision. This is particularly problematic in studies where analysts have a vested interest in a particular result and are working without a pre-analysis plan, as they should not have discretion to select a procedure that pushes the results in the direction they desire. On the other hand, if analysts consider a wide variety of estimators, as we recommend, and if most estimators agree for an individual, we can be confident that our decision for that individual is not a consequence of arbitrary modelling choices. Conversely, if some estimators predict a positive and some a negative response, we should reserve judgment for that unit until more conclusive data is available and admit that we do not know what the best treatment decision is.

5.2 Would we recommend the online exercises? We find that the overall effect of the treatment is significant and positive. We were not able to identify a subgroup of units that had significant and negative treatment effect and we would therefore recommend the treatment for every student. In addition, our sensitiv- ity analysis suggested that our findings would still hold provided the confounder is not too strong. Nonetheless, we cannot exclude the possibility that there is a strong confounder and we would need to better understand the assignment mechanism to eliminate this possibility. This issue deserves investigation because we saw that students who had higher expectations for success in the future were more likely to be in the treatment group. Uncovering the heterogeneity in the CATE function proved to be difficult. We found heterogeneity could be identified from school-level pre-exising mindset norms and urbanicity but in general we had limited power to detect heterogeneous effects. For example, experts believe that the heterogeneity might be moderated by pre-existing mindset norms and school-level achieve- ment. For both covariates, we see that most CATE estimators produce estimates that are consistent with this theory. Domain experts also believe that there could be a “Goldilocks effect” where middle-achieving schools have the largest treatment effect. We are not able

114 CausalToolbox

to verify this statistically, but we do observe that most CATE estimators describe such an effect.

Acknowledgments

We thank Carlos Carvalho, Jennifer Hill, Jared Murray, and Avi Feller for organizing the Empirical Investigation of Methods for Heterogeneity Workshop and their valuable feedback. We thank Office of Naval Research (ONR) grant N00014-15-1-2367.

115 Kunzel,¨ Walter, and Sekhon

References Athey, S. and Imbens, G. W. (2015). Machine learning methods for estimating heteroge- neous causal effects. stat, 1050(5).

Athey, S. and Imbens, G. W. (2016). Recursive partitioning for heterogeneous causal ef- fects. Proceedings of the National Academy of Sciences of the United States of America, 113(27):7353–60.

Chen, T. and Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 785–794, New York, NY, USA. ACM.

Glynn, A. and Quinn, K. (2017). CausalGAM: Estimation of Causal Effects with General- ized Additive Models. R package version 0.1-4.

Green, D. P. and Kern, H. L. (2012). Modeling heterogeneous treatment effects in survey experiments with bayesian additive regression trees. Public opinion quarterly, 76(3):491– 511.

Henderson, N. C., Louis, T. A., Wang, C., and Varadhan, R. (2016). Bayesian analysis of heterogeneous treatment effects for patient-centered outcomes research. Health Services and Outcomes Research Methodology, 16(4):213–233.

Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference. Journal of Com- putational and Graphical Statistics, 20(1):217–240.

K¨unzel,S., Sekhon, J., Bickel, P., and Yu, B. (2017). Meta-learners for estimating hetero- geneous treatment effects using machine learning. arXiv preprint arXiv:1706.03461.

K¨unzel,S., Tang, A., Xie, L., Saarinen, T., Bickel, P., Yu, B., and Sekhon, J. (2018). causalToolbox: Toolbox for Causal Inference with emphasize on Heterogeneous Treatment Effect Estimator. R package version 0.0.1.000.

K¨unzel,S. R., Stadie, B. C., Vemuri, N., Ramakrishnan, V., Sekhon, J. S., and Abbeel, P. (2018). Transfer learning for estimating causal effects using neural networks. arXiv preprint arXiv:1808.07804.

Nie, X. and Wager, S. (2017). Learning objectives for treatment effect estimation. arXiv preprint arXiv:1712.04912.

Powers, S., Qian, J., Jung, K., Schuler, A., Shah, N. H., Hastie, T., and Tibshirani, R. (2018). Some methods for heterogeneous treatment effect estimation in high dimensions. Statistics in medicine.

Rosenbaum, P. R. (2018). sensitivitymv: Sensitivity Analysis in Observational Studies.R package version 1.4.3.

Rubin, D. and van der Laan, M. J. (2007). A doubly robust unbiased transfor- mation. The international journal of biostatistics, 3(1).

116 CausalToolbox

Sekhon, J. S. (2011). Multivariate and propensity score matching software with automated balance optimization: The Matching package for R. Journal of Statistical Software, 42(7):1–52.

Taddy, M., Gardner, M., Chen, L., and Draper, D. (2016). A nonparametric bayesian analysis of heterogenous treatment effects in digital experimentation. Journal of Business & Economic Statistics, 34(4):661–672.

Tian, L., Alizadeh, A. A., Gentles, A. J., and Tibshirani, R. (2014). A simple method for estimating interactions between a treatment and a large number of covariates. Journal of the American Statistical Association, 109(508):1517–1532.

Wager, S. and Athey, S. (2017a). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association.

Wager, S. and Athey, S. (2017b). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, (just- accepted).

Walter, S., Sekhon, J., and Yu, B. (2018). Analyzing the modified outcome for heterogeneous treatment effect estimation. Unpublished manuscript.

Yu, B. (2013). Stabiilty. Bernoulli, 19:1484–1500.

117 Observational Studies 5 (2019) 118-130 Submitted 7/19; Published 8/19

An Application of Matching After Learning To Stretch (MALTS) to the ACIC 2018 Causal Inference Challenge Data

Harsh Parikh [email protected] Department of Computer Science Duke University Durham, NC 27710, USA Cynthia Rudin1 [email protected] Department of Computer Science Department of Electrical and Computer Engineering Department of Statstical Sciences Duke University Durham, NC 27710, USA Alexander Volfovsky1 [email protected] Department of Statistical Sciences Duke University Durham, NC 27710, USA

Abstract In the learning-to-match framework for causal inference, a parameterized distance metric is trained on a holdout train set so that the matching yields accurate estimated conditional average treatment effects. This way, the matching can be as accurate as other black box machine learning techniques for causal inference. We use a new learning-to-match algo- rithm called Matching-After-Learning-To-Stretch (MALTS) (Parikh et al., 2018) to study an observational dataset from the Atlantic Causal Inference Challenge. Other than pro- viding estimates for (conditional) average treatment effects, the MALTS procedure allows practitioners to evaluate matched groups directly, understand where more data might need to be collected and gain an understanding of when estimates can be trusted.1 Keywords: Matching Algorithm, Causal Inference, Nearest Neighbors

1. Introduction Matching methods should place “similar” units into matched groups, but the question of whether two units are similar is much more complicated than it might seem. The choice of distance metric used to form the matches can have a large impact on causal conclusions. If we construct a high quality distance metric for matching, not only will we make correct causal conclusions, but we will also have interpretable matches that we can understand and troubleshoot. Examining the matches will allow us to locate possible sources of confounding, areas where treatment and control units do not overlap, or perhaps it will allow us to troubleshoot other problems with our data.

1. Denotes approximately equal contribution

c 2019 Harsh Parikh, Cynthia Rudin and Alexander Volfovsky. MALTS for ACIC Challenge Data

Conversely, a low quality distance metric can lead to poor matches and incorrect con- clusions. If the experimenter chooses a fixed distance metric beforehand, such as Euclidean distance (as is typically used in K-nearest neighbors) or logit distance (as in propensity score matching), then it would weigh distances between important covariates equally as compared to the distances between irrelevant covariates. Even with a few irrelevant covari- ates in the data, this could easily degrade the distance metric far enough that it would affect the treatment effect estimates (Wang et al., 2017). An alternative to Euclidean distance would be to ask experimenters to construct the distance metric using domain expertise, for instance, using a weighted Euclidean distance or by performing coarsened exact matching (Iacus et al., 2011). This, however, requires the user to construct a high-dimensional dis- tance function. Humans are not naturally adept at constructing high dimensional functions manually, and the high degree of freedom in choosing this metric exposes how easily this choice could go wrong. Ideally, the distance metric should match units together so that matched groups yield ac- curate estimated conditional treatment effects. The learning-to-match framework proposed by Parikh et al. (2018) aims to do this. In this framework, the distance metric is trained on a holdout train set. Because the distances are trained, matched groups tend to have more accurate estimated treatment effects. Irrelevant covariates are automatically ignored as part of the learning process, and similarities along relevant covariates tend to be weighted more highly. The impact of arbitrary human choices is reduced, and we can quantitatively judge the quality of the distance metric prior to using it. Matching After Learning to Stretch (MALTS) (Parikh et al., 2018) is a learning-to- match method that stretches important covariates and shrinks less relevant covariates. By stretching relevant covariates, it forces the distance to be more sensitive to small changes in these covariates. By shrinking less relevant covariates, the distance metric becomes less sensitive to their changes. The stretching and shrinking parameters in the distance metric are learned from a holdout train set, as a special case of learning the parameters of a diagonal Mahalanobis distance matrix. If the Mahalanobis distance is not forced to be diagonal, it would induce more general distances, including rotations, but in this work, we consider only stretching and shrinking of individual covariates for interpretability. MALTS handles categorical covariates by exact matching on as many relevant categorical covariates as possible, using ideas from the FLAME algorithm of Wang et al. (2017). We applied the new learning-to-match framework on the Empirical Investigation of Methods for Heterogeneity Workshop data from the 2018 Atlantic Causal Inference Confer- ence (ACIC). The ACIC dataset emulates the data and intervention in the National Study of Learning Mindsets. It include school identifiers, self-reported expectation for success in future (S3), race (C1), gender (C2), first-generation status (C3), level of urbanicity (XC ), school-level mean mindset (X1), school achievement level (X2), racial/ethinic composition of school (X3), poverty concentration of school (X4), and school size (X5). Outcome is a continuous measure of student achievement (Y ), and treatment is a mindset intervention indicated in the data by T . We find that even though MALTS induces only simple stretch- ing and shrinking of the covariates, it tends to produce results that are similar to a black box machine learning methods for treatment effect estimation like BART (Chipman et al., 2010), Causal Forest (Wager and Athey, 2017) or CFR (Johansson et al., 2016). Moreover, it produces matched groups that we can display, critique, and troubleshoot before estima-

119 Parikh, Rudin and Volfovsky

tion takes place. The matched groups are not created posthoc to explain a black box, they are created as part of an interpretable, auditable process. Briefly, our findings using Matching After Learning to Stretch (MALTS) methodology highlights the importance of self-reported expectations for success in the future in determi- nation of the outcome for both treatment and control sets jointly. We also observe the hints of goldilock’s effect for covariates regarding school-level achievements. The average trend observed jointly on the self-reported expectation for success in the future and urbanicity shows that higher values of both are possibly correlated with higher treatment effects. We find interesting behavior for level 3 of urbanicity that is different from other levels of urban- icity. Finally, we only observed small heterogenity in treatment effects across gender and almost no heterogenity for first-generation status. We present these results as an analysis pipeline—we first assess the quality of matched groups and then provide estimates that can account for that quality.

2. Methodology and Motivation Matching After Learning to Stretch (MALTS) is a matching method that estimates con- ditional average treatment effects (CATEs) (defined as difference of outcome under treat- ment, Y (t) and outcome under control, Y (c) for a given x, mathematically formulated as E[(Y (t) − Y (c))|X = x]) using K-nearest neighbors (KNN) for estimating the counterfactual outcomes conditional at a given location in covariate space, with a learned distance metric (Parikh et al., 2018). Because the distance metric is learned from a training set, MALTS, and its discrete counterpart “Fast Large-scale Almost Matching Exactly Approach to Causal Inference” (FLAME), are part of the learning-to-match framework (Wang et al., 2017; Dieng et al., 2019). MALTS assumes that there is no unobserved confounding. MALTS learns a distance metric L such that the following minimization problem is approximately solved on the normalized training set. The train set is normalized such that each covariate has zero sample mean and unit sample variance. The right side of the optimization equation accounts for the aggregate error in estimation of yi’s as an average of outcome yk of its k-nearest neighbors according to the learned distance metric.   2 X X 1 X L ∈ argminL0  yi − yk  (1) K 0 S∈{C,T } i∈S k∈KNN(yi,L )⊆S The learned distance metric forces the samples with similar outcomes in the set to be closer in covariate space. In this implementation, MALTS’ distance metric for KNN is parameterized by a matrix, termed as L, that handles continuous and discrete variables differently. Let the subscript c indicate continuous variables and d indicate discrete variables then the distance metric is:

distanceL(a, b) = dLc (ac, bc) + dLd (ad, bd), where L = [Lc,Ld] nd 2 X j,j 2 and dLc (ac, bc) = kLcac − Lcbck2 , dLd (ad, bd) = (Ld ) 1[ad(j) 6= bd(j) ]. j=0 MALTS learns the distance metric parameter L using a non-gradient based optimization method with the help of python3’s scipy package (Jones et al., 2001 ).

120 MALTS for ACIC Challenge Data

The learned distance metric is used for matching the units in the estimation set. In principle the proposed procedure allows for the full data to be used for estimation of CATEs. However, it might be preferable to keep the two sets completely separated. This separation allows for the training to inform additional (for example, if in the training set we find a few nearest neighbors who form very tight groups while the others form loose groups then we might want to explore the space of the loose groups to see if additional data might lead to tighter matches). For each unit, we find both the k nearest neighbors in the control group (KNNc) and the k nearest neighbors in the treatment group (KNNt). We estimate the outcome under controly ˆ(c) = 1 P y(c) and the outcome under i K k∈KNNc k treatmenty ˆ(t) = 1 P y(t). (The superscript (t) refers to the unit in treated set i K k∈KNNt k and the superscript (c) refers to the unit in control set.) Thus we estimate the conditional average treatment effect (CATE) as the expected difference between outcomes for units in treated set and control set conditional on given covariates using the relation:

(t) (c) (t) (c) E[Y − Y |X = xi] =y ˆi − yˆi .

We are able to use MALTS to identify outliers or low-quality matches by examin- ing the diameter of the matched group. The diameter provides an intuitive comparison between matched groups: Matched groups that span a large portion of covariate space tend to be low-quality matches. If desired, one can prune the matched groups of large diameter. The diameter of the matched group MG(xi) for query unit xi is defined as diameter(xi) = maxj∈MG(xi) distanceL(xi, xj) where matched group MG(xi) is formed by a union of KNNc(xi) and KNNt(xi). Instead of pruning, one can create an aggregate estimator by weighting each group inversely proportional to a function of its diameter.

3. Workshop Results

We presented results for a preliminary version of the matching algorithm during the ACIC challenge workshop. In this preliminary approach, we embedded the discrete covariates in Euclidean space and learned a single distance metric for the combination of continuous and embedded (discrete) covariates. This approach yielded interpretable matched groups, and we learned from this exercise that it is important to match on a student’s self reported expectations of future success (S3) and probably less important to match on gender and first generation status. However, the embedding of the discrete variables is somewhat misleading within our distance metric learning framework. Specifically, it can make units appear artificially similar if they both belong to rare groups. To address this challenge in the post-workshop analysis, we considered a distance metric that explicitly penalizes non-exact matching on discrete covariates without requiring an apriori embedding of the covariates in Euclidean space, which we described above. We note that generalizations of this method (discussed in Parikh et al., 2018) allow for flexible data driven embeddings of complex data that complement the distance metric learning problem. Distance metric learning for nearest neighbors has also been considered in non-causal settings (Goldberger et al., 2005).

121 Parikh, Rudin and Volfovsky

4. Post Workshop Analysis We estimate treatment effects by using the Matching-After-Learning-to-Stretch (MALTS) methodology to construct matched groups and evaluate conditional average treatment ef- fects. Covariates in the ACIC dataset include a school identifier, self-reported expectation for success in future (S3), race (C1), gender (C2), first-generation status (C3), level of ur- banicity (XC ), school-level mean mindset (X1), school achievement level (X2), racial/ethnic composition of school (X3), poverty concentration of school (X4), school size (X5), outcome variable (Y ), and treatment indicator (T ). For each unit we have continuous measurements X1,...,X5 and discrete covariates S3, C1,C2,C3 and XC , a continuous outcome variable Y and a binary treatment indicator T .

4.1 Learning a Distance Metric and Match Group Quality Analysis We use a distance metric of the form defined in Equation (2) below, where MALTS learns a scaling on the exact matching distance for discrete covariates and stretch on the Euclidean distance for the continuous covariates:

2 2 2 distanceL(a, b) = L1,11[S3(a) 6= S3(b)] + L2,21[C1(a) 6= C1(b)] + L3,31[C2(a) 6= C2(b)] + 2 2 L4,41[C3(a) 6= C3(b)] + L5,51[XC (a) 6= XC (b)] + 5 X 2 2 Lj+5,j+5(Xj(a) − Xj(b)) . j=1 MALTS requires a training set and an estimation set and so we randomly partition the data into these two components: 5% is reserved for training, while the rest is used for estima- tion of conditional average treatment effects (CATEs). MALTS uses a non-gradient based optimization method to find the optimal L that minimizes (or approximately minimizes) the quadratic loss in Equation (1). The optimum for one run is presented in Table 1 with each element of the vector describing the relevant diagonal entry. Exact matching on S3 appears to be important, as we can interpret from the learned L and the distance metric, a 2 mismatch as incurring a cost of at least 17.81 (equal to L1,1)—in comparison, being matched 2 to an individual of the wrong race or ethnicity only costs 1.21 (equal to L2,2). These costs translate directly to the quality of matches that we can find for each unit in our estimation set. This means S3 is important in determining the outcomes for control unit Yc and the outcome for treated unit Yt jointly. Table 1: Stretch Values: Diagonal entries of the L matrix learned using MALTS.

S3 C1 C2 C3 XC X1 X2 X3 X4 X5 Index in L matrix (1,1) (2,2) (3,3) (4,4) (5,5) (6,6) (7,7) (8,8) (9,9) (10,10) Corresponding L value 4.22 1.1 0.65 0.04 2.78 2.56 0.32 1.92 0.64 1.83 Corresponding L2 value 17.808 1.210 0.422 0.002 7.728 6.554 0.102 3.686 0.409 3.349 Relative importance 100 26.02 15.39 0.89 65.87 60.52 7.52 45.52 15.22 43.24 (value/max value)*100

Figure 1 (a) shows that as the level of S3 increases, so does the value of Y . To evaluate whether we have sufficient data to make statements about conditional treatment effects for the different levels of S3 we consider the distribution of diameters of matched groups

122 MALTS for ACIC Challenge Data

Figure 1: Trends for S3 and Diameter: (a) Variation in values of Y for different levels of self-reported expectation of success (S3). (b) Variation in diameter of matched groups for different levels of self-reported expectation of success (S3). Levels 1 and 2 do not have high-quality matched groups. (c) Histogram of the diameter of matched groups in estimation set. Groups whose diameter is too large could potentially be omitted from the analysis.

for each of the levels of S3. Again, diameter is defined as the maximal distance to one of the nearest neighbors for each unit. Figure 1 (b) demonstrates that individuals at Levels 1 and 2 of S3 are generally not well matched. While it is true that these are the lowest frequency levels of S3, the matched diameter provides a deeper insight: We are likely to have bad matches if we fail to match exactly on them, while if we match exactly on these levels we appear to be paying a large penalty for failing to match on other covariates. This is evidence of non-overlap and in our analysis, it is important to prune or down-weight such bad matched groups to avoid poor performance of the estimator. Figure 1(c) shows the histogram of diameters of all matched groups for units in our estimation set. If we take the tightest 75% or 50% of the matched groups then we need to prune groups with diameter greater than 2.79 or 1.21 respectively.

4.2 Heterogeneity of Treatment Effect For different levels of pruning based on the diameter of matched groups, Figure 2 shows the variation of CATEs for different levels of S3. Analyzing Figure 2 (a), it appears there is an initial decrease and then increase in average CATEs for different levels of S3, however from the match quality analysis we know that matches for Level 1 and Level 2 are not very reliable, so this non-monotonicity could be misleading. Figure 2 (b) shows the trend after we prune at the 75th percentile, where Levels 1 and 2 have been completely removed from the matched group analysis. We see that the trend in the CATEs across levels of S3 dissipates. A similar analysis can also be performed for the 50th percentile pruned dataset as shown in Figure 2 (c).

Following a similar evaluation as that for S3 we observe that exactly matching on ur- banicity (XC ) appears important. Since S3 and XC have many levels, we present the joint variability of CATEs in Figures 3 (a), (b) and (c) after no pruning on the diameter, pruning at the 75% level and pruning at the 50% level. It is evident that pruning removes spurious matched group allowing us to evaluate underlying trends. One trend we can infer from the plot is that a low level of urbanicity (XC ) and a high level of self-reported expectation

123 Parikh, Rudin and Volfovsky

Figure 2: CATEs for different values of S3: Variation in predicted CATEs for different levels of self-reported expectation of success (S3) for (a) all the samples in estimation set, (b) for samples beneath the 75th percentile of the estimation set based on the diameter of the matched group, (c) for samples beneath the 50th percentile of the estimation set based on the diameter of the matched group.

Figure 3: Treatment Effects with respect to S3 and XC : Variation in predicted CATEs for different levels of self-reported expectation of success (S3) and urbanicity (XC ) for (a) all the samples in the estimation set, (b) for samples beneath the 75th percentile of the estimation set, (c) for samples beneath the 50th percentile of the estimation set based on the diameter of the matched group.

(S3) tends to have on an average higher individual treatment effects. We also note that marginally, urbanicity Level 3 exhibits substantially lower CATEs than the rest of the levels.

4.2.1 Who did we match? Table 2 (a), (b), and (c), highlights three specific examples of a good, a bad and an “ugly” matched group. The good matched group is tight, with diameter equal to zero. The bad matched group is interesting because for most of the units it is trying to force a match on S3, and hence not making good matches on other covariates. The ugly one is unable to find any sample similar to the query sample. During estimation the latter two matched groups may be pruned as the poor quality matches are likely to produce low quality estimates.

124 MALTS for ACIC Challenge Data 4 4 1 1 2 1 5 4 1 0 1 1 -1 0.4 1.7 0.5 -0.2 -1.1 -1.2 -0.6 -0.8 -0.6 -1.6 -0.8 6 4 1 1 2 1 5 4 2 1 1 1 -1 0.4 0.2 1.7 0.5 -0.2 -1.1 -1.2 -0.6 -0.6 -1.6 -0.6 1 2 1 0 4 1 1 5 1 2 1 1 1 -1 0.1 1.7 0.5 -0.4 -0.7 -0.3 -1.1 -0.6 -1.6 -0.3 1 2 1 2 1 5 4 1 1 1 1 13 0.2 1.2 1.6 1.7 -1.3 -1.1 -0.4 -0.2 -1.5 -0.9 -1.7 -1.6 1 4 2 1 3 1 5 4 2 1 1 1 -1 0.2 1.7 -0.8 -0.1 -0.6 -0.7 -0.2 -1.5 -0.9 -1.7 -0.3 1 4 1 1 2 1 4 2 1 1 1 1 0.4 1.7 0.1 -0.2 -1.1 -1.2 -0.6 -1.1 -0.2 -1.5 -0.9 -1.7 T 0 1 1 1 0 1 1 7 4 2 1 1 1 -1 14 0.1 1.3 1.7 0.5 -1.3 -1.4 -0.9 -1.1 -0.6 -1.6 -0.6 ˆ Y -0.01 0.2 1 1 1 3 0 1 4 4 2 1 1 1 13 0.4 1.7 -0.5 -0.9 -0.4 -0.8 -0.2 -1.5 -0.9 -1.7 -0.2 X5 1.9 1.9 1.9 1 2 0 3 1 5 1 1 1 1 -1 12 15 0.9 0.1 0.1 1.7 0.5 0.4 -0.6 -1.1 -1.6 -0.6 -1.6 X4 1.3 1.3 1.3 1 4 1 0 4 1 1 6 4 1 1 1 1 -1 0.2 0.7 1.7 0.5 -0.1 -0.8 -0.3 -0.6 -1.6 -0.5 X3 -0.1 -0.1 -0.1 1 1 1 2 0 2 4 2 0 1 0 -1 14 0.4 0.9 0.7 : (a) Example of a good matched group produced by MALTS -0.2 -1.1 -1.2 -0.6 -0.4 -1.2 -1.2 -0.9 X2 0.7 0.7 0.7 1 4 1 1 2 0 0 2 1 1 1 0 13 0.4 1.7 -0.2 -1.1 -1.2 -0.6 -0.2 -1.5 -0.9 -1.7 -0.6 X1 -1.1 -1.1 -1.1 1 2 2 1 3 0 0 5 4 1 1 1 0 0.4 1.7 -0.5 -0.9 -0.4 -0.8 -0.2 -1.5 -0.9 -1.7 -0.3 XC 4 4 4 The Bad, diameter: 19.1 1 1 1 0 2 0 2 4 1 1 1 0 The Good, diameter: 0.0 The Ugly, diameter: 21.5 -1 1.3 0.3 0.9 0.7 -0.2 -1.1 -1.1 -0.7 -1.2 -1.2 -0.4 C3 0 0 0 1 1 1 0 0 0 2 4 1 0 3 0 14 0.2 0.3 0.3 0.9 0.1 0.1 -0.7 -0.8 -0.6 -1.1 -0.8 C2 2 2 2 1 4 1 0 2 0 2 4 2 1 1 0 0 1.3 0.3 0.2 0.1 C1 4 4 4 -0.2 -1.1 -1.1 -1.1 -1.5 -0.3 -1.6 1 4 2 0 2 0 1 2 1 1 0 -1 13 S3 6 6 6 0.1 0.2 1.7 -0.3 -0.8 -1.6 -0.2 -1.5 -0.9 -1.7 -0.7 1 1 1 4 0 5 1 1 1 0 14 12 0.3 0.6 0.2 1.7 -1.3 -0.4 -0.5 -0.2 -1.5 -0.9 -1.7 -0.8 1 4 2 1 3 0 2 1 1 3 0 0 -1 14 0.2 0.9 0.1 0.1 -0.8 -0.1 -0.6 -0.8 -0.6 -1.1 1 4 1 1 2 0 2 4 1 1 1 0 0.4 1.7 -0.2 -1.1 -1.2 -0.6 -0.8 -0.2 -1.5 -0.9 -1.7 -0.3 over all 10 matches for each group respectively. (b) Examples of a bad and (c) an ugly matched group produced by t Y 1 1 1 2 2 1 1 1 Example Matched Groups- The Good, The Bad, The Ugly 14 13 0.4 1.7 -0.2 -1.1 -1.2 -0.6 -0.2 -1.5 -0.9 -1.7 and c with diameter equal tocontrol zero group and (exact 10 match nearestY on neighbors all in the covariates). treatmentMALTS. For group, the all of query whom point, are MALTS exactly finds matched. 10 We report nearest only neighbors the in averaged the S3 C1 C2 C3 XC X1 X2 X3 X4 X5 Y T S3 C1 C2 C3 XC X1 X2 X3 X4 X5 Y T Table 2:

125 Parikh, Rudin and Volfovsky

4.2.2 Evaluating the treatment effects The average treatment effect (ATE) (difference of outcome under treatment, Y (t) and out- come under control, Y (c), that is, E[(Y (t) − Y (c))]) estimated above is not sensitive to pruning or weighting of the matched groups based on the diameter of the matched group. Table 3 illustrates the same behavior over three different cases: two different levels of prun- ing and an exponential weighting of the matched groups. The ATE is always estimated in the interval of 0.257 to 0.261.

Table 3: Average Treatment Effect (ATE) estimates on the full estimation set, 75th per- centile estimation set, 50th percentile estimation set and by exponential weighting each −0.1×diameter(x ) matched group in the estimation set with wi = e i .

√σˆ Estimated ATE (µˆ) Standard Error ( n ) Unpruned 0.2570 0.00253 75th Percentile 0.2604 0.00292 50th Percentile 0.2599 0.00353 Exponentially Weighted 0.2575 0.00219

Following our analysis of matched groups, we investigate the “Goldilock’s” hypothesis on the X2 variable: that the treatment effect is higher for an average school level achievement compared to low or high achievement. Figure 4 (a) shows that the match quality for low levels of school achievement is poor and in fact, it might be hard to make meaningful statements about matched groups with diameter more than 4. While eliminating these bad matched groups allows us to comment on X2, we note from Figure 4 (b) that this pruning removes all large and small values of X1 (mean fixed mindset). This means that we can primarily comment about a school with middle-mindset individuals with medium to high achievement levels. This is a slight restatement of the Goldilocks’s hypothesis as it contains conditions on the medium mindset. In Figure 5 (a) we can observe that there is a peak near X2 = 1 approximately and it decreases as we move away in either direction, i.e., towards X2 = 2 or X2 = 0. This behavior is consistent for both the unpruned and pruned versions. The pruned version however is much smoother and shows that a Goldilock’s effect is potentially present, but it may not be large. We also analyze the variability in CATEs as a function of X1: we observe a sharp tran- sition in trends for schools with middle-level mindset. We notice for middle-level mindset schools, the treatment effect decreases as the mindset increases. Evaluating X1 and X2 jointly, Figure 5 (c) shows that for high values of both X1 and X2, the variability in treat- ment effect appears to be large, but this can be due to lack of data at the extremes; when instead considering middle-values for both covariates, treatment effects are more stable. Lastly, we evaluate the marginal treatment effect for different levels of categorical vari- ables for race (C1) and urbanicity (XC ). As we can infer from Figure 6 (a), the average diameters for all levels are almost the same, so pruning diameters greater than 4 will still yield enough samples to analyze. The average CATEs for all other levels are similar to the ATE or slightly higher, while the average CATE for individuals at Level 3 is lower. This can potentially indicate an effect involving urbanicity at Level 3. Figure 7 shows the marginal treatment effect for gender (C2) and first-generation status (C3). There is little variability

126 MALTS for ACIC Challenge Data

Figure 4: Diameter variation with respect to (a) X2 and (b) X1. Variation in diameter of matched groups for (a) different values of school level achievement (X2), (b) different values of mean fixed-mindset (X1). The black curves shows the trend curves fitted with support vector regression.

Figure 5: X1,X2 versus CATEs: Variation in predicted CATEs for different values of (a) school level achievement (X2), for all the samples in the estimation set and for the samples after pruning for matched groups with diameter exceeding 4. Both the blue and the orange curves are fitted using support vector regression with gaussian kernel to the CATE as a function of X2; (b) school level students’ fixed mindset (X1), for unpruned estimation set. The black curve is fitted using support vector regression with gaussian kernel to the CATE as a function of X1. (c) Contour plot of variation in predicted CATEs for different values of school level students’ fixed mindset (X1) and school level achievements (X2) for all matched groups.

127 Parikh, Rudin and Volfovsky

Figure 6: Trends for XC : Variation in (a) diameter of matched groups, (b) predicted CATEs, for different levels of urbanicity (XC) after pruning groups with diameter more than 4.

Figure 7: Trends for C2 and C3: Variation in CATEs for (a) different levels of gender in the full estimation set, (b) different levels of gender in the estimation set pruned at diameters greater than 4, (c) different levels of first generation status in the full estimation set, (d) different levels of first generation status in the estimation set pruned at diameters greater than 4.

in average CATEs across the first generation statuses while the average CATEs for gender Level 2 is on an average higher than average CATEs for gender Level 1.

128 MALTS for ACIC Challenge Data

5. Conclusion and Discussion In a nutshell, the average treatment effect estimated by MALTS is approximately 0.25 and there are some covariates that moderate this effect. For urbanicity, XC , we see a different behavior specifically for Level 3 while for other levels, the estimates are fairly consistent. Since matching on S3 appears important, we probably should collect additional data for lower levels of self-reported expectations. Alternatively, it is possible that the scale for this variable simply needs to be recalibrated; perhaps Levels 1 and 2 are not meaningful. The school level achievement covariate (X2) shows signs of Goldilocks’s effect and should be further tested, while the covariate concerning the school-level mean of students’ fixed mindset (X1) shows non-monotonicity in the middle region of its distribution. Finally, the treatment effect seems to vary slightly across genders and has almost no variation with respect to first-generation status. MALTS permits several important aspects of our analysis to be made easier: understand- ing importance of different covariates, the underlying trends in heterogeneity of treatment effects, and interpretation of the trends by analyzing matched groups. It also provides confidence on the predictions of treatment effects based on the diameter of the matched groups. In certain situations like one shown in Figure 5, pruning of poor quality matches yields more stable results. We also observe that ATE prediction by MALTS for this dataset seems to be fairly robust to analyst choices.

Acknowledgments

This work supported by DHHS, PHS, NIH, and NIBI under grant 1R01EB025021-01, and also by the Duke Energy Initiative.

References Chipman, H. A., George, E. I., and Mcculloch, R. E. (2010). Bart: Bayesian additive regression trees. Annals of Applied Statistics, pages 266–298.

Dieng, A., Liu, Y., Roy, S., Rudin, C., and Volfovsky, A. (2019). Almost-exact match- ing with replacement for causal inference. In Proceedings of Artificial Intelligence and Statistics (AISTATS).

Goldberger, J., Hinton, G. E., Roweis, S. T., and Salakhutdinov, R. R. (2005). Neighbour- hood components analysis. In Advances in Neural Information Processing Systems, pages 513–520.

Iacus, S. M., King, G., and Porro, G. (2011). Causal inference without balance checking: Coarsened exact matching. Political analysis, page mpr013.

Johansson, F., Shalit, U., and Sontag, D. (2016). Learning representations for counterfactual inference. In International Conference on Machine Learning, pages 3020–3029.

Jones, E., Oliphant, T., Peterson, P., et al. (2001–). SciPy: Open source scientific tools for Python. [Online; accessed 2018-11-10]. Available from: http://www.scipy.org/.

129 Parikh, Rudin and Volfovsky

Parikh, H., Rudin, C., and Volfovsky, A. (2018). Malts: Matching after learning to stretch. ArXiv e-prints arXiv:1811.07415.

Wager, S. and Athey, S. (2017). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association.

Wang, T., Roy, S., Rudin, C., and Volfovsky, A. (2017). FLAME: A fast large-scale almost matching exactly approach to causal inference. arXiv preprint arXiv:1707.06315.

130 Observational Studies 5 (2019) 131-140 Submitted 7/19; Published 8/19

Selective Inference for Effect Modification: An Empirical Investigation

Qingyuan Zhao [email protected] Department of Statistics The Wharton School, University of Pennsylvania Philadelphia, PA 19104, USA

Snigdha Panigrahi [email protected] Department of Statistics University of Michigan Ann Arbor, MI 48109, USA

Abstract We demonstrate a selective inferential approach for discovering and making confident con- clusions about treatment effect heterogeneity. Our method consists of two stages. First, we use Robinson’s transformation to eliminate confounding in the observational study. Next we select a simple model for effect modification using lasso-regularized regression and then use recently developed tools in selective inference to make valid for the discovered effect modifiers. We analyze the Mindset Study data-set provided by the workshop organizers and compare our approach with other benchmark methods. Keywords: Lasso, Semiparametric regression, Selective sampler, Variable selection.

1. Methodology and Motivation 1.1 Motivation In the 2018 Atlantic Causal Inference Conference (ACIC 2018), we were kindly invited to participate in a workshop titled “Empirical Investigation of Methods for Heterogeneity”. The workshop organizers provided an observational dataset simulated from the National Study of Learning Mindsets (Mindset Study hereafter) and tasked the participants to ana- lyze how treatment effect of the mindset intervention varies among students in the study. This workshop, in the words of the organizers, “is not intended to be a ‘bake off’ but rather an opportunity to understand the strengths and weaknesses of methods for address- ing important scientific questions”. More specifically, the organizers sought answers for the following three research questions about the Mindset Study: Question 1: Is the intervention effective in improving student achievement? Question 2: Do two hypothesized covariates (X1 and X2) moderate the treatment effect? Question 3: Are there other covariates moderating the treatment effect? In this report, we will attempt to answer these questions using a method proposed in our earlier paper (Zhao et al., 2017) which neatly combines Robinson’s transformation

c 2019 Qingyuan Zhao and Snigdha Panigrahi. Zhao and Panigrahi

(Robinson, 1988) to remove confounding and the recently developed selective inferential framework (Taylor and Tibshirani, 2015) to discover and make confident conclusions about effect modifiers (covariates moderating the treatment effect). Effect modification or treatment effect heterogeneity is an old topic in statistics but has gained lots of attention in recent years, possibly due to the increased complexity of empirical datasets and the development of new statistical learning methods that are much more powerful at discovering effect modification. Though the literature on this topic is massive, an executive summary must include three related but different formulations of this problem: 1. What is the optimal treatment assignment rule for future experimental objects?

2. What is the conditional average treatment effect (CATE) as a function of the covari- ates?

3. What are the potential effect modifiers and how certain are we about them? See Zhao et al. (2017) for more discussion and references. It is obvious that the questions of the workshop organizers fall into the third category. In fact, we believe this is quite common in practice. Empirical researchers often want to use observational or experimental data to test existing scientific hypotheses about effect modification, generate new hypotheses, and gather information for intelligent decision making. However, prior to Zhao et al. (2017), majority of the statistical methods in the third category focused on discovering potential effect modifiers with little attention targeted towards providing statistical inference (such as confidence intervals for the discovered covariates). When the goal is to calibrate the strengths of effect modifiers in such problems, the researcher often relies on sample splitting, where some of the samples are used for discovery and the remaining samples are used for inference (Athey and Imbens, 2016). However, sample splitting does not optimally utilize the information in the discovery samples and often results in loss of power. Instead, the selective inference framework described in this paper does not waste any data, as it leverages on a conditional approach that only discards the information used in model selection (Lee et al., 2016; Fithian et al., 2014).

1.2 Main method To introduce the methodology let’s first fix some notations. Let Y be the observed outcome (a continuous measure of academic achievement), Z be the binary intervention (0 for control and 1 for treated), and X = (X1,...,Xp) be the covariates (p = 10 in the Mindset Study). Furthermore, denote Y (0) and Y (1) as the two potential outcomes, thus Y = Y (Z). We assume that there are no unmeasured confounders throughout the paper, i.e. Y (z) ⊥⊥ Z | X for z = 0, 1. Below we will elaborate the two-step method proposed in Zhao et al. (2017):

Step 1 (Robinson’s transformation): Use machine learning methods to estimate µz(x) = E[Z|X = x] = P(Z = 1|X = x) (the “propensity’ score”) and µy(x) = E[Y |X = x]. Let the estimates beµ ˆy(x) andµ ˆz(x). In R, there are many off-the-shelf implemen- tations available to learn µy(x) and µz(x) without any ex ante model specification. It is helpful to use an algorithm called “cross-fitting” in this step for the purpose of

132 Selective inference for effect modification: An empirical investigation

proving theoretical properties (Schick, 1986; Chernozhukov et al., 2018). Cross-fitting is only implemented for the post-workshop analysis. See Section 3.1 for more detail. Notice that it is straightforward to show (see Zhao et al., 2017) that the CATE τ(x) = E[Y (1) − Y (0)|X = x] satisfies

E[Y − µy(X) | Z, X] = (Z − µz(X))τ(X) (1)

Step 2 (Statistical inference): By approximating the CATE using a linear model, τ(x) ≈ T β0 + x β, equation (1) implies that

T Y − µˆy(X) ≈ (Z − µˆz(X))(β0 + X β) + approximation error + noise.

This motivates us to treat Y˜ = Y −µˆy(X) as the (transformed) response and X˜ = (Z− µˆz(X))X as the (transformed) predictors. We can then use different specifications of τ(x) to answer the three questions posted by the workshop organizers:

Step 2.1 (answering Question 1): Model CATE by just an intercept term: τ(x) ≈ β0. In R, we can report the results of the linear regression lm(Y˜ ∼ Z˜) where Z˜ = Z − µz(X).

Step 2.2 (answering Question 2): Suppose XM are the hypothesized effect mod- ifiers (X1 and X2 in the Mindset Study). We can model CATE by an intercept T and XM: τ(x) ≈ β0 + XMβM. The coefficient βM can be interpreted as the coefficient in the best linear approximation to the actual τ(x). More precisely, it is defined as (see Zhao et al., 2017):

h 2 T 2i (β0, βM) = arg min En (Z − µz(X)) (τ(X) − β0 − XMβM) , (2)

where En stands for averaging over the n samples. In R, we can report the results of the linear regression lm(Y˜ ∼ Z˜ + Z˜ : XM). Step 2.3 (answering Question 3): Use lasso regularized regression in Tibshirani (1996) (or potentially other automated variable selection methods) to select a subset of covariates Mˆ ⊆ {1, 2, . . . , p}. More specifically, Mˆ contains positions of non-zero entries in the solution to the following problem: n h i2 minimize Y˜ − Z˜(β + βT X) + λkβk . (3) 2 En 0 1 Then we can use the existing selective inference methods to make inference about the linear submodel τ(x) ≈ β + xT β that is selected using the data. The 0 Mˆ Mˆ ˆ estimand (β0, βMˆ ) is defined in the same way as (2) by treating M as fixed. The central idea behind the selective inferential methods is to base inference upon a conditional likelihood that truncates the usual (pre-selection) likelihood to the realizations of data that can lead to the same selection event. Lee et al. (2016) proposed the first method along this conditional perspective to overcome the bias encountered in inferring about a data-adaptive target. Assuming Gaussian noise in a linear regression setting, Lee et al. (2016) derived a pivotal statistic that

133 Zhao and Panigrahi

can be computed in closed-form and has a truncated Gaussian law for a class of polyhedral selection rules including the lasso (3). An implementation of this method can be found in the selectiveInference R package (Tibshirani et al., 2017). In principle, more sophisticated selective inference can be used in this step too. We will explore them in Section 3. Compared to other methods, the approach outlined above has several appealing proper- ties. First, the nuisance parameters—µz(x) and µy(x)—are estimated by flexible machine learning methods. Because Robinson’s transformation is used, each nuisance parameter only needs to be estimated at rate faster than n−1/4 to ensure asymptotic validity of the non-selective or selective inference in Step 2 (Zhao et al., 2017). This echos the suggestion of combining machine learning methods and doubly robust estimation by van der Laan and Rose (2011); Chernozhukov et al. (2018). Second, all the the scientific questions raised by workshop organizers can be answered in the same manner. The data analyst only needs to change the specification of the model for τ(x). Third, when answering Question 3, an effective variable selection procedure (such as lasso) can often find an interpretable model that includes most of the important effect modifiers. Selective inference can then provide valid statistical significance and confidence interval for the selected effect modifiers. Lastly, the implementation of this procedure is straightforward by harvesting existing softwares of machine learning methods and selective inference. We refer the reader to Zhao et al. (2017) for a more detailed discussion on the strengths and weaknesses of our approach.

1.3 Alternative methods To provide a more comprehensive picture of our selective inference approach (referred to as method “lasso” below), we decided before seeing any real data in the Mindset Study that we would also use four benchmark methods considered in the applied example in Zhao et al. (2017). These alternative methods are: Method “naive”: This method simply fits a linear model with all the treatment by co- variate interactions (and of course all the main effects). In R, we can simply use lm(Y ∼ Z ∗ X) which is equivalent to lm(Y ∼ Z + X + Z : X). To investigate effect modification, we can just report results for the interactions. This method is called “naive” because the linear model may be misspecified and may be insufficient for removing confounding. Method “marginal”: After Robinson’s transformation (Step 1 above), this method fits univariate linear regressions lm(Y˜ ∼ Z˜ + Z˜ : Xj) for j = 1, . . . , p. This is a special case of Step 2.2 with fixed model M = {j}.

Method “full”: After Step 1, this method fits a full linear model lm(Y˜ ∼ Z˜ + Z˜ : X). This is a special case of Step 2.2 with fixed model M = {1, 2, . . . , p}. Method “snooping”: This method is similar to method “lasso” except for the very last step. Instead of selective inference, it directly reports the results of lm(Y˜ ∼ Z˜ + Z˜ : ˆ XMˆ ) treating M as given rather than learned from the data. This method is used as a straw man to illustrate that ignoring model selection (aka “data snooping”) may lead to over-confident inference.

134 Selective inference for effect modification: An empirical investigation

2. Workshop results 2.1 Implementation details In our workshop analysis, we used the random forest (Breiman, 2001) to estimate the nuisance parameters in Step 1. In particular, we used the “honest” forest implementation in the grf package (Athey et al., 2018) with tune.parameters = TRUE (so some parameters will be tuned by cross-validation) and all other options set to default. In Step 2, categorical covariates are transformed to dummy variables. For example, XC (with five levels: 0, 1, 2, 3, 4) is transformed to XC-1, XC-2, XC-3, XC-4. In Step 2.3, we used the theoretical value T λ = 1.1 × E[kX˜ k∞] (Negahban et al., 2012) for model selection, where  is the vector of noise in the outcome. In the real data analysis λ is computed by simulating  i.i.d.∼ N(0, σˆ2) whereσ ˆ2 is the estimated noise level, see Lee et al. (2016). We then used the fixedLassoInf function in the selectiveInference package to make the selective inference. Details of the implementation can be found in the supplementary R markdown file.

2.2 Results By simply specifying an intercept term for τ(x) as described in Step 2.1, the (weighted) average treatment effect is estimated to be 0.256 with confidence interval [0.235, 0.277] (this does not exactly estimate the average treatment effect because of the regression setup, see equation (2) above). Thus the mindset intervention is indeed effective. Our results for effect modification are summarized in Figure 1. Notice that although all the methods are plotted in the same figure for the ease of visualization, they may be fitting different linear approximations to τ(x) and the coefficients for the same covariate may have different meanings. Several covariates (X1, X5, XC-4) are significant using method “marginal” but non-significant using method “full”, indicating they may be correlated with the actual effect modifier(s). We find the full model difficult to interpret because it consists of all the covariates. The lasso-regularized regression selects two covariates, X1 and XC-3, as potential effect modifiers, and the application of selective inference shows that XC-3 is statistically significant even after adjusting for the model selection. In contrast, the “snooping” inference that ignores the bias from model selection would incorrectly declare that X1 is also statistically significant. To summarize, our workshop analysis suggests that: X1 is possibly an effect modifier but more data is possibly needed before a decisive conclusion can be made; X2 does not moderate the treatment effect; XC3 is an important effect modifier that the data supports. In fact, with selective inference, we are able to estimate the strength of the effect modifier XC3 through both interval and point estimates.

3. Post-workshop analysis 3.1 More advanced methods A major objection to the polyhedral pivot in Lee et al. (2016) is that the selective confi- dence intervals are often excessively long. For example, in Figure 1 (method “lasso”), the confidence interval of X1 is very asymmetric: most of the confidence interval lies above 0 but the point estimate is indeed negative. More radical example of this kind can be found

135 Zhao and Panigrahi

0.50 ● Naive 0.25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 ● ● ● ● ● ● ● ● −0.25 −0.50 0.50 Marginal ● 0.25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 ● ● ● ● ● ● ● ● ● −0.25 −0.50 0.50 ● ● ● 0.25 Full ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 ● ● ● −0.25 −0.50 0.50 Effect modification Effect

0.25 Lasso 0.00 ● ● −0.25 −0.50 0.50 0.25 NA 0.00 ● ● −0.25 −0.50 S3 X1 X2 X3 X4 X5 C1−2 C1−3 C1−4 C1−5 C1−6 C1−7 C1−8 C1−9 C2−2 C3−1 XC−1 XC−2 XC−3 XC−4 C1−10 C1−11 C1−12 C1−13 C1−14 C1−15

Figure 1: Workshop results: This figure plots the 95% confidence intervals of effect modifi- cation by the covariates (red solid intervals do not cover 0).

in Table 1 below. This problem is due to the ill-behavior of the polyhedral pivot when the observed data lies close to the selection boundary. Such phenomenon was observed in the original article by Lee et al. (2016). More recently Kivaranovic and Leeb (2018) has proven that the expected length of the selective confidence interval constructed this way is infinity.

3.1.1 Randomized response

To mitigate this problem, Tian and Taylor (2018) proposed to randomize the response before model selection, thereby smoothing out the selection boundary. This also allows the statistician to reserve more information in the data during the selection stage, leading

136 Selective inference for effect modification: An empirical investigation

to increased power in the inference stage. With moderate amount of injected noise, the increase of inferential power also does not compromise the ability of model selection. In the effect modification problem, this “randomized lasso” algorithm can be directly applied in Step 2.3 by replacing (3) with the following optimization problem: n h i2 minimize Y˜ − Z˜(β + βT X) + λkβk − ωT β, (4) 2 En 0 1 2 where ω ∼ N (0, η Ip) is the injected Gaussian noise. Note that the randomized Lasso has two tuning parameters, one being the amount of `1-penalty λ and the second one being the amount of injected noise which is measured by η2. In our analysis below we will use the same penalty λ as before and set η2 =σ ˆ2/2. The polyhedral lemma of Lee et al. (2016) no longer applies to randomized lasso because selection now depends on both the data and the injected noise ω. To construct selective confidence intervals after randomized lasso, Tian Harris et al. (2016) proposed to use Monte Carlo and developed a general selective sampler to sample realizations of data truncated to the randomized selection region. To obtain a point estimate of the coefficient, Panigrahi et al. (2016) and Panigrahi and Taylor (2018) introduced the “selection-adjusted” maxi- mum likelihood estimate (selective MLE) that maximizes the conditional likelihood given the selection event. These latest selective inference methods are implemented as Python software available at https://github.com/selective-inference/Python-software.

3.1.2 Switching the target of selective inference In our workshop analysis, the target of selective inference is the partial regression coefficient β ˆ defined in (2). Alternatively, one might be interested in the full regression coefficient M  β{1,2,...,p} Mˆ which contains entries of β{1,2,...,p} that correspond to the selected covariates XMˆ . In other words, instead of targeting all the full regression coefficients as in method “full” above, this approach focuses only on certain selected entries. The selective inference framework in Lee et al. (2016) and Tian Harris et al. (2016) can be effortlessly applied to full regression coefficients because they, like partial regression coefficients, can be written as linear functions of the underlying parameters (in our case τ(x)).

3.1.3 Cross-fitting Cross-fitting (Schick, 1986; Chernozhukov et al., 2018) is a general algorithm in semipara- metric inference to eliminate the dependence of nuisance parameter estimates on the cor- responding data point (e.g. dependence ofµ ˆt(Xi) on Ti). In our case, it simply amounts to split the data into two halves and estimating µt(Xi) and µy(Xi) in Step 1 using models trained using the half of the data that does not contain the i-th data point. We imple- mented this algorithm for our post-workshop analysis. Cross-fitting is useful for proving theoretical properties of the semiparametric estimator. In practice we rarely find that the usage of cross-fitting drastically changes the results.

3.2 Results Table 1 shows the post-workshop analysis results. There are in total four analyses, targeting partial or full coefficients and using the polyhedral pivot for lasso or selective sampler for

137 Zhao and Panigrahi

Table 1: Results of different selective inference methods in the Mindset Study dataset.

Target Selective inference Method Covariate Estimate CI p-value C1-11 0.223 [-∞, -2.164] 0.005 Lasso + polyhedral pivot XC-3 -0.141 [-0.156, ∞] 0.234 X1 -0.025 [-0.042, 0.203] 0.736 S3 -0.013 [-0.060, 0.017] 0.908 Partial C1-4 -0.000 [-0.106, 0.082] 0.675 Randomized lasso + sampler C1-11 0.121 [-0.339, 0.428] 0.400 XC-3 -0.151 [-0.265, -0.045] 0.004 X4 0.002 [-0.063, 0.046] 0.596 C1-11 0.284 [-∞, -2.103] 0.005 Lasso + polyhedral pivot XC-3 -0.148 [-0.180, 4.257] 0.256 X1 -0.031 [-4.345, 0.564] 0.872 S3 -0.011 [-0.051, 0.016] 0.214 Full C1-4 0.061 [-0.092, 0.180] 0.185 Randomized lasso + sampler C1-11 0.180 [-0.375, 0.505] 0.568 XC-3 -0.139 [-0.293, 0.002] 0.052 X4 0.011 [-0.046, 0.053] 0.819 randomized lasso. Thus the first analysis in Table 1 (lasso + polyhedral pivot) is the same as method “lasso” in Figure 1 besides we used cross-fitting here. The randomized lasso selects three more covariates in the post-workshop analysis. This is typically the case due to the injected noise. However, all selected covariates besides XC-3 are not statistically significant in the post-selection inference, suggesting that they are probably not effect modifiers. The biggest advantage of using the randomized lasso and selective sampler is shorter selective confidence interval (CI). For example, For XC-3, the CI is reduced from [−0.156, ∞] to [−0.265, −0.045]. A careful reader might have noticed that in first row of Table 1, the naive point estimate for C1-11 obtained by regressing Y on the selected covariates—C1-11, XC-3, and X1—is not covered by the CI. This can happen if the data is very close to the decision boundary, see Lee et al. (2016, Fig. 5). The selective MLE point estimates (for randomized lasso) are always covered by the CIs and close to the center of the CIs in Table 1. Switching the inferential target from partial coefficients to full coefficients does not seem to change the results by much. This is likely due to the lack of strong effect modifiers and the lack of dependence between the covariates. In the full model, the covariate XC-3 is not significant at level 0.05. One possible explanation is that using a selected model often add power to the analysis when the data can be accurately described by a sparse (as opposed to fitting a full model). These observations demonstrate the practical benefits of using the randomized lasso and selective MLE.

4. Discussion

In this paper we have presented a comprehensive yet transparent approach based on Zhao et al. (2017) to analyze treatment effect heterogeneity in observational studies. The same procedure can be applied to randomized experiments as well, and Zhao et al. (2017) has shown that in this case it is sufficient to estimate µy consistently in order for the polyhedral

138 Selective inference for effect modification: An empirical investigation

pivot to be asymptotically valid. The proposed procedure can be easily implemented using existing machine learning packages (to estimate µt and µy) and selective inference softwares. The R and Python code for our analyses are attached with this report. We want to re-emphasize some points made in Zhao et al. (2017) about when selective inference is a good approach for analyzing effect modification. Compared to classical sta- tistical analysis, the selective inference framework makes it possible to use the same data to generate new scientific questions and then answer them. This is not useful for the in- ference of the average treatment effect because it is a deterministic quantity independent of any model selection. However, selective inference can be tremendously useful for effect modification especially when the analyst wants to discover effect modifiers using the data and make some confident conclusions about their effect sizes. We believe that this is indeed the motivation behind the workshop organizers’ Question 3, making selective inference a very appealing choice of analyzing datasets like the Mindset Study. On the other hand, since part of the information in the data is reserved for post-selection inference, the selective inference framework is sub-optimal at making predictions (in our case, estimating τ(x)). There is a long list of literature on estimating the optimal treatment regime or the CATE from the data. This has become a hot topic recently due to the availability of flexible machine learning methods. We refer the reader to Zhao et al. (2017) for some references in this direction. When prediction accuracy is the foremost goal, these machine learning methods should be preferred to selective inference. Berk et al. (2013) proposed an alternative post-selection procedure that constructs uni- versally valid confidence intervals regardless of the model selection algorithm. However this may be overly conservative when the selection algorithm is pre-specified by the data analyst (for example, the lasso with a fixed λ). Small (2018) discussed connections of this alternative approach to observational studies. The application in effect modification also suggests new research directions for selective inference. For example, during the workshop several participants attempted to describe the effect modification using decision trees. Results presented in this way are easy to interpret and may have immediate implications in decision making. With the nodes and cutoffs selected in a data-adaptive fashion, this poses yet another post-selection inference problem. Reserving a hold-out data set for a confirmatory analysis on the effects may lead to a loss of power that can be potentially avoided with selective inference. Obtaining optimal inference post exploration via regression trees is an interesting direction for future work.

References Athey, S., Tibshirani, J., and Wager, S. (2018). Generalized random forests. Annals of Statistics, to appear.

Athey, S. C. and Imbens, G. W. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353–7360.

Berk, R., Brown, L., Buja, A., Zhang, K., and Zhao, L. (2013). Valid post-selection infer- ence. Annals of Statistics, 41(2):802–837.

Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32.

139 Zhao and Panigrahi

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural pa- rameters. The Journal, 21(1):C1–C68. Fithian, W., Sun, D., and Taylor, J. (2014). Optimal inference after model selection. arXiv:1410.2597. Kivaranovic, D. and Leeb, H. (2018). Expected length of post-model-selection confidence intervals conditional on polyhedral constraints. arXiv:1803.01665. Lee, J. D., Sun, D. L., Sun, Y., and Taylor, J. E. (2016). Exact post-selection inference, with application to the lasso. Annals of Statistics, 44(3):907–927. Negahban, S., Yu, B., Wainwright, M. J., and Ravikumar, P. K. (2012). A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Statistical Science, 27(4):538–557. Panigrahi, S. and Taylor, J. (2018). Scalable methods for bayesian selective inference. Electronic Journal of Statistics, 12(2):2355–2400. Panigrahi, S., Taylor, J., and Weinstein, A. (2016). Bayesian post-selection inference in the linear model. arXiv:1605.08824. Robinson, P. M. (1988). Root-n-consistent semiparametric regression. Econometrica, 56(4):931–954. Schick, A. (1986). On asymptotically efficient estimation in semiparametric models. Annals of Statistics, 14(3):1139–1151. Small, D. S. (2018). Larry Brown: Remembrance and connections of his work to observa- tional studies. Observational Studies, 4:250–259. Taylor, J. and Tibshirani, R. J. (2015). Statistical learning and selective inference. Pro- ceedings of the National Academy of Sciences, 112(25):7629–7634. Tian, X. and Taylor, J. (2018). Selective inference with a randomized response. Annals of Statistics, 46(2):679–710. Tian Harris, X., Panigrahi, S., Markovic, J., Bi, N., and Taylor, J. (2016). Selective sampling after solving a convex problem. arXiv:1609.05609. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288. Tibshirani, R., Tibshirani, R., Taylor, J., Loftus, J., and Reid, S. (2017). selectiveInference: Tools for Post-Selection Inference. R package version 1.2.4. Available from: https: //CRAN.R-project.org/package=selectiveInference. van der Laan, M. J. and Rose, S. (2011). Targeted Learning. Springer. Zhao, Q., Small, D. S., and Ertefaie, A. (2017). Selective inference for effect modification via the lasso. arXiv:1705.08020.

140