<<

Causal inference under outcome-based with monotonicity assumptions

Sung Jae Jun* Department of Economics, Pennsylvania State University and Sokbae Lee Department of Economics, Columbia University and IFS

June 25, 2021

Abstract We study causal inference under case-control and case-population sampling. For this purpose, we focus on the binary-outcome and binary-treatment case, where the parameters of interest are causal relative and attributable risk defined via the potential outcome framework. It is shown that strong ignorability is not always as powerful as it is under random sampling and that certain monotonic- ity assumptions yield comparable results in terms of sharp identified intervals. Specifically, the usual is shown to be a sharp identified upper bound on causal relative risk under the monotone treatment response and monotone treatment selection assumptions. We then discuss averaging the conditional (log) odds ratio and propose an algorithm for semiparametrically efficient esti- mation when averaging is based only on the (conditional) distributions of the covariates that are identified in the data. We also offer algorithms for causal inference if the true population distribution of the covariates is desirable for ag- gregation. We show the usefulness of our approach by studying two empirical arXiv:2004.08318v3 [econ.EM] 9 Aug 2021 examples from social sciences: the benefit of attending private school for en- tering a prestigious university in Pakistan and the causal relationship between staying in school and getting involved with drug-trafficking gangs in Brazil.

Keywords: Relative risk; attributable risk; odds ratio; partial identification; semipara- metric efficiency; sieve estimation. *The authors would like to thank Guido Imbens, Chuck Manski, Francesca Molinari and seminar participants at Cemmap, Oxford, and Penn State for helpful comments and Leandro Carvalho and Rodrigo Soares for sharing their dataset and help. This work was supported in part by the European Research Council (ERC-2014-CoG-646917-ROMIA) and by the UK Economic and Council (ESRC) through research grant (ES/P008909/1) to the CeMMAP.

1 1 Introduction

Random sampling from the population of interest provides a convenient setup for causal inference, but it may be overly expensive to implement in practice for vari- ous reasons. For instance, some events are so rare that they are unlikely to be ob- served at random and therefore they are likely to be severely under-represented in a random . Examples include cancer (Breslow & Day 1980), infant death (Currie & Neidell 2005), consumer bankruptcy (Domowitz & Sartain 1999), enter- ing a highly prestigious university (Delavande & Zafar 2019), and drug trafficking (Carvalho & Soares 2016), among many others. The main focus of this paper is to study causal inference when only an outcome-based sample is available. Specif- ically, data are collected by observing individuals that are randomly drawn from those to whom the event of interest is already known to have occurred, while the comparison group data are obtained similarly by focusing on those to whom the event did not occur (hence, case-control design; see, e.g., Breslow(1996)), or simply from an auxiliary source on the entire population without knowing their case sta- tus (as a consequence, case-population design; see, e.g., Lancaster & Imbens(1996)). Both sampling designs can be analyzed in a unified outcome-based sampling frame- work called Bernoulli sampling (see, e.g., Breslow et al.(2000)). Therefore, our ob- jective is to conduct causal inference under Bernoulli sampling. We focus on the binary-outcome and binary-treatment case while data are only observational, as opposed to experimental. In a lesser-known paper, Holland & Rubin(1988) adopt the potential outcome framework to illustrate how to use the as- sumption of strong ignorability—which implies conditional independence between potential outcomes and treatment given covariates—to identify the counterfactual odds ratio in case-control studies. They then argue that the counterfactual odds ratio approximates the ratio of two potential-outcome probabilities (i.e., causal rela- tive risk) under the rare-disease assumption. Their work is the starting point of this paper. We share their motivation that “retrospective studies are never randomized.”

2 However, strong ignorability is often considered too demanding for observational data; furthermore, in case-control studies, without having the knowledge of the true probability of the case in the population, strong ignorability without the rare-disease assumption is generally not sufficient to point identify the causal relative risk given the covariates. In this paper, we develop new identification results from the perspec- tive of partial identification (see, e.g., Manski 1995, 2003, 2009, Tamer 2010, Molinari 2020, among others), without resorting to strong ignorability, nor to the rare-disease assumption. In particular, we adopt the assumptions of monotone treatment response (Man- ski 1997, MTR hereafter) and monotone treatment selection (Manski & Pepper 2000, MTS hereafter) and show that those assumptions enable us to bound causal pa- rameters in a meaningful way. The MTR and MTS assumptions as well as other related notions of monotonicity have been extensively used for causal inference. See e.g. Bhattacharya et al.(2008, 2012), Pearl(2009), VanderWeele & Robins(2009), Kuroki et al.(2010), Kreider et al.(2012), Jiang et al.(2014), Okumura & Usui(2014), Choi(2017), Kim et al.(2018), and Machado et al.(2019) among others. In particu- lar, Kuroki et al.(2010) consider the MTR assumption to bound causal relative risk, which we will discuss further at the end of this section. We investigate partial identification of adjusted causal relative and attributable risk, that is, the ratio of two counterfactual proportions (conditional on covariates) and the difference between them. We show that strong ignorability is not always as powerful as it is under random sampling and that the MTR and MTS assumptions yield comparable results in terms of sharp identified intervals. Specifically, the odds ratio is shown to be a sharp upper bound on causal relative risk. Therefore, the odds ratio plays a central role for causal inference under the MTR and MTS assumptions that, unlike strong ignorability, allow for endogenous treatment assignment. We apply the same approach to causal attributable risk, although the sharp bounds are more complicated than the odds ratio. In addition to new identification results, we develop easy-to-implement infer-

3 ence algorithms for causal relative and attributable risk. Instead of parametrizing the functional causal parameters, we propose to estimate a scalar parameter that is constructed by averaging them. Since the odds ratio is not only a popular measure of association in case-control studies but also it is a sharp upper bound on causal relative risk, we pay special attention to semiparametrically efficient estimation of the expectation of the log odds ratio using the conditional distributions of the co- variates that are identified in the data. Efficient estimation of the (log) odds ratio is a classical topic, but most papers take a parametric approach with a focus on robust- ness: see, e.g., K. Chen (2001), Tchetgen Tchetgen et al.(2010), and Tchetgen Tchetgen (2013). Our approach of targeting the nonparametrically aggregated versions of the odds ratio, which can be efficiently estimated via sieves (e.g., X. Chen (2007)), of- fers a natural alternative. We derive the semiparametric efficiency bound for the aggregated log odds ratio and then propose an easy-to-implement sieve-based es- timation algorithm. We also offer algorithms for causal inference if the true popu- lation distribution of the covariates is desirable for aggregation. Causal inference procedures are presented in detail in section7, and an accompanying R package is available on the Comprehensive R Archive Network (CRAN) at https://CRAN.R- project.org/package=ciccr. We revisit the datasets studied by Delavande & Zafar(2019) and Carvalho & Soares(2016) to illustrate how our methods can be used in case-control and case- population studies, respectively. The first example uses the data based on surveys of the students who entered a very selective university in Pakistan and those who did not, whereas the second example data combine a of drug-trafficking gangs in Rio de Janeiro with the 2000 Brazilian Census. Therefore, they correspond to the case-control and case-population designs, respectively. Using these datasets, we ad- dress new research questions that are not examined in the original papers: (i) the causal effect of attending private school on entering a very selective university and (ii) the causal impact of not staying in school on joining a criminal gang. In both of the examples, strong ignorability is implausible, but the MTR and MTS assump-

4 tions can be reasonably well defended. Our results show that attending private high school increases the chance of entering a prestigious university only by a factor of 1.21 at the maximum: therefore, there is no substantial causal benefit here. Also, our analysis on the Brazilian data shows that those who are not in school may have a higher chance of joining a criminal gang by a factor of 15 times in the worst case. We now discuss the relation to the existing literature on causal inference and outcome-based sampling.M ansson˚ et al.(2007) point out that the propensity score method has only limited ability to control for potential confounding factors in case- control studies. Our method does not rely on strong ignorability nor on the propen- sity score. Rose(2011) and Van der Laan & Rose(2011) propose a method of causal inference based on the assumption that the true case probability is known by a prior study. We focus on the case of unknown case probability; however, it is straightfor- ward to modify our inference procedures if the case probability is indeed known. Didelez & Evans(2018) provide an excellent and extensive survey on causal infer- ence under case-control sampling, but no discussion on partial identification ap- proaches can be found there. Therefore, it appears to be a rather under-explored area to take an interval-identification approach to causal inference under outcome- based sampling. There are a couple of exceptions though. Kuroki et al.(2010) use linear and fractional programming techniques to bound the causal attributable and relative risk, respectively. It is noteworthy that they also consider the MTR assump- tion so that their lower bound on causal relative risk is simply one as in the current paper. However, their work is distinct from ours in that their upper bound on causal relative risk is obtained by assuming that there is prior knowledge about the lower bound on the true case probability; therefore, their upper bound is generally dif- ferent from the usual odds ratio. Gabriel et al.(2020) focus on causal attributable risk and obtained bounds in a variety of scenarios including outcome-dependent sampling with an instrumental variable. But they do not leverage any monotonic- ity assumption. Therefore, our work is complementary to these existing papers. To the best of our knowledge, it was unknown in the literature that the odds ratio is

5 the sharp upper bound of the identified interval of the causal relative risk when the MTR and MTS assumptions are imposed. The remaining part of the paper is organized as follows. In section2 we for- mally present the setup including the causal parameters of interest and the sam- pling scheme. Sections3 to5 focus on the causal relative risk and odds ratio to address identification, aggregation, and estimation issues. In section6, we extend our results to the causal attributable risk. Section7 covers how to carry out causal inference, and section8 is for the applications. Throughout the paper, for generic random variables A and B, we will write fA and fA|B for the density of A and the conditional density of A given B.

2 Preliminaries

2.1 Causal parameters

Let (Y∗, T∗, X∗) be a random vector of a binary outcome, binary treatment, and co- variates of a representative individual. Since we are interested in case-control stud- ies, we assume that a random sample of (Y∗, T∗, X∗) is not available. Instead, we have a sample of (Y, T, X), where the distribution of (T, X) given Y is related with that of (T∗, X∗) given Y∗. The exact sampling schemes and related assumptions will be discussed in detail later, but we only focus on the parameters of interest in this section. For the sake of causal inference, we use the usual potential outcome notation. So, Y∗(t) will be the potential outcome for T∗ = t, and Y∗ can be written as Y∗ = Y∗(1)T∗ + Y∗(0)(1 − T∗). The causal effect of the treatment can be measured by (conditional) relative risk, which is defined by

P{Y∗(1) = 1 | X∗ = x} θ(x) := . (1) P{Y∗(0) = 1 | X∗ = x}

Therefore, our notation extends K. Chen (2001) and Xie et al.(2020) by adding an

6 extra layer of potential outcomes. Alternatively, one may prefer measuring the causal effect of T∗ by the attributable risk, i.e.

∗ ∗ ∗ ∗ θAR(x) := P{Y (1) = 1 | X = x} − P{Y (0) = 1 | X = x}. (2)

However, it is θ(x), not θAR(x), that turns out to be closely related with the odds ra- tio, which has been widely used as a measure of association in case-control studies. For this reason we focus on θ(x) and its relationship with the odds ratio first, after which we extend our approach to θAR(x) in section6.

2.2 Bernoulli sampling

As we mentioned earlier, we assume that a random sample of (Y∗, T∗, X∗) is un- available. Instead the researcher has access to a random sample of (Y, T, X), where the distribution of (Y, T, X) is related with that of (Y∗, T∗, X∗) by Bernoulli sampling (e.g. Breslow et al. 2000) that we describe below. In Bernoulli sampling, the researcher first draws a Beroulli variable Y from a pre- specified marginal distribution, after which she randomly draws (T, X) from some

Py if and only if Y = y; so, Y is an artificial device to decide which subpopulation ∗ ∗ ∗ we will draw (T, X) from. If Py is the distribution of (T , X ) conditional on Y = y, then this is nothing but case-control sampling. Since h0 = P(Y = 1) is part of the sampling scheme, we will assume that it is known; if not, it can be easily estimated. Before we proceed, we make a common support assumption for simplification.

Assumption A (Common Support). The support of X∗ and that of X given Y = y for y = 0, 1 coincide; the common support will be denoted by X .

AssumptionA may not be trivial in some applications. Suppose that Y∗ repre- sents breast cancer and Y indicates whether we focus on those with cancer or not. The, the joint support of gender and age depends highly on which subpopulation

7 we consider; breast cancer patients are mostly women. However, in this case, us- ing gender for extra stratification is appropriate; restricting ourselves to women and having both X and X∗ represent the age, assumptionA becomes easier to justify. Below we discuss two leading cases of Bernoulli sampling that we focus on throughout the paper.

∗ ∗ ∗ Design 1 (Case-Control Sampling). P0 is the distribution of (T , X ) given Y = 0, ∗ ∗ ∗ while P1 is that of (T , X ) given Y = 1.

∗ ∗ Design 2 (Case-Population Sampling). P0 represents the distribution of (T , X ) of the ∗ ∗ ∗ entire population, while P1 is that of (T , X ) conditional on Y = 1.

Design1 is arguably the most popular form of case-control studies and design2, which we call case-population sampling, is considered in Lancaster & Imbens(1996). The latter has been used to study drug trafficking (Carvalho & Soares 2016) and mass demonstrations (Rosenfeld 2017) among others. Note that the distribution of (T, X) is identified from the data, but that of (T∗, X∗) may not. For instance, in design1, we have fX(x) = fX∗|Y∗ (x|1)h0 + fX∗|Y∗ (x|0)(1 − ∗ h0) 6= fX∗ (x), unless h0 is the same as p0 := P(Y = 1), i.e. the true probability ∗ of the case in the population. Further, fYX(1, x) = fX∗|Y∗ (x|1)h0 = fX∗ (x)P(Y = ∗ 1|X = x)h0/p0, which yields the likelihood function studied in e.g. Manski & Ler- man(1977). We emphasize that P(Y = 1|X = x) does not have structural interpreta- tion like P(Y∗ = 1|X∗ = x), where the latter is often formed via domain knowledge in a specific field.

8 3 Identification of causal relative risk

In this section we study identification of θ(x), for which we first introduce some notation. For p ∈ [0, 1], define r(x, p) by  p(1 − h0)P(Y = 1 | X = x)  under design1,  p(1 − h0)P(Y = 1 | X = x) + h0(1 − p)P(Y = 0 | X = x)  p(1 − h0) P(Y = 1 | X = x)  under design2. h0 P(Y = 0 | X = x)

Further, let Π(t | y, x) := P(T = t | Y = y, X = x) be the retrospective binary regression function. Both r and Π depend on the distribution of (Y, T, X) and hence on the sampling design, but our notation is not explicit about it for simplicity. Define

Π(1 | 1, x) Π(0 | 0, x) + r(x, p){Π(0 | 1, x) − Π(0 | 0, x)} Γ(x, p) := × , Π(0 | 1, x) Π(1 | 0, x) + r(x, p){Π(1 | 1, x) − Π(1 | 0, x)} where we will implicitly assume that Π(t | y, x) 6= 0 for all t, y ∈ {0, 1} throughout. The functions r and Γ are all identified because they are functionals of the joint ∗ distribution of (Y, T, X). In both designs, we have r(x, p0) = p (x) and r(x, 0) = 0, ∗ ∗ ∗ ∗ where p0 := P(Y = 1) and p (x) := P(Y = 1 | X = x); here, p0 is the only ∗ unidentified parameter in that if p0 were known, then p (x) would be identified as well. Note that Γ(x, 0) is nothing but the usual odds ratio.

3.1 Strong ignorability

Strong ignorability is a standard setup for causal inference, and it consists of the following two requirements.

Assumption B (Overlap). For all (t, x) ∈ {0, 1} × X ,

0 < P{Y∗(t) = 1 | X∗ = x} < 1 and 0 < P(T∗ = 1 | X∗ = x) < 1.

Assumption C (Unconfoundedness). For all (t, x) ∈ {0, 1} × X ,

P{Y∗(t) = 1 | T∗ = 1, X∗ = x} = P{Y∗(t) = 1 | T∗ = 0, X∗ = x}.

9 AssumptionB has the two requirements. The assumption that 0 < P(T∗ = 1 | X∗ = x) < 1 is the usual overlap condition that requires that treatment be non- degenerate for all x ∈ X . The condition that 0 < P{Y∗(t) = 1 | X∗ = x} < 1 is needed here to avoid degenerate outcomes for all x ∈ X . AssumptionC is the key condition for strong ignorability; if we control for the covariates X∗, then the treatment assignment T∗ is independent of the potential outcomes. Below is the main implication in our context.

Theorem 1 (Holland & Rubin(1988)) . Suppose that assumptionsA toC are satisfied.

Then, for all x ∈ X and under design d ∈ {1, 2}, we have θ(x) = Γ{x, (2 − d)p0}.

Theorem1 is closely related with Holland & Rubin(1988). Specifically, Holland & Rubin(1988) shows that under design1, the odds ratio Γ(x, 0) is equal to the odds ratio in terms of the potential outcomes: i.e.

P{Y∗(1) = 1 | X∗ = x} P{Y∗(0) = 0 | X∗ = x} Γ(x, 0) = P{Y∗(0) = 1 | X∗ = x} P{Y∗(1) = 0 | X∗ = x} under strong ignorability. Theorem1 reformulates this result in terms of causal relative risk, which has clearer interpretation. It also extends the same idea to the case of design2. Under design2, θ(x) is point identified by the usual odds ratio but the case of design1 is a bit different because p0 is unidentified. However, this problem has been ignored because Γ(x, 0) is an approximation of Γ(x, p0) when p0 is close to zero by continuity; the assumption of small p0 is known as the rare disease assumption in epidemiology. However, we remark that the approximation error is generally unrestricted: i.e.

Π(1 | 1, x){Π(1 | 0, x) − Π(1 | 1, x)} f ∗| ∗ (x | 1) ( ) = X Y ∂pΓ x, p p↓0 2 , Π(0 | 1, x)Π (1 | 0, x) fX∗|Y∗ (x | 0)

∗ ∗ ∗ where Π(t | y, x) = P(T = t | X = x, Y = y) in design1. Therefore, p0 being close to zero restricts neither the sign nor the magnitude of ∂pΓ(x, p) in a neighbor- hood of p = 0. Indeed, if T∗ is a rare treatment among the non-case population, or if

10 it is an overly common treatment among the case population, then the rare-disease approximation can be poor; e.g. smoking can be fairly rare within the healthy pop- ulation while it is less so among the population with lung cancer. To be more con- crete, consider the case with no covariates: if Π(1 | 0) = 0.1, Π(1 | 1) = 0.7, and p0 = 0.01, then Γ(0) − Γ(p0) ≈ 1.4. In other words, when there is no causal effect, we must have Γ(p0) = 1, but Γ(0) will be about 2.4. This problem has a quick solution though: since the function Γ(x, ·) itself is iden- tified, we can directly tell how good or bad the rare-disease approximation will be. We formalize this idea by using the concept of partial identification.

Assumption D. There is a known value p¯ such that p0 ∈ [0, p¯].

The idea that the researcher can place an upper bound on an unidentified object has also been used in the context of robust estimation: see e.g. Horowitz & Manski (1995, 1997). Of course, p¯ = 1 reflects the case where the researcher has no prior information for p0.

Theorem 2. Suppose that assumptionsA toD are satisfied. Under design1, we have min{Γ(x, 0), Γ(0, p¯)} ≤ θ(x) ≤ max{Γ(x, 0), Γ(0, p¯)}, and the bounds are sharp.

Theorem2 follows from theorem1 because Γ(x, p) is monotonic in p ∈ [0, p¯]; i.e. it suffices to consider the two end points to obtain sharp bounds, where one of the end points is the odds ratio Γ(x, 0). There is a special case to consider. If Y∗(1) ≥ Y∗(0) almost surely (i.e. treatment is potentially beneficial but it cannot hurt), then Γ(x, ·) is a decreasing function and hence we have Γ(x, p¯) ≤ θ(x) ≤ Γ(x, 0) under design1; the odds ratio represents the maximum causal relative risk that is consistent with what is observed under design1. If there is no information for p0, then the lower bound is simply one. We note that design2 provides an easier environment for causal inference. It seems ironic that design2 was referred to as case-control sampling with contamina- tion by Lancaster & Imbens(1996) but that the ‘contamination’ is in fact helpful for identification.

11 3.2 Beyond strong ignorability

Strong ignorability is a popular assumption for causal inference. But it does not de- liver point identification under design1 nor is it always satisfied in observational studies. We now consider alternative assumptions under which the treatment as- signment is allowed to be endogenous while we can still obtain comparable bounds.

Assumption E (Monotone Treatment Response). Y∗(1) ≥ Y∗(0) almost surely.

Assumption F (Monotone Treatment Selection). For all t ∈ {0, 1} and x ∈ X ,

P{Y∗(t) = 1 | T∗ = 1, X∗ = x} ≥ P{Y∗(t) = 1 | T∗ = 0, X∗ = x}.

AssumptionE was first proposed by Manski(1997), while assumptionF was used by Manski & Pepper(2000). AssumptionE says that treatment is potentially beneficial but it never hurts. For instance, if an individual does not earn high income with a college degree, then the person will not be highly paid without a college de- gree, either. AssumptionF says that, other things being equal, those who have a higher degree are at least as likely to earn high incomes, if their education attain- ment was randomly assigned, compared to those who did not have a higher degree. So, in substance, the treatment decision chosen by an individual reveals the ‘type’ of the person; those who choose to obtain a higher degree are more motivated and they would not be any less likely to earn high incomes than those who choose not to obtain a higher degree if they were randomly assigned to a different treatment sta- tus. AssumptionF is trivially weaker than assumptionC, and it allows individuals with ‘higher ability’ to self-select a higher degree.

Theorem 3. Suppose that assumptionsA,B,E andF are satisfied. Then, under both de- signs1 and2, we have 1 ≤ θ(x) ≤ Γ(x, 0), where the bounds are sharp.

The sharp trivial lower bound follows from assumptionE. The upper bound is a consequence of assumptionF. It is also noteworthy that the bounds in theorem3 do not require the rare-disease assumption, or assumptionD.

12 Theorem3 can be compared with theorems1 and2. Under design2, strong ig- norability is informative in that it ensures that the odds ratio point identifies the causal parameter θ(x). However, under design1, we only obtain interval identifica- tion, whether we impose strong ignorability or monotonicity; the odds ratio will be the sharp upper bound on θ(x). The empirical content of assumptionsE andF does not depend on which of the two Bernoulli sampling schemes is used. Therefore, with the causal parameter θ(x), or log{θ(x)}, in mind, the odds ratio Γ(x, 0), or its logarithm, is a natural estimand to focus on. If the sample available is from design2, then it is relevant how the researcher views assumptionC versus assumptionF. However, if the sample is from design1, i.e. case-controlled with no ‘contamination,’ then it is not so much useful to impose assumptionC compared to using assumptionF in learning the potential maximum benefit of the treatment. In what follows we will take theorem3 as our basis for causal inference; using strong ignorability is either easier or similar. From hereon we will write OR(x) := Γ(x, 0) to emphasize that this is the usual odds ratio conditional on X = x.

4 The odds ratio, heterogeneity, and aggregation

Since we are conditioning on a specific value of the covariate vector, theorem3 is the most general approach to deal with potential heterogeneity in the treatment effect. However, log OR(x) is generally difficult to estimate with high precision when the dimension of X∗ is high. To avoid the curse of dimensionality, it is popular in case-control studies to adopt logistic regression at the true population level, which also explains why the conven- tion of taking logarithm has become popular. One can alternatively parametrize the odds ratio function itself: see e.g. H. Chen (2007) and Tchetgen Tchetgen(2013). Parametric assumptions are convenient, but they are restrictive; log{OR(x)} is generally an unknown function of x that can be highly nonlinear. Instead of intro- ducing any parametrization, we can obtain a robust summary measure by aggre-

13 gating over the population distribution of the covariates, i.e. ϑ¯ := Elog θ(X∗)}, where 0 ≤ ϑ¯ ≤ E{log OR(X∗)} (3) and the bounds are sharp under the conditions of theorem3. Aggregation can be done by using a different weight function, but it seems the most natural to use the true population distribution of the covariates. However, X∗ is not always observed and hence this type of aggregation is not always feasible. We start with feasible aggregation: for y = 0, 1, define

Z β(y) := log OR(x)dFX|Y(x | y), (4) where FX|Y(· | y) is the distribution of X given Y = y under Bernoulli sampling. Then, β(y) is an identified object, and under design d ∈ {1, 2}, we have

 ∗ E log OR(X ) = β(0) + p0{β(1) − β(0)}(2 − d). (5)

Therefore, case-population sampling makes aggregation simpler as well.

5 Efficient estimation of β(y)

Before we study inferential issues, we discuss efficient estimation of β(y). For this purpose, we point out that the sample consists of independent and identically dis- tributed copies of (Y, T, X); the sampling designs affect the distribution of (Y, T, X) and its relationship with that of (Y∗, T∗, X∗). All regularity conditions and results in this section will be presented in terms of the observed variables (Y, T, X); therefore, the regularity conditions will be testable in principle.

5.1 The efficient influence function

We first derive the semiparametric efficiency bound. We start with the following assumptions for regularity.

14 Assumption G (Bounded Probabilities). There is a constant ε > 0 such that for each y = 0, 1, ε ≤ P(T = 1|X, Y = y) ≤ 1 − ε and ε ≤ P(Y = 1|X) ≤ 1 − ε almost surely.

Assumption H (Regular Distribution). The distribution function FX|Y has a probability density fX|Y that satisfies 0 < fX|Y(x|y) < ∞ for all x ∈ X and y = 0, 1.

AssumptionG is slightly stronger than what we need to derive the efficient in- fluence function but it is useful to ensure that all the population quantities given below are well defined without spelling out all the moment conditions. Assump- tionH focuses on the case where X is continuous but this is only for the sake of notational simplicity. Under Bernoulli samping, the likelihood of a single observation (Y, T, X) can be expressed as a mixture of two binary likelihoods, from which we first have the following lemma.

Lemma 1. Consider the Bernoulli sampling scheme of design1 or design2. The tangent space can be represented by the set of functions of the following form:

h  i s(Y, T, X) = (1 − Y) a0(X) + T − P(T = 1|X, Y = 0) b0(X) h  i + Y a1(X) + T − P(T = 1|X, Y = 1) b1(X) ,

2 where the functions ay and by are such that E{ay(X)|Y = y} = 0 and E{s (Y, T, X)} < ∞ for each y = 0, 1.

By direct calculation, as in Hahn(1998), we can show that β(y) is pathwise dif- ferentiable along regular parametric submodels in the sense of Newey(1990, 1994). Further, it turns out that the pathwise derivative is an element of the tangent space presented in lemma1, from which we can obtain the semiparametric efficiency bound for β(y). The following theorem presents this result. Let

fX|Y(X|0) h P(Y = 0 | X) w(X) := = 0 , (6) fX|Y(X|1) 1 − h0 P(Y = 1 | X)

15 where the second equality is by the Bayes rule. Further, for y = 0, 1, define

Yy(1 − Y)1−y{T − P(T = 1|X, Y = y)} ∆ (Y, T, X) := . y P(T = 1|X, Y = y){1 − P(T = 1|X, Y = y)}

Theorem 4. Suppose that assumptionsA,G andH hold and that we have a sample by Bernoulli sampling. Then, for y = 0, 1, β(y) is pathwise differentiable and its pathwise derivative is given by

Yy(1 − Y)1−y n o F (Y, T, X) = log OR(X) − β(y) y y 1−y h0(1 − h0) 1−y ∆0(Y, T, X) w(X) ∆1(Y, T, X) − y + . (1 − h0)w(X) h0

Further, Fy is an element of the tangent space, and therefore, the semiparametric efficiency  2 bound for β(y) is given by E Fy (Y, T, X) . √ Theorem4 implies that the asymptotic variance of a n–consistent and asymp- 2 totically linear estimator of β(y) should be E{Fy (Y, T, X)} by Theorem 2.1 of Newey

(1994). The first term that appears in Fy(Y, T, X) has mean zero, because

" − # Yy(1 − Y)1 y n o E log OR(X) − β(y) = Elog OR(X) − β(y) Y = y = 0. (7) y 1−y h0(1 − h0)

The other terms in Fy(Y, T, X) are for adjustment to address the effect of first step nonparametric estimation of log OR(X) via P(T = 1|X = x, Y = y).

5.2 Estimation algorithms

Efficient estimators of β(y) for y = 0, 1 can be constructed in multiple ways. The most straightforward approach is using equation (7); i.e. we base an estimator on  y 1−y y 1−y the moment condition β(y) = E Y (1 − Y) log OR(X) /{h0(1 − h0) }, where √ we plug in a nonparametric estimator of OR(x). If the resulting estimator is n- consistent and asymptotically linear, then its asymptotic variance will be given by 2 E{Fy (Y, T, X)}, as we commented earlier. Below we propose an algorithm that im-

16 plements this idea, for which we use sieve logistic estimators for the first step non- parametric estimation.  Suppose that we have an i.i.d. sample, (Yi, Ti, Xi) : i = 1, . . . , n . Throughout ˆ n the discussion we assume that h0 is known; otherwise, we can use h := ∑i=1 Yi/n instead, which does not affect the first order asymptotics, provided that the distri- bution of (T, X) given Y = y does not depend on h0. In order to estimate OR(x) nonparametrically in the first step, we consider infi- nite dimensional (retrospective) logistic regression: i.e. for y = 0, 1,

n ∞ o exp ∑j=1 φj(x)µj,y P(T = 1 | X = x, Y = y) = , (8) n ∞ o 1 + exp ∑j=1 φj(x)µj,y where {φj : j = 1, 2, . . .} is a series of basis functions and {µj,y : j = 1, 2, . . .} is a series of unknown coefficients for each y = 0, 1. It then follows that log OR(x) = ∞ ∑j=1 φj(x)(µj,1 − µj,0), from which we obtain

J n Z  β(y) ≈ ∑ φj(x)dFX|Y(x|y) µj,1 − µj,0 , (9) j=1 X provided that Jn diverges as n increases. Equation (9) suggests the following two- step sieve estimation strategy:

i. In the first step, for each y = 0, 1, estimate {µj,y : y = 0, 1 j = 1, . . . , Jn} by

logistic regression of Ti on {φj(Xi) : j = 1, . . . , Jn} with the Yi = y sample.

ii. In the second step, construct a sample analog of equation (9): i.e.

J n Z  βb(y) := ∑ φj(x)dFbX|Y(x|y) µbj,1 − µbj,0 , (10) j=1 X

where µbj,y’s are sieve logit estimates from the first step and

Z n d d ∑i=1 Yi (1 − Yi) φj(Xi) φj(x)dFbX|Y(x|y) = . X n d d ∑i=1 Yi (1 − Yi)

17 Since we use retrospective regression as shown in equation (8), we call the estimator defined in (10) the retrospective sieve logistic estimator of β(y), y = 0, 1. It can be com- puted using standard software for logistic regression, as described in algorithm1.

Algorithm 1: Retrospective Sieve Logistic Estimator of β(1)

Input: {(Yi, Ti, Xi) : i = 1, . . . , n}, tuning parameter Jn and basis functions {φj(·) : j = 1, . . . , Jn} Output: estimate of β(1) and its standard error

1 Construct {φ1(Xi),..., φJn (Xi) : i = 1, . . . , n}, where an intercept term is excluded in φj’s; 2 For each j = 1, . . . , Jn, compute the empirical mean of φj(Xi) using only the case sample (Yi = 1) and construct the demeaned version, say ϕj(Xi), of φj(Xi); 3 Run a logistic regression of Ti on the following regressors: an intercept term, Yi, ϕj(Xi), j = 1, . . . , Jn, and interactions between Yi and ϕj(Xi), j = 1, . . . , Jn, using standard software; 4 Read off the estimated coefficient for Yi and its standard error

The procedure described in algorithm1 achieves the first step by running a com- bined logistic regression of Ti on Yi, the sieve basis terms and the interactions be- tween Yi and the sieve basis terms. This is first-order equivalent since Yi is binary and full interaction terms are included. For the second step, instead of evaluating the right-hand side of equation (10) after logistic regression, φj(Xi)’s are demeaned

first using only the case sample so that the resulting coefficient for Yi is first-order equivalent to the estimator defined in equation (10). The advantage of the formu- lation in algorithm1 is that the standard error of βb(1) can be read off directly from standard software without any further programming. It is straightforward to mod- ify algorithm1 for estimating β(0). One has to compute the empirical mean of φj(Xi) using only the control sample (Yi = 0) for the demeaning step. It is not difficult to work out formal asymptotic properties of our proposed sieve estimator in view of the well-established literature on two-step sieve estimation (see, e.g., Ai & Chen 2003, 2012, Ackerberg et al. 2014, among many others). Since this is now well understood in the literature, we omit details for brevity of the paper.

18 AppendixB reports the results of a small Monte Carlo experiment that illustrates the finite-sample performance of the proposed estimators of β(1) and β(0).

6 Identification of attributable risk

In this section we present identification results for the alternative causal parame- ter θAR(x) defined in equation (2) under the conditions we discussed in section3. Define

1 (−1)j+1Π(j | 1, x) Γ (x, p) := , AR ∑ ( | ) + ( ){ ( | ) − ( | )} j=0 Π j 0, x r x, p Π j 1, x Π j 0, x where Π(t | y, x) and r(x, p) are defined in section3. Note that ΓAR(x, 0) is not the odds ratio but is the difference of two probability ratios.

Theorem 5. Suppose that assumptionsA toC are satisfied. Then, for all x ∈ X and under design d ∈ {1, 2}, we have θAR(x) = r(x, p0)ΓAR{x, (2 − d)p0}.

Therefore, under strong ignorability, unlike θ(x), θAR(x) depends on p0 in both designs1 and2. Further, the rare-disease approximation does not provide anything useful because if p0 ≈ 0, then r(x, p0) ≈ r(x, 0) = 0 by continuity. However, the partial identification approach is still valid.

Theorem 6. Suppose that assumptionsA toD are satisfied. For all x ∈ X and under design d ∈ {1, 2}, the sharp identifiable bounds are given by

min r(x, p)ΓAR{x, (2 − d)p} ≤ θAR(x) ≤ max r(x, p)ΓAR{x, (2 − d)p}. p∈[0,p¯] p∈[0,p¯]

Theorem6 is a simple corollary of theorem5. The sharp bounds for the ag- ¯ R gregated parameter θAR := X θAR(x)ω(x)ds for some weight function ω can be obtained by taking max/min after aggregation. Unlike the case of θ(x), strong ignorability does not lead to point identification under either design1 or 2. What is similar though is that we obtain partial iden-

19 tification and that the bounds under strong ignorability can be shown to be quite comparable to those under the monotonicity assumptions.

Theorem 7. Suppose that assumptionsA,B andD toF are satisfied. Then, for all x ∈ X and under design d ∈ {1, 2}, we have 0 ≤ θAR(x) ≤ maxp∈[0,p¯] r(x, p)ΓAR{x, (2 − d)p}, and the bounds are sharp.

The sharp upper bound in theorem7 is the same as the one presented in the- orem6. Therefore, replacing strong ignorability with assumptionsE andF is not costly in terms of the sharp upper bound on θAR(x). Also, similarly to the case of θ(x), the trivial bound of 0 in theorem7 is a consequence of assumptionE.

7 Causal inference on relative and attributable risk

In this section we discuss how to carry out causal inference on relative risk (RR) and attributable risk (AR) under the MTR and MTS assumptions. In our discussion below, z(1 − α) will be the 1 − α quantile of the standard normal distribution. First, we focus on RR, for which we consider the parameter of interest defined by exp(ϑ¯), where ϑ¯ := E{log θ(X∗)}: the average of the logarithm of RR is less likely to be affected unduly by outliers than that of RR itself. See AppendixA for further discussions. We propose to use equation (5) as a basis of causal inference. Specifically, suppose that MTR and MTS hold. Then, under design d ∈ {1, 2},

  1 ≤ exp(ϑ¯) ≤ exp β(0) + p0{β(1) − β(0)}(2 − d) . (11)

So, causal inference under design2 follows immediately from section5: i.e. the 1 − α one-sided confidence interval for exp(ϑ¯) can be constructed by the interval between  one and exp βb(0) + z(1 − α)bs(0) , where bs(0) is the standard error of βb(0). In the case of design1, the right-hand side on equation (11) is identifiable up to the unknown p0. Therefore, we can conduct causal inference conditional on p0, for which we use a confidence band for β˜(p) := pβ(1) + (1 − p)β(0) that is uniform in

20 p ∈ [0, 1]: i.e. let u(1 − α) := z(1 − α/2) max{bs(1), bs(0)}, where bs(j) is the standard error of βˆ(j), and it can be shown that

  P ∀p ∈ [0, 1], exp{β˜(p)} ≤ exp{pβb(1) + (1 − p)βb(0) + u(1 − α)} ≥ 1 − α; (12) this procedure can be modified if assumptionD with 0 < p¯ < 1 is adopted. In order to see how equation (12) is obtained, let β¯(j) := βb(j) − β(j) for j = 0, 2. We then  ¯ ¯ ¯ ¯ notice that B := infp∈[0,1] pβ(1) + (1 − p)β(0) is equal to min{β(1), β(0)}, and 1 ¯ therefore P{B ≤ −u(1 − α)} ≤ ∑j=0 P{β(j) ≤ −u(1 − α)} ≤ α/2 + α/2. The asymptotic coverage rate in equation (12) is conservative. Achieving the asymptotically exact rate requires that we use the limiting distribution of min{βb(1) − β(1), βb(0) − β(0)}, which is not normal and is not readily available from the esti- mation algorithm given in section 5.2.

We now turn to causal inference on AR. We start with defining βAR(p, y; d) :=   E r(X, p)ΓAR{X, (2 − d)p} | Y = y , where d ∈ {1, 2} indicates the underlying sampling design. Then, the identification analysis in section6 shows that the strong ignorability assumptions lead to the following equality: under design d ∈ {1, 2},

∗ E{θAR(X )} = βAR(p0, 0; d) + (2 − d){βAR(p0, 1; d) − βAR(p0, 0; d)}, (13) where β(p, y; d) is an identified object for all (p, y) ∈ [0, 1] × {0, 1}. Further, if the strong ignorability assumptions are replaced with the MTR and MTS, then the right- hand side of equation (13) becomes an upper bound, while the lower bound is sim- ply zero. To appreciate the right-hand side of equation (13) (as the aggregated AR or its upper bound), we first consider the simpler case of design2. Unlike the RR, even under design2, it is not sufficient to estimate the retrospective regression. Specifi- cally, βAR(p, 0; 2) = pξCP, where the sample analog of ξCP is given by

− n ! 1 n   Pb(Y = 1 | Xi)nΠb (1 | 1, Xi) Πb (0 | 1, Xi)o ξbCP := ∑ Yi ∑(1 − Yi) − , i=1 i=1 Pb(Y = 0 | Xi) Πb (1 | 0, Xi) Πb (0 | 0, Xi)

21 where Pb(Y = y | Xi) and Πb (t | y, Xi) are parametric or sieve logit estimators; here, h0 is replaced with the sample mean of Yi. In principle we can analytically work out the asymptotic standard error of ξbCP, on which we can base our inference. Alternatively, we may use the bootstrap instead. For example, the (1 − α) confidence ∗ interval for E{θAR(X )} under MTR and MTS can be obtained by using a (one- sided) uniform confidence band in p that has the following form: p ∈ [0, 1] 7→   0, p{ξbCP + uAR,CP(1 − α)} , where −uAR,CP(1 − α) is the bootstrap α-quantile of

ξbCP − ξCP. We make a few comments. The researcher may adopt assumptionD by restrict- ing the on [0, p¯]. Asymptotic validity of the nonparametric bootstrap in two-step semiparametric models is a well-studied topic (e.g., Chen et al. 2003). However, in practice, the na¨ıve bootstrap may suffer from a finite sample bias because ra- tios of probabilities are estimated in the first step. To mitigate this issue, we rec- ommend using Efron(1982)’s bias-corrected one-sided percentile interval to obtain uAR,CP(1 − α).

The case of design1 can be handled by the sample principle, although βAR(p, y; 1) has a bit more complicated form; nonlinearity in p makes it difficult to obtain a uni- form confidence band in p. We propose using one-sided pointwise confidence in- tervals using Efron’s bias-corrected percentile intervals in this case. Computational details regarding causal inference on attributable risk are given in Algorithm2. In describing the algorithm, we focus on design1; in the case of design2, we replace the upper bound estimates UBc AR(p) with those of pξCP, and the resulting confi- dence intervals will be uniform in p ∈ [0, 1].

22 Algorithm 2: Causal Inference on Attributable Risk Using Case-Control Samples

Input: {(Yi, Ti, Xi) : i = 1, . . . , n}, the number (B) of bootstrap replications, the coverage probability (1 − α) of the confidence interval, the upper bound (p¯) on the unknown true case probability Output: point estimates UBc AR(p) of the upper bounds on causal attributable risk and the upper end points of the one-sided ∗ ( ) ∈ [ ] pointwise bootstrap confidence intervals q(1−α) p for p 0, p¯

1 Construct a grid P := {p0, p1,..., pJ} of [0, p¯], where 0 = p0 < p1 < ··· < pJ = p¯; 2 For each p ∈ P, evaluate sample analogs UBc AR(p) of UBAR(p) := (1 − p)βAR(p, 0; 1) + pβAR(p, 1; 1); in this step we need to compute retrospective estimates of Π(t | y, Xi) for t = 0, 1, y = 0, 1 as well as prospective estimates of P(Y = 1 | Xi) because βAR(p, y; 1) depends on r(x, p) as its definition, provided right above equation (13), shows; 3 For each bootstrap replication b = 1, . . . , B, generate a bootstrap sample ∗,b ∗,b ∗,b ∗,b {(Yi , Ti , Xi ) : i = 1, . . . , n} and obtain a bootstrap estimate UBc AR(p) for each p ∈ P; 4 For each p ∈ P, compute

B ∗ 1 n ∗,b o µ (p) := ∑ 1 UBc AR(p) ≤ UBc AR(p) , B b=1

where 1{·} is the usual indicator function; ∗  −1 −1 ∗  5 For each p ∈ P, obtain ν (p) := Φ Φ (1 − α) + 2Φ {µ (p)} , where Φ(·) is the cumulative distribution function of the standard normal random variable; ∗ ∗,b 6 For each p ∈ P, compute the ν (p) empirical quantile of UBc AR(p), say ∗ ( ) ( − ) q(1−α) p , as the 1 α one-sided pointwise bootstrap confidence interval

23 8 Examples: the two sampling designs

8.1 Case-control sampling: entering a very selective university

We consider quantifying the causal effect of attending private school on entering a very selective university by using the Pakistan data collected by Delavande & Zafar (2019). This is survey data from male students who were already enrolled in differ- ent types of universities in Pakistan, all located in Islamabad/Rawalpindi and La- hore. Delavande & Zafar(2019) include two Western-style universities, one Islamic university, and four madrassas, but we focus on the two Western-style ones in our analysis: between the two universities, Delavande & Zafar(2019) call the more ex- pensive, selective, and reputable university ”Very Selective University” (VSU) and the other simply ”Selective University” (SU). Therefore, we restrict the population of interest to those who entered either VSU or SU, and we define the binary outcome to be whether a student entered VSU. The binary treatment we consider is whether a student attended private school before university. Since the students in the sam- ple were already enrolled in either VSU or SU at the time of the survey, we have a case-control sample, i.e. our design1.

Table 1: University entrance and private school attendance

Private School Total University T = 0 T = 1 Y = 0 (SU) 151 332 483 Y = 1 (VSU) 51 155 206 Total 202 487 689

Table1 shows the likelihood of entering VSU by private school attendance before university; we will discuss controlling for covariates later. The empirical odds ratio is 1.38. In this example, the unconfoundedness assumption is unlikely to hold, because those who attended private school before university are likely to have more re-

24 sourceful parents. This concern may not completely disappear even if we control for parental income and wealth because of the presence of unobserved parental abili- ties and resources that could affect their children’s university choice. However, the MTR and MTS assumptions are still plausible: private school is probably no infe- rior input to university preparations (hence, MTR), and those who actually chose to attend private school probably care about their future college choice no less than those who did not (hence, MTS). Then, the odds ratio of 1.38 can be interpreted as a sharp upper bound on causal relative risk; therefore, the effect of attending private school seems, at best, modest. Now, we consider controlling for family background variables that are summa- rized in table2. On average, parents of VSU students are more educated, earn higher incomes and have more wealth measured by the ownership of home, car, and other items. We control for the covariates reported in table2, where parents’ monthly in- come is included by cubic B–splines with three inner knots: so, we have a partially linear specification with a sieve approximation on parents’ income. Table3 reports estimation results for the aggregated log odds ratio within each of the case and the control: both β(y) and exp{β(y)} convey the same information, but exp{β(y)} is easier to interpret because it is comparable to the usual odds ratio in terms of its scale. The fact that βb(1) and βb(0) are notably different suggests that the amount of heterogeneity among individuals may be substantial. The confidence intervals are computed based on the MTR and MTS, where we note that the intervals of exp{β(y)} contain the unconditional odds ratio of 1.38. We now consider the methods described in section7, i.e. causal inference on the aggregated relative and attributable risk (RR and AR, respectively) in terms of the population distribution of the covariates. We rely on the MTR and MTS assump- tions to interpret our results as upper bounds. Also, we comment that we use a less demanding specification for AR, because the estimand for AR is much more com- plex and the bootstrap is used for causal inference. Specifically, we only included an indicator for at least one college-educated parent and parents’ monthly income.

25 Table 2: Summary statistics of the case-control sample

VSU SU Attended private school before university 0.75 0.69 At least one college-educated parent 0.89 0.67 Parents’ monthly income 193.38 95.89 Above median income 0.93 0.80 Parents own home 0.92 0.87 Parents own television 0.91 0.84 Parents own cell phone 0.92 0.80 Parents own computer 0.84 0.70 Parents own car 0.84 0.67 Sample size 206 483

Notes. Each entry shows the sample mean. Parents’ monthly income is in thousands of rupees and all other variables are binary indicator variables. The case and control samples consist of those who entered a very selective university (VSU) and a selective university (SU), re- spectively.

Table 3: Estimation results of attending private school on entering VSU

(1) (2) Case (y = 1) Control (y = 0) VSU SU β(y) 0.07 0.19 95% confidence interval [0, 0.48] [0, 0.63] exp[β(y)] 1.07 1.21 95% confidence interval [1, 1.61] [1, 1.89]

Note. Parental background is controlled for when fitting retrospec- tive binary logistic regression models.

26 Figure 1: Causal inference on RR and AR: bounding the effects of attending private school on entering VSU 2.4 Estimates of the Upper Bounds on Relative Risk Estimates of the Upper Bounds on Attributable Risk 0.20 95% One−Sided Uniform Confidence Band 95% One−Sided Pointwise Confidence Interval 2.2 0.15 2.0 1.8 0.10 Relative Risk Relative 1.6 Attributable Risk Attributable 1.4 0.05 1.2 0.00

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Unknown True Case Probability Unknown True Case Probability

Note. The left panel shows the estimates and the 95% one-sided uniform confi- dence band for the upper bounds on relative risk and the right panel the estimates and the 95% one-sided pointwise confidence intervals for the upper bounds on at- tributable risk, as functions of the unknown true case probability, i.e., P(Y∗ = 1).

The number of the bootstrap replications was 10,000. Figure1 summarizes the results: the left (right) panel shows RR (AR). The case probability p0 of entering VSU in the population is not identified in this dataset. But we can trace out the upper bounds as the value of p0 varies between 0 and 1. Consider the left panel of figure1, i.e. RR. If we take the point estimate at face value, attending private school increases the chance of entering VSU by a factor of at most 1.21. Even in terms of the confidence intervals, it seems highly unlikely that the impact is more than a factor of 2. The right panel of figure1 shows AR. The graph shows an inverted U-shape, because r(x, p)ΓAR(x, p) = 0 whenever p is either 0 or 1. The maximum point estimate of the upper bound is 0.044, while the maximum value of the confidence intervals is 0.153. Therefore, it seems highly unlikely that attending private school increases the chance of entering VSU by 15 percent. None of our results require strong ignorability or the rare-disease assumption. Our conclusion of a relatively small positive effect of attending private school on

27 entering VSU, if it exists at all, is reminiscent of existing results in labor economics that find access to private schools have only modest effects on children’s perfor- mance (see, e.g., Epple et al. 2017, MacLeod & Urquiola 2019).

8.2 Case-population sampling: joining a criminal gang

We revisit Carvalho & Soares(2016), who combine the 2000 Brazilian Census with a unique survey of drug-trafficking gangs in favelas (slums) of Rio de Janeiro; there- fore, their dataset is an example of case-population sampling, i.e. our design2. In their study, they use the method of Lancaster & Imbens(1996) to estimate a model of selection into the gang by using race, age, illiteracy, house ownership, and religios- ity. They note that the five characteristics are likely to be predetermined while years of schooling may be endogenous to entry, i.e. joining the gang may lead members to drop out of school. Indeed, 90 percent of gang members are not in school, whereas 46 percent of men aged 10–25 are not in school. They find that “younger individuals, from lower socioeconomic background (black, illiterate, and from poorer families) and with no religious affiliation are more likely to join drug-trafficking gangs.” Table4 provides summary statistics of the sample. We regard currently not attend- ing school as the treatment variable of interest. Unconfoundedness is not plausible because of the endogeneity of schooling that we mentioned earlier. However, not being in school may increase the chance of exposure to gang-related activities, and those who chose to be in school may be the ones who care for consequences no less than those who chose not to be; therefore, the MTR and MTS follow, respectively. Table5 presents estimation results for β(y) and exp{β(y)}, for which we control for the same covariates as Carvalho & Soares(2016). Unlike table3, y = 0 now corresponds to the entire population. Therefore, β(0) itself is the log odds ratio aggregated over the population, which is the sharp upper bound for the aggregation of the log causal relative risk, i.e. ϑ¯. The point estimate of β(0) is 2.71, and that of exp{β(0)} is 15.01, which suggests that the chance of those who are not in school

28 Table 4: Summary Statistics of the Case-Population Sample

(1) (2) Case Population Gang members Men aged 10-25 Not in school 0.901 0.458 Black 0.269 0.142 Age 16.722 17.526 Illiterate 0.094 0.041 Owns house 0.735 0.832 No religion 0.426 0.237 Sample size 223 17175

Notes. Each entry shows the sample mean. Age is in years and all other variables are binary indicator variables. The population con- sists of men aged 10-25 living in Rio’s Favelas. joining a gang may be (up to) 15 times as large as that of those who are. Our discussion above can be supplemented by checking the causal AR. In fig- ure2, we trace out the upper bound on the AR as the value of p0, i.e. the probability of the case (joining a gang) in the population, varies from zero to 0.15. At the three points of p0 ∈ {0.05, 0.10, 0.15} that Carvalho & Soares(2016) considered, the point estimates of the upper bound on the causal AR are 0.33, 0.66, and 0.99, respectively. The uniform confidence band is based on 1,000 bootstrap replications. Note that the confidence band is truncated at one, because the AR cannot be larger than one. Overall, our results are suggestive of potentially large impacts of keeping young men in school in order to discourage them to participate in criminal activities. Fur- ther research based on careful study designs would be necessary to reach a more definitive answer.

29 Table 5: Estimation results of currently not attending school on relative risk of join- ing a gang

(1) (2) Case Population Gang members Men aged 10-25 β(y) 2.90 2.71 95% confidence interval [0, 3.36] [0, 3.19] exp[β(y)] 18.10 15.01 95% confidence interval [1, 28.90] [1, 24.39]

Note. Race, age, illiteracy, house ownership, and religiosity are lin- early controlled for when fitting retrospective binary logistic regres- sion models.

Figure 2: Causal inference on attributable risk: bounding the effects of currently not attending school on joining a gang

Estimates of the Upper Bounds on Attributable Risk 95% One−Sided Uniform Confidence Band 1.2 1.0 0.8 0.6 Attributable Risk Attributable 0.4 0.2 0.0

0.00 0.05 0.10 0.15

Unknown True Case Probability

Note. The figure shows the estimates and the 95% one-sided uniform confidence band for the upper bounds on attributable risk, as functions of the unknown true case probability, i.e., P(Y∗ = 1).

30 References

Ackerberg, D., Chen, X., Hahn, J. & Liao, Z. (2014), ‘Asymptotic efficiency of semi- parametric two-step GMM’, Review of Economic Studies 81(3), 919–943.

Ai, C. & Chen, X. (2003), ‘Efficient estimation of models with conditional moment restrictions containing unknown functions’, Econometrica 71(6), 1795–1843.

Ai, C. & Chen, X. (2012), ‘The semiparametric efficiency bound for models of se- quential moment restrictions containing unknown functions’, Journal of Economet- rics 170(2), 442–457.

Bhattacharya, J., Shaikh, A. M. & Vytlacil, E. (2008), ‘Treatment effect bounds under monotonicity assumptions: an application to swan-ganz catheterization’, Ameri- can Economic Review: Papers and Proceedings 98(2), 351–56.

Bhattacharya, J., Shaikh, A. M. & Vytlacil, E. (2012), ‘Treatment effect bounds: An application to Swan-Ganz catheterization’, Journal of Econometrics 168(2), 223–243.

Breslow, N. E. (1996), ‘Statistics in epidemiology: the case-control study’, Journal of the American Statistical Association 91(433), 14–28.

Breslow, N. E. & Day, N. E. (1980), Statistical Methods in Cancer Research I. The Analysis of Case-Control Studies, Vol. 1, International Agency for Research on Cancer, Lyon, France.

Breslow, N. E., Robins, J. M. & Wellner, J. A. (2000), ‘On the semi-parametric effi- ciency of logistic regression under case-control sampling’, Bernoulli 6(3), 447–455.

Carvalho, L. S. & Soares, R. R. (2016), ‘Living on the edge: Youth entry, career and exit in drug-selling gangs’, Journal of Economic Behavior & Organization 121, 77–98.

Chen, H. (2007a), ‘A semiparametric odds ratio model for measuring association’, Biometrics 63(2), 413–421.

Chen, K. (2001), ‘Parametric models for response-biased sampling’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63(4), 775–789.

Chen, X. (2007b), Chapter 76 large sample sieve estimation of semi-nonparametric models, in J. J. Heckman & E. E. Leamer, eds, ‘Handbook of Econometrics’, Vol. 6, Elsevier, pp. 5549–5632.

Chen, X., Linton, O. & Van Keilegom, I. (2003), ‘Estimation of semiparametric mod- els when the criterion function is not smooth’, Econometrica 71(5), 1591–1608.

Choi, D. (2017), ‘Estimation of monotone treatment effects in network experiments’, Journal of the American Statistical Association 112(519), 1147–1155.

31 Currie, J. & Neidell, M. (2005), ‘Air pollution and infant health: what can we learn from california’s recent experience?’, Quarterly Journal of Economics 120(3), 1003– 1030.

Delavande, A. & Zafar, B. (2019), ‘University choice: The role of expected earn- ings, nonpecuniary outcomes, and financial constraints’, Journal of Political Econ- omy 127(5), 2343–2393.

Didelez, V. & Evans, R. J. (2018), Chapter 6 causal inference from case-control stud- ies, in Ø. Borgan, N. Breslow, N. Chatterjee, M. H. Gail, A. Scott & C. J. Wild, eds, ‘Handbook of statistical methods for case-control studies’, CRC Press, Boca Raton, FL, pp. 87–115.

Domowitz, I. & Sartain, R. L. (1999), ‘Determinants of the consumer bankruptcy decision’, Journal of Finance 54(1), 403–420.

Efron, B. (1982), The jackknife, the bootstrap and other resampling plans, Society for In- dustrial and Applied Mathematics (SIAM).

Epple, D., Romano, R. E. & Urquiola, M. (2017), ‘School vouchers: A survey of the economics literature’, Journal of Economic Literature 55(2), 441–92.

Gabriel, E. E., Sachs, M. C. & Sjolander,¨ A. (2020), ‘Causal bounds for outcome- dependent sampling in observational studies’, Journal of the American Statistical Association pp. 1–12.

Hahn, J. (1998), ‘On the role of the propensity score in efficient semiparametric esti- mation of average treatment effects’, Econometrica 66(2), 315–331.

Holland, P. W. & Rubin, D. B. (1988), ‘Causal inference in retrospective studies’, Evaluation Review 12(3), 203–231.

Horowitz, J. L. & Manski, C. F. (1995), ‘Identification and robustness with contami- nated and corrupted data’, Econometrica pp. 281–302.

Horowitz, J. L. & Manski, C. F. (1997), ‘What can be learned about population pa- rameters when the date are contaminated’, Handbook of Statistics 15, 439–466.

Jiang, Z., Chiba, Y. & VanderWeele, T. J. (2014), ‘Monotone confounding, monotone treatment selection and monotone treatment response’, Journal of Causal Inference 2(1), 1–12.

Kim, W., Kwon, K., Kwon, S. & Lee, S. (2018), ‘The identification power of smooth- ness assumptions in models with counterfactual outcomes’, Quantitative Eco- nomics 9(2), 617–642.

32 Kreider, B., Pepper, J. V., Gundersen, C. & Jolliffe, D. (2012), ‘Identifying the effects of SNAP (Food Stamps) on child health outcomes when participation is endogenous and misreported’, Journal of the American Statistical Association 107(499), 958–975.

Kuroki, M., Cai, Z. & Geng, Z. (2010), ‘Sharp bounds on causal effects in case-control and cohort studies’, Biometrika 97(1), 123–132.

Lancaster, T. & Imbens, G. (1996), ‘Case-control studies with contaminated controls’, Journal of Econometrics 71(1-2), 145–160.

Machado, C., Shaikh, A. M. & Vytlacil, E. J. (2019), ‘Instrumental variables and the sign of the average treatment effect’, Journal of Econometrics 212(2), 522–555.

MacLeod, W. B. & Urquiola, M. (2019), ‘Is education consumption or investment? implications for school competition’, Annual Review of Economics 11(1), 563–589.

Manski, C. F. (1995), Identification problems in the social sciences, Harvard University Press.

Manski, C. F. (1997), ‘Monotone treatment response’, Econometrica 65(6), 1311–1334.

Manski, C. F. (2003), Partial identification of probability distributions, Springer-Verlag.

Manski, C. F. (2009), Identification for prediction and decision, Harvard University Press.

Manski, C. F. & Lerman, S. R. (1977), ‘The estimation of choice probabilities from choice based samples’, Econometrica 45(8), 1977–1988.

Manski, C. F. & Pepper, J. V. (2000), ‘Monotone instrumental variables: With an application to the returns to schooling’, Econometrica 68(4), 997–1010.

Mansson,˚ R., Joffe, M. M., Sun, W. & Hennessy, S. (2007), ‘On the estimation and use of propensity scores in case-control and case-cohort studies’, American journal of epidemiology 166(3), 332–339.

Molinari, F. (2020), ‘Microeconometrics with partial identification’. arXiv:2004.11751 [econ.EM] https://arxiv.org/abs/2004.11751.

Newey, W. K. (1990), ‘Semiparametric efficiency bounds’, Journal of Applied Econo- metrics 5(2), 99–135.

Newey, W. K. (1994), ‘The asymptotic variance of semiparametric estimators’, Econo- metrica 62(6), 1349–1382.

Okumura, T. & Usui, E. (2014), ‘Concave-monotone treatment response and mono- tone treatment selection: With an application to the returns to schooling’, Quanti- tative Economics 5(1), 175–194.

33 Pearl, J. (2009), Causality, Cambridge university press.

Rose, S. (2011), Causal inference for case-control studies, PhD thesis, Biostatistics of the UC-Berkeley.

Rosenfeld, B. (2017), ‘Reevaluating the middle-class protest paradigm: A case- control study of democratic protest coalitions in russia’, American Political Science Review 111(4), 637–652.

Tamer, E. (2010), ‘Partial identification in econometrics’, Annual Review of Economics 2(1), 167–195.

Tchetgen Tchetgen, E. J. (2013), ‘On a closed-form doubly robust estimator of the adjusted odds ratio for a binary exposure’, American journal of epidemiology 177(11), 1314–1316.

Tchetgen Tchetgen, E. J., Robins, J. M. & Rotnitzky, A. (2010), ‘On doubly robust estimation in a semiparametric odds ratio model’, Biometrika 97(1), 171–180.

Van der Laan, M. J. & Rose, S. (2011), Targeted learning: causal inference for observational and experimental data, Springer Science & Business Media.

VanderWeele, T. J. & Robins, J. M. (2009), ‘Properties of monotonic effects on directed acyclic graphs’, Journal of Machine Learning Research 10(24), 699–718.

Xie, J., Lin, Y., Yan, X. & Tang, N. (2020), ‘Category-adaptive variable screening for ultra-high dimensional heterogeneous categorical data’, Journal of the American Statistical Association 115(530), 747–760.

34 Appendices to “Causal inference under outcome-based sampling with monotonicity assumptions”

In appendicesA andB we discuss aggregation of the odds ratio without taking the logarithm and present the results of a small Monte Carlo experiment. AppendixC present proofs.

A Averaging without taking the logarithm

In the main text we followed the convention of using the logarithm of the odds ra- tio, and we focused on an aggregated version of the logarithm of the odds ratio, i.e. β(y) = Elog{OR(X)} | Y = y for y = 0, 1. As a result, the central causal parame- ter in the main text was the logarithm of relative risk. Alternatively, one may want to proceed without taking the logarithm, which does not change the substance of our results, though we now need to consider

Z Z  ∗  ζ := E θ(X ) , ζ(y) := θ(x)dFX|Y(x|y), κ(y) := OR(x)dFX|Y(x|y) X X for y = 0, 1. Again, if the MTR and MTS conditions are satisfied, then we have

1 ≤ ζ(y) ≤ κ(y) (A.1) under both designs1 and2, where the inequalities are sharp. Efficient estimation of κ(y) can be explored exactly in the same way as in section 5.2. Below we present the formula of the efficient influence function, which is an analog of theorem4.

Theorem A.1. Suppose that assumptionsA,G andH hold and that we have a sample by Bernoulli sampling. Then, for y = 0, 1, κ(y) is pathwise differentiable and its pathwise

A-1 derivative is given by

Yy(1 − Y)1−y n o K (Y, T, X) = OR(X) − κ(y) y y 1−y h0(1 − h0) 1−y ∆0(Y, T, X) w(X) ∆1(Y, T, X) − OR(X) y + OR(X) . (1 − h0)w(X) h0

Further, Ky is an element of the tangent space, and therefore, the semiparametric efficiency  2 bound for κ(y) is given by E Ky(Y, T, X) .

We can construct efficient estimators of κ(y) and carry out causal inference on ζ by using the same idea as we described in section 5.2. We do not repeat all the details for brevity. In general we have E{log OR(X∗)} ≤ log E{OR(X∗)} by Jensen’s inequality. Therefore, an average of the log odds ratio is less likely to be affected unduly by outliers than that of the odds ratio itself. This seems to be another merit in using the logarithm of the odds ratio in addition to the usual advantage that it corresponds to the coefficients of the treatment variable when a parametric logistic model is used.

B Monte Carlo Experiments

In this section we report the results of a small Monte Carlo experiment. A case- control sample is generated from

(y) (y) (y) | (y) X | Y = y ∼ N(µ , Σ ) and P(T = 1 | X = x, Y = y) = G(α0 + X α1 ),

(y) (y) (y) (y) where G(u) = exp(u)/{1 + exp(u)}, α0 , α1 , µ and Σ are parameters that may depend on y = 0, 1. In simulations we focus on estimating β(y) that can now (1) (0) | (1) (0) be expressed as β(y) = (α0 − α0 ) + E(X|Y = y) (α1 − α1 ).

With the dimension of X equal to dx = 5, the parameter values are specified as follows: µ(1) = (1, . . . , 1)|, µ(0) = (0, . . . , 0)|, and Σ(y) = Σ for y = 0, 1, where |j−k| (1) (1) | the (j, k) element of Σ is Σj,k = ρ and ρ = 0.5; α0 = 0.5, α1 = (1, 1, 0, 0, 0) ,

A-2 (0) (0) | α0 = 0, α1 = (0, 0, 1, 1, 0) . In this design we have β(1) = β(0) = 0.5. In each Monte Carlo replication, we simulate 1,000 observations separately for both Y = 0 and Y = 1 samples (that is, the total sample size is 2,000 and hˆ = 0.5). There were 1,000 Monte Carlo replications.

Table A.1: Results of Monte Carlo Experiments

β(1) β(0) parametric sieve parametric sieve Mean Bias 0.011 0.070 0.005 0.046 Median Bias 0.012 0.086 -0.001 0.042 RMSE 0.057 0.167 0.033 0.067 Mean AD 0.191 0.330 0.145 0.206 Median AD 0.160 0.283 0.119 0.173 Cov. Prob. 0.944 0.962 0.952 0.962 Note: RMSE stands for the root mean squared error and AD refers to absolute deviation. Cov. Prob. is the coverage probability of the one- sided 95% confidence interval. The results are based on 1,000 Monte Carlo repetitions.

We consider two estimators: (i) a retrospective parametric logistic estimator that uses X as covariates and (ii) a retrospective sieve logistic estimator that uses the linear, quadratic and interaction terms of X as covariates (that is, 2dx + dx(dx − 1)/2 = 20 covariates all together). Table A.1 summarizes the results of the Monte Carlo experiments. Not surprisingly, the parametric estimator performs better for both β(1) and β(0). It shows almost no bias and small root mean squared errors and absolute deviations. Its coverage probability is close to the nominal 95%. The sieve estimator exhibits some positive biases but its performance is overall satisfactory.

C Auxiliary results

∗ ∗ Lemma A.1. We have r(x, p0) = P(Y = 1 | X = x).

A-3 Proof. We focus on design1; design2 is similar but simpler. By the Bayes rule and the sampling design,

p0 fX∗|Y∗ (x | 1) P(Y∗ = 1 | X∗ = x) = p0 fX∗|Y∗ (x | 1) + (1 − p0) fX∗|Y∗ (x | 0)

p0 fX|Y(x | 1) = . (A.2) p0 fX|Y(x | 1) + (1 − p0) fX|Y(x | 0)

Here, by the Bayes rule again, for y = 0, 1,

f (x)P(Y = y | X = x) f (x|y) = X . (A.3) X|Y P(Y = y)

Combining equations (A.2) and (A.3) yields the result.

Lemma A.2. Consider design d ∈ {1, 2}. For t ∈ {0, 1}, we have

r(x, p )Π(t | 1, x) P(Y∗ = 1 | T∗ = t, X∗ = x) = 0 . Π(t | 0, x) + r{x, (2 − d)p0}{Π(t | 1, x) − Π(t | 0, x)}

Proof. First, consider design1. By the Bayes rule and lemma A.1,

r(x, p )Π(t | 1, x) P(Y∗ = 1 | T∗ = t, X∗ = x) = 0 . r(x, p0)Π(t | 1, x) + {1 − r(x, p0)}Π(t | 0, x)

Under design2, the Bayes rule yields, we have P(Y∗ = 1 | T∗ = t, X∗ = x) = r(x, p0)Π(t | 1, x)/Π(t | 0, x), and we not that r(x, 0) = 0.

Lemma A.3. Under design d ∈ {1, 2}, we have

P(Y∗ = 1 | T∗ = 1, X∗ = x) = Γ{x, (2 − d)p }. P(Y∗ = 1 | T∗ = 0, X∗ = x) 0

Proof. It directly follows from the definitions of Γ and r, and lemma A.2.

Lemma A.4. Suppose that assumptionE holds. Then, for t = 0, 1 and for all x ∈ X ,

(−1)tP{Y∗(t) = 1 | X∗ = x} − P(Y∗ = 1 | X∗ = x) ≤ 0, where the bounds are sharp.

Proof. Since the two inequalities are similar, we focus on the case of t = 1. Then,

A-4 assumptionE implies P{Y∗(1), T∗ = 0 | X∗ = x} ≥ P{Y∗(0), T∗ = 0 | X∗ = x}, and therefore, P{Y∗(1) = 1 | X∗ = x} is bounded below by

P{Y∗(1) = 1, T∗ = 1 | X∗ = x} + P{Y∗(0) = 1, T∗ = 0 | X∗ = x}.

For sharpness, we know from assumptionE that

P{Y∗(1) = 1, T∗ = 0 | X∗ = x} − P{Y∗(0) = 1, T∗ = 0 | X∗ = x}

= P{Y∗(1) = 1, Y∗(0) = 0, T∗ = 0 | X∗ = x}, where the right–hand side is unrestricted between 0 and 1.

Lemma A.5. Suppose that assumptionF holds. Then, for t = 0, 1 and for all x ∈ X ,

(−1)tP{Y∗(t) = 1 | X∗ = x} − P(Y∗ = 1 | T∗ = t, X∗ = x) ≥ 0, where the bounds are sharp.1

Proof. Since the two inequalities are similar, we focus on the case of t = 1. First,

P{Y∗(1) = 1 | X∗ = x} = P(Y∗ = 1 | T∗ = 1, X∗ = x)P(T∗ = 1 | X∗ = x)

+ P{Y∗(1) = 1 | T∗ = 0, X∗ = x}P(T∗ = 0 | X∗ = x), (A.4) where we note from assumptionF that there exists some Cx ∈ [0, 1] such that

∗ ∗ ∗ ∗ ∗ ∗ P(Y = 1 | T = 1, X = x) = P{Y (1) = 1 | T = 0, X = x} + Cx. (A.5)

Combining equations (A.4) and (A.5) yields the first inequality in the lemma state- ment. Therefore,

∗ ∗ ∗ ∗ ∗ ∗ ∗ P{Y (1) = 1 | X = x} = P(Y = 1 | T = 1, X = x) − Cx · P(T = 0 | X = x) ≤ P(Y∗ = 1 | T∗ = 1, X∗ = x). (A.6)

1Furthermore, the proof shows that if 0 < P(T∗ = 1|X∗ = x) < 1, these inequalities hold with equality if and only if assumptionF is satisfied with equality.

A-5 Sharpness follows from the fact that Cx is not restricted except that it is between 0 and 1. Also, if P(T∗ = 0|X∗ = x) > 0, then the last inequality in equation (A.6) holds with equality if and only if Cx = 0.

Lemma A.6. Suppose that assumptionsE andF hold. Then, under design1, for all x ∈ X , we have Γ(x, p0) ≤ Γ(x, 0).

Proof. By using lemma A.3, we can write

P(Y∗ = 0 | T∗ = 0, X∗ = x) Γ(x, 0) = Γ(x, p ) . 0 P(Y∗ = 0 | T∗ = 1, X∗ = x)

So, it suffices to show P(Y∗ = 1 | T∗ = 1, X∗ = x) ≥ P(Y∗ = 1 | T∗ = 0, X∗ = x), which directly follows from assumptionsE andF.

D Proofs of the results in the main text

Proof of theorem1: First, consider θ(x). By assumptionC,

P(Y∗ = 1 | T∗ = 1, X∗ = x) θ(x) = , P(Y∗ = 1 | T∗ = 0, X∗ = x) to which we apply lemma A.3.

Proof of theorem2: We first show that Γ(x, p) is monotonic in p. Since ∂pr(x, p) >

0, it suffices to consider the derivative of Γ˜ x(r) := {ax + r(bx − ax)}/{1 − ax + r(ax − bx)}, where ax := Π(0 | 0, x) and bx := Π(0 | 1, x). But, direct calculation shows that b − a ∂ Γ˜ (r) = x x , r x  2 1 − ax + r(ax − bx) which does not change the sign as r varies. Therefore, Γ(x, p) is monotonic in p, and hence the theorem assertion immediately follows from theorem1. Remark: Consider design1, and suppose that strong ignorability holds. Then,

P(T∗ = 0 | X∗ = x)P{Y∗(0) = 1 | X∗ = x} − P(Y∗ = 1 | X∗ = x) b − a = , x x P(Y∗ = 1 | X∗ = x)P(Y∗ = 0 | X∗ = x)

A-6 where

P{Y∗(0) = 1 | X∗ = x} − P(Y∗ = 1 | X∗ = x)

= P{Y∗(0) = 1, T∗ = 1 | X∗ = x} − P{Y∗(1) = 1, T∗ = 1 | X∗ = x}.

∗ ∗ Therefore, if Y (1) ≥ Y (0) almost surely, then we must have bx − ax ≤ 0, which means that Γ(x, p) is decreasing in p. Proof of theorem3: The sharp lower bounds of θ(x) under assumptionE follows from lemma A.4. Further, from lemma A.5, we know that

P(Y∗ = 1 | T∗ = 1, X∗ = x) θ(x) ≤ , (A.7) P(Y∗ = 1 | T∗ = 0, X∗ = x) where the bounds are sharp under assumptionF. So, apply lemmas A.3 and A.6. Sharpness follows from continuity. Proof of Lemma1: First, the likelihood of a single observation (Y, T, X) is given by

 1−Y Y L(Y, T, X) = (1 − h0)P0(T, X) h0P1(T, X) , (A.8)

T 1−T where Py(T, X) = fX|Y(X | y)P(T = 1 | X, Y = y) 1 − P(T = 1 | X, Y = y) for y = 0, 1. Let γ be the parameter denoting regular parametric submodels, where the true value will be denoted by γ0. Then, the score evaluated at γ0 is equal to  h T − P(T = 1|X, Y = 0) ∂γP(T = 1|X, Y = 0; γ0)i (1 − Y) S (X|0) + X|Y P(T = 1|X, Y = 0){1 − P(T = 1|X, Y = 0)}  h T − P(T = 1|X, Y = 1) ∂γP(T = 1|X, Y = 1; γ0)i + Y S (X|1) + , (A.9) X|Y P(T = 1|X, Y = 1){1 − P(T = 1|X, Y = 1)} where SX|Y(x|y) = ∂γ log fX|Y(x|y; γ0) is restricted only by E{SX|Y(X|y)|Y = y} =

0, while the derivatives ∂γP(T = 1|X, Y = y, γ0) are unrestricted. Proof of theorem4: For brevity, we focus on β := β(0); proof for β(1) is analogous.

Let p0(x) = P(T = 1 | X = x, Y = 0) and p1(x) = P(T = 1 | X = x, Y = 1), and we

A-7 have Z " # p1(x; γ) 1 − p0(x; γ) β(γ) := log · fX|Y(x | 0; γ)dx, X 1 − p1(x; γ) p0(x; γ) | {z } :=OR(x;γ) where γ represents regular parametric submodels such that γ0 is the truth. Since

∂γ p1(x; γ0) ∂γ p0(x; γ0) ∂γOR(x; γ0) = OR(x) − OR(x), (A.10) p1(x){1 − p1(x)} p0(x){1 − p0(x)} | {z } | {z } =:A1(x) =:A0(x) we obtain

Z   ∂γβ(γ0) = A1(x) − A0(x) + log{OR(x)}SX|Y(x|0) fX|Y(x|0)dx. (A.11)

Now, we only need to verify the equality between E{F0(Y, T, X)S(Y, T, X)} the right-hand side of equation (A.11), where F0(Y, T, X) and S(Y, T, X) are given in the theorem statement and (A.9), respectively. Note that F0(Y, T, X)S(Y, T, X) is equal to "  # 1 − Y T − p0(X) h  i log OR(X) − β − SX|Y(X|0) + T − p0(X) A0(X) 1 − h0 p0(X){1 − p0(X)} "  # Y fX|Y(X|0) T − p1(X) h  i + SX|Y(X|1) + T − p1(X) A1(X) . h0 fX|Y(X|1) p1(X){1 − p1(X)}  Here, taking expectations directly shows that E F0(Y, T, X)S(Y, T, X) is equal to    fX|Y(X|0) E log{OR(X)}SX|Y(X|0) − A0(X)|Y = 0 + E A1(X) Y = 1 , fX|Y(X|1) which is equal to the right-hand side of (A.11) since   fX|Y(X|0)  E A1(X) Y = 1 = E A1(X)|Y = 0 . fX|Y(X|1)

Finally, it follows from lemma1 that F0 is an element of the tangent space. Proof of theorem5: Under strong ignorability, we have

∗ ∗ ∗ ∗ ∗ ∗ θAR(x) = P(Y = 1 | T = 1, X = x) − P(Y = 1 | T = 0, X = x).

A-8 Therefore, the statement follows from lemma A.2 and the definition of ΓAR. Proof of theorem6: It immediately follows from theorem5. Proof of theorem7: The lower bound trivially follows by the same argument as in theorem3. For the upper bound, lemma A.5 shows that

∗ ∗ ∗ ∗ ∗ ∗ θAR(x) ≤ P(Y = 1 | T = 1, X = x) − P(Y = 1 | T = 0, X = x), where the inequality is sharp under assumptionF. Then, by lemma A.2 and the definition of ΓAR, the right–hand side is equal to r(x, p0)ΓAR{x, (2 − d)p0} under design d ∈ {1, 2}. Proof of theorem A.1: It follows from the same arguments as in the proof of theo- rem4. We omit the details.

A-9