Causal Inference in Case-Control Studies
Total Page:16
File Type:pdf, Size:1020Kb
Causal inference under outcome-based sampling with monotonicity assumptions Sung Jae Jun* Department of Economics, Pennsylvania State University and Sokbae Lee Department of Economics, Columbia University and IFS June 25, 2021 Abstract We study causal inference under case-control and case-population sampling. For this purpose, we focus on the binary-outcome and binary-treatment case, where the parameters of interest are causal relative and attributable risk defined via the potential outcome framework. It is shown that strong ignorability is not always as powerful as it is under random sampling and that certain monotonic- ity assumptions yield comparable results in terms of sharp identified intervals. Specifically, the usual odds ratio is shown to be a sharp identified upper bound on causal relative risk under the monotone treatment response and monotone treatment selection assumptions. We then discuss averaging the conditional (log) odds ratio and propose an algorithm for semiparametrically efficient esti- mation when averaging is based only on the (conditional) distributions of the covariates that are identified in the data. We also offer algorithms for causal inference if the true population distribution of the covariates is desirable for ag- gregation. We show the usefulness of our approach by studying two empirical arXiv:2004.08318v3 [econ.EM] 9 Aug 2021 examples from social sciences: the benefit of attending private school for en- tering a prestigious university in Pakistan and the causal relationship between staying in school and getting involved with drug-trafficking gangs in Brazil. Keywords: Relative risk; attributable risk; odds ratio; partial identification; semipara- metric efficiency; sieve estimation. *The authors would like to thank Guido Imbens, Chuck Manski, Francesca Molinari and seminar participants at Cemmap, Oxford, and Penn State for helpful comments and Leandro Carvalho and Rodrigo Soares for sharing their dataset and help. This work was supported in part by the European Research Council (ERC-2014-CoG-646917-ROMIA) and by the UK Economic and Social Research Council (ESRC) through research grant (ES/P008909/1) to the CeMMAP. 1 1 Introduction Random sampling from the population of interest provides a convenient setup for causal inference, but it may be overly expensive to implement in practice for vari- ous reasons. For instance, some events are so rare that they are unlikely to be ob- served at random and therefore they are likely to be severely under-represented in a random sample. Examples include cancer (Breslow & Day 1980), infant death (Currie & Neidell 2005), consumer bankruptcy (Domowitz & Sartain 1999), enter- ing a highly prestigious university (Delavande & Zafar 2019), and drug trafficking (Carvalho & Soares 2016), among many others. The main focus of this paper is to study causal inference when only an outcome-based sample is available. Specif- ically, data are collected by observing individuals that are randomly drawn from those to whom the event of interest is already known to have occurred, while the comparison group data are obtained similarly by focusing on those to whom the event did not occur (hence, case-control design; see, e.g., Breslow(1996)), or simply from an auxiliary source on the entire population without knowing their case sta- tus (as a consequence, case-population design; see, e.g., Lancaster & Imbens(1996)). Both sampling designs can be analyzed in a unified outcome-based sampling frame- work called Bernoulli sampling (see, e.g., Breslow et al.(2000)). Therefore, our ob- jective is to conduct causal inference under Bernoulli sampling. We focus on the binary-outcome and binary-treatment case while data are only observational, as opposed to experimental. In a lesser-known paper, Holland & Rubin(1988) adopt the potential outcome framework to illustrate how to use the as- sumption of strong ignorability—which implies conditional independence between potential outcomes and treatment given covariates—to identify the counterfactual odds ratio in case-control studies. They then argue that the counterfactual odds ratio approximates the ratio of two potential-outcome probabilities (i.e., causal rela- tive risk) under the rare-disease assumption. Their work is the starting point of this paper. We share their motivation that “retrospective studies are never randomized.” 2 However, strong ignorability is often considered too demanding for observational data; furthermore, in case-control studies, without having the knowledge of the true probability of the case in the population, strong ignorability without the rare-disease assumption is generally not sufficient to point identify the causal relative risk given the covariates. In this paper, we develop new identification results from the perspec- tive of partial identification (see, e.g., Manski 1995, 2003, 2009, Tamer 2010, Molinari 2020, among others), without resorting to strong ignorability, nor to the rare-disease assumption. In particular, we adopt the assumptions of monotone treatment response (Man- ski 1997, MTR hereafter) and monotone treatment selection (Manski & Pepper 2000, MTS hereafter) and show that those assumptions enable us to bound causal pa- rameters in a meaningful way. The MTR and MTS assumptions as well as other related notions of monotonicity have been extensively used for causal inference. See e.g. Bhattacharya et al.(2008, 2012), Pearl(2009), VanderWeele & Robins(2009), Kuroki et al.(2010), Kreider et al.(2012), Jiang et al.(2014), Okumura & Usui(2014), Choi(2017), Kim et al.(2018), and Machado et al.(2019) among others. In particu- lar, Kuroki et al.(2010) consider the MTR assumption to bound causal relative risk, which we will discuss further at the end of this section. We investigate partial identification of adjusted causal relative and attributable risk, that is, the ratio of two counterfactual proportions (conditional on covariates) and the difference between them. We show that strong ignorability is not always as powerful as it is under random sampling and that the MTR and MTS assumptions yield comparable results in terms of sharp identified intervals. Specifically, the odds ratio is shown to be a sharp upper bound on causal relative risk. Therefore, the odds ratio plays a central role for causal inference under the MTR and MTS assumptions that, unlike strong ignorability, allow for endogenous treatment assignment. We apply the same approach to causal attributable risk, although the sharp bounds are more complicated than the odds ratio. In addition to new identification results, we develop easy-to-implement infer- 3 ence algorithms for causal relative and attributable risk. Instead of parametrizing the functional causal parameters, we propose to estimate a scalar parameter that is constructed by averaging them. Since the odds ratio is not only a popular measure of association in case-control studies but also it is a sharp upper bound on causal relative risk, we pay special attention to semiparametrically efficient estimation of the expectation of the log odds ratio using the conditional distributions of the co- variates that are identified in the data. Efficient estimation of the (log) odds ratio is a classical topic, but most papers take a parametric approach with a focus on robust- ness: see, e.g., K. Chen (2001), Tchetgen Tchetgen et al.(2010), and Tchetgen Tchetgen (2013). Our approach of targeting the nonparametrically aggregated versions of the odds ratio, which can be efficiently estimated via sieves (e.g., X. Chen (2007)), of- fers a natural alternative. We derive the semiparametric efficiency bound for the aggregated log odds ratio and then propose an easy-to-implement sieve-based es- timation algorithm. We also offer algorithms for causal inference if the true popu- lation distribution of the covariates is desirable for aggregation. Causal inference procedures are presented in detail in section7, and an accompanying R package is available on the Comprehensive R Archive Network (CRAN) at https://CRAN.R- project.org/package=ciccr. We revisit the datasets studied by Delavande & Zafar(2019) and Carvalho & Soares(2016) to illustrate how our methods can be used in case-control and case- population studies, respectively. The first example uses the data based on surveys of the students who entered a very selective university in Pakistan and those who did not, whereas the second example data combine a survey of drug-trafficking gangs in Rio de Janeiro with the 2000 Brazilian Census. Therefore, they correspond to the case-control and case-population designs, respectively. Using these datasets, we ad- dress new research questions that are not examined in the original papers: (i) the causal effect of attending private school on entering a very selective university and (ii) the causal impact of not staying in school on joining a criminal gang. In both of the examples, strong ignorability is implausible, but the MTR and MTS assump- 4 tions can be reasonably well defended. Our results show that attending private high school increases the chance of entering a prestigious university only by a factor of 1.21 at the maximum: therefore, there is no substantial causal benefit here. Also, our analysis on the Brazilian data shows that those who are not in school may have a higher chance of joining a criminal gang by a factor of 15 times in the worst case. We now discuss the relation to the existing literature on causal inference and outcome-based sampling.M ansson˚ et al.(2007) point out that the propensity score method has only limited ability to control for potential confounding factors in case- control studies. Our method does not rely on strong ignorability nor on the propen- sity score. Rose(2011) and Van der Laan & Rose(2011) propose a method of causal inference based on the assumption that the true case probability is known by a prior study. We focus on the case of unknown case probability; however, it is straightfor- ward to modify our inference procedures if the case probability is indeed known. Didelez & Evans(2018) provide an excellent and extensive survey on causal infer- ence under case-control sampling, but no discussion on partial identification ap- proaches can be found there.