<<

Causal Modeling with Stochastic Confounders

Thanh Vinh Vo 1 Pengfei Wei 1 Wicher Bergsma 2 Tze-Yun Leong 1 1School of Computing, National University of Singapore 2Department of , London School of and Political Science [email protected], [email protected], [email protected], [email protected]

Abstract inference process. Classical approaches to dealing with observed or visible confounders are the propensity score- based methods and their variants (Rubin, 2005). In This work extends in tempo- cases with latent or hidden confounders, however, the ral models with stochastic confounders. We treatment effects on the outcome cannot be directly propose a new approach to variational esti- estimated without further assumptions (Pearl, 2009; mation of causal inference based on a repre- Louizos et al., 2017). For example, in crop planting, senter theorem with a random input space. rainfall received in a field may affect both the fertilizer We estimate causal effects involving latent (since it may drain the fertilizer off the field) and the confounders that may be interdependent and crop yield (through water the plants received). In this time-varying from sequential, repeated mea- situation, rainfall is the hidden confounder as it may surements in an . Our ap- not be directly measurable in real life. proach extends current work that assumes in- dependent, non-temporal latent confounders Recent studies in causal inference (Shalit et al., 2017; with potentially biased estimators. We in- Louizos et al., 2017; Madras et al., 2019) mainly focus troduce a simple yet elegant algorithm with- on static , i.e., the observational data are time- out parametric specification on model com- independent and with independent and identically dis- ponents. Our method avoids the need for tributed (iid) noise. In many real-world applications, expensive and careful parameterization in de- however, events change over time, e.g., each participant ploying complex models, such as deep neural may receive an intervention multiple times and the tim- networks in existing approaches, for causal ing of these interventions may differ across participants. inference and analysis. We demonstrate the In this case, the time-independent assumption does not effectiveness of our approach on various bench- hold, and causal inference would degenerate in the mod- mark temporal datasets. els that fail to capture the nature of time-dependent data. In practice, temporal confounders such as season- ality and long-term trends can potentially contribute 1 INTRODUCTION to bias. For example, the confounding soil fertility in crop planting may change over time, due to different reasons such as annual rainfall or soil erosion. Estimating the causal effects of an intervention or treat- Whenever soil fertility, which may or may not be di- ment on an outcome is an important problem in sci- rectly measured, declines (or raises), it would possibly entific investigations and real-world applications. For affect the future level of fertility. In real-life situations, example: What is the effect of sleep-deprivation on there may be many such latent confounders that may health outcomes? How would family socio-economic not be interpretable. This motivates us to propose a status affect career prospects? What is the impact causal modeling framework that captures these latent of a disease outbreak on the regional stock markets? confounding variables that change over time, and a Existence of confounders or confounding variables that causal inference method that mitigates the stochastic affect both the treatment and the outcome may produce confounding problem and reduces bias in estimating bias in the estimation, and hence complicate the causal causal effects. Proceedings of the 24th International Conference on Artifi- In this work, we introduce a framework to character- cial Intelligence and Statistics (AISTATS) 2021, San Diego, ize the latent confounding effects over time in causal California, USA. PMLR: Volume 130. Copyright 2021 by inference based on the structural (SCM) the author(s). Causal Modeling with Stochastic Confounders

(Pearl, 2000). Inspired by recent work (e.g., Riegg, The SCM consists of a triplet: a set of exogenous vari- 2008; Louizos et al., 2017; Madras et al., 2019) that ables whose values are determined by factors outside handle static, independent confounders, we relax this the framework, a set of endogenous variables whose assumption by modeling the confounders as a stochastic values are determined by factors within the framework, process. This approach generalizes the independent set- and a set of structural equations that express the value ting to model confounders that have intricate patterns of each endogenous variable as a function of the val- of interdependencies over time. ues of the other (endogenous and exogenous) variables. Figure 1 (a) shows a causal graph with endogenous Many existing causal inference methods (e.g., Louizos variables Y, Z, W. Here Y is the outcome variable, W et al., 2017; Shalit et al., 2017; Madras et al., 2019) is the intervention variable, and Z is the confounder exploit recent developments of deep neural networks. variable. Exogenous variables are variables that are While effective, the performance of a neural network not affected by any other variables in the model, which depends on many factors such as its structure (e.g., the are not explicitly in the graph. Causal inference eval- number of layers, the number of nodes, or the activation uates the effects of an intervention on the outcome, function) or the optimization algorithm. Tuning a i.e., p(Y do(W = w)), the distribution of the outcome neural network is challenging; different conclusions Y induced| by setting W to a specific value w. Our may be drawn from different network settings. To work focuses on the problem of estimating causal ef- overcome these challenges, we propose a nonparametric fects, which is different from the works of identifying variational method to estimate the causal effects of causal structure (or causal discovery) (Spirtes et al., interest using a kernel as a prior over the reproducing 2000), where several approaches have been proposed, kernel Hilbert space (RKHS). Our main contributions e.g., Tian and Pearl (2001); Peters et al. (2013, 2014); are summarized as follows: Jabbari et al. (2017); Huang et al. (2019). • We introduce a Causal Inference with Stochastic Estimators with unobserved confounders. The Confounders (CausalSC) method for temporal data following efforts take into account the unobserved con- that captures the interdependencies of confounders. founders in causal inference: Montgomery et al. (2000); This relaxes the independent confounders assumption Riegg (2008); de Luna et al. (2017); Kuroki and Pearl in recent work (e.g., Louizos et al., 2017; Madras et al., (2014); Louizos et al. (2017); Madras et al. (2019). 2019). Under this setting, we introduce the concepts of Specifically, some proxy variables are introduced to causal path effects and intervention effects, and derive replace or infer the latent confounders. For example, approximation measures of these quantities. the household income of a student is a confounder that • Our framework is robust and simple for accurately affects the ability to afford private tuition and hence learning the relevant causal effects: given a , the academic performance; it may be difficult to ob- learning causal effects quantifies how an outcome is tain income information directly, and proxy variables expected to change if we vary the treatment level or such as zip code, or education level are used instead. intensity. Our algorithm requires no information about Figures 1 (b) and (c) present two causal graphs used how these variables are parametrically related, in con- by the recent causal inference algorithms in Louizos trast to the need for paramterizing a neural network. et al. (2017); Madras et al. (2019). The graphs contain latent confounder Z, proxy variable X, intervention • We develop a nonparametric variational estimator W , outcome Y , and observed confounder S. by exploiting the kernel trick in our temporal causal framework. This estimator has a major advantage: Estimators with observed confounders. Hill complex non-linear functions can be used to modulate (2011); Shalit et al. (2017); Alaa and van der Schaar the SCM with estimated parameters that turn out to (2017); Yao et al. (2018); Yoon et al. (2018); K¨unzel have analytical solutions. We empirically demonstrate et al. (2019); Oprescu et al. (2019); Nie and Wager the effectiveness of the proposed estimator. (2020) follow the formalism of potential outcomes and these works do not take into account latent confounders but satisfy the strong ignorability assumption of Rosen- 2 BACKGROUND AND RELATED baum and Rubin (1983). Of interest is the inference WORK mechanism by Alaa and van der Schaar (2017), where the authors model the counterfactuals as functions The structural causal model (SCM) (Pearl, 1995, 2000) living in the reproducing kernel Hilbert space. In con- is a general causal modeling framework that builds on trast, we work on a time series setting within a struc- several seminal works, including structural equation tural causal framework whose inference scheme directly models (Goldberger, 1972, 1973; Duncan, 1975, 2014), benefits from the generalization of the empirical risk potential outcomes framework of Neyman (1923) and minimization framework. Rubin (1974), and graphical models (Pearl, 1988). Thanh Vinh Vo, Pengfei Wei, Wicher Bergsma, Tze-Yun Leong

Z have several features). Thus, the problem setup of our work is different from that of Bica et al. (2020a,b). Sev- Z Z eral efforts (e.g., Kami´nskiet al., 2001; Eichler, 2005, X Y W 2007, 2009; Eichler and Didelez, 2010; Bahadori and Y W X Y W Liu, 2012) analyse causation for temporal data based S on the notion of Granger (Granger, 1980). (a) (b) (c) These works focus on discovering causal structures, which is different from our work that estimates causal Figure 1: The SCM framework: approaches of mod- effects over a period of time. eling causality. (a) all variables are observed; (b) the confounder Z is latent and being inferred using proxy 3 OUR APPROACH variable X (Louizos et al., 2017); (c) there is an addi- tional observed confounder S (Madras et al., 2019). We introduce a Causal inference with Stochastic Con- founders (CausalSC) framework based on SCM, as z1 z2 ... zT illustrated in Figure 2. Following Frees et al. (2004), we evaluate the causal effects within a time interval de-

x1 y1 w1 x2 y2 w2 xT yT wT noted by t. We assume that the interval is large enough to cover the effects of the treatment on the outcome. This assumption is practical as many interval-censored s s ... s 1 2 T datasets are recorded monthly or yearly.

Figure 2: Causal inference with Stochastic Confounders The latent confounder z. In real world applications, (CausalSC) for temporal data. capturing all the potential confounders is not feasible as some important confounders might not be observ- able. When there are unobserved or latent confounders, Estimators for temporal data. Little attention has causal inference from observational data is even more been paid to learning causal effects in non-iid setting challenging and, as discussed earlier, can result in a (Guo et al., 2020; Bica et al., 2020b). Lu et al. (2018) biased estimation. The increasing availability of large formalize the Markov Decision Process under a causal and rich datasets, however, enables proxy variables for perspective and the emphasis is mainly on sequential unobserved confounders to be inferred from other ob- optimization. This work does not discuss the estimation served and correlated variables. In practice, the exact of average treatment effects using observational data. nature and structure of the hidden or latent confounder Li and B¨uhlmann(2018) model potential outcomes for z are unknown. We assume the structural equation of temporal data with observed confounders using state- the latent confounder at time interval t as follows: space models. The models proposed are linear and quadratic regression. Ning et al. (2019) propose causal zt = fz(zt−1) + et, (1) estimates for temporal data with observed confounders where the exogenous variable e N(0, σ2I ), the using linear models and developed Bayesian flavors t z dz function f : Z F with Z is the∼ set containing z inference methods. Bojinov and Shephard (2019) gen- z z t and F is a Hilbert7→ space, σ is a hyperparameter eralize the potential outcomes framework to temporal z z and I denotes the identity matrix. The choice of data and proposed an inference method with Horvitz– dz this structural equation is reasonable because each Thompson estimator (Horvitz and Thompson, 1952). dimension of z maps to a real value which gives a wide This method, however, does not consider the existence t of possible values for z . The function f ( ) would of unobserved confounders. Bica et al. (2020a,b) formal- t z control the effects of confounder at time interval· t 1 ize potential outcomes with observed and unobserved to its future value at time interval t. With reference− to confounders to estimate counterfactual outcomes for an earlier example, the soil fertility in crop planting can treatment plans on each individual, where the outcomes be considered as confounder and this quantity changes are modelled with recurrent neural networks. The ob- over time. When the fertility of soil at time interval served data of each individual in these two works are t 1 declines, possibly because of soil erosion, the temporal and the individuals are independent with fertility− of soil at time interval t may also declines. each other. In other words, there is a collection of ob- Furthermore, the Gaussian assumption imposed on served time series for each individual. Our work, on the exogenous variable e makes it computational tractable other hand, has no notion of individuals. We instead t for subsequent calculations. observe one time series for each feature of the data, i.e., the outcome, treatment, covariates and observed The observed variables y, w, x and s. The dx– confounders (covariates and observed confounders may dimensional observed features at time interval t, is Causal Modeling with Stochastic Confounders

dx denoted by xt R . Similarly, the treatment variable Considering multiple time intervals, we further denote ∈ at time interval t is given by wt. yt and st denote the the vector notations for T time intervals as follows: y = > > > outcome and the observed confounder at time interval t, [y1,..., yT ] , w = [w1,..., wT ] , s = [s1,..., sT ] , x = respectively. z denotes an unobserved confounder. Let [x ,..., x ]>, z = [z ,..., z ]>, z = , s = . We de- t 1 T 1 T 0 ∅ 0 ∅ Y, W, S, X be sets containing yt, wt, st, xt, respectively, fine fixed-time causal effects as the causal effects at and let Fy, Fw, Fs, Fx be Hilbert spaces. Let fy : Z a time interval t and range-level causal effects as the W S F , f : Z S F , f : S F , f : Z S × average causal effects in a time range. The fixed-time × 7→ y w × 7→ w s 7→ s x × 7→ Fx. We postulate the following structural equations: and range-level causal effects can be estimated using Pearl’s do-calculus (Pearl, 2009). We model the unob- st = fs(st−1) + ot, yt = fy(wt, zt, st) + vt, (2) served confounder processes using the latent variable z , inferred for each observation (x , s , w , y ) at time xt = fx(zt, st) + rt, wt = 1(ϕ(fw(zt, st)) ut), (3) t t t t t ≥ interval t. The interval is assumed to be large enough to 2 cover the effects of the treatment wt on the outcome yt. where the exogenous variables ot N(0, σs Ids ), rt 2 2 ∼ 1 ∼ This assumption is practical as many interval-censored N(0, σxIdx ), vt N(0, σy) and ut U[0, 1], ( ) de- notes the indicator∼ function and ∼ϕ( ) is the· logis- datasets are recorded monthly or yearly. · tic function. The last structural equation implies Definition 1. Let w1 and w2 be two treatment paths. that wt is Bernoulli distributed given zt and st, and The treatment effect path (or effect path) of t ∈ p(wt = 1 zt, st) = ϕ(fw(zt, st)). Similar reasoning [1, 2,..., T ] is defined as follows: holds for these| assumptions in that they afford compu- tational tractability in terms of time series formulation. EP := E[ y do(w=w ), x] E[ y do(w=w ), x]. (4) | 1 − | 2 Our model assumes that the treatment and outcome at time interval t 1 are independent of the treatment − The effect path (EP) is a collection of causal effects at and outcome at time interval t given the confounders. time interval t [1, 2,...,T ]. These assumptions are important in real-life, for ex- ∈ Let be the effect path satisfying Def- ample, in the context of agriculture, the crop yield Definition 2. EP inition . Let be the effect at time interval (outcome) of the current crop season probably does not 1 EPt EP . The average treatment∈ effect ATE in is affect crop yield in the next season, and similarly for t ( ) [T1,T2] defined as follows: the chosen fertilizer (treatment). However, they may be correlated through the latent confounders, e.g., soil   PT2  ATE := EPt (T2 T1 + 1). (5) fertility of the land. t=T1 − Learnable functions. With Eqs. (1)-(3), we need The average treatment effect (ATE) quantifies the ef- to learn fi, where i A := y, w, z, x, s . Standard methods to model these∈ functions{ include linear} models fects of a treatment path w1 over an alternative treat- or multi-layered neural networks. Selecting models for ment path w2. This work focuses on evaluating the these functions relies on many problem-specific aspects causal effects with binary treatment. The key quantity such as types of data (e.g., text, images), dataset size in estimating the effect path and average treatment effects is p(y do(w), x), which is the distribution of (e.g., hundreds, thousands, or millions of data points), | and data dimensionality (high- or low-dimension). We y given x after setting variable w by an intervention. propose to model these functions through an augmented Following Pearl’s back-door adjustment (Pearl, 2009) representer theorem, to be desribed in Section 4. and invoking the properties of d-separation, the causal effects of w to y given x with respect to the causal graph in Figure 2 is as follows: 3.1 Causal Quantities of Interest Z p(y do(w), x) = p(y w, z, s)p(z, s x)dsdz. (6) Temporal data capture the evolution of the character- | | | istics over time. Based on earlier work (Louizos et al., 2017; Pearl, 2009; Madras et al., 2019), we evaluate The expression in Eq. (6) typically does not have an causal inference from temporal data by assuming that analytical solution when the distributions p(y w, z, s), the confounders satisfy Eq. (1). The corresponding p(z, s x) are parameterized by more involved| impulse causal graph is shown in Figure 2. This relaxes the functions,| e.g., a nonlinear function such as multi- independent assumption in Louizos et al. (2017) and layered neural network. Thus, the empirical expec- Madras et al. (2019). In particular, we aim to mea- tation of y is used as an approximation. To do so, sure the causal effects of wt on yt given the covariate we first draw samples of z and s from p(z, s x), and x , where x serves as the proxy variable to infer la- then substitute these samples to p(y w, z, s)| to draw t t | tent confounder zt. This formulation subsumes earlier samples of y. The whole procedure is carried out using approaches (Louizos et al., 2017; Madras et al., 2019). forward techniques under specific forms of Thanh Vinh Vo, Pengfei Wei, Wicher Bergsma, Tze-Yun Leong

p(y w, z, s), p(z x, s, w, y), p(y s, x, w), p(w s, x) and Let the empirical risk obtained from the negative ELBO | | | | p(s x). In the next section, we present approximations be Lb. Consider minimizing the following objective func- of these| distributions. tion

[  X 2 J = Lb fi + λi fi H (8) 4 ESTIMATING CAUSAL EFFECTS i∈A k k i i∈A

Estimating causal effects requires systematic sampling + with respect to functions fi (i A), where λi R . from the following distributions: p(s x), p(w s, x), ∈ ∈ | | Then, the minimizer of (8) has the following form p(y s, x, w), p(z x, s, w, y) and p(y w, z, s). This sec- PT ×L i i i th | | | fi = l=1 κi( , νl )βl ( i A), where νl is the l tion presents approximations to these distributions. · ∀ ∈ th input to function fi, i.e., it is a subset of the l tu- ple of D, and the coefficients βi are vectors in the 4.1 The Posterior of Latent Confounders l Hilbert space Fi. This minimizer further emits the fol- i i i > PL l> l Exact inference of z is intractable for many models, lowing solution: β = [βl ,..., βTL] = l=1 Ki Ki + −1 PL l> i y such as multi-layered neural networks. Hence, we in- λiLKi l=1 Ki ψ for i A w, q and ψ = y, x s z q∈ \{ } fer z using variational inference, which approximates ψ = x, ψ = s, ψ = Kqβ . the true posterior p(z x, s, w, y) by a variational pos- | terior q(z x, s, w, y). This approximation is obtained As mentioned in Section 3, we propose to model fi | by minimizing the Kullback-Leibler (DKL): using kernel methods. A natural way is to develop a DKL[q(z x, s, w, y) p(z x, s, w, y)], which is equivalent slight modification to the classical representer theorem to maximizing| thek evidence| lower bound (ELBO) of (Kimeldorf and Wahba, 1970; Sch¨olkopf et al., 2001) so the marginal likelihood: that it can be applied to the optimization of the ELBO. i h i Eq. (8) is minimized with respect to the weights β and L=Ez log p(y, w, s, x z) (7) hyperparameters of the kernels. The proof of Lemma 1 | is provided in Appendix. T X h i q Ezt−1 DKL q(zt yt, xt, wt, st) p(zt zt−1) . Lemma 2. For any fixed β , the objective function J − | k | i t=1 in Eq. (8) is convex with respect to β for all i A q . ∈ \{ } The expectations are taken with respect to varia- tional posterior q(z x, s, w, y), and each term in the Proof. We present the sketch of proof here and details ELBO depends on| the assumption of their distribu- are deferred to Appendix. It can be shown that the tion family presented in Section 3. We further as- objective function is a combination of several compo- sume that the variational posterior distribution takes nents including (βi)>Cβi, c>βi, w> log ϕ(Kl βw) − w the form q(z ) = N(z f (y , w , s , x ), σ2I ), where and (1 w)> log ϕ( Kl βw), where i y, s, x, z , t t q t t t t q dz − − − w ∈ { } f : Y W S|· X F| is a function parameterizing C is a positive semi-definite matrix and c is a vec- q × × × → z the designated distribution, σq is a hyperparameter, tor. The first component is a quadratic form. Thus, i and Idx denotes the identity matrix. Our aim is to its second-order derivative with respective to β is a learn fi where i A A q . positive semi-definite matrix; hence, it is convex. The ∈ ← ∪ { } second term is a linear function of βi thus it is con- 4.1.1 Inference of The Posterior vex. The two last terms come from cross-entropy and the input to these function are linear To formulate a regularized empirical risk, we draw combination of the kernel function evaluated between L samples of z from the variational posterior using each pair of data points. So these two terms are also l l l l reparameterization trick: z = [z1,..., zT ] with zt = convex. Consequently, J is convex with respective to l l N i fq(yt, wt, st, xt) + σqεt and εt (0, Idx ). By drawing β (i A q ) because it is a linear combination of 1 L ∼ ∈ \{ } L noise samples εt ,..., εt at each time interval t in convex components. advance, we obtain a complete dataset

L T Lemma 2 implies that at an iteration in the optimiza- [ [ n l  o tion that βq reaches its convex hull, the objective func- D = yt, wt, xt, st, zt . l=1 t=1 tion J will reach its minimal point after a few more iterations. This is because the non-convexity of J is in- At each time interval t, the dataset gives a tuple of duced only by βq. This result indicates that we should the observed values yt, wt, xt, st, and an expression of attempt different random initialization on βq instead l l zt = fq(yt, wt, st, xt) + σqεt. We state the following: of the other parameters when optimizing J because i Lemma 1. Let κi be kernels and Hi their associated β (i A q ) always converges to its optimal point reproducing kernel Hilbert space (RKHS), where i A. (conditioned∈ \{ on} βq). ∈ Causal Modeling with Stochastic Confounders

4.2 The Auxiliary Distributions et al. (2019). This method is an extension of CEVAE that considers two types of confounder: observed and The previous steps approximate the posterior p(z ) by latent ones. variational posterior q(z ) and estimate the density|· of p(y w, z, s). This section|· outlines the approximation For the neural network setup on each baseline, we of p|(y s, x, w), p(w s, x) and p(s x). Denoting their closely follow the architecture of Louizos et al. (2017) | | | and Shalit et al. (2017). Unless otherwise stated, we corresponding approximation as pe(y s, x, w), pe(w s, x), | | utilize a fully connected network with the exponen- pe(s x), we estimate the parameters of those distribu- tions| directly using classical representer theorem. tial linear unit (ELU) activation function (Clevert et al., 2016) and use the same number of nodes in each hidden In the following, we briefly describe how to approximate layer. We fine-tune the network with 2, 4, 6 hidden lay- p(w s, x). The regularized empirical risk obtained from e | ers and 50, 100, 150, 200, 250 nodes per layer. We also the negative log-likelihood of p(w s, x) is as follows: −1 −2 −3 −4 e | fine-tune the learning rate in 10 , 10 , 10 , 10 Adamax { } T and use (Kingma and Ba, 2015) for optimiza- X  2 tion. J = ` w , g (w − , s , x ) + δ g , w Xent t w t 1 t t wk wkVw t=1 Evaluation metrics. To evaluate the performance of each method, we report the absolute error of the ATE: where `Xent( , ) is the cross-entropy loss function, + · ·  := ATE ATE[ and the precision of estimating δw R , gw : W S X Fw, and Vw is the RKHS ATE ∈ × × 7→ heterogeneous| − effects| (PEHE) (Hill, 2011):  := with kernel function τw( , ). By classical representer PEHE · · PT2 2 theorem, the minimized form of g is given by: (EPi EPc i) /(T2 T1 + 1). w i=T1 − − T X w 5.1 Illustration on Modeling with Stochastic g = α τ ( , [w − , s , x ]), w j w · j 1 j j Confounders j=1

w Before carrying out our main , we first where α R is the parameter to be learned. It can be j ∈ illustrate the importance of our proposed method in es- shown that Jw is a convex objective function because the input to the cross-entropy function is linear. Other timating causal effects for time series data with stochas- distributions can be estimated in a similar fashion and tic confounders. We consider the ground truth causal the full description is provided in Appendix. model whose structural equations are as follows:

st = a0 + a1st−1 + ot, 5 EXPERIMENTS w = 1(ϕ(b + b z ) u ), t 0 1 t ≥ t yt = fy(st, wt) + vt, Baselines and aims of experiments. In this sec- tion, we examine the performance of the proposed where 1( ) is the indicator function, s0 = 0, ot CausalSC in estimating causal effects from tempo- N(0, 0.32),· u U[0, 1], and v N(0, 1). In this model,∼ t ∼ t ∼ ral data on both synthetic and real-world datasets. st, wt, yt are endogenous variables, and ot, ut, vt are We compare with the following baselines: (1) The exogenous variables. The functions fi (i A y ) in potential outcomes-based model for time series data this model are linear. We consider three∈ cases\{ for} the by Bojinov and Shephard (2019), which uses Horvitz- ground truth of function fy: Thompson estimator to evaluate causal effects. The (1). Linear: f (s , w ) = c + c s + c w . key factor of this method is the ‘adapted propen- y t t 0 1 t 2 t (2). Quadratic: f (s , w ) = (c + c s + c w )2. sity score’; we have implemented two versions of this y t t 0 1 t 2 t (3). Exponential: f (s , w ) = exp(c + c s + c w ). score. The first one uses a fully connected neural net- y t t 0 1 t 2 t work. Herein, we assume that p(w w − , y − ) = For all the cases, we randomly choose the true parame- t| 1:t 1 1:t 1 Bern(w f(w − , y − )) with f(w − , y − ) is a neural ters of the models as follows (a , a , b , b , c , c , c ) = t| t 1 t 1 t 1 t 1 0 1 0 1 0 1 2 network taking the observed values of w − , y − as (0.70, 0.95, 0.20, 0.10, 0.70, 0.40, 1.70), and then sam- t 1 t 1 − input to predict wt. The second one uses Long-Short ple three time series of length T = 1000 for st, wt, yt Term Memory (LSTM) to estimate p(wt w1:t−1, y1:t−1). from the above distributions. Herein, we keep wt, yt, st (2) The second baseline is CFRNet, a| model for infer- as the observed data and use the following three in- ring treatment effects by Shalit et al. (2017). (3) The ference models to estimate causal effects: Model-1: third baseline, CEVAE (Louizos et al., 2017) is a causal without confounders, Model-2: with iid confounders inference model based on variational auto-encoders. We and Model-3: with stochastic confounders. The er- use the code of CEVAE and CFRNet which are avail- rors reported in Table 1 show that the model with able online to train these models. (4) The last baseline stochastic confounders outperforms the others as it fits is fairness through causal awareness (FCA) by Madras data well. In the following sections, we present our Thanh Vinh Vo, Pengfei Wei, Wicher Bergsma, Tze-Yun Leong

Table 1: The errors of the estimated treatment effects (lower is better). Model-1 Model-2 Model-3 without confounders with iid confounders CausalSC

√PEHE ATE √PEHE ATE √PEHE ATE Linear outcome 0.06 .01 0.06 .01 0.05 .01 0.05 .01 0.05 .01 0.05 .01 Quadratic outcome 2.53±.08 2.26±.06 1.78±.03 0.88±.03 0.33±.02 0.26±.02 Exponential outcome 4.52±.12 3.36±.15 4.18±.20 2.41±.07 0.91±.03 0.78±.06 ± ± ± ± ± ±

1.0 main experiments with more complicated functions fi DT2L (i A). ∈ 0.5 DT4L 0.0 5.2 Synthetic Experiments Dataset DT6L 0.5 −

1.0 Since obtaining the ground truth for evaluating causal DT6L iid − inference methods is challenging, most of the recent − 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 methods are evaluated with synthetic or semi-synthetic α datasets (Louizos et al., 2017). This set of experiments j1 is conducted on four synthetic datasets and one bench- Figure 3: The heatmap of αj1 on each dataset. mark dataset, where three synthetic datasets are time series with latent stochastic confounders and the other two datasets are static data with iid confounders. kernel method to model the nonlinear functions, an- other one with the neural networks, and the third one Synthetic dataset. We simulate data for y , w , x , t t t is an iid confounder setting with kernel method to s and z from their corresponding distributions as in t t model the nonlinear functions. Specifically, ‘CausalSC- Eqs. (1)-(3), each with length T = 200. The ground Mat´ernkernel’, ‘CausalSC-RBF kernel’ and ‘CausalSC- truth nonlinear functions f (i A q ) with respect i RQ kernel’ denote our proposed method with repre- to the distributions of y , w ,∈x , s\{, z }are fully con- t t t t t senter theorem to model the nonlinear functions f nected neural networks (refer to Appendix for details i (i A) using Mat´ernkernel, radial basis function of these functions). Using different numbers of the kernel,∈ and rational quadratic function kernel, respec- hidden layers, i.e., 2, 4, and 6, we construct three syn- tively. CausalSC-NNjL denotes our method using thetic datasets, namely TD2L, TD4L, and TD6L. For neural networks to model these nonlinear functions, these three datasets, we sample the latent confounder where j 2, 4, 6 is the number of hidden layers. ‘iid variable z satisfying Eq. (1). We also construct another confounder-Mat´ernkernel’,∈ { } ‘iid confounder-RBF kernel’ dataset, TD6L-iid, that uses 6 hidden layers but with and ‘iid confounder-RQ kernel’ denote our proposed the iid latent confounder variable z, i.e., z z , t, s. t s method with representer theorem and independent con- Finally, we only keep y , w , x , s as observed⊥⊥ data∀ for t t t t founders. training. Further details of the simulation are provided in Appendix. Table 2 reports the errors of each method, where signif- icant results are highlighted in bold. We observe that Benchmark dataset. Infant Health and Develop- the performance of our model is competitive for the first ment Program (IHDP) dataset (Hill, 2011) is a study three datasets since our framework is suited and built on the impact of specialist visits on the cognitive de- for temporal data. This verifies the effectiveness of our velopment of children. This dataset has 747 entries, proposed method on the inference of the causal effects each with 25 covariates. The treatment group consists for temporal data, especially with latent stochastic of children who received specialist visits and a control confounders. Moreover, the use of representer theorem group that includes children who did not. For each returns similar values of ATE on three different kernel child, a treated and a control outcome were simulated functions. Our proposed method outperforms the other from the numerical scheme of the NPCI library Dorie baselines on the first three datasets, this is because we (2016). consider the time-dependency in the latent confounders, Results and discussion. Each of the experimen- while the others do not take into account such property. tal datasets has 10 replications. For each , For datasets that respects iid confounder, our meth- we use the first 64% for training, the next 20% for ods give comparable results with the other baselines validation, and the last 16% for testing. We exam- (the last five lines in Table 2). This can be explained ine three setups for our inference method, one with as follows. In our setup, zt has the following form: Causal Modeling with Stochastic Confounders

Table 2: Out-of-sample ATE error (ATE) of each method on different datasets (lower is better). Temporal Data (latent stochastic confounders) IID Data (iid confounders) Method TD2L TD4L TD6L TD6L-iid IHDP CausalSC-Mat´ernkernel 0.299±.069 0.193±.088 0.237±.056 0.412±.116 0.290±.076 CausalSC-RBF kernel 0.341±.078 0.289±.025 0.287±.067 0.397±.096 0.460±.130 CausalSC-RQ kernel 0.311±.067 0.255±.020 0.257±.063 0.342±.076 0.489±.170 CausalSC-NN2L 0.369±.080 0.395±.162 0.396±.169 0.390±.107 4.057±.109 CausalSC-NN4L 0.413±.092 0.309±.063 0.313±.060 0.398±.104 1.048±.441 CausalSC-NN6L 0.477±.102 0.325±.069 0.297±.070 0.388±.106 0.306±.063 iid confounder-Mat´ernkernel 0.519±.072 0.658±.098 0.762±.068 0.311±.092 0.217±.095 iid confounder-RBF kernel 0.632±.085 0.982±.120 0.884±.078 0.346±.075 0.277±.091 iid confounder-RQ kernel 0.640±.085 1.081±.103 0.785±.077 0.335±.076 0.283±.075 Bojinov and Shephard (2019) (FC) 1.477±.220 1.243±.219 1.180±.177 0.470±.081 0.529±.183 Bojinov and Shephard (2019) (LSTM) 1.316±.261 1.117±.232 1.179±.323 0.593±.102 0.613±.137 Shalit et al. (2017) (CFRNet) 0.780±.131 0.570±.098 0.701±.131 0.568±.091 0.424±.100 Louizos et al. (2017) (CEVAE) 1.166±.247 1.428±.709 0.705±.179 0.337±.093 0.232±.061 Madras et al. (2019) (FCA) 0.391±.078 1.065±.504 0.682±.080 0.393±.103 0.261±.063

6 CausalSC-Mat´ernkernel trend of oil price would result in a rise of gold price since 5 CausalSC-NN6L it may generate higher inflation that pushes up the de- POTS-FC mand for gold (Le and Chang, 2011; Sim´akov´a,2011).ˇ

ATE POTS-LSTM  4 CFRNet In this section, we examine the causal effects from the CEVAE price of crude oil to that of gold. The dataset in this 3 FCA consists of monthly prices of some com- 2 modities including gold, crude oil, beef, chicken, cocoa- beans, rice, shrimp, silver, sugar, gasoline, heating oil 1 The error of ATE, and natural-gas from May 1989 to May 2019. We con-

0 sider the price of gold as the outcome y, and the trend 0 20 40 60 80 100 120 of crude oil’s price as the treatment w. Specifically, Length of training sequence we cast an increase of crude oil’s price to 1 (wt = 1) and a decrease of crude oil’s price to 0 (wt = 0). We Figure 4: ATE on different lengths of the training set use the prices of gasoline, heating oil, natural-gas, beef, (lower is better). chicken, cocoa-beans, rice, shrimp, silver and sugar as proxy variables x. z = α + PT PL α k (z , zl ) +  , where α t 0 j=1 l=1 jl z t−1 j t 0 To make it suitable for our setting, the following pre- is a bias vector, α is a weight vector, and  is the jl t rocessing procedure is performed. Let ν , ν , ν ,..., ν Gaussian noise (Please refer to Appendix for specific 0 1 2 T be time series of a commodity. We take the differ- expressions of α and α ). The learned weights α 0 jl jl ence between two consecutive observations as input presented in Figure 3 (for l = 1) show that their quan- to our model, i.e., ∆ν = ν ν . This preprocess- tities are around 0 on data with iid confounders, which t t t−1 ing is applied to all the prices− of commodities, thus breaks the connection from z to z and thus makes t−1 t the inputs to our model are actually the differenced these two variables independent to each other. series: ∆ν1, ∆ν2, ∆ν3,..., ∆νT . For the treatment wt Figure 4 presents the convergence of each method on the (t = 1, 2,..., T ), we further cast it to a binary value, dataset TD6L, over different lengths of the training set i.e., w = 1(∆ν > 0), where 1( ) denotes the indicator t t · Ttrain 5, 10,..., 125 . In general, the more training function. By using the differenced series, we are remov- data we∈ { have, the smaller} the error of the estimated ing trends and seasonal effects (see, e.g, Commandeur ATE. The figure reveals that our method (blue line) and Koopman, 2007, Chapter 10; Jinka and Schwartz, starts to converge from around Ttrain = 45, which is 2016, Chaper 4). Hence, it would be more reasonable faster than the others. Additionally, the estimated to assume that the preprocessed data satisfies our pro- ATE of our method is stable with a small error bar. posed causal graph, i.e., the two consecutive differenced values of the observations are independent given the latent confounders, ∆ν ∆ν − z . 5.3 Real Data: Gold–Oil Dataset t ⊥⊥ t 1 | t Results and discussion. We evaluate the effect Data description. Gold and oil are among the most path and ATE between two sequences of treatments transactable commodities in real-life. An increasing Thanh Vinh Vo, Pengfei Wei, Wicher Bergsma, Tze-Yun Leong

Treatment effects on gold-oil dataset 10 as random processes, generalizing recent work where Case (a) the confounders are assumed to be independent and 8 Case (b) identically distributed. We study the causal effects over 6 time using variational inference in conjunction with

ATE 4 an alternative form of the representer theorem with a 2 random input space. Our algorithm supports causal

0 inference from the observed outcomes, treatments, and 0 50 100 150 200 250 300 350 covariates, without parametric specification of the com- Time interval, t ponents and their relations. This property is important for capturing real-life causal effects in SCM, where Figure 5: Case (a): ATE between two treat- non-linear functions are typically placed in the priors. ment paths [1, 1,..., 1, 1]> and [0, 1,..., 0, 1]>, Case (b): Our setup admits non-linear functions modulating the ATE between two treatment paths [1, 0,..., 1, 0]> and SCM with estimated parameters that have analytical [0, 1,..., 0, 1]>. solutions. This approach compares favorably to recent techniques that model similar non-linear functions to estimate the causal effects with neural networks, which w = [1, 1,..., 1, 1]> (increasing crude oil prices) and 1 usually involve extensive model tuning and architecture w = [0, 1,..., 0, 1]> (alternating decreasing and increas- 2 building. ing crude oil prices, i.e., constant on average). In Figure 5, Case (a) presents the estimated ATE of the One limitation of our framework is that the fixed above two sequences of treatments. The estimated ATE amount of passing time (time-lag) is set to unity as it using CausalSC is 4.8, which that in a period leads to further simplifications in computing causal ef- the price of crude oil increases, the average gold price fects. Of practical interest is to perform a more detailed is about to increase 4.8. This quantity is equivalent empirical study on general time-lag. to an increase of 0.77% in the gold price over the pe- PT obs riod (4.8/ ( t=1 yt )/T = 0.77%). To validate the Acknowledgements 0.77% increase{ in gold price,} we compare this with the results reported in Sim´akov´a(2011)ˇ (based on Granger We wish to thank Young Lee (Harvard University) causality) that show the “percentage increase in oil for discussions related to this work. This work was price leads to a 0.64% increase in gold price”. We note partially supported by a Research Scholarship and an that our results give similar order of magnitude and the Academic Research Grant No. T1 251RES1827 from slight difference may be attributed to our experimental the Ministry of Education, Singapore. data that is from May 1989 to May 2019, while the analysis in Sim´akov´a(2011)ˇ is on the data from 1970 References to 2010. Alaa, A. M. and van der Schaar, M. (2017). Bayesian In Figure 5, Case (b), we present another experiment inference of individualized treatment effects using > > multi-task gaussian processes. In Advances in Neural with w1 = [1, 0,..., 1, 0] and w2 = [0, 1,..., 0, 1] . In this case, both treatment paths represent the alternat- Information Processing Systems, pages 3424–3432. ing variation of crude oil prices. Specifically, the former Bahadori, M. T. and Liu, Y. (2012). On causality infer- increases first and then decreases, while the latter is on ence in time series. In 2012 AAAI Fall Symposium the opposite. The average treatment effect is expected Series. to be around 0. From Figure 5, Case (b), the estimated Bica, I., Alaa, A. M., Jordon, J., and van der Schaar, M. ATE by CausalSC is 0.0045, which is in line with the (2020a). Estimating counterfactual treatment out- expectation. To check on statistical significance, we comes over time through adversarially balanced repre- performed a one group t-test on the EP (Definition 1) sentations. In International Conference on Learning with the population to be tested is 0. The p-value Representations. given by the t-test is 0.9931, which overwhelmingly fail Bica, I., Alaa, A. M., and van der Schaar, M. (2020b). to reject the null hypothesis that the ATE equals 0. Time series deconfounder: Estimating treatment ef- This again verifies the reasonable performance of the fects over time in the presence of hidden confounders. proposed method. In Proceedings of the 37th International Coference on International Conference on Machine Learning. 6 CONCLUSION Bojinov, I. and Shephard, N. (2019). Time series exper- iments and causal estimands: exact We have developed a causal modeling framework, tests and trading. Journal of the American Statistical CausalSC, that admits observed and latent confounders Association, pages 1–36. Causal Modeling with Stochastic Confounders

Clevert, D.-A., Unterthiner, T., and Hochreiter, S. Horvitz, D. G. and Thompson, D. J. (1952). A gen- (2016). Fast and accurate deep network learning by eralization of sampling without replacement from a exponential linear units (elus). In Proceedings of the finite universe. Journal of the American statistical 4th International Conference on Learning Represen- Association, 47(260):663–685. tations. Huang, B., Zhang, K., Gong, M., and Glymour, C. Commandeur, J. J. and Koopman, S. J. (2007). An in- (2019). Causal discovery and in nonsta- troduction to state space time series analysis. Oxford tionary environments with state-space models. In University Press. Proceedings of the 36th International Conference on Machine Learning. de Luna, X., Fowler, P., and Johansson, P. (2017). Proxy variables and nonparametric identification of Jabbari, F., Ramsey, J., Spirtes, P., and Cooper, G. causal effects. Economics Letters, 150:152–154. (2017). Discovery of causal models that contain latent variables through bayesian scoring of indepen- Dorie, V. (2016). Npci: Non-parametrics for causal dence constraints. In Joint European Conference inference. URL: https://github. com/vdorie/npci. on Machine Learning and Knowledge Discovery in Duncan, O. D. (1975). Introduction to Structural Equa- Databases, pages 142–157. Springer. tion Models. New York: Academic Press. Jinka, P. and Schwartz, B. (2016). Anomaly Detection Duncan, O. D. (2014). Introduction to structural equa- for Monitoring. O’Reilly Media, Incorporated. tion models. Elsevier. Kami´nski,M., Ding, M., Truccolo, W. A., and Bressler, Eichler, M. (2005). A graphical approach for evaluating S. L. (2001). Evaluating causal relations in neural sys- effective connectivity in neural systems. Philosophi- tems: Granger causality, directed transfer function cal Transactions of the Royal Society B: Biological and statistical assessment of significance. Biological Sciences, 360(1457):953–967. cybernetics, 85(2):145–157. Eichler, M. (2007). Granger causality and path dia- Kimeldorf, G. S. and Wahba, G. (1970). A correspon- grams for multivariate time series. Journal of Econo- dence between bayesian estimation on stochastic pro- metrics, 137(2):334–353. cesses and smoothing by splines. The Annals of Eichler, M. (2009). Causal inference from multivari- , 41(2):495–502. ate time series: What can be learned from granger Kingma, D. P. and Ba, J. (2015). Adam: A method causality. In 13th International Congress on Logic, for stochastic optimization. In 3rd International Methodology and Philosophy of Science. Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Eichler, M. and Didelez, V. (2010). On granger causal- Track Proceedings ity and the effect of interventions in time series. Life- . time data analysis, 16(1):3–32. K¨unzel,S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2019). Metalearners for estimating heterogeneous Frees, E. W. et al. (2004). Longitudinal and panel treatment effects using machine learning. Proceedings data: analysis and applications in the social sciences. of the National Academy of Sciences, 116(10):4156– Cambridge University Press. 4165. Goldberger, A. S. (1972). Structural equation methods Kuroki, M. and Pearl, J. (2014). Measurement bias in the social sciences. Econometrica: Journal of the and effect restoration in causal inference. Biometrika, Econometric Society, pages 979–1001. 101(2):423–437. Goldberger, A. S. (1973). Structural equation models: Le, T.-H. and Chang, Y. (2011). Oil and gold: cor- An overview. Structural equation models in the social relation or causation? Technical report, Nanyang sciences, pages 1–18. Technological University, School of Social Sciences, Granger, C. W. (1980). Testing for causality: a personal Economic Growth Centre. viewpoint. J. Economic Dynamics and control, 2:329– Li, S. and B¨uhlmann,P. (2018). Estimating het- 352. erogeneous treatment effects in nonstationary time Guo, R., Cheng, L., Li, J., Hahn, P. R., and Liu, H. series with state-space models. arXiv preprint (2020). A survey of learning causality with data: arXiv:1812.04063. ACM Computing Surveys Problems and methods. Louizos, C., Shalit, U., Mooij, J. M., Sontag, D., Zemel, (CSUR) , 53(4):1–37. R., and Welling, M. (2017). Causal effect inference Hill, J. L. (2011). Bayesian nonparametric modeling with deep latent-variable models. In Advances in for causal inference. Journal of Computational and Neural Information Processing Systems, pages 6446– Graphical Statistics, 20(1):217–240. 6456. Thanh Vinh Vo, Pengfei Wei, Wicher Bergsma, Tze-Yun Leong

Lu, C., Sch¨olkopf, B., and Hern´andez-Lobato, Rubin, D. B. (2005). Causal inference using potential J. M. (2018). Deconfounding reinforcement learn- outcomes: Design, modeling, decisions. Journal of ing in observational settings. arXiv preprint the American Statistical Association, 100(469):322– arXiv:1812.10576. 331. Madras, D., Creager, E., Pitassi, T., and Zemel, R. Sch¨olkopf, B., Herbrich, R., and Smola, A. J. (2001). (2019). Fairness through causal awareness: Learning A generalized representer theorem. In International causal latent-variable models for biased data. In conference on computational learning theory, pages Proceedings of the Conference on Fairness, Account- 416–426. Springer. ability, and Transparency , pages 349–358. ACM. Shalit, U., Johansson, F. D., and Sontag, D. (2017). Montgomery, M. R., Gragnolati, M., Burke, K. A., and Estimating individual treatment effect: generaliza- Paredes, E. (2000). Measuring living standards with tion bounds and algorithms. In Proceedings of the proxy variables. Demography, 37(2):155–174. 34th International Conference on Machine Learning, Neyman, J. S. (1923). On the application of proba- pages 3076–3085. bility theory to agricultural experiments. essay on Sim´akov´a,ˇ J. (2011). Analysis of the relationship principles. section 9. (translated and edited by dm between oil and gold prices. Journal of finance, dabrowska and tp speed, statistical science (1990), 5, 51(1):651–662. 465-480). Annals of Agricultural Sciences, 10:1–51. Spirtes, P., Glymour, C. N., Scheines, R., and Hecker- Nie, X. and Wager, S. (2020). Quasi-oracle estimation man, D. (2000). Causation, prediction, and search. of heterogeneous treatment effects. Biometrika. MIT press. Ning, B., Ghosal, S., Thomas, J., et al. (2019). Bayesian Tian, J. and Pearl, J. (2001). Causal discovery from method for causal inference in spatially-correlated changes. In Proceedings of the 17th Conference in Bayesian Analysis multivariate time series. , 14(1):1– Uncertainty in Artificial Intelligence, UAI ’01, page 28. 512–521, San Francisco, CA, USA. Morgan Kauf- Oprescu, M., Syrgkanis, V., and Wu, Z. S. (2019). mann Publishers Inc. Orthogonal random forest for causal inference. In In- Yao, L., Li, S., Li, Y., Huai, M., Gao, J., and Zhang, A. ternational Conference on Machine Learning, pages (2018). Representation learning for treatment effect 4932–4941. PMLR. estimation from observational data. In Advances in Pearl, J. (1988). Probabilistic reasoning in intelligent Neural Information Processing Systems, pages 2633– systems. San Mateo, CA: Kaufmann, 23:33–34. 2643. Pearl, J. (1995). Causal diagrams for empirical research. Yoon, J., Jordon, J., and van der Schaar, M. (2018). Biometrika, 82(4):669–688. GANITE: Estimation of individualized treatment Pearl, J. (2000). Causality: models, reasoning and effects using generative adversarial nets. In Interna- inference, volume 29. Springer. tional Conference on Learning Representations. Pearl, J. (2009). Causal inference in statistics: An overview. Statistics surveys, 3:96–146. Peters, J., Janzing, D., and Sch¨olkopf, B. (2013). Causal inference on time series using restricted struc- tural equation models. In Advances in Neural Infor- mation Processing Systems, pages 154–162. Peters, J., Mooij, J. M., Janzing, D., and Sch¨olkopf, B. (2014). Causal discovery with continuous additive noise models. The Journal of Machine Learning Research, 15(1):2009–2053. Riegg, S. K. (2008). Causal inference and omitted vari- able bias in financial aid research: Assessing solutions. The Review of Higher Education, 31(3):329–354. Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55. Rubin, D. B. (1974). Estimating causal effects of treat- ments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688.