We Are Not Your Real Parents: Telling Causal from Confounded Using MDL
Total Page:16
File Type:pdf, Size:1020Kb
We Are Not Your Real Parents: Telling Causal from Confounded using MDL David Kaltenpoth◦ Jilles Vreeken• Abstract K P Z + K P X Z + K P Y Z , will be lower than the ¹ ¹ ºº ¹ ¹ j ºº ¹ ¹ j ºº Given data over variables X ; :::; Xm;Y we consider the problem complexity corresponding to the model where X causes Y , ¹ 1 º of nding out whether X jointly causes Y or whether they are all K P Z + K P X + K P Y X . As we obviously do not confounded by an unobserved latent variable Z. To do so, we take an ¹ ¹ ºº ¹ ¹ ºº ¹ ¹ j ºº have access to P Z , we propose to estimate it using latent information-theoretic approach based on Kolmogorov complexity. ¹ º In a nutshell, we follow the postulate that rst encoding the true factor modelling. Second, as Kolmogorov complexity is not cause, and then the eects given that cause, results in a shorter computable, we use the Minimum Description Length (MDL) description than any other encoding of the observed variables. e ideal score is not computable, and hence we have to ap- principle as a well-founded approach to approximate it from proximate it. We propose to do so using the Minimum Description above. is is the method that we develop in this paper. Length (MDL) principle. We compare the MDL scores under the In particular, we consider the seing where given a models where X causes Y and where there exists a latent variables sample over the joint distribution P X;Y of continuous- Z confounding both X and Y and show our scores are consistent. To ¹ º nd potential confounders we propose using latent factor modeling, valued univariate or multivariate random variable X = in particular, probabilistic PCA (PPCA). X ;:::; X , and a continuous-valued scalar Y . Although it Empirical evaluation on both synthetic and real-world data ¹ 1 mº shows that our method, CoCa, performs very well—even when has received lile aention so far, we are not the rst to study the true generating process of the data is far from the assumptions this problem. Recently, Janzing and Scholkopf¨ [9, 10] showed made by the models we use. Moreover, it is robust as its accuracy how to measure the “structural strength of confounding” goes hand in hand with its condence. for linear models using resp. spectral analysis [9] and ICA [10]. Rather than implicitly measuring the signicance, 1 Introduction we explicitly model the hidden confounder Z via probabilistic Causal inference from observational data, i.e. inferring PCA. While this means our approach is also linear in nature, cause and eect from data that was not collected through it gives us the advantage that we can fairly compare the randomized controlled trials, is one of the most challenging scores for the models X Y and X Z Y , allowing us ! ! and important problems in statistics [21]. One of the to dene a reliable condence measure. main assumptions in causal inference is that of causal rough extensive empirical evaluation on synthetic suciency. at is, to make sensible statements on the causal and real-world data, we show that our method, CoCa, short relationship between two statistically dependent random for Confounded-or-Causal, performs well in practice. is variables X and Y , it is assumed that there exists no hidden includes seings where the modelling assumptions hold, confounder Z that causes both X and Y. In practice this but also in adversarial seings where they do not. We assumption is oen violated—we seldom know all factors show that CoCa beats both baselines as well as the recent that could be relevant, nor do we measure everything—and proposals mentioned above. Importantly, we observe that hence existing methods are prone to spurious inferences. our condence score strongly correlates with accuracy. at arXiv:1901.06950v1 [cs.LG] 21 Jan 2019 In this paper, we study the problem of inferring whether is, for those cases where we observe a large dierence X andY are causally related, or, are more likely jointly caused between the scores for causal resp. confounded, we can by an unobserved confounding variable Z. To do so, we trust CoCa to provide highly accurate inferences. build upon the algorithmic Markov condition (AMC) [8]. e main contributions of this paper are as follows, we is recent postulate states that the simplest—measured (a) extend the AMC with latent factor models, and propose in terms of Kolmogorov complexity—factorization of the to instantiate it via probabilistic PCA, joint distribution coincides with the true causal model. (b) dene a consistent and easily computable MDL-score Simply put, this means that if Z causes both X and Y the to instantiate the framework in practice, complexity of the factorization according to this model, (c) provide extensive evaluation on synthetic and real data, including comparisons to the state-of-the-art. ◦Max Planck Institute for Informatics and Saarland University, is paper is structured as usual. In Sec. 2 we introduce Saarbrucken,¨ Germany. [email protected] basic concepts of causal inference, and hidden confounders. •Helmholtz Center for Information Security and Max Planck Institute for Informatics, Saarbrucken,¨ Germany. [email protected] We formalize our information theoretic approach to inferring causal or confounded in Sec. 3. We discuss related work in factorize the joint distribution over the measured variables, Sec. 4, and present the experiments in Sec. 5. Finally, we wrap up with discussion and conclusions in Sec. 6. 2 Causal Inference and Confounding m Ö In this work, we consider the seing where we are given P X ;:::; X = P X PA : ¹ 1 mº ¹ i j i º n samples from the joint distribution P X;Y over two i=1 ¹ º statistically dependent continuous-valued random variables at is, we can write it as a product of the marginal Y Y X and . We require to be a scalar, i.e. univariate, but distributions of each Xi conditioned on its true causal parents allow X = X ;:::; Xm to be of arbitrary dimensionality, ¹ 1 º PAi . is is referred to as the causal Markov condition, and i.e. univariate or multivariate. Our task is to determine implies that, under all of the above assumptions, we can have whether it is more likely that X jointly cause Y , or that a hope of reconstructing the causal graph from a sample from there exists an unobserved random variable Z = Z ;:::; Z ¹ 1 k º the joint distribution. that is the cause of both X and Y. Before we detail our approach, we introduce some basic notions, and explain why 2.2 Crude Solutions at Do Not Work Based on the the straightforward solution does not work. above, many methods have been proposed to infer causal relationships from a dataset. We give a high level overview 2.1 Basic Setup It is impossible to do causal inference of the state of the art in Sec. 4. Here, we continue to discuss from observational data without making assumptions [21]. why it is dicult to determine whether a given pair X;Y at is, we can only reason about what we should observe in is confounded or not, and in particular, why traditional the data if we were to change the causal model, if we assume approaches based on probability theory or (conditional) (properties of) a causal model in the rst place. independence, do not suce. A core assumption in causal inference is that the data To see this, let us rst suppose that X causes Y , and was drawn from a probabilistic graphical model, a casual there are is no hidden confounder Z. We then have P X;Y = ¹ º directed acyclic graph (DAG). To have a ghting chance to P X P Y X , and X Y . Now, let us suppose instead that Z ¹ º ¹ j º 6?? recover this causal graph G from observational data, we have causes X and Y, i.e. X Z Y . en we would have ! to make two further assumptions. e rst, and in practice P X;Y; Z = P Z P X Z P Y Z , and, importantly, while ¹ º ¹ º ¹ j º ¹ j º most troublesome is that of causal suciency. is assump- X Y Z, we still observe X Y , and hence cannot ?? j 6?? tion is satised if we have measured all common causes of determine causal or confounded on that alone. all measured variables. is is related to Reichenbach’s prin- Moreover, as we are only given a sample over P X;Y ¹ º ciple of common cause [25], which states that if we nd that for which X Y holds, but know nothing about Z or P Z , 6?? ¹ º two random variables X and Y are statistically dependent, we cannot directly measure X Y Z. A simple approach denoted as X Y , there are three possible explanations. Ei- ?? jˆ ˆ 6?? would be to see if we can generate a Z such that X Y Z; ther X causes Y , X Y , or, the other way around, Y causes ?? j ! for example through sampling or optimization. However, X, X Y , or there is a third variable Z that causes both X ˆ as we have to assign n values for Z, this means we have and Y, X Z Y . In order to determine the laer case, ! n degrees of freedom, and it easy to see that under these we need to have measured Z. conditions it is always possible to generate a Zˆ that achieves e second additional assumption we have to make is this independence, even when there was no confounding Z.