Causal Inference on Discrete Data Via Estimating Distance Correlations
Total Page:16
File Type:pdf, Size:1020Kb
1 Causal Inference on Discrete Data via Estimating Distance Correlations Furui Liu, Laiwan Chan Department of Computer Science and Engineering, The Chinese University of Hong Kong Abstract In this paper, we deal with the problem of inferring causal directions when the data is on discrete domain. By considering the distribution of the cause P (X) and the condi- tional distribution mapping cause to effect P (Y jX) as independent random variables, we propose to infer the causal direction via comparing the distance correlation between P (X) and P (Y jX) with the distance correlation between P (Y ) and P (XjY ). We in- fer “X causes Y ” if the dependence coefficient between P (X) and P (Y jX) is smaller. Experiments are performed to show the performance of the proposed method. Editor: Aapo Hyvarinen¨ 1 Introduction Inferring the causal direction between two variables from observational data becomes arXiv:1803.07712v3 [stat.ML] 7 Aug 2018 a hot research topic. Additive noise models (ANMs) (Zhang and Hyvarinen,¨ 2009a; Shimizu et al., 2006; Hoyer et al., 2009; Peters et al., 2011; Hoyer and Hyttinen, 2009; Zhang and Hyvarinen,¨ 2009b; Shimizu et al., 2011; Hyvarinen¨ and Smith, 2013; Hyvarinen¨ et al., 2010; Hyttinen et al., 2012; Mooij et al., 2011) are preliminary trials to solve this problem. They assume that the effect is governed by the cause and an ad- ditive noise, and the causal inference is done by finding the direction that admits such a model. Recently, under another view of exploiting the asymmetry between cause and effect, the linear trace method (LTr) (Zscheischler et al., 2011; Janzing et al., 2010) and information geometric causal inference (IGCI) (Janzing et al., 2012) are proposed. Suppose X is the cause and Y is the effect. Based on the fact that the generating of P (X) is independent with that of P (Y jX) (Janzing and Scholkopf,¨ 2010; Lemeire and Janzing, 2013; Scholkopf¨ et al., 2012), LTr suggests that the trace condition is fulfilled in the causal direction while violated in the anti-causal direction, and IGCI shows that the density of the cause and the log slope of the function transforming cause to effect are uncorrelated while the density of the effect and the log slope of the inverse of the func- tion are positively correlated. By accessing these so-called cause-effect asymmetries, one can determine the causal direction. Then a kernel method using the framework of IGCI to deal with high dimensional variables is developed (Chen et al., 2014), and nonlinear extensions of trace method is presented (Chen et al., 2013). In some situations, the variables of interest are on discrete domains, and researchers have adopted additive noise models to discrete data for causal inference (Peters et al., 2010, 2011). Given observations of the variable pair, they do regressions in both direc- tions, and test the independence between the residuals and the regressors. The direction that admits an additive noise model is inferred as the causal direction. However, ANMs may not be suitable for modeling discrete variables in some situations. For example, it is not natural to adopt ANMs to modeling categorical variables. Methods with wider applicability would be valuable for causal inference on discrete data. Motivated by the postulate that the generating of P (X) is independent with that of P (Y jX), we suggest that the (P (x);P (Y jx)) is an observation of a pair of vari- ables that are independent with each other (here P (x) = P (X = x), referring to the probability at one specified point). To infer the causal direction, we calculate the de- pendence coefficients between P (X) and P (Y jX), and P (Y ) and P (XjY ). Then the direction that induces smaller correlation is inferred as the causal direction. Without a functional causal model assumption, our method is with wider applicability than tradi- tional ANMs. Various experiments are conducted to demonstrate the performance of the proposed method. This paper is organized as follows. Section 2 defines the problem. Section 3 presents the causal inference principle. Section 4 gives the detailed causal inference method. Section 5 shows the experiments, and section 6 concludes the whole paper. 2 Problem Description Suppose we have a set of observed variables X and Y with support domain X and Y respectively. X is the cause and Y is the effect but we do not have prior knowledge about the causal direction. We assume that they are on discrete domain (for continuous variable we can perform discretization) and there are no latent confounders. We want to identify the causal direction (“X causes Y” or “Y causes X”). For clarity, we list the symbols that may appear in the following sections in table 1. Since we constrain 2 variables in finite range, P (X; Y ) and P (Y jX) can be written as matrices, and P (Y jx) is a vector in the matrix P (Y jX). We use these representations in the rest of the paper. Table 1: Lookup table Symbol Description X support of variable X Y support of variable Y P (X) distribution of X P (x) probability of X = x P (X; Y ) joint distribution of (X; Y ) P (Y jX) conditional distribution of Y given X P (Y jx) conditional distribution of Y given X = x j · j cardinality of a set 3 Causal Inference Principle In this section, we will present the principle that we are using for causal inference on discrete data. We start with the basic idea. 3.1 Basic Idea The basic idea is to consider the (P (x);P (Y jx)) as a realization of a variable pair, and the two variables (one is one dimensional and the other is high dimensional) are independent with each other. See figure 1 for an example. Figure 1 shows an example of P (X) and P (Y jX). Suppose jX j = M and jYj = L (here M = 7 and L = 8). P (X) is a vector in RM and P (Y jX) is a matrix in RL×M . The highlighted (red bars) grids are a pair (P (x);P (Y jx)). Consider P (X) and P (Y jX) as two independent random variables. The generating of the P (X) and P (Y jX) is done by drawing realizations (P (x);P (Y jx)) at each possible value of x (shifting the red bars from right to left). We have jX j realizations. We formalize this in postulate 1. Postulate 1. P (X) and P (Y jX) are both random variables taking realizations at dif- ferent x. P (X) is independent with P (Y jX). Once we have this postulate in mind, one could seek for some properties induced by it for causal discovery. We will discuss this in the next section. 3 P(Y|X) (P(x) , P(Y|x)) P(X) Figure 1: P (X) and P (Y jX) as random variables 3.2 Distance Correlation If we want to characterize the dependence between P (X) and P (Y jX), one measure- ment is the correlation. However, P (Y jX) is a high dimensional random vector. Adopt- ing traditional dependence coefficients like Pearson correlations would cause certain estimation bias when sample size is not large. Moreover, it would be useful if the in- dependence between variables corresponds to 0 correlation. This is not true if we use traditional correlations. Here we propose to use distance correlation (Szekely´ et al., 2007) as the dependence measurement. Distance correlation is a measurement of dependence between two random variables (one dimensional or high dimensional). Suppose we have two random variables (α; β), with characteristic functions fα(t) and fβ(s) respectively. Their joint characteristic function is fα,β(t; s). The distance covariance is defined as below. Definition 1. The distance covariance C(α; β) between two random variables (α; β) is 2 2 C (α; β) = kfα,β(t; s) − fα(t)fβ(s)k (1) Here k·k refers the weighted L2 norm, and similarly we can define distance variance C2(α; α) (Szekely´ et al., 2007). Then the distance correlation is defined as Definition 2. The distance correlation D(α; β) is C(α; β) D(α; β) = (2) pC(α; α)C(β; β) and D(α; β) = 0 if C(α; α) = 0 or C(β; β) = 0. 4 This dependence measurement is a distance metric between the joint and multiply of the marginal characteristic functions. There are other methods to measure the depen- dence, like mutual information and kernel independence measurements (Gretton et al., 2005). However, mutual information is hard to estimate given finite sample size. Kernel methods involve a few parameters (kernel functions, kernel widths) which is not easy to choose. So we use this metric in our paper. Then we discuss how to estimate the distance correlation empirically from data (Szekely´ et al., 2007). Suppose we have n n observations of two random variables (α; β) as f(αi; βi)gi=1. For variable α and β, we can construct n n n 1 X 1 X 1 X a = kα − α k; a = a ; a = a ; a = a (3) ij i j i· n ij ·j n ij ·· n2 ij j=1 i=1 i;j=1 n n n 1 X 1 X 1 X b = kβ − β k; b = b ; b = b ; b = b (4) ij i j i· n ij ·j n ij ·· n2 ij j=1 i=1 i;j=1 and then we construct matrices A and B, with its entries as Aij = aij − ai· − a·j + a·· (5) Bij = bij − bi· − b·j + b·· (6) Then we can estimate the empirical distance covariance as follows (Szekely´ et al., 2007).