Kernel-Based Conditional Independence Test and Application in Causal Discovery

Kernel-based Conditional Independence Test and Application in Causal Discovery Kun Zhang Jonas Peters Dominik Janzing Bernhard Schölkopf Max Planck Institute for Intelligent Systems Spemannstr. 38, 72076 Tübingen Germany Abstract the continuous case { in particular, the variables are often assumed to have linear relations with additive Conditional independence testing is an im- Gaussian errors. In that case, X ?? Y jZ reduces to portant problem, especially in Bayesian net- zero partial correlation or zero conditional correlation work learning and causal discovery. Due to between X and Y given Z, which can be easily tested the curse of dimensionality, testing for condi- (for the links between partial correlation, conditional tional independence of continuous variables correlation, and CI, see Lawrance (1976)). However, is particularly challenging. We propose a nonlinearity and non-Gaussian noise are frequently en- Kernel-based Conditional Independence test countered in practice, and hence this assumption can (KCI-test), by constructing an appropriate lead to incorrect conclusions. test statistic and deriving its asymptotic dis- Recently, practical methods have been proposed for tribution under the null hypothesis of condi- testing CI for continuous variables without assuming tional independence. The proposed method a functional form between the variables as well as the is computationally efficient and easy to im- data distributions, which is the case we are concerned plement. Experimental results show that it with in this paper. To our knowledge, the existing outperforms other methods, especially when methods fall into four categories. The first category the conditioning set is large or the sample size is based on explicit estimation of the conditional den- is not very large, in which case other methods sities or their variants. For example, Su and White encounter difficulties. (2008) define the test statistic as some distance between the estimated conditional densities p(XjY; Z) and p(XjY ), and Su and White (2007) exploit the dif- 1 Introduction ference between the characteristic functions of these conditional densities. The estimation of the condi- Statistical independence and conditional independence tional densities or related quantities is difficult, which (CI) are important concepts in statistics, artificial in- deteriorates the testing performance especially when telligence, and related fields (Dawid, 1979). Let X, the conditioning set Z is not small enough. Meth- Y and Z denote sets of random variables. The CI be- ods in the second category, such as Margaritis (2005) tween X and Y given Z, denoted by X ?? Y jZ, reflects and Huang (2010), discretize the conditioning set Z to the fact that given the values of Z, further knowing the a set of bins, and transform CI to the unconditional values of X (or Y ) does not provide any additional in- one in each bin. Inevitably, due to the curse of di- formation about Y (or X). Independence and CI play mensionality, as the conditioning set becomes larger, a central role in causal discovery and Bayesian network the required sample size increases dramatically. Meth- learning (Pearl, 2000; Spirtes et al., 2001; Koller and ods in the third category, including Linton and Gozalo Friedman, 2009). Generally speaking, the CI relation- (1997) and Song (2009), provide slightly weaker tests ship X ?? Y jZ allows us to drop Y when constructing than that for CI. For instance, the method proposed a probabilistic model for X with (Y; Z), which results by Song (2009) tests whether one can find some (non- in a parsimonious representation. linear) function h and parameters θ0, such that X and Testing for CI is much more difficult than that for Y are conditionally independent given a single index T unconditional independence (Bergsma, 2004). For CI function λθ0 (Z) = h(Z θ0) of Z. In general, this is tests, traditional methods either focus on the discrete different from the test for X ?? Y jZ: to see this, con- case, or impose simplifying assumptions to deal with sider the case where X and Y depend on two different but overlapping subsets of Z; even if X ?? Y jZ, it is kernel matrix of the sample x, and the correspond- impossible to find λθ0 (Z) given which X and Y are ing centralized kernel matrix is Ke X , HKX H, where 1 T conditionally independent. H = I − n 11 with I and 1 being the n × n identity Fukumizu et al. (2004) give a general nonparametric matrix and the vector of 1's, respectively. By default characterization of CI using covariance operators in we use the Gaussian kernel, i.e., the (i; j)th entry of 2 the reproducing kernel Hilbert spaces (RKHS), which jjxi−xj jj KX is k(xi; xj) = exp(− 2 ), where σX is the 2σX inspired a kernel-based measure of conditional depen- kernel width. Similar notations are used for Y and Z. dence (see also Fukumizu et al. (2008)). However, the distribution of this measure under the CI hypothesis is The problem we consider here is to test for CI between unknown, and consequently, it could not directly serve sets of continuous variables X and Y given Z from as a CI test. To get a test, one has to combine this their observed i.i.d. samples, without making cspe- conditional dependence measure with local bootstrap, cific assumptions on their distributions or the func- or local permutation, which is used to determine the tional forms between them. X and Y are said to rejector region (Fukumizu et al., 2008; Tillman et al., be conditionally independent given Z if and only if 2009). This leads to the method in the fourth cat- pXjY;Z = pXjZ (or equivalently, pY jX;Z = pY jZ , or egory. We denote it by CIPERM. Like the methods pXY jZ = pXjZ pY jZ ). Therefore, a direct way to assess in the second category, this approach would require a if X ?? Y jZ is to estimate certain probability densities large sample size and tends to be unreliable when the and then evaluate if the above equation is plausible. number of conditioning variables increases. However, density estimation in high dimensions is a difficult problem: it is well known that in nonparamet- In this paper we aim to develop a CI testing method ric joint or conditional density estimation, due to the which avoids the above drawbacks. In particular, curse of dimensionality, to achieve the same accuracy based on appropriate characterizations of CI, we define the number of required data points is exponentially a simple test statistic which can be easily calculated increasing in the data dimension. Fortunately, condi- from the kernel matrices associated with X, Y , and Z, tional (in)dependence is just one particular property and we further derive its asymptotic distribution under associated with the distributions; to test for it, it is the null hypothesis. We also provide ways to estimate possible to avoid explicitly estimating the densities. such a distribution, and finally CI can be tested con- veniently. This results in a Kernel-based Conditional There are other ways to characterize the CI relation Independence test (KCI-test). In this procedure we that do not explicitly involve the densities or the vari- do not explicitly estimate the conditional or joint den- ants, and they may result in more efficient methods sities, nor discretize the conditioning variables. Our for CI testing. Recently, a characterization of CI is method is computationally appealing and is less sen- given in terms of the cross-covariance operator ΣYX sitive to the dimensionality of Z compared to other on RKHS (Fukumizu et al., 2004). For the random methods. Our results contain unconditional indepen- vector (X; Y ) on X × Y, the cross-covariance operator dence testing (similar to (Gretton et al., 2008)) as a from HX to HY is defined by the relation: special case. hg; ΣYX fi = EXY [f(X)g(Y )] − EX [f(X)]EY [g(Y )] for all f 2 HX and g 2 HY . The conditional cross- 2 Characterization of Independence covariance operator of (X; Y ) given Z is further de- and Conditional Independence fined by 1 −1 ΣYXjZ = ΣYX − ΣYZ ΣZZ ΣZX : (1) We introduce the following notational convention. Intuitively, one can interprete it as the partial covari- Throughout this paper, X, Y , and Z are continuous ance between ff(X); 8f 2 H g and fg(Y ); 8g 2 H g random variabels or sets of continuous random vari- X Y given fh(Z); 8h 2 H g. ables, with domains X , Y, and Z, respectively. De- Z 2 fine a measurable, positive definite kernel kX on X If characteristic kernels are used, the conditional and denote the corresponding RKHS by HX . Simi- 1 If ΣZZ is not invertible, one should use the right inverse larly we define kY , HY , kZ , and HZ . In this paper instead of the inverse; see Corollary 3 in Fukumizu et al. we assume that all involved RKHS's are separable and (2004) 2 square integrable. The probability law of X is denoted A kernel kX is said to be characteristic if the condition by PX , and similarly for the joint probability laws such EX∼P [f(X)] = EX∼Q[f(X)] (8f 2 H) implies P = Q, where P and Q are two probability distributions of X as PXZ . The spaces of square integrable functions of X 2 2 (Fukumizu et al., 2008). Hence, the notion \characteris- and (X; Z) are denoted by LX and LXZ , respectively. tic" was also called \probability-determining" in Fukumizu 2 2 E.g., LXZ = fg(X; Z) j E[g ] < 1g. x = fx1; :::; xng et al. (2004). Many popular kernels, such as the Gaussian denotes the i.i.d. sample of X of size n. KX is the one, are charateristic.

Load more