<<

Kernel-based Conditional Independence Test and Application in Causal Discovery

Kun Zhang Jonas Peters Dominik Janzing Bernhard Sch¨olkopf Max Planck Institute for Intelligent Systems Spemannstr. 38, 72076 T¨ubingen Germany

Abstract the continuous case – in particular, the variables are often assumed to have linear relations with additive Conditional independence testing is an im- Gaussian errors. In that case, X ⊥⊥ Y |Z reduces to portant problem, especially in Bayesian net- zero or zero conditional correlation work learning and causal discovery. Due to between X and Y given Z, which can be easily tested the curse of dimensionality, testing for condi- (for the links between partial correlation, conditional tional independence of continuous variables correlation, and CI, see Lawrance (1976)). However, is particularly challenging. We propose a nonlinearity and non-Gaussian noise are frequently en- Kernel-based Conditional Independence test countered in practice, and hence this assumption can (KCI-test), by constructing an appropriate lead to incorrect conclusions. test statistic and deriving its asymptotic dis- Recently, practical methods have been proposed for tribution under the null hypothesis of condi- testing CI for continuous variables without assuming tional independence. The proposed method a functional form between the variables as well as the is computationally efficient and easy to im- data distributions, which is the case we are concerned plement. Experimental results show that it with in this paper. To our knowledge, the existing outperforms other methods, especially when methods fall into four categories. The first category the conditioning set is large or the sample size is based on explicit estimation of the conditional den- is not very large, in which case other methods sities or their variants. For example, Su and White encounter difficulties. (2008) define the test statistic as some distance be- tween the estimated conditional densities p(X|Y,Z) and p(X|Y ), and Su and White (2007) exploit the dif- 1 Introduction ference between the characteristic functions of these conditional densities. The estimation of the condi- Statistical independence and conditional independence tional densities or related quantities is difficult, which (CI) are important concepts in , artificial in- deteriorates the testing performance especially when telligence, and related fields (Dawid, 1979). Let X, the conditioning set Z is not small enough. Meth- Y and Z denote sets of random variables. The CI be- ods in the second category, such as Margaritis (2005) tween X and Y given Z, denoted by X ⊥⊥ Y |Z, reflects and Huang (2010), discretize the conditioning set Z to the fact that given the values of Z, further knowing the a set of bins, and transform CI to the unconditional values of X (or Y ) does not provide any additional in- one in each bin. Inevitably, due to the curse of di- formation about Y (or X). Independence and CI play mensionality, as the conditioning set becomes larger, a central role in causal discovery and the required sample size increases dramatically. Meth- learning (Pearl, 2000; Spirtes et al., 2001; Koller and ods in the third category, including Linton and Gozalo Friedman, 2009). Generally speaking, the CI relation- (1997) and Song (2009), provide slightly weaker tests ship X ⊥⊥ Y |Z allows us to drop Y when constructing than that for CI. For instance, the method proposed a probabilistic model for X with (Y,Z), which results by Song (2009) tests whether one can find some (non- in a parsimonious representation. linear) function h and parameters θ0, such that X and Testing for CI is much more difficult than that for Y are conditionally independent given a single index T unconditional independence (Bergsma, 2004). For CI function λθ0 (Z) = h(Z θ0) of Z. In general, this is tests, traditional methods either focus on the discrete different from the test for X ⊥⊥ Y |Z: to see this, con- case, or impose simplifying assumptions to deal with sider the case where X and Y depend on two different but overlapping subsets of Z; even if X ⊥⊥ Y |Z, it is kernel matrix of the sample x, and the correspond- impossible to find λθ0 (Z) given which X and Y are ing centralized kernel matrix is Ke X , HKX H, where 1 T conditionally independent. H = I − n 11 with I and 1 being the n × n identity Fukumizu et al. (2004) give a general nonparametric matrix and the vector of 1’s, respectively. By default characterization of CI using covariance operators in we use the Gaussian kernel, i.e., the (i, j)th entry of 2 the reproducing kernel Hilbert spaces (RKHS), which ||xi−xj || KX is k(xi, xj) = exp(− 2 ), where σX is the 2σX inspired a kernel-based measure of conditional depen- kernel width. Similar notations are used for Y and Z. dence (see also Fukumizu et al. (2008)). However, the distribution of this measure under the CI hypothesis is The problem we consider here is to test for CI between unknown, and consequently, it could not directly serve sets of continuous variables X and Y given Z from as a CI test. To get a test, one has to combine this their observed i.i.d. samples, without making cspe- conditional dependence measure with local bootstrap, cific assumptions on their distributions or the func- or local permutation, which is used to determine the tional forms between them. X and Y are said to rejector region (Fukumizu et al., 2008; Tillman et al., be conditionally independent given Z if and only if 2009). This leads to the method in the fourth cat- pX|Y,Z = pX|Z (or equivalently, pY |X,Z = pY |Z , or egory. We denote it by CIPERM. Like the methods pXY |Z = pX|Z pY |Z ). Therefore, a direct way to assess in the second category, this approach would require a if X ⊥⊥ Y |Z is to estimate certain densities large sample size and tends to be unreliable when the and then evaluate if the above equation is plausible. number of conditioning variables increases. However, density estimation in high dimensions is a difficult problem: it is well known that in nonparamet- In this paper we aim to develop a CI testing method ric joint or conditional density estimation, due to the which avoids the above drawbacks. In particular, curse of dimensionality, to achieve the same accuracy based on appropriate characterizations of CI, we define the number of required data points is exponentially a simple test statistic which can be easily calculated increasing in the data dimension. Fortunately, condi- from the kernel matrices associated with X, Y , and Z, tional (in)dependence is just one particular property and we further derive its asymptotic distribution under associated with the distributions; to test for it, it is the null hypothesis. We also provide ways to estimate possible to avoid explicitly estimating the densities. such a distribution, and finally CI can be tested con- veniently. This results in a Kernel-based Conditional There are other ways to characterize the CI relation Independence test (KCI-test). In this procedure we that do not explicitly involve the densities or the vari- do not explicitly estimate the conditional or joint den- ants, and they may result in more efficient methods sities, nor discretize the conditioning variables. Our for CI testing. Recently, a characterization of CI is method is computationally appealing and is less sen- given in terms of the cross-covariance operator ΣYX sitive to the dimensionality of Z compared to other on RKHS (Fukumizu et al., 2004). For the random methods. Our results contain unconditional indepen- vector (X,Y ) on X × Y, the cross-covariance operator dence testing (similar to (Gretton et al., 2008)) as a from HX to HY is defined by the relation: special case. hg, ΣYX fi = EXY [f(X)g(Y )] − EX [f(X)]EY [g(Y )]

for all f ∈ HX and g ∈ HY . The conditional cross- 2 Characterization of Independence covariance operator of (X,Y ) given Z is further de- and Conditional Independence fined by 1 −1 ΣYX|Z = ΣYX − ΣYZ ΣZZ ΣZX . (1) We introduce the following notational convention. Intuitively, one can interprete it as the partial covari- Throughout this paper, X, Y , and Z are continuous ance between {f(X), ∀f ∈ H } and {g(Y ), ∀g ∈ H } random variabels or sets of continuous random vari- X Y given {h(Z), ∀h ∈ H }. ables, with domains X , Y, and Z, respectively. De- Z 2 fine a measurable, positive definite kernel kX on X If characteristic kernels are used, the conditional and denote the corresponding RKHS by HX . Simi- 1 If ΣZZ is not invertible, one should use the right inverse larly we define kY , HY , kZ , and HZ . In this paper instead of the inverse; see Corollary 3 in Fukumizu et al. we assume that all involved RKHS’s are separable and (2004) 2 square integrable. The probability law of X is denoted A kernel kX is said to be characteristic if the condition by PX , and similarly for the joint probability laws such EX∼P [f(X)] = EX∼Q[f(X)] (∀f ∈ H) implies P = Q, where P and Q are two probability distributions of X as PXZ . The spaces of square integrable functions of X 2 2 (Fukumizu et al., 2008). Hence, the notion “characteris- and (X,Z) are denoted by LX and LXZ , respectively. tic” was also called “probability-determining” in Fukumizu 2 2 E.g., LXZ = {g(X,Z) | E[g ] < ∞}. x = {x1, ..., xn} et al. (2004). Many popular kernels, such as the Gaussian denotes the i.i.d. sample of X of size n. KX is the one, are charateristic. cross-covariance operator is related to the CI relation, From the definition of the conditional cross-covariance as seen from the following lemma. operator (1), one can see the close relationship between the conditions in Lemma 1 and those in Lemma 2. Lemma 1 [Characterization based on condi- However, Lemma 1 has practical advantages: in tional cross-covariance operators (Fukumizu Lemma 2 one has to consider all functions in L2, while et al., 2008)] Lemma 1 exploits the spaces corresponding to some ¨ Denote X , (X,Z), kX¨ , kX kZ , and HX¨ the RKHS characteristic kernels, which might be much smaller. 2 2 0 corresponding to kX¨. Assume HX ⊂ LX , HY ⊂ LY , In fact, if we restrict the functions f and g to the and H ⊂ L2 . Further assume that k k is a char- Z Z X¨ Y spaces HX¨ and HY , respectively, Lemma 2 is then re- acteristic kernel on (X × Y) × Z, and that HZ + R duced to Lemma 1. The above characterizations of 2 (the direct sum of the two RKHSs) is dense in L (PZ ). CI motivated our statistic for testing X ⊥⊥ Y |Z, as Then presented below. ΣXY¨ |Z = 0 ⇐⇒ X ⊥⊥ Y |Z. (2) 3 A Kernel-Based Conditional Note that one can replace ΣXY¨ |Z with ΣX¨ Y¨ |Z , where ¨ Independence Test Y , (Y,Z), in the above lemma. Alternatively, Daudin (1980) gives the characterization of CI by ex- 3.1 General results plicitly enforcing the uncorrelatedness of functions in suitable spaces, which may be intuitively more appeal- As seen above, independence and CI can be charac- ing. In particular, consider the constrained L2 spaces terized by uncorrelatedness between functions in cer- tain spaces. We first give some general results on the ˜ 2 ˜ EXZ , {f ∈ LXZ | E(f|Z) = 0}, asymptotic distributions of some statistics defined in 2 EYZ , {g˜ ∈ LYZ | E(˜g|Z) = 0}, terms of the kernel matrices under the condition of 0 ˜0 ˜0 0 0 0 2 such uncorrelatedness. Later those results will be used EYZ , {g | g = g (Y ) − E(g |Z), g ∈ LY }.(3) for testing for CI as well as unconditional indepen- They can be constructed from the corresponding L2 dence. spaces via nonlinear regression. For instance, for any Suppose that we are given the i.i.d. samples x function f ∈ L2 , the corresponding function f˜ is , XZ (x , ..., x , ..., x ) and y (y , ..., y , ..., y ) for X and given by 1 t n , 1 t n Y , respectively. Suppose further that we have the f˜(X¨) = f(X¨) − (f|Z) = f(X¨) − h∗ (Z), (4) eigenvalue decompositions (EVD) of the centralized E f T kernel matrices Ke X and Ke Y , i.e., Ke X = VxΛxVx ∗ 2 ¨ T where hf (Z) ∈ LZ is the regression function of f(X) and Ke Y = VyΛyVy , where Λx and Λy are the di- on Z. One can then relate CI to uncorrelatedness in agonal matrices containing the non-negative eigenval- the following way. ues λx,i and λy,i, respectively. Here, the eigenvalues are sorted in descending order, i.e., λx,1 ≥ λx,2 ≥ Lemma 2 [Characterization based on partial ... ≥ λx,i ≥ 0, and λy,1 ≥ λy,2 ≥ ... ≥ λy,i ≥ 0. association (Daudin, 1980)] 1/2 Let ψx = [ψ1(x), ..., ψn(x)] , VxΛx and φy = The following conditions are equivalent to each other. 1/2 p ˜ ˜ [φ1(y), ..., φn(y)] , VyΛy . I.e., ψi(x) = λx,iVx,i, (i.) X ⊥⊥ Y |Z; (ii.) E(fg˜) = 0, ∀f ∈ EXZ and g˜ ∈ ˜ ˜ 2 where Vx,i denotes the ith eigenvector of Ke X . EYZ ; (iii.) E(fg) = 0, ∀f ∈ EXZ and g ∈ LYZ ; (iv.) ˜˜0 ˜ ˜0 0 ˜ 0 ∗ E(fg ) = 0, ∀f ∈ EXZ and g ∈ EYZ ; (v.) E(fg ) = 0, On the other hand, consider the eigenvalues λX,i and ˜ 0 2 ∀f ∈ EXZ and g ∈ LY . eigenfunctions uX,i of the kernel kX w.r.t. the prob- ∗ ability measure with the density p(x), i.e., λX,i and The above result can be considered as a generaliza- uX,i satisfy tion of the partial correlation based characterization Z 0 ∗ 0 of CI for Gaussian variables. Suppose that (X,Y,Z) kX (x, x )uX,i(x)p(x)dx = λX,iuX,i(x ). is jointly Gaussian; then X ⊥⊥ Y |Z is equivalent to the vanishing of the partial correlation coefficient ρXY.Z . Here we assume that uX,i have unit variance, i.e., 2 ∗ Here, intuitively speaking, condition (ii) means that E[uX,i(X)] = 1. We also sort λX,i in descending order. ∗ any “residual” function of (X,Z) given Z is uncor- Similarly, we define λY,i and uY,i of kY . Define related from that of (Y,Z) given Z. Note that EXZ Pn 1 T t=1 ψi(xt)φj(yt) (resp. EYZ ) contains all functions of X and Z (resp. Sij , √ ψi(x) φj(y) = √ of Y and Z) that cannot be “explained” by Z, in the n n ˜ sense that any function f ∈ EXZ (resp.g ˜ ∈ EYZ ) is with ψi(xt) being the t-th component of the vector uncorrelated with any function of Z (Daudin, 1980). ψi(x). We then have the following results. Theorem 3 Suppose that we are given arbitrary cen- Theorem 4 [Independence test] tred kernels kX and kY with discrete eigenvalues and Under the null hypothesis that X and Y are statisti- the corresponding RKHS’s HX and HY for sets of ran- cally independent, the statistic dom variables X and Y , repectively. We have the fol- lowing three statements. 1 TUI Tr(Ke X Ke Y ) (8) , n 1) Under the condition that f(X) and g(Y ) are uncor- related for all f ∈ HX and g ∈ HY , for any L such has the same asymptotic distribution as ∗ ∗ ∗ ∗ that λX,L+1 6= λX,L and λY,L+1 6= λY,L, we have n ˇ 1 X 2 TUI , 2 λx,iλy,jzij, (9) 2 n L L i,j=1 X 2 d X˚∗ 2 Sij −→ λkzk, as n → ∞, (5) i,j=1 k=1 d i.e., TUI = TˇUI as n → ∞. where zk are i.i.d. standard Gaussian variables (i.e., This theorem inspires the following unconditional in- 2 2 ˚∗ zk are i.i.d. χ1-distributed variables), λk are the eigen- dependence testing procedure. Given the samples x T values of E(ww ) and w is the random vector obtained and y, one first calculates the centralized kernel ma- by stacking the L × L matrix N whose (i, j)th entry is trices Ke X and Ke Y and their eigenvalues λx,i and λy,i, q ∗ ∗ 3 Nij = λX,iλY,juX,i(X)uY,j(Y ). and then evaluates the statistic TUI according to (8). Next, the empirical null distribution of TˇUI under 2) In particular, if X and Y are further independent, the null hypothesis can be simulated in the following we have 2 way: one draws i.i.d. random samples from the χ1- distributed variables z2 , and then generates samples L L ij X 2 d X ∗ ∗ 2 for TˇUI according to (9). (Later we will give another Sij −→ λX,iλY,j · zij, as n → ∞, (6) i,j=1 i,j=1 way to approximate the asympotic null distribution.) Finally the p-value can be found by locating TUI in 2 2 the null distribution. where zij are i.i.d. χ1-distributed variables. 3) The results (5) and (6) also hold if L = n → ∞. This unconditional independence testing method is closely related to the one based on the Hilbert-Schmidt All proofs are sketched in Appendix. We note that independence criterion (HSIC) proposed by Gretton et al. (2008). Actually the defined statistics (our T T statistic TUI and HSICb in Gretton et al. (2008)) are Tr(Ke X Ke Y ) = Tr(ψ ψ ψ ψ ) x x y y the same, but the asymptotic distributions are in dif- n X = Tr(ψT ψ ψT ψ ) = n S2 . (7) ferent forms. In our results the asymptotic distribution x y y x ij only involves the eigenvalues of the two regular kernel i,j=1 matrices Ke X and Ke Y , while in their results it depends Hence, the above theorem gives the asymptotic distri- on the eigenvalues of an order-four tensor, which are 1 more difficult to calculate. bution of n Tr(Ke X Ke Y ) under the condition that f(X) and g(Y ) are always uncorrelated. Combined with the characterizations of (conditional) independence, this 3.3 Conditional independence testing inspires the corresponding testing method. We fur- ther give the following remarks on Theorem 3. First, Here we would like to make use of condition (iv) in X and Y are not necessarily disjoint, and H and H Lemma 2 to test for CI, but the considered functional X Y ¨ 0 ∗ can be any RKHS’s. Second, in practice the eigen- spaces are f(X) ∈ HX¨, g (Y ) ∈ HY , hf (Z) ∈ HZ , and ˚∗ ∗ ˜ ¨ values λk are not known, and one needs to use the hg0 (Z) ∈ HZ , as in Lemma 1. The functions f(X) empirical ones instead, as discussed below. and g˜0(Y,Z) appearing in condition (iv), whose spaces

are denoted by HX¨ |Z and HY|Z , respectively, can be 0 ∗ ∗ 3.2 Unconditional independence testing constructed from the functions f, g , hf , and hg0 . Suppose that we already have the centralized kernel As a direct consequence of the above theorem, we have matrices Ke X¨ (which is the centralized kernel matrix the following result which allows for kernel-based un- ¨ conditional independence testing. of X = (X,Z)), Ke Y , and Ke Z on the samples x, y, and z. We use kernel ridge regression to estimate the ∗ 3 ˚∗ regression function hf (Z) in (4), and one can easily Equivalently, one can consider λk as the eigenvalues of ˆ∗ −1 the tensor T with Tijkl = E(Nij,tNkl,t). see that hf (z) = Ke Z (Ke Z + εI) · f(x¨), where ε is a 2 small positive regularization parameter (Sch¨olkopf and i.i.d. χ1 samples and summing them up with weights ˜ ˚ Smola, 2002). Consequently, f(x¨) can be constructed λk. (For computational efficiency, in practice we drop ˜ ˆ∗ ˚ −5 as f(x¨) = f(x¨) − hf (z) = RZ · f(x¨), where all λx¨|z,i, λy|z,i, and λk which are smaller than 10 ). Finally the p-value is calculated as the probability of −1 −1 Rz = I − Ke Z (Ke Z + εI) = ε(Ke Z + εI) . (10) TˇCI exceeding TCI . Approximating the null distribu- tion with a Gamma distribution, which is given next, T avoids calculating ˚λ and simulating the null distribu- Based on the EVD decoposition Ke X¨ = Vx¨ Λx¨Vx¨, we k 1/2 tion, and is computationally more efficient. can construct ϕx¨ = [ϕ1(x¨), ..., ϕn(x¨)] , Vx¨Λx¨ , as an empirical kernel map for x¨. Correspondingly, an 3.4 Approximating the null distribution by a empirical kernel map of the space HX¨ |Z is given by Gamma distribution ϕ˜ x¨ = RZ ϕ(x¨). Consequently, the centralized kernel matrix corresponding to the functions f˜(X¨) is In addition to the simulation-based method to find the T null distribution, as in Gretton et al. (2008), we pro- Ke X¨ |Z = ϕ˜ x¨ϕ˜ x¨ = RZ Ke X¨ RZ . (11) vide approximations to the null distributions with a Similarly, that corresponding to g˜0 is two-parameter Gamma distribution. The two parame- ters of the Gamma distribution are related to the mean Ke Y |Z = RZ Ke Y RZ . (12) and variance. In particular, under the null hypothesis that X and Y are independent (resp. conditionally in- Furthermore, we let the EVD decompositions of dependent given Z), the distribution of TˇUI given by T ˇ Ke X¨ |Z and Ke Y |Z be Ke X¨ |Z = Ex¨|zΛx¨|zEx¨|z and (9) (resp. of TCI given by (14)) can be approximated T by the Γ(k, θ) distribution: Ke Y |Z = Ey|zΛy|zEy|z, respectively. Λx¨|z (resp. Λy|z) is the diagonal matrix containing non-negative −t/θ k−1 e eigenvalues λx¨|z,i (resp. λy|z,i). Let ψx¨|z = p(t) = t k , 1/2 θ Γ(k) [ψx¨|z,1(x¨), ..., ψx¨|z,n(x¨)] , Vx¨|zΛx¨|z and φy|z = 1/2 2 ˇ ˇ [φ (y¨), ..., φ (y¨)] V Λ . We then have where k = E (TUI )/Var(TUI ) and θ = y|z,1 y|z,n , y|z y|z ˇ ˇ the following result which the proposed KCI-test is Var(TUI )/E(TUI ) in the unconditional case, and 2 ˇ ˇ ˇ ˇ based on. k = E (TCI )/Var(TCI ) and θ = Var(TCI )/E(TCI ) in the conditional case. The means and variances of ˇ ˇ Proposition 5 [Conditional independence test] TUI and TCI on the given sample D are given in the Under the null hypothesis H0 (X and Y are condition- following proposition. ally independent given Z), we have that the statistic Proposition 6 i. Under the null hypothesis that X 1 and Y are independent, on the given sample D, TCI , Tr(Ke X¨ |Z Ke Y |Z ) (13) n we have has the same asymptotic distribution as ˇ 1 E(TUI |D) = 2 Tr(Ke X ) · Tr(Ke Y ), and n2 n ˇ 1 X˚ 2 TCI , λk · zk, (14) n ˇ 2 2 2 k=1 ar(TUI |D) = Tr(Ke ) · Tr(Ke ). V n4 X Y where ˚λ are eigenvalues of wˇ wˇ T and k ii. Under the null hypothesis of X ⊥⊥ Y |Z, we have wˇ = [wˇ 1, ..., wˇ n], with the vector wˇ t obtained by stacking Mˇ = [ψ (¨x ), ..., ψ (¨x )]T · t x¨|z,1 t x¨|z,n t ˇ 1 T 4 E{TCI |D} = Tr(wˇ wˇ ), and [φy|z,1(¨yt), ..., φy|z,n(¨yt)]. n Similarly to the unconditional independence testing, 2 ar(Tˇ |D) = Tr[(wˇ wˇ T )2]. we can do KCI-test by generating the approximate V IC n2 null distribution with Monte Carlo simulation. We 3.5 Practical issues: On determination of the first need to calculate Ke X¨ |Z according to (11), Ke Y |Z according to (12), and their eigenvalues and eigenvec- hyperparameters tors. We then evaluate T according to (13) and sim- CI In unconditional independence testing, the hyperpa- ulate the distribution of Tˇ given by (14) by drawing CI rameters are the kernel widths for constructing the 4 ˚ T Note that equivalently, λk are eigenvalues of wˇ wˇ ; kernel matrices Ke X and Ke Y . We found that the per- hence there are at most n non-zero values of ˚λk. formance is robust to these parameters within a certain range; in particular, as in Gretton et al. (2008), we use 4.1 On the effect of the dimensionality of Z the median of the pairwise distances of the points. and the sample size

In KCI-test, the hyperparameters fall into three cate- We examine how the of Type I and II gories. The first one includes the kernel widths used errors of KCI-test change along with the size of the to construct the kernel matrices Ke X¨ and Ke Y , which conditioning set Z (D = 1, 2, ..., 5) and the sample are used later to form the matrices Ke X¨ |Z and Ke Y |Z size (n = 200 and 400) in particular situations with according to (11) and (12). The values of the kernel simulations. We consider two cases as follows. widths should be such that one can use the informa- tion in the data effectively. In our experiments we In Case I, only one variable in Z, namely, Z1, is ef- normalize all variables to unit variance, and use some fective, i.e., other conditioning variables are indepen- empirical values for those kernel widths: they are set dent from X, Y , and Z1. To see how well the de- to 0.8 if the sample size n ≤ 200, to 0.3 if n > 1200, rived asymptotic null distribution approximates the or to 0.5 otherwise. We found that this simple setting true one, we examined if the probability of Type I always works well in all our experiments. errors is consistent with the significance level α that is specified in advance. We generated X and Y from Z1 The other two categories contain the kernel widths for accroding to the post-nonlinear data generating pro- constructing Ke Z and the regularization parameter ε, cedure (Zhang and Hyv¨arinen,2009): they were con- which are needed in calculating RZ in (10). These pa- structed as G(F (Z1) + E) where G and F are random rameters should be selected carefully, especially when mixtures of linear, cubic, and tanh functions and are the conditioning set Z is large. If they are too large, different for X and Y , and E is independent across X ∗ ∗ the corresponding regression functions hf and hg0 may and Y and has various distributions. Hence X ⊥⊥ Y |Z underfit, resulting in a large probability of Type I er- holds. In our simulations Zi were i.i.d. Gaussian. rors (where the CI hypothesis is incorrectly rejected). On the contrary, if they are too small, these regression A good test is expected to have small probability of function may overfit, which may increase the probabil- Type II errors. To see how large it is for KCI-test, we ity of Type II errors (where the CI hypothesis is not also generated the data which do not follow X ⊥⊥ Y |Z rejected although being false). Moreover, if ε is too by adding the same variable to X and Y produced small, R in (10) tends to vanish, resulting in very above. We increased the dimensionality of Z and the Z sample size n, and repeated the CI tests 1000 random small values in K and K , and then the perfor- e X¨ |Z e Y |Z replications. Figure 1 (a,b) plots the resulting proba- mance may be deteoriated by rounding errors. bility of Type I errors and that of Type II errors at the We found that when the dimensionality of Z is small significance level α = 0.01 and α = 0.05, respectively. (say, one or two variables), the proposed method works In Case II, all variables in the conditioning set Z are ef- well even with some simple empirical settings (say, −3 fective in generating X and Y . We first generated the ε = 10 , and the kernel width for constructing Ke Z independent variables Z , and then similarly to Case equals half of that for constructing K and K ). i e X¨ e Y I, to examine Type I errors, X and Y were generated When Z contains many variables, to make the re- P ∗ ∗ as G( i Fi(Zi) + E). To examine Type II errors, we gression functions h and h 0 more flexible, such that f g further added the same variable to X and Y such that they could properly capture the information about Z they became conditionally dependent given Z. Fig- in f(X¨) ∈ H and g0(Y ) ∈ H , respectively, we use X¨ Y ure 1 (c,d) gives the probabilities of Type I and Type separate regularization parameters and kernel widths II errors of KCI-test obtained on 1000 random repli- for them, denoted by {εf , σf } and {εg0 , σg0 }. To cations. avoid both overfitting and underfitting, we extend the Gaussian process (GP) regression framework to the One can see that with the derived null distribution, multi-output case, and learn these hyperparameters the 1% and 5% quantiles are approximated very well by maximizing the total marginal likelihood. Details for both sample sizes, since the resulting probabilities are skipped. The MATLAB source code is available at of Type I errors are very close to the significance lev- http://people.tuebingen.mpg.de/kzhang/KCI-test.zip . els. Gamma approximation tends to produce slightly larger probabilities of Type I errors, meaning that the two-parameter Gamma approximation may have 4 Experiments a slightly lighter tail than the true null distribution. With a fixed significance level α, as D increases, the We apply the proposed method KCI-test to both syn- probability of Type II errors always increases. This is thetic and real data to evaluate its practical perfor- intuitively reasonable: due to the finite sample size ef- mance and compare it with CIPERM (Fukumizu et al., fect, as the conditioning set becomes larger and larger, 2008). It is also used for causal discovery. X and Y tend to be considered as conditionally inde- (a) Type I error, case I (b) Type II error, case I it is far less sensitive to D.6 Type I error, n=200 (dashed line) & 400 (solid line) Type II error, n=200 (dashed line) & 400 (solid line) 0.07

0.06 0.06 Simulated (α=0.01) (a) Errors of CIPERM (b) CPU time by both methods Gamma appr. (α=0.01) Type I & II errors, n=200 (dashed line) & 400 (solid line) Average CPU time, n = 200 (dashedline) & 400 (solid line) 0.05 0.05 Simulated (α=0.05) 3 1 10 Simulated (α=0.01) Gamma appr. (α=0.05) KCI−test 0.04 α 0.04 Gamma appr. ( =0.01) CI PERM Simulated (α=0.05) 0.8 0.03 Gamma appr. (α=0.05) 0.03 α 2 Prob. type I ( =0.01) 10 0.02 0.02 0.6 Prob. type I (α=0.05) Prob. type II (α=0.01) 0.01 0.01 Prob. type II (α=0.05) 0.4 1 0 0 10 1 2 3 4 5 1 2 3 4 5

D D Time (seconds, log scale) 0.2 (c) Type I error, case II (d) Type II error, case II 0 0 10 Type I error, n=200 (dashed line) & 400 (solid line) Type II error, n = 200 (dashed line) & 400 (solid line) 1 2 3 4 5 1 2 3 4 5 0.07 0.07 D D

0.06 0.06 Simulated (α=0.01) Gamma appr. (α=0.01) 0.05 0.05 Simulated (α=0.05) Gamma appr. (α=0.05) Figure 2: Results of CIPERM shown for comparison. 0.04 Simulated (α=0.01) 0.04 Gamma appr. (α=0.01) (a) Probabilities of both Type I and II errors in Case 0.03 Simulated (α=0.01) 0.03 Gamma appr. (α=0.05) II (dashed line: n = 200; solid line: n = 400). (b) 0.02 0.02 Average CPU time taken by KCI-test and CIPERM in 0.01 0.01 log scale. Note that for KCI-test, we include both the 0 0 1 2 3 4 5 1 2 3 4 5 D D time for simulating the null distribution and that for Gamma approximation. Figure 1: The probability of Type I and Type II errors obtained by simulations in various situations (dashed line: n = 200; solid line: n = 400). Top: Case I (only 4.2 Application in causal discovery one variable in Z is effective to X and Y ). Bottom: Case II (all variables in Z are effective). Note that CI tests are frequently used in problems of causal in- for a good testing method, the propobability of Type ference: in those problems, one assumes that the true I errors is close to the significance level α, and that of causal structure of n random variables X1,...,Xn can Type II errors is as small as possible. be represented by a directed acyclic graph (DAG) G. More specifically, the causal Markov condition assumes that the joint distribution satisfies all CIs that are pendent. On the other hand, as the sample size in- imposed by the true causal graph (note that this is creases from 200 to 400, the probability of Type II an assumption about the physical generating process errors quickly approaches zero. of the data, not only about their distribution). So- called constraint-based methods like the PC algorithm We compared KCI-test with CI (with the stan- PERM (Spirtes et al., 2001) make the additional assumption dard setting of 500 bootstrap samples) in terms of both of faithfullness (i.e., the joint distribution does not al- types of errors and computational efficiency. For con- low any CIs that are not entailed by the Markov con- ciseness, we only report the probabilities of Type I and dition) and recover the graph structure by exploiting II errors of CI in Case II; see Figure 2 (a). One PERM the (conditional) independences that can be found in can see even when D = 1, the probability of Type I the data. Obviously, this is only possible up to Markov errors is clearly larger than the corresponding signif- equivalence classes, which are sets of graphs that im- icance level. Furthermore, it is very sensitive to D pose exactly the same independences and CIs. It is and n. As D becomes large, say, greater than 3, The well known that small mistakes at the beginning of the probabilities of Type II errors increase rapidly to 1, algorithm (e.g. missing an independence relation) may i.e., the test almost always fails to reject the CI hy- lead to significant errors in the resulting DAG. There- pothesis when it is actually false. It seems that the fore the performance of those methods relies heavily sample size is too small for CI to give reliable re- PERM on (conditional) independence testing methods. sults. Figure 2 (b) shows the average CPU time taken by KCI-test and CIPERM (note that it is in log scale). For continuous data, the PC algorithm can be applied KCI-test is computationally more efficient, especially when D is large. Because we use the GP regression son, we always used GP for hyperparameter learning. 6To see the price our test pays for generality, in an- framework to learn hyperparameters, which KCI-test other simulation, we considered the linear Gaussian case spends most of the time on, the computational load of and compared KCI-test with the partial correlated based KCI-test is more sensitive to n.5 On the other hand, one. We found that for the particular problems we investi- gated, both methods give very similar type I errors; partial 5As claimed in Sec. 3.5, when D is not high, say, D = 1 correlation gives much smaller type II errors than KCI-test, or 2, even fixed empirical values of the hyperparameters which is natural since KCI-test applies to far more general work very well. However, for consistentcy of the compari- situations. Details are skipped due to space limitation. using partial correlation or mutual information. The lines were found by both methods. Please refer to the former assumes linear relationships and Gaussian dis- data set for the explanation of the variables. Although tributions, and the latter does not lead to a significance one can argue about the ground truth for this data set, test. Sun et al. (2007) and Tillman et al. (2009) pro- we regard it as promising that our method finds links pose to use CIPERM for CI testing in PC. Based on between number of rooms (RM) and median value of the promising results of KCI-test, we propose to also houses (MED) and between non-retail business (IND) apply it to causal inference. and nitric oxides concentration (NOX). The latter is also missing in the result on these data given by Mar- 4.2.1 Simulated data garitis (2005); instead their method gives some dubi- ous links like crime rate (CRI) to nitric oxides (NOX), We generated data from a random DAG G. In partic- for example. ular, we randomly chose whether an edge exists and sampled the functions that relate the variables from a Gaussian Process prior. We sampled four random MED CRI DIS TAX B variables X1,...,X4 and allowed arrows from Xi to X only for i < j. With probability 0.5 each possi- j RM LST AGE NOX IND ble arrow is either present or absent. If arrows exist, from X1 and X3 to X4, say, we sample X4 from a Gaussian Process with mean function U1 · X1 + U3 · X3 Figure 4: of the PC algorithm applied to the iid (with U1,U3 ∼ U[−2; 2]) and a Gaussian kernel (with continuous variables of the Boston Housing Data Set each dimension randomly weighted between 0.1 and (red lines: PCCIPERM , solid lines: PCKCI-test). 0.6) plus a noise kernel as the covariance function. For significance level 0.01 and sample sizes between 100 and 700 we simulated 100 DAGs, and checked how often the different methods infer the correct Markov 5 Conclusion equivalence class. Figure 3 shows how often PC based We proposed a novel method for conditional indepen- on KCI-test, CIPERM, or partial correlation recovered the correct Markov equivalence class. PC based on dence testing. It makes use of the characterization KCI-test gives clearly the best results. of conditional independence in terms of uncorrelated- ness of functions in suitable reproducing kernel Hilbert

1 KCI−test spaces, and the proposed test statistic can be easily CI PERM 0.8 calculated from the kernel matrices. We derived its part. corr. distribution under the null hypothesis of conditional 0.6 independence. This distribution can either be gener- 0.4 ated by Monte Carlo simulation or approximated by proportion of correct Markov equiv. classes 0.2 a two-parameter Gamma distribution. Compared to

100 200 300 400 500 600 700 discretization-based conditional independence testing sample size methods, the proposed one exploits more complete in- formation of the data, and compared to those methods Figure 3: The chance that the correct Markov equiva- defining the test statistic in terms of the estimated con- lence class was inferred with PC combined with differ- ditional densities, calculation of the proposed statistic ent CI testing methods. KCI-test outperforms CIPERM involves less random errors. We applied the method and partial correlation. on both simulated and real world data, and the re- sults suggested that the method outperforms existing techniques in both accuracy and speed. 4.2.2 Real data

We applied our method to continuous variables in the Acknowledgement Boston Housing data set, which is available at the UCI ZK thanks Stefanie Jegelka for helpful discussions. DJ Repository (Asuncion and Newman, 2007). Due to the was supported by DFG (SPP 1395). large number of variables we choose the significance level to be 0.001, as a rough way to correct for mul- tiple testing. Figure 4 shows the results for PC using Appendix: Proofs

CIPERM (PCCIPERM ) and KCI-test (PCKCI-test). For conciseness, we report them in the same figure: the red The following lemmas will be used in the proof of The- orem 3. Consider the eigenvalues λ and eigenvectors arrows are the ones inferred by PCCIPERM and all solid X,i ∗ lines show the result by PCKCI-test. Ergo, red solid Vx,i of the kernel matrix Ke X and the eigenvalues λX,i and normalized eigenfunctions uX,i (which have unit limit-preserving even if their arguments are sequences variance) of the corresponding kernel kX . Let xˇ be a of random variables, there exists PX , which may de- T fixed-size subsample of x. The following lemma is due pend on n, such that PX · [ψ1(xt), ..., ψL(xt)] → to Theorems 3.4 and 3.5 of Baker (1977). q ∗ q ∗ T [ λX,1uX,1(xt), ..., λX,LuX,L(xt)] in probability 1 ∗ as n → ∞. Similary, we can define the orthogo- Lemma 7 i. n λX,i converge in probability to λX,i. T nal matrix PY such that PY · [φ1(yt), ..., φL(yt)] → ∗ q q ii. For the simple eigenvalue λX,i, whose algebraic ∗ ∗ T √ [ λY,1uY,1(yt), ..., λY,LuY,L(yt)] in probability as multiplicity is 1, nVx,i(xˇ) converges in probability n → ∞. to uX,i(xˇ), where Vx,i(xˇ) denotes the values of Vx,i corresponding to the subsample xˇ. Let vt be the random vector obtained by stacking the T random L×L matrix PX ·Mt ·PY with Mij,t = ψi(xt)· ∗ ∗ ∗ Suppose that the eigenvalues λX,k, λX,k+1, ..., λX,k+r φj(yt) as the (i, j)th entry of Mt. One can see that (r ≥ 1) are the same and different from any n n other eigenvalue. The corresponding eigenfunc- 1 X 2 1 X T T T ||vt|| = Tr(PX MtPY · PY Mt PX ) tions are then non-unique. Denote by ~uX,k:k+r , n n T t=1 t=1 (uX,k, uX,k+1, ..., uX,k+r) an arbitrary vector of the n L eigenfunctions corresponding to this eigenvalue with 1 X T X 2 = Tr(MtM ) = S . (15) multiplicity r + 1. Let S be the space of ~u . n t ij X,k X,k:k+r t=1 i,j=1 √ Lemma 8 i. The distance from nV (xˇ) (0 ≤ x,k+q Again, according to CMT, one can see that vt con- q ≤ r) to the corresponding points in the space S X,k verges in probability to wt. Furthermore, according to converges to zero in the following sense: the vector-valued CLT (Eicker, 1966), as xt and yt are n √ √1 P 0 0 i.i.d., the vector t=1 wt converges in distribution inf{|| nVx,k+q(xˇ)−u (xˇ)|| |u (x) ∈ SX,k} → 0, as n → ∞. n to a multivariate as n → ∞. Be- n cause of CMT, √1 P v then converges to the same ii. There exists an (r + 1) × (r + 1) orthogonal ma- n t=1 t √ normal distribution as n → ∞. As f(X) ∈ HX and trix Pn, which may depend on n, such that n · T g(Y ) ∈ HY are uncorrelated, we know that uX,i and Pn · [Vx,k(xˇ), Vx,k+1(xˇ), ..., Vx,k+r(xˇ)] converges in uY,j are uncorrelated, and consequently the mean of probability to ~uX,k:k+r(xˇ). this normal distribution is E(wt) = 0. The covariance n √1 P T Item (i.) in the above lemma is a reformulation of is Σ = Cov( n t=1 wt) = Cov(wt) = E(wtwt ). Theorem 3.6 of Baker (1977), while item (ii.) is its Assume that we have the EVD decompostion Σ = straightforward consequence. T VwΛwVw, where Λw is the diagonal matrix contain- ˚∗ 0 T The following lemma is a reformulation of Theorem ing non-negative eigenvalues λk. Let vt , Vwvt. n √1 P 0 4.2 of Billingsley (1999). Clearly n t=1 vt follows N (0, Λw) asympotically. That is, Lemma 9 Let {AL,n} be a double sequence of ran- 2 dom variables with indices L and n, {BL} and {Cn} n n L 1 X 1 X d X sequences of random variables, and D a random vari- ||v ||2 = ||v0 ||2 −→ ˚λ∗z2. (16) n t n t k k able. Assume that they are defined in a separable prob- t=1 t=1 k=1 ability space. Suppose that, for each L, A −→d B L,n L Combining (15) and (16) gives (5). d as n → ∞ and that BL −→ D as L → ∞. Suppose fur- If X and Y are independent, for k 6= i or l 6= j, ther that lim lim supL→∞,n→∞ P ({|AL,n −Cn| ≥ ε}) = d one can see that the non-diagonal entries of Σ are 0 for each positive ε. Then Cn −→ D as n → ∞. q ∗ ∗ ∗ ∗ E[ λX,iλY,jλX,kλY,luX,i(xt)uY,j(yt)uX,k(xt)uY,l(yt)] = Sketch of proof of Theorem 3. We first define a q ∗ ∗ ∗ ∗ λX,iλY,jλX,kλY,lE[uX,i(xt)uX,k(xt)]E[uY,j(yt)uY,l(yt)] L × L block-diagonal matrix PX as follows. For sim- ple eigenvalues λ∗ , P = 1, and all other entries in = 0. The diagonal entries of Σ are X,i X,ii λ∗ λ∗ · [u2 (x )] [u2 (y )] = λ∗ λ∗ , which are the ith row and column are zero. For the eigenvalues X,i Y,j E X,i t E Y,j t X,i Y,j ∗ ∗ ∗ also eigenvalues of Σ. Substituting this result into with multiplicity r + 1, say, λX,k, λX,k+1, ..., λX,k+r, the corresponding main diagonal block from the kth (5), one obtains (6). L row to the (k + r)th row of PX is an orthogonal ma- P 2 Finally, consider Lemma 9, and let AL,n , i,j=1 Sij, trix. According to Lemmas 7 and 8 and the con- 2 B PL ˚λ∗z2, C Pn S2 , and D tinuous mapping theorem (CMT) (Mann and Wald, L , k=1 k k n , i,j=1 ij , P∞ ˚∗ 2 Pn 2 d 1943), which states that continuous functions are k=1 λkzk. One can then see that i,j=1 Sij −→ P∞ ˚∗ 2 k=1 λkzk as n → ∞. That is, (5) also holds as ory. Journal of the Royal Statistical Society. Series B, L = n → ∞. As a special case of (5), (6) also holds as 41:1–31, 1979. L = n → ∞.  F. Eicker. A multivariate central limit theorem for ran- dom linear vector forms. The Annals of Mathematical Sketch of proof of Theorem 4. On the one hand, Statistics, 37:1825–1828, 1966. Pn 2 due to (7), we have TUI = i,j=1 Sij, which, according K. Fukumizu, F. R. Bach, M. I. Jordan, and C. Williams. P∞ ∗ ∗ 2 to (6), converges in distribution to i,j=1 λX,iλY,i ·zij Dimensionality reduction for supervised learning with as n → ∞. On the other hand, by extending the proof reproducing kernel hilbert spaces. JMLR, 5:2004, 2004. of Theorem 1 of Gretton et al. (2009), one can show K. Fukumizu, A. Gretton, X. Sun, and B. Sch¨olkopf. Ker- that P∞ ( 1 λ λ −λ∗ λ∗ )z2 → 0 in probabil- nel measures of conditional dependence. In NIPS 20, i,j=1 n2 x,i y,j X,i Y,j ij pages 489–496, Cambridge, MA, 2008. MIT Press. ity as n → ∞. That is, TˇUI converges in probability P∞ ∗ ∗ 2 A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Sch¨olkopf, to i,j=1 λX,iλY,jzij as n → ∞. Consequently, TUI and A. J. Smola. A kernel statistical test of indepen- ˇ and TUI have the same asymptotic distribution.  dence. In NIPS 20, pages 585–592, Cambridge, MA, 2008. Sketch of proof of Proposition 5. Here we let X, ¨ A. Gretton, K. Fukumizu, Z. Harchaoui, and B. K. Sripe- Y , Ke X , and Ke Y in Theorem 3 be X,(Y,Z), Ke X¨ |Z , rumbudur. A fast, consistent kernel two-sample test. In NIPS 23, pages 673–681, Cambridge, MA, 2009. MIT and Ke Y |Z , respectively. According to (7) and Theo- Press. d P∞ ˚∗ 2 rem 3, one can see that TCI −→ k=1 λkzk as n → ∞. T. M. Huang. Testing conditional independence using max- Again, we extend the proof of Theorem 1 of Gretton imal nonlinear conditional correlation. Ann. Statist., 38: P∞ 1˚ ˚∗ 2 2047–2091, 2010. et al. (2009) to show that ( k=1 n λk − λk)zk → 0 P∞ 1˚ 2 P∞ ˚∗ 2 D. Koller and N. Friedman. Probabilistic Graphical Models: as n → ∞, or that k=1 n λkzk → k=1 λkzk in Principles and Techniques. MIT Press, Cambridge, MA, probability as n → ∞. The key step is to show that 2009. P 1˚ ˚∗ k | n λk − λk| → 0 in probability as n → ∞. Details A. J. Lawrance. On conditional and partial correlation. are skipped.  The American Statistician, 30:146–149, 1976. Sketch of proof of Proposition 6. As O. Linton and P. Gozalo. Conditional independence restric- 2 2 tions: testing and estimation, 1997. Cowles Foundation zij follow the χ distribution with one degree of Discussion Paper 1140. freedom, we have (z2 ) = 1 and ar(z2 ) = E ij V ij H. B. Mann and A. Wald. On stochastic limit and order ˇ 2. According to (9), we have E(TUI |D) = relationships. The Annals of Mathematical Statistics, 14: 1 P 1 P P 1 217–226, 1943. n2 i,j λx,iλy,j = n2 i λx,i j λy,j = n2 Tr(Ke X ) · Tr(K ). Furthermore, bearing in mind that z2 are D. Margaritis. Distribution-free learning of bayesian net- e Y ij work structure in continuous domains. In Proc. AAAI independent variables across i and j, and recalling 2005, pages 825–830, Pittsburgh, PA, 2005. Tr(K2 ) = P λ2 , one can see that ar(Tˇ |D) = X i x,i V UI J. Pearl. Causality: Models, Reasoning, and Inference. 1 P 2 2 2 2 P 2 P 2 n4 i,j λx,iλy,jVar(zij) = n4 i λx,i j λy,j = Cambridge University Press, Cambridge, 2000. 2 2 2 n4 Tr(Ke X ) · Tr(Ke Y ). Consequently (i) is true. B. Sch¨olkopf and A. Smola. Learning with kernels. MIT Press, Cambridge, MA, 2002. Similarly, from (14), one can calculate the mean ˇ ˇ K. Song. Testing conditional independence via rosenblatt E{TCI |D} and variance Var(TIC |D), as given in (ii). transforms. Ann. Statist., 37:4011–4045, 2009.  P. Spirtes, C. Glymour, and R. Scheines. Causation, Pre- diction, and Search. MIT Press, Cambridge, MA, 2nd References edition, 2001. L. Su and H. White. A consistent characteristic function- A. Asuncion and D.J. Newman. UCI machine learning based test for conditional independence. Journal of repository. http://archive.ics.uci.edu/ml/, 2007. Econometrics, 141:807–834, 2007. C. Baker. The numerical treatment of integral equations. L. Su and H. White. A nonparametric hellinger metric test Oxford University Press, 1977. for conditional independence. Econometric Theory, 24: 829–864, 2008. W. P. Bergsma. Testing conditional independence for con- tinuous random variables, 2004. EURANDOM-report X. Sun, D. Janzing, B. Sch¨olkopf, and K. Fukumizu. A 2004-049. kernel-based causal learning algorithm. In Proc. ICML 2007, pages 855–862. Omnipress, 2007. P. Billingsley. Convergence of Probability Measures. John Wiley and Sons, 1999. R. Tillman, A. Gretton, and P. Spirtes. Nonlinear di- rected acyclic structure learning with weakly additive J. J. Daudin. Partial association measures and an applica- noise models. In NIPS 22, Vancouver, Canada, 2009. tion to qualitative regression. Biometrika, 67:581–590, K. Zhang and A. Hyv¨arinen.On the identifiability of the 1980. post-nonlinear causal model. In Proc. UAI 25, Montreal, A. P. Dawid. Conditional independence in statistical the- Canada, 2009.