UAI 2009 JANZING ET AL. 249

Identifying confounders using additive noise models

Dominik Janzing Jonas Peters Joris Mooij Bernhard Sch¨olkopf MPI for Biol. Cybernetics MPI for Biol. Cybernetics MPI for Biol. Cybernetics MPI for Biol. Cybernetics Spemannstr. 38 Spemannstr. 38 Spemannstr. 38 Spemannstr. 38 72076 T¨ubingen 72076 T¨ubingen 72076 T¨ubingen 72076 T¨ubingen Germany Germany Germany Germany

Abstract which confounder. Under the assumption of linear relationships between variables, confounders may be We propose a method for inferring the ex- identified by of Independent Component Anal- istence of a latent common cause (“con- ysis, as shown recently by Hoyer et al. [2008], if the founder”) of two observed random variables. distributions are non-Gaussian. Other results for the The method assumes that the two effects of linear case are presented in Silva et al. [2006]. In this the confounder are (possibly nonlinear) func- paper, we will not assume linear relationships, but try tions of the confounder plus independent, ad- to tackle the more general, nonlinear case. ditive noise. We discuss under which condi- In the case of two variables without confounder, Hoyer tions the model is identifiable (up to an arbi- et al. [2009] have argued that the causal inference task trary reparameterization of the confounder) (surprisingly) becomes easier in case of nonlinear func- from the joint distribution of the effects. We tional relationships. They have described a method to state and prove a theoretical result that pro- infer whether X → Y (“X causes Y ”) or Y → X from vides evidence for the conjecture that the the joint distribution P (X,Y ) of two real-valued ran- model is generically identifiable under suit- dom variables X and Y . They consider models where able technical conditions. In addition, we Y is a function f of X up to an additive noise term, propose a practical method to estimate the i.e., confounder from a finite i.i.d. sample of the Y = f(X) + N, (1) effects and illustrate that the method works well on both simulated and real-world . where N is an unobserved noise that is sta- tistically independent of X. They show in their paper that generic choices of functions f, distributions of X 1 Introduction and distributions of N induce joint distributions on X,Y that do not admit such an additive noise model in the inverse direction, i.e., from Y to X. If P (X,Y ) ad- Since the pioneering work on causal inference methods mits an additive model in one and only one direction, (described for example in [Pearl, 2000] and [Spirtes this direction can be interpreted as the true causal et al., 1993]), much work has been done under the direction. We believe that the situation with a con- assumption that all relevant variables have been ob- founder between the two variables is similar in that served. An interesting, and possibly even more impor- respect: nonlinear functional relationships enlarge the tant, question is how to proceed if not all the relevant class of models for which the causal structure is iden- variables have been observed. In that case, depen- tifiable. dencies between observed variables may also be ex- plained by confounders — for instance, if a dependence We now state explicitly which assumptions we make in between the incidence of storks and the birthrate is the rest of this paper. First of all, we focus on the case traced back to a common cause influencing both vari- of only two observed and one latent continuous random ables. In general, the difficulty not only lies in the fact variables, all with values in R. We assume that there that the values of the latent variables have not been ob- is no feedback, or in other words, the true causal struc- served, but also that the causal structure is unknown. ture is described by a DAG (directed acyclic graph). In other words, it is in general not clear whether and Also, we assume that selection effects are absent, that how many confounders are needed to explain the data is, the data samples are drawn i.i.d. from the proba- and which observed variables are directly caused by bility distribution described by the model. 250 JANZING ET AL. UAI 2009

In practical applications, however, we propose to pre- T u(t), v(t) y fer the causal hypothesis X → Y already if the vari- ance of NX is small compared to the of NY X Y x (after we have normalized both X and Y to variance 1). To justify this, consider the case that X causes Y and the joint distribution admits a model (1), but by a Figure 1: Directed acyclic graph and a scatter slight measurement error, we observe X˜ instead of X, corresponding to a CAN model for two observed vari- differing by a small additive noise term. Then P (X,Y˜ ) ables X and Y that are influenced by an unobserved admits a proper CAN model because X is the latent confounder variable T . common cause, but we infer X˜ → Y because, from a coarse-grained point of view, we should not distinguish Definition 1 Let X, Y and T be random variables between the quantity X and the measurement result X˜ if both variables almost coincide. taking values in R. We define a model for Con- founders with Additive Noise (CAN) by Finding the precise conditions under which the iden- tification of CAN models is unique up to equivalence, X = u(T ) + N with N ,N ,T X X Y (2) is a non-trivial problem: If u and v are linear and Y = v(T ) + NY jointly independent. NX ,NY ,T are Gaussian, one obtains a whole family of models inducing the same bivariate Gaussian joint where u, v : → are continuously differentiable R R distribution. Other examples where the model is not functions and N , N are real-valued random vari- X Y uniquely identifiable are given in Hoyer et al. [2009]: ables. any joint density which admits additive noise models from X to Y and also from Y to X is a special case of The random variables NX and NY describe additive “noise”, of which one may think of as the net effect a non-identifiable CAN model. of all other causes which are not shared by X and Y . The remainder of this paper is organized as follows. In This model can be represented graphically by the DAG the next section, we provide theoretical motivation for shown in Figure 1. our belief that in the generic case, CAN models are uniquely identifiable. A practical algorithm for the Definition 2 We call two CAN models equivalent if task is proposed in Section 3. It builds on a combina- they induce the same distributions of N , N and the X Y tion of nonlinear dimensionality reduction and kernel same joint distribution of (u(T ), v(T )). dependence measures. Section 4 provides empirical re- This definition removes the ambiguity arising from un- sults on synthetic and real world data, and Section 5 observable reparameterizations of T . We further adopt concludes the paper. the convention E(NX ) = E(NY ) = 0. 2 Theoretical motivation The method we propose here enables one to distin- guish between (i) X → Y , (ii) Y → X, and (iii) Below, we prove a partial identifiability result for the X ← T → Y for the class of models defined in (2), special case that both u, v are invertible, where we and (iv) to detect that no CAN model fits the data consider the following limit: first, let the of (which includes, for example, generic instances of the the noise terms N and N be small compared to the case that X causes Y and in addition T confounds X Y curvature of the graph (u(t), v(t)); second, we assume both X and Y ). If N = 0 a.s. (“almost surely”) and X that the curvature is non-vanishing nevertheless (rul- u is invertible, the model reduces to the model in (1) ing out the linear case), and third, that the density on by setting f := v ◦u−1. Given that we have observed a the graph (u(t), v(t)) changes slowly compared to the joint density on 2 of two variables X,Y that admits a R variance of the noise. unique CAN model, the method we propose identifies this CAN model and therefore enables us to do causal Because of the assumption that u is invertible, we can inference by employing the following decision rule: we reparameterize T such that the CAN model (2) sim- infer X → Y whenever NX is zero a.s. and u invertible, plifies to infer Y → X whenever N is zero a.s. and v invertible, Y X = T + N and Y = v(T ) + N , and infer X ← T → Y if neither of the alternatives X Y hold, which corresponds in spirit with Reichenbach’s i.e., the joint probability density is given by principle of common cause [Reichenbach, 1956]. Note p(x, y, t) = q(x − t)r(y − v(t))s(t) that the case of NX = NY = 0 a.s. and u and v invert- ible implies a deterministic relation between X and Y , where q, r, s are the densities of NX , NY and T respec- which we exclude here. tively. We will further assume that s is differentiable UAI 2009 JANZING ET AL. 251

* * * because NX and NY are centralized. We can thus ap- y * * * v(x) * * proximately determine the function v from observing * * ** * all the conditional expectations Ey(X). After a shift y2 ** * * * * * by the estimated function w(y), the conditional dis- * ** * y * * * 1 * * tribution of X, given y, is approximately given by a * * ** * * convolution of NX with the scaled noise βyNY : y * * * * * 0 * * * * * *  X − Ey(X) | y ∼ NX + βyNY x w(y0) Since we can observe this convolution for different val- ues βy, we can compute the moments of NX and NY Figure 2: Sample from a distribution p(x, y) obtained using the following Lemma: by adding small noise in x- and y-direction to a distri- bution supported by a graph (t, v(t)). We show that Lemma 1 Let Z and W be independent random vari- for fixed y, X is approximately an affine function in ables for which all moments E(Zk) and E(W k) for the noise variables NX and NY . k = 1, . . . , n exist. Then all these moments can be re- constructed by observing the nth moments of Z + cjW for n + 1 different values c0, . . . , cn. with bounded derivative and is nonzero at an infinite

number of points. We also assume that the inverse Proof: The nth moments of Z + cjW are given by function w := v−1 of v is two times differentiable and n 00   w is bounded. We further assume that the density of n X k n n−k k E (Z + cjW ) = cj E(Z )E(W ) . NX is several times differentiable, that all its moments k k k=0 are finite and that E(|NX | )/k! → 0. This implies The (n + 1) × (n + 1) matrix M given by the entries that the characteristic function of NX , and therefore kn its distribution, can be uniquely expressed in terms of mjk := cj k is invertible because it coincides with its moments. We assume likewise for N . the Vandermonde matrix up to multiplying the kth Y n column with k . Hence we can compute all products We will show that under these assumptions, one can n−k k E(Z )E(W ) for k = 0, . . . , n by matrix inversion estimate the function v from the conditional expec- n n and obtain in particular E(Z ) and E(W ). Taking tation E(X|Y = y) and that all moments of NX and into account only a subset of points, we can obviously N can be approximated from higher order conditional Y compute lower moments, too.  moments E(Xk|Y = y), with vanishing approximation error in the limit. We now choose n + 1 values y0, . . . , yn such that we obtain n + 1 different values βy , . . . , βy and apply Defining r (t) := p(y, t) = r(y − v(t))s(t), the con- 0 n y Lemma 1 to identify the first n moments of NX and ditional distribution of T given Y = y is given by NY . For higher moments, the approximation will typ- ry(t)/p(y). For fixed y, we can locally (i.e., for t ≈ ically cause larger errors. t0 := w(y)) approximate We now focus on the error made by the approximations 0  ry(t) ≈ r − v (t0)(t − t0) s(t0) above. We introduce the error terms  (y) := (T − w(y))n − (βnN n) , if s is almost constant around t0 and if v has small n Ey E y Y curvature at t0. Hence, and hence obtain

T | y ∼ w(y) + βyNY (3) Ey(X) = Ey(T ) = w(y) + 1(y) .

1 0 approximately, where βy := − 0 = −w (y). Some calculations then yield the following exact rela- v (t0) tion between the moments of NX and NY : We cannot observe T directly, but we can observe the n noisy version X = T + N of it (see also Figure 2). X n X βk (N n−k) (N k ) (6) From (3), we conclude that the conditional distribu- y k E X E Y k=0 tion of X given Y = y is approximately given by n = Ey((X − Ey(X)) ) (7) n   X | y ∼ w(y) + βyNY + NX (4) X k n +  (y) ((X − (X))n−k) (8) 1 k Ey Ey Using Ey(... ) as shorthand for the conditional expec- k=1 n tation E(... | Y = y), we conclude that X n −  (y) (N n−k) (9) k k E X Ey(X) ≈ w(y) , (5) k=0 252 JANZING ET AL. UAI 2009

−1 Defining vectors b, c and d by letting their j’th com- w˜` is then defined asv ˜` , and one checks easily that 0 00 2 ponent be the value of (7)–(9) for y = yj, respectively, ks˜`k∞ and kw˜` k∞ tend to zero with O(1/` ). Further,

and defining the vector q and the matrix M by note that the βyj are invariant with respect to the ˜   scaling: βyj ,` = βyj . Hence, by Lemma 2, all k(yj) n−k k k n qk := (N ) (N ),Mjk := β , converge to zero.  E X E Y yj k To summarize, we have sketched a proof that the we can write (6)–(9) for y = y0, . . . , yn in matrix no- CAN model becomes identifiable in a particular limit. tation as We expect that a stronger statement holds (under Mq = b + c + d suitable technical conditions), but postpone the non- and therefore trivial problem of finding the right technical conditions and the corresponding proof to future work. q = M−1b + M−1(c + d). The analysis above also shows that it should be pos- The above approximation yielded q = M−1b. sible to identify a confounder by estimating the vari- The remaining terms are bounded from above by ances of the noises NX and NY and comparing their −1 sizes (as discussed in the previous section). Indeed, in M (kck + kdk ). The following lemma (which is proved in the Appendix) shows that the errors order to estimate the variance of the noise variable NX , one observes the conditional expectation (X | Y = y) k(y) are small. This then implies that the error in E −1 0 00 for three different values of y; if the conditional expec- q ≈ M b is also of order O(ks k∞ + kw k∞). tations are sufficiently different, one can apply Lemma Lemma 2 For any n and y such that s(w(y)) 6= 0: 1 to calculate the moments of NX and NY up to sec- ond order. Assume, we observe a dependence between 0 00 n(y) = O(ks k∞ + kw k∞), X and Y (which are normalized to have variance 1). In case one of the noise variances is much smaller than where the terms involve moments of NY and positive the other (say NX  NY ), this would indicate a di- powers of βy and of s(w(y)), which are all bounded for rect causal influence (X → Y in that case), but if both given y and n. noise variances are large, this indicates the existence of a confounder. We consider now a sequence of distributions p`(x, y) obtained by scaling the graph (t, v(t)) up to the larger graph (`t, `v(t)), while keeping the distributions of the 3 Identifying Confounders from Data noise NX and NY fixed. Then we consider the con- ditional distributions of X, given Y = y for y = In this section we propose an algorithm (ICAN) that `y , . . . , `y and can determine the moments of N 0 n X is able to identify a confounder in CAN models. While and N up to an error that converges to zero for Y we only addressed the low noise regime in the previous ` → ∞, as the following Theorem shows. theoretical section, the practical method we propose Theorem 1 Define a sequence of joint densities here should work even for strong noise, although in that case more data points are needed. p`(x, y) by Assume that X,Y are distributed according to the Z  p`(x, y) := q(x − `t)r(y − `v(t))s(t)dt . (10) CAN model (2). We write s(t) := u(t), v(t) for the “true” curve of the confounder in R2. A scatter plot of the samples (X,Y ) (right panel of Figure 1, Let, as above, y0, . . . , yn be chosen such that all k for example) suggests a simplistic method for detect- βy0 , . . . , βyn are different. Then every E(NX ) k ing the confounder: for every curve s : [0, 1] → 2 and E(NY ) can be computed from the conditional mo- R ments (Xm) for j = 0, . . . , n and m = 1, . . . , n project the data points (Xk,Yk) onto this curve s, E`yj ˆ up to an error that vanishes for ` → ∞ under the as- such that the Euclidean distance is minimized: Tk = sumptions made above. argmint∈[0,1] k(Xk,Yk)−s(t)k2. From a set of all possi- ble paths S now choose the ˆs that minimizes the global Pn ˆ Proof: We rewrite eq. (10) as `2 distance k=1 k(Xk,Yk) − s(Tk)k2 (dimensionality ˆ Z reduction) and propose Tk to be the confounder for p`(x, y) := q(x − t˜`)r(y − v˜`(t˜`))˜s`(t˜`)dt˜` . Xk and Yk. This results in the estimated residuals (NˆX,k, NˆY,k) = (Xk,Yk) − ˆs(Tˆk). If the hypotheses ˆ ˆ ˆ ˆ ˆ ˆ with T ⊥⊥ NX , T ⊥⊥ NY , NX ⊥⊥ NY cannot be rejected, pro- pose that there is a confounder whose values are given t˜` := `t, v˜`(t˜`) := `v(t˜`/`), s˜`(t˜`) := s(t˜`/`)/` . by Tˆk. UAI 2009 JANZING ET AL. 253

This idea turns out to be too naive: even if the data 3. Estimated residuals and confounder values clearly have been generated according to the model (2), the depend on each other. Also independence tests like procedure results in dependent residuals. As an exam- the Hilbert-Schmidt Independence Criterion (see be- ple, consider a data set simulated from the following low) reject the hypotheses of independence: p-values model: of 5×10−2, 7×10−5 and 1×10−4 are computed corre- sponding to the right plots in Figure 3 from top to X = 4 · ϕ−0.1(T ) + 4 · ϕ1.1(T ) + NX bottom. These dependencies often occur when the Y = 1 · ϕ−0.1(T ) − 1 · ϕ1.1(T ) + NY projections onto the curve ˆs are chosen to minimize the global `2 distance, which can be seen as follows: where ϕµ is the probability density of a in our example ∂s1 is small for T ≈ 0.5 or Y ≈ 0. 2 ∂t N (µ, 0.1 ) distributed and Since the points are projected onto the curve orthogo- NX ,NY ∼ U([−0.1, 0.1]) and T ∼ U([0, 1]) are nally, the projection results in very small residuals NˆY . jointly independent. We now minimize the global `2 This introduces a dependency between Nˆ and Tˆ. De-  Y distance over the set of functions S = s : si(t) = pendency between the residuals Nˆ and Nˆ can arise Y X αi · ϕ−0.1(t) + βi · ϕ1.1(t); i = 1, 2 . Since there are from regions, where ˆs is approximately linear, like in only four to fit, the problem is numerically the bottom right part of Figure 3: positive residuals solvable and gives the following optimal solution: NY lead to positive residuals NX and vice versa. α = 3.9216, β = 4.0112, α = 0.9776, β = −0.9911. 1 1 2 2 Summarizing, projecting the pairs (X,Y ) onto the The `2 projections Tˆ result in a lower global `2 distance (6.92) than the true values of T (11.87). path (ˆs(t)) by minimizing the `2 distance to the path is the wrong approach for our purpose. Instead, the data

1 (X,Y ) should be projected in a way that minimizes the dependence between residuals and confounder (NˆX , Tˆ X 0.8 0.1 and NˆY , Tˆ) and between the residuals itself (NˆX , NˆY ).

0.6 0 Let DEP(W, Z) denote any non-negative dependence

estimated N −0.1 measure between random variables W and Z, which is 0.4 0 0.5 1 estimated T zero if and only if W and Z are independent (we later 0.2 suggest to use the Hilbert-Schmidt Independence Cri-

Y 0.1 terion). In the example above we can use the red curve

Y 0 0 as an initial guess, but choosing the projections by minimizing the sum of the three dependence measures −0.2 estimated N −0.1 0 0.5 1 instead of `2 distances. In our example this indeed −0.4 estimated T leads to residuals that fulfill the independence con-

Y 0.1 straints (p-values of 1.00, 0.65, 0.80). For the general −0.6 case, we propose Algorithm 1 as a method for identi- 0 −0.8 fying the hidden confounder T given an i.i.d. sample

estimated N −0.1 of (X,Y ). −1 3.4 3.8 4.2 0 0.1 X estimated N If a CAN model can be found we interpret the outcome X ˆ VarNX of our algorithm as X → Y if ˆ  1 andu ˆ in- VarNY ˆ Figure 3: Left: a scatter plot of the data, true path VarNX vertible and as Y → X if ˆ  1 andv ˆ invertible. VarNY s and projections (black and solid), estimated path ˆs There is no mathematical rule that tells whether one and projections (red and dashed). Right: residuals should identify a variable X and its (possibly noisy) plotted against each other and estimated confounder. measurement X˜ or consider them as separate variables instead. Thus we cannot be more explicit about the Figure 3 shows the true function s (black line), a scat- ˆ ˆ threshold for the factor between VarNX and VarNY ter plot of (X,Y ) (black circles) and the computed that tells us when to accept X → Y or Y → X or curve ˆs that minimizes the global `2 distance (dashed X ← T → Y . red line). Additionally, for some data points projec- tions onto s and ˆs are shown, too: black crosses corre- To implement the method we still need an algorithm spond to the “true projections” (i.e., the points with- for the initial dimensionality reduction, a dependence out noise) onto s and red crosses correspond to pro- criterion DEP, a way to minimize it and an algo- jections onto the estimated function ˆs minimizing the rithm for non-. Surely, many different `2 distance. The latter result in the proposed resid- choices are possible. We will now briefly justify our uals, which are shown together with the estimated choices for the implementation. values of the confounder on the right side of Figure 254 JANZING ET AL. UAI 2009

Algorithm 1 Identifying Confounders using Additive they should be able both to deal with continuous data Noise Models (ICAN) and to detect non-linear dependencies. Since there

1: Input: (X1,Y1),..., (Xn,Yn) (normalized) is no canonical way of discretizing continuous vari- ables, methods that work for discrete data (like a χ2 2: Initialization: test) are not suitable for our purpose. In our method 3: Fit a curve ˆs to the data that minimizes `2 dis- Pn  we chose the Hilbert-Schmidt Independence Criterion tance: ˆs := argmins∈S k=1 dist s, (Xk,Yk) . (HSIC) [Gretton et al., 2005, 2008]. It can be de- 4: repeat fined as the distance between the joint distribution 5: Projection: and the product of the marginal distribution repre- ˆ ˆ ˆ ˆ 6: T := argminT DEP(NX , NY ) + DEP(NX ,T ) + sented in a Reproducing Kernel Hilbert Space. For DEP(NˆY ,T ) with (NˆX,k, NˆY,k) = (Xk,Yk) − specific choices of the kernel (e.g., a Gaussian kernel) ˆs(Tk) it has been shown that HSIC is zero if and only if 7: if NˆX ⊥⊥ NˆY and NˆX ⊥⊥ Tˆ and NˆY ⊥⊥ Tˆ then the two distributions are independent. Furthermore 8: Output: (Tˆ1,..., Tˆn),u ˆ = ˆs1, vˆ = ˆs2, and the distribution of HSIC under the hypothesis of in- ˆ VarNX dependence can be approximated by a Gamma dis- ˆ . VarNY tribution [Kankainen, 1995]. Thus we can construct 9: Break. a statistical test for the null hypothesis of indepen- 10: end if dence. In our we used Gaussian kernels 11: Regression: and chose their kernel sizes to be the distances 12: Estimate ˆs by regression (X,Y ) = ˆs(Tˆ)+Nˆ . Set between the points [Sch¨olkopf and Smola, 2002]: e.g. 2 2 uˆ = ˆs1, vˆ = ˆs2. 2σ = median{kXk − Xlk : k < l}. We will use the 13: until K iterations term HSIC for the value of the Hilbert-Schmidt norm and p for the corresponding p-value. For a small 14: Output: Data cannot be fitted by a CAN model. HSIC p-value (< 0.05, say) the hypothesis of independence is rejected.

Initial Dimensionality Reduction For the projection step we now minimize HSIC(NˆX , NˆY ) + HSIC(NˆX , Tˆ) + HSIC(NˆY , Tˆ) It is difficult to solve the optimization problem (line 3 ˆ of the algorithm) for a big function class S. Our ap- with respect to T . Note that at this part of the proach thus separates the problem into two parts: we algorithm the function ˆs (and thusu ˆ andv ˆ) remain ˆ fixed and the residuals are computed according to start with an initial guess for the projection values Tk ˆ ˆ (this is chosen using an implementation of the Isomap NX = X − uˆ(T ) and NY = Y − vˆ(T ). We used algorithm [Tenenbaum et al., 2000] by van der Maaten a standard optimization algorithm for this task [2007]) and then iterate between two steps: In the first (fminsearch in MatLab) initializing it with the values of Tˆ obtained in the previous iteration. Instead step we keep the projection values Tˆk fixed and choose a new function ˆs = (ˆu, vˆ), whereu ˆ andv ˆ are chosen of the sum of the three dependence criterion the by regression from X on Tˆ and Y on Tˆ, respectively. maximum can be used, too. This is theoretically To this end we used Gaussian Process Regression [Ras- possible, but complicates the optimization problem mussen and Williams, 2006], using the implementation since it introduces non-differentiability. of Rasmussen and Williams [2007], with hyperparame- It should be mentioned that sometimes (not for all ters set by maximizing marginal likelihood. In the sec- data sets though) a regularization for the T values ond step the curve is fixed and each data point (Xk,Yk) may be needed. Even for dependent noise very pos- is projected to the nearest point of the curve: Tk is itive (or negative) values of T result in large residuals,

chosen such that kˆs(Tk) − (Xk,Yk)k`2 is minimized. which may be regarded as independent. In our im- We then perform the first step again. A similar iter- plementation we used a heuristic and just performed ative procedure for dimensionality reduction has been 5000 iterations of the minimization algorithm, which proposed by Hastie and Stuetzle [1989]. proved to work well in practice. This initial step of the algorithm is used for stabiliza- Non-linear Regression tion. Although the true curve s may differ from the Here, again, we used Gaussian process regression for `2 minimizer ˆs, the difference is not expected to be both variables separately. Since the confounder values very large. Minimizing dependence criteria from the Tˆ are fixed we can fit X =u ˆ(Tˆ)+NˆX and Y =v ˆ(Tˆ)+ beginning often results in very bad fits. NˆY to obtain ˆs = (ˆu, vˆ). Dependence Criterion and its Minimization In the experiments this step was mostly not necessary: There are various choices of dependence criteria that whenever the algorithm was able to find a solution can be used for the algorithm. Notice, however, that UAI 2009 JANZING ET AL. 255

with independent residuals, it did so in the first or sec- ship between X and Y . It is obvious that the data ond iteration after optimizing the projections accord- cannot be explained by X = g(Y ) + N with a noise N ing to the dependence measures. We still think that that is independent of Y . It turns out that also the this step can be useful for difficult data sets, where the model corresponding to the other direction X → Y curve that minimizes the `2 distance is very different can be rejected since a regression of Y onto X leads to ˆ from the ground truth. dependent residuals (pHSIC(X,Y − f(X)) = 0.0015). Data set 2. 4 Experiments This data set is produced in the same way as data set 1, but this time using an invertible v and unequal In this section we show that our method is able to scaled noises. We sampled NX ∼ U([−0.008, 0.008]) detect confounders both in simulated and in real data. and NY ∼ U([−0.0015, 0]). We argued above that for finite sample sizes this case should rather be re- 4.1 Simulated data garded as Y → X and not as an example with a hidden common cause. The algorithm again identi- Data set 1. fies a curve and projections, such that the indepen- We show on a simulated data set that our algorithm dence constraints are satisfied (pHSIC(NˆX , NˆY ) = 1.00, finds the confounder if the data comes from the model pHSIC(NˆX , Tˆ) = 1.00 and pHSIC(NˆY , Tˆ) = 1.00, see assumed in (2). We simulated 200 data points from Figure 5), and it is important to note that the different a curve whose components u and v consist of a ran- scales of the variances are reduced, but still noticeable ˆ dom linear combination of Gaussian bumps each. The Var(NX ) ( ˆ ≈ 5). In such a case we indeed interpret the noise is uniformly distributed on [−0.035, 0.035]. Note Var(NY ) outcome of our algorithm as “Y causes X”. that in contrast to the example given in Section 3 we are now doing the regression using Gaussian pro- cesses. The algorithm finds a curve and corresponding 0.5

projections of the data points onto this curve, such Y 0 ˆ ˆ ˆ −0.5 that NX , NY and T are pairwise independent, which 0.6 0.62 0.66 0.68 0.7 can be seen from the p-values p (Nˆ , Nˆ ) = 0.94, X HSIC X Y X 0.01 pHSIC(NˆX , Tˆ) = 0.78 and pHSIC(NˆY , Tˆ) = 0.23. The top panel of Figure 4 shows the data and both true 0 (black) and estimated (red) curve. The bottom panel −0.01

estimated N 0 0.2 0.4 0.6 0.8 1 shows estimated and true values of the confounder. estimated T Recall that the confounder can be estimated only up Y 0.01 to an arbitrary reparameterization (e.g. t 7→ −t). 0

−0.01

estimated N 0 0.2 0.4 0.6 0.8 1 −3.6 estimated T

Y 0.01 −3.8 Y 0 −4 −0.01

−4.2 estimated N −8 −6 −4 −2 0 2 4 6 0 0.5 1 1.5 estimated N X X x 10−3

1 Figure 5: Data set 2. Top: true (black) and estimated 0.5 (red) curve. Others: Scatter plots of the fitted residu- estimated T als against each other and against estimated values for 0 the confounder. The fact that the noise NX has been 0 0.2 0.4 T 0.6 0.8 1 sampled with a higher variance than NY can also be detected in the fitted residuals. Figure 4: Data set 1. Top: true (black) and esti- mated (red) curve. Bottom: The estimated values Since the variances of NX and NY differ significantly of the confounder are plotted against the true values. and the sample size is small, we can (as expected) Apart from the arbitrary reparameterization t 7→ −t even fit a direct causal relationship between X and the method inferred confounder values close to the true Y : Assuming the model ones. X = g(Y ) + N (11) In this example the empirical joint distribution of (X,Y ) does not allow a simple direct causal relation- and fitting the functiong ˆ by Gaussian Process re- 256 JANZING ET AL. UAI 2009

gression, for example, results in independent residuals: pHSIC(NˆX , Tˆ) = 1.00, pHSIC(NˆY , Tˆ) = 0.16. Figures pHSIC(Y,X − gˆ(Y )) = 0.97. Thus we regard the model 7 and 8 show the results. The confounder has been (11) and thus Y → X to be true. This does not contra- successfully identified. dict the identifiability conjecture because the depen- dencies introduced by setting the noise NˆY mistakenly 2 to zero are not detectable at this sample size. 1 Data set 3. 0 −1 We also simulated a data set for which the noise terms pressure KABI −2 NX and NY clearly depend on T . Figure 6 shows a −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 scatter plot of the data set, the outcome curve of the pressure KABE algorithm after K = 5 iterations (top) and a scat- 150 ˆ ter plot between the estimated residuals NY and con- 100 founder values Tˆ (bottom). The method did not find a curve and corresponding projections for which the 50

ˆ ˆ estimated order of T residuals were independent (pHSIC(NY , T ) = 0.00, for 0 0 50 100 150 example), and thus results in “Data cannot be fitted true order of T by a CAN model”. This makes sense since the data set was not simulated according to model (2). Figure 7: ASOS data. Top: scatter plot of the data, together with the estimated path ˆs (note that it is not interpolating between the data points). Bottom: or- 7 dering of the estimated confounder values against the Y true ordering. The true ordering is almost completely 6 recovered. −1 −0.5 0 X 0.5 1 1.5 Y

X 0.1

0 0

estimated N −0.1

estimated N 0 0.2 0.4 0.6 0.8 1 estimated T 0 0.2 0.4 estimated T Y 0

−0.2

Figure 6: Data set 3. To check whether our method estimated N 0 0.2 0.4 0.6 0.8 1 does not always find a confounder we simulated a data estimated T set where the noise clearly depends on T . Indeed the Y 0 algorithm does not find an independent solution and −0.2

stops after K = 5 iterations. Top: true (black) and es- estimated N −0.06 −0.04 −0.02 0 0.02 0.04 timated (red) curve. Bottom: the estimated residuals estimated N X clearly depend on the estimated confounder. Figure 8: ASOS data. Residuals plotted against each other and against the estimated confounder. The hy- 4.2 Real data pothesis of independence is not rejected, which means ASOS data. the method identified the confounder. The Automated Surface Observations Systems (ASOS) consists of several stations that automatically 5 Conclusion and Future Work collect and transmit weather data every minute. We used 150 values for air pressure that were collected by We have proposed a method to identify the confounder stations KABE and KABI in January 2000 [NCDC, of a model where two observable variables are func- 2009]. We expect the time to be a confounder. As in tions of an unobserved confounder plus additive noise. the other experiments a projection minimizing the ` 2 We have provided a theoretical motivation for the distance would not be sufficient: after the initialization method, and showed that the algorithm works on both step we obtain p-values, which reject independence simulated and real world data sets. (pHSIC(NˆX , NˆY ) = 0.00, pHSIC(NˆX , Tˆ) = 0.00, pHSIC(NˆY , Tˆ) = 0.02). After the projection step Our initial results are encouraging, and our theoretical minimizing the sum of HSICs the residuals are motivation provides some insight into why the problem regarded as independent: pHSIC(NˆX , NˆY ) = 1.00, is solvable in the first place. UAI 2009 JANZING ET AL. 257

A complete identifiability result in the style of Hoyer R. Silva, R. Scheines, C. Glymour, and P. Spirtes. et al. [2009], however, would clearly be desirable, along Learning the structure of linear latent variable mod- with further experimental evidence. els. Journal of Machine Learning Research, 7:191– 246, 2006. References P. Spirtes, C. Glymour, and R. Scheines. Causation, A. Gretton, O. Bousquet, A. Smola, and Prediction, and Search. Springer-Verlag, 1993. (2nd B. Sch¨olkopf. Measuring statistical dependence ed. MIT Press 2000). with Hilbert-Schmidt norms. In ALT, pages 63–78. J. B. Tenenbaum, V. de Silva, and J. C. Langford. A Springer-Verlag, 2005. global geometric framework for nonlinear dimension- A. Gretton, K. Fukumizu, C.-H. Teo, L. Song, ality reduction. Science, 290:2319–2323, 2000. B. Sch¨olkopf, and A. Smola. A kernel statistical test L. J. P. van der Maaten. An introduction to di- of independence. In Advances in Neural mensionality reduction using matlab. Technical Processing Systems 20, pages 585–592, Cambridge, Report MICC-IKAT 07-07, Maastricht University, MA, 2008. MIT Press. Maastricht, 2007. T. Hastie and W. Stuetzle. Principal curves. Journal of the American Statistical Association, 84:502–516, Appendix: proof of Lemma 2 1989. 0 Fix y and set ty := w(y), sy := s(w(y)), βy := −w (y). P. O. Hoyer, S. Shimizu, A. J. Kerminen, and We compute: M. Palviainen. Estimation of causal effects using Z n n linear non-Gaussian causal models with hidden vari- p(y)Ey((T − ty) ) = (t − ty) r(y − v(t))s(t) dt. ables. Int. J. Approx. Reasoning, 49(2):362–378, 2008. ISSN 0888-613X. doi: http://dx.doi.org/10. We want to make the substitutiony ˜ = y − v(t), i.e., 1016/j.ijar.2008.02.006. t = w(y − y˜). Application of Taylor’s theorem yields:

2 00 P. O. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and w(y − y˜) − ty = βyy˜ +y ˜  || ≤ kw k∞/2, 0 00 B. Sch¨olkopf. Nonlinear causal discovery with ad- −w (y − y˜) = βy + ηy˜ |η| ≤ kw k∞, ditive noise models. In D. Koller, D. Schuurmans, s(w(y − y˜)) = s + ζ(β y˜ +y ˜2) |ζ| ≤ ks0k . Y. Bengio, and L. Bottou, editors, Advances in Neu- y y ∞ ral Information Processing Systems 21 (NIPS*2008), where we suppress the dependencies of , η, ζ ony ˜ in pages 689–696. 2009. the notation. Therefore: Z A. Kankainen. Consistent testing of total indepen- n (t − ty) r(y − v(t))s(t) dt dence based on the empirical characteristic function. Z PhD Thesis, University of Jyv¨askyl¨a, 1995. 2 n 2  = (βyy˜ +y ˜ ) r(˜y) sy + ζ(βyy˜ +y ˜ ) (βy + ηy˜) dy.˜ NCDC. ASOS 1-minute data (DSI 6406/page 2 data), 2009. URL http://www.ncdc.noaa.gov/oa/ The special case n = 0 yields climate/climatedata.html. Z 2  p(y) = r(˜y) sy + ζ(βyy˜ +y ˜ ) (βy + ηy˜) dy˜ J. Pearl. : Models, Reasoning, and Infer- 0 00 ence. Cambridge University Press, 2000. = syβy + O(ks k∞ + kw k∞).

C. E. Rasmussen and C. Williams. GPML For arbitrary n, we obtain code. Z http://www.gaussianprocess.org/gpml/ 2 n 2  code, 2007. (βyy˜ +y ˜ ) r(˜y) sy + ζ(βyy˜ +y ˜ ) (βy + ηy˜) dy˜ n 0 00 C. E. Rasmussen and C. Williams. Gaussian Pro- = syβyE (βyNY ) + O(ks k∞ + kw k∞). cesses for Machine Learning. MIT Press, 2006. The error terms contain moments of NY , which are all finite by assumption, and (positive) powers of s and H. Reichenbach. The direction of time. University y β , which are also finite. Thus we conclude that of Los Angeles Press, Berkeley, 1956. y n n n 0 00 n(y) = Ey((T −ty) )−E(βy NY ) = O(ks k∞+kw k∞). B. Sch¨olkopf and A. Smola. Learning with Kernels. MIT Press, 2002.