1

Causal Inference on Discrete Data via Estimating Distance Correlations

Furui Liu, Laiwan Chan Department of Computer Science and Engineering, The Chinese University of Hong Kong

Abstract

In this paper, we deal with the problem of inferring causal directions when the data is on discrete domain. By considering the distribution of the P (X) and the condi- tional distribution mapping cause to effect P (Y |X) as independent random variables, we propose to infer the causal direction via comparing the distance correlation between P (X) and P (Y |X) with the distance correlation between P (Y ) and P (X|Y ). We in- fer “X causes Y ” if the dependence coefficient between P (X) and P (Y |X) is smaller. are performed to show the performance of the proposed method.

Editor: Aapo Hyvarinen¨

1 Introduction

Inferring the causal direction between two variables from observational data becomes arXiv:1803.07712v3 [stat.ML] 7 Aug 2018 a hot research topic. Additive noise models (ANMs) (Zhang and Hyvarinen,¨ 2009a; Shimizu et al., 2006; Hoyer et al., 2009; Peters et al., 2011; Hoyer and Hyttinen, 2009; Zhang and Hyvarinen,¨ 2009b; Shimizu et al., 2011; Hyvarinen¨ and Smith, 2013; Hyvarinen¨ et al., 2010; Hyttinen et al., 2012; Mooij et al., 2011) are preliminary trials to solve this problem. They assume that the effect is governed by the cause and an ad- ditive noise, and the causal inference is done by finding the direction that admits such a model. Recently, under another view of exploiting the asymmetry between cause and effect, the linear trace method (LTr) (Zscheischler et al., 2011; Janzing et al., 2010) and information geometric causal inference (IGCI) (Janzing et al., 2012) are proposed. Suppose X is the cause and Y is the effect. Based on the fact that the generating of P (X) is independent with that of P (Y |X) (Janzing and Scholkopf,¨ 2010; Lemeire and Janzing, 2013; Scholkopf¨ et al., 2012), LTr suggests that the trace condition is fulfilled in the causal direction while violated in the anti-causal direction, and IGCI shows that the density of the cause and the log slope of the function transforming cause to effect are uncorrelated while the density of the effect and the log slope of the inverse of the func- tion are positively correlated. By accessing these so-called cause-effect asymmetries, one can determine the causal direction. Then a kernel method using the framework of IGCI to deal with high dimensional variables is developed (Chen et al., 2014), and nonlinear extensions of trace method is presented (Chen et al., 2013). In some situations, the variables of interest are on discrete domains, and researchers have adopted additive noise models to discrete data for causal inference (Peters et al., 2010, 2011). Given observations of the variable pair, they do regressions in both direc- tions, and test the independence between the residuals and the regressors. The direction that admits an additive noise model is inferred as the causal direction. However, ANMs may not be suitable for modeling discrete variables in some situations. For example, it is not natural to adopt ANMs to modeling categorical variables. Methods with wider applicability would be valuable for causal inference on discrete data. Motivated by the postulate that the generating of P (X) is independent with that of P (Y |X), we suggest that the (P (x),P (Y |x)) is an observation of a pair of vari- ables that are independent with each other (here P (x) = P (X = x), referring to the probability at one specified point). To infer the causal direction, we calculate the de- pendence coefficients between P (X) and P (Y |X), and P (Y ) and P (X|Y ). Then the direction that induces smaller correlation is inferred as the causal direction. Without a functional assumption, our method is with wider applicability than tradi- tional ANMs. Various experiments are conducted to demonstrate the performance of the proposed method. This paper is organized as follows. Section 2 defines the problem. Section 3 presents the causal inference principle. Section 4 gives the detailed causal inference method. Section 5 shows the experiments, and section 6 concludes the whole paper.

2 Problem Description

Suppose we have a set of observed variables X and Y with support domain X and Y respectively. X is the cause and Y is the effect but we do not have prior knowledge about the causal direction. We assume that they are on discrete domain (for continuous variable we can perform discretization) and there are no latent confounders. We want to identify the causal direction (“X causes Y” or “Y causes X”). For clarity, we list the symbols that may appear in the following sections in table 1. Since we constrain

2 variables in finite range, P (X,Y ) and P (Y |X) can be written as matrices, and P (Y |x) is a vector in the matrix P (Y |X). We use these representations in the rest of the paper.

Table 1: Lookup table Symbol Description X support of variable X Y support of variable Y P (X) distribution of X P (x) probability of X = x P (X,Y ) joint distribution of (X,Y ) P (Y |X) conditional distribution of Y given X P (Y |x) conditional distribution of Y given X = x | · | cardinality of a set

3 Causal Inference Principle

In this section, we will present the principle that we are using for causal inference on discrete data. We start with the basic idea.

3.1 Basic Idea

The basic idea is to consider the (P (x),P (Y |x)) as a realization of a variable pair, and the two variables (one is one dimensional and the other is high dimensional) are independent with each other. See figure 1 for an example. Figure 1 shows an example of P (X) and P (Y |X). Suppose |X | = M and |Y| = L (here M = 7 and L = 8). P (X) is a vector in RM and P (Y |X) is a matrix in RL×M . The highlighted (red bars) grids are a pair (P (x),P (Y |x)). Consider P (X) and P (Y |X) as two independent random variables. The generating of the P (X) and P (Y |X) is done by drawing realizations (P (x),P (Y |x)) at each possible value of x (shifting the red bars from right to left). We have |X | realizations. We formalize this in postulate 1.

Postulate 1. P (X) and P (Y |X) are both random variables taking realizations at dif- ferent x. P (X) is independent with P (Y |X).

Once we have this postulate in mind, one could seek for some properties induced by it for causal discovery. We will discuss this in the next section.

3 P(Y|X)

(P(x) , P(Y|x))

P(X)

Figure 1: P (X) and P (Y |X) as random variables

3.2 Distance Correlation

If we want to characterize the dependence between P (X) and P (Y |X), one measure- ment is the correlation. However, P (Y |X) is a high dimensional random vector. Adopt- ing traditional dependence coefficients like Pearson correlations would cause certain estimation bias when sample size is not large. Moreover, it would be useful if the in- dependence between variables corresponds to 0 correlation. This is not true if we use traditional correlations. Here we propose to use distance correlation (Szekely´ et al., 2007) as the dependence measurement. Distance correlation is a measurement of dependence between two random variables (one dimensional or high dimensional). Suppose we have two random variables (α, β), with characteristic functions fα(t) and fβ(s) respectively. Their joint characteristic function is fα,β(t, s). The distance is defined as below.

Definition 1. The distance covariance C(α, β) between two random variables (α, β) is

2 2 C (α, β) = kfα,β(t, s) − fα(t)fβ(s)k (1)

Here k·k refers the weighted L2 norm, and similarly we can define distance C2(α, α) (Szekely´ et al., 2007). Then the distance correlation is defined as

Definition 2. The distance correlation D(α, β) is C(α, β) D(α, β) = (2) pC(α, α)C(β, β) and D(α, β) = 0 if C(α, α) = 0 or C(β, β) = 0.

4 This dependence measurement is a distance metric between the joint and multiply of the marginal characteristic functions. There are other methods to measure the depen- dence, like mutual information and kernel independence measurements (Gretton et al., 2005). However, mutual information is hard to estimate given finite sample size. Kernel methods involve a few parameters (kernel functions, kernel widths) which is not easy to choose. So we use this metric in our paper. Then we discuss how to estimate the distance correlation empirically from data (Szekely´ et al., 2007). Suppose we have n n observations of two random variables (α, β) as {(αi, βi)}i=1. For variable α and β, we can construct

n n n 1 X 1 X 1 X a = kα − α k, a = a , a = a , a = a (3) ij i j i· n ij ·j n ij ·· n2 ij j=1 i=1 i,j=1

n n n 1 X 1 X 1 X b = kβ − β k, b = b , b = b , b = b (4) ij i j i· n ij ·j n ij ·· n2 ij j=1 i=1 i,j=1 and then we construct matrices A and B, with its entries as

Aij = aij − ai· − a·j + a·· (5)

Bij = bij − bi· − b·j + b·· (6) Then we can estimate the empirical distance covariance as follows (Szekely´ et al., 2007).

Definition 3. The empirical distance covariance Cn(α, β) is

v n 1 uX C (α, β) = u A B (7) n nt ij ij i,j=1

We can estimate the empirical distance correlation using the empirical distance co- variance. The distance correlation has a property that D(α, β) = 0 implies indepen- dence between α and β. We show that this helps to identify the causal direction in the next section.

3.3 Inferring Causal Directions

In this section we discuss how to infer the causal directions. Suppose we have the joint distribution of the variable pair as P (X,Y ). We are able to factorize it in two directions and get (P (X),P (Y |X)) and (P (Y ),P (X|Y )). Each of them is a pair. We define the dependence measurements of them as below.

5 Definition 4. The dependence measurements are defined as

DX→Y = D(P (X),P (Y |X)) (8)

DY →X = D(P (Y ),P (X|Y )) (9)

If postulate 1 is accepted, then in the causal direction, the distance correlation be- tween P (X) and P (Y |X) reaches the lower bound as

DX→Y = 0 (10)

Since the distance correlation is nonnegative, in the anti-causal direction we have

DY →X ≥ 0 (11) and now we get the causal inference principle as

DY →X ≥ DX→Y (12)

Intuitively speaking, in the causal direction we get smaller dependence coefficient be- tween the marginal distribution and the conditional distribution than that in the anti- causal direction. One thing worth attention is that the domain size should be reasonably large to generate reliable . We will give the detailed causal inference method in the next section.

4 Causal Inference Method

In this section we give a causal inference method which identifies the causal direction via estimating the distance correlations. If X causes Y, the DX→Y should be smaller than DY →X . However, estimating the coefficients from samples may induce random errors. So we introduce a threshold . They are significantly different if their difference is larger than , and we can decide the causal direction. Otherwise we stay undecided. We detail the inference method in table 2. From table 2, one could see that our method identifies the cause and the effect by factorizing the joint distribution in two directions and comparing the dependence co- efficients (DX→Y and DY →X ). The one with smaller distance correlation (between the marginal distribution and the conditional distribution) is inferred as the causal direction. We name it causal inference via estimating distance correlations (abbreviated as DC). Next we analyze the computational cost of our method. Suppose the sample size is n. The time for constructing the matrix recording the joint distribution is O(n), and the 2 2 times for calculating DX→Y and DY →X are O(|X | ) and O(|Y| ) respectively. So the total time is O(n + |X |2 + |Y|2). One can see that this method is of low computational complexity (linear with respect to sample size). This would be verified in experiments.

6 Table 2: Causal inference method Algorithm 1: Causal inference via estimating distance correlations (DC) Input: sample of the discrete variables X and Y , threshold  1. Construct the vector recording the distribution P (X) and the matrix recording the

conditional distribution P (Y |X). Calculate DX→Y = D(P (X),P (Y |X)). 2. Construct the vector recording the distribution P (Y ) and the matrix recording the

conditional distribution P (X|Y ). Calculate DY →X = D(P (Y ),P (X|Y )). 3. Decide the causal direction:

If DY →X − DX→Y > , output “X causes Y ”.

If DX→Y − DY →X > , output “Y causes X”. Else, output “No decision made”.

5 Experiments

In this section we test the performance of our method (DC). The compared algorithm is the discrete regression (DR) (Peters et al., 2011). We perform experiments under various settings. Section 5.1 shows the performance of DC on identifying ANMs with different N . Section 5.2 presents the performance of DC when the distribution of the cause and the conditional distribution mapping cause to effect are randomly generated. Section 5.3 tests the efficiency of the algorithms. Section 5.4 discusses the choice of the threshold parameter . Section 5.5 shows the performance of DC at different decision rates. In section 5.6, we apply DC to real world cause-effect pairs (with discretization) to show its capability in solving practical problems.

5.1 Additive Noise Models

We first evaluate the accuracies of DC on identifying ANMs with different ranges of noise term N. The model is written as

Y = f(X) + N,N ⊥⊥ X (13)

The function f is constructed by random mapping from X = {1, 2, ..., 30} to Y0 = {1, 2, ..., 30}. Suppose the support of the noise is N . The noise domain N is chosen as:

1. N = {0, 1}.

2. N = {−1, 0, 1}.

3. N = {−2, −1, 0, 1, 2}.

4. N = {−3, −2, −1, 0, 1, 2, 3}.

7 The probability distributions of the cause X are chosen by: (1) randomly generate a vector (length |X |) with each entry being an integer between [1, |X |/4]. (2) Normalize it to unit sum. We generate the probability distributions of the noise N using the same way. In each trial, the algorithms are forced to make a decision. For each noise setting, we randomly generate 500 functions. Thus we have 500 additive noise models. Y for each model could be different due to the randomness of the mappings. Then we sample 200, 300, 500, 1000, 2000, 4000 points for each model, and apply DC and DR to the samples. The plots showing the accuracies of the algorithms are given in figure 2.

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

Accuracy 0.4 Accuracy 0.4

0.3 0.3

0.2 0.2 Distance Correlation Distance Correlation 0.1 Discrete Regression 0.1 Discrete Regression

0 0 0 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000 Sample size Sample size (a) |N | = 2 (b) |N | = 3

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

Accuracy 0.4 Accuracy 0.4

0.3 0.3

0.2 0.2 Distance Correlation Distance Correlation 0.1 Discrete Regression 0.1 Discrete Regression

0 0 0 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000 Sample size Sample size (c) |N | = 5 (d) |N | = 7 Figure 2: Accuracy of the algorithms vs. sample size on different N

From figure 2, one can see that DR performs slightly better than DC when |N | = 2. For example, when sample size is 4000, the accuracy of DR is 0.82 while that of DC is 0.78. We observe that ANMs with small |N | could sometimes yield small distance correlations in both directions. In these situations, the decision made by DC is close to a random guess. DC performs better than DR when |N | is larger. The accuracies of DC become around 0.9 when sample size is large. But DC does not correctly identify

8 all models. This is because the difference between the estimated distance correlations is sometimes small due to estimation random errors, and this may make the decision wrong.

5.2 Models with Randomly Generated P (X) and P (Y |X)

We now test the algorithms on the models with P (X) and P (Y |X) being randomly generated. To be specific, we generate P (X) using the method in section 5.1. Then we generate |N |/4 distributions on Y as a reference set. For x ∈ X , we generate P (Y |x) by randomly taking one of the distributions in the reference set. We choose the domain size to be

1. (|X |, |Y|) = (12, 12).

2. (|X |, |Y|) = (15, 15).

3. (|X |, |Y|) = (18, 18).

4. (|X |, |Y|) = (20, 20).

For each setting, we generate 500 models. For each model, we sample 200, 300, 500, 1000, 2000, 4000 points, and apply DC and DR to them. The performance is showed in figure 3. One can see that DR has unsatisfactory performance in these scenarios. This is because DR often makes a random guess since the models do not satisfy the ANM in either direction. DC has satisfactory performance when sample size goes large. This shows that DC works in these scenarios while DR does not.

5.3 On Efficiency

This section investigates the efficiency of the algorithms. We use the same experimental setting as that in section 5.2 (setting 2 and 4). For each sample size, we run them 100 times and record the total running time (seconds) of the algorithms. The records are showed in figure 4. Figure 4 tells that DC has a higher efficiency than DR. For example, when (|X |, |Y|) = (15, 15) and sample size is 1000, DR uses around 400 seconds to finish the experi- ments while DC uses only 2 seconds. This is because DR searches the whole domain iteratively to find a function that yields minimum dependence between residuals and regressors, which could be time-consuming in practice.

9 1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

Accuracy 0.4 Accuracy 0.4

0.3 0.3

0.2 0.2 Distance Correlation Distance Correlation 0.1 Discrete Regression 0.1 Discrete Regression

0 0 0 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000 Sample size Sample size (a) (|X |, |Y|) = (12, 12) (b) (|X |, |Y|) = (15, 15)

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

Accuracy 0.4 Accuracy 0.4

0.3 0.3

0.2 0.2 Distance Correlation Distance Correlation 0.1 Discrete Regression 0.1 Discrete Regression

0 0 0 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000 Sample size Sample size (c) (|X |, |Y|) = (18, 18) (d) (|X |, |Y|) = (20, 20)

Figure 3: Accuracy of the algorithms vs. sample size on different (|X |, |Y|)

5.4 On Parameter 

In the sections above, DC is forced to make a decision at each trial ( = 0). In this section, we examine the influence of  on the performance of DC. This may help to set the values of  in practice. We use the same experimental setting as that in section 5.2. The domain size is choose as (|X |, |Y|) = (15, 15) and (|X |, |Y|) = (20, 20). The sample size is fixed to be 4000. We choose the parameter  to be:

1.  = 0.01.

2.  = 0.05.

3.  = 0.1.

For each setting, we generate 500 models and apply DC to them. The proportion of correctly identified models, proportion of wrongly identified models and proportion of non-identified models are showed in figure 5.

10 3 3 10 10

2 2 10 10

1 1 10 10 Running time Running time

0 0 10 10

Distance Correlation Distance Correlation Discrete Regression Discrete Regression

−1 −1 10 10 0 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000 Sample size Sample size

(a) (|X |, |Y|) = (15, 15) (b) (|X |, |Y|) = (20, 20)

Figure 4: Running time of the algorithms vs. sample size on different (|X |, |Y|)

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3 Propotion of models Propotion of models

0.2 0.2 Correctly identified Correctly identified 0.1 Wrongly identified 0.1 Wrongly identified Non−identified Non−identified 0 0 0.01 0.05 0.1 0.01 0.05 0.1 Epsilon Epsilon (a) (|X |, |Y|) = (15, 15) (b) (|X |, |Y|) = (20, 20)

Figure 5: Proportion of models vs. threshold  on different (|X |, |Y|)

The proportion of non-identified models becomes large when the threshold is 0.1. This is because DC becomes conservative in this situation. The proportion of correctly identified models and that of wrongly identified models decrease as the threshold goes larger. However, the accuracy (the number of correctly identified models divided by the number of correctly identified models plus the number of wrongly identified models) increases. This is reasonable since the decisions made by DC under a higher threshold are more reliable. Based on the plotted results, we observe that the accuracy and de- cision date of DC are acceptable when  = 0.05. We suggest that 0.05 is a reasonable choice for the parameter.

11 5.5 On Decision Rates

In our algorithm, the threshold parameter  controls the decision rates of DC. In other words, if we increase the  (from 0 to 1), the decision rates decrease (from 100% to 0%). In this section, we study the influence of the parameter  on the performance of DC. We follow the experimental setting in section 5.2, and we choose (|X |, |Y|) = (15, 15). We fix the sample size to be 5000, and we vary the decision rates (by changing the  from 0 to 1). The plots showing the percentage of correct decisions versus the decision rate are available in figure 6.

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2 Percentage of correct decisions 0.1 Distance Correlation

0 0 0.2 0.4 0.6 0.8 1 Decision rate

Figure 6: Percentage of correct decisions vs. decision rate

Obviously, the percentage of correct decisions decreases as the decision rate in- creases. For example, the percentage of correct decisions is 100% when the decision rate is less than 20%. But it becomes 77% when the decision rate is 100%. This is acceptable since the decision would be more reliable if the algorithm makes decisions based on higher .

5.6 On Real World Data

We apply DC to real world benchmark cause-effect pairs1. The dataset contains records of 88 cause-effect pairs. We exclude the pairs numbered 17, 44, 45, 52, 53, 54, 55, 68, 71, 75. They are either multivariate causal pairs or pairs that cannot be fitted into memory using our algorithm. We apply DC, DR(Peters et al., 2011), IGCI(Janzing et al., 2012), LiNGAM(Shimizu et al., 2006), ANM(Hoyer et al., 2009), PNL(Zhang and Hyvarinen,¨ 2009b), CURE(Sgouritsa et al., 2015). The last 5 methods can be di- rectly applied to the data. DC and DR could be applied to discretized data. To make the variables discrete, we process them using the following method. For a variable

1Available at https://webdav.tuebingen.mpg.de/cause-effect/

12 X, if the maximum absolute value of all observations is less than 1, we process it by round(20 ∗ X); else we process it by round(X)2. For each pair, we generate 50 repli- cates using techniques, and then apply the algorithms to the causal pairs. The boxplots showing the accuracies are in figure 7.

0.75

0.7

0.65

0.6 Accuracy

0.55

0.5

DC DR IGCI LiNGAM ANM PNL CURE

Figure 7: Accuracy of algorithms on 78 real world causal pairs

Determining causal directions on real world data is challenging since the causal mechanisms are often complex and the data records could be noisy (Pearl, 2000; Spirtes et al., 2000). However, figure 7 shows that DC has satisfactory performance on this task. The average accuracy of DC is around 72%, which is highest among all algorithms. ANM based methods (DR, LiNGAM, ANM) do not have a good performance. This may be because the assumptions of additive noise models restrict their applicability.

6 Conclusions

In this paper, we deal with the causal inference problem on discrete data. We consider the distribution of the cause and the conditional distribution mapping cause to effect as independent random variables, and propose to discover the causal direction via esti- mating and comparing the distance correlations. Encouraging experimental results are reported. This shows that inferring the causal direction using the independence postu- late is a promising research direction. In future we will try to extend this method to deal with high dimensional data.

2For pair 65, 66, 67 which contain stock returns, we process the variables by round(100 ∗ X)

13 Acknowledgments

The authors want to thank the editor and the anonymous reviewers for helpful com- ments. The work described in this paper was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administration Region, China.

References

Chen, Z., Zhang, K., and Chan, L. (2013). Nonlinear causal discovery for high dimen- sional data: A kernelized trace method. In Proceedings of the IEEE 13th Interna- tional Conference on Data Mining (ICDM), pages 1003–1008. IEEE.

Chen, Z., Zhang, K., Chan, L., and Scholkopf,¨ B. (2014). Causal discovery via repro- ducing kernel embeddings. Neural Computation, 26(7):1484–1517.

Gretton, A., Herbrich, R., Smola, A., Bousquet, O., and Scholkopf,¨ B. (2005). Ker- nel methods for measuring independence. Journal of Machine Learning Research, 6:2075–2129.

Hoyer, P. O. and Hyttinen, A. (2009). Bayesian discovery of linear acyclic causal mod- els. In Proceedings of the 25th International Conference on Uncertainty in Artificial Intelligence (UAI), pages 240–248. AUAI Press.

Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J. R., and Scholkopf,¨ B. (2009). Nonlin- ear causal discovery with additive noise models. In Advances in Neural Information Processing Systems (NIPS), pages 689–696.

Hyttinen, A., Eberhardt, F., and Hoyer, P. O. (2012). Learning linear cyclic causal models with latent variables. Journal of Machine Learning Research, 13(1):3387– 3439.

Hyvarinen,¨ A. and Smith, S. M. (2013). Pairwise likelihood ratios for estimation of non-gaussian structural equation models. Journal of Machine Learning Research, 14:111–152.

Hyvarinen,¨ A., Zhang, K., Shimizu, S., and Hoyer, P. O. (2010). Estimation of a struc- tural vector autoregression model using non-gaussianity. Journal of Machine Learn- ing Research, 11:1709–1731.

Janzing, D., Hoyer, P., and Scholkopf,¨ B. (2010). Telling cause from effect based on high-dimensional observations. In Proceedings of the 27th International Conference on Machine Learning (ICML), pages 479–486.

14 Janzing, D., Mooij, J., Zhang, K., Lemeire, J., Zscheischler, J., Daniusis,ˇ P., Steudel, B., and Scholkopf,¨ B. (2012). Information-geometric approach to inferring causal directions. Artificial Intelligence, 182:1–31.

Janzing, D. and Scholkopf,¨ B. (2010). Causal inference using the algorithmic markov condition. IEEE Transactions on Information Theory, 56(10):5168–5194.

Lemeire, J. and Janzing, D. (2013). Replacing causal faithfulness with algorithmic independence of conditionals. Minds and Machines, 23(2):227–249.

Mooij, J. M., Janzing, D., Heskes, T., and Scholkopf,¨ B. (2011). On causal discovery with cyclic additive noise models. In Advances in Neural Information Processing Systems (NIPS), pages 639–647.

Pearl, J. (2000). : models, reasoning and inference. Cambridge University Press, Cambridge, UK.

Peters, J., Janzing, D., and Scholkopf,¨ B. (2010). Identifying cause and effect on dis- crete data using additive noise models. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 597–604.

Peters, J., Janzing, D., and Scholkopf, B. (2011). Causal inference on discrete data using additive noise models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(12):2436–2450.

Scholkopf,¨ B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., Mooij, J., Pineau, L. J., et al. (2012). On causal and anticausal learning. In Proceedings of the 29th Interna- tional Conference on Machine Learning (ICML), pages 1–8.

Sgouritsa, E., Janzing, D., Hennig, P., and Scholkopf,¨ B. (2015). Inference of cause and effect with unsupervised inverse regression. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 847–855.

Shimizu, S., Hoyer, P. O., Hyvarinen,¨ A., and Kerminen, A. (2006). A linear non- gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7:2003–2030.

Shimizu, S., Inazumi, T., Sogawa, Y., Hyvarinen,¨ A., Kawahara, Y., Washio, T., Hoyer, P. O., and Bollen, K. (2011). Directlingam: a direct method for learning a linear non-gaussian structural equation model. Journal of Machine Learning Research, 12:1225–1248.

Spirtes, P., Glymour, C. N., and Scheines, R. (2000). Causation, prediction, and search, volume 81. MIT press.

15 Szekely,´ G. J., Rizzo, M. L., Bakirov, N. K., et al. (2007). Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6):2769–2794.

Zhang, K. and Hyvarinen,¨ A. (2009a). Causality discovery with additive disturbances: an information-theoretical perspective. In Proceedings of the European Confer- ence on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), pages 570–585. Springer.

Zhang, K. and Hyvarinen,¨ A. (2009b). On the identifiability of the post-nonlinear causal model. In Proceedings of the 25th International Conference on Uncertainty in Arti- ficial Intelligence (UAI), pages 647–655. AUAI Press.

Zscheischler, J., Janzing, D., and Zhang, K. (2011). Testing whether linear equations are causal: A free probability theory approach. In Proceedings of the 27th International Conference on Uncertainty in Artificial Intelligence (UAI), pages 839–847. AUAI Press.

16