Insensitivity of Constraint-Based Causal Discovery Algorithms to Violations of the Assumption of Multivariate Normality Mark Voortman∗ and Marek J

Proceedings of the Twenty-First International FLAIRS Conference (2008) Insensitivity of Constraint-Based Causal Discovery Algorithms to Violations of the Assumption of Multivariate Normality Mark Voortman∗ and Marek J. Druzdzel∗† ∗Decision Systems Laboratory †Faculty of Computer Science School of Information Sciences and Intelligent Systems Program Białystok Technical University University of Pittsburgh, Pittsburgh, PA 15260, USA Wiejska 45A, 15-351 Białystok, Poland {voortman,marek}@sis.pitt.edu Abstract then X and Y are said to be conditionally independent. Be- cause the multivariate Normal case is tractable, it is tempt- Constraint-based causal discovery algorithms, such as the PC ing to assume this distribution in practice. Druzdzel & Gly- algorithm, rely on conditional independence tests and are otherwise independent of the actual distribution of the data. In mour (1999), for example, use a data set obtained from the case of continuous variables, the most popular conditional in- US News & World Report Magazine to study causes of low dependence test used in practice is the partial correlation test, student retention in US Universities. Figure 1 presents his- applicable to variables that are multivariate Normal. Many tograms of all eight variables in that data set. It seems that researchers assume multivariate Normal distributions when few, if any, of the variables are Normally distributed, so the dealing with continuous variables, based on a common wis- question that naturally arises is whether they are ‘Normal dom that minor departures from multivariate Normality do enough’ to yield correct results when the partial correlation not have a too strong effect on the reliability of partial corre- tests are applied. Common wisdom says that partial cor- lation tests. We subject this common wisdom to a test in the relation is fairly insensitive to minor departures from Nor- context of causal discovery and show that partial correlation mality. What constitutes minor departures and when these tests are indeed quite insensitive to departures from multivariate Normality. They, therefore, provide conditional indepen- departures are large enough to weaken the power of partial dence tests that are applicable to a wide range of multivariate correlation, has, to our knowledge, not been tested system- continuous distributions. atically. In this paper, we describe a series of experiments that we conducted to test the sensitivity of the partial correla- Introduction tion test, and the resulting effect on a basic causal discovery Causal discovery algorithms based on constraint-based algorithm, the PC algorithm, to departures from multivariate search, such as SGS and PC (Spirtes, Glymour, & Scheines Normality. We will show empirically that the partial corre- 2000), perform a search for an equivalence class of causal lation test is very robust against departures from Normality, graphs that is identifiable from patterns of conditional inde- and thus, in practice the PC algorithm yields rather robust pendencies observed in the data. These algorithms depend results. on the probability distribution over the variables in question only indirectly, since they take a set of conditional indepen- The PC Algorithm dence statements as input, and will correctly identify the class of causal structures that are compatible with the ob- The PC algorithm (Spirtes, Glymour, & Scheines 2000) served data. As long as conditional independence between makes four basic assumptions that must be satisfied to yield random variables can be established, the algorithms produce correct results: provably correct results. The main problem to overcome is 1. The set of observed variables is causally sufficient. Causal finding conditional independence tests that are suitable for a sufficiency means that every common cause of two or given distribution. more variables is contained in the data set. This assump- While reliable independence tests exist for discrete data, tion is easily relaxed (see, for example, Spirtes, Glymour, there are no generally accepted tests for mixtures of discrete & Scheines (2000)), but this is not of much importance and continuous data and even no tests that cover the gen- for the current paper. eral continuous case. One special case that can be tackled in practice is when the set of variables in question follows 2. All records in the data set are drawn from the same joint a multivariate Normal distribution. In that case, there is a probability distribution. well established test of conditional independence, notably 3. The probability distribution P over the observed variables partial correlation. If the partial correlation between vari- is faithful to a directed acyclic graph G of the causal struc- ables X and Y conditional on a set of variables Z is zero, ture. Faithfulness means that all and only the conditional Copyright c 2008, Association for the Advancement of Artificial independence relations found in P are entailed by the Intelligence (www.aaai.org). All rights reserved. Markov condition applied to G. The Markov condition 690 Figure 2: (a) The underlying directed acyclic graph. (b) The complete undirected graph. (c) Graph with zero order conditional independencies removed. (d) Graph with second order conditional independencies removed. (e) The partially rediscovered graph. (f) The fully rediscovered graph. Figure 1: Marginal distributions of the 8 variables in the retention data set (Druzdzel & Glymour 1999). is satisfied if node X in graph G is independent of all its non-descendents minus its parents, given its parents. 4. The statistical decisions required by the algorithms are Figure 3: Resulting graph when the statistical test failed to correct for the population. find the independency A ⊥ B. It is the last assumption that we focus on in this paper. The PC algorithm works as follows: 1. Start with a complete undirected graph G with vertices V. In Step (1), we start out with a complete undirected graph, 2. For all ordered pairs hX, Y i that are adjacent in G, test shown in Figure 2b. In Step (2) we remove an edge when if they are conditionally independent given a subset of two variables are conditionally independent on a subset of Adjacencies(G, X) \{Y }. We increase the cardinal- adjacent variables. The graph in Figure 2 implies two (con- ity of the subsets incrementally, starting with the empty ditional) independencies (denoted by ⊥), namely A ⊥ B set. If the conditional independence test is positive, we and A ⊥ D|{B, C}, which leads to graphs in Figure 2c and remove the undirected link and set Sepset(X, Y ) and 2d, respectively. Step (3) is crucial, since it is in this step Sepset(Y, X) to the conditioning variables that made X where we orient the causal arcs. In our example, we have and Y conditionally independent. the triplet A − C − B and C is not in Sepset(A, B), so 3. For each triple of vertices X, Y, Z, such that the pairs we orient A → C and B → C in Figure 2e. In Step (4) {X, Y } and {Y, Z} are adjacent in G but {X, Z} is not, we have to orient C → D, otherwise A ⊥ D|{B, C} would orient X − Y − Z as X → Y ← Z if and only if Y is not not hold, and B → D to prevent a cycle. Figure 2(f) shows in Sepset(X, Z). the final result. In this example, we are able to rediscover 4. Orient the remaining edges in such a way that no new the complete causal structure, although this is not possible conditional independencies and no cycles are introduced. in general. If an edge could still be directed in two ways, leave it It is important to note the impact of an incorrect statis- undirected. tical decision. If, for example, our statistical test failed to We illustrate the PC algorithm by means of a simple ex- find A ⊥ B, the resulting graph would be that of Figure 3. ample (after Druzdzel & Glymour (1999)). Suppose we ob- In general, failing to find just one conditional independence tained a data set that is generated by the causal structure in could have a severe impact on the output of the PC algo- Figure 2a, and we want to rediscover this causal structure. rithm. 691 Partial Correlation Test mean, skewness, and kurtosis, respectively. Because both If P is a probability distribution linearly faithful to a graph central moments are zero for the Normal distribution, and G, then A and B are conditionally independent given C if most of the time non-zero for other distributions, this is one measure of closeness to Normality. and only if the partial correlation is zero, i.e., ρAB.C = 0. In the first experiment, we focus on the third central mo- Partial correlation ρAB.C is the correlation between residu- ment using the Pearson IV distribution, because it is capable als R1 and R2 resulting from the linear regression of A on C and B on C, respectively. of keeping the kurtosis constant and change the skewness. Since the sample partial correlation will almost never be The second experiment is the converse, namely fixing the exactly zero, we have to resort to a statistical test to make skewness and changing the kurtosis, using the Pearson VII a decision and this is where the assumption of multivariate distribution. The third experiment changes both of the mo- Normality shows up. Fisher’s z-transform (Fisher 1915) as- ments at the same time using the Gamma distribution. As a sumes a multivariate Normal distribution and is given by: last experiment, we investigate the effect of multiple conditioning variables on the partial correlation test. In the experiments, we measure the effect of using non- 1p 1 +ρ ÂB.C Normal distribution by counting Type I and Type II errors. A z(ˆρAB.C) = n − |C| − 3 log , 2 1 − ρÂB.C Type I error is committed when the null hypothesis is falsely rejected, and, conversely, a Type II error is committed when where ρÂB.C is the sample partial correlation, n is the the null hypothesis is falsely accepted.

Insensitivity of Constraint-Based Causal Discovery Algorithms to Violations of the Assumption of Multivariate Normality Mark Voortman∗ and Marek J

Cos513 Lecture 2 Conditional Independence and Factorization

Data-Mining Probabilists Or Experimental Determinists?

Bayesian Networks

(Ω, F,P) Be a Probability Spac

A Bunched Logic for Conditional Independence

Week 2: Causal Inference

Conditional Independence and Its Representations*

Causality, Conditional Independence, and Graphical Separation in Settable Systems

The Hardness of Conditional Independence Testing and the Generalised Covariance Measure

STAT 535 Lecture 2 Independence and Conditional Independence C

Overview Examples Probability Review Probability Review Independence

Contents 7 Conditioning