<<

A Kernel Conditional Independence Test for Relational Data

Sanghack Lee and Vasant Honavar Artificial Intelligence Research Laboratory College of Information Sciences and Technology The Pennsylvania State University, University Park, PA 16802 {sxl439, vhonavar}@ist.psu.edu

Abstract methods including, in particular, non-parametric meth- ods (Fukumizu et al., 2008; Zhang et al., 2011; Doran et al., 2014; Lee and Honavar, 2017) have been devel- Conditional independence (CI) tests play a oped to test for CI in settings where the parametric form central role in , machine of the underlying distribution is unknown but a measure learning, and causal discovery. Most existing of closeness between data samples can be defined, e.g., CI tests assume that the samples are indepen- using a kernel function. However, these methods implic- dently and identically distributed (i.i.d.). How- itly or explicitly assume that the data samples are inde- ever, this assumption often does not hold in the pendently and identically distributed (i.i.d.). case of relational data. We define Relational Conditional Independence (RCI), a generaliza- Many sources of real-world data, e.g., the WWW, cita- tion of CI to the relational setting. We show tion networks, social networks, biomolecular networks, how, under a set of structural assumptions, we exhibit a relational structure, wherein the data are nat- can test for RCI by reducing the task of test- urally represented as collections of interlinked entities. ing for RCI on non-i.i.d. data to the problem In the resulting relational data, e.g., a citation network, of testing for CI on several data sets each of the entities, e.g., authors, articles, and institutions, clearly which consists of i.i.d. samples. We develop do not constitute i.i.d. observations. Methods for learn- Kernel Relational CI test (KRCIT), a nonpara- ing causal models from relational data rely on oracles metric test as a practical approach to testing that can answer CI queries from such data (Maier et al., for RCI by relaxing the structural assumptions 2013; Lee and Honavar, 2016). Practical realizations of used in our analysis of RCI. We describe re- such algorithms will need to replace such oracles by CI sults of experiments with synthetic relational tests against relational data. However, in the relational data that show the benefits of KRCIT relative setting, with the exception of autocorrelated data, e.g., to traditional CI tests that don’t account for the time series (Chwialkowski et al., 2014), where ‘close- non-i.i.d. nature of relational data. ness’ in time, space, or network is well-defined (Flaxman et al., 2016), effective ways to define and test for CI have been lacking. Any attempt to generalize the notion of CI 1 INTRODUCTION to the relational setting needs to overcome several chal- lenges: What are relational counterparts of random vari- ables? How can we define their marginal distributions? Observational and experimental data represent system- atic interactions among a set of random variables of in- Against this background, inspired by the notion of rela- terest. Conditional independence (CI) tests constitute es- tional d-separation (Maier et al., 2013), which general- sential tools for understanding such interactions. Ran- izes a graphical criterion for CI to a specific model of dom variables X and Y are said to be conditionally in- relational data, we (i) Formalize Relational Conditional dependent given Z, denoted by X ⊥⊥ Y | Z, if and Independence (RCI), the relational counterpart of CI. (ii) only if the joint distribution Pxyz can be factorized as Examine the dependence and heterogeneity of relational Px|zPy|zPz. The notion of CI plays a central role in variables in terms of the underlying relational structure. statistical inference (Dawid, 1979), probabilistic graph- (iii) Based on the preceding analyses, devise a Kernel ical models (Koller and Friedman, 2009), and causal dis- Relational CI Test (KRCIT) that, to the best of our knowl- covery (Pearl, 2000; Spirtes et al., 2000). A variety of edge, offers the first practical method for testing for RCI. (iv) Describe results of experiments with synthetic rela- k k tional data that show the benefits of KRCIT relative to tra- a e a ditional CI tests that don’t account for the non-i.i.d. na- j j i i ture of relational data. RCI and KRCIT offer new ways to understand dependencies in relational data across a broad d d range of practical applications. b c c

2 PRELIMINARIES Figure 1: (Left) a small relational skeleton σ as an undi- rected graph of entities of three classes, Blue, Magenta, We follow the notational conventions from , and Gray; and (Right) an example of a relational vari- graph theory, and relational models. We use a capital let- able V where V (`) refers to the multiset of attribute X of a set of Gray items forming a triangle with the given ter X to denote a ; X to denote the range Blue item and a Magenta item. Hence, ( ) = { }; of X; and a lowercase letter x to denote the value of X. ` V i a.X Calligraphic letters are also used to represent mathemat- V (j) = {c.X, d.X}; and V (k) = ∅; ical objects, e.g., graphs. We define a labeled (directed or undirected) graph G = and “relational data” to mean a set of data that conform hV, E, Li where V denotes a set of vertices (or nodes) and to a given relational schema. We use letters such as i and E a set of edges. Each vertex is assigned a discrete label j to stand for integers or items (e.g., the letters i and j to by a labeling function L : V 7→ Σ where Σ is a set of la- refer two items in σ (I)). bels. We disallow self-loops. Given an undirected graph 0 G, a connected component G is a vertex-induced sub- 3 CI TEST WITH RELATIONAL DATA graph of G where there exists a path between every pair of vertices in G0. We denote all connected components in We define the notion of relational variables followed G by CCG and a connected component containing v ∈ V by the notion of Relational Conditional Independence by CCG. Two labeled graphs G and G0 are said to be iso- v (RCI). We provide both a theoretical characterization of morphic, denoted by G =∼ G0, if there exists a bijective 0 0 RCI as well as a practical approach to testing for RCI. function f : V 7→ V such that ∀v∈VL(v) = L (f (v)) 0 and ∀u,v∈V (u, v) ∈ E ⇔ (f (u) , f (v)) ∈ E . Consider attribute classes (attributes for short) X, Y , and Z of a relational schema. In the absence of any relational We use a simplified version of Entity-Relationship (ER) structure, the “data” corresponding to instantiations of model (Chen, 1976) to describe relational data (Fried- these random variables in a dataset can be naturally in- man et al., 1999; Heckerman et al., 2007; Maier et al., dexed so that (x , y , z ) denotes the ith instance drawn 2013). A relational schema S = hE, R, Ai describes the i i i from P . However, in the relational setting, there is no relational domain of interest with a set of entity classes xyz such natural index. Hence, we can use a set of items of E (e.g., person, student), relationship classes R (e.g., each item class to serve the role of an index. This index- friend-of, son-of ), and attribute classes A (e.g., gender, ing scheme generalizes the notion of the ‘ith instance’ in income). We refer to the union E ∪ R as the set of item an i.i.d. setting to the notion of ‘instantiated by item i’ classes I. In general, a relationship class R ∈ R can be where i ∈ σ (I) for some I ∈ I. Note that different item n-ary where n ≥ 2 (e.g., contract is a set of ternary re- classes provide different ways to index relational data. lationships involving products, buyers, and sellers). We use A (I) to denote an item class I’s attribute classes, Definition (Relational Variable). Let S = hE, R, Ai be and A−1 (X) by to denote the item class of X. a relational schema and σ ∈ ΣS be an arbitrary rela- tional skeleton. A relational variable V is a function A relational skeleton σ ∈ ΣS is an instantiation of a from σ (I) for some item class I ∈ I to a subset of given relational schema, which can be viewed as an undi- {j.X | j ∈ σ (J)} for some attribute class X and item rected bipartite graph. We denote by σ (I) a set of items class J such that X ∈ A (J): of an item class I ∈ I. Given an item i ∈ σ (I), we use i.X to denote the item i’s attribute class X. Note that i.X V : σ (I) 7→ 2{j.X|j∈σ(J)} is a random variable which takes a value i.x ∈ X. An edge (i, r) ∈ σ represents the participation of an entity i where every i ∈ σ (I) is connected to j ∈ σ (J) such in a relationship r. For simplicity, we represent a skele- that j.X ∈ V (i). ton, whose relationship classes are binary, as an undi- rected graph of entities. We use “relational structure” to As a simple but concrete example, a relational variable mean the the graphical structure of a relational skeleton, V , ‘smoking status of one’s neighbors’, is defined with I, J both being the ‘Person’ item class, X corresponding where the item attributes of each attribute class share to the attribute class ‘smoking’, where j ∈ σ(J) where some commonality, e.g., the joint distribu- j.X ∈ V (i) is i’s neighbor for every person i ∈ σ(I). tion of all item attributes can be modeled as a directed Given a person Arthur who has two neighbors Sally and acyclic graph G of item attributes (Friedman et al., 1999) Sean, V (arthur) = {sally.Smoke sean.Smoke} results (n.b. paG and anG denote values of parental and ancestral a 2-dimensional random variable. In Fig. 1, we illustrate nodes in G, respectively), another example of relational variables involving three P (v) = Q Q P | G ( ) entity classes. X∈A i∈σ(A−1(X)) i.x pa i.X Understanding an item attribute, e.g., j.X, as a random where v represents values of all item attributes in σ. For variable, V (i) (or V for short) is a set of random vari- every item of the same item class, e.g., i, j ∈ σ (I), i G G ables. The ‘value’ of V is denoted by v = (j.x) ∈ P(i.X | pa (i.X)) = P(j.X | pa (j.X)) is of- i i j.X∈Vi G G |Vi| ten assumed if pa (i.X) and pa (j.X) are matched X , indexed with item attributes in Vi. In the preceding under model-specific assumption, e.g., their averages example, varthur = {sally : F alse, sean : T rue}. Noting that a relational variable itself is not a random variable, are the same (Friedman et al., 1999) (i.e., the ratio of we proceed to carefully define relational conditional in- one’s smoking neighbors to one’s neighbors). This is dependence. Without loss of generality, we consider the called parameter-tying or templating (Koller, 1999) and case where the conditioning is on a single relational vari- is widely used in relational or temporal domains to cap- able although, in general, the conditioning can be on a ture time-invariant aspects of the domain. We relate pa- set of relational variables. rameter tying and item attributes to the homogeneity (in the sense of being identically distributed) and indepen- Definition (Relational Conditional Independence). Let ∼ 0 dence of random variables. Let G =i,j G denote graph {U, V, W } be relational variables with a common do- isomorphism between G and G0 subject to the constraint main, σ (I), defined on a relational schema S where that vertex i ∈ G must be matched to j ∈ G0. σ ∈ Σ . Let item attributes S U ∪ V ∪ W be S i∈σ(I) i i i Proposition 1 (Identically Distributed Random Vari- random variables. Then, U and V are said to be indepen- ables). Let G be a directed acyclic graph representing dent to each other given W , denoted by (U ⊥⊥ V | W ) , σ a conditional independence structure of item attributes if and only if where each item attribute, e.g., k.X, is labeled with its attribute class, e.g., X. Given that the parameters are ∀i∈σ(I)Puiviwi = Pu |w Pv |w Pwi . i i i i tied, random variables i.X and j.X in G are identically  G   G  distributed if G an (i.X) ui.X,j.X G an (j.X) . It is easy to see that this definition of relational CI (RCI) Proposition 2 (Independent Random Variables). Two generalizes traditional CI where i.i.d. samples are drawn random variables i.X and j.X are independent if from Pxyz. Let U be “smoking status of oneself” and W G anG (i.X) ∩ G anG (j.X) = ∅. be “smoking status of one’s parents”, e.g., U (arthur) = {arthur.Smoke}. We might ask (U ⊥⊥ V | W )σ: “Is Proofs for both propositions directly follow from Markov one’s smoking status independent of one’s neighbors’ condition that a random variable is independent of its smoking status given one’s parents’ smoking status?” non-descendants given its parents in a DAG G. For an While, in the real world, answering such question can undirected graph G of item attributes (e.g, Markov ran- be quite difficult because of the complexity and partial G ∼ G dom field) labeled as above, CC =i.X,j.X CC observability of interactions among people, the notion of i.X j.X and CCG 6= CCG would be a sufficient condi- relational CI, once operationalized, can help extract use- i.X j.X i.i.d. ful insights from relational data. tion for i.X, j.X ∼ P for some distribution P under the parameter-tying assumption where graph isomorphic 3.1 RELATIONAL VARIABLES AND maximal cliques share the same parameters. PARAMETER TYING However, we have no access to the underlying CI struc- ture G of item attributes. Hence, we deduce an i.i.d. con- In the relational setting, assuming that each item attribute dition through items on an observed skeleton σ: of the same attribute class is identically distributed (e.g., Assumption 3. Let S = hE, R, Ai be a relational sally =d arthur ), would be tantamount .Smoke .Smoke schema and σ be a relational skeleton of S. Let i and to ignoring the relational structure of the data. On the j be items in σ and X ∈ A (I) of I ∈ I. Then, random other hand, if we were to let each item attribute share no variables i.X and j.X are independent and identically commonality whatsoever with any other item attribute, distributed if checking for RCI becomes nearly impossible. Hence, we σ ∼ σ σ σ restrict our attention to the practically relevant setting (CCi =i,j CCj ) ∧ (CCi 6= CCj ) Note that the condition is sufficient but not necessary 4.1 KERNEL FOR RELATIONAL VARIABLES for the random variables corresponding to the item at- tributes to be i.i.d. This is based on our understand- We provide kernels for relational variables that reflect ing of how parameter-tying assumption is realized in a our understanding of relational structure as in Section given relational structure and determines the qualitative 3.1. Consider a relational variable U, associated with aspects (i.e., homogeneity and independence) of the ran- the attribute class X. A kernel function for U, kU , mea- dom variables corresponding to the item attributes. sures similarity between two instantiations of U where |Ui| each instantiation, e.g., ui ∈ X , consists of a set of item attributes. We illustrate our approach using the R- 3.2 HANDLING NON-I.I.D. VARIABLES convolution kernel (Haussler, 1999), which computes the kernel over two sets as the sum of kernel values for every It is possible that a relational structure induces depen- pair of elements from two sets. Thus, we define kU as dent and non-identically distributed (i.e., heterogeneous) ; X  = P P X ( ) item attributes even when parameter tying is assumed. kU ui, uj kIA a.X∈Ui b.X∈Uj kIA a.X, b.X Hence, we cannot simply apply a traditional CI test to (1) X ( ) test (U ⊥⊥ V | W )σ on the flattened version of relational where the base kernel kIA a.X, b.X measures the sim- data {(ui, vi, wi)}i∈σ(I) where σ(I) is the common do- ilarity between two item attributes. Based on the anal- main of U, V , and W . ysis in Section 3.1, Ui and Uj do not necessarily yield identically-distributed item attributes. Hence, we design Our solution to this problem is to perform CI tests by kX (a.X, b.X) by taking both homogeneity and attribute decomposing, with respect to a given RCI query, the set IA values into consideration: kX is defined as a product ker- of items ( ) into subsets of items such that each sub- IA σ I nel of the kernel for homogeneity k and a kernel for set yields a set of i.i.d. observations under the above as- σ attribute values kx: sumption. Consider a function id : σ (I) 7→ Z such that X d k (a.X, b.X) = kσ (a, b) kx (a.x, b.x) for id (i) = id (j) only if (ui, vi, wi) = (uj, vj, wj) IA σ ∼ σ and (ui, vi, wi) ⊥⊥ (uj, vj, wj) (i.e., CCi =i,j CCj and where the kernel for attribute values is typically defined σ σ CCi 6= CCj given that U, V , and W are isomorphism- using a standard kernel for the data type, for example, a d invariant, see Appendix for definition). Then, a tradi- Gaussian RBF kernel if X ⊆ R . We elaborate the kernel tional CI test, treating U, V , and W as random variables, for homogeneity kσ below.

U ⊥⊥ V | W, id Kernel for Homogeneity (Among Item Attributes) We postulate that two random variables a.X and b.X will remove bias introduced by the relational structure will be similarly distributed if the corresponding items provided we have large enough samples per condition appear in similar contexts in σ, i.e., similarly intercon- (i.e., a large number of CCs per isomorphic class). Such nected neighbors in σ. Therefore, the degree to which a naive solution, however, has severe limitations in prac- two item attributes a.X and b.X are identically dis- tice: i) It is possible, in the worst case, that all items in re- tributed can be approximated by the similarity of the con- lational data are connected and ii) Each connected com- text of a and the context of b in σ. ponent might be non-isomorphic to others. We use h-hop neighbors of an item in σ to induce a con- text of the item for practicality since a connected com- σ 4 A KERNEL RELATIONAL CI TEST ponent, e.g., CCa , can be as large as the given relational skeleton σ. Thus, we design the kernel for homogeneity k (a, b) as a graph kernel between two labeled graphs To address the limitations noted above, we will relax the σ σ [neσ (a)] and σ [neσ (b)] where neσ (a) is a set of requirements that the connected components be isomor- h h h items in σ that are reachable in no more than h hops from phic and that items be partitioned into non-overlapping item a in σ. Each item in a context, e.g., σ [neσ (a)], is connected components. Recent progress in kernel-based h labeled with its item class except the item a, which is as- nonparametric tests (e.g., two-sample tests or CI) al- signed to a special label allowing a graph kernel between lows us to utilize the notion of closeness between sam- σ [neσ (a)] and σ [neσ (b)] to match a and b. This reflects ples to test for homogeneity or conditional independence. h h ‘ ’, graph isomorphism with an additional constraint We proceed to show how kernel-based conditional in- ui,j (Assumption 3). dependence tests, originally introduced for testing inde- pendence of i.i.d. random variables from data, can be We choose to exploit an existing graph kernel for adapted to the relational setting, by defining a novel ker- labeled graphs. For example a shortest-path ker- nel function for relational variables. nel (Borgwardt and Kriegel, 2005) is given by 0 P P 0 0 kSP (G, G ) = c,d∈V c0,d0∈V0 kn (c, c ) · kn (d, d ) · Algorithm 1 KRCIT 0 0 kl (dG (c, d) , dG0 (c , d )) with the choice of kernels on Input: σ: relational data; U, V , W : relational variables of base node (i.e., item) kn and on shortest path length kl where item class I; kU , kV , kW : kernels for U, V , W ; kσ: a kernel for subgraphs of items; CI: the base kernel-based CI test dG (c, d) is a shortest path length between c and d in G. 1: Ω ← {(ui, vi, wi)} , which is (u, v, w) We use the Dirac kernel for both kn and kl, that is, kn is i∈σ(I) 1 if two items have the same label and 0, otherwise, and 2: kG(·, ·) ← kU (·, ·; kσ)kV (·, ·; kσ)kW (·, ·; kσ) 3: Ku, Kv, Kw ←kernel matrices for u, v, and w. kl is 1 if two lengths are the same and 0, otherwise. 4: Kg ←kernel matrix for (u, v, w) with kG 5: Kwg ← Kw Kg Kernel for Homogeneity (Among Observations) The 6: return CI (Ku, Kv, Kwg) use of contexts does not supplant the role of the indicator id. Hence, we introduce a new variable G to play the role of id without dealing with dependent observations. we treat ‘dependent observations’ to be resolved explic- With G, the question of RCI, (U ⊥⊥ V | W )σ, becomes itly through conditionals instead of being implicitly re- that of traditional CI, U ⊥⊥ V | W, G and, similarly, an moved, e.g., non-i.i.d. CI test for autocorrelated data unconditional query (U ⊥⊥ V )σ becomes U ⊥⊥ V | G. (Flaxman et al., 2016). We have already seen the kernel for relational variable which considered both contexts and values. Taking the 4.3 VALIDITY OF FLATTENING APPROACH value part out from the definition of the kernel, we can We provide a sufficient condition under which a flattened get a kernel for homogeneity among observations. Since sample of relational data conditioned on G correctly an observation consists of three (two if unconditional) transforms the question of an RCI query (U ⊥⊥ V | W ) relational variables, we use a product kernel. We define σ to a traditional CI query U ⊥⊥ V | W, G. We con- sider the alternative hypothesis ( 6⊥⊥ | ) to satisfy kG(i, j) = kU (ui,uj; kσ) kV (vi,vj; kσ) kW (wi,wj; kσ) U V W σ (2) ∀i∈σ(I)Ui6⊥⊥ Vi | Wi instead of just ∃i∈σ(I)Ui6⊥⊥ Vi | Wi. Condition 4. Ui,Vi ⊥⊥ Uj,Vj | Wi and d Note that while we have used the R-convolution kernel (wi, gi = wj, gj) → (Ui,Vi) = (Uj,Vj) for every for relational variables and the shortest-path kernel as our i 6= j ∈ σ (I). graph kernel to illustrate our idea, the approach can ac- commodate other kernels, e.g., the optimal assignment This condition simply makes a set of (U, V ) samples into kernel (Kriege et al., 2016) for relational variables and a a set of i.i.d. samples where either U ⊥⊥ V or U 6⊥⊥ V Weisfeiler-Lehman kernel (Shervashidze et al., 2011) for holds for each condition. In addition to the first condi- graphs. In practice, the choice of kernel can be guided by tion, we provide a relaxed sufficient condition only for the knowledge of the domain. the null hypothesis.

Condition 5. Vi ⊥⊥ Vj | Wi and (wi, gi = wj, gj) → 4.2 TREATING DEPENDENT OBSERVATIONS d Vi = Vj for every i 6= j ∈ σ (I). Now we briefly discuss how we can handle depen- This condition only makes V i.i.d. for each condi- dent observations. In relational data, the dependencies tion. However, it is sufficient to observe U ⊥⊥ V if among observations can arise for different reasons. In ∀ U ⊥⊥ V | W . Otherwise, U 6⊥⊥ V | W, G will previous section, we showed that two item attributes i∈σ(I) i i i hold in most cases unless the aggregation of (U ,V ) per become dependent if they share the same ancestors i i condition makes such dependence vanish. (Proposition 2). However, in some settings, we can ig- nore some types of dependence among observations. For instance, consider a hidden Markov model with 4.4 ALGORITHM hidden variables X and observed variables Y where By supplying customized kernels for U, V , W , and G X ⊥⊥ Y | X and P (Y | X ) and P (X | X ) t−1 t t t t t t−1 (Equation 1 and 2), a kernel CI test in an i.i.d. setting are time-invariant. Simply running a traditional CI test will decide whether to accept or reject the null hypoth- on a sample {(x , y , x )}n would likely result in t−1 t t t=2 esis given a flattened sample Ω. We illustrate the pseu- the null hypothesis not being rejected in spite of cor- docode of the kernel relational conditional independence relations among observations, e.g., (Xt−1,Yt,Xt) and test, KRCIT, in Algorithm 1. If the given query is uncon- (X ,Y ,X ). Variables like X and X are nat- t−2 t−1 t−1 t t−1 ditional such that W is undefined, then K = 1. urally represented as (temporally) related variables, the w dependencies among which can be broken by condition- We considered the following two kernel-based condi- ing on an appropriate set of variables. In this regard, tional independence tests as a base CI test for KRCIT: > Kernel CI Test (KCIT, Zhang et al., 2011) which uses kernel matrix Kx = V ΛV. Similarly, we can obtain the norm of conditional cross-covariance operator in ψz. Then, a Gaussian process regression with a linear ∗ RKHS (reproducing kernel Hilbert space); and Self- kernel and a white noise kernel can be used to learn Kz. Discrepancy CI Test (SDCIT, Lee and Honavar, 2017) In this case, we focus on the kernel matrix for residuals −2 ∗−1 which uses RKHS distance between the given sample given by Kx|z = RKxR where R = I + σ Kz . representing Pxyz and the sample modified to mimic Following Flaxman et al. (2016), we use the expectation ∗ Px|zPy|zPz. The time complexity of KRCIT will depend of Kx|z, which is given by KzR + RKxR. not only on the size of flattened sample but also the car- We list methods to be compared in our experiments. dinality of relational variables, and the size of subgraphs (Naive): We use Hilbert-Schmidt Independence Crite- (i.e., hops), etc. If the cardinality of relational variables rion (HSIC, Gretton et al., 2005) for unconditional cases and the number of hops used to specify contexts are fixed which uses the eigenspectrum of covariance operators to small constants, computing kernel matrices requires in RKHS. Otherwise, either KCIT or SDCIT is used. O n2 time where n is the size of flattened sample. (Residualized): We residualize values of a given rela- Then, the time complexity of KRCIT with KCIT as a base 3 3 tional skeleton based on contexts (i.e., replace values to CI test is O n since that of KCIT is O n . its residuals). Then, aforementioned naive tests are used. We append a prefix ‘R-’ to denote ‘residualized’. (Resid- 5 EMPIRICAL EVALUATION ual Kernel): Residuals are computed in RKHSs and the kernel for residuals is obtained as described above. We report results of experiments that examine the per- Then, naive tests are used where computing kernel for formance of KRCIT in testing RCI on relational data us- two values (e.g., an RBF kernel) is replaced by looking ing synthetic relational data where we know the ground up the residual kernel matrix. We append a prefix ‘RK- truth RCI. We used RCM (Maier et al., 2013; Lee and ’. (KRCIT): We use a postfix ‘-K’ or ‘-SD’ to denote Honavar, 2016), which is a generative model for rela- KRCIT with KCIT or with SDCIT, respectively. tional data where a set of assumed causal relationships We implemented KRCIT and other kernel CI tests in among relational variables specifies how values of item Python.1 Throughout experiments, we considered real- attributes are generated given a relational skeleton. 0 00 valued attributes and an RBF kernel, e.g., kx (x , x ) = 0 00 2 exp −γxkx − x k , is used for each attribute where 5.1 METHODS 2−1 2 2 γ· is chosen as 2σ where σ = 0.1 is the variance of Gaussian noise we used in data gen- We compare the performance of KRCIT, traditional CI tests that do not account for the relational structure of the erating processes. Additionally, we normalized both R-convolution kernel and shortest-path kernel, i.e., data, and an alternative RCI test that makes use of con- 0 p text using residualization (Flaxman et al., 2016) where k (a, b) = k (a, b) / k (a, a) k (b, b). We report how regression is used to remove dependence of a variable correctly null hypothesis is rejected by a given test on a set of conditionals. For example, assume that one (power) measured by Area Under Power Curve (AUPC), wants to test X ⊥⊥ Y where X and Y are two time se- which is area under cumulative density function of p- ries. By regressing each X and Y on time T , one can values, and how incorrectly null hypothesis is rejected ˆ ˆ measured by type-I error rates given α = 0.05. obtain x|t = X − E [X|T ] and y|t = Y − E [Y |T ]. Then, the test becomes x|t ⊥⊥ y|t under a set of as- sumptions. In the case of RCI, one can residualize val- 5.2 A SIMPLE EXAMPLE FOR ues in the given relational skeleton to remove depen- VISUALIZATION dence on the ‘context’ of each attribute value. Formally, m We start with a simple example that explains how and let X,Z ∈ R be two random variables where we seek why KRCIT is better for relational CI testing than stan- for x|z such that X = f (Z)+x|z. We can train a Gaus- sian process regression, i.e., f ∼ GP (0, k), by maximiz- dard CI tests that assume i.i.d. data. For the sake of read- ˆ ability, we omit unnecessary details (detailed descrip- ing total marginal likelihood. Then, x|z = X − X = −1 tions are provided in Appendix). We considered a rela- + −2K∗ K∗ 2 I σ z X. Both z and σ are learned through tional schema where attribute class X and Y associates Gaussian process regression employing, e.g., a Gaussian with entity class ‘circle’ and ‘square’, respectively. There RBF kernel and a white noise kernel. are other item classes and attribute classes in the schema.

Assume that we only have access to kernel matrices Kx A relational skeleton σ consists of CCs of four differ- and Kz. Following Zhang et al. (2011), we can use the 1 1 Codes are available online at https://github.com/ empirical kernel map for x, ψx = VΛ 2 , where V and sanghack81/KRCIT and https://github.com/ Λ are obtained through the eigendecomposition of the sanghack81/SDCIT. Null and Heterogeneous (biased)

X Y X Y X Y X Y

Figure 2: Four connected components composing a rela- tional skeleton. Alternative and Heterogeneous (biased)

Original Permuted (KRCIT) Permuted (SDCIT) Null and Homogeneous (randomized)

Figure 4: Comparisons with fully biased relationship

Alternative and Homogeneous (randomized) HSIC KRCIT-SD

Null and Heterogeneous (randomized) non- non-randomness

heterogeneity heterogeneity

Alternative and Heterogeneous (randomized) Figure 5: Type-I errors (with 20 trials) varying both non- randomness and heterogeneity

sis and heterogeneity where color codes correspond to the types of structure which associates with the value (x, y) in Fig. 2. Utilizing permuted samples, we visu- Figure 3: Comparisons of the given data (left) and two alize how KRCIT (with SDCIT) and SDCIT, which is a samples under the null hypothesis using KRCIT with SD- permutation-based test, produce data consistent with the CIT (center) and by SDCIT (right), respectively, with null hypothesis. KRCIT permutes Y s conditioning on G, randomized relationships which corresponds to ‘color’, while SDCIT simply shuf- fles Y s. The center and right columns correspond to the permuted sample under KRCIT and a naive test, respec- ent structures as shown in Fig. 2. We designed a gen- tively. When contexts more strongly correlate with val- erative model such that X is a function of the value in ues, we can more clearly observe the difference between an adjacent ‘rhombus’ item and Y is a function of X in the permuted samples by KRCIT and by SDCIT. an adjacent circle and the value in an adjacent ‘triangle’ item. That is, we make sure that i.X and j.X are not For each row, if the center plot is significantly differ- identically-distributed if i is adjacent to a rhombus item ent from its corresponding left plot, KRCIT would reject while j is not. The same idea also applies to Y s. the null hypothesis. For example, KRCIT correctly rejects the null hypothesis for the samples from the alternative We controlled randomness of relationships between cir- hypothesis (row 2nd and 4th). Interestingly, SDCIT also cles and squares: non-random relationship represents that correctly rejects samples from the alternative hypothesis. a resulting relational skeleton consists only circles adja- In Fig. 4, we plot null samples when relational skeletons cent to a rhombus are connected to squares adjacent to exhibit biased relationships. We can similarly observe a triangle (1st and 4th components in Fig. 2) while ran- difference between two tests. Fig. 5 illustrates type-I er- domized relationship exhibits a relational skeleton where rors (given α = 0.05) of a naive test (HSIC) and KRCIT- all four components are balanced. We also controlled het- SD based on relational skeletons generated with various erogeneity: the extent to which distributions of (X,Y ) of degrees of heterogeneity and non-randomness of rela- different components diverges from each other. tionship. KRCIT-SD is robust to heterogeneity and non- In the left column of Fig. 3, we visualize four relational random relationships while HSIC is not. Note that resid- data based on the combinations of underlying hypothe- ualization approaches utilizing contexts perform similar Biased + Homogeneous Randomized + Heterogeneous Biased + Homogeneous Randomized + Heterogeneous 1.0 1.0

0.8 HSIC 0.8 HSIC R-HSIC R-HSIC AUPC AUPC RK-HSIC RK-HSIC 0.6 0.6 KRCIT-K KRCIT-K KRCIT-SD KRCIT-SD 0.4 0.4 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 linear coefficient linear coefficient linear coefficient linear coefficient Biased + Homogeneous Randomized + Heterogeneous 1.0 Figure 6: AUPCs with varying dependency where re- lational skeletons are generated with homogeneous and 0.8 HSIC R-HSIC AUPC randomized relationships (left) and with heterogeneous RK-HSIC 0.6 KRCIT-K and biased relationships (right). KRCIT-SD 0.4 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 linear coefficient linear coefficient to KRCIT (see Appendix). Finally, we illustrate the changes of power of different Figure 7: AUPCs with hop=1 (top) and hop=2 (bottom) tests as the strength of dependency between X and Y under two different settings for relational skeletons is increased (Fig. 6). Tests that use contexts consistently estimates dependency without regard to the underlying robust to the choice of contexts and achieves high AUPC conditions while HSIC over-rejects samples from weak dependence in certain conditions where rejection comes as the dependence increases. partially from other than linear dependence (4th row in Fig. 3). In summary, whenever contexts provide suffi- 5.4 CONDITIONAL TESTS cient information to infer (non-)identically distributed observations, KRCIT is able to eliminate suspicious de- We investigated whether KRCIT would be able to dis- pendencies due to such heterogeneity. cover the causal structure of synthetic relational data generated from an RCM. Thus, we focus on testing 1) 5.3 MORE COMPLICATED STRUCTURES relational version of Markov condition, which is essen- tial to learn the undirected causal structure, and 2) con- We conduct a similar experiment but with more com- ditional dependence, which is critical to infer the orien- plicated and larger structures where the cardinality of tation of undirected causal relationships. We constructed relational variables is not one and observations are de- a set of relational skeletons of 3 entity classes and 3 re- pendent. Such dependence among observations, e.g., two lationship classes between every pair of entity classes. circles, is due to their sharing a common cause, e.g., We controlled for the maximum number of neighbors of a rhombus. In this experiment, we additionally inves- the same item class of an item (e.g., a circle item can tigate how different sizes of subgraphs as contexts af- have at most three square neighbors) and the number of fect the performance of RCI tests. Contexts based on entities per entity class. We generate relational skeletons 1-hop subgraphs contain sufficient information while 2- to exhibit correlation among different relationships (see hop subgraphs will include information about other vari- Appendix for details). Then, values are generated based able, e.g., a 2-hop subgraph of a circle includes triangle on two hypotheses. For the null hypothesis, we gener- and other circles connected to common rhombuses. ate values, roughly, in a causal order X → Z → Y or X ← Z → Y . For the alternative hypothesis, a relational For tests for the null hypothesis, we obtained similar data is generated based on a causal order X → Z ← Y . results as in the previous section. However, the kernel- We test (U ⊥⊥ V | W ) where U, V , and W associates based residualization approach (RK-HSIC) shows higher σ with X, Y , and Z, respectively. type-I errors than expected when larger contexts are em- ployed (see Appendix). KRCIT performed as desired even In Fig. 8, we plot AUPCs and type-I error rates under with larger contexts. For tests for alternative hypothe- different sizes of flattened sample, maximum cardinali- sis, we report AUPC in Fig. 7. With properly-sized con- ties of relationships, and the sizes of contexts in terms texts (hop=1), both residualization-based methods per- of hops. The left plots demonstrate that the power of all form well. However, they are sensitive to the choice of tests increases as larger sample is used (with max 3 re- contexts – the power of both R-HSIC and RK-HSIC drops. lationships and hop 1). However, KRCIT with KCIT as Both these KRCIT methods are relatively weaker than any a base CI test suffers high type-I error rates. The center other tests when dependence between relational variables plots depict the negative effect of more complex structure is not strong enough. However, KRCIT with SDCIT seems on both power and type-I errors. This implies the general 1.0 1.0 1.0

0.8 0.8 0.8 method

AUPC AUPC AUPC KCIT 0.6 0.6 0.6 SDCIT R-KCIT R-SDCIT 0.4 0.4 0.4 RK-KCIT 0.3 0.3 0.3 RK-SDCIT 0.2 0.2 0.2 KRCIT-K 0.1 0.1 0.1 KRCIT-SD

type-I error0 rate .0 type-I error0 rate .0 type-I error0 rate .0 200 400 600 800 1 2 3 1 2 3 4 no. of entities per class max no. of relationships no. of hops for contexts

Figure 8: AUPCs and type-I error rates of various tests with relational skeletons generated with different settings and tests employed different size of subgraphs as contexts.

approach to handling multiple values (i.e., using an R- vided by the exponential family. convolution kernel) should properly modified when large In this paper, we have defined CI in the relational setting relationships are involved (e.g., friends). Unlike previous and provided an effective approach to testing for RCI. experiments with unconditional RCI tests, it is surpris- Our definition makes use of a definition of a relational ing to observe that the power of naive tests decreases as variable that subsumes the notions of slot chains in prob- underlying structure is more complex. Finally, contexts abilistic relational models (Friedman et al., 1999), rela- larger than necessary makes some tests weaker and less- tional path in relational causal models with bridge burn- calibrated (right plots). However, we observed that con- ing semantics (Maier et al., 2013) and path semantics texts with hop set to 4 are similar to each other. That (Lee and Honavar, 2016), and first order expression in is, the kernel matrix obtained by applying shortest-path DAPER (Heckerman et al., 2007). kernel on contexts contains similar values and can not clearly inform heterogeneity among random variables. 7 SUMMARY AND FUTURE WORK Overall, the naive tests that do not account for relational structure performed well for the null hypothesis yield- In this work, we defined relational conditional indepen- ing type-I error rate around 0.05 since, in our genera- dence (RCI), the generalization of CI to a relational set- tive model, Z ‘values’ provide all the necessary infor- ting with the language of Entity-Relationship model. We mation to infer Y . However, they showed weak power proposed kernel RCI test (KRCIT), a first practical and compared to others in general. Both residualization ap- general design of RCI test which reduces bias caused proaches perform very well when proper contexts are by an underlying relational structure. We empirically KRCIT employed. performs very differently depending demonstrated benefits of KRCIT compared to naive CI on the choice of base CI test. KCIT, which also uses resid- tests on simulated relational data. ualization as an internal mechanism to handle condition- als, seems to have problems dealing with G, a condition- Some directions for future work include: improving ing variable playing a role of id. KRCIT with SDCIT is, KRCIT by employing appropriately designed graph ker- in general, a good choice since it provides a reasonable nels and optimizing the kernel parameters; a more com- power with a precise control of type-I error rates. prehensive experimental study of KRCIT using real- world relational data; investigating a way to incorporate network analysis before performing RCI test to guide the 6 DISCUSSION design of kernels; and applying RCI to discover causal relationships in relational domains (Maier et al., 2013; Maier et al. (2013) considered relational d-separation, a Lee and Honavar, 2016). problem closely related to RCI. However, they relied on a traditional CI test by simply flattening and aggregat- Acknowledgements ing relational data (i.e., average) without incorporating structural information. As we have shown, such an ap- The authors are grateful to UAI 2017 anonymous re- proach biases the results of independence tests (Section viewers for their thorough reviews. This research was 5.2). An independence test on two non-i.i.d. data sets is supported by the Edward Frymoyer Endowed Professor- addressed by Zhang et al. (2009), who considered gener- ship, the Center for Big Data Analytics and Discovery alizing HSIC to the setting of structured and interdepen- Informatics at the Pennsylvania State University, and the dent observations. However, their work focused on ex- Sudha Murty Distinguished Visiting Chair in Neurocom- plicitly known CI structures which can be represented as puting and Data Science at the Indian Institute of Sci- undirected graphical models utilizing factorization pro- ence. References Kriege, N. M., Giscard, P.-L., and Wilson, R. C. (2016). On valid optimal assignment kernels and applications Borgwardt, K. M. and Kriegel, H.-P. (2005). Shortest- to graph classification. In Advances in Neural Infor- path kernels on graphs. In Proceedings of the Fifth mation Processing Systems 29, pages 1623–1631. IEEE International Conference on Data Mining, pages 74–81. Lee, S. and Honavar, V. (2016). A Characterization of Markov Equivalence Classes of Relational Causal Chen, P. P.-S. (1976). The entity-relationship model – Models under Path Semantics. In Proceedings of the toward a unified view of data. ACM Transactions on 32nd Conference on Uncertainty in Artificial Intelli- Database Systems, 1(1):9–36. gence, pages 387–396. Chwialkowski, K., Sejdinovic, D., and Gretton, A. Lee, S. and Honavar, V. (2017). Self-Discrepancy Condi- (2014). A Wild Bootstrap for Degenerate Kernel Tests. tional Independence Test. In Proceedings of the 33rd In Advances in Neural Information Processing Sys- Conference on Uncertainty in Artificial Intelligence. tems 27, pages 3608–3616. (to appear). Dawid, A. P. (1979). Conditional Independence in Statis- Maier, M., Marazopoulou, K., Arbour, D., and Jensen, D. tical Theory. Journal of the Royal Statistical Society. (2013). A Sound and Complete Algorithm for Learn- Series B (Methodological), 41(1):1–31. ing Causal Models from Relational Data. In Proceed- Doran, G., Muandet, K., Zhang, K., and Scholkopf,¨ B. ings of the Twenty-Ninth Conference on Uncertainty (2014). A Permutation-Based Kernel Conditional In- in Artificial Intelligence, pages 371–380. dependence Test. In Proceedings of the Thirtieth Con- Pearl, J. (2000). Causality: Models, Reasoning, and In- ference on Uncertainty in Artificial Intelligence, pages ference. Cambridge University Press, 2nd edition. 132–141. Shervashidze, N., Schweitzer, P., van Leeuwen, E. J., Flaxman, S. R., Neill, D. B., and Smola, A. J. (2016). Mehlhorn, K., and Borgwardt, K. M. (2011). Gaussian Processes for Independence Tests with Non- Weisfeiler-Lehman graph kernels. Journal of Machine iid Data in Causal Inference. ACM Transactions on Learning Research, 12:2539–2561. Intelligent Systems and Technology, 7(2):1–23. Spirtes, P., Glymour, C. N., and Scheines, R. (2000). Friedman, N., Getoor, L., Koller, D., and Pfeffer, A. Causation, Prediction, and Search. MIT Press, 2nd (1999). Learning Probabilistic Relational Models. In edition. Proceedings of the Sixteenth International Joint Con- ference on Artificial Intelligence, pages 1300–1309. Zhang, K., Peters, J., Janzing, D., and Scholkopf,¨ B. (2011). Kernel-based Conditional Independence Test Fukumizu, K., Gretton, A., Sun, X., and Scholkopf,¨ B. and Application in Causal Discovery. In Proceedings (2008). Kernel Measures of Conditional Dependence. of the Twenty-Seventh Conference on Uncertainty in In Advances in Neural Information Processing Sys- Artificial Intelligence, pages 804–813. tems 20, pages 489–496. Zhang, X., Song, L., Gretton, A., and Smola, A. (2009). Gretton, A., Bousquet, O., Smola, A., and Scholkopf,¨ Kernel Measures of Independence for non-iid Data. In B. (2005). Measuring Statistical Dependence with Advances in Neural Information Processing Systems Hilbert-Schmidt Norms. In Algorithmic Learning The- 21, pages 1937–1944. ory. ALT 2005, pages 63–77. Haussler, D. (1999). Convolution Kernels on Discrete Structures. Technical report, University of California, Santa Cruz. Heckerman, D., Meek, C., and Koller, D. (2007). Prob- abilistic Entity-Relationship Models, PRMs, and Plate Models. In Getoor, L. and Taskar, B., editors, Intro- duction to Statistical Relational Learning, pages 201– 238. Koller, D. (1999). Probabilistic Relational Models. In Proceedings of the 9th International Workshop on In- ductive Logic Programming, pages 3–13. Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press. n A ISOMORPHISM-INVARIANT and bi, {(ai, bi)}i=1 ⊂ σ. That is, by design, an A item RELATIONAL VARIABLE having (or not having) a C neighbor is connected to a B item having (or not having) a D neighbor. Given the We describe a desired property of a relational variable. randomness p, we randomly pick np B items and shuffle This property ensures that the interpretation of relational their A neighbors. variable is consistent across graph isomorphic connected components of any relational skeleton σ ∈ ΣS. Let Relational Causal Model We use a linear Gaussian S = hE, R, Ai be a relational schema and σ ∈ ΣS noise model with sum aggregators as follows: σ be an arbitrary instantiation of the schema. Let CCa = σ ∀c∈σ(C) c.S = µ + c hVa, Ea, Lai and CCb = hVb, Eb, Lbi where each item σ is labeled with its item class. Let fa,b be a set of map- ∀d∈σ(D) d.T = µ + d σ ∼ σ ping functions demonstrating CCa =a,b CCb , that is, X σ ∀a∈σ(A) a.X = c.S + µ + a f = {f | ∀v∈V La(v) = Lb(f (v))∧∀u,v∈V (u, v) ∈ a,b a a c∈neσ (a)∩σ(C) E ⇔ ( ( ) ( )) ∈ E ∧ ( ) = }. Let be a re- a f u , f v b f a b U X lational variable with domain σ (I) where I ∈ E ∪ R. ∀b∈σ(B) b.Y = β · a.X+ If a∈neσ (b)∩σ(A) X  d.T + µ + b ∀f∈f σ {f (i) | i.X ∈ Ua} = j | j.X ∈ Uf(a) a,b d∈ne(b;σ)∩σ(D) for any a, b ∈ σ (I) for any σ ∈ Σ , then, U is said to be S where every  is an independent Gaussian noise with zero isomorphism-invariant. mean and variance 0.12. We control the correlation be- tween connected X and Y by adjusting β where β = 0 B EXPERIMENTAL SETUP implies that X and Y are independently generated, or more precisely, [A] .X and [A, RAB,B] .Y are indepen- We describe experimental settings with the language of dent. The pair of X and Y values generated from this relational causal model (Maier et al., 2013). For simplic- model can be understood as a mixture of four bivariate ity, we represent a relational skeleton as an undirected normal distributions and µ controls the distance between graph of entities. Hence, (a, b) ∈ σ represents that two distributions. When µ = 0 , all four distributions are cen- entities a and b are, in fact, connected to a common rela- tered at (0, 0). Although, we described four distributions tionship item in the relational skeleton σ. having the same mean as ‘homogeneous’, they have dif- ferent variances. B.1 SIMPLE EXPERIMENTS We test unconditional independence between [A] .X and [A, R ,B] .Y . Relational Schema There are four entity classes A, B, AB C, and D where each associates with attribute class X, Y , S, and T , respectively. There are three relationship B.2 MORE COMPLICATED EXPERIMENTS classes between A and B, A and C, and B and D. We We only change how skeletons are generated. We use the used circle, square, rhombus, and triangle to refer A, B, same relational schema and relational causal model as C, and D. shown above. Relational Skeleton We generate relational skeletons Relational Skeleton Similarly, we generate = 400 varying degrees of randomness from 0 to 1 where ran- n items for and and 2 items for and . Then, domness of 0 is referred to as ‘biased’. Given random- A B n/ C D each a ∈ σ (A) randomly chooses C neighbor(s) so that ness 0 ≤ p ≤ 1, we initially generate a fully biased i a has one C neighbor if 1 ≤ i ≤ n , two neighbors if relational skeleton σ, then randomize some of edges be- i 3 n < i ≤ 2n , and three neighbors if 2n < i ≤ n. Simi- tween σ (A) and σ (B) to acquire a relational skeleton of 3 3 3 larly, each b ∈ σ (B) randomly chooses D neighbor(s). desired randomness p. i As shown in the previous setup, we initialize relational Let n be the number of entities for each of A and B (we skeleton with biased relationships between A and B, that n set n = 200). There are n/2 entities for C and D, respec- is, {(ai, bi)} ⊂ σ. Then, we randomize the connec- n n i=1 tively. Hence, let σ (A) = {ai}i=1, σ (B) = {bi}i=1, tion based on randomness parameter. We further add n n/2 n/2 random connections between ( ) and ( ). σ (C) = {ci}i=1, and σ (D) = {di}i=1. We connect C σ A σ B n/2 items to A items and D items to B items, {(ai, ci)}i=1 ⊂ This setup yields more complicated structure than n/2 σ and {(bi, di)}i=1 ⊂ σ. Then, we “initially” connect ai the previous setup since each of {a.X}a∈σ(A) and {b.Y }b∈σ(B) is made of dependent observations and A and and B are in many-to-many relationships. ∀c∈σ(C) c.Z = µ + c X ∀a∈σ(A) a.X = c.Z + µ + a B.3 CONDITIONAL TESTS c∈neσ (a)∩σ(C) X ∀b∈σ(B) b.Y = c.Z + µ + b. Relational Schema We have three entity classes A, B, c∈neσ (b)∩σ(C) and C, which associates with X, Y , and Z, respectively. There are binary relationship classes for each pair of en- For testing alternative hypothesis, we use the following tity classes, i.e., RAB, RAC , and RBC . All cardinalities are ‘many’, hence an entity can have many neighbors of model where (roughly speaking) Z is a common effect the other entity class. of X and Y ,

∀a∈σ(A) a.X = µ + a ∀ = + Relational Skeleton We control the maximum number b∈σ(B) b.Y µ b X of neighbors of the same kind. For example, an item of ∀c∈σ(C) c.Z = a.X+ A can have at most k neighbors of B. In other words, a∈neσ (c)∩σ(A) σ ∀a ∈σ(A) |ne (ai) ∩ σ (B)| ≤ k. We similarly put re- X i b.Y + µ +  . strictions between B and C and between A and C, as c σ well. b∈ne (c)∩σ(B) We construct relational skeletons where relationships of In all experiments, we set µ = 0.3. We test all three classes (RAB, RBC , and RAC ) are correlated. To do so, we adopt the idea of latent space model- [B] .Y ⊥⊥ [B,RAB,A] .X | [B,RBC ,C] .Z ing. Given n, the number of entities per entity class, 2 for the null hypothesis and test we generate n points in [0, 1] ⊂ R2 for each entity AB class. Let φ (·) be the coordinate of an item. Let D [ ] ⊥⊥ [ ] | [ ] AB C,RAC ,A .X C,RBC ,B .Y C .Z be a squared Euclidean distance where D i,j = 2 AB kφ (ai) − φ (bi)k2 . Then, a kernel matrix K is for the alternative hypothesis. AB  AB  K i,j = exp −γ · D i,j where we chose γ = 50. By normalization, we get a probability ma- C TYPE I ERRORS FOR DIFFERENT AB KAB trix P = (1>·KAB ·1) to (approximately) model METHODS AB Pr ((ai, bj) ∈ σ) ∝ 2 P . With this probability, i,j We illustrate type-I error plots for Section 5.2 and 5.3. we sample nk/2 edges to form a relational skeleton while satisfying maximum number of neighbors k. For example, if we limit an item of A can have three neigh- bors of B, then, there are, on average, 1.5 B neighbors for an item of A. Edges between A and C and between B and C are similarly obtained.

Relational Causal Model We consider three different models: two for conditional independence and one for conditional dependence. For testing null hypothesis, we randomly choose one of following two models:

∀a∈σ(A) a.X = µ + a X ∀c∈σ(C) c.Z = a.X + µ + c a∈neσ (c)∩σ(A) X ∀b∈σ(B) b.Y = c.Z + µ + b c∈neσ (b)∩σ(C) HSIC HSIC non-randomness non-randomness

heterogeneity heterogeneity R-HSIC R-HSIC R-HSIC non-randomness non-randomness non-randomness

heterogeneity heterogeneity heterogeneity RK-HSIC RK-HSIC RK-HSIC non-randomness non-randomness non-randomness

heterogeneity heterogeneity heterogeneity KRCIT-K KRCIT-K KRCIT-K non-randomness non-randomness non-randomness

heterogeneity heterogeneity heterogeneity KRCIT-SD KRCIT-SD KRCIT-SD non-randomness non-randomness non-randomness

heterogeneity heterogeneity heterogeneity Figure 10: Type-I errors of different methods with two Figure 9: Type-I errors of different methods for Section different contexts based on hop=1 (left) and hop=2 5.2. (right) for Section 5.3. HSIC does not use contexts.