A Kernel Conditional Independence Test for Relational Data
Total Page:16
File Type:pdf, Size:1020Kb
A Kernel Conditional Independence Test for Relational Data Sanghack Lee and Vasant Honavar Artificial Intelligence Research Laboratory College of Information Sciences and Technology The Pennsylvania State University, University Park, PA 16802 fsxl439, [email protected] Abstract methods including, in particular, non-parametric meth- ods (Fukumizu et al., 2008; Zhang et al., 2011; Doran et al., 2014; Lee and Honavar, 2017) have been devel- Conditional independence (CI) tests play a oped to test for CI in settings where the parametric form central role in statistical inference, machine of the underlying distribution is unknown but a measure learning, and causal discovery. Most existing of closeness between data samples can be defined, e.g., CI tests assume that the samples are indepen- using a kernel function. However, these methods implic- dently and identically distributed (i.i.d.). How- itly or explicitly assume that the data samples are inde- ever, this assumption often does not hold in the pendently and identically distributed (i.i.d.). case of relational data. We define Relational Conditional Independence (RCI), a generaliza- Many sources of real-world data, e.g., the WWW, cita- tion of CI to the relational setting. We show tion networks, social networks, biomolecular networks, how, under a set of structural assumptions, we exhibit a relational structure, wherein the data are nat- can test for RCI by reducing the task of test- urally represented as collections of interlinked entities. ing for RCI on non-i.i.d. data to the problem In the resulting relational data, e.g., a citation network, of testing for CI on several data sets each of the entities, e.g., authors, articles, and institutions, clearly which consists of i.i.d. samples. We develop do not constitute i.i.d. observations. Methods for learn- Kernel Relational CI test (KRCIT), a nonpara- ing causal models from relational data rely on oracles metric test as a practical approach to testing that can answer CI queries from such data (Maier et al., for RCI by relaxing the structural assumptions 2013; Lee and Honavar, 2016). Practical realizations of used in our analysis of RCI. We describe re- such algorithms will need to replace such oracles by CI sults of experiments with synthetic relational tests against relational data. However, in the relational data that show the benefits of KRCIT relative setting, with the exception of autocorrelated data, e.g., to traditional CI tests that don’t account for the time series (Chwialkowski et al., 2014), where ‘close- non-i.i.d. nature of relational data. ness’ in time, space, or network is well-defined (Flaxman et al., 2016), effective ways to define and test for CI have been lacking. Any attempt to generalize the notion of CI 1 INTRODUCTION to the relational setting needs to overcome several chal- lenges: What are relational counterparts of random vari- ables? How can we define their marginal distributions? Observational and experimental data represent system- atic interactions among a set of random variables of in- Against this background, inspired by the notion of rela- terest. Conditional independence (CI) tests constitute es- tional d-separation (Maier et al., 2013), which general- sential tools for understanding such interactions. Ran- izes a graphical criterion for CI to a specific model of dom variables X and Y are said to be conditionally in- relational data, we (i) Formalize Relational Conditional dependent given Z, denoted by X ?? Y j Z, if and Independence (RCI), the relational counterpart of CI. (ii) only if the joint distribution Pxyz can be factorized as Examine the dependence and heterogeneity of relational PxjzPyjzPz. The notion of CI plays a central role in variables in terms of the underlying relational structure. statistical inference (Dawid, 1979), probabilistic graph- (iii) Based on the preceding analyses, devise a Kernel ical models (Koller and Friedman, 2009), and causal dis- Relational CI Test (KRCIT) that, to the best of our knowl- covery (Pearl, 2000; Spirtes et al., 2000). A variety of edge, offers the first practical method for testing for RCI. (iv) Describe results of experiments with synthetic rela- k k tional data that show the benefits of KRCIT relative to tra- a e a ditional CI tests that don’t account for the non-i.i.d. na- j j i i ture of relational data. RCI and KRCIT offer new ways to understand dependencies in relational data across a broad d d range of practical applications. b c c 2 PRELIMINARIES Figure 1: (Left) a small relational skeleton σ as an undi- rected graph of entities of three classes, Blue, Magenta, We follow the notational conventions from statistics, and Gray; and (Right) an example of a relational vari- graph theory, and relational models. We use a capital let- able V where V (`) refers to the multiset of attribute X of a set of Gray items forming a triangle with the given ter X to denote a random variable; X to denote the range Blue item and a Magenta item. Hence, ( ) = f g; of X; and a lowercase letter x to denote the value of X. ` V i a:X Calligraphic letters are also used to represent mathemat- V (j) = fc:X; d:Xg; and V (k) = ;; ical objects, e.g., graphs. We define a labeled (directed or undirected) graph G = and “relational data” to mean a set of data that conform hV; E; Li where V denotes a set of vertices (or nodes) and to a given relational schema. We use letters such as i and E a set of edges. Each vertex is assigned a discrete label j to stand for integers or items (e.g., the letters i and j to by a labeling function L : V 7! Σ where Σ is a set of la- refer two items in σ (I)). bels. We disallow self-loops. Given an undirected graph 0 G, a connected component G is a vertex-induced sub- 3 CI TEST WITH RELATIONAL DATA graph of G where there exists a path between every pair of vertices in G0. We denote all connected components in We define the notion of relational variables followed G by CCG and a connected component containing v 2 V by the notion of Relational Conditional Independence by CCG. Two labeled graphs G and G0 are said to be iso- v (RCI). We provide both a theoretical characterization of morphic, denoted by G =∼ G0, if there exists a bijective 0 0 RCI as well as a practical approach to testing for RCI. function f : V 7! V such that 8v2VL(v) = L (f (v)) 0 and 8u;v2V (u; v) 2 E , (f (u) ; f (v)) 2 E . Consider attribute classes (attributes for short) X, Y , and Z of a relational schema. In the absence of any relational We use a simplified version of Entity-Relationship (ER) structure, the “data” corresponding to instantiations of model (Chen, 1976) to describe relational data (Fried- these random variables in a dataset can be naturally in- man et al., 1999; Heckerman et al., 2007; Maier et al., dexed so that (x ; y ; z ) denotes the ith instance drawn 2013). A relational schema S = hE; R; Ai describes the i i i from P . However, in the relational setting, there is no relational domain of interest with a set of entity classes xyz such natural index. Hence, we can use a set of items of E (e.g., person, student), relationship classes R (e.g., each item class to serve the role of an index. This index- friend-of, son-of ), and attribute classes A (e.g., gender, ing scheme generalizes the notion of the ‘ith instance’ in income). We refer to the union E [ R as the set of item an i.i.d. setting to the notion of ‘instantiated by item i’ classes I. In general, a relationship class R 2 R can be where i 2 σ (I) for some I 2 I. Note that different item n-ary where n ≥ 2 (e.g., contract is a set of ternary re- classes provide different ways to index relational data. lationships involving products, buyers, and sellers). We use A (I) to denote an item class I’s attribute classes, Definition (Relational Variable). Let S = hE; R; Ai be and A−1 (X) by to denote the item class of X. a relational schema and σ 2 ΣS be an arbitrary rela- tional skeleton. A relational variable V is a function A relational skeleton σ 2 ΣS is an instantiation of a from σ (I) for some item class I 2 I to a subset of given relational schema, which can be viewed as an undi- fj:X j j 2 σ (J)g for some attribute class X and item rected bipartite graph. We denote by σ (I) a set of items class J such that X 2 A (J): of an item class I 2 I. Given an item i 2 σ (I), we use i:X to denote the item i’s attribute class X. Note that i:X V : σ (I) 7! 2fj:Xjj2σ(J)g is a random variable which takes a value i:x 2 X. An edge (i; r) 2 σ represents the participation of an entity i where every i 2 σ (I) is connected to j 2 σ (J) such in a relationship r. For simplicity, we represent a skele- that j:X 2 V (i). ton, whose relationship classes are binary, as an undi- rected graph of entities. We use “relational structure” to As a simple but concrete example, a relational variable mean the the graphical structure of a relational skeleton, V , ‘smoking status of one’s neighbors’, is defined with I, J both being the ‘Person’ item class, X corresponding where the item attributes of each attribute class share to the attribute class ‘smoking’, where j 2 σ(J) where some commonality, e.g., the joint probability distribu- j:X 2 V (i) is i’s neighbor for every person i 2 σ(I).