A Link Analysis Extension of Correspondence Analysis for Mining Relational Databases 483
Total Page:16
File Type:pdf, Size:1020Kb
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 4, APRIL 2011 481 ALinkAnalysisExtensionofCorrespondence Analysis for Mining Relational Databases Luh Yen, Marco Saerens, Member, IEEE,andFranc¸ois Fouss Abstract—This work introduces a link analysis procedure for discovering relationships in a relational database or a graph, generalizing both simple and multiple correspondence analysis. It is based on a random walk model through the database defining a Markov chain having as many states as elements in the database. Suppose we are interested in analyzing the relationships between some elements (or records) contained in two different tables of the relational database. To this end, in a first step, a reduced, much smaller, Markov chain containing only the elements of interest and preserving the main characteristics of the initial chain, is extracted by stochastic complementation [41]. This reduced chain is then analyzed by projecting jointly the elements of interest in the diffusion map subspace [42] and visualizing the results. This two-step procedure reduces to simple correspondence analysis when only two tables are defined, and to multiple correspondence analysis when the database takes the form of a simple star-schema. On the other hand, a kernel version of the diffusion map distance, generalizing the basic diffusion map distance to directed graphs, is also introduced and the links with spectral clustering are discussed. Several data sets are analyzed by using the proposed methodology, showing the usefulness of the technique for extracting relationships in relational databases or graphs. Index Terms—Graph mining, link analysis, kernel on a graph, diffusion map, correspondence analysis, dimensionality reduction, statistical relational learning. Ç 1INTRODUCTION RADITIONAL statistical, machine learning, pattern recogni- analysis of the features describing each instance belonging to Ttion, and data mining approaches (see, for example, [28]) the population of interest (attribute value analysis) to the usually assume a random sample of independent objects analysis of the links existing between these instances from a single relation. Many of these techniques have gone (relational analysis), in addition to the features. through the extraction of knowledge from data (typically This paper precisely proposes a link-analysis-based extracted from relational databases), almost always leading, technique allowing to discover relationships existing in the end, to the classical double-entry tabular format, between elements of a relational database or, more containing features for a sample of the population. These generally, a graph. More specifically, this work is based features are therefore used in order to learn from the sample, on a random walk through the database defining a Markov provided that it is representative of the population as a chain having as many states as elements in the database. whole. However, real-world data coming from many fields Suppose, for instance, we are interested in analyzing the (such as World Wide Web, marketing, social networks, or relationships between elements contained in two different biology; see [16]) are often multirelational and interrelated. tables of a relational database. To this end, a two-step The work recently performed in statistical relational learning procedure is developed. First, a much smaller, reduced, [22], aiming at working with such data sets, incorporates Markov chain, only containing the elements of interest—ty- research topics, such as link analysis [36], [63], web mining pically the elements contained in the two tables—and [1], [9], social network analysis [8], [66], or graph mining [11]. preserving the main characteristics of the initial chain, is All these research fields intend to find and exploit links extracted by stochastic complementation [41]. An efficient between objects (in addition to features—as is also the case in algorithm for extracting the reduced Markov chain from the the field of spatial statistics [13], [53]), which could be of large, sparse, Markov chain representing the database is various types and involved in different kinds of relation- proposed. Then, the reduced chain is analyzed by, for ships. The focus of the techniques has moved over from the instance, projecting the states in the subspace spanned by the right eigenvectors of the transition matrix ([42], [43], [46], [47]; called the basic diffusion map in this paper), or . L. Yen and M. Saerens are with the Information Systems Research Unit by computing a kernel principal component analysis [54], (ISYS/LSM) and Machine Learning Group (MLG), Universite´ Catholique [57] on a diffusion map kernel computed from the reduced de Louvain (UCL), 1, place des Doyens, 1348 Louvain-La-Neuve, Belgium. graph and visualizing the results. Indeed, a valid graph E-mail: {luh.yen, marco.saerens}@ucLouvain.be. F. Fouss is with the Management Department—LSM, Faculte´s Uni- kernel based on the diffusion map distance, extending the versitaires Catholiques de Mons (FUCaM), 151, Chausse´e de Binche, basic diffusion map to directed graphs, is introduced. 7000 Mons, Belgium. E-mail: [email protected]. The motivations for developing this two-step procedure Manuscript received 16 Jan. 2009; revised 14 Sept. 2009; accepted 29 Dec. are twofold. First, the computation would be cumbersome, 2009; published online 23 Aug. 2010. if not impossible, when dealing with the complete Recommended for acceptance by Z.-H. Zhou. For information on obtaining reprints of this article, please send e-mail to: database. Second, in many situations, the analyst is not [email protected], and reference IEEECS Log Number TKDE-2009-01-0023. interested in studying all the relationships between all Digital Object Identifier no. 10.1109/TKDE.2010.142. elements of the database, but only a subset of them. 1041-4347/11/$26.00 ß 2011 IEEE Published by the IEEE Computer Society 482 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 4, APRIL 2011 Moreover, if the whole set of elements in the database is self-loops (w 0 for i 1; :::; n) and that the graph has a ii ¼ ¼ analyzed, the resulting mapping would be averaged out by single connected component; that is, any node can be the numerous relationships and elements we are not reached from any other node. If the graph is not connected, interested in—for instance, the principal axis would be there is no relationship at all between the different completely different. It would therefore not exclusively components and the analysis has to be performed sepa- reflect the relationships between the elements of interest. rately on each of them. It is therefore to be hoped that the Therefore, reducing the Markov chain by stochastic graph modeling the relational database does not contain too complementation allows to focus the analysis on the many disconnected components—this can be considered as elements and relationships we are interested in. a limitation of our method. Partitioning a graph into Interestingly enough, when dealing with a bipartite graph connected components from its adjacency matrix can be (i.e., the database only contains two tables linked by one done in O n2 (see, for instance, [56]). Based on the ð Þ relation), stochastic complementation followed by a basic adjacency matrix, the Laplacian matrix L of the graph is diffusion map is exactly equivalent to simple correspon- defined in the usual manner: L D A, where D ¼ À ¼ dence analysis. On the other hand, when dealing with a star- Diag ai: is the generalized outdegree matrix with diagonal ð Þ n schema database (i.e., one central table linked to several entries d D a a . The column vector d ii ¼½ ii ¼ i: ¼ j 1 ij ¼ tables by different relations), this two-step procedure diag a is simply the vector¼ containing the outdegree of ð i:Þ reduces to multiple correspondence analysis. The proposed each node. Furthermore, theP volume of the graph is defined methodology therefore extends correspondence analysis to as v vol G n d n a . Usually, we are deal- g ¼ ð Þ¼ i 1 ii ¼ i;j 1 ij the analysis of a relational database. ing with symmetric¼ adjacency matrices,¼ in which case L is In short, this paper has three main contributions: symmetric and positiveP semidefiniteP (see, for instance, [10]). From this graph, we define a natural random walk . Atwo-stepprocedureforanalyzingweighted through the graph in the usual way by associating a state to graphs or relational databases is proposed. each node and assigning a transition probability to each . It is shown that the suggested procedure extends link. Thus, a random walker can jump from element to correspondence analysis. element, and each element therefore, represents a state of . A kernel version of the diffusion map distance, the Markov chain describing the sequence of visited states. applicable to directed graphs, is introduced. A random variable s t contains the current state of the ð Þ The paper is organized as follows: Section 2 introduces Markov chain at time step t: if the random walker is in state the basic diffusion map distance and its natural kernel on a i at time t, then s t i. The random walk is defined by the ð Þ¼ graph. Section 3 introduces some basic notions of stochastic following single-step transition probabilities of jumping complementation of a Markov chain. Section 4 presents the from any state i s t to an adjacent state: j s t 1 : ¼ ð Þ ¼ ð þ Þ two-step procedure for analyzing the relationships between P s t 1 j s t i a =a p . The transition prob- ð ð þ Þ¼ j ð Þ¼ Þ¼ ij i: ¼ ij elements of different tables and establishes the equivalence abilities only depend on the current state and not on the between the proposed methodology and correspondence past ones (first-order Markov chain). Since the graph is analysis in some special cases. Section 5 presents some completely connected, the Markov chain is irreducible, that illustrative examples involving several data sets, while is, every state can be reached from any other state.