Complex Network Comparison Using Random Walks∗

Complex Network Comparison Using Random Walks∗ Shan Lu Jieqi Kang Weibo Gong Department of Electrical and Department of Electrical and Department of Electrical and Computer Engineering Computer Engineering Computer Engineering University of Massachusetts University of Massachusetts University of Massachusetts Amherst Amherst Amherst [email protected] [email protected] [email protected] Don Towsley School of Computer Science University of Massachusetts Amherst [email protected] ABSTRACT diffusion are effective in testing the similarity of complex In this paper, we proposed a complex network comparison graphs and that such simulations provide plausible mech- method based on mathematical theory of diffusion over man- anisms for many brain activities. In this paper we apply ifolds using random walks over graphs. We show that our the random walk method to distinguish between complex method not only distinguishes between graphs with differ- networks, with experiments on graph comparison between ent degree distributions, but also different graphs with same graphs with different distributions and the graphs with the degree distributions. We compare the undirected power law same distributions but different connectivity structures. graphs generated by Barabasi-Albert (B-A) model and directed power law graphs generated by Krapivsky's model to Graph comparison is a challenging task since graph sizes in- the random graphs generated by Erdos-Renyi model. We crease extremely fast in diverse areas, such as social networks also compare power law graphs generated by four different (Facebook, Twitter), Web graphs (Google), knowledge net- generative models with the same degree distribution. works (Wikipedia), etc.. Many graph comparison methods have been proposed to quantitatively define the similarity Keywords between graphs. In [10] the authors summarize the existing methods into three categories: graph isomorphism, iterative random walk, complex network, power law graph, graph methods and feature extraction. The graph isomorphism comparison and iterative methods are not scalable and thus not effective for large networks. Feature extraction methods extrac- 1. INTRODUCTION t features like degree distribution, eigenvalues to compare. The asymptotic behavior of the heat content has been used These methods are closer in spirit to our method. Howev- as a tool to understand the geometry of a manifold domain er previously proposed features may not reflect the network [2, 16] or the connectivity structure of a graph [12, 13]. Heat connectivity structure very well. For example, in [13], the content, as the solution of the heat equation associated with authors give an example where two isospectral nonisometric the Laplacian operator, summarizes the heat diffusion in planar graphs can be distinguished by the heat content, de- the domain as a function of time for a given initial heat spite the fact they share the same set of eigenvalues. In [8], distribution. One property of the heat content method is the authors analysed the structural properties of graphs with that its asymptotic behavior as t ! 0 separates the heat the same degree distribution and found that different net- content curves of different structures. This enables one to works with the same degree distribution may have distinct develop fast algorithms for complex graphs comparison. In structural properties. In [15], the authors discussed meth- [6, 7] it was pointed out that Monte-Carlo simulations of ods for similarity testing in directed web graphs, including vertex ranking, sequence similarity and signature similarity, ∗ This work is supported in part by the United States among others. However, like most of the algorithms in [10], National Science Foundation Grant EFRI-0735974, CNS- they need to know the node correspondence, which is the 1065133 and CNS-1239102, and Army Research Office Con- mapping between the graphs' nodes. It is already a hard tract W911NF-08-1-0233 and W911NF-12-1-03. problem for many complex networks. Our algorithm exhibits the following features. First, our method summarizes graph structure into a single time function so as to facilitate similarity testing. Second, the behavior of this function around time t = 0 is the most important for the comparison. Practically we only need the beginning part of the heat content so that we can greatly reduce the computation time. Third, we use lazy random walk to estimate the heat content function, thereby avoid computing the eigenvalues and eigenvectors of the graph Laplacian while re- with initial condition taining the spectral information. Fourth, our algorithm only ( 1 if u 2 iD compares the connectivity structure and does not use node H0(u; u) = correspondence. Hence it avoids the need to identify a map- 0 else : ping between the graphs' nodes. Finally we note that our method is robust to minor changes in large graphs according Assuming the total number of vertices is N and the number to the interlacing theorem in [3]. With these features, our al- of interior vertices is n, Ht is an N × N matrix. Ht(u; v) gorithm is capable of handling very large complex networks. measures the amount of heat that flows from vertex u to Using experiments, we show that our algorithm perform- vertex v at time t. All heat that flows to the boundary s better in distinguishing networks comparing to the other vertices is absorbed. We label the interior vertices 1; ··· ; n feature extraction methods, such as eigenvalues and degree and the boundary vertices n + 1; ··· ;N. The normalized distributions. Laplacian L can be partitioned into four parts L L L = iD;iD @D;iD : The rest of the paper is organized as follows. In Section 2, LiD;@D L@D;@D we give the notations and review the concept of heat equation and heat content for graphs. In Section 3, we use the Since we are only interested in the heat remaining in the lazy random walk simulation method to estimate the heat interior domain, define the n × n matrix ht with content. In Section 4, the graph generative models used in h (u; v) = H (u; v) for u; v 2 iD: experiment part are introduced. Experiment settings and t t −LiD;iD t results are presented in Section 5. Section 6 summarizes the The solution to the heat equation is ht = e . For main results and discusses future work. convenience, we slightly abuse notation and still use Λ and Φ as the eigenvalue matrix and eigenvector matrix of LiD;iD. 2. HEAT EQUATION AND HEAT CONTENT We have −Λt −1 2.1 Notations ht = Φe Φ : Let G = (V; E) denote a graph with vertex set V and edge set E ⊆ V × V with adjacency matrix A = [auv]. auv = 1 For each entry of ht, we have if there is an edge from u to v; otherwise auv = 0. The n P X −λit out-degree matrix D = diag[du] with du = v auv. ht(u; v) = e φi(u)πi(v): (3) i=1 The graph Laplacian of a graph is defined as L = D −A and the normalized Laplacian is defined as [4] The heat content Q(t) is defined as the sum of all the entries in ht: L = D−1=2LD−1=2: X X Q(t) = ht(u; v) (4) −1 With the random walk Laplacian Lr = D L, we have the u v following relation between L and Lr n X X X −λit −1=2 1=2 = e φi(u)πi(v): (5) Lr = D LD : u v i=1 Without loss of generality, we assume that the Laplacian P P Letting αi = u v φi(u)πi(v) yields matrix L is diagonalizable and hence L is diagonalizable. Let λ ≤ λ ≤ · · · ≤ λ be the eigenvalues of L and n 1 2 n X −λit Q(t) = αie : (6) φi; i = 1; ··· ; n be the corresponding eigenvectors. With i=1 Λ = diag[λi] and Φ = [φ1; ··· ; φn] we can diagonalize L as The heat content can be viewed as the sum of exponen- L = ΦΛΦ−1; tially decaying functions with different rates and different −1 where Φ = [π1; π2; ··· ; πn]. strengths. The rates and strengths are determined by the graph Laplacian eigenvalues and eigenvectors, respectively. Furthermore we have To emphasize the larger eigenvalues more, we can use the −1=2 −1=2 −1 following derivatives of the heat content for comparison: Lr = (D Φ)Λ(D Φ) : (1) m L and L share the same set of eigenvalues, but the cor- X −λit r Q_ (t) = − αiλie ; responding eigenvectors are different. L is the normalized i=1 graph Laplacian used in the heat equation on a graph. We use the relationship between L and Lr to develop a random 3. RANDOM WALK METHODS FOR HEAT walk simulation method in the later section. CONTENT ESTIMATION 2.2 Heat equation and heat content Computing eigenvalues and eigenvectors of the Laplacian matrix needed for evaluating the heat content is very time Vertex set V is partitioned into two subsets, the set of all consuming for large complex networks. In this section, we interior nodes iD and the set of all boundary nodes @D. We use a discrete time random walk method to approximate the have V = iD [ @D. The heat equation associated with the heat content. normalized graph Laplacian is @Ht = −LH We consider a random walk where the random walker moves @t t (2) Ht(u; v) = 0 for u 2 @D; from vertex u to a neighboring vertex v with probability −1 auv=du. Define the transition matrix M = D A and the Barabasi-Albert (B-A) model lazy random walk transition matrix as The construction starts with m0 initial nodes. Each new node is connected to m(m ≤ m ) existing nodes with a M = (1 − δ)I + δM: 0 L probability proportional to the number of links that the ex- In other words, a random walker either moves to one of isting nodes already have.

Complex Network Comparison Using Random Walks∗

Modern Discrete Probability I

Markov Chain Monte Carlo Estimation of Exponential Random Graph Models

Exploring Healing Strategies for Random Boolean Networks

Percolation Threshold Results on Erdos-Rényi Graphs

Probability on Graphs Random Processes on Graphs and Lattices

3 a Few More Good Inequalities, Martingale Variety 1 3.1 From

Random Boolean Networks As a Toy Model for the Brain

An Introduction to Exponential Random Graph (P*) Models for Social Networks

A Review on Statistical Inference Methods for Discrete Markov Random Fields Julien Stoehr, Richard Everitt, Matthew T

A Review on Statistical Inference Methods for Discrete Markov Random Fields Julien Stoehr

The Random Graph

Critical Points for Random Boolean Networks James F