2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Network Similarity using Distribution of Distance Matrices

Ralucca Gera Ruriko Yoshida Department of Applied Mathematics Department of Operations Research Naval Postgraduate School, Naval Postgraduate School, Monterey, CA 93943 USA Monterey, CA 93943 USA [email protected] [email protected]

Abstract—Decision makers use partial information of networks walk on the product graph of two graphs [2]. Borgwardt to guide their decision, yet when they act, they act in the real and Kriegel proposed graph kernels based on shortest paths network or the ground truth. Therefore, a way of comparing the and verified it by classifying the proteins graphs model into partial information to ground truth is required. We introduce a statistical measure that analyzes the network obtained from the functional classes [3]. Przuljˇ proposed a measure based on the partially observed information and ground truth, which of course graphlet distribution to compare two large networks, can be applied to the comparison of any networks. As a first step, but the proposed measure is computationally costly [4]. Wilson in the current research, we restrict ourselves to networks of the and Zhu compared two graphs using spectra and show that the same size to introduce such a method, which can be generalized Euclidean distance between spectra of two graphs is highly to different size networks. We perform mathematical analysis on the , and then apply our methodology to synthetic correlated with their edit distance [5]. Zager and Verghese use networks generated using five different generating models. We the structural similarity of local neighborhoods of the nodes conclude with a statistical hypothesis test to decide whether two two compare different graphs [6]. Aliakbary et al. proposed a graphs are correlated or not correlated. genetic algorithms based approach to compute the similarity measure between two networks of different sizes [7]. Davis I.INTRODUCTION et al. propose a random walk based method to compare two networks by comparing the explored percentages of nodes Understanding and exploring data in today’s digital world and edges at some iteration i while exploring the networks is difficult due to the complex interconnections of systems [8]. Crawford et al. introduced two metrics to compare the we study and the big data they generate. topologies of two networks using the eigenvalues of the is commonly used to describe and study such systems, by Adjacency matrix and the Laplacian Matrix [9]. Roginkski et creating networks of interactions between people, objects, al. used the neighbor matrix to compare graphs and show that and abstract concepts. Consider an environment in which it captures eleven important characteristics of the networks like capturing the data is done progressively, such as assembling degree, distances, closeness, and betweenness, etc. [10]. the of person based on information gathered A pragmatic approach to isomorphism is to identify if two from a cellphone. The question we would like to answer is: graphs are almost the same, namely similar. In this research, at what point have we captured enough information about the we assume that two graphs are similar if they were created in a social network (partial information network), to know that it is similar fashion by following the same rules. This would then representative of the real social network of the person (ground extend to real networks by assuming that similar real-world truth). That is, humans tend to base decisions on the partial networks became what they are through a similar process. This information that is available, yet act in the ground truth one. could include the case of comparing a depiction of a network An accurate depiction of a network that is similar to the ground to the ground truth itself due to insufficient information or the truth is important for a meaningful analysis. impossibility of capturing the entire network. We applied our Comparing two graphs has always been a problem of inter- methodology to synthetic networks whose number of nodes est for Graph Theorists [1], and lately for Network Scientists. we can control, allowing us to compare their topology rather Researchers have used various metrics to compare two given than sizes. As our experiment was very successful, we plan graphs like degree distributions, density, clustering coefficient, on extending our methodology to real-world networks having average length, Maximum Common Subgraph, number of different node counts. spanning trees, Hamming Distance, etc. These properties only In this paper, we investigate the distribution of a distance compare the macro-level properties of the networks, but the metric between random graphs. We wish to compare graphs, micro-level binding of the network is not considered. or sets of graphs, to each other in a statistical framework Kashima and Inokuchi use graph kernel method to classify and ask whether or not they are significantly different, more the graphs where a graph kernel is defined using a random specifically: Suppose G1 and G2 have the same number of IEEE/ACM ASONAM 2018, August 28-31, 2018, Barcelona, Spain nodes. Then, we have the following hypotheses. U.S. Government work not protected by U.S. copyright H0 : two given graphs G1 and G2 do not have similar between a node i and a node j in the graph (if a random structure to each other, variable X ∼ Bernoulli(p). If X = 1, then we assign an edge H1 : two given graphs have similar structure. between a node i and a node j in the graph. If X = 0, then we do not assign the edge between a node i and a node j in In terms of statistics, these hypotheses as the following: the graph). Thus for any graph G in Gp,n, we have H : two given graphs G and G are not correlated, 0 1 2 P (G) = pe(G)(1 − p)N−e(G), H1 : two given graphs are correlated. where e(G) is the number of edges in G. This is called Erdos-¨ A key to conduct this hypothesis test is to understand the distri- Renyi´ model of a random graph [11]. bution of a distance measure d(G1,G2) between G1 and G2, which can be used as a test statistic for the hypotheses above. B. Barabasi-Albert´ (BA) Network If we understand the distribution of the test statistics, then we The Barabasi-Albert´ Model (BA) is a basic scale-free graph can conduct the hypothesis test or significance test. Thus here that grows through of new nodes based we investigate the distribution of d(G1,G2) where G1 and G2 on the degree of the existing nodes in the network. The user are sampled from some distribution, where d(G1,G2) is the determines the parameters for the number of nodes and number path distance defined in Section III. of edges per incoming node, which will produce the power This paper is organized as follows: Section II introduces the law . Formally, we obtain a Barabasi-Albert´ motivation and the generating models to create the synthetic model by starting with a small number m0 of vertices, and networks. Section III explains the distance matrix followed at every time step, new vertices yi (i ≥ 1) are added with by the sensitivity analysis on d(G1,G2) if we perturb G1 m < m0 edges. These new edges link the incoming vertex yi by one edge. In Section III-C we compute the mean and to m different vertices that already exist in the network at the 2 variance of d (G1,G2) if we sample G1 and G2 uniformly. step i when yi is added. To incorporate preferential attachment, The experimental results on synthetic graphs are discussed we assume that the probability, P , that a new vertex will be in Section IV. We then present our conclusions and further connected to a vertex i depends on the degree ki of that vertex, ki directions in Section V. namely P . j kj After t steps the model leads to a scale-free network with II.PROBLEM DEFINITION AND PRELIMINARIES t + m0 vertices and mt edges. The preferential attachment process gives rise to the “rich become richer” phenomenon In the current research, we investigate a network similarity that leads to power-law degree distribution [12]. method to compare given networks based on limited informa- tion of a network, such as knowing just the network’s topology. C. Watts-Strogatz (WS) Network The standard graph comparison is done through isomor- The Watts-Strogatz model was one of the original models phisms of graphs. However, the isomorphism is hard, its introduced to capture the small world effect, yet allowing the decision problem being still open. In addition, we would like user to choose how much randomness and how much chaos to to move away from the limitation given by the binary answer: bring into the creation of the graph. The model starts with a given two graphs, they are isomorphic or nonisomorphic. We one dimensional lattice of n nodes arranged into a circle, and would like to determine if the graphs are similar instead. each node is adjacent to the first k neighbors to its left and We acknowledge that similarity is a very complex and vast the first k neighbors to its right (just like Harary graphs [13] problem. In this work, we propose a metric that recognizes th k or the k power of an n-cycle, namely Cn). This part creates similar graphs; we assume that two graphs are similar if they the structure of the small world. The chaos is then brought in were created by the same algorithm. In our study, we use through rewiring some of the edges: with a fixed probability synthetic networks to create networks of the same type. p some end vertices of edges are picked up and randomly re- In network science, the understanding of large-scale net- wired to different vertices, thus creating shorter connections works can be done both by the analysis of real networks between the people that were diametrically opposed in the as well as by modeling them through synthetic networks. existing lattice. Note that if p = 1 then one would obtain a The analysis of these synthetic complex networks is allowing random graph as each edge would be rewired randomly, and if us to bring theory into place while mimicking reality. Next, p = 0 one obtains the originally structured lattice. The choice we discuss five distinct synthetic models that we use for of other values of p allows for the combination of chaos and generating networks. structure. This was introduced by Watts and Strogatz [14]. A. Erdos-R¨ enyi´ (ER) Network D. PowerLaw-Cluster (PC) Network n For p ∈ [0, 1], n ∈ N, and N = 2 , let Gp,n be the probabil- In real-world social networks, people prefer to make con- ity space (Ω, F,P ) which is the product of probability spaces nections with the friends of their friends. It increases the (Ωij, Fij,Pij) for i, j ∈ {1, 2, . . . , n} such that Ωij = {0, 1} number of closed triad connections and the average clustering and Pij is the probability function of the Bernoulli distribution [15] in the network. Holme and Kim proposed the PowerLaw- with a parameter p: for each pair of nodes (i, j), we assign Cluster generating model based on both preferential attach- the probability p independently that decides to assign an edge ment as well as triad connections phenomena [16]. The network is generated as the BarabsiAlbert (BA) growth model B. Sensitivity of the Path Difference with an extra step that at each iteration a node is given a In this section we would like to see how sensitive the path chance of making an edge to one of its neighbors’ neighbors difference d(G1,G2) between two graphs G1,G2 ∈ G by with probability p and forms a triangle. removing one edge e ∈ G1. Let p (i, j) be the set of all shortest paths in G ∈ G E. Community Structured (CS) Network G from a node i to a node j, and let pG(e) := {(i, j)|e ∈ is a well known studied meso-scale p ∈ pG(i, j), i < j}. Let Pe(G) := {pG(e)||pG(i, j)| = structure in real-world networks. Here we use a very basic 1 with (i, j) ∈ pG(e)} for an edge e ∈ G and G ∈ G. Community Structured generating model, where the nodes are Intuitively, for e a fixed edge of G ∈ G, Pe(G) is the all divided into the groups of same sizes. The edges are placed shortest and unique paths from all possible node in G to uniformly at random between the nodes in the same group another node in G such that each shortest path goes through with probability pin and between the nodes of different groups an edge e. with probability pout [17]. The probability of having an intra- community link is higher than the inter-community link. Theorem 1. Suppose G ∈ G are connected graphs with n nodes. If we remove an edge e in G, then p d(G, (G − {e})) ≥ |Pe(G)|. III.THEPROPOSED METHODOLOGY Proof. We now compute d = d(G, (G − {e}) A. Distance Matrix qP 2 We propose to use the distance matrix (introduced in 1977 d = i1 qP 2 = (dij (G) − dij (G − {e})) Let G (n) be the probability space (G, F, ν) which is the i

X 1 X 0 2 = [d (G) − d (G )] g(n)2 ij ij G,G0 i

Theorem 4. Assume that Gν (n) the probability space with the uniform distribution ν on G. Let σ2(n)be the variance of d2. Then In this equation, two terms can be simplified as:  " #2 " #2  X X 2 X X 0 2   dij (G) + dij (G )   0 0   G,G i

(d) PC Network (e) CS Network Fig. 1: Histograms for z-scores under five different graph models. The black dashed lines are the significance levels α = 0.05.

and under the null hypothesis in (5), DN,N 0 → 0 as N, 4) Two samples of size 100 generated under PowerLaw- N 0 → ∞. Cluster model (see Figure 2(d)). 5) Two samples of size 100 generated under Community Structured model (see Figure 2(e)). Input: The observed two set of arbitrary graphs Type I error with the significance level α = 0.05 in this study 1 1 1 2 2 2 S1 = {G1,G2,...,GN } and S2 = {G1,G2,...,GN 0 }. can be found at Table II. The Type I error rate (significance level) α. model Type I error rate Output: Decision whether we reject or not reject the null Erdos-R˝ enyi´ model 0.09 hypothesis in (5). Barabasi-Albert´ model 0.04 Watts-Strogatz model 0.05 1) Compute the test statistics D 0 from the observation. N,N PowerLaw-Cluster model 0.06 p 0 0 2) If DN,N 0 > c(α) (N + N )/(N · N ), where c(α) Community Structured model 0.06 can be found for each level of α in [19], then we reject the null hypothesis. If not then we do not reject TABLE II: Type I error with the significance level α = 0.05. the null hypothesis. Next, we conduct tests for Type II errors. We compare One sample of size 100 generated under one model and one sample We conducted a simulation study for this significance test of size 100 generated under a different model for the number on two samples. In this experiment we would like to see if of nodes n = (5 · i) + 10 for i = 1,..., 100. The results are the number of nodes in the graphs affect the type I error rate shown in Figure 3. (i.e., the false positive rate) and the type II error rate (i.e., Results show that the proposed hypothesis can clearly differ- the false negative rate) of the statistical test under each model entiate two networks generated using two different generating to generate random graphs. Under each model, we varied the processes as shown in Figure 3(a-c,e,g-j). In Figure 3(d,f), number of nodes n = (5 · i) + 10 for i = 1,..., 100. Then, p-values are higher, and the proposed hypothesis is not able we generated 100 graphs with fixed number of nodes. We to differentiate the models due to the following reasons. In generated two such samples under each model with a fixed Community Structured Model, the edges are placed within number of nodes. and between the communities uniformly at random with the given probability p and p respectively, as discussed in First, we conducted tests for Type I errors. We compared in out Section II-E. Thus the generating fashion of Community for the number of nodes n = (5 · i) + 10 for i = 1,..., 100: Structured model is very similar to the Erdos-R˝ enyi´ model, 1) Two samples of size 100 generated under Erdos-R˝ enyi´ and therefore the similarity value is very high. The highest model (see Figure 2(a)). p-valued for Community structured and Erdos-R˝ enyi´ model 2) Two samples of size 100 generated Barabasi-Albert´ is 0.443 and shows that the p-value captures the similari- model (see Figure 2(b)). ties between the models. Similarly, PoweLaw-Cluster model 3) Two samples of size 100 generated under Watts-Strogatz generates the network using preferential attachment model, model (see Figure 2(c)). and a triad edge is placed between a node and a neighbor (a) ER Network (b) BA Network (c) WS Network

(d) PC Network (e) CS Network Fig. 2: P-values for the significance test on two samples. The black lines are the significance levels α = 0.05. of its neighbors with probability p. In the experiments, we We analyzed it theoretically by looking at the mean and stan- have taken a small value for p, p = 0.1. Thus the generated dard deviation for uniformly generated Erdos-R˝ enyi´ random PoweLaw-Cluster networks are generated in a very similar graphs with n nodes. We test the proposed hypotheses on fashion as Barabasi-Albert´ networks for the small size net- Erdos-R˝ enyi´ and four more generating models. We created works, and the similarity is captured by p-values as shown multiple copies of the same size graph using all different in Figure 3(f). We further observe that as the network size methods of obtaining synthetic models. First, we tested our increases, the p-values decreases and shows the differences hypothesis to capture the similarity between the networks between the generating models. generated using the same generating model. Our type I error We also observe the biggest p-value for each pair of models analysis showed that the hypothesis test in Section IV-A and and the results are shown in Table III. the hypothesis test in Section IV-B control the type I error rates under the significance level. Next, we tested our hypothesis of Models Biggest p-value Models Biggest p-value ER vs. BA 0.069 ER vs. WS 5.956e−06 differentiating graphs that were created by different synthetic ER vs. PC 0.443 ER vs. CS 0.443 models, and our power analysis shows that the hypothesis test −09 BA vs. WS 9.122e BA vs. PC 0.894 in Section IV-B has high power. BA vs. CS 1.220e−05 WS vs. PC 3.695e−09 WS vs. CS 2.849e−06 PC vs. CS 0.001 There has been much work to classify an observed network using supervised learning methods, such as classification trees TABLE III: Biggest p-values for different models for type II and random forests, using graph statistics, such as mean errors. distances and average degree (for example, [20] and [21]). It is not well understood how such statistics contribute classifica- Our experiment shows that the hypothesis test which we tion accuracy. In the classification process, we learn behaviors proposed controls the type I error rate under the significance of such graph statistics by generating random networks and level (Table II and Figure 2), and it seems that the type I error then we apply what we learned from random networks to rates is not affected by the number of nodes. In addition, the a given data set. Chia suggests that the average distance analysis shows that our hypothesis test has fairly high power contributes significantly in classifying a network, yet it is and it does not depend on the number of nodes (Figure 3). important to understand how a distance d(G1,G2) between two networks, G1,G2, varies. V. CONCLUSIONSAND FURTHER STUDIES This project opens up various future directions. In this In the current research, we introduced the methodology of work, we compare graphs that we control in size, but having identifying similar graphs using the path difference distance. different generating processes. In a temporal view, the sizes (a) ER vs. BA Network (b) ER vs. WS Network (c) ER vs. PC Network (d) ER vs. CS Network

(e) BA vs. WS Network (f) BA vs. PC Network (g) BA vs. CS Network

(h) WS vs. PC Network (i) WS vs. CS Network (j) PC vs. CS Network Fig. 3: P-values for the significance test on two samples. The black lines are the significance levels α = 0.05.

of real-world networks increase very fast, and thus researchers [8] Benjamin Davis, Ralucca Gera, Gary Lazzaro, Bing Yong Lim, and analyze samples of the ground true networks. It will be ben- Erik C Rye. The marginal benefit of monitor placement on networks. In Complex Networks VII, pages 93–104. Springer, 2016. eficial to identify what collected samples capture the network [9] Brian Crawford, Ralucca Gera, Jeffrey House, Thomas Knuth, and Ryan generating information. Miller. Graph structure similarity using spectral . In International Workshop on Complex Networks and their Applications, ACKNOWLEDGMENTS pages 209–221. Springer, 2016. [10] Jonathan W Roginski, Ralucca M Gera, and Eric C Rye. The neighbor The authors thank the DoD for partially sponsoring the matrix: generalizing the degree sequence. 2015. [11] P. Erdos¨ and A. Renyi.´ On random graphs. i. Publicationes Mathemat- current research. R.Y. is supported by NSF Division of Math- icae, 6:290–297, 1959. ematical Sciences: CDS&E-MSS program 1622369. Authors [12] B. Bollobas´ and O. Riordan. Robustness and vulnerability of scale-free would also like to thank Akrati Saxena, IIT Ropar, India for random graphs. Internet Math, 1(1):1–35, 2003. [13] Frank Harary and Edgar M. Palmer. Graphical Enumeration. Academic the discussions and code implementations for the project. Press, 1973. [14] Duncan J Watts and Steven H Strogatz. Collective dynamics of small- REFERENCES worldnetworks. nature, 393(6684):440–442, 1998. [15] Mark Newman. The structure and function of complex networks. SIAM [1] Ronald C Read and Derek G Corneil. The graph isomorphism disease. review, 45(2):167–256, 2003. Journal of Graph Theory, 1(4):339–363, 1977. [16] Petter Holme and Beom Jun Kim. Growing scale-free networks with [2] Hisashi Kashima and Akihiro Inokuchi. Kernels for graph classification. tunable clustering. Physical review E, 65(2):026107, 2002. In ICDM workshop on active mining, volume 2002, 2002. [17] Santo Fortunato. Community detection in graphs. Physics reports, [3] Karsten M Borgwardt and Hans-Peter Kriegel. Shortest-path kernels on 486(3-5):75–174, 2010. graphs. In Data Mining, Fifth IEEE International Conference on, pages [18] W. W. Daniel. Kolmogorov–smirnov one-sample test. In Applied 8–pp. IEEE, 2005. Nonparametric Statistics, pages 319–330. PWS-Kent., Boston, USA, [4] Natasaˇ Przulj.ˇ comparison using graphlet degree 2nd edition, 1990. distribution. Bioinformatics, 23(2):e177–e183, 2007. [19] E. S. Pearson and H. O. Hartley. Tables 54, 55. In Biometrika Tables [5] Richard C Wilson and Ping Zhu. A study of graph spectra for comparing for Statisticians 2, pages 117–123. Cambridge University Press, 1972. graphs and trees. Pattern Recognition, 41(9):2833–2841, 2008. [20] J.P. Canning, E.E. Ingram, S. Nowak-Wolff, A.M. Ortiz, N.K. Ahmed, [6] Laura A Zager and George C Verghese. Graph similarity scoring and R.A. Rossi, K.R.B. Schmitt, and S. Soundarajan. Network classification matching. Applied mathematics letters, 21(1):86–94, 2008. and categorization, 2017. [7] Sadegh Aliakbary, Sadegh Motallebi, Sina Rashidian, Jafar Habibi, [21] X.L.P. Chia. Assessing the robustness of graph statistics for network and Ali Movaghar. Distance metric learning for complex networks: analysis under incomplete information, 2018. Towards size-independent comparison of network structures. Chaos: An Interdisciplinary Journal of Nonlinear Science, 25(2):023111, 2015.