Arxiv:1108.1502V2 [Cs.SI] 10 Feb 2012 the Community Structure of a Network
Total Page:16
File Type:pdf, Size:1020Kb
Generalized Louvain method for community detection in large networks Pasquale De Meo∗, Emilio Ferrarax, Giacomo Fiumara∗, Alessandro Provetti∗,∗∗ ∗Dept. of Physics, Informatics Section. xDept. of Mathematics. University of Messina, Italy. ∗∗Oxford-Man Institute, University of Oxford, UK. fpdemeo, eferrara, gfiumara, [email protected] Abstract—In this paper we present a novel strategy to a novel measure of edge centrality, in order to rank all discover the community structure of (possibly, large) networks. the edges of the network with respect to their proclivity This approach is based on the well-know concept of network to propagate information through the network itself; iii) modularity optimization. To do so, our algorithm exploits a novel measure of edge centrality, based on the κ-paths. its computational cost is low, making it feasible even for This technique allows to efficiently compute a edge ranking large network analysis; iv) it is able to produce reliable in large networks in near linear time. Once the centrality results, even if compared with those calculated by using ranking is calculated, the algorithm computes the pairwise more complex techniques, when this is possible; in fact, proximity between nodes of the network. Finally, it discovers because of the computational constraints, the adoption of the community structure adopting a strategy inspired by the well-known state-of-the-art Louvain method (henceforth, LM), some existing techniques is not viable when considering efficiently maximizing the network modularity. The experi- large networks, and their application is only limited to small ments we carried out show that our algorithm outperforms case-studies. other techniques and slightly improves results of the original This paper is organized as follows: in the next Section we LM, providing reliable results. Another advantage is that its provide some background information about the community adoption is naturally extended even to unweighted networks, differently with respect to the LM. detection problem. Section III introduces the main objectives of this work and describes an intuitive sketch about the novel Keywords-complex networks; community structure strategy of community detection we propose. In Section IV the key concept of κ-path edge centrality is recalled, being it I. INTRODUCTION a novel and efficient strategy of ranking edges with respect The investigation of the community structure inside net- to their centrality in the network. All the pieces are glued works has acquired a great relevance during the last years, together in Section V. We describe our strategy to detect in particular in the context of Social Network Analysis the community structure, inspired by the well-known state- (SNA). This, also because of the unpredicted success of of-the-art LM [1], which is computationally suitable even Online Social Networks (OSNs). In fact, social phenomena when large networks are analyzed. Experiments that have such as Facebook and Twitter amongst others, glue together been carried out are discussed in Section VI. Finally, Section millions of users under a unique network whose features VII concludes, depicting some future directions of research. are a goldmine for Social Scientists. Several works are focused on the Social Network analysis of these OSNs; II. BACKGROUND others describe the strategies of analysis themselves. Several techniques to investigate the community structure In this paper we focus on the possible strategies of com- of networks have been proposed in literature during last munity detection. As to date, two paradigms exist to discover years. There exist numerous comprehensive surveys to this arXiv:1108.1502v2 [cs.SI] 10 Feb 2012 the community structure of a network. The former is based problem, such as [2], [3]. on the analysis of the global features of the network, for In its general formulation, the problem of finding commu- example its topology. These approaches are characterized by nities in a network is intended as a data clustering problem. high computational complexity and high quality results. The In fact, it could be solved assigning each node of the network latter paradigm relies on exploiting local information, for to a cluster, in a meaningful way. Two approaches have been example those acquirable by nodes and their neighborhoods. widely investigated, i) spectral clustering based techniques, The computational cost of these techniques is lower than and, ii) network modularity optimization strategies. The those exploiting global features, but the reliability decreases. former relies on the optimization of the process of cutting In this work, we propose a novel strategy to discover the graph representing the given network. The latter is based the inner community structure of a network. The main on the maximization of a benefit function, called network characteristics of our approach are the followings: i) it modularity. We briefly recall them, separately. exploits global information of the network, establishing The problem of minimizing the number of cuts in a which are the edges of the network that contribute to the given graph has been proved to be NP-hard. To do so, creation of the community structure; ii) to do so, it adopts different approximate techniques have been proposed. An example is by using the spectral clustering [4], exploiting betweenness centrality, that is itself very costly (even if the the eigenvectors of the Laplacian matrix of the network. most efficient algorithm [10] is adopted). We recall that the Laplacian matrix L of a given graph has Several variants of this strategy have been proposed during components Lij = kiδ(i; j)−Aij, where ki is the degree of the years, such as the fast clustering algorithm provided by a node i, δ(i; j) is the Kronecker delta (that is, δ(i; j) = 1 Clauset, Newman and Moore [11], that runs in O(n log n) on if and only if i = j) and Aij is the adjacency matrix sparse graphs; the extremal optimization method proposed representing the graph connections. Another approach relies by Duch and Arenas [12], based on a fast agglomerative on the strategy of the ratio cut partitioning [5], [6]. This approach, with O(n2 log n) time complexity; the Newman- is a function that, if minimized, allows the identification Leicht [13] mixture model based on statistical inferences; of large clusters with a minimum number of outgoing other maximization techniques by Newman [14] based on interconnections. The principal issue of spectral clustering eigenvectors and matrices. based techniques is that one has to know in advance the The state-of-the-art technique is called Louvain method number and the size of communities comprised in the given (LM) [1]. This strategy is based on local information and network. This makes this strategy unfeasible if the purpose is is well-suited for analyzing large weighted networks. It is to discover the unknown community structure of a network. based on the two simple steps: i) each node is assigned The strategy exploited in this paper adopts the second to a community chosen in order to maximize the network paradigm, the one relying on the concept of network modu- modularity Q; the gain derived from moving a node i into larity. It can be explained as follows: let consider a network, a community C can simply be calculated as [1] represented by means of a graph G = (V; E), partitioned into m communities; assuming l the number of edges P C P 2 P P 2 s +k +ki C i C^ C C^ ki ∆Q = − − − − (2) between nodes belonging to the s-th community and ds is 2m 2m 2m 2m 2m the sum of the degrees of the nodes in the s-th community, the network modularity Q is given by where P is the sum of the weights of the edges inside P C C, C^ is the sum of the weights of the edges incident m " 2# X ls ds to nodes in C, ki is the sum of the weights of the edges Q = − (1) C jEj 2jEj incident to node i, ki is the sum of the weights of the edges s=1 from i to nodes in C, m is the sum of the weights of all Intuitively, high values of Q implies high values of ls for the edges in the network; ii) the second step simply makes a each discovered community; thus, detected communities are new network consisting of nodes that are those communities dense within their structure and weakly coupled among each previously found. Then the process iterates until a significant other. Equation 1 reveals a possible maximization strategy: improvement of the network modularity is obtained. in order to increase the value of the first term (namely, the In this paper we present an efficient community detection coverage), the highest possible number of edges should fall algorithm which represents a generalization of the LM. in each given community, whereas the minimization of the In fact, it can be applied even on unweighted networks second term is obtained by dividing the network in several and, most importantly, it exploits both global and local communities with small total degrees. information. To make this possible, our strategy computes The problem of maximizing the network modularity has the pairwise distance between nodes of the network. To do been proved to be NP complete [7]. To this purpose, several so, edges are weighted by using a global feature which heuristic strategies to maximize the network modularity Q represents their aptitude to propagate information through have been proposed as to date. Probably, the most pop- the network. The edge weighting is based on the κ-path ular one is called Girvan-Newman strategy [8], [9]. This edge centrality, a novel measure whose calculation requires approach works in two steps, i) ranking edges by using the a near linear computational cost [15]. Thus, the partition of betweenness centrality as measure of importance; ii) deleting the network is obtained improving the LM.