2017 International Conference on Data Science and Advanced Analytics
Discovering Community Structure in Multilayer Networks
Soumajit Pramanik1, Raphael Tackx2, Anchit Navelkar1, Jean-Loup Guillaume3 and Bivas Mitra1 1Department of Computer Science & Engineering, IIT Kharagpur, India 2Sorbonne Universites,´ UPMC Universite´ Paris 06, CNRS, LIP6 UMR 7606, 4 place Jussieu 75005 Paris, France 3L3I, University of La Rochelle, France [email protected], [email protected], [email protected] [email protected], [email protected]
Abstract —Community detection in single layer, isolated net- Proximity Links works has been extensively studied in the past decade. However, Location/ many real-world systems can be naturally conceptualized as L2 L5 L1 Business multilayer networks which embed multiple types of nodes and L Layer L3 4 relations. In this paper, we propose algorithm for detecting communities in multilayer networks. The crux of the algorithm is Review Links based on the multilayer modularity index QM , developed in this U U paper. The proposed algorithm is parameter-free, scalable and U1 U3 5 7 User adaptable to complex network structures. More importantly, it Layer U U can simultaneously detect communities consisting of only single U2 4 6 type, as well as multiple types of nodes (and edges). We develop Friendship Links a methodology to create synthetic networks with benchmark multilayer communities. We evaluate the performance of the Fig. 1. A sample multilayer (Yelp) network. proposed community detection algorithm both in the controlled environment (with synthetic benchmark communities) and on the reveal complex interactions between multi-type nodes and empirical datasets (Yelp and Meetup datasets); in both cases, the heterogeneous links. They are also found to be beneficial for proposed algorithm outperforms the competing state-of-the-art algorithms. different data mining tasks such as context-sensitive search, prediction and recommendation [3] etc. Community detection Keywords -Multilayer Network, Modularity, Community Detec- in multilayer networks is challenging as the detected com- tion munities have possibility to contain only single or multiple types of nodes. Most of the recent endeavors concentrated I. INTRODUCTION on the multiplex networks [4], [5] where all layers share Communities are defined as groups of nodes that are more the identical set of nodes but may have multiple types of densely connected to each other than to the rest of the network. interactions. In multiplex network, some of the approaches The goal of the community detection algorithms, consequently, propose new quality metrics [4] to measure the goodness of the is to partition the networks into groups of nodes; large detected communities whereas a few other approaches utilize body of work exists on community detection in single and random walk [5] or frequent-pattern mining techniques [6] isolated network [1]. Recently, many real networks, including to obtain structurally similar components. In principle, most communication, social, infrastructural and biological ones, are of the aforementioned algorithms transform the problem to often represented as multilayer networks [2]. A multilayer the classical community detection in a monoplex network network is comprised of multiple interdependent networks, leveraging on the fact that in multiplex network, one-to-one where each network layer represents one aspect of interaction. cross layer links connect the copies of the same nodes in Moreover, the functionality of a node in one network layer is multiple layers. Unfortunately, the presence of heterogeneous dependent on the role of nodes in other layers. For instance, a nodes across multiple layers and cross layer dependency links location based social network (say, Yelp) can be represented as make the aforementioned solutions inadequate for multilayer a multilayer network (see Fig. 1) where in one layer customers networks. (visitors) are connected via social links and in the other layer Attempts have been made in bits and pieces to detect location nodes are connected through proximity links. The communities in multilayer networks; novel methodologies (coupling) link connecting a customer with a location node have been introduced such as Dirichlet process [7], tensor represents the visit of a customer to a location. factorization [3], subspace clustering [8], non-negative matrix Community detection in complex multilayer networks is an factorization [9] etc. However, most of these approaches suffer important research problem. The communities in multilayer from several limitations. First of all, some of the aforesaid networks help to identify functionally cohesive sub-units and algorithms only work on a specific type of multilayer networks
978-1-5090-5004-8/17 $31.00 © 2017 IEEE 611 DOI 10.1109/DSAA.2017.71 layer algorithms on the empirical datasets (Yelp and Meetup) L1 and demonstrate that they outperform the state of the art baselines in correctly discovering the communities (sec. VII).
II. REPRESENTATION &PROBLEM STATEMENT
L2 We start with formally representing the multilayer network
Multilayer Ground-Truth Single-Layer Ground- and defining the respective communities. Next, we state the Communities Truth Communities problem of detecting communities in multilayer network and Fig. 2. Network configurations with two different types of ground truth the key challenges. communities. A. Representation (say, star-type [7] etc.). Secondly, some of them are forced G =(G G ) to detect communities comprising only multiple types of We represent a multilayer network as a tuple U , B G = { : ∈{1 2 }} nodes [3], hence introducing bias. Third, the desired number where U Li i , ,...,M is a family of M G G = { : of communities are required to be fixed apriori for most of uni-partite graphs (called layers of ) and B Lij ∈{1 2 } = } them [3], [8], limiting their capability to discover the true i, j , ,...,M ,i j is the family of bipartite set of communities. Finally, a proper framework to generate graphs containing nodes from individual layers and the cross benchmark communities for generic multilayer networks is not layer interconnections among them. We denote each layer =( ) available in any of them. Li Vi,Ei where Vi and Ei are respectively the set Recent endeavors directed towards development of modular- of nodes and intra-layer edges present in Li. In the same ( ) ity index for heterogeneous networks. For instance, composite line, we can represent Lij as a triplet Vi,Vj,Eij where { ⊆{ × } : ∈{1 2 } = } modularity [10] calculates the modularity of a multi-relational Eij Vi Vj i, j , ,...,M ,i j is the network as the integration of modularities calculated for each set of coupling edges between nodes of layers Li & Lj. G single-relational subnetwork. However, due to the deficiency Definition: A community C in a multilayer network is (C C ) G in definition, the composite modularity can only produce defined as a cohesive module U , B of containing a subset communities with single type of nodes. On the other side, of nodes from one or more layers and all the edges having both C C modularity proposed in [11] in the context of gene-chemical endpoints incident on them. Mathematically, U and B can C = { C =( C C ): C ⊆ C = interaction network fails to conceive the role of coupling links be defined as U Li Vi ,Ei Vi Vi,Ei { ∩ ( C × C )} ∈{1 2 }} C = { C = in communities characterization. The detailed exploration of Ei Vi Vi ,i , ,...,M and B Lij ( C C C ): C ⊆ C ⊆ C = { ∩ ( C × prior art reveals the importance of the multilayer community Vi ,Vj ,Eij Vi Vi,Vj Vj,Eij Eij Vi C )} ∈{1 2 } = } detection algorithm which is free from (a) any external param- Vj ,i,j , ,...,M ,i j . G eter, say total number of communities (b) any bias towards Importantly, communities of a multilayer network can be communities with only single type or only multiple types of divided into two types (see Fig. 2) (a) cross layer communities |C |=Φ nodes. Developing a suitable modularity index should be the (containing multiple types of nodes) for which B ; first step towards this direction. (b) single layer communities (containing only single type of |C | =Φ In this paper, we propose a community detection algorithm nodes) for which B . for multilayer networks which is able to detect communities B. Problem Statement comprising both single type as well as multiple types of The problem of multilayer community detection algorithm nodes, depending on the network structure. First we represent G the multilayer network with proper notations and define the is to divide the network into a set of disjoint cohesive modules C1,C2,...,CK which is a cover of the nodes in problem of community detection (sec. II). Next, we develop G a methodology to construct synthetic multilayer network with such that each module Ci is comprised of a group of ground truth communities and evaluate it rigorously (sec. III). nodes densely connected inside & loosely connected outside The major contribution of this paper is to propose a modu- the community. The key challenges of this problem are two-fold - (a) dealing larity index QM for characterizing communities in multilayer networks. Subsequently, we develop the multilayer community with multilayer network which contains multiple types of links detection algorithms GN-Q and Louvain-Q incorporating (of different densities) & nodes and (b) detecting both cross M M layer & single layer communities simultaneously without any the modularity index QM . We present the convergence proof for both the proposed algorithms along with their complexity additional parameter. analysis (sec. IV). We first evaluate the performance of the III. DATASET proposed modularity as a community scoring metric (sec. V) A. Synthetic Dataset Generation and then tested the performance of the developed algorithms against the competing algorithms. Controlled experiments, In this section, we propose a methodology to generate performed on the synthetic network, exhibit the ability of benchmark multilayer networks with ground truth communi- the GN-QM and Louvain-QM algorithms to efficiently detect ties. The parameter α regulates the proportion of cross layer communities comprising both single type and multiple types vs. single layer communities in the benchmark. The network on nodes (sec. VI). Finally, we evaluate the proposed multi- contains M number of different layers where each layer Li
612 NMI NMI NMI NMI
p = 0.2 mu = 0.2 mu = 0.2 p = 0.2 CompMod p = 0.4 CompMod mu = 0.4 CompMod mu = 0.4 CompMod p = 0.4 MetaFac p = 0.6 MetaFac mu = 0.6 MetaFac mu = 0.6 MetaFac 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 (a) mu, alpha = 0.4 (b) mu, alpha = 1.0 (c) p, alpha = 0.4 (d) p, alpha = 1.0
Fig. 3. Change of NMI values with μ and p for ‘CompMod’ and ‘MetaFac’ on 2-layer networks with 100 nodes in each layer, generated with maximum i degree kmax =10, average degree ki =6& coupling link density d =0.07. (a). contains Ni nodes (Ni = |Vi|) with average degree ki. The methodology contains the following three steps: 1) Experimental Setup: We implement the following state- Step 1. Single layer communities: First, we apply the LFR of-the-art multilayer community detection algorithms to detect benchmark algorithm [12] to generate communities at each the communities in step (b). (i) MetaFac [3]: This algorithm detects communities based layer Li with Ni nodes where both degree and community size distributions follow power law distribution with exponents on the tensor factorization and requires the number of com- munities to be specified apriori1. It detects communities in a γi and βi respectively. We fix the mixing parameter as μi to way such that each of them contains at least one node from construct Ci single layer communities in layer Li. Step 2. Cross layer communities: We combine the commu- every layer (i.e. only cross layer communities). nity xi ∈Ci of layer Li with community xj ∈Cj of layer Lj to (ii) CompMod [10]: This algorithm detects communities by create the cross layer community xij. Assuming |Ci| and |Cj| maximizing composite modularity, which is a combination of as the number of communities in layers Li and Lj respectively, the modularities of different subnetworks. |Cc| = min{|Ci| , |Cj|} denotes the maximum possible number We compute normalized mutual information (NMI) [13] of cross layer communities. We construct (|Cc|×α) cross layer index to compare the detected communities with the ground communities by randomly combining single-layer communi- truth communities (step (b)). NMI is a measure of similarity ties from both the layers Li & Lj respectively; notably each of communities, which attains a high value if the ground truth cross layer community xij may contain one or multiple single and detected communities exhibit good agreement. layer communities. 2) Evaluation: Finally, we accomplish the step (c) by Step 3. Coupling links: Finally, we create the coupling demonstrating that synthetic network possesses the desired links between the layers Li and Lj with density dij. Fraction p properties, in the following way. denotes the mixing parameter for the cross layer communities. Varying μ: In Fig. 3(a) and Fig. 3(b), we increase the We first distribute (Ni × Nj × dij) coupling links randomly mixing parameter μ (fraction of edges going out of single layer between the layers Li and Lj where each link has one end communities) which degrades the qualities of the communities. in Li and another end in Lj. Next, we rewire the coupling Clearly, as μ increases, NMI drops for both the state of the links such that p fraction of links stay inside the cross layer art algorithms, irrespective of α and p values. This points to communities and the remaining 1−p fraction of links connect the fact that with increasing μ, the ground truth communities different cross layer communities. degrade independent of the type of communities, which meets our expectation. B. Synthetic Dataset Evaluation Varying p: Similarly, Fig. 3(c) and Fig. 3(d) focuses on In the following, we evaluate the performance of the syn- p, which intuitively regulates the cohesiveness of the cross thetic network generation algorithm; we examine whether the layer communities. Evidently, the obtained NMI rises with generated networks behave consistently with our expectation. p for both the community detection algorithms. The slope is In order to accomplish the task, we adhere to the following ap- observed to be relatively steeper for the higher α, as it indicates proach - (a) we generate synthetic networks varying different the presence of more number of cross layer communities model parameters μ, α & p, which intrinsically regulate the (magnifying the effect of p). quality of the community structures present in the generated networks. (b) we apply state-of-the-art multilayer community IV. DEVELOPING MULTILAYER COMMUNITY DETECTION detection algorithms [3], [10] on this synthetic networks and ALGORITHM evaluate the quality of the detected communities with respect In this section, first we develop the modularity index QM to to the ground truth communities. (c) we conclude that the characterize the quality of multilayer communities. Next, we synthetic network possesses desired properties, only if the quality of the detected communities in step (b) is consistent 1For our experiment, we vary the input number of communities from two with the quality of ground truth communities specified in step to fifty and report the one exhibiting highest overlap with the ground truth.
613 show that simple adaptation of QM with classical methodolo- network and finally derive the multilayer modularity index gies may lead to the development of multilayer community QM . detection algorithms. 1) Cross Layer Communities: Any cross layer community C C is composed of three submodules - two intra layer (L1 , A. Desired properties of multilayer communities: Intuition C C L2 ) and one inter layer (L12). The vanilla null model proposed We start this section highlighting the desired properties 1 2 in [14] directly derives the Pij, Pij for intra layer submodules. of the communities in a multilayer network. As introduced The expected number of edges between any two nodes i and in section II, in multilayer networks, we observe two types j (with intra-layer degrees hi and hj respectively) within the of communities (see Fig. 2) (a) cross layer communities 1 =( ∗ ) 2 | | community C can be calculated as Pij hi hj / E1 (containing multiple types of nodes) and (b) single layer C 2 =( ∗ ) 2 | | (for submodule L1 ) and Pij hi hj / E2 (for sub- communities (containing only single type of nodes). module LC ). For inter layer or bipartite submodule LC , the =(C C ) G 2 12 For an ‘ideal’ cross layer community C U , B of probability of an edge between node i and j depends on |C | =Φ (where B ), the desired properties are the following, their respective coupling degrees c and c . The probability X1 : i j Property The group of nodes in each uni-partite and of having a coupling edge between i, j can be estimated as C C 12 bipartite layer (Li s and Lijs respectively) should be highly =( ∗ ) | | Pij ci cj / E12 (similar to [15]). The aforementioned cohesive. null models satisfy the desired requirements of PropertyX1 X2 : Property The (coupling) edges in the bipartite layer introduced in section IV-A. C ∈C C C Lij B should connect most of the nodes from Li (i.e. Vi ) C C Each cross layer community C is represented as with most of the nodes from L (i.e. V ). C C C C C j j {{L1 ,L2 }, {L12}} where L1 and L2 are the submodules Similarly, for an ‘ideal’ single layer community C = C with edges from E1 and E2 respectively and L12 is from E12; (CU CB) of G (where CB =Φ), the desired properties are , we substitute P in Eq. 1 to define its modularity as enumerated below, ij ⎡
PropertyS1 : The community should be highly cohesive 1 1 (hi ∗ hj) C = C ⊆ QC = ∀i, j ∈ C ⎣ A − within the layer Li to which it belongs (i.e. U Li Li). M 3 2 | | ij 2 | | 2 E1 E1 S : C C i,j∈V1 Property The nodes in Li (i.e. Vi ) should be very loosely connected with nodes of other layers Lj. 1 (ci ∗ cj) + Aij − (2) | 12| | 12| B. Multilayer Modularity Index E ∈ ∈ E i V1,j V2 ⎤
Following the aforesaid intuitions, we propose multilayer 1 ( ∗ ) G = {{ } { }} hi hj ⎦ modularity for a two-layer network L1,L2 , L12 + Aij − 2 |E2| 2 |E2| where L1 =(V1, E1)&L2 =(V2, E2) are the individual layers i,j∈V2 and L12 =(V1, V2, E12) is the bipartite graph connecting nodes In case of inter layer submodule, if node i in C is not 1 2 of layer L and L . connected with any other layer (hence, coupling degree ci We start with the basic notion of Newman-Girvan modu- zero), we use its intra layer degree hi as a proxy of ci larity [14], Q = 1 (A − P )δ(ψ ,ψ ) where m is 12 2m i,j ij ij i j for computing Pij . This allows us to penalize those nodes the total number of edges in the network, i, j ∈ V1 ∪ V2 are in cross layer community C which are only connected with any pair of nodes in the network, ψi indicates the community nodes within the same layer and not with the nodes in C of membership of node i and δ(ψ ,ψ ) is the Kronecker delta 12 i j different layer. This amendment in the null model Pij satisfies 1 = 2 function which is only if ψi ψj i.e. i and j belong to the the desired requirements of PropertyX . Therefore with this 0 same community (and otherwise). Aij represents the classi- modification, the Eq. 2 can be written as, G ⎡ cal adjacency matrix of . The penalty term Pij (often referred 1 1 ( ∗ ) as null model) is the expected probability of having an edge C = ∀ ∈ ⎣ − hi hj + QM i, j C Aij (3) between nodes i and j if edges are placed at random. Notably, 3 2 |E1| 2 |E1| i,j∈V1 in multilayer network the edges in one layer Li are intrinsically 1 (c i ∗ c j) different from another layer Lj, which gets reflected in the Aij − 2 |E1| +2|E2| + |E12| 2 |E1| +2|E2| + |E12| property of the corresponding layers as well (say diverse edge i∈V1,j∈V2 ⎤ densities across the layers). This observation motivates us to 1 ( ∗ ) hi hj ⎦ introduce layer-wise discriminating null models, and propose + Aij − 2 |E2| 2 |E2| the following multilayer modularity, i,j∈V2 1 QM = {(Aij − Pij)δ(ψi,ψj)} where (1) = 0 = 2m where for any node i, c i ci if ci > and c i hi otherwise. ⎧ ij 2) Single Layer Communities: In any single layer commu- ⎨ 1 ∈ & ∈ Pij if i V1 j V1 nity C, all the constituent nodes belong to either L1 or L2. The = 2 ∈ & ∈ 1 Pij Pij if i V2 j V2 null models can be directly derived as P =(h ∗ h )/2 |E1| ⎩ 12 ij i j P if (i ∈ V1 & j ∈ V2) or (i ∈ V2 & j ∈ V1) 2 =( ∗ ) 2 | | ij and Pij hi hj / E2 for layer L1 and L2 respectively 1 2 In the following, we compute the null model terms Pij, Pij from the vanilla null model [14]. These null models satisfy 12 S1 and Pij separately for both types of communities of multilayer the Property introduced in section IV-A. Hence, for each
614 C single layer community C represented as {L1 }, we substitute relationships between them, there would be L single layer and L − 1 bipartite modularity terms in the corresponding Q Pij in Eq. 1 to compute⎡ the modularity as, ⎤ M equation. 1 1 ( ∗ ) C = ∀ ∈ ⎣ − hi hj ⎦ QM i, j C Aij (4) 3 2 |E1| 2 |E1| C. Community detection algorithm i,j∈V1 We leverage on the single layer community detection al- Importantly, there can be many nodes in C which are con- gorithms Girvan-Newman [16] & Louvain [17] which detect nected to other layer nodes via coupling edges, violating communities by maximizing Girvan-Newman modularity [14]. PropertyS2 of desired single layer community. In order to penalize those nodes in a community C with coupling degree Algorithm 2: Louvain-QM ci,weaddci along with hi to estimate the null model. Input : A multilayer network G as defined in section IV-B Subsequently, the modularity of C becomes where V = V1 ∪ V2. C = ∀ ∈ Output: Maximum QM of G and detected communities. Q⎡ M i, j C ⎤ 1 while True do 1 1 ( + ) ∗ ( + ) 2 Place each node of G into a single community ⎣ hi ci hj cj ⎦ Aij − 3 Save QM for this decomposition 3 2 |E1| + |E12| 2 |E1| + |E12| i,j∈V1 4 while there are moved nodes do 5 foreach node n ∈ V do (5) 6 c = neighboring community of n maximizing QM Finally, combining both types of communities, the overall increase modularity of⎡ the network can be represented as 7 if c results in a strictly positive increase then 8 move n from its community to c 1 nC 1 ⎣ 9 end QM = ∀i, j ∈ Ck Aij − 3 2 |E1| + θC ∗|E12| 10 end =1 k i,j∈V1 k 11 end (h + θ ∗ c ) ∗ (h + θ ∗ c ) 1 12 if QM reached is higher than the initial QM then i Ck i j Ck j + 13 Display the partition found 2 |E1| + θC ∗|E12| 2 |E1| +2|E2| + |E12| k 14 Transform G into the network between communities ( ∗ ) c i c j 15 else Aij − 2 |E1| +2|E2| + |E12| 16 break i∈V1,j∈V2 17 end 1 + 18 end 2 | 2| + ∗| 12| 19 return E ⎤θCk E