Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16)

Modularity Based Community Detection with Deep Learning 1,2,3 1 3, 1 4 5,6 Liang Yang, Xiaochun Cao, Dongxiao He, ⇤ Chuan Wang, Xiao Wang, Weixiong Zhang 1State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences 2School of Information Engineering, Tianjin University of Commerce 3School of Computer Science and Technology, Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin University 4Department of Computer Science and Technology, Tsinghua University 5College of Math and Computer Science, Institute for Systems Biology, Jianghan University 6Department of Computer Science and Engineering, Washington University in St. Louis yangliang,caoxiaochun,wangchuan @iie.ac.cn, [email protected], { [email protected], [email protected]}

Abstract model [Psorakis et al., 2011] focuses on deriving generative models of networks. Such a in essence maps Identification of module or community structures a network to an embedding in a low-dimensional latent space. is important for characterizing and understanding The mapping can be done by, e.g., nonnegative matrix fac- complex systems. While designed with different ob- torization (NMF) [Wang et al., 2008]. Unlike the stochastic jectives, i.e., stochastic models for regeneration and model, the maximization model [Newman, 2006], modularity maximization models for discrimination, as the name suggests, attempts to maximize a modularity func- both these two types of model look for low-rank tion on network substructures. The optimization can be done embedding to best represent and reconstruct net- by eigenvalue decomposition (EVD), which is equivalent to work topology. However, the mapping through such reconstructing a low-rank modularity matrix. embedding is linear, whereas real networks have var- In short, while appeared in different forms, the stochastic ious nonlinear features, making these models less model and modularity maximization model share an essential effective in practice. Inspired by the strong represen- commonality, i.e., mapping a network to a low-dimensional, tation power of deep neural networks, we propose a latent space embedding. Motivated by the strong discrimi- novel nonlinear reconstruction method by adopting nation power of the modularity maximization model and the deep neural networks for representation. We then relationship between maximizing modularity and reconstruct- extend the method to a semi-supervised community ing modularity matrix by finding low-dimensional embedding, detection algorithm by incorporating pairwise con- we aim to seek a more effective reconstruction algorithm for straints among graph nodes. Extensive experimental modularity optimization. However, all these types of embed- results on synthetic and real networks show that ding adopted in the two popular models are linear, e.g., NMF the new methods are effective, outperforming most in the stochastic model and EVD in the modularity maximiza- state-of-the-art methods for community detection. tion model. This is in sharp contrast to the fact that real-world networks are full of nonlinear properties, e.g., a relationship 1 Introduction (e.g., ) among nodes may not necessarily be linear. As a result, the representation power of these linear mapping Real-world systems often appear in networks, e.g., social net- based models is limited on real-world networks. works in Facebook media, protein interaction networks, power Neural networks, particularly that with deep structures, are grids and the Internet. Real-world networks often consist of well known to provide nonlinear low-dimensional represen- functional units, which manifest in the form of network mod- tations [Bourlard and Kamp, 1988]. They have been success- ules or communities, subnetworks with nodes more tightly fully applied to complex problems and systems in practice, connected with respect to the rest of the networks. Finding such as image classification, speech recognition, and playing network communities is, therefore, critical for characteriz- an ancient strategic board game of Go. To the best of our ing the organizational structures and understanding complex knowledge, however, deep neural networks have not yet been systems. successfully applied to community detection. A great deal of effort has been devoted to developing net- work community finding methods, among which, two are Taking advantage of the nonlinear representation power of widely adopted and thus worth mentioning. The stochastic deep neural networks, we propose in this paper a nonlinear re- construction (DNR) algorithm for community detection using ⇤Corresponding author. deep neural networks.

2252 It is known that information beyond can The index of the largest element in the ith row of H indicates greatly aid network community identification [Zhang, 2013; the community that node i belongs to. Yang et al., 2015]. Examples of such information include There are many variations to the stochastic model, such semantics on nodes, e.g., names and labels, and constraints on as nonnegative matrix tri-factorization and stochastic block relationships among nodes, e.g., community membership con- model. Nearly all of these models can be intuitively viewed as straints between adjacent nodes (pairwise constraints). In finding new representations in a low-dimensional space that the current study, we extend our DNR method to a semi- can best represent and reconstruct the . supervised DNR (semi-DNR) algorithm to explicitly incor- porates pairwise constraints among nodes to further improve 2.2 Modularity Maximization Model community detection. This model was introduced by Newman [Newman, 2006] to maximize a modularity function Q, which is defined as the dif- 2 Reconstruction based Community Detection ference between the number of edges within communities and the expected number of such edges over all pairs of vertices. We consider an undirected and unweighted graph G =(V,E), For example, consider a network with two communities, then where V = v1,v2,...,vN is the set of N vertices, and E = { } 1 kikj eij the set of edges each of which connects two vertices in Q = a (h h ), { } 4m ij 2m i j V . The adjacency matrix of G is a nonnegative symmetric ij N N X ⇣ ⌘ matrix A =[aij] R+ ⇥ where aij =1if there is an edge where h equals to 1 (or -1) if i belongs to the first (or between vertices i2and j, or a =0otherwise, and a =0 i ij ii second) group, kikj is the expected number of edges between for all 1 i N. The of vertex i is defined as 2m   vertices i and j if edges are placed randomly, ki is the degree ki = j aij. The problem of community detection is to find 1 K of vertex i and m = 2 i ki is the total number of edges in the K modules or communities Vi i=1 that are subgraphs whose N N P { } network. By defining modularity matrix B =[bij] R ⇥ vertices are more tightly connected with one another than with P 2 whose element is b = a kikj , modularity Q can be outside vertices. Here, we consider disjoint communities, i.e., ij ij 2m Vi Vj = for i = j. written as \ ; 6 1 Q = hT Bh, (1) 2.1 Stochastic Model 4m N In stochastic model [Psorakis et al., 2011; He et al., 2015; where h =[hi] R is a community membership indicator 2 Jin et al., 2015], aij can be viewed as the probability that vector. Maximizing Eq. (1) is NP-hard, for which many vertices i and j are connected. This probability can be further optimization algorithms have been proposed, such as extremal considered to be determined by the probabilities that vertices optimization [Duch and Arenas, 2005]. In practice, we can i and j generate edges belonging to the same community. relax the problem by allowing variable hi to take any real value N K T We introduce latent variables H =[hik] R+ ⇥ with hik and h h = N. To generalize Eq. (1) to K>2 communities, 2 N K representing the probability that node i generates an edge we can define an indicator matrix H =[hij] R ⇥ and 2 belonging to community k. This latent variable also captures obtain T the probability that node i belongs to community k, and each MOD(H, B)=Q =Tr(H BH), row of H can be considered as a community membership L T distribution of a vertex. The probability that vertices i and j is s.t. Tr(H H)=N, connected by a link belonging to community k is then hikhjk, where Tr(.) is the trace of a matrix. Based on Rayleigh Quo- and the probability that they are connected is: tient, the solution to this problem is the largest K eigenvectors of the modularity matrix B. Each row of matrix H can be K regarded as a new representation of the corresponding vertex aˆij = hikhjk. in the latent space, and clustering algorithms, such as k-means, kX=1 can be used to classify the nodes into disjoint groups in the As a result, the community detection problem can be formu- latent space. lated as a nonnegative matrix factorization A Aˆ = HHT . The Eckart-Young-Mirsky Theorem [Eckart and Young, The NMF-based community detection approaches⇡ [Psorakis et 1936] explores the relationship between the reconstruction and al., 2011] aim to find a nonnegative membership matrix H to singular value decomposition (SVD). reconstruct adjacency matrix A. There are two common objec- Theorem 1. [Eckart and Young, 1936] For a matrix D m n T 2 tive (loss) functions to quantify the reconstruction error. The R ⇥ (m > n), if D = U⌃V is the singular value decom- first is based on the square loss function [Wang et al., 2008; position of D, and U, V, ⌃ = diag(1,2,...,m) where Zhang et al., 2007] which is equivalent to the square of the 1 > 2 > ... > m are as follows: Frobenius norm of the difference between two matrices ⌃1 0 T T U =[U1U2], ⌃ = , V =[V1V2], (A, HH )= A HH 2 . 0⌃2 LLSE || ||F  where ⌃ is r r, U is m r and V is n r, then the The second is based on the Kullback-Leibleer divergence (KL- 1 ⇥ 1 ⇥ 1 ⇥ divergence) between two matrices optimal solution to the following problem ˆ T T argmin D D F KL(A, HH )=KL(A HH ). Dˆ m n rank(Dˆ ) r || || L || 2R ⇥ 6

2253 T is Dˆ ⇤ = U1⌃1V1 and latent representation H back into the original data space, i.e., reconstructs the original data from the latent representation: ˆ 2 2 D D⇤ F = r+1 + ... + m. || || mi = g(hi)=s(WM hi + dM ), q N d N 1 The above theorem means that the matrix reconstruction where WM R ⇥ , dM R ⇥ are the parameters to from the singular vectors corresponding to the K largest sin- be learned in2 the decoder and2 g(.) is another element-wise gular values is the best rank-K approximation to the origi- nonlinear mapping similar to s(.). Auto-Encoder aims at nal matrix under the Frobenuis norm. Since modularity ma- learning a low-dimensional nonlinear representation H that trix B is symmetric, there exists orthogonal decomposition can best reconstruct the original data B, i.e. minimize the T B = U⇤U , where ⇤ is diagonal matrix with the eigen- difference between the original data B and reconstruction data T values of B as the diagonal elements and U U = I. Thus, M under parameters ✓ = WH , dH , WM , dM the matrix reconstructed by the eigenvectors corresponding { } to the K largest eigenvalues is also the best approximation to N ˆ the input matrix B with rank K. Therefore, the modularity ✓ = argmin L✓(B, M) = argmin L✓(bi, mi) ✓ ✓ i=1 maximization problem can be regarded as reconstructing the X modularity matrix using a low-rank approximation. N Put together, stochastic model and modularity maximiza- = argmin L✓(bi,g(f(bi))), (3) tion can be intuitively interpreted as to find low-dimensional ✓ i=1 X representations to best reconstruct given network structures. It is important to note that both of them only reconstruct original where L✓(bi, mi) is a distance function that measures the reconstruction error. Here we adopt the Euclidean distance networks by linear reconstruction, using, e.g., NMF or SVD, and sigmoid cross-entropy distance as distance functions. The and ignore nonlinear properties of the networks. It is unclear sigmoid cross-entropy distance maps elements in bi =[bji] how NMF and SVD based approaches can be extended to N 1 N 1 2 R ⇥ and mi =[mji] R ⇥ to [0, 1] using sigmoid func- accommodate nonlinear low-dimensional embedding. We aim 1 2 tion (x)= x , and then computes the cross-entropy of to overcome this difficulty by using deep neural networks, the 1+e focus of the current paper. them as N

(mji)log((bji)) + (1 (mji)) log(1 (bji)) . 3 Deep Nonlinear Reconstruction Model j=1 X We now present a novel deep nonlinear reconstruction (DNR) W d model for community detection. We first introduce an Auto- After training the Auto-Encoder, H and H are obtained Encoder, which is a key building block of the model, and and Eq. (2) can be used to generate the new representations then describe a stacked Auto-Encoder. While in the following for all vertices. discussion we focus on finding a nonlinear embedding that best reconstructs the modularity matrix B, as inspired by the SVD- 3.2 Optimization based modularity maximization (Section 2.2), our method Eq. (3) can be solved by back-propagation with stochas- can be readily applied to other network input forms, such as tic gradient descent. In each iteration, the parameters ✓ = adjacency and Laplacian matrices. WH , dH , WM , dM are updated as follows { } ji ji @ 3.1 Reconstruction based on Auto-Encoder W↵ = W↵ L✓(B, M), @Wji Auto-Encoder is a special neural network that is used to learn ↵ @ a new representation that can best approximate the original j j d↵ = d↵ j L✓(B, M), data [Bourlard and Kamp, 1988; Hinton and Zemel, 1994]. @d↵ N N We adopt modularity matrix B =[bij] R ⇥ as the input 2 where ↵ H, M . By defining z↵ = W↵x + d↵, we have to the Auto-Encoder. Here, the elements of B are bij = 2{ } kikj th aij 2m , and the i column bi of B represents vertex i. N The Auto-Encoder consists of two key components: encoder @ @ and decoder. The encoder maps the original data B to a low- ji L✓(X,g(f(X))) = ji L✓(xi,g(f(xi))) d N @W↵ i=1 @W↵ dimensional embedding H =[hij] R ⇥ where d

2254 j @ where ↵ = j L✓(xi,g(f(xi))) denotes the contribution are known to be in the same community, or oij =0otherwise. @z↵ of a node to the overall reconstruction error. For Euclidean Thus, we can write the pairwise constraints as distance based L✓(B, M), N N 1 2 N N LSE(O, H)= oij hi hj 2 j j j ji i j R 2 || || = (bij mij )s0(z ), = W s0(z ), i=1 j=1 M M H H M H X X i=1 i=1 T T T X X =Tr(H DH) Tr(H OH)=Tr(H LH), s (x) s(x) where 0 is the derivative of . N N where Tr(.) is the trace of a matrix, D =[dij] R+ ⇥ a 3.3 Stacked Auto-Encoder diagonal matrix whose entries are row summation2 of O, i.e., N Recently, deep learning has been successful on various prob- dii = oij, and L = D O the graph regularization j=1 lems in many fields, such as image classification, semantic seg- matrix (Laplacian matrix) of a priori information O. Similarly, mentation [Liu et al., 2015; Liang et al., 2015]. However, as the KL-divergenceP based constraints can be written as: the number of layers increases, the space of parameters grows N N exponentially, making optimization inefficient. A compromis- 1 KL(O, H)= oij KL(hi hj )+ KL(hj hi) , R 2 D || D || ing strategy is to train the network layer by layer [Vincent et i=1 j=1 al., 2010]. X X ⇣ ⌘ To take advantage of a deep architecture, we stack a series which takes into account the asymmetry of KL-divergence of Auto-Encoders to form a DNR model. For a deep Auto- and averages KL(hi hj) and KL(hj hi). By minimiz- ing (O,DH) or || (O, HD), we expect|| the new repre- Encoder network, we train the first Auto-Encoder by recon- RLSE RKL structing the original data, i.e., the modularity matrix B and sentations of two nodes i and j are similar if we have some 1 N t obtain a new representation H R ⇥ 1 . We then train the information indicating that these two nodes belong to the same ith Auto-Encoder by reconstructing2 the output of the (i 1)th community, i.e., the corresponding element oij =1. i N t Auto-Encoder and obtain a representation H R ⇥ i , where By incorporating the pairwise constraints in Eq (4) with the 2 ti

2255 n=1000 c =10 c =50 n=1000 c =20 c =100 min max min max 1 1 1 FN FEC 0.8 0.8 0.8 Infomap FN FUA 0.6 0.6 FEC 0.6 FN EO FEC Infomap SP 0.4 Infomap 0.4 FUA 0.4 DNR FUA EO 0.2 MABA 0.2 SP 0.2 SP normalized DNR DNR normalized mutual information normalized mutual information 0 0 0 1 2 3 4 5 6 7 8 0.6 0.65 0.7 0.75 0.8 0.6 0.65 0.7 0.75 0.8 number of inter−community edges per vertex Zout mixing parameter u mixing parameter u (a) GN network (b) LFR network (small community) (c) LFR network (large community)

Figure 1: Comparison of DNR with six competing methods on GN and LFR networks.

Table 1: Performance on Real-world Networks (the best performance is in bold and the second best performance is in italics) Datasets NMKSP EO FN FUA DNR L2 DNR CE Karate [Zachary, 1977] 34 78 2 1.000 0.587 0.692 0.587 1.000 1.000 Dolphins [Lusseau and Newman, 2004] 62 159 2 0.753 0.579 0.572 0.516 0.889 0.818 Friendship6 [Xie et al., 2013] 68 220 6 0.418 0.952 0.727 0.852 0.888 0.924 Friendship7 [Xie et al., 2013] 68 220 7 0.477 0.910 0.762 0.878 0.907 0.932 Football [Girvan and Newman, 2002] 115 613 12 0.334 0.885 0.698 0.890 0.927 0.914 Polbooks [Newman, 2006] 105 441 3 0.561 0.557 0.531 0.574 0.552 0.582 Polblogs [Adamic and Glance, 2005] 1,490 16,718 2 0.511 0.501 0.499 0.375 0.389 0.517 Cora [Yang et al., 2009] 2,708 5,429 7 0.295 0.441 0.459 0.260 0.463 0.421

(EO) algorithm [Duch and Arenas, 2005], the FUA algorithm The layer configurations of the deep neural networks for [Blondel et al., 2008], the MABA algorithm [He et al., 2012] different problems tested are shown in Table 2. The networks and the (FN) algorithm [Newman, 2004]. The remaining meth- have at most 3 stacked Auto-Encoder, and the dimension of ods include the (FEC) algorithm [Yang et al., 2007] and the each latent space is less than that of its input and output spaces. Infomap algorithm [Rosvall and Bergstrom, 2008]. For example, the stacked Auto-Encoder network for the LFR To assess the effectiveness of using pairwise constraints, network consists of three Auto-Encoders, where the first is we compared semi-DNR with two existing semi-supervised 1,000-512-1,000, the second 512-256-512 and the third 256- community detection algorithms, ModLink [Zhang, 2013] and 128-256. All Auto-Encoders were trained separately. We took GraphNMF [Yang et al., 2015]. The former transforms pair- a modularity matrix as the input to the first Auto-Encoder, wise constraints into information of network topology, modi- and trained the it to minimize the reconstruction error, then fies the adjacency matrix and uses a conventional algorithm to took the embedding result as the input to the second Auto- find communities on new label-refined networks. The latter Encoder, and so on. We set the training batch to the size of the incorporates the labels by constraining the nodes belonging to network and ran at most 100,000 iterations. For each network the same community to have similar membership representa- we trained a DNR model with 10 random initializations, and tions. We adopted the Normalized Mutual Information (NMI) took the latent representations from three Auto-Encoders for for performance measure. clustering. Here, we adopted the k-means for clustering and returned the results with the maximum modularity. 5.1 Experiment Setup 5.2 Community Detection Results

Table 2: Deep Nonlinear Reconstruction Network Setting We considered two types of synthetic networks, Girvan- Newman (GN) networks [Girvan and Newman, 2002] and Datasets N Layers Configuration Lancichinetti-Fortunato-Radicchi (LFR) networks [Lanci- Karate 34 34-32-16 chinetti et al., 2008]. Each GN network consists of 128 nodes Dolphins 62 62-32-16 divided into 4 communities of 32 nodes each. Each node Friendship6 68 68-32-16 has on average 16 edges, among which Zout edges are inter- Friendship7 68 68-32-16 community edges. The results are shown in Figure 1(a). As Football 115 115-64-32-16 shown, DNR outperforms all competing methods, especially Polbooks 105 105-64-32-16 Z > 6 Polblogs 1,490 1,490-256-128-64 when out . Cora 2,708 2,708-512-256-128 The LFR networks are more complicated than the GN net- GN Network 128 126-64-32-16 works. As suggested in [Lancichinetti et al., 2008], we set LFR Network 1,000 1,000-512-256-128 the number of nodes to 1000, the average degree to 20, the exponent of a vertex degree and the community size to -2

2256 1 1 1 1

0.9 0.9 0.9 0.95 0.8 0.8 0.8 0.9 0.7 0.7 0.7 0.6

0.85 GraphNMF GraphNMF GraphNMF 0.5 GraphNMF ModLink 0.6 ModLink 0.6 ModLink ModLink 0.4 0.8 DNR−3layers DNR−3layers DNR−3layers DNR−3layers DNR−2layers 0.5 DNR−2layers 0.5 DNR−2layers DNR−2layers

normalized mutual information normalized mutual information normalized mutual information normalized mutual information 0.3 DNR−1layer DNR−1layer DNR−1layer DNR−1layer 0.75 0.4 0.4 0.2 0 5 10 15 20 0 5 10 15 20 25 30 0 10 20 30 40 50 0 10 20 30 40 50 percentage of pairwise lables percentage of pairwise lables percentage of pairwise lables percentage of pairwise lables (a) GN newtwork with Zout=7 (b) GN newtwork with Zout=8 (c) LFR network with µ =0.75 (d) LFR network with µ =0.8

Figure 2: Comparison of semi-DNR and two competing methods on GN and LFR networks. and -1 respectively, and varied the mixing parameter µ from better results for comparison. Here, we set the balancing pa- 0.6 to 0.8. To fully evaluate the performance on networks rameter = 1000. We also verifed the effects of pairwise with different community sizes, we generated two groups of constraints on more than one layer i.e., only the bottom layer, networks: one with community sizes from 10 to 50 while the both bottom and middle layers and all the three layers. other from 20 to 100. The results on these two groups of net- The results are shown in Figure 2. On the GN with Zout = works are shown in Figs 1(b) and 1(c). As shown, DNR can 7, all the methods have similar improvements by enforcing the successfully detect small and large communities. It achieves same percent of labels. On the networks where unsupervised the best performance when µ>0.65 on networks with small methods cannot obtain satisfactory results, in comparison, the community size (middle figure, Figure 1) and µ>0.6 on semi-DNR achieves much better performance with the same networks with large community sizes (right figure, Figure 1). number of labels. For example, with 20% pairwise constraints, While Infomap performs slightly better than DNR when is the NMI of semi-DNR achieves 0.95 while that of GraphNMF small, it completely fails when µ > 0.75. and ModLink only achieve 0.76 and 0.79, respectively, on LFR In summary, the experimental results on synthetic networks network with µ =0.8 (right figure in Figure 2). This means showed that the deep nonlinear model is more effective on the semi-DNR is much more efficient on the use of pairwise difficult networks with vague community structures and is constraints than other methods. Furthermore, the performance competitive on easy networks with clear community structures. of enforcing pairwise constraints on multi-layers has similar While the factors for such a superb performance compared improvements. It illustrates that semi-DNR can fully explore with other modularity-based methods remains to be further pairwise constraints by only one layer graph regularization. investigated, it may be partially attributed to the nonlinear structure of our new model, which helps to mitigate the reso- 6 Conclusion [ ] lution limit Fortunato and Barthelemy, 2007 and the extreme In order to overcome the serious drawback of linear low-rank [ ] degeneracy Good et al., 2010 problems, both of which are embedding used by the widely adopted stochastic model and often suffered from modularity optimization. modularity maximization model for network community iden- Nine widely-used real networks, listed in Table 1, were tification, we proposed a nonlinear model in deep neural net- [ used for evaluation. As shown in previous work Psorakis et works to gain representation power for large complex net- ] al., 2011 , no single unified loss function seemed to exist that works; developed an algorithm using the model for network can successfully detect communities in all networks. In our community detection; and further extended this method to a experiments, we also adopted the Euclidean distance and sig- semi-supervised deep nonlinear reconstruction algorithm by moid cross-entropy as error functions. We used sigmoid cross- incorporating pairwise constraints. Extensive experimental entropy instead of KL-divergence because the former can be results on synthetic and real networks illustrate that our new integrated with the sigmoid function for back-propagation. methods outperform the existing state-of-the-art methods for The results are shown in Table 1. Here, we compared DNR network community identification. In the future, we plan to with other well-known modularity-based community detec- study using the latent space embedding from tion methods. As shown in the table, DNR with the L2 norm DNR to make the model and methods more robust. (DNR L2) and with the cross-entropy distance (DNR CE) outperforms most of the competing modularity-based opti- mization algorithms. 7 Acknowledgments This work was supported by National Natural Science Foun- 5.3 Semi-supervised Community Detection Results dation of China (No. 61503281, 61502334, 61303110, To evaluate the new semi-supervised deep nonlinear recon- 61422213), ”Strategic Priority Research Program” of the Chi- struction (semi-DNR) method, we used networks on which it nese Academy of Sciences (XDA06010701), Open Funding is difficult to find satisfactory community structures without Project of Tianjin Key Laboratory of Cognitive Computing label information. As shown in Figure 2, the performance is and Application, Foundation for the Young Scholars by Tianjin mediocre on GN network with Zout =8and LFR network University of Commerce (150113), the Talent Development with µ =0.75 and 0.8. Besides, we chose GN network with Program of Wuhan, the municipal government of Wuhan, Zout =7where the methods without labels can also achieve Hubei, China (2014070504020241), and an internal research

2257 grant of Jianghan University, Wuhan, China, as well as by Quasi-parametric human parsing. In CVPR 2015, pages 1419– United States National Institutes of Health (R01GM100364). 1427, June 2015. [Lusseau and Newman, 2004] David Lusseau and Mark EJ New- References man. Identifying the role that animals play in their social networks. Proceedings of the Royal Society of London. Series B: Biological [ ] Adamic and Glance, 2005 Lada A Adamic and Natalie Glance. Sciences, 271(Suppl 6):S477–S481, 2004. The political blogosphere and the 2004 us election: divided they blog. In Proceedings of the 3rd international workshop on Link [Newman, 2004] Mark EJ Newman. Fast algorithm for detect- discovery, pages 36–43. ACM, 2005. ing community structure in networks. Physical Review E, 69(6):066133, 2004. [Blondel et al., 2008] Vincent D Blondel, Jean-Loup Guillaume, Re- naud Lambiotte, and Etienne Lefebvre. Fast unfolding of commu- [Newman, 2006] Mark EJ Newman. Modularity and community nities in large networks. Journal of statistical mechanics: theory structure in networks. Proceedings of the National Academy of and experiment, 2008(10):P10008, 2008. Sciences, 103(23):8577–8582, 2006. [Bourlard and Kamp, 1988] Herve´ Bourlard and Yves Kamp. Auto- [Psorakis et al., 2011] Ioannis Psorakis, Stephen Roberts, Mark association by multilayer perceptrons and singular value decom- Ebden, and Ben Sheldon. Overlapping community detection using position. Biological cybernetics, 59(4-5):291–294, 1988. bayesian non-negative matrix factorization. Physical Review E, 83(6):066114, 2011. [Duch and Arenas, 2005] Jordi Duch and Alex Arenas. Community detection in complex networks using extremal optimization. Phys. [Rosvall and Bergstrom, 2008] Martin Rosvall and Carl T Rev. E, 72:027104, Aug 2005. Bergstrom. Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of [Eckart and Young, 1936] Carl Eckart and Gale Young. The approx- Sciences, 105(4):1118–1123, 2008. imation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936. [Vincent et al., 2010] Pascal Vincent, Hugo Larochelle, Isabelle La- joie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked [Fortunato and Barthelemy, 2007] Santo Fortunato and Marc denoising autoencodernmfs: Learning useful representations in Barthelemy. Resolution limit in community detection. Pro- a deep network with a local denoising criterion. The Journal of ceedings of the National Academy of Sciences, 104(1):36–41, Machine Learning Research, 11:3371–3408, 2010. 2007. [Wang et al., 2008] Rui-Sheng Wang, Shihua Zhang, Yong Wang, [Girvan and Newman, 2002] Michelle Girvan and Mark EJ New- Xiang-Sun Zhang, and Luonan Chen. Clustering complex net- man. Community structure in social and biological networks. works and biological networks by nonnegative matrix factorization Proceedings of the National Academy of Sciences, 99(12):7821– with various similarity measures. Neurocomputing, 72(1):134– 7826, 2002. 141, 2008. [Good et al., 2010] Benjamin H Good, Yves-Alexandre de Mon- [Xie et al., 2013] Jierui Xie, Stephen Kelley, and Boleslaw K Szy- tjoye, and . Performance of modularity maximiza- manski. Overlapping community detection in networks: The tion in practical contexts. Physical Review E, 81(4):046106, 2010. state-of-the-art and comparative study. ACM Computing Surveys [He et al., 2012] Dongxiao He, Jie Liu, Bo Yang, Yuxiao Huang, (CSUR), 45(4):43, 2013. Dayou Liu, and Di Jin. An ant-based algorithm with local opti- [Yang et al., 2007] Bo Yang, William K Cheung, and Jiming Liu. mization for community detection in large-scale networks. Ad- Community mining from signed social networks. Knowledge vances in Complex Systems, 15(08):1250036, 2012. and Data Engineering, IEEE Transactions on, 19(10):1333–1348, [He et al., 2015] Dongxiao He, Dayou Liu, Di Jin, and Weixiong 2007. Zhang. A stochastic model for detecting heterogeneous link com- [Yang et al., 2009] Tianbao Yang, Rong Jin, Yun Chi, and Shenghuo munities in complex networks. In AAAI Conference on Artificial Zhu. Combining link and content for community detection: A dis- Intelligence, 2015. criminative approach. In Proceedings of the 15th ACM SIGKDD [Hinton and Zemel, 1994] Geoffrey E Hinton and Richard S Zemel. ’09, pages 927–936, 2009. Autoencoders, minimum description length, and helmholtz free [Yang et al., 2015] Liang Yang, Xiaochun Cao, Di Jin, Xiao Wang, energy. Advances in neural information processing systems, pages and Dan Meng. A unified semi-supervised community detection 3–3, 1994. framework using latent space graph regularization. Cybernetics, [Jin et al., 2015] Di Jin, Zheng Chen, Dongxiao He, and Weixiong IEEE Transactions on, 45(11):2585–2598, Nov 2015. Zhang. Modeling with node degree preservation can accurately [Zachary, 1977] W Zachary. An information flow modelfor conflict find communities. In AAAI Conference on Artificial Intelligence, and fission in small groups. Journal of anthropological research, 2015. 33(4):452–473, 1977. [Lancichinetti et al., 2008] Andrea Lancichinetti, Santo Fortunato, [Zhang et al., 2007] Shihua Zhang, Rui-Sheng Wang, and Xiang- and Filippo Radicchi. Benchmark graphs for testing community Sun Zhang. Uncovering fuzzy community structure in complex detection algorithms. Physical Review E, 78(4):046110, 2008. networks. Physical Review E, 76(4):046103, 2007. [Liang et al., 2015] Xiaodan Liang, Si Liu, Xiaohui Shen, Jianchao [Zhang, 2013] Zhong-Yuan Zhang. Community structure detection Yang, Luoqi Liu, Jian Dong, Liang Lin, and Shuicheng Yan. Deep in complex networks with partial background information. EPL human parsing with active template regression. Pattern Analysis (Europhysics Letters), 101(4):48005, 2013. and Machine Intelligence, IEEE Transactions on, 37(12):2402– 2414, 2015. [Liu et al., 2015] S. Liu, X. Liang, L. Liu, X. Shen, J. Yang, C. Xu, L. Lin, Xiaochun Cao, and S. Yan. Matching-cnn meets knn:

2258