<<

Detection in Social Networks: An In-depth Benchmarking Study with a Procedure-Oriented Framework ∗

Meng Wang1, Chaokun Wang1, Jeffrey Xu Yu2, Jun Zhang1 1 Tsinghua University, Beijing 100084, China 2 The Chinese University of Hong Kong, Hong Kong, China {meng-wang12, zhang-jun10}@mails.tsinghua.edu.cn, [email protected], [email protected]

ABSTRACT graphs from the entire graph [26]. Since networks are usually mod- Revealing the latent community structure, which is crucial to un- eled as graphs, detecting communities in multifarious networks is derstanding the features of networks, is an important problem in also known as the graph partition problem in modern graph theory network and graph analysis. During the last decade, many ap- [7,2], as well as the graph clustering [1] or dense subgraph discov- proaches have been proposed to solve this challenging problem in ery problem [16] in the graph area. In the last decade, lots diverse ways, i.e. different measures or data structures. Unfortu- of solutions have emerged in the literature [5,9, 19, 24, 14, 25, 12, nately, experimental reports on existing techniques fell short in va- 3, 29,4, 11], trying to solve this problem from various perspectives. lidity and integrity since many comparisons were not based on a The extensive research work has promoted the prosperity of the unified code base or merely discussed in theory. of community detection approaches. However, it also raises We engage in an in-depth benchmarking study of community de- a new difficulty, how to choose the most appropriate approach in tection in social networks. We formulate a generalized community specific scenarios, since many latest approaches have not been com- detection procedure and propose a procedure-oriented framework pared with each other upon unified platforms with same datasets for benchmarking. This framework enables us to evaluate and com- and uniform configurations. Given the huge diversity of various pare various approaches to community detection systematically and approaches, it is usually not easy to analyze, compare and eval- thoroughly under identical experimental conditions. Upon that we uate the extensive existing work. In this sense, a general bench- can analyze and diagnose the inherent defect of existing approaches mark for community detection is quite necessary and beneficial. deeply, and further make effective improvements correspondingly. In this paper, we make a benchmarking study for community de- We have re-implemented ten state-of-the-art representative al- tection, which contains a universal procedure-oriented framework gorithms upon this framework and make comprehensive evalua- and a comprehensive evaluation system. Upon that, we are able tions of multiple aspects, including the efficiency evaluation, per- to analyze, evaluate, diagnose and further improve the existing ap- formance evaluations, sensitivity evaluations, etc. We discuss their proaches thoroughly, and get interesting and credible conclusions. merits and faults in depth, and draw a set of take-away interesting 1.1 Challenges conclusions. In addition, we present how we can make diagnoses An in-depth benchmarking study for community detection is non- for these algorithms resulting in significant improvements. trivial and poses a set of unique challenges. Firstly, considering the various existing approaches, the lack of a 1. INTRODUCTION procedure-oriented framework for community detection makes it a Intrinsic community structures are possessed by many real-world puzzle to understand, compare and diagnose them. Since these ap- networks, e.g. biological data, communication networks and social proaches are of various categories, a universal framework of com- graphs, to name but a few. Given a network, it is particularly in- munity detection is quite difficult to be summarized and abstracted. teresting as well as challenging to detect the inherent and hidden Secondly, to make a fair comparison and build a general bench- communities. Communities, which have no quantitative definition, mark for evaluation, it is a necessity to re-implement different ap- are also called clusters. They are usually considered as groups proaches of various categories based on a common code base. Ac- of nodes, in which intra-group connections are much denser than tually, the re-implementation is really a tough work. those inter-group ones. Just as many classic puzzles, community Finally, when proposing a new approach, authors often testify detection is intuitive at first sight but actually an intricate problem. their work via limited metrics that perform well. In our benchmark- Community detection [8] aims at grouping nodes in accordance ing evaluation, we need a suite of metrics which can embody full with the relationships among them to form strongly linked sub- structural characteristics of communities to evaluate the approaches as comprehensively and thoroughly as possible. ∗Corresponding author: Chaokun Wang. There exist two pieces of research work similar to ours. Yang This work is licensed under the Creative Commons Attribution- et al. only investigated the performances of different metrics for NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li- communities with ground-truth [28]. Xie et al. made an evaluation cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per- on overlapping community detection [27]. However, they failed to mission prior to any use beyond those covered by the license. Contact present a universal framework. Instead, we conduct a systematic copyright holder by emailing [email protected]. Articles from this volume in-depth benchmarking study to solve the above challenges. were invited to present their results at the 41st International Conference on Very Large Data Bases, August 31st - September 4th 2015, Kohala Coast, 1.2 General Benchmark Hawaii. Proceedings of the VLDB Endowment, Vol. 8, No. 10 In this paper, we have designed a benchmark for community de- Copyright 2015 VLDB Endowment 2150-8097/15/06. tection. As shown in Fig.1, our benchmark consists of four core

998 ,adauie rp oe ovre rmtedatasets; the from converted model graph configurations unified (2) parameter a 6.1), and 6.2), (Sec. (Sec. datasets synthetic and world (1) modules: n ancontributions: follow- main non- the ing on make We framework techniques. work generalized detection first a community the overlapping with is study this benchmarking knowledge, the our of the on best of the comparison To and work. evaluation extensive analysis, in-depth the on focuses Contributions 1.3 credible. and fair more comparison same the the make in thus implemented and algorithms environment, with pro- base code we common and benchmark a this diagnoses vide and In make strength improvement. and the for prescriptions thoroughly, study targeted algorithm to us each allows of framework fac- weakness our key steps, separating and and modularizing tors By to ability outliers. the excessive and avoid distribution and community , on mixture evaluations and additional density effectiveness, network and on accuracy evaluations the sensitivity on on evaluations evaluation performance efficiency cost, the time including com- aspects, var- a concerned of of evaluation ious for consists metrics widely-recognized it of Moreover, suite or prehensive com- comparison. classical to implement for to approaches algorithms easy many it latest makes from and steps tasks, detection and munity phases factors, key the 6.3–6.11). (Sec. (4) aspects 5); different (Sec. from tection work leading existing framework, the over our Evaluation improvement on of based directions proce- algorithms to the these 3; (3) on Sec. 4); diagnoses in Sec. introduced in are mappings framework dure the detection community of of details workflow (the common the of high ehv odce opeesv ecmrigsuywhich study benchmarking comprehensive a conducted have We abstracts which framework universal a contains benchmark The eeto Framework Detection • • • • • edn osgicn efrac improvements. performance significant approaches, existing to for leading diagnoses make to algorithms. how concerned present also on We ratings pro- brief and and conclusions, intuitive take-away vide interesting of set datasets. a synthetic draw and We real-world both on using based benchmark approaches our these mapping on evaluations by specifics. in-depth their C++) make on We standard based framework (using the base to them code common a approaches, in algorithms detection representative state-of-the-art community ten re-implement of and family the steps. and review factors We key the ab- modularizing via and detection for- stracting community by of workflow framework generic procedure-oriented a mulating novel a propose We ... Algorithms iue1 ecmr o omnt detection community for Benchmark 1: Figure Initialization Diagnosis 2 DIagnosis 1 PreCompute opeesv vlainsse o omnt de- community for system evaluation comprehensive a ,

Setup FirstAllocate ,real- 2.2), (Sec. algorithms of set a including , Configurations Diagnoses

Select Transformation eeaie eeto rcdr with procedure detection generalized a , Effectiveness Efficiency Conditions Density

Diagnoses Fluctuate Datasets Update Distribution Overlapping Accuracy hc rvd targeted provide which , Construction

Graph Model Extract Collect Quantity Outliers MultiLevelDraft ...

Setup Detection Framework Evaluation 999 Coms Coms wihcvr otrep- proposed. most approaches covers popular which all 3 Fig. from formation resentatives in the shown as to communities, according of graph. process approaches the existing in to the belongs categorize node We each that the (community) finding group problem at definite fundamental aims the which on detection, focus community non-overlapping we of study have this approaches In effective of proposed. number been large particularly a and years, Algorithms Detection 2.2 communities: outlier: one three and are there eliminated), be will singletons When only 2. Fig. in lustrated note .W en hmas them be define cannot We detected the which [13]. outside far nodes, groups allowed independent are communities, some any into group, grouped certain a into a as to referred subsets: the node define non-empty we of Here S list a graph desirable. the more cover incompletely usually which are clusters and partition, graph a Definition Problem approaches. 2.1 existing the review then and detection, community of BACKGROUND AND PRELIMINARY 7. Sec. in 2. study this conclude the and report 6, Sec. and in benchmark findings 5. our and Sec. with results in framework approaches the these on present evaluate based We we diagnoses for Afterwards targeted framework make framework. map universal to we the how a 4 to propose Sec. approaches in we existing then 3 the and existing Sec. detection, community out In in sketch benchmarking 2. and Sec. detection in community work of problem the late n nms h omnt eeto rbe isa nigthe finding at ∩ aims problem R detection assignment community community optimal the mvs, an and predefined the than less by of are produced threshold sizes be whose or groups, tiny algorithms the original disbanding by identified directly be can | as | denoted be can ties social etrfr ocoeycnetdgop fnds(os n a and (Coms) nodes (Outs). of outliers groups disparate of connected number closely moderate to refers ment V E Outs i cn | | = omnt eeto a ensuidurmtigyalthese all unremittingly studied been has detection Community il- is detection community of results the of example intuitive An Outliers Communities Networks Social problem the and concepts basic define first we preliminaries, As formu- We follows. as organized is paper this of remainder The P = = 1 ROBLEM V Coms ∧ = m i n 0 = and , oilntoki lorfre ohri sa as herein to referred also is network social A . v ⊆ {{ iue2 neapeo omnt detection community of example An 2: Figure ∈ /0 V ic omnt eeto osntfreec node each force not does detection community Since . v hudtyt satisfy to try should n 2 Coms (2) and V 1 and , iia ai size valid minimal , i E 0 D v } v 2 3 cluster o-vrapn omnte r o ofie to confined not are communities Non-overlapping . EFINITION = stesto nietdrelationships, undirected of set the is , Outs v v v 3 5 1 V cn A . , v v − 4 4 stettlnme fcmuiis Please communities. of number total the is = , S ra or v v { oilnetwork social 2 5 i cn Community = v , ∪ mvs 15 v 1 part v V 6 Outs 1. 6 } } G i v 0 . 9 , ti ot etoigta outliers that mentioning worth is It . = eeal,gvnantokG network a given Generally, ( ( { 1 . V V mvs v ( outliers agnrlstigwihmeans which setting general (a 2 v v = i Coms , 7 0 8 7 E , T v .Hri h pia assign- optimal the Herein V. fcommunities. of ) Coms ) v v V 8 v 11 12 where , 10 , j 0 v , Community = 9 Outs with : , v v v Outs 0 omnt salso is community A /0. = 14 13 10 Community ) V } { outlier n 2 v rmG ..()Coms (1) s.t. G, from V , 15 { stesto nodes, of set the is = 1 niiul and individuals 0 v ,

··· 11 { communities 3 v , | v , v V 12 E ∈ cn 0 graph , ⊆ v } V 13 where , , V ( @ , V v V × . , 14 i E 0 V as }} m ∈ ) , , , The first kind of approaches starts from the original graph and Categories Sketches Problem-Solving Perspectives Representatives Clustering decomposes the entire graph to local parts gradually, trying to sep- Entire Graph (Division) { [19], [23] } $ arate out communities from the entire graph. Communities Direct Partitioning { [22], [29] } Division algorithms in hierarchy clustering methods, such as Label Propagation { [17], [24] } Local Structures Community Leadership Expansion { [12] } Radicchi [23] and Spectral [19], gradually separate the entire net- Detection $ work into local parts by the edge clustering coefficient or the eigen- Communities Percolation { [14], [20] }

Hierarchy Clustering { [5], [18] } of matrix. Constructed Tree (Agglomeration) Direct partitioning methods separate the entire network into $ Matrix Blocking { [4] } Communities disjoint communities. The Scalable Community Detection (SCD) Skeleton Clustering { [3], [11] } algorithm [22] partitions the network by maximizing the weighted Figure 3: Categories of community detection approaches community clustering [21], a recently proposed metric of commu- nity. Maximal k-Mutual-Friends (M-KMF) [29] algorithm incre- makes it difficult to comparatively analyze these algorithms thor- mentally filters out the connections by the number of mutual friends oughly. For the sake of a better understanding of the underlying between nodes to let the communities spontaneously emerge. principles of community detection algorithms, we abstract two fun- Conversely, the second kind of approaches takes a bottom-up damental concepts, including the propinquity measure and the manner from local structures to the whole graph, and the commu- revelatory structure, which play critical in the community nities are formed during this process. detection task and can be used to distinguish different approaches. Label propagation methods start from local neighborhood to Definition 1. (Propinquity Measure). Given a subset M of the recognize communities automatically. The Label Propagation Al- elements (such as nodes, relationships or other specific structures) gorithm (LPA) [24] adopts an asynchronous update strategy where in a graph G, the propinquity measure of M, denoted as φ(M), is nodes join in groups under their neighbors’ choices. The HANP al- the measurement of the nearness of M by the inner-connections, gorithm [17] based on Hop Attenuation and Node Preference adopts and is the primary criterion to estimate the priority of the elements additional rules to ensure more stable and robust results. when they are transformed to make the communities emerge. Leadership expansion methods find communities according to local leader groups, since members always gather together around Definition 2. (Revelatory Structure). Given a graph G, the rev- some core nodes with high to form communities. The elatory structure corresponding to G, denoted as Π, is an assistant TopLeaders [12] algorithm gradually associates nodes to the near- structure derived from G, provides yet another way to organize the est leaders and locally reelects new leaders during each iteration. massive graph elements and enlightens us on the community struc- Clique percolation methods assume communities are constructed ture from the intertwined connections among them. by multiple adjacent cliques. Based on the original approach [20], The two concepts lay the basis for different community detec- the Sequential Clique Percolation (SCP) [14] algorithm sequen- tion algorithms. The former determines the tendency of grouping tially generates cliques to form connected communities. nodes to communities, while the latter records the gradual forma- The third kind of the approaches maintains a tree, which is a tion of communities and leads to a more effective detection process multi-level structure reorganized from the original graph, aiming at especially for approaches of the third category. finding communities corresponding to the branches of the tree. It should be noted that the specific definitions of the above con- Agglomeration algorithms in hierarchy clustering methods usu- cepts could be quite different in various approaches and thus we ally build an explicit hierarchical tree from small clusters to large only give a general definition here. The propinquity measure may ones. Based on the Newman Fast Greedy Algorithm (NFGA) [18], be the modularity [5], node [12], etc., and the revelatory Claust et al. proposed an agglomeration algorithm CNM [5] which structure may appear as the hierarchy tree [4, 18], G∗ graph [14] or starts from single nodes, maintains the change of modularity, and other particular structures. We will discuss how the two concepts iteratively generates the optimal level of the hierarchy structure. are defined specifically in different algorithms in Sec.4. Matrix Blocking technique can also be utilized in community detection by constructing a hierarchy tree to order nodes in a net- 3.2 The Generalized Procedure work. As a representative, the Matrix Blocking Dense Subgraph We formulate a generalized procedure of community detection Extract (MB-DSGE) algorithm [4] reorders the network, and ex- in this framework. As illustrated in the “Detection Framework” tracts dense subgraphs as communities. module in Fig.1, the procedure consists of three phases, including Skeleton clustering methods reveal dense connected clusters initialization, transformation and construction, and characterizes based on the skeleton of the original network, which is an efficient the generic workflow of community detection via a series of the way of finding communities. The SCOT+HintClus algorithm [3] key steps. The details of the procedure are shown in Alg.1. detects the hierarchical cluster boundaries of a network to extract Phase 1: Initialization the meaningful cluster tree. Inspired by this idea, the Graph Skele- In this phase, the graph elements need to get their initial propin- ton Clustering (gCluSkeleton) algorithm [11] projects the network 0 quity values and form a primary community assignment (Rtmp). As to its core-connected maximal spanning tree, and then detects the shown in Alg.1, after the propinquity measure ( φ) and the reve- optimized core-connected clusters on it. latory structure (Π) are defined (Line1), the procedure calculates initial propinquity values via PRECOMPUTE (Line3) and allocates 3. FRAMEWORK FOR BENCHMARKING the primary assignment via FIRSTALLOCATE (Line4). In this section, we present our procedure-oriented framework for Phase 2: Transformation benchmarking in community detection, which consists of two fun- In this phase, the inner structures and relations underlying the damental concepts abstracted from existing detection algorithms network elements are transformed and clarified iteratively, result- and a generalized procedure of community detection. ing in a set of intermediate detection results SRtmp . We abstract three key steps for each iteration, including SELECT,FLUCTUATE 3.1 Fundamental Concepts and UPDATE (Line6–8). First of all, in S ELECT, the candidate ele- Existing algorithms usually solve the community detection prob- ments CadT are picked out from the graph. After that, in FLUCTU- lem with various methods based on different assumptions. This ATE, the revelatory structure Π correlated with these elements will

1000 Algorithm 1 GENERICDETECTPROC Furthermore, the framework provides flexible combinations of Input: G(V,E), mvs and Tmax the optional steps, making itself adaptable to different kinds of al- Output: R(Coms,Outs) gorithms. Generally, all of the three categories of algorithms dis- 1: initialize φ and Π; cussed in Sec. 2.2 can be mapped to this framework. T 2: T ← 0, SRtmp ← /0, SR ← /0, Cad ← /0; The framework is light-weight. It unifies the input and output, 3: PRECOMPUTE(G,φ); T handles the overall detection process, and provides necessary inter- 4: Rtmp ← FIRSTALLOCATE(G,Π); faces. Algorithms can be integrated easily by defining the factors 5: while T! = Tmax&&!STABLE(Π)&&!OPTIMAL(φ) do T in Sec. 3.1 and implementing the functions in Sec. 3.2. Both the 6: Cad ← SELECT(G,Π); T T framework and algorithms are developed using standard C++. 7: Rtmp ← FLUCTUATE(Cad ,Π); T 8: UPDATE(INVOLVE(Cad ),φ); T 9: SRtmp ← SRtmp ∪ Rtmp; 4. IMPLEMENTATIONUPON FRAMEWORK 10: T++; The present section recaps ten of the state-of-the-art algorithms 11: end while for community detection and describes how they work under this 12: if S has multiple results then Rtmp framework. Fig.4 shows the overall procedure mapping of these 13: for each level ∈ Π do algorithms to the framework. Based on the specific perspectives, 14: S ← S ∪ COLLECT(S ); R R Rtmp we consider the following representatives: 15: end for 16: R ← MULTILEVELDRAFT(SR,φ or ψ); • Agglomeration algorithm CNM [5], and division algorithms 17: else if SRtmp has no obvious result then Radicchi [23] and Spectral [19] (hierarchy clustering, Sec. 4.1). 18: R ← EXTRACT(Π,φ or ψ); • M-KMF [29] (direct partitioning, Sec. 4.2). 19: else R is obtained in the iteration; • LPA [24] and HANP [17] (label propagation, Sec. 4.3). 20: end if • TopLeaders [12] (leadership expansion, Sec. 4.4). 21: R.Outs ← R.Outs ∪ ERASE(R.Coms,mvs); • SCP [14] (clique percolation, Sec. 4.5). 22: ORDER(R.Coms); 23: return R; • MB-DSGE [4] (matrix blocking, Sec. 4.6). • gCluSkeleton [11] (skeleton clustering, Sec. 4.7). be transformed, i.e. being fluctuated to form a more apparent inter- 4.1 Hierarchy Clustering mediate result (RT ). Following these structural transformations, tmp The hierarchy clustering algorithms (CNM [5], Radicchi [23], in UPDATE, the propinquity of other involved elements will be re- Spectral [19], etc.) form communities in a multi-level structure computed and the next iteration starts. These three steps are con- progressively on the basis of the original graph. This process falls ducted iteratively until the iteration terminates. During the itera- into two types, i.e. agglomeration or division, depending on their tions, the latent communities can form in many ways, as the afore- construction order of the hierarchy structure. mentioned categories, i.e. from the whole graph to communities, The revelatory structure. Agglomeration algorithms usually main- from local structures to communities or from trees to communities. tain an explicit hierarchy tree as the revelatory structure , in which Three indicators are usually employed to terminate the iterative Π the leaves denote nodes of the graph and the branches combine procedure (as shown in Line5): (1) whether the iteration number nodes or groups at different levels. Actually, without constructing a reaches the fixed threshold T (which is usually chosen to ensure max hierarchy tree, most division algorithms, such as Radicchi [23] and an approximate convergence of an algorithm); (2) whether the reve- Spectral [19], start from the entire graph and split out the commu- latory structure Π is stable (STABLE(Π)); and (3) whether the value nities gradually. For the lack of space here, we only introduce the of the propinquity measure φ is optimal (OPTIMAL(φ)). According procedure mapping of the classical agglomeration algorithm CNM. to specific algorithms, these indicators may be chosen and designed The propinquity measure. CNM employs modularity [5] as the specifically in different approaches. propinquity measure φ to make a greedy choice, trying to optimize Phase 3: Construction the global modularity of the final community partition: φ(R)= ∑cn In the last phase, the final result R is constructed by refining the h  i i=1 Ii 2Ii+Oi − , where Ii indicates the total number of internal re- current intermediate results SRtmp . If SRtmp has multiple choices, m 2m COLLECT (Line 14) needs to gather the result at each level of Π. lationships within the community Ci, Oi the number of outgoing Then MULTILEVELDRAFT (Line 16) weighs the collected results relationships between nodes in Ci and any node outside. For CNM, and picks up the best-performing one. Usually, based on Π,EX- any node would be grouped into a community during the transfor- TRACT (Line 18) is employed to let the communities emerge. In mation, and thus the cn communities cover all nodes in the graph. most cases, EXTRACT or MULTILEVELDRAFT may also adopt φ The initialization phase. At the beginning, PRECOMPUTE ob- as a measure, nevertheless, sometimes another measure ψ can be tains an initial value of φ(G) and FIRSTALLOCATE takes each node adopted, e.g. density in [4]. Eventually, the invalid communities as a single tree (tiny mono-community with only one member) to 0 are removed by ERASE according to mvs, and R.Coms is sorted via form a primary forest (Rtmp). ORDER by the community size (Line 21–23). The transformation phase. In the iteration T,SELECT chooses two candidate trees (i.e. current communities) CadT , whose com- T−1 3.3 Highlights bination may lead to the maximum increase of φ(Rtmp ). Then, Composed of two fundamental concepts and a generalized pro- FLUCTUATE combines them to form a new tree (a new commu- T cedure, our framework is beneficial for understanding and analyz- nity of Rtmp). Afterwards UPDATE recalculates the new value of T ing the community detection approaches. The two concepts are the φ(Rtmp), and then the algorithm turns to the next iteration. basis for solving the problem of community detection, and differ- The construction phase. In some early methods, COLLECT needs ent implementations may lead to quite different performances, even to gather the result in each level of Π, and then MULTILEVEL- for the same approach (as illustrated in Sec. 5.1). The procedure is DRAFT selects the best one (with the maximum φ(R))[18]. The the modularization of the critical steps of this problem, and uncov- improvement in CNM lies in SELECT and UPDATE. It breaks the ers the generic detection workflow, making it easy to study various iteration once the latest combination cannot increase the modularity approaches deeply within the identical framework. any more (∆φ < 0 and OPTIMAL(φ(R)) is true). Consequently, the

1001 Algorithms PreCompute FirstAllocate Select Fluctuate Update Terminate Collect & MultiLevelDraft Extract CNM ↦ → ↷ → → ↬ ⇢ ↣ ↦ start step Radicchi ↦ → ↷ → → ↝ ⇢ ⇢ : Spectral ↦ → ↷ → → ↝ ⇢ ⇢ →: normal step LPA ⇢ ↦ ↷ → ⇢ ↺ ⇢ ↣ ↷: into the iteration HANP ⇢ ↦ ↷ → → ↺ ⇢ ↣ ↺: iterate to Tmax TopLeaders ↦ → ⇢ ↷ → ↝ ⇢ ↣ ↬: break when optimal SCP ⇢ ⇢ ↷ → ⇢ ↺ ⇢ ↣ ↝: break when stable M-KMF ↦ ⇢ ↷ → → ↬ ⇢ ↣ ↣: end step MB-DSGE ↦ → ↷ → ⇢ ↺ ⇢ ↣ ⇢: skipped step gCluSkeleton ↦ → ↷ → ⇢ ↺ ↣ ⇢ Figure 4: The overall procedure mapping of ten studied algorithms current forest makes up the result. Without a direct output during The construction phase. Empirically, label propagation algo- the iteration, CNM needs to take all the leaves in each individual rithms limit the number of iterations to a maximum value Tmax to tree as the members of a same community via EXTRACT. ensure that the label distribution achieves a steady status approxi- mately. At last, EXTRACT will classify nodes with same labels to 4.2 Direct Partitioning same communities. Partitioning a graph directly is an intuitive way to get the closely connected communities. We take M-KMF [29] as a representative 4.4 Leadership Expansion and present its mapping to our framework as follows. Leadership expansion methods (e.g., TopLeaders [12]) regard a The revelatory structure. In M-KMF the k-mutual-friends sub- community as a set of followers congregating around a potential graphs are taken as Π and the most appropriate ones get separated leader, and form communities by identifying promising leaders and out as communities. Each node v ∈ Π has the degree d(v) ≥ k and then iteratively assembling followers. each relationship belongs to at least k triangles. The revelatory structure. All leaders try to expand their own The propinquity measure. In M-KMF, the φ(r) of each relation- leader groups (i.e. Π) respectively to form the final communities. ship r is defined as the number of triangles TR(r) containing r. The propinquity measure. The TopLeaders algorithm adopts de- The initialization phase. Without an initial assignment of com- gree centrality φ(vi) = d(i) of a node vi to measure the leadership. munities, M-KMF merely executes PRECOMPUTE to initialize the The initialization phase. Specifically, PRECOMPUTE first com- φ(r) of each relationship. Before that, it also handles a degree filter putes the global degree centrality of each node in the graph. Then, which pulls nodes with d(·) < k + 1 out as outliers. FIRSTALLOCATE elects the ld most pivotal nodes as the leaders The transformation phase. In each iteration T,SELECT picks out by the FNiC strategy (Few Neighbors in Common, which assures the relationships (CadT ) for whom φ(r) < k, and FLUCTUATE re- neighborhoods of any two leaders have an intersection less than 0 moves all these candidates from the current graph. Once a batch of λmax) and forms initial states of the groups (Rtmp). T relationships are erased, new disconnected parts (Rtmp) may arise. The transformation phase. In each iteration T, TopLeaders skips Hence, they are closer to the k-mutual-friends subgraphs we look SELECT and executes FLUCTUATE directly. In FLUCTUATE, each for. Then, UPDATE copes with each relationship involved by these node explores its neighbors within l hops using the BFS strategy deletions and recomputes φ(r) for the next iteration. and counts the common neighbors with each leader. Once the num- The construction phase. The iteration ends precisely when there ber of common neighbors with any leader exceeds the threshold is no relationship to be marked. At this moment, each living re- λmin before the BFS reaches the maximum depth, dp, the node joins lationship belongs to k triangles, so that the mutual friends be- the corresponding group around that leader. Then, the leader group T tween each pair of connected nodes is maximum (OPTIMIZE(φ(r)) gets enlarged (Rtmp). Please notice that any node who has multi- is true). To fetch the final communities, EXTRACT needs to traverse ple choices and cannot make this decision until l reaches the upper the current Π to get the connected components. bound will be regarded as a hub, i.e. a special outlier. Besides, any node whose common neighbors cannot reach λmin anyhow will be 4.3 Label Propagation regarded as an outlier. In consequence of the expansion, UPDATE Label propagation algorithms (LPA [24], HANP [17], etc.) iden- needs to recompute φ(vi) locally and tries to reelect new leaders. tify communities via different labels spreading among neighbors. The construction phase. The transformation process repeats until The revelatory structure. Π herein is defined as the label distri- no new leader comes to power any more (STABLE(Π) is true). Once bution where different labels stand for different communities. all the leaders win a second term, each current leader group can be The propinquity measure. LPA has no special propinquity mea- obtained via EXTRACT as a final community. sure, whereas HANP adopts hop score and preference of each node ω vi to improve its robustness: φ(vi) = S(i) · P(i) where ω is a reg- 4.5 Clique Percolation ulatory factor. Please note that the preference P(i) can be d(i), and In clique percolation algorithms, a clique is an atomic element the hop score S(i) = max j∈NL(i) S( j) − σ where NL(i) is the neigh- in the graph. As a representative, SCP [14] detects communities bors having the same label with i and σ is the attenuation factor. sequentially based on the idea of clique percolation. Unlike other The initialization phase. In FIRSTALLOCATE, nodes are as- algorithms, SCP does not define an explicit propinquity measure. 0 ∗ signed unique labels indicating different mono-communities (Rtmp). The revelatory structure. SCP maintains the G graph as the rev- The transformation phase. In the iteration T,SELECT generates elatory structure Π, in which the κ-1-cliques are taken as nodes. If a random permutation of nodes (CadT ). For LPA, in FLUCTUATE, two κ-1-cliques belong to a same κ-clique, they are connected in each node changes its label to the most frequent one of the labels the G∗ graph. Generally, SCP assumes κ = 3 or 4. If κ = 3, for the of its neighbors. While for HANP, the process is extended by intro- endpoints of a relationship, the κ-cliques can be directly obtained ducing the hop score and the node preference to the label frequency. by their common neighbors. If κ = 4, each pair of the connected T The new label distribution makes up the new temporary result Rtmp. common neighbors can make up a 4-clique together with the two Without φ, LPA needs no UPDATE while HANP needs UPDATE to endpoints. In the experiments later, we will show the effect of κ. recompute the S(i) of φ(vi) w.r.t. σ, which controls how far the The transformation phase. SCP goes directly into the transfor- particular label of the newly fluctuated node can spread. mation. In each iteration T,SELECT first gets the common neigh-

1002 bors of the endpoints of a relationship, and forms the corresponding weight, FIRSTALLOCATE produces the CCMST of G and stores all κ-cliques (CadT ). Then, in FLUCTUATE, all inner κ-1-cliques are unique CCS in CCMST in the ε-queue in descending order. ∗ T inserted into the G graph (Rtmp). All the κ-1-cliques which belong The transformation phase. In the iteration T,SELECT takes the to a same upper clique will be linked in this graph. first ε out of the ε-queue and filters all relationships in CCMST The construction phase. The iteration is nothing but a sequen- with CCS < ε to reveal the current core-connected clusters (CadT ). tial traversal of the relationships and ends after dealing with all When necessary (µ > 2), SELECT fetches attractors from AIQ to κ-cliques (Tmax = m). At last, EXTRACT hunts for all connected enlarge the clusters. FLUCTUATE then takes these clusters as bran- components in the G∗ graph and takes them as communities. ches in the hierarchy tree and the clusters produced in the previous iteration as subbranches of them. Along with the decreasing of ε, 4.6 Matrix Blocking each previous branch definitely belongs to a current one. Matrix blocking technique (e.g., MB-DSGE [4]) finds the dense The construction phase. Without a wise indicator of the termina- subgraphs, i.e. the communities, based on the rearrangement of G. tion state, the iteration goes on until all the branches form a whole The revelatory structure. hierarchy tree The is adopted as Π. tree (Tmax = n − 1, at most), when the ε-queue is empty. Now, The propinquity measure. MB-DSGE utilizes the cosine similar- COLLECT brings the homologous clusters w.r.t. the branches into ity to measure the propinquity of two nodes: φ(vi,v j) = |A(:,i)||A(:, j)| buckets (SR) with different ε values (different levels in Π). MUL- where A is the adjacent matrix of G. TILEVELDRAFT then picks up the best result R using an extended The initialization phase. PRECOMPUTE first computes φ(vi,v j) modularity function ψ(R). of each pair of nodes (not only those who have connections). Then, triples (i, j,φ(vi,v j)) are sorted in descending order according to 5. DIAGNOSES AND IMPROVEMENTS the value of φ to form a queue CQ. In FIRSTALLOCATE, trees rep- With high abstraction and modularization of the fundamental 0 resenting single nodes form the original forest (Rtmp). factors and key steps of the general detection approaches, the pro- The transformation phase. In the iteration T,SELECT pops a posed framework provides not only a testbed to compare various triple (i, j,φ(vi,v j)) from CQ with the maximal cosine similarity approaches with a fair benchmark, but also a for us to and finds the trees corresponding to node i and j (CadT ). Then, study the strength and weakness of each algorithm. It allows us to T FLUCTUATE combines the two trees to form a new branch (Rtmp) conduct diagnoses for algorithms which do not perform well, and in Π if they have no common ancestor. Since CQ has been obtained further makes targeted prescriptions for improvement. Due to the before, UPDATE is unnecessary in MB-DSGE. space limitation, here we discuss directions for diagnosing and im- The construction phase. The entire iteration ends once the whole proving the algorithms with two examples. tree is built (Tmax = n − 1). Since an obvious result cannot be ob- 5.1 Diagnosing Key Factors tained now, MB-DSGE constructs the final communities using an- As analyzed earlier, the propinquity measure φ plays a critical other measure density.EXTRACT firstly counts the wrap-up nodes in many approaches to community detection. Different approa- V(b) and relationships E(b) for each branch b of Π in a bottom- ches usually adopt various implementations of , but not all of them up way. Then, it traverses Π in a top-down way and computes the φ 2|E(b)| are the best choices. Thus one promising direction for improve- density (ψ(b) = |V(b)|(|V(b)|−1) ) of each branch. Once the density ment is to look again at the propinquity measures inside existing of b is higher than the predefined threshold ρmin, b is pruned out approaches and seek more appropriate implementations. and all the leaves within it turn out to be the tightly interconnected We take MB-DSGE as an example. In Fig. 5(a) we illustrate how members of a community. MB-DSGE works on the given network shown in Fig.2. According 4.7 Skeleton Clustering to the current propinquity measure φ defined using cosine similar- ity, v9 is closer to v10, with whom it has two common neighbors, Skeleton clustering algorithms (e.g., gCluSkeleton [11]) try to than v7 and v8. Thus, in MB-DSGE, v9 would be merged with v10 find closely connected communities based on the skeleton of the first. However, without even a relationship, they are factually not original graph G to make the detection process more efficient. so close. On the contrary, v9 is closer to the directly-connected v7 The revelatory structure. The gCluSkeleton algorithm makes the and v8. Consequently, it may result in sparse communities with skeleton-based clustering just on the core-connected maximal span- low density. Another possible drawback of the current definition ning tree (CCMST) of G. Therefore, the CCMST and the hierarchy lies in that the individual outliers cannot be identified clearly since tree together make up the revelatory structure Π. they usually have few neighbors and may be prematurely merged The propinquity measure. The gCluSkeleton algorithm adopts (e.g. v15 in Fig. 5(a)). the structural similarity (denoted as ε) to measure the propinquity density=0.36 of the endpoints of a relationship r. The propinquity measure φ(ri j) density=0.73 = √|N(i)∩√N( j)| where N is the self-contained neighborhood of a density=0.73 |N(i)|· |N( j)| density=0.83 density=0.7 node. The propinquity measure φ here owns a suite of derivatives: (1) core-similarity (CS), (2) reachability-similarity (RS), (3) core- connectivity-similarity (CCS), and (4) attachability-similarity (AS). 1 4 6 2 5 3 7 8 9 10 11 12 13 15 14 3 5 1 4 2 6 8 9 7 10 11 12 13 14 15 For node i, CS(i) is the maximum structural similarity εˆ. It means node i should have at least µ adjacent relationships whose structural (a) MB-DSGE (b) MB-DSGE* Figure 5: Comparison of the hierarchy tree and the results of similarities are greater than εˆ. RS(ri j) = min{CS(i),φ(ri j)} stands MB-DSGE and MB-DSGE* (ρmin = 0.35) for the reachability from node j to i. CCS(ri j) = min{RS(i, j), RS( j,i)}, ensures the bidirectional reachability of the two nodes. We can improve MB-DSGE by replacing φ, leading to an im- AS(i) is the maximum RS( j,i) obtained with all neighbors of node proved algorithm MB-DSGE*. Since the propinquity measure φ in i, and the corresponding neighbor j is called an attractor of i. MB-DSGE does not take the direct connections between two nodes The initialization phase. In PRECOMPUTE, the CCS of all the into consideration, we can replace it with the corresponding φ in relationships is computed, and meanwhile the triples of (i, j, AS(i)) gCluSkeleton, i.e. the structure similarity, to correct this omission. are initialized and stored in a priority queue AIQ. Taking CCS as Note that for MB-DSGE*, φ needs to be computed for each pair of

1003 nodes, while for gCluSkeleton it only needs to be computed for the Table 1: Small-scale real-world networks with ground-truth endpoints of each relationship. Name n m cn Supp. The execution processes of MB-DSGE and MB-DSGE* on the Strike 24 38 3 / network of Fig.2 are visualized in Fig.5 with the built hierarchy Football 180 788 11 61 outliers, 4 hubs tree and the communities denoted with the dotted boxes. Compared We also employ seven large-scale real-world networks without with Fig. 5(a), the new hierarchy tree in Fig. 5(b) seems more order- ground-truth from the Stanford Large Network Dataset Collection ly since MB-DSGE* generates a combination order {3,5,1,4,2,6} (http://snap.stanford.edu/data/), ranging from co-author networks in accordance with the nearness of nodes in the subgroup. MB- (CA-HepPh, DBLP), co-purchasing networks (Amazon), commu- DSGE* produces three dense subgraphs while MB-DSGE cannot nication networks (Email-Enron) to friendship networks (BrightKit, differentiate the two groups in the right part of the graph. Conse- Gowalla and YouTube). We report the statistics in Table2, where quently, the rationality of the result gets heightened a lot. ccavg is the average clustering coefficient, and diam the network diameter. We cannot obtain the real communities of these networks 5.2 Diagnosing Key Steps and thus we adopt a widely-recognized metrics to evaluate the per- In Sec. 3.2, we abstract the main phases and key steps of general formance in terms of the quality of clusters instead of the accuracy. approaches to community detection. As presented in Sec.4, algo- Table 2: Large-scale real-world networks rithms can be implemented within our framework via step mapping. Name n m ccavg diam This enables us to examine the performance of each single step and CA-HepPh 12,008 118,505 0.6115 13 analyze their influence on the whole algorithm in depth. Upon that Email-Enron 36,691 183,830 0.4970 11 we can propose targeted improvements by adding a beneficial step, BrightKit 58,228 214,078 0.1723 16 removing an unnecessary one, or modifying an inefficient one. Gowalla 196,591 950,327 0.2367 14 DBLP 317,080 1,049,866 0.6324 21 We illustrate this with the LPA algorithm as example. In original Amazon 334,863 925,872 0.3967 44 LPA, no propinquity measure is defined. Thus there do not exist YouTube 1,134,890 2,987,624 0.0808 20 important steps of evaluating and updating the propinquity of ele- ments in the network. In this way the varying importance of nodes Furthermore, we use the Lancichinetti-Fortunato-Radicchi (LFR) and relationships is neglected. networks [15] with ground-truth. Tens of synthetic datasets are pro- This inspires a direction to improve LPA. The nodes with differ- duced for in-depth evaluations, and we only list some representa- ent confidences should be considered differently during the label tives in Table3. Here d is the average degree of nodes, dmax the propagation process. Thus we introduce the activeness of nodes, maximum node degree, cmin the minimum community size, and which shows how many times a node changes its label in the whole cmax the maximum size. In LFR, another critical parameter is the iteration. Intuitively, there is a negative correlation between the mixing parameter u (0.2 by default [11]) which indicates the pro- confidence of a node and its activeness, and thus we define the portion of relationships a node shares with other communities. ω Table 3: Synthetic networks with ground-truth propinquity measure as φ(vi) = 1/Activeness(i) . In the transfor- Name n m d d c c mation phase, we modify the FLUCTUATE step, in which nodes also max min max S 10K 10,000 50,302 10 50 20 100 choose their labels according to the neighbors’ confidence, and add S 100K 100,000 504,371 10 50 50 100 a critical step UPDATE to recompute the propinquity of nodes. S 200K 200,000 953,230 10 100 60 200 As demonstrated in the experiments later in Sec. 6.10, the spe- S 500K 500,000 2,938,555 10 250 100 500 cific diagnoses achieve great improvements over original algorithms. It should be noted that in this study we only take two examples 6.2 Configurations for illustration due to the limit of space. More diagnoses may be In our benchmarking evaluation, all requisite parameters of the conducted for other existing approaches based on our framework. studied algorithms are set as the recommended values in their orig- For instance, instead of a trivial MULTILEVELDRAFT, a smart EX- inal papers, as shown in Table4. TRACT step may be preferred in gCluSkeleton. It may also make In particular, for Radicchi, we choose the weak option, i.e. the the iteration in TopLeaders more efficient by adding an extra SE- community definition Ω =weak (see in Sec. 6.5), to avoid excessive LECT for choosing the candidates. outliers. For TopLeaders, the number of leaders ld needs to be provided manually or in advance by other means. For MB-DSGE, 6. BENCHMARKING EVALUATION although the prevalent value of ρmin is 0.1, we may need to run In this section, we conduct in-depth evaluations for the commu- the algorithm repeatedly for better results (e.g., 0.5 on Football), nity detection algorithms within our framework using the proposed since a slightly inappropriate ρmin may lead to awful results as the benchmark, which covers the efficiency, accuracy, effectiveness, scale of the network increases. More details about the effects of the density sensitivity, mixture sensitivity, outliers, community distri- parameters are presented in the following evaluations. bution and diagnosis effects. We introduce the datasets and param- Table 4: Parameter setting eter configurations in the benchmark at first, and then report our Algorithm Value Algorithm Value thorough evaluation methodology and results. We summarize our Radicchi Ω=weak TopLeaders λmax=5, λmin=0, dp=2 findings and rate the algorithms intuitively at last. All experiments LPA Tmax=6 HANP Tmax=20, ω=0.1, σ=0.05 are conducted on a computer running Windows Server 2008 with SCP κ ∈{3, 4} M-KMF k ∈{1, 2, 3} MB-DSGE ρ =0.1 gCluSkeleton µ ∈{2, 3} an Intel Xeon 2.0 GHz CPU and 256 GB RAM. min 6.1 Datasets 6.3 Efficiency Towards a better and thorough evaluation of the existing ap- In this subsection, we discuss the efficiency of the concerned proaches, in this study we adopt three categories of datasets. algorithms from both theoretical and experimental aspects. In the first category, we consider two widely-used small-scale Theoretical analyses. The time complexities of the ten algo- real-world networks, including Strike and Football [12], for which rithms in each phase are illustrated in Table5. Most social net- we can obtain the authentic communities. The statistics are shown works are sparse, and we usually have O(m) ≈ O(n). The commu- in Table1, where cn is the number of real communities. nity structures become non-significant when m tends to n2 [8].

1004 CNM Radicchi LPA HANP TopLeaders SCP M‐KMF MB‐DSGE gCluSkeleton 103 105 104 105 3 (s) (s) 4 4 10 (s) 2 (s) 10 10 10 102 3

3 Time Time 10 10

Time Time

1 0 10 102 1 102 -1 1 10 10 ‐2 10 Processing 10 Processing Processing 5 5 5 Processing 5 x10 x10 ‐ x10 x10 10-1 1 10 3 1 1 3212 45 12345 1231 45 123 45 Nodes Nodes Nodes Nodes (a) initialization (b) transformation (c) construction (d) overall Figure 6: Time costs with node scale under different phases in the framework w.r.t. various algorithms Theoretically, many classical hierarchy clustering algorithms have be attributed to the pairwise association of clusters in FLUCTUATE. high time overheads, such as Radicchi, Spectral and NFGA. How- In the construction phase, except gCluSkeleton, most algorithms ever, CNM can greatly reduce the time complexity since it needs complete this process within negligible time, as shown in Fig.6(c), no MULTILEVELDRAFT after the transformation. Obviously, label since EXTRACT usually only traverses Π to get the final results. propagation algorithms traverse all nodes and their neighbors for The gCluSkeleton costs much time due to the existence of the time- Tmax times while TopLeaders executes this process for an uncertain consuming COLLECT and MULTILEVELDRAFT. The overall time time T. Moreover, TopLeaders has a tough work in FIRSTALLO- costs of these algorithms are shown in Fig.6(d), almost the same CATE to find the original leaders. The time cost for SCP depends on as those in the transformation phase. the quantity of non-repeating κ-1-cliques (Nc) in the network. For Experiments on real-world networks. We also conduct exper- M-KMF, the involved elements are mostly relative to the average iments on the efficiency of the algorithms on real-world networks. degree davg of nodes in each iteration, and the worst case is that all The overall time costs on datasets CA-HepPh, Email-Enron, Ama- relationships are removed. MB-DSGE spends O(mlogm) time in zon and DBLP are presented in Fig.7. In general, the ranking of PRECOMPUTE, and O(nlogn) in tree building. For gCluSkeleton, the processing time is consistent with those on synthetic networks, Nε is the size of ε-queue (i.e. Tmax), which equals to n−1 at most. and thus we do not discuss them in detail for brevity. Table 5: Overview of time complexity Ca‐HepPh Email‐Enron Amazon DBLP Algorithm Initialization Transformation Construction 106 2 4

Time(s) 10 CNM O(m + n) O(mlog n) O(n) Radicchi O(m + n) O(m4/n2) — 102 Spectral O(n2) O(n2 logn) — 1

LPA & HANP O(n) O(Tmax(m + n)) O(n) Processing TopLeaders O(nlogn + ld2) O(T(m + n)) O(n) SCP — O(Nc) O(m + n) Figure 7: Time overheads on real-world networks M-KMF O(m + n) O(davgm) O(m + n) MB-DSGE O(mlogm) O(nlogn) O(n) Influence of framework implementation. To evaluate the in- gCluSkeleton O(m + nlogn) O(mlogn) O(Nε (m + n)) fluence on efficiency of the framework standardization of the orig- inal algorithms, we compare the overall processing time of them in Experiments on synthetic networks. We evaluate the efficiency both their native implementation and framework implementation and scalability of these algorithms on synthetic networks, whose on S 500K. The result of the comparison is shown in Fig.8. Since scales vary from 100K nodes to 500K nodes (Fig.6(a) – (d)). We our re-implementation based on the framework is just an encapsu- find Spectral even cannot handle S 100K due to an enormous con- lation of original algorithms with the generalized detection proce- sumption of 40 GB of RAM for a single-pass iteration, which is dure, it has almost no effect on the efficiency of the algorithms. an unbearable space overhead in practice. Thus we only evaluate Framework impl. Native impl. other approaches according to the three phases of the workflow. 106 4

Time(s) 10 Fig.6(a) shows the initialization time of six algorithms varying with the node scale (with logarithmic coordinate for Y axis). The 102 time costs of LPA and HANP are much lower than others and thus 1 omitted for brevity, and SCP does not have an initialization phase. Processing In accordance with the above analyses, TopLeaders spends the most time in PRECOMPUTE, followed by MB-DSGE and gCluSkeleton, Figure 8: Comparison on the efficiency of framework imple- both of which need to compute and sort values of φ in this step. mentation vs. native implementation The time costs of the transformation phase of each algorithm are Remarks. Considering the overall time overheads, we find that shown in Fig.6(b). M-KMF outperforms others significantly. For M-KMF performs more efficiently than others. Generally, these SCP, the iteration performs slightly faster by utilizing the OpenMP algorithms can be ordered w.r.t. the efficiency as M-KMF > SCP API (http://openmp.org) to optimize the process in FLUCTUATION. > TopLeaders > Radicchi > (LPA, HANP) > (CNM, MB-DSGE, TopLeaders can hardly get stabilized on large-scale datasets, and gCluSkeleton). The approaches themselves are efficient under the thus we relax the proportion of the two-term leaders to 95% ap- time complexity O(nlogn), but the transformation is the most time- proximately. Consequently, this tiny modification increases the ef- consuming phase. ficiency significantly in our experiments with little influence on per- formance. CNM, MB-DSGE and gCluSkeleton cost much more 6.4 Accuracy time in this phase than others. For CNM, the iteration number In this subsection, we evaluate the accuracy of an algorithm on is supposed to be O(log(n)), but actually, it nearly reaches O(n) datasets with ground-truth by examining to what extent the commu- (e.g., the number of iteration is 99,404 on S 100K which contains nities delivered by an algorithm are consistent with the real ones. 100,000 nodes). Although the iteration of gCluSkeleton is similar Metrics. Cross Common Fraction (Fsame) compares each pair of to that of Kruskal algorithm (O(mlogn)), the high time cost may communities, in which one comes from the detected result and the

1005 Strike 1 BrightKit 1 Fsame 0.8 0.8 0.6 CC 0.6 Jaccard 0.4 0.4 0.2 S 0.2 NMI 0 Q 0 Football 1 Cowalla 1 Fsame 0.8 0.8 CC 0.6 Jaccard 0.6 0.4 0.4 S 0.2 0.2 0 NMI 0 Q S_100K 1 1 F DBLP 0.8 same 0.8 CC 0.6 0.6 0.4 Jaccard 0.4 S 0.2 0.2 Q 0 NMI 0 S_200K 1 1 Amazon Fsame 0.8 0.8 CC 0.6 Jaccard 0.6 0.4 0.4 S 0.2 NMI 0.2 Q 0 0

(a) accuracy (b) effectiveness Figure 9: Accuracy and effectiveness of various algorithms on six metrics (accuracy: Cross Common Fraction (Fsame), (Jaccard), Normalized (NMI); effectiveness: Cluster Coefficient (CC), Strength (S), modularity (Q)) other comes from the real one, to find the maximal shared parts. where N(v) consists of the neighbors of node v, d(v) is the degree Formally, it is defined as of v, cn is the community quantity, and Ci is the i-th community. cn cn0 1 0 1 0 Strength (S) measures the intensity of detected communities. Ev- Fsame = max|Ci ∩C j| + max|Ci ∩C j|, (1) ∑ j ∑ i 2 i=1 2 j=1 idence shows that the result usually seems valid if most members where cn and cn0 are the quantities of detected and real communi- are in the same community with their neighbors. We inherit Radic- 0 ties respectively, and Ci and C j are the i-th detected community chi’s community definitions [23] to get the strength score of a re- and j-th real community respectively. sult. Suppose that d(v)in and d(v)out stand for degrees inside and Jaccard Index [10] is a widely used similarity measure. It is outside the community C containing node v, respectively. If d(v)in > more sensitive for it can overcome the little variance of the cross out in out d(v) for ∀v ∈C, C is a strong community; if ∑v d(v) > ∑v d(v) , common fraction when nodes from several different communities C is a weak community; otherwise C is invalid. To make the global in one result join together as a single community in another result evaluation, we define the strength as: S = 1 ∑cn score(C ). In this Ps cn i=1 i [24]. It can be defined as where Ps stands for the number Ps+Ps1+Ps2 work, we let score(Ci) = 1 for a strong community; 0.5 for a weak of node pairs which are respectively classified into the same com- one; and 0 for an invalid one. munity in both results, Ps1 stands for the number of node pairs ap- Modularity (Q) is the most widespread quality function for com- pearing in the same community in the algorithm-produced results, munity detection. It compares the result with a randomized one but in different communities in the truth, and Ps2 vice versa. to indicate how reasonably the nodes are assigned into different Normalized Mutual Information (NMI)[6] is a popular criterion groups. The definition of this metric has been given in Sec. 4.1. for evaluating the accuracy of community detection based on the Remarks. As illustrated in Fig. 9(b), almost all algorithms show information theory. The score of NMI stands for the agreement of highly-consistent effectiveness on different datasets. Most algo- two results and it is defined as CC Ni jNt rithms get satisfactory so that communities produced by these −2∑ Ni j log i, j Ni.N. j algorithms have high internal aggregation. Specifically, with the NMI = N , (2) N log Ni. + N log . j similar φ defined based on the number of triangles, Radicchi and ∑i i. Nt ∑ j . j Nt S where N is the confusion matrix whose element Ni j is the number M-KMF perform very well w.r.t. , showing strong preference for of the shared members between a detected community Ci and a intra-group links than inter-group ones. However, their scores of 0 real one C j. Ni. and N. j are the sum over row i and column j modularity are much worse than those of others. Differently, CNM gets the best Q among all the algorithms since it takes this metric respectively, and Nt = ∑i ∑ j Ni j. Remarks. The scores of the above metrics are shown in Fig. 9(a). as φ and achieves the optimal value at the end of transformation. Based on the accuracy evaluation, we can find that SCP, M-KMF, It is worth mentioning that without prior knowledge of the leader MB-DSGE and gCluSkeleton perform outstandingly on networks number, TopLeaders does not perform well any more. In general, containing many outliers, such as Football. The results of these communities detected by HANP, LPA and CNM have higher qual- algorithms show high stability, except MB-DSGE which may be ities than those produced by other algorithms. easily influenced by the scale of nodes. Moreover, it is obvious that LPA and HANP produce eminently believable and accurate results 6.6 Density Sensitivity on networks without any outliers, such as synthetic networks. We are also interested in the sensitivity of algorithms when they are applied to datasets with various network densities. In this ex- 6.5 Effectiveness periment we generated synthetic networks on the basis of S 10K by Most large-scale real-world networks have no ground-truth. In gradually increasing its density by modifying the average degree d this situation, we evaluate the effectiveness of algorithms by exam- (from 10 to 70, ∆d = 10, note that |E| = η|V| and d ≈ 2η ). ining the quality of detected communities via various metrics. The variations of the overall time costs are shown in Fig. 10. Ex- Metrics. Clustering Coefficient (CC) focuses on the tendency cept SCP, CNM, LPA and HANP, the growing tendency of the rest of community members to cluster together. Thus, high clustering algorithms are in line with that demonstrated in Fig.6(d). We also coefficient means high probability that the connections inside the studied the variations of effectiveness, accuracy as well as commu- detected communities are dense. The global clustering coefficient nity quantity of the concerned algorithms on datasets with different of an undirected network can be computed as densities, and present the results in Fig. 11. For space limitation, cn T ! 1 1 2|{ets : vt ,vs ∈ N(v) Ci,ets ∈ E}| here we take the CC and Fsame as representatives of effectiveness CC = ∑ ∑ , (3) cn |Ci| d(v)(d(v) − 1) i=1 v∈Ci and accuracy, respectively, and similar tendencies are observed on

1006 105 CNM We evaluate the performance in terms of the effectiveness (mod- 104 Radicchi (s) LPA ularity score Q) and the accuracy (Jaccard index) of the concerned 3 10 HANP algorithms and illustrate the results in Fig. 12. The modularity Time

102 TopLeaders of most algorithms except gCluSkeleton declines linearly when u 10 SCP grows, while the Jaccard index of them shows different change ten- M‐KMF

Processing 1 dencies: CNM and gCluSkeleton deteriorate much faster than oth- MB‐DSGE ‐1 10 gCluSkeleton ers while LPA and HANP are more stable and insensitive to the 10 20 30 40 50 60 70 d mixture degree. Figure 10: Overall time costs of varying d Remarks. We can find that all the algorithms perform worse on datasets with higher mixture degrees, and both the effectiveness other metrics. With the increase of density, most algorithms ex- and accuracy would decrease when the communities overlap more cept SCP produce communities with higher CC values as expected. with each other and become less discriminative. In comparison, From the perspective of accuracy evaluated with F , Radicchi, same the label propagation algorithms are more stable and show better LPA and HANP not only outperform others significantly, but also capability of distinguishing communities from confusing mixtures. show more stable performance with less sensitivity on the density. On the contrary, the performances of SCP, M-KMF and MB-DSGE 6.8 Outliers decrease obviously when networks become denser. Furthermore, Focusing only on the produced clusters in the community de- we find the quantities of detected communities decrease consis- tection tasks, sometimes we may be hoodwinked and come to a tently with the increasing density. Since k =1 here, M-KMF is conclusion unilaterally. Because of the existence of outliers, we more likely to form large connected k-mutual-subgraphs so that the should also care about how many nodes are discriminated outside total number of communities drops rapidly. the groups meanwhile. As we discussed in Sec. 2.1, outliers can CC Avg. node degree d 0.8 0.6 be affirmed automatically by the original algorithms or the prede- 0.4 10 10 10 0.2 fined threshold mvs. In the former case, dense groups of nodes are 0 20 20 20 Fsame revealed from the network so that the rest parts are naturally re- 0.9 30 30 30 0.6 0.3 40 40 40 garded as outliers. In the latter case, a proper partition of the entire 0 Quantity 50 50 50 300 network is obtained so that the components whose sizes are less 200 60 60 60 100 70 70 70 than mvs will be disbanded, and the members of them are consid- 0 ered as outliers. In this subsection, we evaluate the extent to which these algorithms identify nodes as outliers, as shown in Fig. 13.

Figure 11: Tendency of different scores of varying d CNM Radicchi TopLeaders SCP M‐KMF MB‐DSGE gCluSkeleton 1 Remarks. The increase of density of networks usually leads to 0.8

OP 0.6 higher clustering coefficients, and for most algorithms it produces 0.4 fewer detected communities. Moreover, since there are more links 0.2 0 among nodes so that there are fewer outliers produced by these CA‐HepPH Email‐Enron BrightKit Amazon DBLP algorithms as the density increases. In general, Radicchi, LPA and Figure 13: Proportion of produced outliers HANP are superior to others on both the overall performance and stability. The growth in the density of a network has the strongest We test the ability of discovering outliers for these algorithms on impact on both the efficiency and the performance of SCP. Football which contains 65 labeled outliers. When mvs = 2, CNM, Radicchi, SCP, M-KMF, MB-DSGE and gCluSkeleton can iden- 6.7 Mixture Sensitivity tify outliers with the F1-Score higher than 0.90, while others rarely Mixing (or overlapping) structures are common in real-world so- consider outliers. We also conducted experiments on large-scale cial networks, which make it more difficult to differentiate commu- real-world networks without any priori knowledge about the exact nities from the entire network. In this experiment, we study the outliers, and illustrate the proportion of outliers (OP = |Outs|/|V|) performance and sensitivity of these algorithms on datasets with in Fig. 13. Here mvs is set to 3 since on large-scale networks, different mixture degrees. we think only two people cannot form a community [28]. LPA For synthetic networks generated by LFR, the mixture degree and HANP are not included in the figure since they produced rare depends on the mixing parameter u, as we discussed in Sec. 6.1.A outliers. Once the label distribution tends to be steady, there are small value of u leads to large diversities among different commu- few tiny groups flagged with unique labels (2 on both Amazon and nities while a high value of u may produce overlapping communi- DBLP for HANP, 57 and 32 on them for LPA). Similarly, outliers ties. We generated synthetic datasets using LFR with 100K nodes found by CNM and TopLeaders are also quite few. Since MB- and different settings of u from 0.1 to 0.5 (∆u=0.1). Usually, it is DSGE has higher preference towards local dense connected groups, meaningless to find communities on network where more than half it produces more outliers. Especially, if we set κ to 4 for SCP, or of them are overlapped. increase k from 1 to 3 for M-KMF, more outliers will be produced for stricter limitation of Π. For gCluSkeleton, we set µ =3 only for 1 1 CNM Radicchi Amazon and DBLP to ensure better effectiveness of communities, 0.7 0.8 LPA which may lead to more outliers. 0.4 0.6 HANP Remarks. By label propagation algorithms, we can hardly get 0.1 TopLeaders 0.4 ‐0.2 SCP outliers in a network. Compared with other algorithms we studied, M‐KMF ‐0.5 0.2 generally, MB-DSGE produces the most outliers and has the lowest MB‐DSGE coverage ratio (1−OP) of a network. The outliers produced by ‐0.8 0 gCluSkeleton 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 SCP, M-KMF and gCluSkeleton are too sensitive to the value of u u the key parameter, i.e. κ, k and µ, respectively. Moreover, we can (a) Modularity (Q) (b) Jaccard Index find that outliers are easily to be revealed on datasets with lower Figure 12: Effectiveness and accuracy of varying u ccavg (e.g. only 0.1723 of BrightKit).

1007 Table 6: Comparison between LPA and LPA* CA-HepPH Email-Enron Gowalla DBLP Alg. CC S Q CC S Q CC S Q CC S Q LPA 0.7037 0.5985 0.4201 0.7340 0.7912 0.2983 0.7685 0.7902 0.3690 0.7317 0.6149 0.6930 LPA* 0.7227 0.6617 0.4449 0.7408 0.8118 0.4954 0.7751 0.8058 0.5303 0.7381 0.6404 0.7032 S 10K S 100K S 200K S 500K Alg. Fsame Jaccard NMI Fsame Jaccard NMI Fsame Jaccard NMI Fsame Jaccard NMI LPA 0.9681 0.9945 0.9992 0.9967 0.9917 0.9992 0.9935 0.9783 0.9969 0.9980 0.9944 0.9989 LPA* 0.9980 0.9940 0.9993 0.9979 0.9945 0.9995 0.9984 0.9956 0.9994 0.9992 0.9976 0.9996 Table 7: Comparison between MB-DSGE and MB-DSGE* Strike Football Alg. CC S Q Fsame Jaccard NMI CC S Q Fsame Jaccard NMI MB-DSGE 0.7508 0.5000 0.2864 0.7708 0.4438 0.5777 0.4266 0.8000 0.5179 0.6139 0.8040 0.9767 MB-DSGE* 0.7747 0.8333 0.5474 0.9583 0.8208 0.8660 0.5097 0.6923 0.5408 0.6389 0.7154 1.0000 CA-HepPH BrightKit Gowalla DBLP Alg. CC S Q CC S Q CC S Q CC S Q MB-DSGE 0.5865 0.2834 0.3929 0.5443 0.1952 0.0961 0.5508 0.2165 0.1075 0.6303 0.3830 0.3964 MB-DSGE* 0.6223 0.3106 0.4437 0.6672 0.2715 0.1940 0.6269 0.2579 0.1883 0.6686 0.4413 0.5166

6.9 Distribution of Communities Remarks. The distributions produced by CNM, LPA and HANP The distribution of communities, i.e. the distribution of quan- are closer to regular power law distribution. TopLeaders has a dif- tities of communities with different sizes, is also a key indicator ferent community distribution with other algorithms due to the pre- of the validity of an algorithm. We visualize the community dis- defined leader number. On large-scale networks, for many algo- tributions produced by various algorithms on Amazon and DBLP rithms such as Radicchi, M-KMF, SCP as well as gCluSkeleton, the in Fig. 14 (with double logarithmic coordinates) for comparison. existence of an undivided large , which means a lower We find that LPA and HANP produce regular power law distribu- separation degree, is actually an intractable problem to be solved. tions, while CNM prefers more large communities so that the dis- tributions are more plain and long-tailed. For SCP, when κ =3, 6.10 Diagnosis Effects the distribution on Amazon is reasonable while the distribution on Based on the two targeted diagnoses discussed in Sec. 5.1 and DBLP is uneven due to an undivided large-scale community whose 5.2, we compare the modified algorithms with their original ones size is about 105. It may result from the small value of κ which respectively, i.e. MB-DSGE* vs. MB-DSGE and LPA* vs. LPA, to makes different κ-1-cliques form together more easily. The results verify the effects of improvement. of Radicchi and M-KMF are quite homogeneous due to the simi- Table6 shows the superiority of LPA* on both real-world and lar φ. As discussed in Sec. 6.8, undesirable excessive outliers will synthetic networks. Similarly, MB-DSGE* also leads obvious im- be found by M-KMF if we increase k. For MB-DSGE, even with provements of the performance on real-world networks, as shown an expectation for community density of 0.1, the results also show in Table7. For lack of space, here we only present the result on remarkable sparsity. Unlike M-KMF, MB-DSGE only takes those part of the datasets we used. Based on the verification, we find dense groups away, leaving others as poor outliers. The community LPA* shows significant improvements over LPA which has already distribution generated by gCluSkeleton is a little messy which may outperformed other algorithms on many datasets. MB-DSGE* also be caused by the lower differentiation of the values of φ, i.e. the leads to better performance on most metrics, in both accuracy and values of the ε-queue, on large-scale networks. At last, we find effectiveness. In particular, we find MB-DSGE* can significantly that a predefined ld makes the detection much more purposeful, reduce the proportion of the excessive produced outliers, as shown and thus TopLeaders presents quite dissimilar distributions. in Table8. In sum, the diagnoses based on our framework can

104 104 103 LPA‐Amazon HANP‐Amazon CNM‐Amazon improve the performance in accuracy and effectiveness, and both LPA‐DBLP HANP‐DBLP CNM‐DBLP 103 103 modifications will not increase the processing time. 102 Quantity Quantity Quantity

102 102 Table 8: Reduction of outlier proportion 10 10 10 Alg. CA-HepPH Gowalla BrightKit Amazon DBLP Community Community Community MB-DSGE 0.3218 0.7692 0.8841 0.4244 0.4105 1 1 1 110 102 103 104 110 102 103 104 105 110 102 103 104 105 MB-DSGE* 0.1220 0.4027 0.4882 0.1446 0.1665 Community Size Community Size Community Size (a) LPA (b) HANP (c) CNM 104 105 6.11 Summary Radicchi‐Amazon SCP‐Amazon TopLeaders‐DBLP(k=2000) 4 SCP‐DBLP 2 TopLeaders‐DBLP(k=3000) 103 Radicchi‐DBLP 10 10 According to the above comprehensive evaluations covering multi- Quantity Quantity Quantity

3

10 102 aspects of community detection, we can find that each approach has 102 10 its own merits and faults. They vary in the efficiency of process- 10 10 Community Community Community ing, have various adaptabilities to different metrics of performance 1 1 1 110 102 103 104 105 110 102 103 104 105 110 102 103 104 as well as different characteristics of datasets. To sum up, Table9 Community Size Community Size Community Size shows an intuitive summary on the performance, adaptability and (d) Radicchi (e) SCP (f) TopLeaders competitiveness of the studied approaches. We give general rat- 104 104 103 M‐KMF‐Amazon(k=2) MB‐DSGE‐Amazon gCluSkeleton‐Amazon M‐KMF‐Amazon(k=1) MB‐DSGE‐DBLP gCluSkeleton‐DBLP ings from multiple critical aspects in this table, including process 103 103 102 Quantity Quantity

Quantity

efficiency (Effi.), accuracy (Accu.), effectiveness (Effe., in terms of 102 102 10 CC, S and Q), the stability w.r.t. density sensitivity (Dens.) or mix- 10 10 Community Community Community ture sensitivity (Mixt.), as well as to what extent an algorithm can 1 1 1 2 3 10 2 103 104 105 avoid producing excessive outliers (Outl.). 110 102 103 104 105 1 10 10 10 1 10 Community Size Community Size Community Size From the perspective of efficiency, we see that M-KMF is the (g) M-KMF (h) MB-DSGE (i) gCluSkeleton most efficient approach, while CNM, MB-DSGE and gCluSkeleton Figure 14: Distributions of community size of each algorithm cost much more time than others.

1008 Table 9: A visualized summary of studied approaches (more [2] B. Bollobas.´ Modern Graph Theory, volume 184. Springer, 1998. stars refer to better performances in the corresponding aspect) [3] D. Bortner and J. Han. Progressive clustering of networks using Effe. Alg. Effi. Accu. Dens. Mixt. Outl. structure-connected order of traversal. ICDE, pages 653–656, 2010. CC S Q [4] J. Chen and Y. Saad. Dense subgraph extraction with application to CNM ?? ?? ????? ???? ????? ???? ? ???? community detection. TKDE, 24(7):1216–1230, 2012. Radicchi ??? ???? ????? ????? ? ???? ?? ??? LPA ??? ????? ????? ???? ???? ????? ??? ????? [5] A. Clauset, M. E. Newman, and C. Moore. Finding community HANP ??? ????? ????? ???? ???? ????? ??? ????? structure in very large networks. Physical Review E, 70(6):066111, TopLeaders ???? ??? ???? ?? ???? ??? ?? ???? 2004. SCP ???? ???? ???? ?? ?? ? ?? ??? [6] L. Danon, A. Diaz-Guilera, J. Duch, and A. Arenas. Comparing M-KMF ????? ???? ????? ????? ? ?? ?? ??? community structure identification. Journal of Statistical Mechanics: MB-DSGE ?? ??? ???? ?? ?? ?? ?? ? Theory and Experiment, 2005(09):P09008, 2005. gCluSkeleton ?? ? ???? ? ? ??? ? ?? [7] R. Diestel. Graph Theory, volume 173. Springer, 2000. Among the most important aspects, the accuracy evaluates the [8] S. Fortunato. Community detection in graphs. Physics Reports, performance of algorithms using datasets with ground-truth. In this 486(3):75–174, 2010. respect, LPA and HANP perform best in our study, showing the [9] D. Gibson, R. Kumar, and A. Tomkins. Discovering large dense strongest capability of identifying authentic communities. subgraphs in massive graphs. VLDB, pages 721–732, 2005. As for the effectiveness which measures the inner- and intra- [10] L. Hamers, Y. Hemeryck, G. Herweyers, M. Janssen, H. Keters, structures of the detected communities with various metrics, we R. Rousseau, and A. Vanhoutte. Similarity measures in scientometric find different algorithms with different characteristics show differ- research: the jaccard index versus salton’s cosine formula. Information Processing & Management, 25(3):315–318, 1989. ent performances w.r.t. different metrics. CNM, Radicchi, LPA, [11] J. Huang, H. Sun, Q. Song, H. Deng, and J. Han. Revealing HANP and M-KMF prefer to produce communities with high clus- density-based clustering structure from the core-connected tree of a tering coefficients (CC), Radicchi and M-KMF tend to form com- network. TKDE, 25(8):1876–1889, 2013. munities with high strength (S), while CNM is the best one at find- [12] R. R. Khorasgani, J. Chen, and O. R. Za¨ıane. Top leaders community ing communities with high modularity (Q). detection approach in information networks. SNA-KDD, 2010. These algorithms have different sensitivities with the density and [13] R. Kumar, J. Novak, and A. Tomkins. Structure and evolution of the mixture degree of the datasets. Comparably speaking, LPA and online social networks. Link Mining: Models, Algorithms, and HANP are the most stable approaches which keep excellent perfor- Applications, pages 337–357. Springer, 2010. mance even when networks become much denser or the communi- [14] J. M. Kumpula, M. Kivela,¨ K. Kaski, and J. Saramaki.¨ Sequential algorithm for fast clique percolation. Physical Review E, ties become highly mixed. Moreover, Radicchi and CNM are also 78(2):026109, 2008. less sensitive with the varying network density. [15] A. Lancichinetti, S. Fortunato, and F. Radicchi. Benchmark graphs Considering the tendency of producing outliers, MB-DSGE tends for testing community detection algorithms. Physical Review E, to pick out excessive outliers and performs better on networks with 78(4):046110, 2008. more outliers. While LPA and HANP hardly identify nodes as sin- [16] V. E. Lee, N. Ruan, R. Jin, and C. Aggarwal. A survey of algorithms gletons and are more applicable on networks with few outliers. Be- for dense subgraph discovery. Managing and Mining Graph Data, sides, the community distributions produced by LPA, HANP and pages 303–336. Springer, 2010. CNM are closest to the power law distribution. [17] I. X. Leung, P. Hui, P. Lio, and J. Crowcroft. Towards real-time community detection in large networks. Physical Review E, All in all, with an acceptable time cost, the label propagation 79(6):066107, 2009. algorithms are more excellent and stable approaches which could [18] M. E. Newman. Fast algorithm for detecting community structure in produce more accurate and valid communities with reasonable com- networks. Physical Review E, 69(6):066133, 2004. munity distributions and fewer outliers under most circumstances. [19] M. E. Newman. Modularity and community structure in networks. PNAS, 103(23):8577–8582, 2006. 7. CONCLUSIONS [20] G. Palla, I. Derenyi,´ I. Farkas, and T. Vicsek. Uncovering the overlapping community structure of complex networks in nature and In this paper, we conduct a comprehensive benchmarking study . Nature, 435(7043):814–818, 2005. on approaches to community detection in social networks. Within [21] A. Prat-Perez,´ D. Dominguez-Sal, J. M. Brunat, and J.-L. the proposed benchmark, we formulate a generalized procedure- Larriba-Pey. Shaping communities out of triangles. CIKM, pages oriented framework, with high abstraction and nice modularization 1677–1681, 2012. of the fundamental factors and critical steps of this problem. We [22] A. Prat-Perez,´ D. Dominguez-Sal, and J.-L. Larriba-Pey. High have re-implemented ten state-of-the-art representative algorithms quality, scalable and parallel community detection for large real by mapping them to the proposed framework, and make in-depth graphs. WWW, pages 225–236, 2014. evaluations on them based on our benchmark using both real-world [23] F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Parisi. Defining and identifying communities in networks. PNAS, and synthetic datasets. We discuss their merits and faults thor- 101(9):2658–2663, 2004. oughly, draw a set of interesting take-away conclusions, and pro- [24] U. N. Raghavan, R. Albert, and S. Kumara. Near linear time vide intuitive ratings. In addition, we present how to make diag- algorithm to detect community structures in large-scale networks. noses for these algorithms based on our framework, and report sig- Physical Review E, 76(3):036106, 2007. nificant improvements in the experimental study. [25] V. Satuluri and S. Parthasarathy. Scalable graph clustering using stochastic flows: applications to community discovery. SIGKDD, 8. ACKNOWLEDGMENTS pages 737–746, 2009. [26] L. Tang and H. Liu. Community detection and mining in social This work was supported by the National Natural Science Foun- media. Synthesis Lectures on Data Mining and Knowledge dation of China (No. 61170064, No. 61373023) and the National Discovery, 2(1), pages 1-137, 2010. High Technology Research and Development Program of China [27] J. Xie, S. Kelley, and B. K. Szymanski. Overlapping community (No. 2013AA013204). detection in networks: The state-of-the-art and comparative study. ACM Computing Surveys, 45(4):43, 2013. 9. REFERENCES [28] J. Yang and J. Leskovec. Defining and evaluating network [1] C. C. Aggarwal and H. Wang. A survey of clustering algorithms for communities based on ground-truth. ICDM, pages 745–754, 2012. graph data. Managing and Mining Graph Data, pages 275–301. [29] F. Zhao and A. K. Tung. Large scale cohesive subgraphs discovery Springer, 2010. for visual analysis. VLDB, pages 85–96, 2012.

1009