Community Detection in Social Networks: an In-Depth Benchmarking Study with a Procedure-Oriented Framework ∗

Community Detection in Social Networks: An In-depth Benchmarking Study with a Procedure-Oriented Framework ∗ Meng Wang1, Chaokun Wang1, Jeffrey Xu Yu2, Jun Zhang1 1 Tsinghua University, Beijing 100084, China 2 The Chinese University of Hong Kong, Hong Kong, China fmeng-wang12, [email protected], [email protected], [email protected] ABSTRACT graphs from the entire graph [26]. Since networks are usually mod- Revealing the latent community structure, which is crucial to un- eled as graphs, detecting communities in multifarious networks is derstanding the features of networks, is an important problem in also known as the graph partition problem in modern graph theory network and graph analysis. During the last decade, many ap- [7,2], as well as the graph clustering [1] or dense subgraph discov- proaches have been proposed to solve this challenging problem in ery problem [16] in the graph mining area. In the last decade, lots diverse ways, i.e. different measures or data structures. Unfortu- of solutions have emerged in the literature [5,9, 19, 24, 14, 25, 12, nately, experimental reports on existing techniques fell short in va- 3, 29,4, 11], trying to solve this problem from various perspectives. lidity and integrity since many comparisons were not based on a The extensive research work has promoted the prosperity of the unified code base or merely discussed in theory. family of community detection approaches. However, it also raises We engage in an in-depth benchmarking study of community de- a new difficulty, how to choose the most appropriate approach in tection in social networks. We formulate a generalized community specific scenarios, since many latest approaches have not been com- detection procedure and propose a procedure-oriented framework pared with each other upon unified platforms with same datasets for benchmarking. This framework enables us to evaluate and com- and uniform configurations. Given the huge diversity of various pare various approaches to community detection systematically and approaches, it is usually not easy to analyze, compare and eval- thoroughly under identical experimental conditions. Upon that we uate the extensive existing work. In this sense, a general bench- can analyze and diagnose the inherent defect of existing approaches mark for community detection is quite necessary and beneficial. deeply, and further make effective improvements correspondingly. In this paper, we make a benchmarking study for community de- We have re-implemented ten state-of-the-art representative al- tection, which contains a universal procedure-oriented framework gorithms upon this framework and make comprehensive evalua- and a comprehensive evaluation system. Upon that, we are able tions of multiple aspects, including the efficiency evaluation, per- to analyze, evaluate, diagnose and further improve the existing ap- formance evaluations, sensitivity evaluations, etc. We discuss their proaches thoroughly, and get interesting and credible conclusions. merits and faults in depth, and draw a set of take-away interesting 1.1 Challenges conclusions. In addition, we present how we can make diagnoses An in-depth benchmarking study for community detection is non- for these algorithms resulting in significant improvements. trivial and poses a set of unique challenges. Firstly, considering the various existing approaches, the lack of a 1. INTRODUCTION procedure-oriented framework for community detection makes it a Intrinsic community structures are possessed by many real-world puzzle to understand, compare and diagnose them. Since these ap- networks, e.g. biological data, communication networks and social proaches are of various categories, a universal framework of com- graphs, to name but a few. Given a network, it is particularly in- munity detection is quite difficult to be summarized and abstracted. teresting as well as challenging to detect the inherent and hidden Secondly, to make a fair comparison and build a general bench- communities. Communities, which have no quantitative definition, mark for evaluation, it is a necessity to re-implement different ap- are also called clusters. They are usually considered as groups proaches of various categories based on a common code base. Ac- of nodes, in which intra-group connections are much denser than tually, the re-implementation is really a tough work. those inter-group ones. Just as many classic puzzles, community Finally, when proposing a new approach, authors often testify detection is intuitive at first sight but actually an intricate problem. their work via limited metrics that perform well. In our benchmark- Community detection [8] aims at grouping nodes in accordance ing evaluation, we need a suite of metrics which can embody full with the relationships among them to form strongly linked sub- structural characteristics of communities to evaluate the approaches as comprehensively and thoroughly as possible. ∗Corresponding author: Chaokun Wang. There exist two pieces of research work similar to ours. Yang This work is licensed under the Creative Commons Attribution- et al. only investigated the performances of different metrics for NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li- communities with ground-truth [28]. Xie et al. made an evaluation cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per- on overlapping community detection [27]. However, they failed to mission prior to any use beyond those covered by the license. Contact present a universal framework. Instead, we conduct a systematic copyright holder by emailing [email protected]. Articles from this volume in-depth benchmarking study to solve the above challenges. were invited to present their results at the 41st International Conference on Very Large Data Bases, August 31st - September 4th 2015, Kohala Coast, 1.2 General Benchmark Hawaii. Proceedings of the VLDB Endowment, Vol. 8, No. 10 In this paper, we have designed a benchmark for community de- Copyright 2015 VLDB Endowment 2150-8097/15/06. tection. As shown in Fig.1, our benchmark consists of four core 998 The remainder of this paper is organized as follows. We formu- Efficiency Distribution Quantity DIagnosis 1 late the problem of community detection and sketch out existing Effectiveness Accuracy Outliers Diagnosis 2 work in Sec.2. In Sec.3 we propose a universal framework for Density Overlapping ... benchmarking in community detection, and then in Sec.4 we map Diagnoses ... Evaluation the existing approaches to the framework. Afterwards we present how to make targeted diagnoses based on the framework in Sec.5. Initialization Transformation Construction We evaluate these approaches with our benchmark and report the results and findings in Sec.6, and conclude this study in Sec.7. Collect Select Update Fluctuate PreCompute FirstAllocate MultiLevelDraft 2. PRELIMINARY AND BACKGROUND Extract Loop Conditions As preliminaries, we first define basic concepts and the problem Detection Framework of community detection, and then review the existing approaches. 2.1 Problem Definition Algorithms Configurations Datasets Graph Model Setup Social Networks.A social network with n individuals and m social ties can be denoted as G(V;E), where V is the set of nodes, Figure 1: Benchmark for community detection jVj = n, and E is the set of undirected relationships, E ⊆ V ×V, modules: (1) Setup, including a set of algorithms (Sec. 2.2), real- jEj = m. A social network is also referred to herein as a graph. world and synthetic datasets (Sec. 6.1), parameter configurations Communities. Non-overlapping communities are not confined to (Sec. 6.2), and a unified graph model converted from the datasets; a graph partition, and clusters which incompletely cover the graph (2) Detection Framework, a generalized detection procedure with are usually more desirable. Here we define the communities as 0 0 high abstraction of the common workflow of community detection a list of non-empty node subsets: Coms = fV1;··· ;Vcng, where (the details of the framework are introduced in Sec.3; the proce- Scn 0 i=1Vi ⊆ V, and cn is the total number of communities. Please dure mappings in Sec.4); (3) Diagnoses, which provide targeted 0 T 0 note Coms should try to satisfy Vi Vj = /0.A community is also diagnoses on these algorithms based on our framework, leading referred to as a cluster or a part. to directions of improvement over the existing work (Sec.5); (4) Outliers. Since community detection does not force each node Evaluation, a comprehensive evaluation system for community de- into a certain group, some independent nodes, which cannot be tection from different aspects (Sec. 6.3–6.11). grouped into any communities, are allowed far outside the detected The benchmark contains a universal framework which abstracts 0 groups [13]. We define them as outliers: Outs = fvjv 2 V; V 2 the key factors, phases and steps from many approaches to com- @ i 0 Scn 0 munity detection tasks, and makes it easy to implement classical or Coms ^ v 2 Vi g = V − i=1Vi . It is worth mentioning that outliers latest algorithms for comparison. Moreover, it consists of a com- can be directly identified by original algorithms or be produced by prehensive suite of widely-recognized metrics for evaluation of var- disbanding the tiny groups, whose sizes are less than the predefined ious concerned aspects, including the efficiency evaluation on the threshold of minimal valid size (mvs) of communities. time cost, performance evaluations on accuracy and effectiveness, PROBLEM DEFINITION 1. Generally, given a network G(V;E), sensitivity evaluations on network density and mixture degree, and and an mvs, the community detection problem aims at finding the additional evaluations on community distribution and the ability to optimal community assignment R(Coms;Outs) from G, s.t. (1) Coms avoid excessive outliers. By modularizing and separating key fac- \Outs = /0 and (2) Coms [ Outs = V. Herein the optimal assign- tors and steps, our framework allows us to study the strength and ment refers to closely connected groups of nodes (Coms) and a weakness of each algorithm thoroughly, and make diagnoses and moderate number of disparate outliers (Outs).

Load more