On Community Detection in Real-World Networks and the Importance of Degree Assortativity

On Community Detection in Real-World Networks and the Importance of Degree Assortativity Marek Ciglan Michal Laclavík Kjetil Nørvåg Institute of Informatics Institute of Informatics Dept. of Computer and Slovak Academy of Sciences Slovak Academy of Sciences Information Science Bratislava, Slovakia Bratislava, Slovakia Norwegian University of [email protected] [email protected] Science and Technology Trondheim, Norway [email protected] ABSTRACT of nodes, communities, that are densely connected among them- Graph clustering, often addressed as community detection, is a selves and have weaker connections to the other communities in prominent task in the domain of graph data mining with dozens the network. It is a task that should help to analyse large graphs of algorithms proposed in recent years. Community detection al- and identify significant structures within. A classical example is gorithms are commonly evaluated against artificially generated net- to analyse social networks in order to find social groups of users. works with planted communities. In this paper, we focus on sev- Community detection faces numerous challenges, the principal one eral popular community detection algorithms with low computa- is the lack of a consensus on the formal definition of a network tional complexity and with decent performance on the artificial community structure. The unclarity of the task definition results benchmarks, and we study their behaviour on the real-world net- in a significant number of community detection algorithms having works. The motivation is that there is a class of networks for which been proposed, using different quality definitions of a community the community detection methods fail do deliver good community structure or leaving the problem formulation in an ambiguous infor- structure. For example, when these community detection meth- mal description only. In this paper, we do not propose yet another ods were used to find clusters in the information network of DB- community detection algorithm, nor do we attempt to provide new Pedia, we observed that in the resulting community structure for formalization of the task. Rather, we study community detection all the used algorithms, the majority of the nodes belonged to a methods on real-world networks and the possibilities to improve small number of very large clusters. This result is contradictory their precision by pre-processing the network topology. to expectations, and a systematic study of the given algorithms be- Motivation. The motivation is that for a class of networks, the haviour were conducted. In this paper, we first study the relation- community detection techniques fail to deliver a good partition. For example, when using community detection techniques to anal- ship between different network properties and the type of commu- 1 nity structure unveiled by the given algorithms. Results indicate a yse the semantic network DBPedia , where nodes corresponds to statistically significant correlation between the network assortativ- the DBPedia concepts and edges denote a relation defined between ity coefficient and the size of top k largest clusters. We examine two concepts, our expectation was that the analysis would reveal the assortativity of ground-truth communities and show that assor- small clusters with semantically related concepts and entities. The tativity of a community structure can be very different from the as- clusters were expected to be, for example, similar to Wikipedia sortativity of the original network. We then examine the possibility categories (containing groups of Wikipedia articles handpicked by of weighting edges of a network with the aim to improve the com- human contributors and assigned to be a member of the category, munity detection outputs for networks with assortative community class). However, the detected structure contains a few very large structure. The evaluation shows that the proposed weighting can communities comprising the majority of the nodes. significantly improve the results of community detection methods Due to the size of the data sets, our choice of community detec- on networks with assortative community structure. tion methods for the network analysis was limited to a small fam- ily of fast community detection algorithms that are pseudolinear in the time complexity. We have analysed the link graph using La- Keywords bel Propagation [16] algorithm, a greedy modularity optimization community detection, network assortativity, edge weighting algorithm [2] and a community detection method with parametriz- able community size constraint (SCCD) [3]. The community struc- 1. INTRODUCTION ture produced by the label propagation algorithm had the largest community, with over 2.96 million of nodes. The SCCD method The goal of community detection in networks is to identify sets with default setting yielded a structure with 78% of the nodes in the 20 largest communities, and the greedy modularity optimization method produced partition with 88% of the nodes in the 20 Permission to make digital or hard copies of all or part of this work for largest clusters. The obtained results clearly did not match our ex- personal or classroom use is granted without fee provided that copies are pectations of the community structure of the DBPedia network. not made or distributed for profit or commercial advantage and that copies Overview of the study. We first analyse several large social and bear this notice and the full citation on the first page. To copy otherwise, to information networks with known ground-truth communities and republish, to post on servers or to redistribute to lists, requires prior specific compare the detected structure obtained by the three community permission and/or a fee. KDD ’13 1 Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. Knowledge base derived from Wikipedia, http://dbpedia.org detection algorithms to the ground-truth clusters. On four of the 2.1 Community Detection analysed networks, the detected clusters are a decent approxima- The task of community detection is to, given an input network, tion of the ground-truth communities. For the rest of the networks, find the community structure of the network. A community struc- the similarity scores of the yielded partitions with the ground-truth ture is considered to be a collection of clusters of densely connected communities are low and at the same time, a few of the largest de- vertices that are less densely connected to other parts or communi- tected clusters contain the majority of the network nodes. ties in the network. The problem has been a very popular research To find out whether this is a generic behaviour of the algorithms topic and has been extensively studied in recent years; good evi- and what are the causes of such behaviour, we have applied the dence of the topic’s popularity is the overview paper by Fortunato algorithms on a wider range of different networks, including col- [6] with more than 450 references. Since its publication in 2010 laboration social networks, citation networks and web graphs. The (until Feb 2013), it has attracted more than 1200 citations accord- goal was to identify the correlation between the type of the de- ing to the Google Scholar service. Despite the significant research tected community structure and diverse properties of the analysed effort on this problem, there is no consensus on the formalization networks. The results reveal statistically significant correlations of of the task and authors often use different definitions of a com- the size of top largest clusters with two properties — network de- munity or even leave the notion of the community structure in an gree assortativity and number of nodes in the network. We then informal description. The most widely used approach is to focus study the assortativity of the ground-truth communities. Based on on maximizing the modularity measure (introduced by Newman the results of the latter, we examine the possibility to modify the and Girvan[14]) that compares how community-like is the partition network by means of edge weighting. The underlying idea is to ex- of the input network to a random network with the same degrees amine whether we can increase the precision of the community de- of vertices. Modularity is a quality function for estimation of how tection algorithms by the weighting functions for the networks with good the partition of a network is. The basic formulation of the assortative community structure. We describe the weighting heuris- community detection task expects as an output a partition of a net- tics and show empirically that it is a suitable approach in practice. work; that is, each node is a member of exactly one community. On our test data, the similarity of the detected community struc- Numerous variants of the problem have been studied, including de- ture to the ground-truth is significantly increased after applying the tection of overlapping communities (a vertex can belong to multiple weightings on network with assortative community structure. communities) (e.g., works by Gregory [7] and Zhang et al. [19] , The main contributions of the paper are: clustering of bipartite graphs (e.g., Papadimitriou et al. [15]) or detection of clusters exploiting additional information than network • We show correlation between degree assortativity coefficient structure (e.g., attributes on nodes/edges) (e.g., Yang et al. [18]). of a network and the type of community structure yielded by Community-detection algorithms are usually evaluated against community detection algorithms with linear time complexity. artificial benchmark graphs, where a community structure has been injected (e.g., [9]). The advantage is that the evaluator can tune • We show that assortativity of the community structure can the parameters of the generated network, the disadvantage is the differ from overall assortativity of the network. artificiality itself. The real-world networks with known community structure studied in the literature are usually small ones (e.g., • We propose edge weighting functions designed to decrease Zachary’s karate club (36 nodes) or Dolphin social network (62 the influence of edges connecting disassortative nodes.

Load more