On Community Detection in Real-World Networks and the Importance of Degree Assortativity
Total Page:16
File Type:pdf, Size:1020Kb
On Community Detection in Real-World Networks and the Importance of Degree Assortativity Marek Ciglan Michal Laclavík Kjetil Nørvåg Institute of Informatics Institute of Informatics Dept. of Computer and Slovak Academy of Sciences Slovak Academy of Sciences Information Science Bratislava, Slovakia Bratislava, Slovakia Norwegian University of [email protected] [email protected] Science and Technology Trondheim, Norway [email protected] ABSTRACT lack of a consensus on the formal definition of a network commu- Graph clustering, often addressed as community detection, is a nity structure. The result of the ambiguity of the task definition is prominent task in the domain of graph data mining with dozens that a significant number of community detection algorithms hav- of algorithms proposed in recent years. In this paper, we focus ing been proposed, using different quality definitions of a commu- on several popular community detection algorithms with low com- nity structure or leaving the problem formulation in an ambiguous putational complexity and with decent performance on the artifi- informal description only. In this paper, we do not propose yet cial benchmarks, and we study their behaviour on real-world net- another community detection algorithm, nor do we attempt to pro- works. Motivated by the observation that there is a class of net- vide a new formalization of the task. Rather, we study community works for which the community detection methods fail to deliver detection methods on real-world networks and the possibility of good community structure, we examine the assortativity coefficient improving their precision by pre-processing the network topology. of ground-truth communities and show that assortativity of a com- Motivation. The motivation is that for a class of networks, the munity structure can be very different from the assortativity of the community detection techniques fail to deliver a good partitioning original network. We then examine the possibility of exploiting the (in the rest of the paper, we use the term partitioning to address the latter by weighting edges of a network with the aim to improve the result of a community detection algorithm; a partitioning is a set of detected communities). For example, when using community community detection outputs for networks with assortative com- 1 munity structure. The evaluation shows that the proposed weight- detection techniques to analyse the semantic network DBPedia , ing can significantly improve the results of community detection where nodes corresponds to the DBPedia concepts and edges de- methods on networks with assortative community structure. note a relation defined between two concepts, our expectation was that the analysis would reveal small clusters with semantically re- lated concepts and entities. The clusters were expected to be, for Categories and Subject Descriptors example, similar to Wikipedia categories (containing groups of Wi- H.2.8 [Database Applications]: Data mining; E.1 [DATA STRUC- kipedia articles handpicked by human contributors and assigned to TURES]: Graphs and networks be a member of the category, class). However, the detected struc- ture contains a few very large communities comprising the majority of the nodes. Keywords Due to the size of the data sets, our choice of community de- community detection, network assortativity, edge weighting tection methods for the network analysis was limited to a small family of fast community detection algorithms that are near-linear in the time complexity. We analysed the link graph using the la- 1. INTRODUCTION bel propagation [19] algorithm, a greedy modularity optimization The goal of community detection in networks is to identify sets algorithm [3] and a community detection method with parameter- of nodes, communities, that are densely connected among them- ized community size constraint (SCCD) [4]. The community struc- selves and have weaker connections to the other communities in the ture produced by the label propagation algorithm had the largest network. It is a task that can help to analyse large graphs and iden- community, with over 2.96 million of nodes. The SCCD method tify significant structures within and a classical example is to anal- with default setting yielded a structure with 78% of the nodes in yse social networks in order to find social groups of users. Commu- the 20 largest communities, and the greedy modularity optimiza- nity detection faces numerous challenges, the principal one is the tion method produced a partitioning with 88% of the nodes in the 20 largest clusters. The obtained results clearly did not match our expectations of the community structure of the DBPedia network. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed Overview of the study. Motivated by the above observation, we for profit or commercial advantage and that copies bear this notice and the full citation analyse several large social and information networks with known on the first page. Copyrights for components of this work owned by others than the ground-truth communities and compare the detected structure ob- author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or tained by the three community detection algorithms to the ground- republish, to post on servers or to redistribute to lists, requires prior specific permission truth clusters. On four of the analysed networks, the detected clus- and/or a fee. Request permissions from [email protected]. ters are a decent approximation of the ground-truth communities. KDD’13, August 11–14, 2013, Chicago, Illinois, USA. Copyright is held by the owner/author(s). Publication rights licensed to ACM. 1 Copyright 2013 ACM 978-1-4503-2174-7/13/08 ...$15.00. Knowledge base derived from Wikipedia, http://dbpedia.org For the rest of the networks, the similarity scores of the yielded par- the notion of the community structure in an informal description. titions with the ground-truth communities are low, and at the same The most widely used approach is to focus on maximizing the mod- time, a few of the largest detected clusters contain the majority of ularity measure (introduced by Newman and Girvan[17]) that com- the network nodes. pares how community-like the partitioning of the input network is We then study the assortativity of the analysed networks and to a random network with the same degrees of vertices. Modular- assortativity of their ground-truth communities. The interesting ity is a quality function for estimating how good the partitioning finding is that the assortativity coefficient of a network can be sig- of a network is. The basic formulation of the community detection nificantly different compared to the assortativity of its community task expects as an output a partitioning of a network; that is, each structure. Based on this, we examine the possibility of modifying node is a member of exactly one community. Numerous variants of the network by means of edge weighting. The underlying idea is to the problem have been studied, including detection of overlapping examine whether we can increase the precision of the community communities where a vertex can belong to multiple communities detection algorithms by the weighting functions for the networks (e.g., works by Gregory [9] and Zhang et al. [22]), clustering of with assortative community structure. We describe the weight- bipartite graphs (e.g., Papadimitriou et al. [18]), and detection of ing heuristics and show empirically that it is a suitable approach clusters exploiting additional information in addition to network in practice. On our test data, the similarity of the detected com- structure (e.g., attributes on nodes/edges, Yang et al. [21]). munity structure to the ground-truth is significantly increased after Community-detection algorithms are usually evaluated against applying the weightings on network with assortative community artificial benchmark graphs, where a community structure has been structure. injected (e.g., [11]). The advantage is that the evaluator can tune The main contributions of the paper are: the parameters of the generated network, the disadvantage is the • We show that assortativity of the community structure can artificiality itself. The real-world networks with known commu- differ from overall assortativity of the network. nity structure studied in the literature are usually small ones (e.g., Zachary’s karate club (36 nodes) or Dolphin social network (62 • We propose edge weighting functions designed to decrease nodes)), with few exceptions; e.g., in a recent study by Yang and the influence of edges connecting disassortative nodes. We Leskovec [20], the authors identify the ground-truth communities show that such edge weighting on networks with assortative for several large networks and study their properties. In this work, community structure can increase significantly the similarity we reuse their data sets. Abrahao et al. in [1] study structural prop- of the communities identified by community detection meth- erties of the ground-truth communities and compare them with the ods to the ground-truth communities, an increase of 2 to 10 properties of clusters discovered by several community detection times compared to the baseline solution is reported. approaches. The finding is that the communities produced by dif- The organization of the rest of the paper is as follows. Section 2 ferent approaches are clearly separable