On Modularity Clustering

1 On Modularity Clustering Ulrik Brandes1, Daniel Delling2, Marco Gaertler2, Robert Gorke¨ 2, Martin Hoefer1, Zoran Nikoloski3, Dorothea Wagner2 Abstract— Modularity is a recently introduced quality measure approximation factor no better than two. In addition, our examples for graph clusterings. It has immediately received considerable indicate that the quality of greedy clusterings may heavily depend attention in several disciplines, and in particular in the complex on the tie-breaking strategy utilized. In fact, in the worst case, systems literature, although its properties are not well under- no approximation factor can be provided. These performance stood. We study the problem of finding clusterings with maximum modularity, thus providing theoretical foundations for past and studies are concluded by partitioning some previously considered present work based on this measure. More precisely, we prove networks optimally, which does yield further insight. the conjectured hardness of maximizing modularity both in the This paper is organized as follows. Section II shortly introduces general case and with the restriction to cuts, and give an Integer preliminaries, formulations of modularity, an ILP formulation of Linear Programming formulation. This is complemented by first the problem. Basic and counterintuitive properties of modularity insights into the behavior and performance of the commonly are observed in Sect. III. Our NP-completeness proofs are given applied greedy agglomerative approach. in Section IV, followed by an analysis of the greedy approach Index Terms— Graph Clustering, Graph Partitioning, Modu- in Section V. The theoretical investigation is extended by char- larity, Community Structure, Greedy Algorithm acterizations of the optimum clusterings for cliques and cycles in Section VI. Our work is concluded by revisiting examples I. INTRODUCTION from previous work in Section VII and a brief discussion in Graph clustering is a fundamental problem in the analysis of Section VIII. relational data. Studied for decades and applied to many settings, it is now popularly referred to as the problem of partitioning II. PRELIMINARIES networks into communities. In this line of research, a novel graph Throughout this paper, we will use the notation of [14]. More clustering index called modularity has been proposed recently [1]. precisely, we assume that G = (V; E) is an undirected connected The rapidly growing interest in this measure prompted a series graph with n := jV j vertices, m := jEj edges. Denote by C = of follow-up studies on various applications and possible adjust- fC1;:::;Ckg a partition of V . We call C a clustering of G and ments (see, e.g., [2], [3], [4], [5], [6]). Moreover, an array of the Ci, which are required to be non-empty, clusters; C is called heuristic algorithms has been proposed to optimize modularity. trivial if either k = 1 or k = n. We denote the set of all possible These are based on a greedy agglomeration [7], [8], on spectral clusterings of a graph G with A (G). In the following, we often division [9], [10], simulated annealing [11], [12], or extremal identify a cluster Ci with the induced subgraph of G, i. e., the optimization [13] to name but a few prominent examples. While graph G[Ci] := (Ci;E(Ci)), where E(Ci) := ffv; wg 2 E : Sk these studies often provide plausibility arguments in favor of the v; w 2 Cig. Then E(C) := i=1 E(Ci) is the set of intra-cluster resulting partitions, we know of only one attempt to characterize edges and E n E(C) the set of inter-cluster edges. The number of properties of clusterings with maximum modularity [2]. In partic- intra-cluster edges is denoted by m(C) and the number of inter- ular, none of the proposed algorithms has been shown to produce cluster edges by m(C). The set of edges that have one end-node optimal partitions with respect to modularity. in Ci and the other end-node in Cj is denoted by E(Ci;Cj). In this paper we study the problem of finding clusterings with maximum modularity, thus providing theoretical foundations for A. Definition of Modularity past and present work based on this measure. More precisely, we proof the conjectured hardness of maximizing modularity Modularity is a quality index for clusterings. Given a simple both in the general case and the restriction to cuts, and give an graph G = (V; E), we follow [1] and define the modularity q (C) integer linear programming formulation to facilitate optimization of a clustering C as without enumeration of all clusterings. Since the most commonly q (C) := employed heuristic to optimize modularity is based on greedy 2 3 P 0 !2 agglomeration, we investigate its worst-case behavior. In fact, we X jE(C)j jE(C)j + C02C jE(C; C )j 4 − 5 : (1) give a graph family for which the greedy approach yields an m 2m C2C This work was partially supported by the DFG under grants BR 2158/2- Note that C0 ranges over all clusters, so that edges in E(C) 3, WA 654/14-3, Research Training Group 1042 ”Explorative Analysis and Visualization of Large Information Spaces” and by EU under grant DELIS are counted twice in the squared expression. This is to adjust 0 0 (contract no. 001907). proportions, since edges in E(C; C ), C 6= C , are counted twice 1 Department of Computer & Information Science, University of Konstanz, as well, once for each ordering of the arguments. Note that we fbrandes,[email protected] can rewrite Equation (1) into the more convenient form 2 Faculty of Informatics, Universitat¨ Karlsruhe (TH), delling,gaertler,rgoerke,wagner @ira.uka.de " P 2# f g X jE(C)j deg(v) 3 Max-Planck Institute for Molecular Plant Physiology, Bioinformatics q (C) = − v2C : (2) m 2m Group, [email protected] C2C 2 This reveals an inherent trade-off: To maximize the first term, Corollary 3.2: Isolated nodes have no impact on modularity. many edges should be contained in clusters, whereas the mini- Corollary 3.2 directly follows from the fact that modularity mization of the second term is achieved by splitting the graph depends on edges and degrees, thus, an isolated node does not into many clusters with small total degrees each. Note that the contribute, regardless of its association to a cluster. Therefore, we first term jE(C)j=m is also known as coverage [14]. exclude isolated nodes from further consideration in this work, i. e., all nodes are assumed to be of degree greater than zero. B. Maximizing Modularity via Integer Linear Programming Lemma 3.3: A clustering with maximum modularity has no cluster that consists of a single node with degree 1. The problem of maximizing modularity can be cast into a Proof: Suppose for contradiction that there is a clustering C very simple and intuitive integer linear program (ILP). Given a with a cluster C = fvg and deg(v) = 1. Consider a cluster graph G = (V; E) with n := jV j nodes, we define n2 decision v C that contains the neighbor node u. Suppose there are a variables X 2 f0; 1g, one for every pair of nodes u; v 2 V . u uv number of m intra-cluster edges in C and m inter-cluster edges The key idea is that these variables can be interpreted as an i u e connecting C to other clusters. Together these clusters add equivalence relation (over V ) and thus form a clustering. In order u 2 to ensure consistency, we need the following constraints, which m (2m + me) + 1 i − i guarantee m 4m2 to q (C). Merging Cv with Cu results in a new contribution of reflexivity 8 u: Xuu = 1 ; 2 symmetry 8 u; v : Xuv = Xvu , and mi + 1 (2mi + me + 1) − 2 8X + X − 2 · X ≤ 1 m 4m < uv vw uw transitivity 8 u; v; w : Xuw + Xuv − 2 · Xvw ≤ 1 : The merge yields an increase of : Xvw + Xuw − 2 · Xuv ≤ 1 1 2mi + me − 2 > 0 The objective function of modularity then becomes m 2m in modularity, because mi + me ≤ m and me ≥ 1. This proves 1 X deg(u) deg(v) E − X ; the lemma. 2m uv 2m uv (u;v)2V 2 Lemma 3.4: There is always a clustering with maximum mod- ( ularity, in which each cluster consists of a connected subgraph. 1 , if (u; v) 2 E with Euv = : Proof: Consider for contradiction a clustering C with a 0 , otherwise cluster C of mi intra- and me inter-cluster edges that consists Note that this ILP can be simplified by pruning redundant of a set of more than one connected subgraph. The subgraphs in n n C G variables and constraints, leaving only 2 variables and 3 do not have to be disconnected in , they are only disconnected constraints. when we consider the edges E(C). Cluster C adds m (2m + m )2 i − i e III. FUNDAMENTAL OBSERVATIONS m 4m2 In the following, we identify basic structural properties that to q (C). Now suppose we create a new clustering C0 by splitting C clusterings with maximum modularity fulfill. We first focus on into two new clusters. Let one cluster Cv consist of the component the range of modularity, for which Lemma 3.1 gives the lower including node v, i.e. all nodes, which can be reached from a and upper bound. node v with a path running only through nodes of C, i.e. Cv = Lemma 3.1: Let G be an undirected and unweighted graph S1 i i i−1 i=1 Cv, where Cv = fw j 9(w; wi) 2 E(C) with wi 2 Cv g and C 2 A (G).

Load more