Network Clustering Via Maximizing Modularity: Approximation Algorithms and Theoretical Limits Thang N
Total Page:16
File Type:pdf, Size:1020Kb
1 Network Clustering via Maximizing Modularity: Approximation Algorithms and Theoretical Limits Thang N. Dinh∗z, Xiang Liy, and My T. Thaiy, ∗Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284 USA, yDept. of Comp. & Info. Sci. & Eng., University of Florida, Gainesville, FL 32611 USA zCorresponding author: Thang N. Dinh, email: [email protected] Abstract—Many social networks and complex systems are via maximizing modularity (aka modularity clustering) as found to be naturally divided into clusters of densely connected surveyed in [3]. However, there is a little understood on nodes, known as community structure (CS). Finding CS is one of the complexity and approximability of modularity clustering fundamental yet challenging topics in network science. One of the besides its NP-completeness [8], [9] and APX-hardness [10]. most popular classes of methods for this problem is to maximize The approximability of modularity clustering in general graphs Newman’s modularity. However, there is a little understood on remains an open question. how well we can approximate the maximum modularity as well This paper focuses on understanding theoretical aspects as the implications of finding community structure with provable ∗ guarantees. In this paper, we settle definitely the approximability of CSs with near-optimal modularity. Let C be a CS with of modularity clustering, proving that approximating the problem maximum modularity value and let QOP T be the modularity within any (multiplicative) positive factor is intractable, unless value of C∗. Given 0 < ρ < 1, polynomial-time algo- P = NP. Yet we propose the first additive approximation algorithm rithms that can find CSs with modularity at least ρQOP T for modularity clustering with a constant factor. Moreover, we are called (multiplicative) approximation algorithms; and ρ provide a rigorous proof that a CS with modularity arbitrary is called (multiplicative) approximation factor. Given the NP- close to maximum modularity QOP T might bear no similarity completeness of modularity clustering, we are left with two to the optimal CS of maximum modularity. Thus even when choices: designing heuristics which provides no performance CS with near-optimal modularity are found, other verification guarantee (like the vast major modularity clustering works) methods are needed to confirm the significance of the structure. or designing approximation algorithms which can guarantee near-optimal modularity. I. INTRODUCTION We seek the answers to the following questions: how Many complex systems of interest such as the Internet, well we can approximate the maximum modularity, i.e., for social, and biological relations, can be represented as networks what values of ρ there exist ρ-approximation algorithms for consisting a set of nodes which are connected by edges modularity clustering? Moreover, do CSs with near-optimal ∗ between them. Research in a number of academic fields modularity bear similarity to C , the ultimate target of all has uncovered unexpected structural properties of complex modularity clustering algorithms? Our contributions (answers networks including small-world phenomenon [1], power-law to the above questions) are as follows. degree distribution, and the existence of community structure • We prove that there is no approximation algorithm with (CS) [2] where nodes are naturally clustered into tightly con- any factor ρ > 0 for modularity clustering, unless P nected modules, also known as communities, with only sparser = NP, therefore definitively settling the approximation connections between them. Finding this community structure is complexity of the problem. We prove this intractability a fundamental but challenging problem in the study of network results for both weighted networks and unweighted systems and has not been yet satisfactorily solved, despite the networks (with the allowance of multiple edges.) huge effort of a large interdisciplinary community of scientists • On the bright side, we propose the first additive approx- arXiv:1602.01016v1 [cs.SI] 2 Feb 2016 working on it over the past years [3]. imation algorithm that find a community structure with Newman-Girvan’s modularity that measures the “strength” modularity at least QOP T −2(1−κ) for κ = 0:766. The of partition of a network into modules (also called communities proposed algorithm also provides better quality solutions or clusters) [2] has rapidly become an essential element of comparing to the-state-of-the-art modularity clustering many community detection methods. Despite of the known methods. drawbacks [4], [5], modularity is by far the most used and best • We provide rigorous proof that CSs with near-optimal known quality function, particularly because of its successes modularity might be completely different from C∗, the in many social and biological networks [2] and the ability to CS with maximum modularity QOP T . This holds no auto-detect the optimal number of clusters [6], [7]. One can matter how close the modularity value to QOP T is. search for community structure by looking for the divisions of Thus adopters of modularity clustering should carefully a network that have positive, and preferably large, values of the employ other verification methods even when they found modularity. This is the underlying “assumption” for numerous CSs with modularity values that are extremely close to optimization methods that find communities in the network the optimal ones. 2 Related work. A vast amount of methods to find community matrix δ is defined as structure is surveyed in [3]. Brandes et al. proves the NP- ( 1; if i and j are in the same community completeness for modularity clustering, the first hardness result δij = : for this problem. The problem stands NP-hard even for trees 0; otherwise: [9]. DasGupta et al. show that modularity clustering is APX- hard, i.e., there exists a constant c > 1 so that there is The modularity values can be either positive or negative and no (multiplicative) c-approximation for modularity clustering it is believed that the higher (positive) modularity values unless P=NP [10]. In this paper, we show a much stronger indicate stronger community structure. The modularity clus- result that the inapproximability holds for all c > 1. tering problem asks to find a division which maximizes the Modularity has several known drawbacks. Fortunato and modularity value. Barthelemy [4] has shown the resolution limit, i.e., modularity Let B be the modularity matrix [15] with entries clustering methods fail to detect communities smaller than didj 1 X B = A − : We have Q(C) = B δ : a scale, the resolution limit only appears when the network ij ij 2M 2M ij ij is substantially large [11]. Another drawback is modularity’s i;j highly degenerate energy landscape [5], which may lead to Alternatively the modularity can also be defined as very different partitions with equally high modularity. How- l 2 ever, for small and medium networks of several thousand X E(Ct) vol(Ct) Q(C) = − ; (2) nodes, the Louvain method [12] to optimize modularity is M 4M 2 among the best algorithms according to the LFR benchmark t=1 [11]. The method is also adopted in products such as LinkedIn where E(Ct) is the total weight of the edges inside Ct and InMap or Gephi. P vol(Ct) = v2C dv is the volume of Ct. While approximation algorithms for modularity cluster- t ing in special classes of graphs are proposed for scale-free III. MULTIPLICATIVE APPROX.ALGORITHM networks[13], [14] and d-regular graphs [10], no such algo- rithms for general graphs are known. A major thrust in optimization is to develop approxima- Organization. We present terminologies in Section II. The tion algorithms of which one can theoretically prove the inapproximability of modularity clustering in weighted and performance bound. Designing approximation algorithms is, unweighted networks is presented in Section III. We present however, very challenging. Thus, it is desirable to know for the first additive approximation algorithm for modularity clus- what values of ρ, there exist ρ-approximation algorithms. tering in Section IV. Section V illustrates that the optimality This section gives a negative answer to the existence of of modularity does not correlate to the similarity between the approximation algorithms for modularity clustering with any detected CS and the maximum modularity CS. Section VI (multiplicative) factor ρ > 0, unless P = NP. presents computational results and we conclude in Section VII. We show the inapproximability result for weighted networks via a gap-producing redution from the PARTITION problem in subsection III-A. Ignoring the weights doesn’t make the II. PRELIMINARIES problem any easier to approximate, as we shall show in We consider a network represented as an undirected graph subsection III-B that the same inapproximability hold for G = (V; E) consisting of n = jV j vertices and m = jEj edges. unweighted networks. The adjacency matrix of G is denoted by A = (A ), where ij Our proofs for both cases use the fact that we can approxi- A is the weight of edge (i; j) and A = 0 if (i; j) 2= E. We ij ij mate modularity clustering if and only if we can approximate also denote the (weighted) degree of vertex i, the total weights the problem of partitioning the network into two communities of edges incident at i, by deg(i) or, in short, d . i to maximize modularity. Then we show that the later problem Community structure (CS) is a division of the vertices cannot be approximated within any finite factor. in V into a collection of disjoint subsets of vertices C = fC1;C2;:::;Clg that the union gives back V . Especially, the number of communities l is not known as a prior. Each subset A. Inapproximability in Weighted Graphs Ci ⊆ V is called a community (or module) and we wish to Theorem 1: For any ρ > 0, there is no polynomial-time have more edges connecting vertices in the same communities algorithm to find a community structure with a modularity than edges that connect vertices in different communities.