<<

1 Detecting Statistically Significant Communities

Zengyou He, Hao Liang, Zheng Chen, Can Zhao, Yan Liu

Abstract—Community detection is a key data analysis problem across different fields. During the past decades, numerous algorithms have been proposed to address this issue. However, most work on community detection does not address the issue of statistical significance. Although some research efforts have been made towards mining statistically significant communities, deriving an analytical solution of p-value for one community under the configuration model is still a challenging mission that remains unsolved. The configuration model is a widely used model in community detection, in which the of each node is preserved in the generated random networks. To partially fulfill this void, we present a tight upper bound on the p-value of a single community under the configuration model, which can be used for quantifying the statistical significance of each community analytically. Meanwhile, we present a local search method to detect statistically significant communities in an iterative manner. Experimental results demonstrate that our method is comparable with the competing methods on detecting statistically significant communities.

Index Terms—Community Detection, Random Graphs, Configuration Model, Statistical Significance. !

1 INTRODUCTION

ETWORKS are widely used for modeling the structure function that is able to always achieve the best performance N of complex systems in many fields, such as biology, in all possible scenarios. engineering, and social science. Within the networks, some However, most of these objective functions (metrics) do vertices with similar properties will form communities that not address the issue of statistical significance of commu- have more internal connections and less external links. nities. In other words, how to judge whether one com- Community detection is one of the most important tasks munity or a network partition is real or not based on in and data mining, which can reveal some rigorous statistical significance testing procedures. the and organization of network structures. The Such testing-based approaches provide many advantages detection of meaningful communities is of great importance over other metrics. First of all, the statistical significance in many real applications [1]. In these diverse domains, each of communities can be quantified in terms of the p-value, detected community may provide us with some interesting which is a universally understood measure between 0 and 1 yet unknown information. in different fields. As a result, the threshold for the p-value Due to the importance of the community detection prob- is easy to specify since it corresponds to the significance lem, numerous algorithms from different perspectives have level. In contrast, numerical values generated from other been proposed during the past decades [1], [2]. Although objective functions are generally data-dependent, which is existing community detection methods have very different hard for people to interpret and determine a universal procedures to detect community structures, these methods threshold across all data sets. Furthermore, the p-value is can be roughly classified into different categories accord- typically derived under a certain random graph model in ing to the objective function and the corresponding search a mathematically sound manner, while many other metrics procedure [3]. The objective function is used for evaluating may be just defined in an ad-hoc manner. the quality of candidate communities, which serves as a In many real applications, it is critical to find network guideline for the search procedure and is critical to the suc- communities that are statistically significant. For instance, arXiv:1806.05602v2 [cs.SI] 11 Aug 2019 cess of the entire community detection algorithm. Probably the detection of protein complexes from protein-protein the proposed by Newman and Girvan [4] is the interaction (PPI) networks corresponds to a special com- most popular objective function in the field of community munity detection problem [8]. However, the PPI networks detection. Based on modularity, some popular community derived from current high-throughput experimental tech- detection methods, such as Louvain [5] and FastGreedy [6], niques are very noisy, in which many spurious edges are have been proposed to search the communities in networks. present while some true edges are missing [9], [10]. More im- In addition, many other objective functions have been pro- portantly, some reported protein complexes (communities in posed to evaluate the goodness of communities as well, as PPI networks) need to be further validated by wet-lab exper- summarized in [7]. Overall, there is still no such an objective iments that are costly and time-consuming. Thus, in these types of applications, the quality of identified communities must be strictly controlled through rigourous hypothesis • Z. He is with School of Software, Dalian University of Technology, testing procedures in order to obtain really meaningful out- Dalian,China, and Key Laboratory for Ubiquitous Network and Service put (e.g. [11], [12], [13]). In fact, the issue of errors and noises Software of Liaoning Province, Dalian, China. E-mail: [email protected] in the network data has also be recognized in many other • H. Liang, Z. Chen and Y. Liu are with School of Software, Dalian fields such as social science [10]. Therefore, the detection of University of Technology, Dalian, China. statistically significant communities will be desirable as well • C. Zhao is with Institute of Information Engineering, CAS. in these fields. Furthermore, the p-value of each community 2

can be used as the metric in emerging applications such as used analytically for assessing the statistical significance network community prioritization [14]. of a single community. Then, we present a local search To address the problem of discovering statistically signif- method to detect statistically significant communities in an icant communities, some research efforts have been made iterative manner. Experimental results on both real data sets towards this direction (e.g. [15], [16], [17], [18], [19], [20], and the LFR benchmark data sets show that our method [21], [22]). In these methods, the first critical issue that is comparable with the competing methods in terms of must be solved is to define a metric to effectively assess different evaluation metrics. the statistical significance of communities. This is because The main contributions of this paper can be summarized the significance-based metric will be used as the objective as follows: function for finding statistically significant communities. Therefore, different measures have been presented for as- • We propose a new method for assessing the statis- sessing the statistical significance of communities, which can tical significance of one single community. To our be classified into two categories: the statistical significance knowledge, this is the first piece of work that de- of one network partition (e.g. [15], [16], [17], [18]) and the livers an analytical p-value (or its upper bound) for statistical significance of a single community (e.g. [19], [20], quantifying a community under the configuration [21], [22]). model. Unlike existing methods under the config- Although the metrics that evaluate a network partition uration model that quantify the statistical signifi- can provide a global view on the set of all generated cance through the membership probabilities of single communities, they generally cannot guarantee that every nodes, our method directly evaluates each candidate single community is statistically significant as well. In ad- community. dition, many different partitions of the same network may • We provide a systematic summarization and analysis lead to quite similar significance values, making it difficult on the existing methods for detecting statistically to determine which partition should be reported as the significant communities, which may serve as the final result. Furthermore, such partition-based significance foundation for the further investigation towards this metrics typically focus on the assessment of a set of non- direction. overlapping communities. Therefore, evaluating the statis- • We present a local search method to conduct the tical significance of each single community is more mean- community detection based on the proposed upper ingful in community detection. Meanwhile, these methods bound of p-value. Extensive empirical studies vali- can be also classified by the techniques for deriving the p- date the effectiveness of our method on community values: analytical methods (e.g. [15], [16], [19], [20], [21]) evaluation and detection. or sampling methods (e.g. [17], [18], [22]). The sampling The remaining parts of this paper are structured as method calculates an empirical p-value through generat- follows: Section 2 presents and discusses related work in ing a number of random graphs. Although the sampling a systematic manner. Section 3 introduces our definition method is easy to understand and implement, it is very on the statistical significance of a single community and time consuming since many random networks have to be the corresponding community detection method. Section 4 generated and analyzed. More importantly, the p-values of shows the experimental results and Section 5 concludes this the same community may be inconsistent in different runs paper. and the precision of the p-value is dependent on the number of random graphs. Therefore, the analytical method is much preferable to the sampling method since it can provide an 2 RELATED WORK analytical solution of the p-value. Unfortunately, how to assess the statistical significance of 2.1 Statistically Significant Community Detection a single community (sub-graph) analytically is a challenging 2.1.1 An overview of the categorization task due to the difficulty on calculating the probability of To quantify the statistical significance of communities, one finding a community from the random graphs generated needs to choose a random graph model that specifies how from a specific null model. As a result, only a few research the reference random graphs are generated. Therefore, the efforts have been made towards this direction [19], [20], random graph model can be used as a criterion for cate- [21], [23], [24], [25], [26]. The details of these methods will gorizing available significance-based community detection be provided in the related work of this paper. Here we methods. There are many random graph models in the just highlight the fact that the configuration model is the literature [27], such as Barabasi-Albert´ model, Watts and most widely used random graph model in the literature. Strogatz model, hierarchal model and so on. Here we Unfortunately, how to calculate the analytical p-value under mainly discuss three random graph models which have the configuration model still remains unsolved. Existing been widely exploited in the field of statistically significant significance-based metrics [19], [20], [21], [24] were built community detection: Erdos–R¨ enyi´ model, configuration on the probability that each node belongs to the commu- model and . nity under the configuration model. In other words, these methods did not evaluate the statistical significance that • Erdos–R¨ enyi´ (E-R) model has two closely related one single community will appear in random graphs in a variants: G(N,M) and G(N, p). In G(N,M) model, straightforward manner. a graph is chosen uniformly at random from the In this paper, we first propose a tight upper bound on set of all graphs with N nodes and M edges. In the p-value under the configuration model, which can be G(N, p) model, a graph is constructed by connecting 3

two vertices randomly and independently with the Surprise [15] is a measure that evaluates the distribu- probability p(0 < p < 1). tion of intra- and inter-community links with a cumulative • The configuration model generates a random graph hypergeometric function. Significance [28] is defined as the in which each node has a fixed degree. In other probability for the partition to be contained in a random words, the degree of each node will be the same in graph. The method in [29] first calculates the expected all random graphs. modularity under the E-R model, and then a partition is • Stochastic block model produces a random graph in claimed to be statistically significant if its modularity is which any two vertices are connected by an edge significantly higher than the expected modularity. with a probability that is determined by their com- Under the configuration model, Z-modularity [16] quan- munity memberships. That is, the information on the tifies the statistic rarity of a partition in terms of the fraction underlying communities are assumed to be known, of the number of edges within communities using the Z- and two vertices will be connected with a higher score. Meanwhile, CSV [30] is a synthetic index for assessing connection probability from the same community. the validity of a partition based on the concepts from the network enrichment analysis. In addition to the random graph model, another two In [31], a new objective function for community detection criteria can be used for categorizing related methods. One is proposed under the stochastic block model, which leads is the target of evaluation, and the other is the technique naturally to the development of a likelihood ratio test for for deriving the p-values. In fact, one can evaluate the determining if the detected communities are statistically quality of a partition of one network or assess the statistical significant. significance of a single community. Hence, the target of evaluation can be either the full partition of a network or one 2.1.3 Sampling methods for network partition single community. Regardless of the target of evaluation, In addition to the analytical methods, some existing meth- we can calculate the p-value using analytical methods or ods adopt the sampling techniques to evaluate the statistical sampling techniques. Sampling techniques are quite time- significance of a network partition. consuming since a large number of random graphs should The methods in [17], [18] and [32] all test “the similarity” be generated to derive the empirical null distribution of or “the difference” between the partition of the original test statistics. On the other hand, analytical methods try to network and the partition of randomly perturbed networks. derive an analytical solution of the p-value. For complex The method in [18] uses tools from functional data analysis random graph models, it is very challenging to obtain an to formulate a hypothesis testing problem that tests “the analytical p-value or even its tight bound. difference” between two curves of VI (Variation of Informa- Based on above three different criteria, existing methods tion), where one curve is generated from the partition of for detecting statistically significant communities are cate- the original network and another curve is derived from the gorized and summarized in Fig.1. partition of randomly perturbed networks. In contrast, the method in [32] just uses the mean of VI between the original Statistically Significant Community Detection partition and the partition on the perturbed network as the test statistic. Similarly, the method in [17] adopts the same

Single Full network perturbation strategy as [32] and introduces a new Community Partition index for measuring the similarity between partitions. In addition, some existing methods first use sampling Analytical Sampling Analytical Sampling methods to generate a null distribution under the given ran- Solution Method Solution Method dom graph model and then test the statistical significance of the target partition. The method in [33] uses the largest E-R Configuration Configuration Other E-R Configuration Stochasic Stochasic Other Model Model Configuration eigenvalue of the difference matrix between the affinity ma- Model Models Model Model Block Block Models Model Model Model trices of the network and its null model as the test statistic, where the empirical distribution of the largest eigenvalue

[13] C-Score[24] [11] [25] Surprise [15] [16] [31] [17] [35] [36] can be approximated with a Gamma distribution. In [34], a SiDeS [23] SSF[12] [22] [28] CSV[30] [18] set of random networks is generated for producing an em- ESSC [20] [29] [32] OSLOM [19] [33] pirical null distribution of partition modularities to calculate CCME [21] [34] [26] the Z-score for the partition on the original network. Meanwhile, there are also several methods [35], [36] Fig. 1. Existing methods under different criteria. Here “other models” which adopt the sampling techniques for finding statisti- refers to those non-standard random graph models that are proposed cally significant communities based on some concepts from for special scenarios or hard to be described in an explicit manner. For physics. instance, the Poisson random model [25] generated random graphs with a given expected degree sequence, where each of the two end-points of an edge is chosen among all vertices through a Poisson process. 2.1.4 Analytical methods for single community To assess the statistical significance of a single community analytically, two types of strategies have been adopted by 2.1.2 Analytical methods for network partition the existing methods. To evaluate the quality of one network partition statistically On one hand, methods such as C-Score [24], SSF [12], and analytically, several different methods have been pro- OSLOM [19], ESSC [20] and CCME [21] first calculate the posed in the literature. probability that “each node belongs to the community”. 4

Then, the statistical significance of one community can be composed of the SAME set of nodes derived from c and quantified based on the statistics of several “exceptional (2) it has more internal edges than c. The proposed method nodes” in the community that have the lower member- in [11] generalizes [22], in which a “better community” (1) ship probability. The key issue in these methods is how has the same size as c and (2) has better community quality to quantify the connection probability between each node value (this value can be generated from any quality func- and the candidate community. Furthermore, one interesting tion). Furthermore, [25] assesses the statistical significance observation is that all these methods adopt the configuration of a single community in the same way, while the random model as the null model. graphs are generated under the Poisson random model, and On the other hand, several methods [13], [23], [26] try the “better” community is defined to have larger Poisson to evaluate the statistical significance of one community di- discrepancy than the original SAME set of nodes. rectly under different random graph models. More precisely, the probability of finding a community from the random 2.2 Testing for and Community network that is “better” than the target community, i.e. Number the p-value of target community, is directly calculated and used for the purpose of significance assessment. The main Besides quantifying the statistical significance of communi- challenge is how to derive the analytical p-value or its upper ties, the significance testing problems with respect to the bound under a random graph model. community structure and community number are also very According to the evaluation strategy and the random important. Testing the community structure is to determine graph model, existing analytical methods for a single com- whether the community structure is present in the network. munity can be summarized as a tree shown in Fig.2. First of Furthermore, testing the community number is to identify all, most existing solutions try to quantify the statistical sig- the correct number of communities in a statistically sound nificance of one single community using “indirect methods” manner under the assumption that a community structure under the configuration model, and only several “direct is present. methods” are available in the literature. More importantly, To test the community structure, some statistical tests although the metric in [26] is derived under the configura- have been proposed in the literature, e.g., the test based tion model, it is a Z-score rather than an analytical p-value. on the relations between the observed frequencies of small In other words, how to define and calculate the analytical subgraphs [37], [38] and the test based on the probability p-value for one community under the configuration model distribution of eigenvalues of the normalized edge-weight remains unaddressed. In this paper, we will take a first step matrix [39]. towards this direction by providing a tight upper bound on The problem of determining the number of communities the analytical p-value under the configuration model. is widely investigated in the literature as well [40], [41], [42], [43], [44], [45], [46]. Here we just highlight the fact that Analytical Methods for these methods employ different techniques from different Single Community perspectives to test if the number of communities equals a given number. Direct Methods Indirect Methods

E-R Configuration Configuration 3 METHODS Model Model Model 3.1 Problem statement

[13] [26] C-Score[24] Given an undirected graph G(V,E), where V is the set of SiDeS[23] SSF[12] vertices and E is the set of edges, any sub-graph S ⊆ V ESSC [20] OSLOM [19] can be regarded as a candidate community. Due to the lack CCME [21] of a consensus on the formal definition of a community, different objective functions have been proposed to evaluate Fig. 2. Existing analytical methods for a single community under two the quality of S (e.g. density, modularity) [7]. Without loss of criteria. Direct methods evaluate the statistical significance of one com- generalization, here we use f(S) to denote such an objective munity in a straightforward manner, while indirect methods quantify the statistical significance of one community through the membership function and assume that S is more likely to be a true probabilities of single nodes. community if f(S) is larger. Then, S will be regarded as a community if f(S) is larger than a given threshold. Simi- larity, we can also define an objective function to quantify if 2.1.5 Sampling methods for single community a network partition/structure is true or not. Similarly, some existing methods also assess the statistical Here we focus on how to effectively quantify the sta- significance of a single community with sampling tech- tistical significance of a candidate community S in terms of niques. In both [11] and [22], a fixed number of random p-value under a given random graph model. More precisely, graphs are firstly generated under the configuration model under the null hypothesis that S is generated from a specific and then the p-value of one community c is defined as the random graph model, the p-value of S can be calculated as probability of finding “better communities” from the ran- |{f(Se) ≥ f(S)|Se ⊆ G,e Ge ∈ R}|/|R|, where R is the set of all dom graphs. The key difference between [11] and [22] lies in possible random graphs and each Se is an induced sub-graph how to define “better communities”. In [11], one community (with the same set of vertices of S) in the corresponding from the random graphs is said to be “better” if (1) it is random graph. 5

In this paper, the density function is used as the objective obtain a new permutation shown in Fig.1 (c1) by switching function to calculate the p-value. When the set of vertices the first two (and the last two) pairs of half-edges in the is fixed to be the set of nodes of S in each random sub- permutation P . The new permutation corresponds to the graph Se, the f(Se) function is equivalent to counting the same graph G0. That is, for a fixed random graph, the set of number of internal edges within Se. Meanwhile, we adopt pairs of half-edges are fixed as well. Each pair can appear the configuration model as the null model for generating at any place of |E| positions in the permutation. Therefore, random graphs, in which the pre-assigned degree of each there will be |E|! permutations of these pairs that generate node will be preserved in the random networks. the same random graph. Hence, (2|E|)! should be divided by |E|! in order to count the number of distinct random graphs. 3.2 The configuration model Secondly, the order of two half-edges in each generated Suppose the degree of each node i ∈ V is denoted by edge has no effect on the random graph as well. That is, di, then the degree of the graph G can be defined as if we switch the positions of the (2i + 1)th half-edge and P th D = i∈V di = 2|E|. Similarly, the degree of a sub-graph the (2i + 2) half-edge, we will obtain a new permutation P S can be calculated as Ds = i∈S di = 2Ein + Eout, where but it corresponds to the same random graph. For example, Ein is the number of edges within S and Eout is the number we can swap the positions of two half-edges in the 1st,4th of edges between S and V \S. and 5th pairs in the permutation P to generate a new Under the configuration model, a random graph can be permutation shown in Fig.1 (c2), which also corresponds to generated based on the following procedure. Firstly, each the random graph G0. In summary, for a fixed random graph node i has di half-edges that need to be connected with and a fixed order of |E| pairs of half-edges, we may have other half-edges to form edges. There are totally D half- 2|E| different permutations that lead to the same random |E| edges in G and Ds half-edges derived from the nodes in S. graph. Therefore, (2|E|)! should be divided by 2 as well. A new edge can be generated by connecting two half-edges Thirdly, the half-edges from the same node have no at random. We will obtain a random graph after |E| pairs of difference in the random graph. For example, the node 1 randomly selected half-edges have been connected. in Fig.1 will generate 3 half-edges since its degree is 3. For To obtain an analytical formula for the p-value of S, the given permutation P , these 3 half-edges are assigned to we need to know (1) the number of all possible random positions 1, 4 and 10. If we randomly permute these 3 half- graphs T and (2) the number of random sub-graphs that edges and distribute them to the same three positions, then have at least Ein edges. Note that the counting problem we obtain a new permutation in Fig.1 (c3) that will generate here is simpler than that of counting the number of two- the random graph G0 as well. This means that (2|E|)! should th way zero-one tables with fixed marginal sums (e.g., [47], be divided by each di!, where di is the degree of the i node [48]). This is because the random graphs generated from the in the graph. configuration model are allowed to contain self-loops and Based on the above observations, the total number of multiple edges between vertices even the given graph G is distinct random graphs T under the configuration model simple. As a result, we can obtain an analytical solution to can be calculated as: this seemingly difficult problem. The generation mechanism (2|E|)! of the configuration model can be formulated as an equiva- T = . (1) |E| Q|V | lent permutation-and-connection procedure [49]: |E|!2 i=1 di! (1) Generating a permutation of 2|E| half-edges. There Since there are |S| nodes in S, we may denote the degree will be (2|E|)! permutations in total. (S) of each node in S by d , where i is ranged from 1 to |S|. (2) For each permutation, we may obtain |E| edges of a i th Similarly, the degree of each node in V \S is denoted by random graph by connecting the (2i+1) half-edge and the (S) |V | th Q (2i + 2) half-edge sequentially, where i = 0, 1, .., |E| − 1. dj , where j is ranged from 1 to |V \S|. Then, i=1 di! in Q|S| (S) Q|S| (S) Fig.1 presents an example to illustrate such an alternative Equation (1) equals i=1 di ! j=1 dj ! random graph generating process. However, it is easy to see that the above procedure may generate many identical 3.3.2 The number of distinct random graphs with a “dense” random graphs. According to whether identical random random sub-graph graphs are allowed in graph counting, we have two different We have shown how to calculate the number of all distinct p-value calculation formulations, which will be presented in random graphs T , then the remaining challenge is to cal- details in section 3.3 and 3.4, respectively. culate the number of random graphs that have at least Ein edges within the sub-graph Se induced from the nodes of S. 3.3 The p-value based on distinct random graphs To fulfill this task, some variables have to be firstly 3.3.1 The number of distinct random graphs introduced. For each random graph Ge that have at least E S As shown in Fig.1, the permutation-and-connection proce- in edges within its sub-graph e, there must be at least 2E S dure will generate many identical random graphs due to the in half-edges derived from the nodes in that have been (Sin) following reasons. selected to form edges within Se. Let di be a random th Firstly, |E| pairs of adjacent half-edges in the permuta- variable that denotes the number of half-edges from the i tion compose the edge set of the random graph, but the node in S that are selected into the set of internal half-edges. (Sout) (S) (Sin) generated random graph is mainly determined by the edge di = di −di is the number of half-edges remained th P|S| (Sin) set rather than the order of these pairs. For example, we can for the i node. Then, i=1 di should be no less than 6

P 1 2 3 1 4 2 2 5 2 1

4 P 1 2 3 1 4 2 1 1 2 5 2 1 4 2 (c1) 3 1 1 2 4 2 2 1 2 5 2 3 3 5 (c2) 2 1 1 4 2 5 2 1 2 3 5

(c3) 1 2 3 1 4 2 2 5 2 1 Initial Graph G Random Graph G′ Generated from Permutation P

(a) (b) (c)

Fig. 3. (a) An initial graph G is composed of five nodes and five edges. The degree sequence for five nodes is: (d1 = 3, d2 = 4, d3 = 1, d4 = 0 1, d5 = 1). (b) A permutation P of 10 half-edges can generate a random graph G . (c) Other three different permutations can also generate the same random graph G0.

1

4 2 1 4 2 + 2 4 3 3 3 5 5 5 𝑺𝑺� ( ) ( )

1 𝒃𝒃𝟏𝟏 𝒃𝒃𝟐𝟐 ( ) ( ) ( ) 4 The Procedure for Generating When ( , , ) = 4,0,0 � 2 𝑆𝑆𝑖𝑖𝑖𝑖 𝑆𝑆𝑖𝑖𝑖𝑖 𝑆𝑆𝑖𝑖𝑖𝑖 𝑮𝑮 (b) 𝑮𝑮� 𝑑𝑑2 𝑑𝑑3 𝑑𝑑5 3 5 S 1

1 4 2 4 Initial Graph G 2 + 2 4 (a) 3 3 3 5 5 𝑺𝑺�5 ( ) ( )

( ) ( ) ( ) The𝒄𝒄𝟏𝟏 Procedure for Generating 𝒄𝒄𝟐𝟐When ( , , ) = 2,1,1 𝑮𝑮� (c) 𝑆𝑆𝑖𝑖𝑖𝑖 𝑆𝑆𝑖𝑖𝑖𝑖 𝑆𝑆𝑖𝑖𝑖𝑖 𝑮𝑮� 𝑑𝑑2 𝑑𝑑3 𝑑𝑑5

Fig. 4. (a) The same initial graph G as in Fig.1 (a) with the following characteristics: 5 nodes, 5 edges and the degree sequence is: (d1 = 3, d2 = 4, d3 = 1, d4 = 1, d5 = 1). Here the sub-graph S is composed of three nodes and two edges. (b) The random graph Ge that has three edges within (Sin) (Sin) (Sin) Se can be generated according to our procedure when (d2 , d3 , d5 ) = (4, 0, 0). (c) The same random graph Ge can be generated by our (Sin) (Sin) (Sin) procedure as well when (d2 , d3 , d5 ) = (2, 1, 1).

2Ein in a random graph with a denser sub-graph Se. So we show that it is not necessary to calculate this number. E (Sin) (Sin) can generate all random graphs that have at least in edges (2) Under a given fixed degree sequence (d1 , d2 within Se with the following procedure: (Sin) , .., d|S| ), we can first generate Ein internal edges within (S ) (S) in Se by randomly connecting 2Ein selected half-edges from (1) Select di half-edges from the set of di half- th Step (1). The number of different ways that generate these edges of the i node in S (1 6 i 6 |S|) with the P|S| (Sin) Ein edges corresponds to the number of random graphs constraint that 2Ein = i=1 di . For a fixed vector (Sin) (Sin) (Sin) (Sin) with the following parameters: |S| nodes, degree sequence (d1 , d2 , .., d|S| ), no matter how di half-edges (Sin) (Sin) (Sin) (S) (d1 , d2 , .., d|S| ) and Ein edges. Hence the number are selected from the set of d half-edges, the gener- i of ways to generate Ein edges is : ated set of 2Ein half-edges will be the same. Therefore, (S ) (S ) (S ) for a fixed vector (d in , d in , .., d in ), the number of (2Ein)! 1 2 |S| Zin = . (2) E Q|S| (Sin) ways of generating this vector should be 1 rather than Ein!2 in d ! (S) i=1 i Q|S| di  i=1 (Sin) . Then the question is reduced to calculate the di (3) For the remaining 2|E|−2Ein half-edges with the de- (Sin) (Sin) (Sin) (Sout) (Sout) (Sout) (S) (S) (S) number of distinct vectors (d1 , d2 , .., d ) such that (d , d , .., d , d , d , .., d ) |S| gree sequence 1 2 |S| 1 2 |S| , P|S| (Sin) 2Ein = i=1 di . In fact, this is an integer partition we can generate |E| − Ein edges by randomly connecting problem in number theory and combinatorics. Later, we will these half-edges. Similarly, the number of ways to generate 7

|E| − Ein edges can be calculated as the corresponding number of random graphs: S1 (2|E| − 2Ein)! Zout = . |S| (S ) |S| (S) (|E| − E )!2(|E|−Ein) Q d out ! Q d ! in i=1 i j=1 j S (3) 2 Then, it is obvious that random graphs generated by the above three steps will contain at least Ein edges within its sub-graph Se. However, the above procedure may generate identical random graphs under different degree sequences (S ) (S ) (S ) ((d in , d in , .., d in )). As shown in Fig.2, our objective is 1 2 |S| (1) (2) to generate random graphs that have at least 2 edges within ( Ds )( |E| ) 14 14 the sub-graph Se induced from the nodes of S. In step (1), we 2Ein Ein (12)( 6 ) −3 (1) pvalue(S1) ≤ 2|E| = 28 = 8.98 × 10 (Sin) (Sin) (Sin) ( ) ( ) (d , d , d ) 2Ein 12 may select different degree sequences 2 3 5 . Ds |E| 17 14 (S ) (S ) (S ) (2E )(E ) ( )( ) (d in , d in , d in ) = (4, 0, 0) pvalue(S ) ≤ in in = 12 6 = 0.61 For example, we select 2 3 5 in (2) 2 2|E| (28) (Sin) (Sin) (Sin) (2Ein) 12 Fig.2 (b1) and (d2 , d3 , d5 ) = (2, 1, 1) in Fig.2 (c1). In step (2) and step (3), we can obtain the same random Fig. 5. Examples for illustrating the calculation of the upper bound on the p-value. The first subgraph S1 is a of size 4, which has 6 internal graph Ge under these two different degree sequences. There- edges and 2 external edges. The second subgraoh S2 is composed of 5 fore, the number of distinct random graphs that have at least nodes, which has 6 internal edges and 5 external edges. Intuitively, S1 is more likely to be a true community than S2. Therefore, S1 should have Ein edges within the subgraph Se induced from the nodes P a lower p-value than S2. As expected, the upper bound on the p-value of S is no more than (Sin) (Sin) (Sin) (Zin ∗ Zout). of S1 is much smaller than that of S2 according to Equation (4) and (7). (d1 ,d2 ,..,d|S| ) Putting all together, we can obtain an upper bound on the p-value of S: graphs. Under this assumption, the total number of random T P graphs under the configuration model becomes: (Sin) (Sin) (Sin) (Zin ∗ Zout) (d1 ,d2 ,..,d|S| ) pvalue(S) ≤ (2|E|)! (2|E|)! T = . |E| Q|V | |E| (5) |E|!2 i=1 di! |E|!2 (S) P Q|S| di  |E|  ( (Sin) (Sin) (Sin) (S ) ) (d ,d ,..,d ) i=1 d in Ein Accordingly, random graphs that have at least Ein edges = 1 2 |S| i 2|E|  within the induced sub-graph Se can be generated by first 2Ein selecting 2Ein half-edges from Ds half-edges to generate Ds  |E|  2E E a sub-graph which has Ein internal edges, then randomly = in in . (4) 2|E|  connecting the remaining 2|E| − 2Ein half-edges. However, 2Ein the above procedure may also produce identical random In the last equation in (4), graphs (even the half-edges from the same node are as- d(S) sumed to be distinct) due to the same reason as we have P Q|S| i  Ds  ( (Sin) (Sin) (Sin) (S ) ) = because (d ,d ,..,d ) i=1 in 2Ein 1 2 |S| di discussed in section 3.3. Therefore, the number of random both the left and the right side of the equation corresponds graphs Z that have at least Ein edges within the sub-graph to the number of choosing 2Ein edges from Ds half-edges Se satisfies: under the assumption that half-edges from the same node are distinct. ! Moreover, this upper bound is tight since pvalue(S) = Ds (2Ein)! (2|E| − 2Ein)! Ds |E| Z ≤ . (6) Ein (|E|−Ein) (2Ein)(Ein) 2Ein (Ein)!2 (|E| − Ein)!2 2|E| when Ds = 2Ein. In this case, the sub-graph S (2Ein) is disconnected with all other vertices in G. In other words, Finally, we can obtain the same tight upper bound on the S is a connected of G. To generate a random p-value as in section 3.3: graph that has at least Ein edges within Se, each entry in (d(Sout), d(Sout), .., d(Sout)) D = 2E 1 2 |S| must be zero when s in. Ds  (2Ein)! (2|E|−2Ein)! 2E (E )!2Ein (|E|−Ein) As a result, no identical random graphs will be generated pvalue(S) ≤ in in (|E|−Ein)!2 since there will be no overlap between the edge sets in step (2|E|)! |E|!2|E| (2) and step (3). Ds  |E|  = 2Ein Ein . 2|E|  (7) 3.4 The p-value when all half-edges are distinct 2Ein We have shown how to derive an upper bound on the p- In Fig.5, we use two subgraphs as examples to illustrate value based on distinct random graphs, in which the half- how to calculate the upper bound of the p-value according edges from the same node have no difference in the random to Equation (4) and (7). As shown in Fig.5, a smaller upper graph. We can also assume that the half-edges from the bound of p-value will be assigned to the subgraph that is same node are distinguishable in the generation of random more likely to be a true community. 8

3.5 The DSC method smallest p-value is retained. The updated node set is used The proposed DSC algorithm for detecting statistically sig- as the seed set again to continue the local search procedure nificant communities is described in Algorithm 1. The input until the p-value cannot be further reduced. To accelerate of the algorithm is composed of a undirected graph G(V,E) the convergence speed, we impose an additional threshold and a significance level threshold, and the output is a parameter on the difference value between the logarithm of set of statistically significant communities. Note that the new p-value and that of the original p-value. If the difference upper bound in Equation 4 and 7 is used as the p-value value is less than the specified threshold, we will terminate in the algorithm. Meanwhile, we use the Stirling formula to the local search procedure. The proof on the convergence of approximate the factorial in the p-value calculation: the Algorithm 2 is given in Theorem 1. √ n n! ≈ 2πn( )n. (8) Algorithm 2 Search One Community. e Input: A node set ns. At the beginning of the algorithm, we initialize the Output: One statistically significant community sc. NodeList by using all nodes in G. Meanwhile, we choose 1: Initialization: nodeIndex ← −1, choice ← ADD, minp ← the node with the maximal clustering coefficient from the pvalue(ns). NodeList and its neighbors to form the seed community 2: for each node ∈ ns do 3: rp ← pvalue(ns − node). (Line 3∼4) for detecting a single statistically significant com- 4: if rp < minp then munity. The local clustering coefficient of a node quantifies 5: minp ← rp. how close its neighbors are to being a clique, so it can be 6: nodeIndex ← node. used to measure if the node and its neighbors tend to cluster 7: choice ← REMOVE. together. The clustering coefficient of a node can be defined 8: end if as: 9: end for 10: for each node∈ / ns and has a neighbor in ns do |{ejk : j, k ∈ Ni, ejk ∈ E}| 11: ra ← pvalue(ns + node). C = , (9) ra < minp i |N |(|N | − 1)/2 12: if then i i 13: minp ← ra. nodeIndex ← node where Ni is the set of neighbors of the node i and ejk 14: . choice ← ADD denotes the edge between two nodes j and k in the neigh- 15: . N 16: end if borhood i. Therefore, the local clustering coefficient of a 17: end for node is the proportion of the number of edges between 18: if nodeIndex = −1 then the nodes within its neighborhood divided by the maximal 19: sc ← ns number of edges that could exist between them. If the 20: return sc. degree of a node is 2 and its two neighbors have an edge, 21: else if choice = ADD then then its clustering coefficient will be 1. Hence, those nodes 22: ns.add(nodeIndex). 23: Search One Community(ns). with only two neighbors will not be used to generate the 24: else seed community even their clustering coefficients are 1. 25: ns.remove(nodeIndex). After the initialization, we detect a single community 26: Search One Community(ns). with a local search method and remove nodes in this com- 27: end if munity from NodeList. We repeat the initialization and local search steps until the NodeList is empty. Theorem 1. The Algorithm 2 can converge in a finite number of Algorithm 1 DSC Algorithm. iterations. Input: An undirect network G(V,E);A significance level pa- rameter α ∈ (0, 1). Output: A set of statistically significant communities SC. Proof. First, note that there are only a finite number sub- |V | 1: Initialization: NodeList ← V . graphs (2 − 1) in the network, which means that the 2: while NodeList ! = ∅ do number of possible sub-graphs is finite when searching a 3: s ← max(clusterCoefficient(NodeList)). statistically significant community. Second, each possible 4: ns ← s ∪ {t ∈ V |(s, t) ∈ E}. sub-graph appears at most once during the search proce- 5: sc ← Search One Community(ns). dure of Algorithm 2 since the p-value sequence is strictly 6: if |sc| > 2 and pvalue(sc) < α then 7: SC.add(sc). decreasing. Therefore, the result follows. 8: NodeList.remove(sc). 9: end if 10: end while The result of Theorem 1 guarantees the convergence 11: Return SC. of the Algorithm 2. Meanwhile, since one node can be distributed into different communities in our algorithm, Given a seed set ns, Search One Community(ns) re- there may be some overlapping nodes among different turns a single statistically significant community using a commuinities. If the overlap between two communities is local search procedure in Algorithm 2. In this procedure, we too high, it is reasonable to regard that one of the two try to include one node into ns or remove one node from communities is redundant. To solve the redundancy issue, |A∩B| ns to check if such operations can lead to smaller p-values. we merge two communities A and B if min(|A|,|B|) is larger The operation that can obtain a new node set that has the than a threshold. 9

4 EXPERIMENTS In addition, we also employ other five metrics in the performance comparison on real data sets: Purity, Rand In- We compare our method with three existing algorithms: dex (RI), Precision, Recall, and F-measure. For each detected OSLOM [19], ESSC [20] and CPM [50] on both real data sets community ω in Ω, we can find a ground-truth community and LFR benchmark data sets. We choose these methods k c from C such that these two communities have the largest based on the following considerations: CPM is one of the b number of overlapping nodes. The nodes in ω ∩ c can be most popular overlapping community detection methods, k b regarded as correctly detected nodes from ω . Then, Purity OSLOM and ESSC also detect statistically significant com- k is defined as the fraction of correctly detected nodes: munities under the configuration model. Meanwhile, the source codes or software packages of these three methods 1 X P urity(Ω,C) = max |ωk ∩ cj|. (14) are publicly available. We run OSLOM with their default N j parameter settings on undirect and unweighted networks, k and the significance level α in ESSC is specified to be 0.01. Based on the set of ground-truth communities C, two Meanwhile, we use the CFinder software which is the imple- nodes are said to have the same label if they are contained mentation of CPM. We choose its best result when the clique in the same community from C. Then, each node pair with size parameter is ranged from 3 to 5. Also, the significance respect to the set of detected communities has four possibil- level α in our method is specified to be 0.01, the overlap ities: (1) True Positive (TP): two nodes with the same label threshold is fixed to be 0.7, and the difference threshold with are allocated into the same community; (2) False Negative respect to the logarithm of p-value is specified to be 5. (FN): two nodes with the same label are distributed to To evaluate different community detection methods, different communities; (3) False Positive (FP): two nodes here we choose Overlapping Normalized Mutual Informa- with different labels are allocated into the same community; tion (ONMI) [51] as the major performance indicator. Let (4) True Negative (TN): two nodes with different labels are distributed to different communities. The Rand Index (RI) is Ω = {ω1, ω2, .., ωK } be the set of detected communities, defined as the percentage of correctly allocated node pairs: then the binary membership variable Xk (k = 1, 2, .., K) th can be used to indicate if one node belongs to the k TP + TN RI = . (15) community in Ω. The probability distribution of Xk is given TP + TN + FP + FN by P (X = 1) = |ω |/N and P (X = 0) = 1 − P (X = 1), k k k k Precision and Recall are defined as follows: where N is the number of nodes in the network. Similarly, the random variable Yl = 1 (l = 1, 2, .., J) if one node TP TP th P = ,R = . (16) belongs to the l community in C, where C = {c1, c2, .., cJ } TP + TN TP + FN represents the set of ground-truth communities. The prob- The F-measure is defined as the harmonic mean of ability distribution of Yl can be defined by P (Yl = 1) = precision and recall: |cl|/N and P (Yl = 0) = 1 − P (Yl = 1). Consequently, we P (X ,Y ) 2 × P × R can obtain the joint probability distribution k l in a F − measure = . (17) similar manner. Then, the conditional entropy of Xk given P + R Yl is defined as: 4.1 Real data sets In this section, we choose eight real data sets in the per- H(X |Y ) = H(X ,Y ) − H(Y ), (10) k l k l l formance comparison: Karate (karate) [52], Football (foot- where H(·) is the standard entropy function. ball) [53], Personal Facebook (personal) [20], Political blogs (polblogs) [54], Books about US politics (polbooks) [55], and As a result, the entropy of Xk with respect to all compo- Railway (railway) [56], DBLP collaboration network (dblp) nents of Y = {Y1,Y2, .., YJ } can be defined as: [57] and Amazon product co-purchasing network (amazon) [57]. The detailed statistics of these data sets are summa- H(Xk|Y ) = min H(Xk|Yl). (11) l∈{1,2..,J} rized in Table 1, where |V | and |E| respectively denote the number of the nodes and the number of the edges, hki X = The normalized conditional entropy of represents the average degree of the nodes, k represents {X ,X , .., X } Y max 1 2 K with respect to is defined as: the maximal degree of the nodes , |C| denotes the number of true communities in the network and |Smax|, |Smin| 1 X H(Xk|Y ) H(X|Y ) = . (12) respectively denote the maximal and minimal community |Ω| H(X ) k k size of true communities in the network. As can be seen from Table 1, the number of nodes is Note that we can define H(Y |X) in the same way. ranged from 34 to 300000, and the number of communities is Finally the ONMI for two community structures Ω and C ranged from 2 to 75000, covering a broad range of properties is given by: of real networks. Furthermore, the ground-truth communi- ties of the first 6 small networks have no overlapping nodes, ONMI(X|Y ) = 1 − [H(X|Y ) + H(Y |X)]/2. (13) while dblp and amazon have highly overlapping ground- truth communities. Note that the evaluation metrics except The greater the value of ONMI is, the better the detection ONMI are mostly used in the scenarios that the ground- results are. In the most extreme case, ONMI = 1 indicates truth communities have no overlapping nodes. Therefore that the set of reported communities are exactly the same as we only use ONMI as the performance indicator on the dblp the set of true communities. and amazon data sets. 10

TABLE 1 The characteristics of eight real data sets

Data sets |V | |E| hki kmax |C| |Smax| |Smin| karate 34 78 4.59 17 2 18 16 football 115 613 10.57 12 12 13 5 personal 561 8375 29.91 166 8 150 3 polblogs 1490 19090 27.32 351 2 758 732 polbooks 105 441 8.4 25 3 49 13 railway 301 1226 6.36 48 21 46 1 dblp 0.3M 1M 6.62 549 13K 7556 6 amazon 0.3M 0.9M 5.53 343 75K 53551 3

TABLE 2 The performance comparison of different algorithms on real data sets with non-overlapping ground-truth communities

Data sets Algorithm ONMI Purity Precision Recall RI F-measure |Cd| |Sdmax| |Sdmin| DSC 0.92 1 1 1 1 1 2 17 16 ESSC 0.92 1 1 1 1 1 2 16 16 karate OSLOM 0.92 0.97 0.94 0.94 0.94 0.94 2 19 16 CPM 0.87 0.71 0.53 0.62 0.62 0.57 3 25 3 DSC 0.85 0.96 0.92 0.94 0.98 0.93 12 14 4 ESSC 0.81 0.91 0.83 0.67 0.96 0.74 13 14 6 football OSLOM 0.82 0.92 0.84 0.92 0.98 0.88 11 15 6 CPM 0.76 0.92 0.86 0.77 0.98 0.81 13 13 4 DSC 0.32 0.73 0.58 0.83 0.83 0.68 14 172 5 ESSC 0.23 0.72 0.57 0.30 0.82 0.39 21 560 8 personal OSLOM 0.15 0.74 0.63 0.26 0.82 0.37 24 97 1 CPM 0.20 0.55 0.36 0.73 0.66 0.48 13 328 4 DSC 0.27 0.93 0.93 0.71 0.83 0.80 27 517 3 ESSC 0.19 0.69 0.54 0.60 0.53 0.57 5 1211 1 polblogs OSLOM 0.13 0.95 0.92 0.23 0.60 0.37 13 277 1 CPM 0.15 0.78 0.79 0.37 0.67 0.51 10 674 3 DSC 0.22 0.85 0.73 0.40 0.73 0.52 6 36 4 ESSC 0.21 0.55 0.40 0.58 0.48 0.47 8 104 5 polbooks OSLOM 0.45 0.85 0.79 0.82 0.84 0.81 3 53 11 CPM 0.33 0.89 0.82 0.63 0.63 0.72 6 36 4 DSC 0.23 0.56 0.31 0.30 0.90 0.30 29 48 3 ESSC 0.11 0.51 0.30 0.16 0.70 0.20 45 128 7 railway OSLOM 0.16 0.55 0.37 0.25 0.90 0.30 31 30 1 CPM 0.12 0.49 0.22 0.26 0.68 0.24 10 173 4

4.1.1 Performance Comparison allocated node pairs. Anyway, although our method cannot Table 2 presents the comparison result on the first six always achieve the best performance on all real data sets, real data sets whose ground-truth communities have no it outperforms the competing methods on most data sets in overlapping nodes, and Table 3 presents the comparison terms of ONMI, Recall, RI and F-measure. result in terms of ONMI on the dblp and amazon data sets. (2) For the karate data set, our method and ESSC can Meanwhile, Table 2 and Table 3 also record the number of achieve the perfect performance with respect to metrics communities (denoted by |Cd|), the maximal community such as RI and F-measure while ONMI is not 1. This is size (denoted by |Sdmax|) and the minimal community size because (a) our method do not report the communities (denoted by |Sdmin|) reported by each algorithm. whose sizes are smaller than 3 or whose p-values are no From the experimental results in Table 2 and Table 3, we less than the significance level, and ESSC also do not report have the following important observations and comments: some nodes in the network. (b) The five evaluation metrics (1) All algorithms can achieve good performance in except ONMI only use nodes in the reported communities terms of ONMI on the karate and football data sets since in the performance assessment, while ONMI also considers these two small networks have well-separated communities. the nodes that are not reported in the community detection However, the ONMI values of all algorithms on the re- results. maining six networks are very low because the community (3) For the large networks (dblp and amazon), our structure in these networks is weak. Meanwhile, it can also method outperformed ESSC and OSLOM in terms of ONMI. be observed that the ONMI value is small while the values However, we have to admit the fact that out method cannot of other five evaluation metrics are large on some networks. beat CPM on the dblp data set. This indicates that the perfor- This is mainly due to the difference in the way of defining mance of significance-based community detection methods the metrics: ONMI evaluates the consensus between the set on large networks should be further improved. of detected communities and the set of ground-truth com- (4) We also compare the characteristics of reported com- munities globally while the other five metrics are defined munities of different methods on different data sets. With based on local information such as the number of correctly respect to the number of communities, our method is more 11

accurate than other methods on the karate, football, railway TABLE 4 and amazon data sets. In regard to the maximal community Spearman correlation between p-value and other three scoring functions size and the minimal community size, no algorithm can be very accurate on most data sets. Overall, it can be RatioCut Modularity karate 1 1 -1 observed that these characteristics of reported communities football 0.9231 0.9231 -0.9371 are positively correlated with those performance indicators personal 0.8313 0.1667 -0.8743 such as ONMI. For instance, our method can achieve better polblogs 1 1 -1 performance than other methods on the football data set. polbooks 1 1 -0.5 Meanwhile, the number of communities reported by our railway 0.9532 0.2639 -0.9477 method is exactly the same as the ground-truth on this data dblp 0.4748 0.1371 -0.9640 amazon 0.6965 0.3988 -0.9979 set.

TABLE 3 The performance comparison of different algorithms on real data sets each network. The experimental results on the correlation with overlapping ground-truth communities relationship are presented in Table 4. p Data sets Algorithm ONMI |Cd| |Sdmax| |Sdmin| The -value is positively correlated with conductance DSC 0.14 46k 2325 3 and ratio , since small scores are assigned to true com- ESSC 0.08 13k 1220 3 munities in all these three functions. In contrast, modularity dblp OSLOM 0.11 18k 127 3 has a negative correlation with the p-value since it generates CPM 0.19 47k 2085 5 larger scores for real communities. DSC 0.21 29k 1185 3 From the experimental results of Table 4, we have the ESSC 0.17 26k 5290 3 amazon following observations. OSLOM 0.18 22k 859 3 Firstly, the absolute values of correlation coefficients in CPM 0.21 23k 240 5 Table 4 are all no less than 0.5 (except two values) for small networks. Meanwhile, more than 1/3 of these coefficients are either 1 or -1. For two large networks, 3 out of 6 absolute 4.1.2 Correlation with classical community evaluation func- correlation coefficients are larger than 0.5. This indicates that tions our p-value function has a good consensus with existing To further validate the effectiveness of our p-value function, classical scoring functions, which further validates the cor- we check the correlation between the p-value and three well- rectness and effectiveness of the proposed p-value function. known community scoring functions: conductance [58], ra- Secondly, the proposed p-value function is highly cor- tio cut [59] and modularity [4]. The conductance of a com- related with the conductance function on all six small net- munity S is defined as: works with correlation coefficients that are at least 0.8. On two large networks, the correlation coefficients are close to Eout Conductance(S) = , (18) 0.5. Meanwhile, the proposed p-value function is highly min(Ds,D − Ds) negatively correlated with the modularity measure on all networks. In contrast, the consensus with ratio cut is not so where the meanings of Eout, Ds and D have been given in good. This is because conductance and modularity consider Section 3.1. The ratio cut is defined as: both the internal links within the community and external Eout links outside the community in a manner that is similar RatioCut(S) = , (19) |S|(|V | − |S|) to our definition on the p-value. However, the ratio cut function only considers external links in its definition. where |S| denotes the number of nodes in the community Finally, the correlation relationship between the p-value S. The modularity of a single community is defined as: function and three scoring functions are different across dif- ferent data sets. This partially illustrates why different com-  2 Ein Ein + Eout munity detection methods exhibit different performance on Modularity(S) = + , (20) |E| 2|E| different networks. where |E| is the number of edges in the network. Since the ground-truth communities are known for the 4.2 LFR benchmark data sets eight real networks in Table 1, we use the set of true commu- The LFR model [60] can generate artificial networks which nities as the input to calculate the association between dif- have a planted community structure. The LFR benchmark ferent scoring functions. Suppose there are |C| ground-truth networks have heterogeneous distributions of degree communities for a given network, we calculate the p-value, and community size [60]. There is an important parameter conductance, ratio cut and modularity for each of these called mixing coefficient µ in LFR benchmark model. This |C| communities. That is, we can obtain a score vector of mixing coefficient represents the desired average proportion length |C| for each scoring function. If two scoring functions of connections between a node and the nodes outside its coincide with each other perfectly, the absolute correlation community. Clearly, small values of µ indicate that there is value between the two corresponding score vectors will be an obvious community structure in the generated network. 1. Based on this observation, we calculate the Spearman’s In particular, µ = 0 indicates that all links are within the rank correlation coefficient between the score vector of community and µ = 1 indicates that two nodes of each the p-value and that of other three scoring functions on edge belong to different communities. In addition, another 12

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5 NMI NMI 0.4 0.4

0.3 0.3 DSC DSC 0.2 OSLOM 0.2 OSLOM CPM CPM 0.1 ESSC 0.1 ESSC

0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 u u

(a) N=1000 Small Community (b) N=1000 Big Community

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5 NMI NMI 0.4 0.4

0.3 0.3 DSC DSC 0.2 OSLOM 0.2 OSLOM CPM CPM 0.1 ESSC 0.1 ESSC

0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 u u

(c) N=5000 Small Community (d) N=5000 Big Community

Fig. 6. The comparison of different methods in terms of ONMI on the LFR benchmark networks without overlapping communities.

two parameters are used in generating overlapping commu- has comparable performance with both classical community nities: on controls the number of overlapping nodes and om detection methods and other significance-based community specifies the number of communities that the overlapping detection methods on detecting non-overlapping communi- node belongs to. ties. Our method is tested on LFR benchmark data sets with To test the performance of different methods on net- different network sizes and community sizes. Following the works with overlapping communities, we still use the same experimental settings in OSLOM [19], we considered two parameters for network size and community size to generate network sizes: N = 1000 and N = 5000 and two commu- four networks by setting on = 10% ∗ N and om = 3 nity sizes: “small” community size in the range [10,50] and in the LFR model. The performance comparison results on “big” community size in the range [20,100]. the networks with overlapping communities are shown in Fig.6 presents the performance of different methods in Supplementary Fig.2. Here we also only use ONMI as the terms of ONMI on four LFR data sets without overlapping performance indicator for networks with overlapping com- communities. Fig.6 shows that our method can achieve the munities. Although our method cannot outperform OSLOM same level performance as other three competing algorithms when the network size is 1000 and community is big, when the mixing coefficient µ is not larger than 0.6. When it has better performance than the other two competing the mixing coefficient is bigger than 0.6, each planted “true algorithms. community” will not be a community even in a weak Meanwhile, we also generate networks with N = 5000, sense since it has more external links than internal links on = 10% ∗ N and µ = 0.3 to check the performance [61]. Therefore, all methods have a quick decline of ONMI. when om is varied from 2 to 6. The results of different Overall, the experimental results indicate that our method algorithms are shown in Supplementary Fig.3. The increase 13 of om will lead to the performance decline of all methods, [16] A. Miyauchi and Y. Kawase, “Z-score-based modularity for com- which indicates that it is more difficult to detect overlapping munity detection in networks,” PloS one, vol. 11, no. 1, p. e0147805, 2016. communities when the overlapping nodes belong to more [17] Y. Hu, Y. Nie, H. Yang, J. Cheng, Y. Fan, and Z. Di, “Measuring communities. Moreover, our method can still achieve good the significance of community structure in complex networks,” performance even when the om parameter is relatively Physical Review E, vol. 82, no. 6, p. 066106, 2010. large. [18] A. Carissimo, L. Cutillo, and I. De Feis, “Validation of community robustness,” Computational Statistics & Data Analysis, vol. 120, pp. 1–24, 2018. 5 CONCLUSION [19] A. Lancichinetti, F. Radicchi, J. J. Ramasco, and S. Fortunato, “Finding statistically significant communities in networks,” PloS To address the problem of detecting statistically significant one, vol. 6, no. 4, p. e18961, 2011. communities, we derive a tight upper bound for the p- [20] J. D. Wilson, S. Wang, P. J. Mucha, S. Bhamidi, A. B. Nobel et al., value of a single community under the configuration model. “A testing based extraction algorithm for identifying significant communities in networks,” The Annals of Applied Statistics, vol. 8, Meanwhile, we also provide a systematic summarization no. 3, pp. 1853–1891, 2014. and analysis on the existing methods for detecting the [21] J. Palowitch, S. Bhamidi, and A. B. Nobel, “Significance-based statistically significant communities. Based on the upper community detection in weighted networks,” Journal of Machine bound of p-value of a single community, we present a local Learning Research, vol. 18, no. 188, pp. 1–48, 2018. [22] S. Kojaku and N. Masuda, “A generalised significance test for search method to find statistically significant communities. individual communities in networks,” Scientific Reports, vol. 8, Experimental results show that our method is comparable no. 1, p. 7351, 2018. with the competing methods on detecting true communities. [23] M. Koyuturk,¨ W. Szpankowski, and A. Grama, “Assessing sig- nificance of connectivity and conservation in protein interaction networks,” Journal of Computational Biology, vol. 14, no. 6, pp. 747– ACKNOWLEDGMENTS 764, 2007. [24] A. Lancichinetti, F. Radicchi, and J. J. Ramasco, “Statistical signif- This work was partially supported by the Natural Science icance of communities in networks,” Physical Review E, vol. 81, Foundation of China under Grant No. 61572094 and the no. 4, p. 046110, 2010. Fundamental Research Funds for the Central Universities [25] B. Wang, J. M. Phillips, R. Schreiber, D. Wilkinson, N. Mishra, and R. Tarjan, “Spatial scan statistics for graph clustering,” in Proceed- (No.DUT2017TB02). ings of the 2008 SIAM International Conference on Data Mining, 2008, pp. 727–738. [26] A. Miyauchi and Y. Kawase, “What is a network community?: A REFERENCES novel quality function and detection algorithms,” in Proceedings of the 24th ACM International on Conference on Information and [1] S. Fortunato, “Community detection in graphs,” Physics Reports, Knowledge Management, 2015, pp. 1471–1480. vol. 486, no. 3, pp. 75–174, 2010. [27] R. Durrett, Random graph dynamics. Cambridge university press [2] S. Fortunato and D. Hric, “Community detection in networks: A Cambridge, 2007. user guide,” Physics Reports, vol. 659, pp. 1–44, 2016. [3] L. Duan, W. N. Street, Y. Liu, and H. Lu, “Community detection [28] V. A. Traag, G. Krings, and P. Van Dooren, “Significant scales in in graphs through correlation,” in Proceedings of the 20th ACM community structure,” Scientific Reports, vol. 3, 2013. SIGKDD International Conference on Knowledge Discovery and Data [29] J. Reichardt and S. Bornholdt, “When are networks truly modu- Mining, 2014, pp. 1376–1385. lar?” Physica D: Nonlinear Phenomena, vol. 224, no. 1-2, pp. 20–26, [4] M. E. Newman and M. Girvan, “Finding and evaluating commu- 2006. nity structure in networks,” Physical Review E, vol. 69, no. 2, p. [30] L. Cutillo and M. Signorelli, “An inferential procedure for 026113, 2004. community structure validation in networks,” arXiv preprint [5] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast arXiv:1710.06611, 2017. unfolding of communities in large networks,” Journal of statistical [31] M. B. Perry, G. V. Michaelson, and M. A. Ballard, “On the statis- mechanics: theory and experiment, vol. 2008, no. 10, p. P10008, 2008. tical detection of clusters in undirected networks,” Computational [6] M. E. Newman, “Fast algorithm for detecting community structure Statistics & Data Analysis, vol. 68, pp. 170–189, 2013. in networks,” Physical Review E, vol. 69, no. 6, p. 066133, 2004. [32] B. Karrer, E. Levina, and M. E. Newman, “Robustness of commu- [7] T. Chakraborty, A. Dalmia, A. Mukherjee, and N. Ganguly, “Met- nity structure in networks,” Physical Review E, vol. 77, no. 4, p. rics for community analysis: A survey,” ACM Computing Surveys 046119, 2008. (CSUR), vol. 50, no. 4, p. 54, 2017. [33] Y.-T. Chang, D. Pantazis, and R. M. Leahy, “Assessing statistical [8] S. S. Bhowmick and B. S. Seah, “Clustering and summarizing significance when partitioning large-scale brain networks,” in Prof. protein-protein interaction networks: A survey,” IEEE Transactions IEEE Int. Symposium on Biomedical Imaging (ISBI), 2012, pp. 1759– on Knowledge and Data Engineering, vol. 28, no. 3, pp. 638–658, 2015. 1762. [9] B. Teng, C. Zhao, X. Liu, and Z. He, “Network inference from [34] M. Sales-Pardo, R. Guimera, A. A. Moreira, and L. A. N. Amaral, ap-ms data: computational challenges and solutions,” Briefings in “Extracting the hierarchical organization of complex systems,” bioinformatics, vol. 16, no. 4, pp. 658–674, 2014. Proceedings of the National Academy of Sciences of the United States [10] M. Newman, “Network structure from rich but noisy data,” Nature of America, vol. 104, no. 39, pp. 15 224–15 229, 2007. Physics, vol. 14, no. 6, pp. 542–545, 2018. [35] P. Zhang and C. Moore, “Scalable detection of statistically sig- [11] V. Spirin and L. A. Mirny, “Protein complexes and functional mod- nificant communities and , using message passing for ules in molecular networks,” Proceedings of the National Academy of modularity,” Proceedings of the National Academy of Sciences of the Sciences of the United States of America, vol. 100, no. 21, pp. 12 123– United States of America, vol. 111, no. 51, pp. 18 144–18 149, 2014. 12 128, 2003. [12] Z. He, C. Zhao, H. Liang, B. Xu, and Q. Zou, “Protein complexes [36] C. P. Massen and J. P. Doye, “Thermodynamics of community identification with family-wise error rate control,” IEEE/ACM structure,” arXiv preprint cond-mat/0610077, 2006. transactions on computational biology and bioinformatics, 2019. [37] C. Gao and J. Lafferty, “Testing network structure using re- [13] Y. Su, C. Zhao, Z. Chen, B. Tian, and Z. He, “On the statistical lations between small subgraph probabilities,” arXiv preprint significance of protein complex,” Quantitative Biology, vol. 6, no. 4, arXiv:1704.06742, 2017. pp. 313–320, 2018. [38] ——, “Testing for global network structure using small subgraph [14] M. Zitnik, J. Leskovec et al., “Prioritizing network communities,” statistics,” arXiv preprint arXiv:1710.00862, 2017. Nature communications, vol. 9, no. 1, p. 2544, 2018. [39] T. Tokuda, “Statistical test for detecting community structure in [15] R. Aldecoa and I. Mar´ın, “Deciphering network community struc- real-valued edge-weighted graphs,” PloS one, vol. 13, no. 3, p. ture by surprise,” PloS one, vol. 6, no. 9, p. e24195, 2011. e0194079, 2018. 14

[40] K. Chen and J. Lei, “Network cross-validation for determining the number of communities in network data,” Journal of the American Statistical Association, pp. 1–11, 2017. [41] D. F. Saldana, Y. Yu, and Y. Feng, “How many communities are there?” Journal of Computational and Graphical Statistics, vol. 26, no. 1, pp. 171–181, 2017. [42] J. Lei et al., “A goodness-of-fit test for stochastic block models,” The Annals of Statistics, vol. 44, no. 1, pp. 401–424, 2016. [43] P. J. Bickel and P. Sarkar, “Hypothesis testing for automated community detection in networks,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 78, no. 1, pp. 253–273, 2016. [44] Y. Zhao, E. Levina, and J. Zhu, “Community extraction for social networks,” Proceedings of the National Academy of Sciences of the United States of America, vol. 108, no. 18, pp. 7321–7326, 2011. [45] A. F. McDaid, T. B. Murphy, N. Friel, and N. J. Hurley, “Improved bayesian inference for the stochastic block model with application to large networks,” Computational Statistics & Data Analysis, vol. 60, pp. 12–31, 2013. [46] A. Nobile and A. T. Fearnside, “Bayesian finite mixtures with an unknown number of components: The allocation sampler,” Statistics and Computing, vol. 17, no. 2, pp. 147–162, 2007. [47] Y. Chen, P. Diaconis, S. P. Holmes, and J. S. Liu, “Sequential monte carlo methods for statistical analysis of tables,” Journal of the American Statistical Association, vol. 100, no. 469, pp. 109–120, 2005. [48] B. D. McKAY, “Subgraphs of dense random graphs with specified degrees,” Combinatorics, Probability and Computing, vol. 20, no. 3, pp. 413–433, 2011. [49] F. Radicchi, A. Lancichinetti, and J. J. Ramasco, “Combinatorial approach to modularity,” Physical Review E, vol. 82, no. 2, p. 026102, 2010. [50] G. Palla, I. Derenyi,´ I. Farkas, and T. Vicsek, “Uncovering the overlapping community structure of complex networks in nature and society,” Nature, vol. 435, no. 7043, pp. 814–818, 2005. [51] A. Lancichinetti, S. Fortunato, and J. Kertesz,´ “Detecting the overlapping and hierarchical community structure in complex networks,” New Journal of Physics, vol. 11, no. 3, p. 033015, 2009. [52] W. W. Zachary, “An information flow model for conflict and fission in small groups,” Journal of anthropological research, vol. 33, no. 4, pp. 452–473, 1977. [53] M. Girvan and M. E. Newman, “Community structure in social and biological networks,” Proceedings of the National Academy of Sciences of the United States of America, vol. 99, no. 12, pp. 7821– 7826, 2002. [54] L. A. Adamic and N. Glance, “The political blogosphere and the 2004 us election: divided they blog,” in Proceedings of the 3rd International Workshop on Link Discovery, 2005, pp. 36–43. [55] V. Krebs, “ analysis software & services for organi- zations, communities, and their consultants,” http://www.orgnet. com, 2013. [56] T. Chakraborty, S. Srinivasan, N. Ganguly, A. Mukherjee, and S. Bhowmick, “On the permanence of vertices in network com- munities,” in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014, pp. 1396– 1405. [57] J. Yang and J. Leskovec, “Defining and evaluating network com- munities based on ground-truth,” Knowledge and Information Sys- tems, vol. 42, no. 1, pp. 181–213, 2015. [58] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney, “Com- munity structure in large networks: Natural cluster sizes and the absence of large well-defined clusters,” Internet Mathematics, vol. 6, no. 1, pp. 29–123, 2009. [59] J. Leskovec, K. J. Lang, and M. Mahoney, “Empirical comparison of algorithms for network community detection,” in Proceedings of the 19th International Conference on World Wide Web, 2010, pp. 631–640. [60] A. Lancichinetti, S. Fortunato, and F. Radicchi, “Benchmark graphs for testing community detection algorithms,” Physical Review E, vol. 78, no. 4, p. 046110, 2008. [61] F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Parisi, “Defining and identifying communities in networks,” Proceedings of the National Academy of Sciences of the United States of America, vol. 101, no. 9, pp. 2658–2663, 2004.