<<

11

FIG. 6: Six modules corresponding to the uppermost branched linear chain of modules depicted in Figure 4. Colors denote modules as defined by the network information bottleneck algorithm. Again the modules roughly correspond to institutional affiliations. Over 50% of the blue nodes have one or more affiliations with the institutions based in and around Chicago (Argonne National Laboratory, University of Illinois at Chicago and University of Notre Dame). 70% of the red nodes are in England, and 75% of the green nodes are in China, mostly at the Institute of Chemistry Chinese Academy of Sciences, and all of the cyan nodes are at the Center of Complex Systems Research in Illinois. Both the yellow and magenta modules are mostly affiliated with the University of Nebraska. ECS 289 / MAE 298 Feb 9, 2011

“Assortative mixing, graph partitioning,

FIG. 7: E. coli gene regulatory network. Lcommunityargest componen structure”t of the symmetric version of the E. coli genetic regulatory network. Colors denoted modules identified by NIB.

Diffusive distributions are but one general class of distributions on a network. A natural generalization of these ideas is to describe other distributions on a network for which a particular function, energy, or origin is known, and on which some particular of freedom (such as chemical concentration or genetic expression as a function of time) may be defined. Finally, we note that while the information bottleneck is a prescription for finding the highest-fidelity summary of a system at a given simplicity, algorithms for determining network are usually motivated by various definitions of normalized min-cuts [27, 28, 29, 30]. Our results, particularly for the synthetic graphs with prescribed modular structure, demonstrate that information implies edge modularity, an unexpected finding which motivates further numerical and analytic investigations in progress regarding this relationship.

Acknowledgments

It is a pleasure to acknowledge , Noam Slonim, Christina Leslie, Risi Kondor, Ilya Nemenman, and Susanne Still for many useful conversations on the information bottleneck and modularity. This work was supported Many networks naturally break up into groups (Other common features: broad-scale, small worlds, clustering)

• Modular networks / Community structures • Max- min-flow • Hierarchical networks • More generally, finding clusters of similar objects What can we learn from partitioning a network?

(Still an open question: tons of work on detecting communities, not so much work in interpreting what they mean.)

• Bottlenecks

• Social groups / collaboration networks

• US Congress: – Partisan / bi-partison voting patterns (both by party and region) – Central players

• Sometimes functional subgroups (some evidence in Gene Regulatory networks) Topics covered today • Modularity metric (many edges within a module, few between modules) • Kernighan-Lin algorithm • Spectral techniques: Graph Laplacian, Fiedler vector • Girvan-Neman ( based) and Newman’s additional work (spectral) (works well for directed networks (weighted edges is easy – )) • Approximation techniques for large networks (finding the optimal is NP-complete) • Dendrograms / hierachy • Overlapping communities • Noise from high degree nodes (includes inherent modularity) • Maximum likelihood (nodes with similar patterns of connectivity) • Applications: when do “communities” actually tell us anything useful? (Esp since methods are topology based not about functional connections (e.g. V8 engine)) Here: missing edges, intelligent guesses at missing node attributes ... Kernighan-Lin algorithm, 1970

• How well can you partition a set of connected nodes into two groups with minimal group-spanning edges? (Motivation: Circuit board design – minimize wires crossing between boards.) • Judge this using a quality metric about how well partitions into groups: minimum number of group-spanning edges. • Divide the vertices into two groups (at random or, better yet, with good guess) • Examine each pair (i, j) where i in Group1 and j in Group2. Swap the (i, j) pair that reduces the spanning edges (if there is one). And repeat excluding this pair. • Avoiding local minimum – 1) “simulated annealing” (accept bad moves with some probability, and decrease the probability over time); 2) If no pair reduces number of spanning edges, swap pair that increases it the least. • Dependence on initial condition / grouping. • Also number of groups and their sizes preserved. Spectral partitioning – graph bi-section with Fiedler vector

• Recall random walk state transition matrix (we started from A and row or column normalized it).

0 1 1/4 1/3 1/2 1/4 0 B 1/4 1/3 0 1/4 0 C B C B C M = B 1/4 0 1/2 0 0 C B C @ 1/4 1/3 0 1/4 1/2 A 0 0 0 1/4 1/2

• Consider instead the Graph Laplacian, L where: No self-loops allowed Lii = di (the degree of node i) 0 1 Lij = −Aij for i 6= j. 3 −1 −1 −1 0 B −1 2 0 −1 0 C (Recall −Aij = −1 if edge B C B C connects i and j, and 0 otherwise.) L = B −1 0 1 0 0 C B C @ −1 −1 0 3 −1 A 0 0 0 −1 1 Graph bisection • M, rwalk state transition matrix (Perron-Frobenious); largest eigenvalue λ = 1 (steady-state). The number of eigenvalues with λ = 1 equals number of disconnected components • L, graph Laplacian. – Related to diffusion on a network (see Newman book, sec 6.13.1) – All eigenvalues λi real and λi ≥ 0. – There will always be one λ = 0, call it λ1 (Note ~v1 = {1, 1, 1, ···}) – The number of λi’s equal to zero are the number of disconnected components. – If only one λ = 0, call the next largest eigenvector λ2, the Fiedler vector. – The eigenvector ~v2 will have elements equal −1 or +1 only. – Elements with +1 assigned to group 1, those with -1 to group 2 (left picture:)

Timescale, τ1 = 5314 to Timescale, τ2 = 1092 to Timescale, τ3 = 157 to

● ●● ● ●● ● ●● ● o o ● o o ● ● ● ● ● ● o● ● ● o● ● ● ● ● ● ● o o● o● o o o● o● ● o o● o● o o o● o● +++●+ +●+ + o o● o● o o● o● ● +● + ● ● ● ● ● ● ● o● o● o ● o o● o● o ● +● ●+●● +● ● ●● o ● ● ●● o 20 ++ +● ● 20 o o ● ● 20 o o o ● ● + ●●++ +● + o●● o o o● o o●● o o ● ●● +● +●+●+ ●o● o ● o● o● ●o● o ● ●+●+ o● +++ ● + + ● ● o ●o o● o● o ●o + + ● +● ++● + o● o o● o● o ● Y + +● ●●●● ● Y o● o●●o●● o● Y +● ●●●● ● ● +++●+ ● ooo● ● +++●+ ● o ● +++ ● o● ● o ● +++ o ●o● + ● ● +●● ● o● ●● + ● ● 10 o ++ 10 + + o 10 + + ++ ● ● ●+● ●+● o o● ● ● ●o + +● + +● 5 o● o ● 5 ● + ● 5 ● + ● ● o● ● ● o +● + ● ● ● o +● + ● ● ● o ●o o● o o●● ● ● ● o●● ● ● ● o●● ● o ● ● oo● o +● ++ ● oo● o +● ++ ● oo● o ● ●o o ●o●o● o● ●+●+ o ●o●o● o● ●+●+ ●o●o● o● 0 o o o 0 + o 0 + + o

0 5 10 20 0 5 10 20 0 5 10 20

X X X Aside: Some other important spectral properties

• Eigenvalues of Adjacency matrix, Laplacian, rwalk matrix are invariant under node relabeling.

• If two graphs are isomorphisms, they will have exactly the same eigenvalue spectrum.

• This is a necessary condition for isomorphism, but not sufficient – just because two graphs have same eigenvalues does not mean they are isomorphic – but it is strong evidence for isomorphisms (sometimes just first few eigenvalues tested is considered evidence)

• For more on graph isomorphisms, Fiedler vectors, graph embeddings, see http://cs-www.cs.yale.edu/homes/spielman/ From graph bisection to communities in networks

• M. Girvan and M. E. J. Newman, Community structure in social and biological networks, Proceedings of the National Academy of Sciences, 99 (2002), 78217826.

• The paper the launched “community structure”

• Identify the edge with highest betweeness centrality, this edge partitions the graph into two pieces.

• Rerun recursively on each sub-network.

• Note, this is continued bi-section.

• Very costly (need to calculate all shortest-paths between all pairs of nodes – recall definition of betweeness centrality). Relating this back to Kernighan-Lin and Fiedler

• Newman, Girvan, Finding and evaluating community structure in networks, Physical Review E, 69, 026113 (2004).

• Newman, Finding community structure in networks using the eigenvectors of matrices, Physical Review E, 74, 036104 (2006).

• Note the next slides come from a presentation that Mark Newman gave at the Institute for Pure and Applied Math, Workshop on “Random and Dynamic Graphs and Networks”, May 2007. There are many other interesting presentations there: http://www.ipam.ucla.edu/programs/rsws3/ Spectral partitioning is related to min-cut/max-flow

• Choose any pair of vertices (i, j) the minimum set of edges whose deletion places the two vertices in disconnected components of the graph, carries the maximum flow between the two vertices.

• Finding the optimal partitioning over all possible pairs of nodes is a difficult optimization problem that is in NP.

• Maximizing modularity also is NP: U. Brandes, D. Delling, M. Gaertler, R. Goerke, M. Hoefer, Z. Nikoloski, and D. Wagner, On modularity clustering, IEEE Transactions on Knowledge and Data Engineering, 20 (2008), 172188.

• So rather than finding THE optimal partitioning, we turn to approximation algorithms to find approximately optimal partitions. The

• V.D. Blondel, J.-L. Guillaume, R. Lambiotte and E. Lefebvre (2008). ”Fast unfolding of community in large networks”. J. Stat. Mech.: P10008

• http://sites.google.com/site/findcommunities/

• The method consists of two phases. First, it looks for ”small” communities by optimizing modularity in a local way. Second, it aggregates nodes of the same community and builds a new network whose nodes are the communities. These steps are repeated iteratively until a maximum of modularity is attained.

• This method is implemented NetworkX and others. Dendrograms

Can be built by aggregating (Louvain method) or recursing from the full graph. !

Vol 435|9 June 2005|doi:10.1038/nature03607 LETTERS

Uncovering the overlapping community structure of complex networks in nature and society

Gergely Palla1,2, Imre Dere´nyi2, Ille´s Farkas1 & Tama´s Vicsek1,2

Many complex systems in nature and society can be described in withOverlappingthe overlaps being their links. communities!The number of such links of com terms of networks capturing the intricate web of connections community a can be called its community degree, da : Finally, the 1–4 com a among the units they are made of .G.A key Palla,question I.is how Dernyi,to size I.sa Farkas,of any community andcan T.most Vicsek,naturally be Uncoveringdefined as the the overlapping interpret the global organization of such networks as the co- number of its nodes. To characterize the community structure of a existence of their structural subunitscommunity(communities) associated structurelarge ofnetwor complexk we introduc networkse the distributions in natureof these andfour basic society, Nature, 435 with more highly interconnected parts. Identifying these a priori quantities. In particular we focus on their cumulative distribution unknown building blocks (such as functionally(2005),related 814818.proteins5,6, industrial sectors7 and groups of people8,9) is crucial to the understanding of the structural and functional properties of networks. The existing deterministic methods used for large net- works find separated communities, whereas most of the actual networks are made of highly overlapping cohesive groups of nodes. Here we introduce an approach to analysing the main statistical features of the interwoven sets of overlapping commu- nities that makes a step towards uncovering the modular structure of complex systems. After defining a set of new characteristic quantities for the statistics of communities, we apply an efficient technique for exploring overlapping communities on a large scale. We find that overlaps are significant, and the distributions we introduce reveal universal features of networks. Our studies of collaboration, word-association and protein interaction graphs show that the web of communities has non-trivial correlations and specific scaling properties. Most real networks typically contain parts in which the nodes (units) are more highly connected to each other than to the rest of the network. The sets of such nodes are usually called clusters, communities, cohesive groups or modules8,10,11–13; they have no widely accepted, unique definition. In spite of this ambiguity, the presence of communities in networks is a signature of the hierarchical nature of complex systems5,14. The existing methods for finding communities in large networks are useful if the commu- nity structure is such that it can be interpreted in terms of separated sets of communities (see Fig. 1b and refs 10, 15, 16–18). However, most real networks are characterized by well-defined statistics of overlapping and nested communities. This can be illustrated by the numerous communities that each of us belongs to, including those related to our scientific activities or personal life (school, hobby, Figure 1 | Illustration of the concept of overlapping communities. a, The family) and so on, as shown in Fig. 1a. Furthermore, members of our black dot in the middle represents either of the authors of this paper, with communities have their own communities, resulting in an extremely several of his communities around. Zooming in on the scientific community complicated web of the communities themselves. This has long been demonstrates the nested and overlapping structure of the communities, and understood by sociologists19 but has never been studied system- depicting the cascades of communities starting from some members atically for large networks. Another, biological, example is that a exemplifies the interwoven structure of the network of communities. large fraction of proteins belong to several protein complexes b, Divisive and agglomerative methods grossly fail to identify the simultaneously20. communities when overlaps are significant. c, An example of overlapping k- communities at k 4. The yellow community overlaps the blue one In general, each node i of a network can be characterized by a in a single node, whereas it¼shares two nodes and a link with the green one. membership number m i, which is the number of communities that These overlapping regions are emphasized in red. Notice that any k-clique the node belongs to. In turn, any two communities a and b can share (complete subgraph of size k) can be reached only from the k-cliques of the ov sa;b nodes, which we define as the overlap size between these same community through a series of adjacent k-cliques. Two k-cliques are communities. Naturally, the communities also constitute a network, adjacent if they share k 2 1 nodes.

1Biological Physics Research Group of the Hungarian Academy of Sciences, Pa´zma´ny P. stny. 1A, H-1117 Budapest, Hungary. 2Department of Biological Physics, Eo¨tvo¨s University, Pa´zma´ny P. stny. 1A, H-1117 Budapest, Hungary. 814 ©!!""#!Nature Publishing Group! !

Vol 435|9 June 2005|doi:10.1038/nature03607 LETTERS

Uncovering the overlapping community structure of complex networks in nature and society

Gergely Palla1,2, Imre Dere´nyi2, Ille´s Farkas1 & Tama´s Vicsek1,2

Many complex systems in nature and society can be described in with the overlaps being their links. The number of such links of com terms of networks capturing the intricate web of connections community a can be called its community degree, da : Finally, the 1–4 com among the units they are made of . A key question is how to size sa of any community a can most naturally be defined as the interpret the global organization of such networks as the co- number of its nodes. To characterize the community structure of a existence of their structural subunits (communities) associated large network we introduce the distributions of these four basic with more highly interconnected parts. Identifying these a priori quantities. In particular we focus on their cumulative distribution unknown building blocks (such as functionally related proteins5,6, industrial sectors7 and groups of people8,9) is crucial to the understanding of the structural and functional properties of networks. The existing deterministic methods used for large net- works find separated communities, whereas most of the actual networks are made of highly overlapping cohesive groups of nodes. Here we introduce an approach to analysing the main statistical features of the interwoven sets of overlapping commu- nities that makes a step towards uncovering the modular structure of complex systems. After defining a set of new characteristic quantities for the statistics of communities, we apply an efficient technique for exploring overlapping communities on a large scale. We find that overlaps are significant, and the distributions we introduce reveal universal features of networks. Our studies of collaboration, word-association and protein interaction graphs show that the web of communities has non-trivial correlations and k-clique aggregation (Palla et al.) specific scaling properties. Most real networks typically contain parts in which the nodes (units) are more highly connected to each other than to the rest of the network. The sets of such nodes are usually called clusters, communities, cohesive groups or modules8,10,11–13; they have no widely accepted, unique definition. In spite of this ambiguity, the presence of communities in networks is a signature of the hierarchical nature of complex systems5,14. The existing methods for finding communities in large networks are useful if the commu- nity structure is such that it can be interpreted in terms of separated sets of communities (see Fig. 1b and refs 10, 15, 16–18). However, most real networks are characterized by well-defined statistics of overlapping and nested communities. This can be illustrated by the numerous communities that each of us belongs to, including those related to our scientific activities or personal life (school, hobby, Figure 1 | Illustration of the concept of overlapping communities. a, The family) and so on, as shown in Fig. 1a. Furthermore, members of our black•dotBuildin the upmiddle k-cliques.represents either k-cliquesof the authors with overlappingof this paper, with nodes are in overlapping communities have their own communities, resulting in an extremely several ofcommunitieshis communities around. Zooming in on the scientific community complicated web of the communities themselves. This has long been demonstrates the nested and overlapping structure of the communities, and understood by sociologists19 but has never been studied system- depicting the cascades of communities starting from some members atically for large networks. Another, biological, example is that a exemplifies• Whatthe interwoven size k tostructure choose?of the (Typicallynetwork of trycommu 2, 3,nities. 4, 5) large fraction of proteins belong to several protein complexes b, Divisive and agglomerative methods grossly fail to identify the simultaneously20. communities when overlaps are significant. c, An example of overlapping k-clique• Notcomm allunities nodesat k assigned4. The yellow tocommu communitiesnity overlaps the blue one In general, each node i of a network can be characterized by a in a single node, whereas it¼shares two nodes and a link with the green one. membership number m i, which is the number of communities that These overlapping regions are emphasized in red. Notice that any k-clique the node belongs to. In turn, any two communities a and b can share (complete subgraph of size k) can be reached only from the k-cliques of the ov sa;b nodes, which we define as the overlap size between these same community through a series of adjacent k-cliques. Two k-cliques are communities. Naturally, the communities also constitute a network, adjacent if they share k 2 1 nodes.

1Biological Physics Research Group of the Hungarian Academy of Sciences, Pa´zma´ny P. stny. 1A, H-1117 Budapest, Hungary. 2Department of Biological Physics, Eo¨tvo¨s University, Pa´zma´ny P. stny. 1A, H-1117 Budapest, Hungary. 814 ©!!""#!Nature Publishing Group! Aggregating/collapsing: treat k-cliques as nodes and recurse

H.-W. Shen, X. Cheng, K. Cai, and M.-B. Hu, Physica A 388, 1706 (2009).

• Find maximal k-clique, collapse into one node

• Repeat until all nodes now assigned to a clique Noise in community detection Haoran Wen, E. A. Leicht, and R. M. D’Souza, “Improving community detection in networks HAORAN WEN, E. A. LEICHT, AND RAISSA M. D’SOUZA PHYSICAL REVIEW E 00,006100(2011) by targeted node removal” Physical Review E 83, 016114, 2011.

networks but has now been extended to weighted and directed (a) (b) networks [24–26]. Recently several researchers have developed methods for detecting community structure when vertices may belong to multiple communities and communities may thus overlap [11–16]. However, in a network where some vertices belong to one-quarter or one-half of all communities detected (as we show later is the case for many networks) the question arises as to whether these vertices truly participate in all of the communities. Regardless, such vertices can act as violators obscuring underlying community structures. • High degree nodesFIG. 1. Community which participate structures of a with sample out network community (a) before and preference can (b) after violator removal, showing the effect of two violators out of obscure the behavior of ordered communities 40 nodes. Vertices are initially assigned to one of four communities at C. Empirical OSS Networks random (identified via node shape and shading). The two vertices with • In metabolic networks: Water, ATP,hydrogen (“currency” versus “commodity” metabolites), the largest degree are violators (hexagons) and do not preferentially Both software developer communication networks and M. Huss and P.connect Holme, to anyIET single Sys. Biol. community. 1, 280 The (2007). remaining edges fall within software dependency networks have attracted much research communities with 85% probability and between them with 15%. attention [27–29]. In this work we examine the developer • In email networksBoxes the illustrate boss the (and communities others) communicate detected using broadly the algorithm (for discussion in communication in Open networks from four OSS projects: Apache Source Software,Ref. [ see,4]. C. Bird, et al. ICSE (ACM Press, 2008) HTTP server, Python, PostgresSQL, and Perl. Each network is constructed by combining monthly mailing list archives of the • In software calland graphs, identity low-level of the violators. functions The (e.g. methods printf, malloc, we develop etc). herein respective OSS project into a cumulative view spanning several allow detection of vertices acting as violators without needing years [30–32]. In these networks, a is a developer, and aprioriknowledge or requiring that violators necessarily be weighted directed edges represent emails between developers. the largest degree vertices. The size of these projects range between 1000 and 2500 developers. Developers migrate in out of projects over time [32], making the email social networks noisier than the B. Community detection algorithms Apache callgraph network described next. As mentioned, There are numerous methods available to detect commu- project leaders can participate broadly across the network and nity structure in networks, with comprehensive reviews in introduce noise in community structures. Refs. [21–23]. Due to its simplicity and easy extension to We also study the code base for the Apache 2.0 HTTP directed and weighted networks, we choose as our base server, an OSS project written in the C programming language algorithm the spectral partitioning algorithm of Newman, (a procedural, rather than objected-oriented, software system). which calculates the leading eigenvector of a modularity Our data are composed of monthly snapshots of the functions matrix to identify a network’s modular structure [4]. However, and function calls over a four-year period. (For details on the our methods should easily generalize to alternate community- extraction procedure see Ref. [33].) The callgraph network finding algorithms. is built by considering each function as a vertex and each The chosen spectral partitioning algorithm determines the function call a directed edge. The Apache 2.0’s callgraph is number of communities via an optimization process. It divides extremely stable [34], with data from each snapshot yielding anetworkintocommunitiesbymaximizingaqualityof similar results; thus we report on a representative snapshot modularity function Q.HereQ measures the difference (10/1/2001). between the actual and expected number of edges falling within acommunityandisnormalizedtorangefromzerotoone. III. REMOVING NOISY VERTICES Specifically, We hypothesize that, for the software callgraph, low-level 1 ki kj functions that are called commonly by other functions across Q Aij δ(Gi ,Gj ),AE (1) = 2m − 2m the network disrupt community detection. This is analogous to i,j " # ! the empirical observations that project leaders in OSS projects, where Aij is the ij th element of the adjacency matrix, ki and who have high degree, and currency metabolites, such as H2O kj are the degrees of vertices i and j,respectively,δ(r,s)isthe and CO2,interferewithcommunitydetection[19,20]. The in- Kronecker¨ delta, Gi and Gj are the groups vertex i and j are degree of a function is the number of other functions it is called assigned to, and m is the number of edges in the network [4]. by; thus low-level functions have high in-degree. Note, for The goal of the algorithm is to assign vertices to groups Gi and Apache, that the largest in-degrees are an order of magnitude Gj so that Q in Eq. (1)ismaximized,withthismaximalQ larger than the largest out-degrees. (Reference [34] contains a representing how well a network breaks apart into subgroups. discussion of constraints giving rise to this disparity.) Q is not a property of a network; instead it is a measure of the In this section we show that degree-targeted node removal quality of a specific grouping of a network [10]. Finding the is not always satisfactory and that better results are obtained global optimal assignment is NP hard, so one turns to methods by targeting removal of nodes that cause the largest increase of approximation, some of which are stochastic. As a result, in Q. We also introduce the statistical technique of change one may not be able to ascertain a unique maximal value of point detection as a criteria to determine when to stop node Q for a given network. Q was originally defined for simple removal.

006100-2 So networks can break into communities but what does it all mean?

• Different algorithms can give different results: Trust the communities that result consistently across algorithms

• In social systems, communities seem to be topically related.

• In biological systems, some evidence that communities relate to function, some evidence they do not relate.

• In mechanical systems proper function seems to span communities. (See Dan Whitney slides)

• Communities help us do good visualization layout (Porter et al AMS Notices 2009; and visualization lecture upcoming by Muelder).