“Assortative Mixing, Graph Partitioning, Community Structure”

11 FIG. 6: Six modules corresponding to the uppermost branched linear chain of modules depicted in Figure 4. Colors denote modules as defined by the network information bottleneck algorithm. Again the modules roughly correspond to institutional affiliations. Over 50% of the blue nodes have one or more affiliations with the institutions based in and around Chicago (Argonne National Laboratory, University of Illinois at Chicago and University of Notre Dame). 70% of the red nodes are in England, and 75% of the green nodes are in China, mostly at the Institute of Chemistry Chinese Academy of Sciences, and all of the cyan nodes are at the Center of Complex Systems Research in Illinois. Both the yellow and magenta modules are mostly affiliated with the University of Nebraska. ECS 289 / MAE 298 Feb 9, 2011 “Assortative mixing, graph partitioning, FIG. 7: E. coli gene regulatory network. Lcommunityargest componen structure”t of the symmetric version of the E. coli genetic regulatory network. Colors denoted modules identified by NIB. Diffusive distributions are but one general class of distributions on a network. A natural generalization of these ideas is to describe other distributions on a network for which a particular function, energy, or origin is known, and on which some particular degree of freedom (such as chemical concentration or genetic expression as a function of time) may be defined. Finally, we note that while the information bottleneck is a prescription for finding the highest-fidelity summary of a system at a given simplicity, algorithms for determining network community structure are usually motivated by various definitions of normalized min-cuts [27, 28, 29, 30]. Our results, particularly for the synthetic graphs with prescribed modular structure, demonstrate that information modularity implies edge modularity, an unexpected finding which motivates further numerical and analytic investigations in progress regarding this relationship. Acknowledgments It is a pleasure to acknowledge Mark Newman, Noam Slonim, Christina Leslie, Risi Kondor, Ilya Nemenman, and Susanne Still for many useful conversations on the information bottleneck and modularity. This work was supported Many networks naturally break up into groups (Other common features: broad-scale, small worlds, clustering) • Modular networks / Community structures • Max-cut min-flow • Hierarchical networks • More generally, finding clusters of similar objects What can we learn from partitioning a network? (Still an open question: tons of work on detecting communities, not so much work in interpreting what they mean.) • Bottlenecks • Social groups / collaboration networks • US Congress: – Partisan / bi-partison voting patterns (both by party and region) – Central players • Sometimes functional subgroups (some evidence in Gene Regulatory networks) Topics covered today • Modularity metric (many edges within a module, few between modules) • Kernighan-Lin algorithm • Spectral techniques: Graph Laplacian, Fiedler vector • Girvan-Neman (centrality based) and Newman’s additional work (spectral) (works well for directed networks (weighted edges is easy – multigraph)) • Approximation techniques for large networks (finding the optimal is NP-complete) • Dendrograms / hierachy • Overlapping communities • Noise from high degree nodes (includes inherent modularity) • Maximum likelihood (nodes with similar patterns of connectivity) • Applications: when do “communities” actually tell us anything useful? (Esp since methods are topology based not about functional connections (e.g. V8 engine)) Here: missing edges, intelligent guesses at missing node attributes ... Kernighan-Lin algorithm, 1970 • How well can you partition a set of connected nodes into two groups with minimal group-spanning edges? (Motivation: Circuit board design – minimize wires crossing between boards.) • Judge this using a quality metric about how well partitions into groups: minimum number of group-spanning edges. • Divide the vertices into two groups (at random or, better yet, with good guess) • Examine each pair (i; j) where i in Group1 and j in Group2. Swap the (i; j) pair that reduces the spanning edges (if there is one). And repeat excluding this pair. • Avoiding local minimum – 1) “simulated annealing” (accept bad moves with some probability, and decrease the probability over time); 2) If no pair reduces number of spanning edges, swap pair that increases it the least. • Dependence on initial condition / grouping. • Also number of groups and their sizes preserved. Spectral partitioning – graph bi-section with Fiedler vector • Recall random walk state transition matrix (we started from Adjacency matrix A and row or column normalized it). 0 1 1=4 1=3 1=2 1=4 0 B 1=4 1=3 0 1=4 0 C B C B C M = B 1=4 0 1=2 0 0 C B C @ 1=4 1=3 0 1=4 1=2 A 0 0 0 1=4 1=2 • Consider instead the Graph Laplacian, L where: No self-loops allowed Lii = di (the degree of node i) 0 1 Lij = −Aij for i 6= j. 3 −1 −1 −1 0 B −1 2 0 −1 0 C (Recall −Aij = −1 if edge B C B C connects i and j, and 0 otherwise.) L = B −1 0 1 0 0 C B C @ −1 −1 0 3 −1 A 0 0 0 −1 1 Graph bisection • M, rwalk state transition matrix (Perron-Frobenious); largest eigenvalue λ = 1 (steady-state). The number of eigenvalues with λ = 1 equals number of disconnected components • L, graph Laplacian. – Related to diffusion on a network (see Newman book, sec 6.13.1) – All eigenvalues λi real and λi ≥ 0. – There will always be one λ = 0, call it λ1 (Note ~v1 = f1; 1; 1; · · · g) – The number of λi’s equal to zero are the number of disconnected components. – If only one λ = 0, call the next largest eigenvector λ2, the Fiedler vector. – The eigenvector ~v2 will have elements equal −1 or +1 only. – Elements with +1 assigned to group 1, those with -1 to group 2 (left picture:) Timescale, τ1 = 5314 to Timescale, τ2 = 1092 to Timescale, τ3 = 157 to ● ●● ● ●● ● ●● ● o o ● o o ● ● ● ● ● ● o● ● ● o● ● ● ● ● ● ● o o● o● o o o● o● ● o o● o● o o o● o● +++●+ +●+ + o o● o● o o● o● ● +● + ● ● ● ● ● ● ● o● o● o ● o o● o● o ● +● ●+●● +● ● ●● o ● ● ●● o 20 ++ +● ● 20 o o ● ● 20 o o o ● ● + ●●++ +● + o●● o o o● o o●● o o ● ●● +● +●+●+ ●o● o ● o● o● ●o● o ● ●+●+ o● +++ ● + + ● ● o ●o o● o● o ●o + + ● +● ++● + o● o o● o● o ● Y + +● ●●●● ● Y o● o●●o●● o● Y +● ●●●● ● ● +++●+ ● ooo● ● +++●+ ● o ● +++ ● o● ● o ● +++ o ●o● + ● ● +●● ● o● ●● + ● ● 10 o ++ 10 + + o 10 + + ++ ● ● ●+● ●+● o o● ● ● ●o + +● + +● 5 o● o ● 5 ● + ● 5 ● + ● ● o● ● ● o +● + ● ● ● o +● + ● ● ● o ●o o● o o●● ● ● ● o●● ● ● ● o●● ● o ● ● oo● o +● ++ ● oo● o +● ++ ● oo● o ● ●o o ●o●o● o● ●+●+ o ●o●o● o● ●+●+ ●o●o● o● 0 o o o 0 + o 0 + + o 0 5 10 20 0 5 10 20 0 5 10 20 X X X Aside: Some other important spectral properties • Eigenvalues of Adjacency matrix, Laplacian, rwalk matrix are invariant under node relabeling. • If two graphs are isomorphisms, they will have exactly the same eigenvalue spectrum. • This is a necessary condition for isomorphism, but not sufficient – just because two graphs have same eigenvalues does not mean they are isomorphic – but it is strong evidence for isomorphisms (sometimes just first few eigenvalues tested is considered evidence) • For more on graph isomorphisms, Fiedler vectors, graph embeddings, see http://cs-www.cs.yale.edu/homes/spielman/ From graph bisection to communities in networks • M. Girvan and M. E. J. Newman, Community structure in social and biological networks, Proceedings of the National Academy of Sciences, 99 (2002), 78217826. • The paper the launched “community structure” • Identify the edge with highest betweeness centrality, this edge partitions the graph into two pieces. • Rerun recursively on each sub-network. • Note, this is continued bi-section. • Very costly (need to calculate all shortest-paths between all pairs of nodes – recall definition of betweeness centrality). Relating this back to Kernighan-Lin and Fiedler • Newman, Girvan, Finding and evaluating community structure in networks, Physical Review E, 69, 026113 (2004). • Newman, Finding community structure in networks using the eigenvectors of matrices, Physical Review E, 74, 036104 (2006). • Note the next slides come from a presentation that Mark Newman gave at the Institute for Pure and Applied Math, Workshop on “Random and Dynamic Graphs and Networks”, May 2007. There are many other interesting presentations there: http://www.ipam.ucla.edu/programs/rsws3/ Spectral partitioning is related to min-cut/max-flow • Choose any pair of vertices (i; j) the minimum set of edges whose deletion places the two vertices in disconnected components of the graph, carries the maximum flow between the two vertices. • Finding the optimal partitioning over all possible pairs of nodes is a difficult optimization problem that is in NP. • Maximizing modularity also is NP: U. Brandes, D. Delling, M. Gaertler, R. Goerke, M. Hoefer, Z. Nikoloski, and D. Wagner, On modularity clustering, IEEE Transactions on Knowledge and Data Engineering, 20 (2008), 172188. • So rather than finding THE optimal partitioning, we turn to approximation algorithms to find approximately optimal partitions. The Louvain method • V.D. Blondel, J.-L. Guillaume, R. Lambiotte and E. Lefebvre (2008). ”Fast unfolding of community hierarchies in large networks”. J. Stat. Mech.: P10008 • http://sites.google.com/site/findcommunities/ • The method consists of two phases. First, it looks for ”small” communities by optimizing modularity in a local way. Second, it aggregates nodes of the same community and builds a new network whose nodes are the communities. These steps are repeated iteratively until a maximum of modularity is attained. • This method is implemented NetworkX and others. Dendrograms Can be built by aggregating (Louvain method) or recursing from the full graph. ! Vol 435|9 June 2005|doi:10.1038/nature03607 LETTERS Uncovering the overlapping community structure of complex networks in nature and society Gergely Palla1,2, Imre Dereńyi2, Ille´s Farkas1 & Tama´s Vicsek1,2 Many complex systems in nature and society can be described in withOverlappingthe overlaps being their links.

Load more