Community structure in complex networks

Santo Fortunato Me in a nutshell…

Computational Social Science

Science of Networks Science Computational social science Science of science

0 10

-1 10 Agricultural Economics & Policy ) Allergy 0 -2 Anesthesiology 10 Astronomy & Astrophysics Biology -3 Computer Science, Cybernetics

P(c/c 10 Developmental Biology

0 Engineering, Aerospace c Hematology -4 Mathematics 10 Microbiology Neuroimaging -5 Physics, Nuclear 10 Tropical Medicine

-2 -1 0 1 2 10 10 10 10 10 c/c0 Science of science

• Analysis and modeling Social Information: WWW Information: Citation

• Community structure Community structure

Communities: sets of tightly connected nodes • People with common interests • Scholars working on the same field • Proteins with equal/similar functions • Papers on the same/related topics • … Community detection

What for? • Organization • Node classification • Missing links • Effect on dynamics • … Difficult problem!

12345678910 11 12 13 14 15 10 1 2 9 5 13 12 14 6 11 4 38715 1 111 10 111 1 2 1111 1 111 3 1111 2 11 11 4 1111 1 9 111 1 5 11111 5 11 11 1 6 1111 13 1111 7 11 1 12 11 1 1 8 11 11 1 14 11 11 9 11 1 1 6 111 1 10 11 1 1 11 1111 1 11 111 11 4 1 1111 12 1111 3 1 111 13 11 1 1 8 1 11 11 14 1111 7 111 15 11 11 15 1111 Difficult problem!

Ill-defined problem: • What is a community/partition? • What is a good community/partition? Three basic questions

1) How to detect communities? 2) How to test community detection algorithms? 3) How to make partitions robust? Acknowledgements Alex Arenas Marc Barthelémy Ginestra Bianconi Richard Darst Alberto Fernández Sergio Gómez

Clara Granell Darko Hric Jacopo Iacovacci Janos Kertész Mikko Kivelä Andrea Lancichinetti

Vito Latora Massimo Marchiori Filippo Radicchi José J. Ramasco Jari Saramäki How to detect communities? Global optimization Principle: • Function Q(P) that assigns a score to each partition • Best partition of the network -> partition corresponding to the maximum/minimum of Q(P)

Problem: Answer depends on the whole graph -> it changes if one considers portions of it or if it is incomplete How to detect communities? Global optimization Principle: • Function Q(P) that assigns a score to each partition • Best partition of the network -> partition corresponding to the maximum/minimum of Q(P)

Problem: Answer depends on the whole graph -> it changes if one considers portions of it or if it is incomplete optimization n 1 c d2 Q= l c m c 4m c=1 X ✓ ◆ E = m c2

M. E. J. Newman, M. Girvan, Phys. Rev. E 69, 026113 (2004)

M. E. J. Newman, Phys. Rev. E 69, 066133 (2004) Modularity optimization n 1 c d2 Q= l c m c 4m c=1 X ✓ ◆ E = m c2

M. E. J. Newman, M. Girvan, Phys. Rev. E 69, 026113 (2004)

M. E. J. Newman, Phys. Rev. E 69, 066133 (2004) Modularity optimization n 1 c d2 Q= l c m c 4m c=1 X ✓ ◆ E = m c2

M. E. J. Newman, M. Girvan, Phys. Rev. E 69, 026113 (2004)

M. E. J. Newman, Phys. Rev. E 69, 066133 (2004) Modularity optimization n 1 c d2 Q= l c m c 4m c=1 X ✓ ◆ E = m c2

M. E. J. Newman, M. Girvan, Phys. Rev. E 69, 026113 (2004)

M. E. J. Newman, Phys. Rev. E 69, 066133 (2004) Modularity optimization n 1 c d2 Q= l c m c 4m c=1 X ✓ ◆ E = m c2

M. E. J. Newman, M. Girvan, Phys. Rev. E 69, 026113 (2004)

M. E. J. Newman, Phys. Rev. E 69, 066133 (2004)

Goal: find the maximum of Q over all possible network partitions Problem: NP-complete (Brandes et al., 2007)! Modularity optimization n 1 c d2 Q= l c m c 4m c=1 X ✓ ◆ Resolution limit

n 1 c 1 d 2 Q = l c modularity’s scale m c 4 pm c=1 " # X ✓ ◆ Result: clusters smaller than this scale cannot be resolved!

Consequences

S. F. & M. Barthélemy, PNAS 104, 36-41 (2007) Resolution limit

n 1 c 1 d 2 Q = l c modularity’s scale m c 4 pm c=1 " # X ✓ ◆ Result: clusters smaller than this scale cannot be resolved!

Consequences

S. F. & M. Barthélemy, PNAS 104, 36-41 (2007) Resolution limit

n 1 c 1 d 2 Q = l c modularity’s scale m c 4 pm c=1 " # X ✓ ◆ Consequences

S. F. & M. Barthélemy, PNAS 104, 36-41 (2007) Resolution limit

n 1 c 1 d 2 Q = l c modularity’s scale m c 4 pm c=1 " # X ✓ ◆ Consequences

S. F. & M. Barthélemy, PNAS 104, 36-41 (2007) Resolution limit

n 1 c 1 d 2 Q = l c modularity’s scale m c 4 pm c=1 " # X ✓ ◆ Consequences

S. F. & M. Barthélemy, PNAS 104, 36-41 (2007) Local optimization

Principle: • Communities are local structures • Local exploration of the network, involving the subgraph and its neighborhood

Advantages: • Absence of global scales -> no resolution limit • One can analyze only parts of the network Local optimization Implementation: • Function Q(C) that assigns a score to each subgraph • Best cluster -> cluster corresponding to the maximum/ minimum of Q(C) over the set of subgraphs including a seed node Example: Local Fitness Method (LFM)

Fitness of cluster C :

kinC 2l f = = C C ↵ ↵ (kinC + koutC ) d C

A. Lancichinetti, S. F., J. Kertész, New. J. Phys. 11, 033015 (2009) Local fitness method Local optimization: OSLOM

Basics: • LFM with fitness expressing the of a cluster with respect to random fluctuations • Statistical significance evaluated with Order Statistics

First multifunctional method: • Link direction • Link weight • Overlapping clusters •

A. Lancichinetti, F. Radicchi, J. J. Ramasco, S. F., PLoS One 6, e18961 (2011) Local optimization: OSLOM

http://www.oslom.org/ How to test community detection algorithms? Question: how to test clustering algorithms? Answer: checking whether they are able to recover the known community structure of benchmark graphs Planted l-partition model (Condon & Karp, 1999)

Ingredients: q 1) p=probability that vertices of q p p the same cluster are joined p q q 2) q=probability that vertices of q q q q q different clusters are joined q

Principle: if p > q the groups are communities The LFR benchmark

Realistic feature: power law distributions of and community size

A. Lancichinetti, S. F., F. Radicchi, Phys. Rev. E 78, 046110 (2008)

https://sites.google.com/site/andrealancichinetti/files/ https://github.com/networkx/networkx/blob/master/networkx/ algorithms/community/community_generators.py The LFR benchmark A comparative analysis

Author Label Order Girvan & Newman GN O(nm2) Clauset et al. Clauset et al. O(n log2 n) Blondel et al. Blondel et al. O(m) Guimer`a et al. Sim. Ann. parameter dependent Radicchi et al. Radicchi et al. O(m4/n2) Palla et al. Cfinder O(exp(n)) Van Dongen MCL O(nk2), k < n parameter Rosvall & Bergstrom Infomod parameter dependent Rosvall & Bergstrom Infomap O(m) Donetti & Munoz˜ DM O(n3) Newman & Leicht EM parameter dependent Ronhovde & Nussinov RN O(n), 1 A. Lancichinetti, S. F., Phys. Rev. E 80, 056117 (2009) The LFR benchmark A comparative analysis 1 1 1 1 N=1000, S 0.8 0.8 0.8 0.8 N=1000, B N=5000, S 0.6 0.6 0.6 0.6 N=5000, B 0.4 0.4 0.4 0.4 Clauset et al. 0.2 Blondel et al. 0.2 MCL 0.2 Cfinder 0.2 0 0 0 0 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 1 1 1 1 0.8 0.8 0.8 0.8 N=1000, S 0.6 0.6 N=1000, B 0.6 0.6 N=5000, S 0.4 0.4 0.4 0.4 N=5000, B Radicchi 0.2 Infomod 0.2 Infomap 0.2 et al. 0.2 Sim. ann. 0 0 0 0 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 Mixing parameter µt Mixing parameter µt

1 1 0.8 0.8 0.6 0.6

Normalized Normalized Mutual Information 0.4 0.4 0.2 GN 0.2 DM 0 0 … and the winner is: 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 1 § 0.8 0.8 Infomap N=1000, S 0.6 0.6 N=1000, B N=5000, S § * 0.4 0.4 N=5000, B 0.2 EM 0.2 RN 0 0 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 Mixing parameter µt Normalized Mutual Information Consensus clustering

Problem: Stochastic (non-deterministic) methods yield many result partitions: which one shall one choose?

Solution: Searching for the partition which is most similar, on average, to the input partitions (median or consensus partition)

Difficult combinatorial optimization task: greedy solution (consensus matrix)

A. Lancichinetti, S. F., Sci. Rep. 2, 336 (2012). Consensus matrix

Definition § Matrix D whose entry Dij is the frequency that vertices i and j were in the same cluster in the input partitions

Starting point: network G with n vertices, clustering method A.

§ Apply A on G nP times -> nP partitions § Compute the consensus matrix D: Dij is the number of partitions in which vertices i and j of G are assigned to the same cluster, divided by nP § All entries of D below a chosen threshold t are set to zero

§ Apply A on D nP times -> nP partitions § If the partitions are all equal, stop (the consensus matrix would be block-diagonal). Otherwise go back to 2. A simple example Original Graph Consensus Matrix

(I) Dij = 1 Dij = 3/4 Dij = 2/4 (II) Dij = 1/4

(III)

(IV) Results

LFR benchmarks

(a) N=1000 (b) N=5000

1 1 1 1 0.8 Louvain 0.8 LPM 0.8 Louvain 0.8 LPM 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 Original Original 0.2 Consensus 0.2 0.2 Consensus 0.2 0 0 0 0 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 1 1 1 1 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 Clauset et al. 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2

Normalized Mutual Information SA Clauset et al. Normalized Mutual Information SA 0 0 0 0 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Mixing parameter µ Mixing parameter µ Consensus in dynamic networks § Succession of snapshots, corresponding to overlapping time windows of size Δt: [t0, t0+Δt], [t0+1, t0+1+Δt], [tm-Δt, tm] § Dij = number of times vertices i and j are clustered together, divided by the number of partitions corresponding to snapshots including both vertices

Tracking dynamic clusters: t t+1 ? C ! C Strategy: computing the of Ct with all clusters of the partition at time t + 1, and pick the cluster with the highest

value. Same procedure to find the “father” of cluster Ct+1 Criterion: § A and B are each other’s best match: A “survives” to time t +1 § A and B are not each other’s best match: A “dies” at time t and B is considered as a new cluster. Tracking dynamic clusters: the APS citation network

Granular Packing Noise Materials Self Resonance Semiflexible Organized Neural Cell Polymer Percolation Criticality Microrheology Random Sandpiles

Fractal Prisoner'sNoise Resonance Induced Excitable Network Coherence System Cell Stochastic Oscillatory Spatially Diversity Correlated Neural Med■um , year: 2002.5 Dilemma Elastic Synchronization Oscillators Social Clustering Community Modularity Complex Spreading Percolation Epidemics Conductivity Critical Random Boolean Cellular Automaton Threshold Fractal Small World Aggregation Feedback Noise Scale Free Growth Delay Stochastic Diffusion Coloring 2007 - 2008 2007 2000 - 2005 2000 1970 - 1975 1970 1990 - 1995 1990 1980 - 1985 1980 Satisfiability Code Summary

1) What is a community? No unique answer! Definition is system- and problem-dependent 2) Magic method? No such thing! Domain dependent methods? 3) Global optimization methods have important limits: local optimization looks more natural and promising 4) Consensus clustering useful technique to find robust partitions 5) Attention on validation S. F., Phys. Rep. 486, 75-174 (2010) S. F., D. Hric, Phys. Rep. 659, 1-44 (2016)