Data-driven Pattern Discovery using Network Science

Frank Takes

LIACS, Leiden University

D4N Meeting — January 27, 2017

Network Science — D4N Meeting — January 27, 2017 1 / 50 Data

Data: facts, measurements or text collected for reference or analysis (Oxford dictionary) Unstructured data: data that does not fit a certain data structure (text, some numeric measurements) Structured data: data that fits a certain data structure (table, tree, network, etc.)

Network Science — D4N Meeting — January 27, 2017 2 / 50 Data Analysis Data Mining Pattern Discovery Data Science Big Data Network science: analyzing “big” structured data consisting of objects connected via certain relationships, in short: networks Interest from: mathematics, computer science, physics, biology, economics, social sciences, . . .

Data Network Science →

Data

Network Science — D4N Meeting — January 27, 2017 3 / 50 Data Mining Pattern Discovery Data Science Big Data Network science: analyzing “big” structured data consisting of objects connected via certain relationships, in short: networks Interest from: mathematics, computer science, physics, biology, economics, social sciences, . . .

Data Network Science →

Data Data Analysis

Network Science — D4N Meeting — January 27, 2017 3 / 50 Pattern Discovery Data Science Big Data Network science: analyzing “big” structured data consisting of objects connected via certain relationships, in short: networks Interest from: mathematics, computer science, physics, biology, economics, social sciences, . . .

Data Network Science →

Data Data Analysis Data Mining

Network Science — D4N Meeting — January 27, 2017 3 / 50 Data Science Big Data Network science: analyzing “big” structured data consisting of objects connected via certain relationships, in short: networks Interest from: mathematics, computer science, physics, biology, economics, social sciences, . . .

Data Network Science →

Data Data Analysis Data Mining Pattern Discovery

Network Science — D4N Meeting — January 27, 2017 3 / 50 Big Data Network science: analyzing “big” structured data consisting of objects connected via certain relationships, in short: networks Interest from: mathematics, computer science, physics, biology, economics, social sciences, . . .

Data Network Science →

Data Data Analysis Data Mining Pattern Discovery Data Science

Network Science — D4N Meeting — January 27, 2017 3 / 50 Network science: analyzing “big” structured data consisting of objects connected via certain relationships, in short: networks Interest from: mathematics, computer science, physics, biology, economics, social sciences, . . .

Data Network Science →

Data Data Analysis Data Mining Pattern Discovery Data Science Big Data

Network Science — D4N Meeting — January 27, 2017 3 / 50 Data Network Science →

Data Data Analysis Data Mining Pattern Discovery Data Science Big Data Network science: analyzing “big” structured data consisting of objects connected via certain relationships, in short: networks Interest from: mathematics, computer science, physics, biology, economics, social sciences, . . .

Network Science — D4N Meeting — January 27, 2017 3 / 50 Network Science — D4N Meeting — January 27, 2017 4 / 50 Network Science — D4N Meeting — January 27, 2017 5 / 50 Networks

Network/graph: objects and relationships G = (V , E) Objects/entities/nodes/vertices V = n | | Relationships/ties/links/edges E = m | | Data attributes are annotations on the nodes and the edges Enrich using labels, weights and multiple node and edge types Examples: Online social networks Scientific citation and collaboration networks Webgraphs Biological networks Communication networks Corporate networks

Network Science — D4N Meeting — January 27, 2017 6 / 50 One-mode labeled network

Source: http://web.stanford.edu/class/cs224w

Network Science — D4N Meeting — January 27, 2017 7 / 50 Two-mode weighted network

Source: http://toreopsahl.com

Network Science — D4N Meeting — January 27, 2017 8 / 50 LIACS collaboration network

Network Science — D4N Meeting — January 27, 2017 9 / 50 Branch of data science focusing on network data Method in complexity research Complex systems approach: the behavior emerging from the network reveals patterns not visible when studying the individuals

Network science

Network science: understanding data by investigating interactions and relationships between individual data objects as a network Networks are the central model of computation

Network Science — D4N Meeting — January 27, 2017 10 / 50 Network science

Network science: understanding data by investigating interactions and relationships between individual data objects as a network Networks are the central model of computation Branch of data science focusing on network data Method in complexity research Complex systems approach: the behavior emerging from the network reveals patterns not visible when studying the individuals

Network Science — D4N Meeting — January 27, 2017 10 / 50 Example: PPI network

1706 proteins 6207 interactions

Network Science — D4N Meeting — January 27, 2017 11 / 50 Example: PPI network

1706 proteins 6207 interactions

Network Science — D4N Meeting — January 27, 2017 11 / 50 Example: PPI network

1706 proteins 6207 interactions

Network Science — D4N Meeting — January 27, 2017 12 / 50 Visualization of PPI network

Network Science — D4N Meeting — January 27, 2017 13 / 50 Visualization of PPI network

Network Science — D4N Meeting — January 27, 2017 14 / 50 Real-world networks

Topological characteristics 1 Density 2 Degree Power law 3 Components Giant component 4 Distance Small world 5 Clustering coefficient

Network Science — D4N Meeting — January 27, 2017 15 / 50 Directed networks Indegree indeg(v) = 4 Outdegree outdeg(v) = 3 Degree distribution: frequency of each degree value. Follows a power law distribution with a “fat tail”

Degree

u w u w

v x v x

y z y z

Figure : Undirected network Figure : Directed network

Undirected networks: degree deg(v) = 5

Network Science — D4N Meeting — January 27, 2017 16 / 50 Degree distribution: frequency of each degree value. Follows a power law distribution with a “fat tail”

Degree

u w u w

v x v x

y z y z

Figure : Undirected network Figure : Directed network

Undirected networks: degree deg(v) = 5 Directed networks Indegree indeg(v) = 4 Outdegree outdeg(v) = 3

Network Science — D4N Meeting — January 27, 2017 16 / 50 Degree

u w u w

v x v x

y z y z

Figure : Undirected network Figure : Directed network

Undirected networks: degree deg(v) = 5 Directed networks Indegree indeg(v) = 4 Outdegree outdeg(v) = 3 Degree distribution: frequency of each degree value. Follows a power law distribution with a “fat tail”

Network Science — D4N Meeting — January 27, 2017 16 / 50 Outdegree distribution

Network Science — D4N Meeting — January 27, 2017 17 / 50 Indegree distribution

Network Science — D4N Meeting — January 27, 2017 18 / 50 Giant component

Network Science — D4N Meeting — January 27, 2017 19 / 50 Components in PPI network

Network Science — D4N Meeting — January 27, 2017 20 / 50 Components in PPI network

Network Science — D4N Meeting — January 27, 2017 21 / 50 Distance in PPI network

Network Science — D4N Meeting — January 27, 2017 22 / 50 Topics in Network Science

Graph Representation and Structure Network Modeling Link Prediction Spidering and Sampling Centrality Visualization Algorithms Graph Compression Community Detection Diffusion Contagion, Gossiping and Virality Privacy, Anonymity and Ethics

Network Science — D4N Meeting — January 27, 2017 23 / 50 Centrality

Network Science — D4N Meeting — January 27, 2017 24 / 50 Centrality

Given a social network, which person is most important? What is the most important page on the web? Which protein is most vital in a biological network? Who is the most respected author in a scientific citation network? What is the most crucial router in an internet topology network?

Network Science — D4N Meeting — January 27, 2017 25 / 50 Degree centrality

Undirected graphs – degree centrality: measure the number of adjacent nodes deg(v) C (v) = d n 1 − Directed graphs — indegree centrality and outdegree centrality Local measure O(1) time to compute

Network Science — D4N Meeting — January 27, 2017 26 / 50 Degree centrality

Network Science — D4N Meeting — January 27, 2017 27 / 50 Network Science — D4N Meeting — January 27, 2017 28 / 50 Degree centrality

Loras Tyrell Jhiqui Janos Slynt Pycelle Doreah Wine Seller Irri Ilyn Payne Qotho Illyrio Mopatis Septa Mordane Barristan Selmy Mycah Lancel Lannister Mirri Maz Duur Lyanna Stark Jorah MormontMeryn TrantSandor Clegane Viserys Targaryen Hot Pie Rakharo Hugh of the Vale Drogo Syrio ForelJaime Lannister Eddard Stark Mago Beric Dondarrion

Myrcella Jory CasselThe Three-Eyed Raven Grenn Cersei Baratheon Rodrik Cassel Will Gared Pypar Alistair Thorn Joer Mormont Yoren Ros Walder Frey Catelyn StarkRobb Stark Waymar Royce Rast Maester Aemon

Tywin Lannister Benjen Stark Maester Luwin Timett Mord Greatjon Umber Old Nan Shagga Lysa Arryn Osha Kurleket Chella Bronn Knight of House Frey Hodor Galbart Glover Kevan Lannister Vardis Egan Willis Wode Robin Arryn Shae

Figure : Character co-occurence network. Node size based on degree.

Network Science — D4N Meeting — January 27, 2017 29 / 50 Closeness centrality

Closeness centrality: the average distance to each other node in the graph 1 X Cc (v) = d(v, w) n 1 − w∈V where d(v, w) is the length of a shortest path from v to w Global distance-based measure Connected component(s). . . O(mn) to compute: one BFS in O(m) for each of the n nodes

Network Science — D4N Meeting — January 27, 2017 30 / 50 Closeness centrality

Network Science — D4N Meeting — January 27, 2017 31 / 50 Degree vs. closeness centrality

Loras Tyrell Jhiqui Janos Slynt Pycelle Renly Baratheon Doreah Varys Wine Seller Irri Ilyn Payne Petyr Baelish Gregor Clegane Qotho Illyrio Mopatis Septa Mordane Barristan Selmy Daenerys Targaryen Mycah Lancel Lannister Mirri Maz Duur Sansa Stark Lyanna Stark Stannis Baratheon Jorah MormontMeryn TrantSandor Clegane Robert Baratheon Viserys Targaryen Hot Pie Rakharo Hugh of the Vale Joffrey Baratheon Drogo Syrio ForelJaime Lannister Eddard Stark Gendry Mago Arya Stark Beric Dondarrion

Myrcella Jory CasselThe Three-Eyed Raven Samwell Tarly Grenn Cersei Baratheon Rickon Stark Jon Snow Rodrik Cassel Will Gared Pypar Alistair Thorn Joer Mormont Yoren Ros Bran Stark Walder Frey Catelyn StarkRobb Stark Waymar Royce Rast Maester Aemon

Tywin Lannister Theon Greyjoy Tyrion Lannister Benjen Stark Maester Luwin Timett Mord Greatjon Umber Old Nan Shagga Lysa Arryn Osha Kurleket Chella Bronn Knight of House Frey Hodor Galbart Glover Kevan Lannister Vardis Egan Willis Wode Robin Arryn Shae

Figure : Node size based on degree, color based on closeness centrality.

Network Science — D4N Meeting — January 27, 2017 32 / 50 Betweenness centrality

Betweenness centrality: measure the number of shortest paths that run through a node

X σu(v, w) C (u) = b σ(v, w) v,w∈V v6=w,u6=v,u6=w

σ(v, w) is the number of shortest paths from v to w

σu(v, w) is the number of such shortest paths that run through u Divide by largest value to normalize to [0; 1] Global path-based measure O(2mn) time to compute (two “BFSes” for each node)

U. Brandes, ”A faster algorithm for betweenness centrality”, Journal of Mathematical Sociology 25(2): 163–177, 2001

Network Science — D4N Meeting — January 27, 2017 33 / 50 Betweenness centrality

Network Science — D4N Meeting — January 27, 2017 34 / 50 Degree vs. betweenness centrality

Loras Tyrell Jhiqui Janos Slynt Pycelle Renly Baratheon Doreah Varys Wine Seller Irri Ilyn Payne Petyr Baelish Gregor Clegane Qotho Illyrio Mopatis Septa Mordane Barristan Selmy Daenerys Targaryen Mycah Lancel Lannister Mirri Maz Duur Sansa Stark Lyanna Stark Stannis Baratheon Jorah MormontMeryn TrantSandor Clegane Robert Baratheon Viserys Targaryen Hot Pie Rakharo Hugh of the Vale Joffrey Baratheon Drogo Syrio ForelJaime Lannister Eddard Stark Gendry Mago Arya Stark Beric Dondarrion

Myrcella Jory CasselThe Three-Eyed Raven Samwell Tarly Grenn Cersei Baratheon Rickon Stark Jon Snow Rodrik Cassel Will Gared Pypar Alistair Thorn Joer Mormont Yoren Ros Bran Stark Walder Frey Catelyn StarkRobb Stark Waymar Royce Rast Maester Aemon

Tywin Lannister Theon Greyjoy Tyrion Lannister Benjen Stark Maester Luwin Timett Mord Greatjon Umber Old Nan Shagga Lysa Arryn Osha Kurleket Chella Bronn Knight of House Frey Hodor Galbart Glover Kevan Lannister Vardis Egan Willis Wode Robin Arryn Shae

Figure : Node size based on degree, color based on betweenness centrality.

Network Science — D4N Meeting — January 27, 2017 35 / 50 Centrality measures compared

Figure : Degree, closeness and betweenness centrality

Source: ”Centrality”’ by Claudio Rocchini, Wikipedia File:Centrality.svg

Network Science — D4N Meeting — January 27, 2017 36 / 50 Network Science — D4N Meeting — January 27, 2017 37 / 50 Periodic table of centrality

1 IA 18 VIIIA

8000 1979 518 1989 1 DC Periodic Table of Network Centrality IC Degree 2 IIA 13 IIIA 14 IVA 15 VA 16 VIA 17 VIIA Information C

224 1971 239 2008 26 1989 275 2002 51 2004 279 1997 399 2 001 178 1995 2 BC EBC kPC EGO HYPER AFF α-C ECC Betweenness Endpoint BC kPath C. Ego Hypergraphs Affiliation C. α-Cent. Eccentricity

942 1966 239 2008 9068 1999 573 2006 296 1999 80 2006 34 2010 116 1998 3 CC PBC HITS g-kPC GROUP HYPSC t-SC RAD Closeness Proxy BC 3 IIIA 4 IVB 5 VB 6 VIB 7 VIIB 8 VIIIB 9 VIIIB 10 VIIIB 11 IB 12 IIB Hubs/Authority geodesic kPath Groups/Classes Hyperg. SC t-Subgraph Radiality

1279 1972 239 2008 224 1971 53 2009 236 2007 5 2010 0 2015 2 2013 56 2007 281 1971 42 2012 427 2007 43 2009 573 2006 573 2006 505 2010 17 2013 116 1998 4 EC LSBC EBC CBC ∆C MDC EYC CAC EPTC CCoef PeC BN EI e-kPC v-kPC WEIGHT TCom INT Eigenvector LscaledBC Edge BC Commun. BC Delta Cent. MD Cent. Entropy C. Comm. Ability Entropy PC Clust. Coef. PeC Bottleneck Essentiality I. e-disjoint kPC v-disjoint kPC Weighted C. Total Comm. Integration

1306 1953 239 2008 979 2005 477 1991 42 2009 11 2008 0 2014 45 2012 0 2015 1 2014 4 2012 119 2008 43 2009 179 2005 426 1988 116 1991 58 2007 586 2004 5 KS DBBC RWBC TEC LI MC COMCC ECCoef SMD UCC WDC MNC KL BIP GPI kRPC SCodd RWCC Katz Status DBounded BC RWalk BC Total Effects Lobby Index Mod Cent. Community C. ECCoef Super Mediat. United Comp. WDC MNC Clique Level Bipartivity GPI Power Reachability odd Subgraph RWalk CC

8053 1999 239 2008 291 1953 477 1991 1 2014 10 2012 0 2012 1699 2001 0 2015 15 2011 26 2011 119 2008 3 2013 2457 1987 X X 27 2012 13 2007 0 2014 6 PR DSBC σ IEC DM LAPC ABC STRC SNR HPC LAC DMNC LR β-C HYP kEPC FC HCC Page Rank DScaled BC Stress Immediate Eff. Degree Mass Laplacian C. Attentive BC Straightness C Silent Node R. Harm. Prot. Local Average DMNC Lurker Rank β Cent. Hyperbolic C. k-edge PC Functional C. Hierar. CC

484 2005 613 1991 14 2012 477 1991 69 2010 35 2010 X X 15 2010 14 2013 11 2013 45 2012 108 2010 X X 1 2014 36 2009 0 2014 0 2014 0 2015 7 SC FBC RLBC MEC LEVC TC SDC ZC CI CoEWC NC MLC RSC SWIPD XXXX BCPR TPC EDCC Subgraph Flow BC RLimited BC Mediative Eff. Leverage Cent. Topological C. Sphere Degree Zonal Cent. Collab. Index CoEWC NC Moduland C. Resolvent SC SWIPD LinComb BCPR Tunable PC Effective Dist. “Traditional” 8000 1979 942 1966 573 2006 1130 2005 24 2014 252 1974 6 1981 3 2012 3 2009 citations year Betweenness-like Freeman Sabidussi Borgatti/Everett Borgatti Boldi/Vigna Nieminen Kishi Kitti Garg C Friedkin Measures Conceptual Axiomatic Conceptual Conceptual Axiomatic Axiomatic Axiomatic Axiomatic Axiomatic Name Miscellaneous Path-based

2065 1934 1546 1950 780 1948 1475 1951 297 1992 3649 2001 4167 1998 961 1993 71 2008 Specific Network Type

Moreno Bavelas Bavelas Leavitt Borgatti/Everett Jeong et al. Tsai/Ghoshal Ibarra Valente Spectral-based c David Schoch (University of Konstanz) Historic Historic Historic Historic Conceptual Empirical Empirical Empirical Empirical Closeness-like

Network Science — D4N Meeting — January 27, 2017 38 / 50 Community detection

Network Science — D4N Meeting — January 27, 2017 39 / 50 Community detection

Figure : Communities: node subsets connected more strongly with each other than with the rest of the network

Network Science — D4N Meeting — January 27, 2017 40 / 50 Partitions vs. communities

J. Leskovec, Affiliation Network Models for Densely Overlapping Communities, MMDS 2012.

Network Science — D4N Meeting — January 27, 2017 41 / 50 Modularity

Modularity: numerical value indicating the quality of a division of a network into communities Community: subset of nodes for which the fraction of links inside the community is higher than expected in a random network Modularity Q [0, 1] ∈ Resolution parameter r indicating how “tough” the algorithm should look for communities Algorithms optimize the modularity score Q given some r (using hill climbing, heuristics, genetic algorithms and many more optimization techniques)

V.D. Blondel, J-L. Guillaume, R. Lambiotte and E. Lefebvre, Fast unfolding of communities in large networks in Journal of Statistical Mechanics: Theory and Experiment 10: P10008, 2008.

Network Science — D4N Meeting — January 27, 2017 42 / 50 Communities in PPI network

Network Science — D4N Meeting — January 27, 2017 43 / 50 Example: Corporate networks

Nodes are firms Links are board interlocks: two firms share a senior level director 1,068,409 firms 3,262,413 interlocks Aggregation to city level

Network Science — D4N Meeting — January 27, 2017 44 / 50 Corporate networks

Figure : 400, 000 largest firms globally, plotted based on latitude/longitude.

Network Science — D4N Meeting — January 27, 2017 45 / 50 Corporate networks

Figure : Global corporate network: over 1, 000, 000 board interlocks.

Network Science — D4N Meeting — January 27, 2017 46 / 50

Networks are everywhere!

Conclusions

Network science treats data as an annotated set of objects and relationships The structure of the network provides new insights in the data Centrality measures are able to identify prominent actors in the network solely based on its structure Community detection algorithms reveal groups and clusters based on the network structure

Network Science — D4N Meeting — January 27, 2017 49 / 50 Conclusions

Network science treats data as an annotated set of objects and relationships The structure of the network provides new insights in the data Centrality measures are able to identify prominent actors in the network solely based on its structure Community detection algorithms reveal groups and clusters based on the network structure Networks are everywhere!

Network Science — D4N Meeting — January 27, 2017 49 / 50 https://franktakes.nl

Data: U. Stelzl, U. Worm, M. Lalowski, C. Haenig, F. H. Brembeck, H. Goehler, M. Stroedicke, M. Zenkner, A. Schoenherr, S. Koeppen, J. Timm, S. Mintzlaff, C. Abraham, N. Bock, S. Kietzmann, A. Goedde, E Toksz, A. Droege, S. Krobitsch, B. Korn, W. Birchmeier, H. Lehrach, and E. E. Wanker. A human protein–protein interaction network: A resource for annotating the proteome. Cell 122:957–968, 2005.

The end

Thank You!

Network Science — D4N Meeting — January 27, 2017 50 / 50 Data: U. Stelzl, U. Worm, M. Lalowski, C. Haenig, F. H. Brembeck, H. Goehler, M. Stroedicke, M. Zenkner, A. Schoenherr, S. Koeppen, J. Timm, S. Mintzlaff, C. Abraham, N. Bock, S. Kietzmann, A. Goedde, E Toksz, A. Droege, S. Krobitsch, B. Korn, W. Birchmeier, H. Lehrach, and E. E. Wanker. A human protein–protein interaction network: A resource for annotating the proteome. Cell 122:957–968, 2005.

The end

Thank You!

https://franktakes.nl

Network Science — D4N Meeting — January 27, 2017 50 / 50 The end

Thank You!

https://franktakes.nl

Data: U. Stelzl, U. Worm, M. Lalowski, C. Haenig, F. H. Brembeck, H. Goehler, M. Stroedicke, M. Zenkner, A. Schoenherr, S. Koeppen, J. Timm, S. Mintzlaff, C. Abraham, N. Bock, S. Kietzmann, A. Goedde, E Toksz, A. Droege, S. Krobitsch, B. Korn, W. Birchmeier, H. Lehrach, and E. E. Wanker. A human protein–protein interaction network: A resource for annotating the proteome. Cell 122:957–968, 2005.

Network Science — D4N Meeting — January 27, 2017 50 / 50