Exploratory Analysis of Networks

James D. Wilson Data Institute Conference 2017

James D. Wilson (USF) Exploratory Analysis of Networks 1 / 35 Summarizing Networks

Goals

1 Provide one number summaries of network

2 Develop hypotheses about observed data

3 Motivate predictive models

4 Augment standard multivariate analyses

James D. Wilson (USF) Exploratory Analysis of Networks 2 / 35 At first glance: Network Summary Statistics

Measures of Connectivity

Degree: number of edges incident to each node (popularity) For directed networks, we consider In- and Out-Degree Geodesic distance: shortest path between two nodes

Diameter: longest geodesic distance

Clustering coefficient: fraction of triangles to total triples

Reciprocity: fraction of reciprocated ties (directed graphs only)

James D. Wilson (USF) Exploratory Analysis of Networks 3 / 35 Examples in Social Networks

James D. Wilson (USF) Exploratory Analysis of Networks 4 / 35 At first glance: Network Summary Statistics

Measures of Nodal Influence /

Centrality: how ”central” is a node in the observed network? Degree: popularity of a node Eigenvector: self- neighbor- importance

Betweenness: extent+ to which a node lies on the shortest path between two nodes Closeness: mean distance from node to all other vertices Authorities: vertices with useful information on a topic

Hubs: vertices that point to authorities

James D. Wilson (USF) Exploratory Analysis of Networks 5 / 35 Examples

Internet Search

PageRank Algorithm Popular web-search scoring and search algorithm (Google) Score based on eigenvector centrality HITS Algorithm (Kleinberg, 1999) Hyperlink-induced topic search Uses hub and authority scores as basis for measuring importance of webpages

James D. Wilson (USF) Exploratory Analysis of Networks 6 / 35 Example: Recommendation Systems

Figure: eye2data.blogspot.com

Want individual (node) with most influence

James D. Wilson (USF) Exploratory Analysis of Networks 7 / 35 Example: Epidemics

Where did this outbreak begin?

James D. Wilson (USF) Exploratory Analysis of Networks 8 / 35

Informally: Communities in a network are subgraphs C1,..., Ck n such that ⊆ [ ]

Edge density within sets Ci is large

Edge density between sets Ci is small

James D. Wilson (USF) Exploratory Analysis of Networks 9 / 35 Community Structure

In the adjacency matrix, re-ordering the rows and columns according to community labels / modules will result in densely connected “blocks” along the diagonal

Networks with community structure are said to be assortative

James D. Wilson (USF) Exploratory Analysis of Networks 10 / 35 Aims of Community Detection

Aim: Capture relevant structure of a complex system

Example 1: Facebook friendship networks

User friendships Geographic location of user

→ Example 2: Human Connectome

Clustering of regions signify “functional regions”

James D. Wilson (USF) Exploratory Analysis of Networks 11 / 35 Community Detection Approaches

In general, community detection (when well-defined mathematically) is NP-hard. Thus, identifying communities requires approximate algorithms

That being the case, there is no shortage of computational algorithms to identify communities

We will describe several key approaches - i.e., ways to define community structure. For each of these approaches there are many algorithms available (which we won’t detail here)

Several review articles and one 100 + review on algorithms

James D. Wilson (USF) Exploratory Analysis of Networks 12 / 35 Key Community Detection Approaches Min-cut

Identify cut of vertices that "cuts" the fewest edges

Modularity

Partition that deviates most from organization in

Spectral

Focus on spectral properties of graph Laplacian

Stochastic Block Model

Model-based approach. Relies on maximum likelihood estimation

Extraction

Local, significance based algorithms

James D. Wilson (USF) Exploratory Analysis of Networks 13 / 35 The Min-k-Cut Approach Goal (Min-cut Max flow problem): Find the partition of vertices

Π C1 ... Ck whose communities have the minimum number of edges between them (Goldberg and Tarjan, 1988) = ∪ ∪

The cut of two communities C1, C2 n is:

1 ⊂ [ ] cut C1, C2 Ai,j I i C1, j C2 2 i,j ( ) = Q ( ∈ ∈ ) We seek the partition -

k−1 k Min-k-Cut G argminΠ cut C`, Cm `=1 m=`+1 ( ) = ŒQ Q ( )‘

James D. Wilson (USF) Exploratory Analysis of Networks 14 / 35 Min-k-Cut

Question: What happens if we search for the Min-2-cut when there are nodes that are only connected to one other node?

James D. Wilson (USF) Exploratory Analysis of Networks 15 / 35 Normalized Cut Problem with Min-Cut: Tends to find many singleton communities! To address this, one can normalize the cut between two communities by their size (Ratio-Cut) or by their volume (Norm-cut)

Normalized-Cut (Shi and Malik, 2000):

Define the volume of a collection B n as: vol B di . Then, i∈B ⊂ [ ] ( ) = Q k−1 k cut B`, Bm cut B`, Bm Min-Norm-k-Cut G argminΠ `=1 m=`+1 vol B` vol Bm ( ) ( ) ( ) = ŒQ Q + ‘ ( ) ( )

James D. Wilson (USF) Exploratory Analysis of Networks 16 / 35 Normalized Cut

Addresses the issue of singleton communities but ...

Issue: When k 2, finding the solution to the Norm-Cut is NP-hard. >

Fortunately, an approximate solution can be found!

James D. Wilson (USF) Exploratory Analysis of Networks 17 / 35 Connected Components

A connected component of an undirected graph is a collection of vertices C V such that

There is⊆ a path between u and v for all u, v C There is no path between u and v for u C and∈ v V C

∈ ∈ ƒ Note: Partitioning a network into its connected components is an “extreme” example of community detection. So we want to identify communities that are “like” disjoint connected components

James D. Wilson (USF) Exploratory Analysis of Networks 18 / 35 Spectral Clustering and The Graph Laplacian

n×n Define D diag d1,..., dn R where

Graph Laplacian= ( L: ) ∈ L D A

= − Normalized graph laplacian Lnorm:

−1 −1 Lnorm D L I D A

= = −

James D. Wilson (USF) Exploratory Analysis of Networks 19 / 35 Key Property of the Graph Laplacian

Theorem 1.

Let G be an undirected graph with non-negative weights and let Lnorm be its normalized graph laplacian.

Let k the multiplicity of the eigenvalue 0 of Lnorm. Then,

(1) k is= the number of connected components C1,..., Ck in G

(2) The eigenspace of 0 is spanned by the indicator vectors 1Ci

Key Point: If G clustered into k disjoint connected components, then we can perfectly identify the k clusters using the k smallest eigenvectors

James D. Wilson (USF) Exploratory Analysis of Networks 20 / 35 Spectral Clustering

Algorithm n×n Input: Adjacency matrix A R+ , number of communities k

1 Calculate normalized graph∈ laplacian Lnorm

2 Compute

X the n k matrix of the k smallest eigenvectors of Lnorm

= × 3 Cluster the rows of X using k-means

Output: Clusters C1,..., Ck

James D. Wilson (USF) Exploratory Analysis of Networks 21 / 35 Properties of Spectral Clustering

Requires a prespecified number of clusters k

Works perfectly in an ideal scenario

Requires the use of another clustering method (k-means)

The solution to a relaxed version of the normalized-cut problem

Reference (seriously, read this): Ulrike Von Luxburg "A tutorial on spectral clustering" (2006)

James D. Wilson (USF) Exploratory Analysis of Networks 22 / 35 Stochastic Block Model (SBM)

Model-based approach to community detection

G V n , E with binary adjacency matrix A

Assumes= ( = [ that] G) has k blocks generated as follows:

1 Community labels c c1,..., cn generated at random:

= iid( ) c1,..., cn multinomial 1, π π1, . . . , πk

∼ ( = { }) 2 Conditional on c, A u, v are independent Bernoulli rvs with

( ) E A u, v c Pcu ,cv

Reference: Holland, et al. "Stochastic[ ( )S block] = models: first steps" (1983)

James D. Wilson (USF) Exploratory Analysis of Networks 23 / 35 Stochastic Block Model (SBM)

Observe G Go, calculate likelihood Θ Go, k with Θ P, c

Finding c becomes= an estimation problem:L( S ) = { }

Θ arg max Θ Go, k Θ ̂ Requires approximate algorithms= L like( MCMCS ) or variational EM

Issue: Approximate algorithms can be slow!

James D. Wilson (USF) Exploratory Analysis of Networks 24 / 35

Aim: find the partition of G whose communities contain the highest density of edges relative to the expected density of edges

Remarks: Requires a notion of what a random network looks like

The choice of a null network model affects resulting communities

This is the most widely adopted approach to community detection! Reference: Mark E Newman "Modularity and community structure in networks" (2004)

James D. Wilson (USF) Exploratory Analysis of Networks 25 / 35 Modularity

Graph G V n , E , adjacency matrix A Au,v

Modularity= ( ):= Measures[ ] ) the “significance"= of[ partition] c:

Q 1 dud v c A u, v I cu cv 2 E u,v 2 E ( ) Q( ) = Q Œ ( ) − ‘ { = } Measures the averageS S departure of observedS S edge density from expected edge density

James D. Wilson (USF) Exploratory Analysis of Networks 26 / 35 Modularity Maximization

Aim: Find the labels c∗ 1,..., k n that maximizes modularity:

∈ { } c∗ arg max c NP hard optimization problem= {Q}

Many approximate algorithms developed

Reference: Santo Fortunato, "Community detection in graphs" (2009). [100+ page review paper]

James D. Wilson (USF) Exploratory Analysis of Networks 27 / 35 Community Extraction

Basic Idea:

Identify communities Ci V one at a time via iterative search

Remove/avoid C1,..., Ci⊆when searching for Ci+1

Virtues:

Possible to accommodate overlap Automatic selection of number of communities Parallelizable! Can easily scale to large networks.

James D. Wilson (USF) Exploratory Analysis of Networks 28 / 35 Community Extraction Methods

Methods:

OSLOM: Lancichinetti, et al. "Finding statistically significant communities in networks" (2011) – resampling based method Extraction: Zhao, et al. "Community extraction for social networks" (2011) – score-based residualizing ESSC: Wilson, et al. "A testing based extraction algorithm for identifying significant communities in networks" (2014) – hypothesis testing based extraction

James D. Wilson (USF) Exploratory Analysis of Networks 29 / 35 Significance based Community Extraction

James D. Wilson (USF) Exploratory Analysis of Networks 30 / 35 The ESSC Algorithm

Single Extraction

Given: Graph G = ([n], E). Significance level α ∈ (0, 1)

Input: Initial set B0 ⊆ [n]

Loop: Until Bt+1 = Bt

- For each u ∈ [n], compute p-value p(u ∶ Bt )

- Order the vertices of G so that p(u1 ∶ Bt ) ≤ ⋯ ≤ p(un ∶ Bt )

- Let k ≥ 0 be the largest integer such that p(uk ∶ Bt ) ≤ (k~n)α

- Let Bt+1 = {u1,..., uk } and increment t ∶= t + 1

Return: Community C = Bt

James D. Wilson (USF) Exploratory Analysis of Networks 31 / 35 The ESSC Algorithm

Repeat Single Extraction using vertex neighborhoods as initial set Final collection: set of unique fixed points of search global search Code and Readme available at https://github.com/jdwilson4/ESSC

James D. Wilson (USF) Exploratory Analysis of Networks 32 / 35 Community Detection in igraph

The igraph package contains several fast methods for identifying communities. Primarily these methods are all algorithms that seek the partition with the highest modularity. These methods include:

Infomap (cluster_infomap()) Fast and Greedy (cluster_fast_greedy()) Label Propagation (cluster_label_prop()) Louvain (cluster_louvain()) Spinglass (cluster_spinglass()) Walktrap (cluster_walktrap())

James D. Wilson (USF) Exploratory Analysis of Networks 33 / 35 Extensions to Multilayer Networks

Network model for multidimensional system

m Unordered sequence of m networks G m, n n , Eα `=1

G` n , E` describes relational structure( ) of= layer{([ ]` )} =Each([ ] layer) represents a relational type or sample Can incorporate layer dependencies (correlation, temporal dependence)

James D. Wilson (USF) Exploratory Analysis of Networks 34 / 35 Multilayer Community Detection

Three general approaches: Aggregate, Separate, and Super Adjacency

Big Idea: Identify densely connected collections of vertices that persist across layers

Important Considerations:

What is a multilayer community? How to measure significance of a community? How to identify communities?

Now, onward to R!

James D. Wilson (USF) Exploratory Analysis of Networks 35 / 35