Exploratory Analysis of Networks
Total Page:16
File Type:pdf, Size:1020Kb
Exploratory Analysis of Networks James D. Wilson Data Institute Conference 2017 James D. Wilson (USF) Exploratory Analysis of Networks 1 / 35 Summarizing Networks Goals 1 Provide one number summaries of network 2 Develop hypotheses about observed data 3 Motivate predictive models 4 Augment standard multivariate analyses James D. Wilson (USF) Exploratory Analysis of Networks 2 / 35 At first glance: Network Summary Statistics Measures of Connectivity Degree: number of edges incident to each node (popularity) For directed networks, we consider In- and Out-Degree Geodesic distance: shortest path between two nodes Diameter: longest geodesic distance Clustering coefficient: fraction of triangles to total triples Reciprocity: fraction of reciprocated ties (directed graphs only) James D. Wilson (USF) Exploratory Analysis of Networks 3 / 35 Examples in Social Networks James D. Wilson (USF) Exploratory Analysis of Networks 4 / 35 At first glance: Network Summary Statistics Measures of Nodal Influence / Centrality Centrality: how ”central” is a node in the observed network? Degree: popularity of a node Eigenvector: self- neighbor- importance Betweenness: extent+ to which a node lies on the shortest path between two nodes Closeness: mean distance from node to all other vertices Authorities: vertices with useful information on a topic Hubs: vertices that point to authorities James D. Wilson (USF) Exploratory Analysis of Networks 5 / 35 Examples Internet Search PageRank Algorithm Popular web-search scoring and search algorithm (Google) Score based on eigenvector centrality HITS Algorithm (Kleinberg, 1999) Hyperlink-induced topic search Uses hub and authority scores as basis for measuring importance of webpages James D. Wilson (USF) Exploratory Analysis of Networks 6 / 35 Example: Recommendation Systems Figure: eye2data.blogspot.com Want individual (node) with most influence James D. Wilson (USF) Exploratory Analysis of Networks 7 / 35 Example: Epidemics Where did this outbreak begin? James D. Wilson (USF) Exploratory Analysis of Networks 8 / 35 Community Structure Informally: Communities in a network are subgraphs C1;:::; Ck n such that ⊆ [ ] Edge density within sets Ci is large Edge density between sets Ci is small James D. Wilson (USF) Exploratory Analysis of Networks 9 / 35 Community Structure In the adjacency matrix, re-ordering the rows and columns according to community labels / modules will result in densely connected “blocks” along the diagonal Networks with community structure are said to be assortative James D. Wilson (USF) Exploratory Analysis of Networks 10 / 35 Aims of Community Detection Aim: Capture relevant structure of a complex system Example 1: Facebook friendship networks User friendships Geographic location of user → Example 2: Human Connectome Clustering of regions signify “functional regions” James D. Wilson (USF) Exploratory Analysis of Networks 11 / 35 Community Detection Approaches In general, community detection (when well-defined mathematically) is NP-hard. Thus, identifying communities requires approximate algorithms That being the case, there is no shortage of computational algorithms to identify communities We will describe several key approaches - i.e., ways to define community structure. For each of these approaches there are many algorithms available (which we won’t detail here) Several review articles and one 100 + review on algorithms James D. Wilson (USF) Exploratory Analysis of Networks 12 / 35 Key Community Detection Approaches Min-cut Identify cut of vertices that "cuts" the fewest edges Modularity Partition that deviates most from organization in random graph Spectral Focus on spectral properties of graph Laplacian Stochastic Block Model Model-based approach. Relies on maximum likelihood estimation Extraction Local, significance based algorithms James D. Wilson (USF) Exploratory Analysis of Networks 13 / 35 The Min-k-Cut Approach Goal (Min-cut Max flow problem): Find the partition of vertices Π C1 ::: Ck whose communities have the minimum number of edges between them (Goldberg and Tarjan, 1988) = ∪ ∪ The cut of two communities C1; C2 n is: 1 ⊂ [ ] cut C1; C2 Ai;j I i C1; j C2 2 i;j ( ) = Q ( ∈ ∈ ) We seek the partition - k−1 k Min-k-Cut G argminΠ cut C`; Cm `=1 m=`+1 ( ) = Q Q ( ) James D. Wilson (USF) Exploratory Analysis of Networks 14 / 35 Min-k-Cut Question: What happens if we search for the Min-2-cut when there are nodes that are only connected to one other node? James D. Wilson (USF) Exploratory Analysis of Networks 15 / 35 Normalized Cut Problem with Min-Cut: Tends to find many singleton communities! To address this, one can normalize the cut between two communities by their size (Ratio-Cut) or by their volume (Norm-cut) Normalized-Cut (Shi and Malik, 2000): Define the volume of a collection B n as: vol B di . Then, i∈B ⊂ [ ] ( ) = Q k−1 k cut B`; Bm cut B`; Bm Min-Norm-k-Cut G argminΠ `=1 m=`+1 vol B` vol Bm ( ) ( ) ( ) = Q Q + ( ) ( ) James D. Wilson (USF) Exploratory Analysis of Networks 16 / 35 Normalized Cut Addresses the issue of singleton communities but ... Issue: When k 2, finding the solution to the Norm-Cut is NP-hard. > Fortunately, an approximate solution can be found! James D. Wilson (USF) Exploratory Analysis of Networks 17 / 35 Connected Components A connected component of an undirected graph is a collection of vertices C V such that There is⊆ a path between u and v for all u; v C There is no path between u and v for u C and∈ v V C ∈ ∈ Note: Partitioning a network into its connected components is an “extreme” example of community detection. So we want to identify communities that are “like” disjoint connected components James D. Wilson (USF) Exploratory Analysis of Networks 18 / 35 Spectral Clustering and The Graph Laplacian n×n Define D diag d1;:::; dn R where Graph Laplacian= ( L: ) ∈ L D A = − Normalized graph laplacian Lnorm: −1 −1 Lnorm D L I D A = = − James D. Wilson (USF) Exploratory Analysis of Networks 19 / 35 Key Property of the Graph Laplacian Theorem 1. Let G be an undirected graph with non-negative weights and let Lnorm be its normalized graph laplacian. Let k the multiplicity of the eigenvalue 0 of Lnorm. Then, (1) k is= the number of connected components C1;:::; Ck in G (2) The eigenspace of 0 is spanned by the indicator vectors 1Ci Key Point: If G clustered into k disjoint connected components, then we can perfectly identify the k clusters using the k smallest eigenvectors James D. Wilson (USF) Exploratory Analysis of Networks 20 / 35 Spectral Clustering Algorithm n×n Input: Adjacency matrix A R+ , number of communities k 1 Calculate normalized graph∈ laplacian Lnorm 2 Compute X the n k matrix of the k smallest eigenvectors of Lnorm = × 3 Cluster the rows of X using k-means Output: Clusters C1;:::; Ck James D. Wilson (USF) Exploratory Analysis of Networks 21 / 35 Properties of Spectral Clustering Requires a prespecified number of clusters k Works perfectly in an ideal scenario Requires the use of another clustering method (k-means) The solution to a relaxed version of the normalized-cut problem Reference (seriously, read this): Ulrike Von Luxburg "A tutorial on spectral clustering" (2006) James D. Wilson (USF) Exploratory Analysis of Networks 22 / 35 Stochastic Block Model (SBM) Model-based approach to community detection G V n ; E with binary adjacency matrix A Assumes= ( = [ that] G) has k blocks generated as follows: 1 Community labels c c1;:::; cn generated at random: = iid( ) c1;:::; cn multinomial 1; π π1; : : : ; πk ∼ ( = { }) 2 Conditional on c, A u; v are independent Bernoulli rvs with ( ) E A u; v c Pcu ;cv Reference: Holland, et al. "Stochastic[ ( )S block] = models: first steps" (1983) James D. Wilson (USF) Exploratory Analysis of Networks 23 / 35 Stochastic Block Model (SBM) Observe G Go, calculate likelihood Θ Go; k with Θ P; c Finding c becomes= an estimation problem:L( S ) = { } Θ arg max Θ Go; k Θ ̂ Requires approximate algorithms= L like( MCMCS ) or variational EM Issue: Approximate algorithms can be slow! James D. Wilson (USF) Exploratory Analysis of Networks 24 / 35 Modularity Aim: find the partition of G whose communities contain the highest density of edges relative to the expected density of edges Remarks: Requires a notion of what a random network looks like The choice of a null network model affects resulting communities This is the most widely adopted approach to community detection! Reference: Mark E Newman "Modularity and community structure in networks" (2004) James D. Wilson (USF) Exploratory Analysis of Networks 25 / 35 Modularity Graph G V n ; E , adjacency matrix A Au;v Modularity= ( ):= Measures[ ] ) the “significance"= of[ partition] c: Q 1 dud v c A u; v I cu cv 2 E u;v 2 E ( ) Q( ) = Q ( ) − { = } Measures the averageS S departure of observedS S edge density from expected edge density James D. Wilson (USF) Exploratory Analysis of Networks 26 / 35 Modularity Maximization Aim: Find the labels c∗ 1;:::; k n that maximizes modularity: ∈ { } c∗ arg max c NP hard optimization problem= {Q} Many approximate algorithms developed Reference: Santo Fortunato, "Community detection in graphs" (2009). [100+ page review paper] James D. Wilson (USF) Exploratory Analysis of Networks 27 / 35 Community Extraction Basic Idea: Identify communities Ci V one at a time via iterative search Remove/avoid C1;:::; Ci⊆when searching for Ci+1 Virtues: Possible to accommodate overlap Automatic selection of number of communities Parallelizable! Can easily scale to large networks. James D. Wilson (USF) Exploratory Analysis of Networks 28 / 35 Community Extraction Methods Methods: OSLOM: Lancichinetti, et al. "Finding statistically significant communities in networks" (2011) – resampling based method Extraction: Zhao, et al. "Community extraction for social networks" (2011) – score-based residualizing ESSC: Wilson, et al. "A testing based extraction algorithm for identifying significant communities in networks" (2014) – hypothesis testing based extraction James D. Wilson (USF) Exploratory Analysis of Networks 29 / 35 Significance based Community Extraction James D.