Network Analysis / Visualization Announcements
Total Page:16
File Type:pdf, Size:1020Kb
Network Analysis / Visualization Maneesh Agrawala Jessica Hullman CS 294-10: Visualization Fall 2014 Announcements 1 Final project Design new visualization method ■ Pose problem, Implement creative solution Deliverables ■ Implementation of solution ■ 8-12 page paper in format of conference paper submission ■ 1 or 2 design discussion presentations Schedule ■ Project proposal: 10/27 ■ Project presentation: 11/10, 11/12 ■ Final paper and presentation: TBD, likely 12/1-12/5 Grading ■ Groups of up to 3 people, graded individually ■ Clearly report responsibilities of each member Network Analysis & Visualization *Many slides adapted from E. Adar’s / L. Adamic’s Network Theory and Applications course slides. 2 Basic definition Networks (graphs) are collections of points joined by lines Edges can be directed / undirected Edges and nodes can have attributes associated with them 1 2 node 3 edge 4 Adjacency matrix 5 1 2 3 4 5 1 0 0 1 0 0 points lines 2 0 0 1 0 0 vertices edges, arcs 3 1 1 0 1 1 nodes links 4 0 0 1 0 1 actors ties, relations 5 0 0 1 1 0 Diseases http://diseasome.eu/ 3 Transportation http://www.lx97.com/maps/ Online social networks Friendster Heer, J., and boyd, d. Vizster: Visualizing Online Social Networks. InfoVis 2005. 4 Lombardi, M. ‘George W. Bush, Harken Energy and Jackson Stephens, ca 1979–90’ Actors and movies (bipartite) 5 Applications Tournaments Organization Charts Genealogy Text Biological Interactions (Genes, Proteins) Computer Networks Social Networks Simulation and Modeling Integrated Circuit Design 6 Characterizing networks What does it look like? Size? Density? Centralization? Clustering? Components? Cliques? Motifs? Avg. path length? … www.opte.org 7 Topics Network Analysis • Centrality / centralization • Community structure • Pattern identification • Models Tools for Network EDA Centrality 8 How far apart are things? Distance: shortest paths Shortest path (geodesic path) ■ The shortest sequence of links connecting two B nodes A ■ Not always unique C n A and C are connected by 2 shortest paths n A – E – B - C D E n A – E – D - C 9 Distance: shortest paths Shortest path from 2 to 3: 1 4 6 2 1 5 7 3 Diameter? Distance: shortest paths Shortest path from 2 to 3? Shortest path from each pair? 4 6 2 1 5 7 3 10 Most important node? Centrality Y Y X X outdegree indegree Y X X Y betweenness closeness 11 Degree centrality (undirected) C d(n ) A A D = i = i+ = ∑ ij j Normalized degree centrality d(i) CD (i) = N−1 12 When is degree not sufficient? ability to broker between groups likelihood that information originating anywhere in the network reaches you Betweenness Assuming nodes communicate using the most direct route, how many pairs of nodes have to pass information through target node? Y X Y Y X X 13 Betweenness: definition CB (i) = ∑ gjk (i) / gjk j,k≠i, j<k gjk = the number of geodesics connecting jk gjk(i) = the number that actor i is on. Normalization: ' CB (i) = CB (i )/[(n −1)(n − 2)/2] number of pairs of vertices excluding the vertex itself € Betweenness - examples non-normalized: A B C D E 14 Lada’s facebook network: nodes are sized by degree, and colored by betweenness (blue- >red è low->high). When are Cd, Cb not sufficient? likelihood that information originating anywhere in the network reaches you 15 Closeness: definition Being close to the center of the graph Closeness Centrality: −1 # N & Cc (i) = % ∑ d(i, j)( $% j=1, j≠i '( Normalized Closeness Centrality ' N −1 CC (i) = (CC (i)) / (N −1) = N ∑ d(i, j) j=1, j≠i Examples - closeness 16 Centrality in directed networks Prestige degree centrality ~ indegree centrality closeness ~ consider nodes from which target node can be reached influence range ~ nodes reachable from target node betweenness ~ consider directed shortest paths Straight-forward modifications to equations for non-directed graphs Characterizing• generally different centrality metrics will be posively correlated nodes • when they are not, there is likely something interesJng about the network • suggest possible topologies and node posions to fit each square Low Low Low Degree Closeness Betweenness High Degree Node embedded in Node's connections cluster that is far are redundant - from the rest of the communication network bypasses him/her High Closeness Node links to a Many paths likely to small number of be in network; node important/active is near many other nodes. people, but so are many others High Node’s few ties are Rare. Node Betweenness crucial for network monopolizes the ties flow from a small number of people to many others. 17 Centralization – how equal Variation in the centrality scores among the nodes Freeman’s general formula for centralization: maximum value in the network g * ∑ [CD (n ) − CD (i)] C = i=1 D [(N −1)(N − 2)] € Examples g * [CD (n ) − CD (ni )] C = ∑i=1 D [(N −1)(N − 2)] (5− 5)+ (5−1)× 5 C = =1 D (6 −1)(6 − 2) 18 Examples CD = 0.167 CD = 1.0 CD = 0.167 Financial networks 19 Baker and Faulkner Price fixing scams in switchgear, transformer, and turbine industries What network relaons facilitate illegal behavior/ conspiracy? The Social Organization of Conspiracy: Illegal Networks in the Heavy Electrical Equipment Industry, Wayne E. Baker, Robert R. Faulkner. American Sociological Review, Vol. 58, No. 6 (Dec., 1993), pp. 837-860. Theoretical predictions 20 Results Low information-processing conspiracies are decentralized, high information processing load leads to centralization. At the individual level, degree centrality predicts verdict. Community Structure 21 How dense is it? Max. possible edges: density = e/ emax ■ Directed: emax = n*(n-1) ■ Undirected: emax = n*(n-1)/2 Is everything connected? 22 Connected components - directed Strongly connected components ■ Each node in component can be reached from every other node in component by following directed links F n B C D E B G n A A C H n G H D n F E Weakly connected components ■ Each node can be reached from every other node by following links in either direction n A B C D E n G H F Broder et al. (1999) 23 Finding connected components 1 Digby’s Blog 21 JawaReport 2 James Walcott 11 21 22 Vodka Pundit 3 Pandagon 1 22 21 23 Roger L Simon 4 blog.johnkerry.com 33 24 Tim Blair 5 Oliver Willis 23 25 Andrew Sullivan 6 America Blog 4 2222 23 4 5 24 26 Instapundit 7 Crooked Timber 66 5 24 2727 27 Blogs for Bush 8 Daily Kos 7 25 28 LittleGreenFootballs 7 25 26 9 American Prospect 8 26 2828 29 Belmont Club 10 Eschaton 1010 29 9 1111 29 3030 30 Captain’s Quarters 11 Wonkette 9 31 Powerline 12 Talk Left 13 1212 32 Hugh Hewitt 13 32 13 Political Wire 31 32 33 INDC journal 14 Talking Points Memo 17 31 17 1515 1414 34 Real Clear Politics 15 Matthew Yglesias 1818 33 35 Winds of Change 3535 33 16 Washington Monthly 1616 3434 35 363636 Allahpundit 17 MyDD 37 Michelle Malkin 18 Juan Cole 1919 39 38 Wizbang 3838 39 19 Left Coaster 3737 38 39 Dean’s World 20 Bradford DeLong 40 Volokh 20 20 4040 Adamic, L., and Glance, N. The political blogosphere and the 2004 US election: Divided they blog. Proceedings of the 3rd international workshop on Link discovery, p.36-43, (2005) Community finding 24 Hierarchical clustering Process: ■ Calculate weights W for all pairs of vertices ■ Start: N disconnected vertices ■ Adding edges (one by one) between pairs in order of decreasing weight ■ Result: nested components Hierarchical clustering 25 Hierarchical clustering Betweenness clustering Girvan and Newman 2002 iterative algorithm: ■ Compute Cb of all edges ■ Remove edge i where Cb(i) == max(Cb) ■ Recalculate betweenness 26 Clustering coefficient Local clustering coefficient: number of closed triplets centered on i C = i number of connected triplets centered on i i i i Global clustering coefficient: Ci = 1/3 = 0.33 3* number of closed triplets CG = number of connected triplets CG= 3*1/5 = 0.6 Comparing to random graph ■ clustering coefficient ■ compare to a randomized version (conserving node degree) ■ degree distribution ■ assortativity ■ do high degree nodes connect to other high degree nodes? ■ average shortest path ■ motif profile 27 Pattern finding - motifs Define / search for a particular structure, e.g. complete triads W X Y Z Motifs can overlap in the network graph motif to be found motif matches http://mavisto.ipk-gatersleben.de/frequency_concepts.html 28 4 node subgraphs Motif detection 29 Tools Network EDA Structure Useful features: • Centralizaon • Associate aributes • Density • Node/graph level centrality • Clustering, components • Filter on stas;cs • Mo;fs • Examine distribu;ons • Comparison to models • Iden;fy components, clusters • Define and search for paerns • Create random graphs, calculate AGributes stas;cs • Nodes / links / communi;es • Map stas;cs to visual features (color, size, weight) • Track nodes and groups of interest • Zoom and pan in large graphs 30 Pajek Mrvar, A. Pajek - Program for Large Network Analysis. Connections 21(1998)2, 47-57 GUESS Adar, E. GUESS: The Graph Exploration System. ACM CHI 2006. 31 GUESS: plotting statistics GUESS – convex hulls 32 SocialAction Challenge: User directedness + number of stas;cal features leads to opportunis;c analysis in most tools Soluon: ■ Provide overview ■ Use aribute ranking and coordinated views ■ Aggregate networks, iden;fy communi;es ■ View bi-, tripar;te (etc.) networks separately ■ Access to matrix overview ■ Keep nodes in place Perer, A. and Shneiderman, B. Balancing systematic and flexible exploration of social networks. InfoVis 2006. SocialAction 33 SocialAction SocialAction 34 Other tools Gephi Prefuse TouchGraph GraphViz NodeXL NetLogo Simulating network models 35 Small world network Milgram (1967) ■ Mean path length in US social networks ■ ~ 6 hops separate any two people Small world networks Watts and Strogatz 1998 ■ a few random links in an otherwise structured graph make the network a small world regular lattice: small world: random graph: my friend’s friend is mostly structured all connections always my friend with a few random random connections 36 Defining small world phenomenon Pattern: ■ high clustering C C ■ low mean shortest path network >> random graph lnetwork ≈ ln(N) Examples ■ neural network of C.