Jure Leskovec
Total Page:16
File Type:pdf, Size:1020Kb
Jure Leskovec Computer Science Department Stanford University Includes joint work with Deepay Chakrabarti, Anirban Dasgupta, Christos Faloutsos, Jon Kleinberg, Kevin Lang and Michael Mahoney Workshop on Complex Networks in Information & Knowledge Management, CIKM 2009 Interaction graph model of networks: . Nodes represent “entities” . Edges represent “interactions” between pairs of entities Thinking of data in a form of a network makes the difference: . Google realized web-pages are connected . Collective classification, targeting Jure Leskovec, CNIKM '09 2 Online social networks Collaboration networks Systems biology networks arXiv DBLP Communication networks Web graph & citation networks Internet Jure Leskovec, CNIKM '09 3 Complex networks as phenomena, not just designed artifacts What are the common patterns that emerge? Flickr social network Scale free networks n= 584,207, m=3,555,115 many more hubs than expected Jure Leskovec, CNIKM '09 4 Network data spans many orders of magnitude: . 436-node network of email exchange over 3-months at corporate research lab [Adamic-Adar, SocNets ‘03] . 43,553-node network of email exchange over 2 years at a large university [Kossinets-Watts, Science ‘06] . 4.4-million-node network of declared friendships on a blogging community [Backstrom et at., KDD ‘06] . 240-million-node network of all IM communication over a month on Microsoft Instant Messenger [Leskovec-Horvitz, WWW ‘08] Jure Leskovec, CNIKM '09 5 [Leskovec et al. KDD 05] How do network properties scale with the size of the network? Citations Citations ) a=1.6 E(t diameter N(t) time Densification Shrinking diameter Average degree increases Path lengths get shorter Jure Leskovec, CNIKM '09 6 What do we hope to achieve by analyzing networks? . Patterns and statistical properties of network data . Build intuition, design principles and models . Understand why networks are organized the way they are . Design better tools/algorithms Jure Leskovec, CNIKM '09 7 [Backstrom et al. KDD ‘06] In a social network nodes explicitly declare group membership: . Facebook groups, Publication venue Can think of groups as node colors Gives insights into social dynamics: . Recruits friends? Memberships spread along edges . Doesn’t recruit? Spread randomly What factors influence a person’s decision to join a group? Jure Leskovec, CNIKM '09 8 [Backstrom et al. KDD ‘06] Analogous to diffusion Group memberships spread over the network: . Red circles represent existing group members . Yellow squares may join Question: . How does prob. of joining a group depend on the number of friends already in the group? Jure Leskovec, CNIKM '09 9 [Backstrom et al. KDD ‘06] LiveJournal: 1 million users DBLP: 400,000 papers 250,000 groups 100,000 authors 2,000 conferences Diminishing returns: . Probability of joining increases with the number of friends in the group . But increases get smaller and smaller Jure Leskovec, CNIKM '09 10 [Backstrom et al. KDD ‘06] Connectedness of friends: . x and y have three friends in the group . x’s friends are independent . y’s friends are all connected x y Who is more likely to join? Jure Leskovec, CNIKM '09 11 [Backstrom et al. KDD ‘06] Competing sociological theories: x y . Information argument [Granovetter ‘73] . Social capital argument [Coleman ’88] Information argument: . Unconnected friends give independent support Social capital argument: . Safety/trust advantage in having friends who know each other Jure Leskovec, CNIKM '09 12 [Backstrom et al. KDD ‘06] LiveJournal: 1 million users, 250,000 groups Social capital argument wins! Prob. of joining increases with the number of adjacent members. Jure Leskovec, CNIKM '09 13 What we have seen so far suggest that network groups are tightly connected Network communities: . Sets of nodes with lots of connections inside and few to outside (the rest Communities, clusters, of the network) groups, modules Jure Leskovec, CNIKM '09 14 How to automatically find such densely connected groups of nodes? Ideally such automatically detected clusters would then correspond to real groups For example: Communities, clusters, groups, modules Jure Leskovec, CNIKM '09 15 Zachary’s Karate club network: . Observe social ties and rivalries in a university karate club . During his observation, conflicts led the group to split . Split could be explained by a minimum cut in the network Jure Leskovec, CNIKM '09 Part 1-16 Micro-markets in “query × advertiser” graph query advertiser Jure Leskovec, CNIKM '09 17 (1) Map part of the real-world into a network (2) Hypothesize the network contains clusters (3) Formalize the idea of a cluster using an objective function (4) Use an algorithm/heuristic to optimize the objective function (5) Call the sets that the algorithms finds What is the cluster structure of networks? “clusters” and evaluate them by comparing themHow to does the itreal scale world with the network size? Jure Leskovec, CNIKM '09 19 What is the cluster structure of networks? How does it scale from small to very large networks? How to think about clusters Physics collaborations in large networks? Idea: Use approximation algorithms for NP-hard graph partitioning problems as experimental probes of network structure Part of a large social network Jure Leskovec, CNIKM '09 20 (3) What is a good cluster? S Many edges internally Few pointing outside S’ Formally, conductance/ normalized cut: Φ(S) = # edges cut / # edges inside Small Φ(S) corresponds to good clusters (4) What method to use to optimize the objective function? (Is there systematic bias?) Jure Leskovec, CNIKM '09 21 What is “best” community of 5 nodes? Score: Φ(S) = # edges cut / # edges inside Jure Leskovec, CNIKM '09 22 What is “best” Bad community community of 5 nodes? Φ=5/6 = 0.83 Score: Φ(S) = # edges cut / # edges inside Jure Leskovec, CNIKM '09 23 What is “best” Bad community community of 5 nodes? Φ=5/7 = 0.7 Better community Φ=2/5 = 0.4 Score: Φ(S) = # edges cut / # edges inside Jure Leskovec, CNIKM '09 24 What is “best” Bad community community of 5 nodes? Φ=5/7 = 0.7 Best community Φ=2/8 = 0.25 Better community Φ=2/5 = 0.4 Score: Φ(S) = # edges cut / # edges inside Jure Leskovec, CNIKM '09 25 So far we defined the measure that quantifies how cluster-like is a set of nodes Now, we want to define a measure of how expressed are the clusters in the network overall Jure Leskovec, CNIKM '09 26 [Leskovec et al. WWW ‘08] Define: Network community profile (NCP) plot Plot the score of best community of size k k=5 k=7 log Φ(k) Φ(5)=0.25 Φ(7)=0.18 Community size, log k Jure Leskovec, CNIKM '09 27 (k) Φ Cluster score, Cluster log • A dot represents a set on k nodes • Lower envelope gives the score of best cluster on k nodes Cluster size, log k Jure Leskovec, CNIKM '09 28 [Leskovec et al. WWW ‘08] Many algorithms to extract network communities: . Spectral (quadratic approx): confuses “long paths” with “deep cuts” . Multi-commodity flow (log(n) approx): difficulty with expanders . Modularity optimization: popular heuristics . Metis (multi-resolution heuristic): common in practice . X+MQI: post-processing step on, e.g., MQI of Metis Local Spectral - connected and tighter sets [Andersen-Chung 07] Jure Leskovec, CNIKM '09 29 Meshes, grids, dense random graphs: d-dimensional meshes California road network Jure Leskovec, CNIKM '09 30 [Leskovec et al. WWW ‘08] Manifold learning dataset (Hands) Jure Leskovec, CNIKM '09 31 Zachary’s university karate club social network Jure Leskovec, CNIKM '09 32 [Leskovec et al. WWW ‘08] Collaborations between scientists in networks [Newman, 2005] (k) Φ Conductance, log Community size, log k Jure Leskovec, CNIKM '09 33 [Ravasz-Barabasi 03] [Clauset-Moore-Newman 08] Jure Leskovec, CNIKM '09 34 Natural hypothesis about NCP: NCP of real networks slopes downward Slope of the NCP corresponds to the dimensionality of the network What about large networks? We examined more than 100 large networks Jure Leskovec, CNIKM '09 35 [Leskovec et al. WWW ‘08] Typical example: General Relativity collaborations (n=4,158, m=13,422) Jure Leskovec, CNIKM '09 36 [Leskovec et al. WWW ‘08] Jure Leskovec, CNIKM '09 37 [Leskovec et al. WWW ‘08] Better and better communities Communities get worse and worse (k), (conductance) (k), Best community Φ has ~100 nodes k, (cluster size) Jure Leskovec, CNIKM '09 38 Each successive edge inside the community costs more cut-edges NCP plot Φ=1/3 = 0.33 Φ=2/4 = 0.5 Φ=8/6 = 1.3 Φ=64/14 = 4.5 Each node has twice as many children Jure Leskovec, CNIKM '09 39 Empirically we note that best clusters (call them whiskers) are barely connected to the network NCP plot Best cluster. How does it scale with network size? ⇒ Core-periphery structure Jure Leskovec, CNIKM '09 40 [Leskovec et al. Arxiv ‘09] Whiskers: Edge to cut Whiskers in real networks are non-trivial (richer than trees) Jure Leskovec, CNIKM '09 41 [Leskovec et al. Arxiv ‘09] Whiskers Whiskers in real networks are larger than expected based on density and degree sequence Jure Leskovec, CNIKM '09 42 [Leskovec et al. Arxiv ‘09] Practically constant! Each dot is a different network Jure Leskovec, CNIKM '09 43 [Leskovec et al. Arxiv ‘09] Nothing happens! ⇒ Nestedness of the core-periphery structure Jure Leskovec, CNIKM '09 44 Denser and denser network core Core contains ~60% nodes and ~80% edges Small good communities Nested Core-periphery Jure Leskovec, CNIKM '09 45 [Leskovec et al. Arxiv ‘09] What if we allow cuts that give disconnected communities? • Bag of whiskers: compose communities out of whiskers • How good “communities” do we get? Jure Leskovec, CNIKM '09 46 [Leskovec et al. Arxiv ‘09] Rewired network Bag-of- whiskers Local spectral Metis+MQI LiveJournal Jure Leskovec, CNIKM '09 47 [Leskovec et al. Arxiv ‘09] Regularization properties: spectral embeddings stretch along directions in which the random- walk mixes slowly .