Social and Technological Network Analysis Lecture 4

Social and Technological Network Analysis Lecture 4: Community Detec=on and Overlapping Communi=es Dr. Cecilia Mascolo Communies • Weak =es (Lecture 2) seemed to bridge groups of =ghtly coupled nodes (communi=es) • How do we find these communi=es? In This Lecture • We will describe a Community Detec=on method based on betweenness centrality. • We will describe the concept of Modularity and Modularity Op=mizaon. • We will describe methods for overlapping community detec=on. What is a Community? Howtoautomaticallyfind suchdenselyconnected groupsofnodes? Ideallysuchautomatically detectedclusterswould thencorrespondtoreal groups Forexample: Communities,clusters, groups,modules 11/10/2009 JureLeskovec,StanfordCS322:NetworkAnalysis 4 Why do we want to find par==ons/communi=es? • Clustering web clients with similar interest or geographically near can improve performance • Customers with similar interests could be clustered to help recommendaon systems • Clusters in large graphs can be used to create data structures for efficient storage of graph data to handle queries or path searches • Study the relaonship/mediaon among nodes – Hierarchical organizaon study Example Zachary’s Karate club: 34 members of a club over 3 years. Edges: interac=on outside the club WWW: pages and hyperlinks Iden=ficaon of clusters can improve page ranking Remove weak =es • Local bridges connect weakly interac=ng parts of the network. • What if we have many bridges: which do we remove first? Or there might be no bridges. • Note: Without those bridges paths between nodes would be longer. Edge Betweenness • Edge Betweenness: the number of shortest paths between pairs of nodes that run along the edge. 7 * 7 Algorithm of Girvan-Newmann (PNAS 2002) • Calculate the betweenness of all edges • Cut the edge with highest betweenness • Recalculate edge betweenness How is the betweenness computed? • Calculate the shortest paths from node A – BFS search from A. – Determine number of shortest paths from A to each node. Calculang number of shortest paths Calculang flows When we get to a node X in the breadth-first search structure, working up from the bocom, we add up all the flows arriving from edges directly below X, plus 1 for the flow des=ned for X itself. We then divide this up over the edges leading upward from X, in propor=on to the number of shortest paths coming through each. Calculang Edge Betweenness • Build one of these graphs for each node in the graph • Sum the values on the edges on each graph to obtain the edge betweenness Edge: deleon When do we stop? • How do we know when to stop? • When X communi=es have been detected? • When the level of cohesion inside a community has reached Y? • There is no prescrip=ve way for every case • There are also many other ways of detec=ng communies. Modularity • Perhaps a good measure of when to stop is when for each community the “cohesion” within the community is higher than outside… • Q= (edges inside the community)- (expected number of edges inside the community for a random graph with same node degree distribu;on as the given network) Modularity on a randomized graph calculaon The expected number of a Ka edges in the randomized version of the graph where nodes are rewired: kakb 2m m is the number of edges of the graph = ½ sum(ki) Modularity (2) • Number of edges inside a community: 1 ∑Aa,bδ(ca,cb ) 2 a,b • Where: • Aa,b is 1 if there is an edge a->b, • δ(ca , cb) is the Kronecker Delta (1 if ca is equal to cb) Modularity (3) 1 1 kakb Q1= ∑Aa,bδ(ca,cb )− ∑ δ(ca,cb ) 2 a,b 2 a,b 2m 1 kakb Q1= ∑(Aa,b − )δ(ca,cb ) 2 a,b 2m 1 k k Q = (A − a b )δ(c ,c ) Frac=on of edges over ∑ a,b a b all edges m 2m a,b 2m Modularity (4) • Modularity ranges from -1 to 1. – It is posi=ve if the number of edges inside the group are more than the expected number. – Variaon from 0 indicate difference with random case. • Modularity can be used at each round of the Girvan- Newmann algorithm to check if it is =me to stop. However the complexity of this is O(m2n). • Why don’t we try to just maximize modularity? Modularity Op=mizaon • Finding the configuraon with maximum modularity in a graph is an NP complete problem. • However there are good approximaon algorithms. Fast Modularity • Start with a network of n communi=es of 1 node • Calculate ΔQ for all possible community pairs • Merge the pair of the largest increase in Q • Repeat (2)&(3) un=l one community remains • Cross cut the dendrogram where Q is maximum. • This runs in O((m + n)n). • A further op=mizaon runs in O(m d logn) [d depth of dendrogram]. • Most networks are sparse so m~n and d~log n Example of Dendrogram 3 of the value 0.601 for the best split reported in [5]. As the dendrogram reveals, the algorithm finds six communities. Some of them correspond to single conferences, but most correspond to two or more. The GN algorithm, by contrast, finds all eleven conferences, as well as accu- rately identifying independent teams that belong to no conference. Nonetheless, it is clear that the new algorithm is quite capable of picking out useful community structure from the network, and of course it is much the faster algorithm. On the author’s personal computer the 1 6 17 7 5 11 12 20 2 18 22 14 4 13 8 3 10 23 19 16 15 21 9 31 33 29 25 26 32 24 27 30 34 28 algorithm ran to completion in an unmeasureably small time—less than a hundredth of a second. The algorithm of Girvan and Newman took a little over a second. FIG. 2: Dendrogram of the communities found by our algo- Atimedifference of this magnitude will not present rithm in the “karate club” network of Zachary [5, 17]. The abigprobleminmostpracticalsituations,butperfor- shapes of the vertices represent the two groups into which the club split as the result of an internal dispute. mance rapidly becomes an issue when we look at larger networks; we expect the ratio of running times to increase with the number of vertices. Thus, for example, in applying our algorithm to the 1275-node network of not to continue using it—it appears to give the best jazz musician collaborations mentioned above, we found results. For systems too large to make use of this ap- that it runs to completion in about one second of CPU proach, however, our new algorithm gives useful com- time. The GN algorithm by contrast takes more than munity structure information with comparatively little three hours to reach very similar results. effort. As an example of an analysis made possible by the We have applied our algorithm to a variety of real- speed of the new algorithm, we have looked at a network world networks also. We have looked, for example, at of collaborations between physicists as documented by the “karate club” network studied in [5], which represents papers posted on the widely-used Physics E-print Archive friendships between 34 members of a club at a US univer- at arxiv.org.Thenetworkisanupdatedversionofthe sity, as recorded over a two-year period by Zachary [17]. one described in Ref. 13, in which scientists are consid- During the course of the study, the club split into two ered connected if they have coauthored one or more pa- groups as a result of a dispute within the organization, pers posted on the archive. We analyze only the largest and the members of one group left to start their own component of the network, which contains n =56276sci- club. In Fig. 2 we show the dendrogram derived by feed- entists in all branches of physics covered by the archive. ing the friendship network into our algorithm. The peak Since two vertices that are unconnected by any path are Q . modularity is =0381 and corresponds to a split into never put in the same community by our algorithm, the two groups of 17, as shown in the figure. The shapes of small fraction of vertices that are not part of the largest the vertices represent the alignments of the club mem- component can safely be assumed to be in separate com- bers following the split and, as we can see, the division munities in the sense of our algorithm. Our algorithm found by the algorithm corresponds almost perfectly to takes 42 minutes to find the full community structure. these alignments; only one vertex, number 10, is classified Our best estimates indicate that the GN algorithm would wrongly. The GN algorithm performs similarly on this take somewhere between three and five years to complete task, but not better—it also finds the split but classifies its version of the same calculation. one vertex wrongly (although a different one, vertex 3). The analysis reveals that the network in question con- In other tests, we find that our algorithm also success- sists of about 600 communities, with a high peak modu- fully detects the main two-way division of the dolphin larity of Q =0.713, indicating strong community struc- social network of Lusseau [6, 18], and the division be- ture in the physics world. Four of the communities found tween black and white musicians in the jazz network of are large, containing between them 77% of all the ver- Gleiser and Danon [11]. tices, while the others are small—see Fig. 4, left panel. As a demonstration of how our algorithm can some- The four large communities correspond closely to subject times miss some of the structure in a network, we take subareas: one to astrophysics, one to high-energy physics, another example from Ref.

Load more