Scalable Multi-Threaded Community Detection in Social Networks

2012 IEEE2012 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum Scalable Multi-threaded Community Detection in Social Networks Jason Riedy David A. Bader Henning Meyerhenke College of Computing Institute of Theoretical Informatics Georgia Institute of Technology Karlsruhe Institute of Technology Atlanta, GA, USA Karlsruhe, Germany Abstract—The volume of existing graph- One such useful analysis kernel finds smaller structured data requires improved parallel tools communities, subgraphs that locally optimize some and algorithms. Finding communities, smaller connectivity criterion, within these massive graphs. subgraphs densely connected within the subgraph These smaller communities can be analyzed more than to the rest of the graph, plays a role both thoroughly or form the basis for multi-level algo- in developing new parallel algorithms as well as rithms. Previously, we introduced the first massively opening smaller portions of the data to current parallel algorithm for detecting communities in analysis tools. We improve performance of our parallel community detection algorithm by 20% on massive graphs [4]. Our implementation on the the massively multithreaded Cray XMT, evaluate Cray XMT scaled to massive graphs but relied on its performance on the next-generation Cray XMT2, platform-specific features. Here we both improve and extend its reach to Intel-based platforms with the algorithm and extend its reach to OpenMP OpenMP. To our knowledge, not only is this the first systems. Our new algorithm scales well on two massively parallel community detection algorithm generations of Cray XMTs and on Intel-based but also the only such algorithm that achieves server platforms. excellent performance and good parallel scalability Community detection is a graph clustering across all these platforms. Our implementation analyzes a moderate sized graph with 105 million problem. There is no single, universally accepted vertices and 3.3 billion edges in around 500 seconds definition of a community within a social network. on a four processor, 80-logical-core Intel-based One popular definition is that a community is system and 1100 seconds on a 64-processor Cray a collection of vertices more strongly connected XMT2. than would occur from random chance, leading to methods based on modularity [5]. Another I. INTRODUCTION definition [6] requires vertices to be more connected Graph-structured data inundates daily electronic to others within the community than those outside, life. Its volume outstrips the capabilities of nearly either individually or in aggregate. This aggregate all analysis tools. The Facebook friendship network measure leads to minimizing the communities’ has over 800 million users each with an average conductance. We consider disjoint partitioning of of 130 connections [1]. Twitter boasts over 140 a graph into connected communities guided by a million new messages each day [2], and the NYSE local optimization criterion. Beyond obvious visual- processes over 300 million trades each month [3]. ization applications, a disjoint partitioning applies Applications of analysis range from database opti- usefully to classifying related genes by primary mization to marketing to regulatory monitoring. use [7] and also to simplifying large organizational Much of the structure within these data stores structures [8] and metabolic pathways [9]. lies out of reach of current global graph analysis The next section briefly touches on related work, kernels. including our prior results. Section III reviews 978-0-7695-4676-6/12 $26.00 © 2012 IEEE 16131619 DOI 10.1109/IPDPSW.2012.203 the high-level parallel agglomerative community and has not been designed with parallelism in detection algorithm. Section IV dives into details mind. Incorporating refinement into our parallel on the data structures and algorithms, mapping algorithm is an area of active work. The parallel each to the Cray XMT and Intel-based OpenMP approach is similar to existing multilevel graph platforms’ threading architectures. Section V shows partitioning algorithms that use matchings for edge that our current implementation achieves speed-ups contractions [18], [19] but differs in optimization on real-world data of up to 29.6× on a 64-processor criteria and not enforcing that the partitions must Cray XMT2 and 13.7× on a four processor, 40- be of balanced size. physical-core Intel-based platform. On this uk-2007- 05 web crawl graph with 105 million vertices III. PARALLEL AGGLOMERATIVE COMMUNITY and 3.3 billion edges, our algorithm analyzes the DETECTION community structure in around 500 seconds on a Agglomerative clustering algorithms begin by four processor, 80-logical-core Intel-based system placing every input graph vertex within its own and 1100 seconds on a 64-processor Cray XMT2. unique community. Then neighboring communities are merged to optimize an objective function like II. RELATED WORK maximizing modularity [5], [20], [21] (internal con- Graph partitioning, graph clustering, and com- nectedness) or minimizing conductance (normalized munity detection are tightly related topics. A recent edge cut) [22]. Here we summarize the algorithm survey by Fortunato [10] covers many aspects and break it into primitive operations. Section IV of community detection with an emphasis on then maps each primitive onto our target threaded modularity maximization. Nearly all existing work platforms. of which we know is sequential, with a very recent We consider maximizing metrics (without loss of notable exception for modularity maximization generality) and target a local maximum rather than on GPUs [11]. Many algorithms target specific a global, possibly non-approximable, maximum. contraction edge scoring or vertex move mecha- There are a wide variety of metrics for community nisms [12]. Our previous work [4] established the detection [10]. We will not discuss the metrics in first parallel agglomerative algorithm for commu- detail here; more details are in the references above nity detection and provided results on the Cray and our earlier work [4]. XMT. Prior modularity-maximizing algorithms se- Our algorithm maintains a community graph quentially maintain and update priority queues [13], where every vertex represents a community, edges and we replace the queue with a weighted graph connect communities when they are neighbors in matching. Here we improve our algorithm, update the input graph, and weights count the number of its termination criteria, and achieve scalable perfor- input graph edges either collapsed into a single mance on Intel-based platforms. community graph edge or contained within a Zhang et al.[14] recently proposed a parallel community graph vertex. We currently do not algorithm that identifies communities based on a require counting the vertices in each community, custom metric rather than modularity. Gehweiler but such an extension is straight-forward. and Meyerhenke [15] proposed a distributed diffu- From a high level, our algorithm repeats the sive heuristic for implicit modularity-based graph following steps until reaching some termination clustering. criterion: Work on sequential multilevel agglomerative 1) associate a score with each edge in the com- algorithms like [16] focuses on edge scoring and munity graph, exiting if no edge has a positive local refinement. Our algorithm is agnostic to- score, wards edge scoring methods and can benefit from 2) greedily compute a weighted maximal match- any problem-specific methods. Another related ing using those scores, and approach for graph clustering is due to Blondel 3) contract matched communities into a new et al.[17]. However, it does not use matchings community graph. 16141620 Each step is one primitive parallel operations. mal matching is computed in O(|Ec|) [23] where The first step scores edges by how much the Ec is the edge set of the current community optimization metric would change if the two adja- graph, each iteration of our algorithm’s inner cent communities merge. Computing the change loop requires O(|E|) operations. As with other in modularity and conductance requires only the algorithms, the total operation count depends on weight of the edge and the weight of the edge’s ad- the community growth rates. If our algorithm halts jacent communities. The change in conductance is after K contraction phases, our algorithm runs in negated to convert minimization into maximization. O(|E|·K) operations where the number of edges The second step, a greedy approximately max- in the original graph, |E|, bounds the number in imum weight maximal matching, selects pairs of any community graph. If the community graph is neighboring communities where merging them will halved with each iteration, our algorithm requires improve the community metric. The pairs are O(|E|·log |V |) operations, where |V | is the number independent; a community appears at most once in of vertices in the input graph. If the graph is the matching. Properties of the greedy algorithm a star, only two vertices are contracted per step guarantee that the matching’s weight is within a and our algorithm requires O(|E|·|V |) operations. factor of two of the maximum possible value [23]. This matches experience with the sequential CNM Any positive-weight matching suffices for optimiz- algorithm [28] and similar parallel implementa- ing community metrics. Some community metrics, tions [11]. including modularity [24], form NP-complete optimization problems,

Load more