Incremental Streaming Graph Partitioning
Total Page:16
File Type:pdf, Size:1020Kb
Incremental Streaming Graph Partitioning Lisa Durbeck Peter Athanas Virginia Tech Virginia Tech Dept. of ECE Dept. of ECE ldurbeck @ vt.edu athanas @ vt.edu Abstract—Graph partitioning is an NP-hard problem its contributions is a set of graphs along with vertex whose efficient approximation has long been a subject group labels with which researchers can develop of interest. The I/O bounds of contemporary computing new algorithms and validate results. This also fa- environments favor incremental or streaming graph parti- cilitates comparisons among competing approaches tioning methods. Methods have sought a balance between latency, simplicity, accuracy, and memory size. In this in terms of their computational and memory cost paper, we apply an incremental approach to streaming versus their accuracy of results. Facilitating this partitioning that tracks changes with a lightweight proxy further is a graph generator, and a baseline algorithm to trigger partitioning as the clustering error increases. and scoring functions against which researchers can We evaluate its performance on the DARPA/MIT Graph benchmark their progress, and a code base in C++ Challenge streaming stochastic block partition dataset, and or Python with which to begin. Within the larger find that it can dramatically reduce the invocation of partitioning, which can provide an order of magnitude DGC, which contains several challenges since 2017, speedup. this work focuses on the Streaming Graph Chal- Index Terms—Computation (stat.CO); Graph partition; lenge (SGC), a challenge for graph partitioning that graph sparsification; spectral graph theory; LOBPCG; includes both static and streaming datasets of syn- spectral graph partitioning; stochastic blockmodels; incre- thetic stochastic blockmodel-based (SBM) graphs. mental partitioning; community detection This paper reports on efforts to advance progress further within the SGC by augmenting the fastest I. INTRODUCTION approach demonstrated to date by incorporating Graphs are a species of big data that is partic- recent and complementary progress in incremental ularly suited for representing relationships between partition for time-varying graphs by Charisopoulos entities. For example, a graph of a social network [2]. Partitioning is an expensive operation; we seek describes friendships and kinship. A graph partition to reduce calls to the partitioner without impacting is a more compact encoding of the data, in which the the partition quality. The approach takes advantage community structure of the graph is preserved, with of intrinsic properties of the graph to track graph a several-fold reduction in edges. Graph partition is edge insertions and deletions, and indicate when among NP-hard problems for which exact solutions the set of updates may cause a change to the are too expensive to obtain because no polynomial- community structure of the graph. When it does not, time solution is known; instead, people have devel- we retain the group labels from the prior partition— oped heuristics to find clusters that approximate the conserving partition cycles while tracking the opti- optimal solution sufficiently well. mal group assignments within an error tolerance . Today’s phenomenal data production rates make To better pinpoint the source of speedup we it desirable to have a simple, fast and accurate derive an expression for the rate of consumption of graph partitioner along with convenient tools for edge updates by the enhanced partitioning pipeline graph analysis. The DARPA/MIT Graph Challenge and show that under certain conditions it can pro- (DGC) was envisioned as a way to accelerate duce an impressive speedup over the work/time of progress in graph structure assessments such as the partitioner, especially as the number of batches graph isomorphism and graph partition [9]. One of of updates increases. We experimentally verify these in IEEE High Performance Extreme Computing Conference HPEC 2020, © 2020 IEEE estimates for SGC SBM graphs with known parti- or LOBPCG by Zhuzhunashvili and Knyazev [17]. tion labels. LOBPCG is a spectral method particularly suited The paper is organized as follows. The Back- for sparse matrices and sparse graphs like those of ground section outlines related work in graph par- the DGC stochastic block challenges [17]. LOBPCG titioning motivating the approach used here, as has a faster time-order than the baseline of O(N ·k), described in the Approach and Methods section. The where N is the number of nodes in the graph, and Experiments and Results section describes specifics k is the number of groups, or clusters. on the datasets and computing environment, and the We confirmed these earlier-reported time-orders performance of the approach relative to the baseline for both the Peixoto and Knyazev methods in the and to the most relevant related work. Discussion course of our experiments with SGC static graphs and Future Work summarizes the work and suggests of sizes ranging from 50 nodes to 5 million nodes how it could be extended. that were released for the challenge in 2017 or 2020. Since LOBPCG’s time-order is a significant II. BACKGROUND improvement over the baseline and gives similar Many common development frameworks use the accuracies, at least for the low-block-size, low- most rudimentary of the lightweight streaming par- variation graphs within SGC, we based our exper- titioning techniques pioneered by Stanton and Kliot, iments on seeing whether these ideas improve the still, such as hashing on node index [16]. Yet there performance of LOBPCG further. are many potential benefits to approaches that trade LOBPCG is an efficient spectral clustering some simplicity for greater computational or storage method. Other methods of spectral partitioning may efficiency. A summary of approaches can be found be desirable in other situations. Spielman and Teng in Fortunato [7]. present a single-pass, local partitioning algorithm Global approaches to graph partition rely on also based on spectral clustering using the eigen- properties of the entire graph. The most common values of the graph Lapacian [15]. Their method examples are spectral partitioning, which derives uses O(N + E) storage and O(E) time where E a partition from approximate eigenvectors of the is the number of edges of the graph, and N the adjacency matrix [8], and spectral clustering, which number of nodes. LOBPCG has a smaller time- derives it using the eigen-decomposition of the order and space-order due to the advantage of being graph Laplacian matrix [1], [14]. These were first a matrix-free method that does not require storing developed in the 1960s and 70s, and took their the graph coefficient matrix explicitly; instead, it modern form in the 1990s. Their major drawback can access the matrix by evaluating matrix-vector is that the necessary access to information about products [10]. the full graph may not be possible in practice for To our knowledge, incremental partitioning was very large graphs—a drawback addressed to some first introduced by Ou and Ranka, within the context degree in the current work by the use of a matrix- of adaptive finite element meshing [12]. They ob- free method by Knyazev that does not require the served that, for a class of problems such as adaptive explicit storage of the adjacency matrix [10]. finite element meshing, where continuous updates to The DGC baseline for streaming graph partition the graph are occurring, the number of updates to for stochastic block models (SBM) is a sequen- the partition assignment at any given time is small tial implementation of a parallelizable Bayesian relative to the size of the graph. They called this statistics-based stochastic block partitioning method the incremental graph partitioning problem. They of Peixoto [9], [13]. The baseline partitioner is assigned partitions to graph updates using a simple accurate in its group assignments, but slow: its time- local graph-distance-based metric. order is O(E · log2E), where E is the number of A similar observation motivates Charisopoulos edges in the graph. within the context of evolution of time-varying One of the winners in the 2017 SGC compe- SBM graphs [2]. They developed a proxy for the tition is a partitioner based on Locally Optimal Davis–Kahan bound on the subspace distance within Block Preconditioned Conjugate Gradient Method, updates to a graph adjacency matrix that serves 2 to characterize graph updates succinctly in terms of distance from a spectrum threshold, within an incremental approach to spectral estimation, using subspace iteration for block Krylov methods. This provides a bound on the error for the accuracy of the subspace used to make the partition assignments. We use a relevant distance metric for SBMs they derived and proved that requires access to degree information from the prior partition, and degree information within the current batch of graph Fig. 1. State machine representation of algorithm as a process. The updates [2]. algorithm from a data view is encapsulated in Equation 5. Our partitioning algorithm is depicted in Figure 1 as a state machine. It functions similarly to Algorithm 3.1 in Charisopoulos [2]. The system for all symmetric adjacency matrices [11]. We use has two states and three transitions; first, an ac- this fact to set d = 1 and solve Equation 7 for α. cumulation state in which the low-cost distance metric corresponding with Equation 6 is periodically III. APPROACHES &METHODS checked, and second, transition to a partitioning To define the partitioning problem in general, state in which the eigenvalues and eigenvectors given a graph with no group labels associated with are recalculated, after which the system returns to any node, the goal is to produce accurate group the accumulation state until the proxied-subspace label assignments for each node in the graph. An distance is further than a distance of from the additional assumption for the present work is that subspace generated in the prior partition cycle. In the problem is an incremental partition problem. In this way the partition label assignments remain such a case, the graph is revealed to the algorithm within an error distance of from their true current over time; it does not possess all the information values.