Parallel Breadth-First Search on Distributed Memory Systems

Parallel Breadth-First Search on Distributed Memory Systems Aydın Buluç Kamesh Madduri Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA {ABuluc, KMadduri}@lbl.gov ABSTRACT tions, in these graphs. To cater to the graph-theoretic anal- Data-intensive, graph-based computations are pervasive in yses demands of emerging \big data" applications, it is es- several scientific applications, and are known to to be quite sential that we speed up the underlying graph problems on challenging to implement on distributed memory systems. current parallel systems. In this work, we explore the design space of parallel algo- We study the problem of traversing large graphs in this rithms for Breadth-First Search (BFS), a key subroutine in paper. A traversal refers to a systematic method of explor- several graph algorithms. We present two highly-tuned par- ing all the vertices and edges in a graph. The ordering of allel approaches for BFS on large parallel systems: a level- vertices using a \breadth-first" search (BFS) is of particular synchronous strategy that relies on a simple vertex-based interest in many graph problems. Theoretical analysis in the partitioning of the graph, and a two-dimensional sparse ma- random access machine (RAM) model of computation indi- trix partitioning-based approach that mitigates parallel com- cates that the computational work performed by an efficient munication overhead. For both approaches, we also present BFS algorithm would scale linearly with the number of ver- hybrid versions with intra-node multithreading. Our novel tices and edges, and there are several well-known serial and hybrid two-dimensional algorithm reduces communication parallel BFS algorithms (discussed in Section2). However, times by up to a factor of 3.5, relative to a common vertex efficient RAM algorithms do not easily translate into \good based approach. Our experimental study identifies execu- performance" on current computing platforms. This mis- tion regimes in which these approaches will be competitive, match arises due to the fact that current architectures lean and we demonstrate extremely high performance on lead- towards efficient execution of regular computations with low ing distributed-memory parallel systems. For instance, for a memory footprints, and heavily penalize memory-intensive 40,000-core parallel execution on Hopper, an AMD Magny- codes with irregular memory accesses. Graph traversal prob- Cours based system, we achieve a BFS performance rate of lems such as BFS are by definition predominantly memory 17.8 billion edge visits per second on an undirected graph of access-bound, and these accesses are further dependent on 4.3 billion vertices and 68.7 billion edges with skewed degree the structure of the input graph, thereby making the algo- distribution. rithms \irregular". The recently-created Graph 500 List [20] is an attempt to rank supercomputers based on their performance on data- 1. INTRODUCTION intensive applications, and BFS is chosen as the first rep- resentative benchmark. Distributed memory architectures The use of graph abstractions to analyze and understand dominate the supercomputer market and computational sci- social interaction data, complex engineered systems such entists have an excellent understanding on mapping numer- as the power grid and the Internet, communication data ical scientific applications to these systems. Measuring the such as email and phone networks, biological systems, and sustained floating point execution rate (Flop/s) and compar- in general, various forms of relational data, has been gain- ison with the theoretical system peak is a popular technique. ing ever-increasing importance. Common graph-theoretic In contrast, little is known about the design and parallel per- problems arising in these application areas include identi- formance of data-intensive graph algorithms, and the pro- fying and ranking important entities, detecting anomalous grammability trade-offs involved. Thus, the current Graph patterns or sudden changes in networks, finding tightly in- 500 list does not yet accurately portray the data-crunching terconnected clusters of entities, and so on. The solutions capabilities of parallel systems, and is just a reflection of the to these problems typically involve classical algorithms for quality of various benchmark BFS implementations. problems such as finding spanning trees, shortest paths, BFS on distributed-memory systems involves explicit com- biconnected components, matchings, flow-based computa- munication between processors, and the distribution (or partitioning) of the graph among processors also impacts per- Permission to make digital or hard copies of all or part of this work for formance. We utilize a testbed of large-scale graphs with personal or classroom use is granted without fee provided that copies are billions of vertices and edges to empirically evaluate the not made or distributed for profit or commercial advantage and that copies performance of our BFS algorithms. These graphs are all bear this notice and the full citation on the first page. To copy otherwise, to sparse, i.e., the number of edges m is just a constant factor republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. times the number of vertices n. Further, the average path SC11, November 12-18, 2011, Seattle, Washington, USA length in these graphs is a small constant value compared Copyright 2011 ACM 978-1-4503-0771-0/11/11 ...$10.00. to the number of vertices, or is at most bounded by log n. Algorithm 1 Serial BFS algorithm. Input: G(V; E), source vertex s. Our Contributions Output: d[1::n], where d[v] gives the length of the shortest We present two complementary approaches to distributed- path from s to v 2 V . memory BFS on graphs with skewed degree distribution. 1: for all v 2 V do The first approach is a more traditional scheme using one- 2: d[v] 1 dimensional distributed adjacency arrays for representing 3: d[s] 0, level 1, FS φ, NS φ the graph. The second method utilizes a sparse matrix rep- 4: push s ! FS resentation of the graph and a two-dimensional partitioning 5: while FS 6= φ do among processors. The following are our major contribu- 6: for each u in FS do tions: 7: for each neighbor v of u do 8: if d[v] = 1 then • Our two-dimensional partitioning-based approach, cou- 9: push v ! NS pled with intranode multithreading, reduces the communi- 10: d[v] level cation overhead at high process concurrencies by a factor of 3:5. 11: FS NS, NS φ, level level + 1 • Both our approaches include extensive intra-node multi- core tuning and performance optimization. The single- is slightly different from the widely-used queue-based serial node performance of our graph-based approach is com- algorithm [14]. We can relax the FIFO ordering mandated parable to, or exceeds, recent single-node shared memory by a queue at the cost of additional space utilization, but results, on a variety of real-world and synthetic networks. the work complexity in the RAM model is still O(m + n). The hybrid schemes (distributed memory graph partition- 2.2 Parallel BFS: Prior Work ing and shared memory traversal parallelism) enable BFS scalability up to 40,000 cores. Parallel algorithms for BFS date back to nearly three • To accurately quantify the memory access costs in BFS, decades [31, 32]. The classical parallel random access ma- we present a simple memory-reference centric performance chine (PRAM) approach to BFS is a straightforward exten- model. This model succinctly captures the differences be- sion of the serial algorithm presented in Algorithm1. The tween our two BFS strategies, and also provides insight graph traversal loops (lines 6 and 7) are executed in parallel into architectural trends that enable high-performance graph by multiple processing elements, and the distance update algorithms. and stack push steps (lines 8-10) are atomic. There is a barrier synchronization step once for each level, and thus the execution time in the PRAM model is O(D), where the D is 2. BREADTH-FIRST SEARCH OVERVIEW the diameter of the graph. Since the PRAM model does not weigh in synchronization costs, the asymptotic complexity 2.1 Preliminaries of work performed is identical to the serial algorithm. Given a distinguished \source vertex" s, Breadth-First The majority of the novel parallel implementations de- Search (BFS) systematically explores the graph G to dis- veloped for BFS follow the general structure of this \level- cover every vertex that is reachable from s. Let V and E synchronous" algorithm, but adapt the algorithm to better refer to the vertex and edge sets of G, whose cardinalities fit the underlying parallel architecture. In addition to keep- are n = jV j and m = jEj. We assume that the graph is un- ing the parallel work complexity close to O(m+n), the three weighted; equivalently, each edge e 2 E is assigned a weight key optimization directions pursued are of unity. A path from vertex s to t is defined as a sequence of • ensuring that parallelization of the edge visit steps (lines edges hu ; u i (edge directivity assumed to be u ! u i i+1 i i+1 6, 7 in Algorithm1) is load-balanced, in case of directed graphs), 0 ≤ i < l, where u = s and 0 • mitigating synchronization costs due to atomic updates u = t. The length of a path is the sum of the weights of l and the barrier synchronization at the end of each level, edges. We use d(s; t) to denote the distance between vertices and s and t, or the length of the shortest path connecting s and • improving locality of memory references by modifying the t. BFS implies that all vertices at a distance k (or \level" graph layout and/or BFS data structures.

Parallel Breadth-First Search on Distributed Memory Systems

Scalable and Distributed Deep Learning (DL): Co-Design MPI Runtimes and DL Frameworks

Parallel Programming

Parallel Computer Architecture

Parallel Programming Using Openmp Feb 2014

Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters

An Overview of the PARADIGM Compiler for Distributed-Memory

Concepts from High-Performance Computing Lecture a - Overview of HPC Paradigms

In Reference to RPC: It's Time to Add Distributed Memory

Introduction to Parallel Computing

Multi-Core Architectures

Scalable Problems and Memory-Bounded Speedup

14. Parallel Computing 14.1 Introduction 14.2 Independent