Scalable SIMD-Efficient Graph Processing on Gpus

2015 International Conference on Parallel Architecture and Compilation Scalable SIMD-Efficient Graph Processing on GPUs Farzad Khorasani Rajiv Gupta Laxmi N. Bhuyan Computer Science and Engineering Department University of California Riverside, CA, USA {fkhor001, gupta, bhuyan}@cs.ucr.edu Abstract—The vast computing power of GPUs makes them In this paper we present techniques that maximize the an attractive platform for accelerating large scale data parallel scalability and performance of vertex-centric graph process- computations such as popular graph processing applications. ing on multi-GPU systems by fully exploiting the available However, the inherent irregularity and large sizes of real- world power law graphs makes effective use of GPUs a resources as follows: major challenge. In this paper we develop techniques that SIMD hardware – The irregular nature of power law greatly enhance the performance and scalability of vertex- graphs makes it difficult to balance load across threads centric graph processing on GPUs. First, we present Warp leading to underutilization of SIMD resources. We address Segmentation, a novel method that greatly enhances GPU the device underutilization problem of a GPU by developing device utilization by dynamically assigning appropriate number of SIMD threads to process a vertex with irregular-sized Warp Segmentation that dynamically assigns appropriate neighbors while employing compact CSR representation to number of SIMD threads to process a vertex with irregular- maximize the graph size that can be kept inside the GPU sized neighbors. Our experiments show that the warp exe- global memory. Prior works can either maximize graph sizes cution efficiency of warp segmentation exceeds 70% while (VWC [11] uses the CSR representation) or device utilization for the well known VWC [11] technique it is around 40%. (e.g., CuSha [13] uses the CW representation; however, CW is roughly 2.5x the size of CSR). Second, we further scale GPU global memory – For scaling performance to large graph processing to make use of multiple GPUs while proposing graphs, they must be held in the global memory of the GPUs. Vertex Refinement to address the challenge of judiciously using To maximize the graph sizes that can be held in global the limited bandwidth available for transferring data between memories, a compact graph representation must be used. GPUs via the PCIe bus. Vertex refinement employs parallel Therefore Warp Segmentation makes use of the compact binary prefix sum to dynamically collect only the updated boundary vertices inside GPUs’ outbox buffers for dramatically CSR representation. To tolerate the long latency of non- reducing inter-GPU data transfer volume. Whereas existing coalesced memory accesses that arise while accessing the multi-GPU techniques (Medusa [31], TOTEM [7]) perform neighbors of a vertex in CSR, warp segmentation keeps the high degree of wasteful vertex transfers. On a single GPU, GPU cores busy by scheduling other useful operations that our framework delivers average speedups of 1.29x to 2.80x compute the segment size and lane’s intra-segment index. over VWC. When scaled to multiple GPUs, our framework achieves up to 2.71x performance improvement compared to Inter-GPU communication bandwidth – Since large inter-GPU vertex communication schemes used by other multi- graphs must be distributed across the global memories of GPU techniques (i.e., Medusa, TOTEM). multiple GPUs, processing at each GPU requires values of Keywords-Graphs; Power Law Graphs; GPU; Irregular neighboring vertices that reside on other GPUs. Here we Computations; Scalability; Multi-GPU; must make judicious use of the limited bandwidth available for transferring data between GPUs via the PCIe bus. We introduce an approach based upon parallel binary prefix I. INTRODUCTION sum that dynamically limits the inter-GPU transfers to only Due to their ability to represent relationships between include updated vertices. Existing multi-GPU techniques entities, graphs have become the building blocks of many perform high degree of wasteful vertex value transfers. high performance data analysis algorithms. A wide variety Our solution maximizes the graph sizes for which high of graph algorithms can be expressed in an iterative form – performance can be achieved by fully utilizing GPU re- during each iteration vertices update their state based upon sources of SIMD hardware, memory, and bandwidth. states of neighbors connected by edges using a computation Let us briefly consider the related works and see how our procedure until the graph state becomes stable. The inherent approach overcomes their drawbacks. First, we consider the data parallelism in an iterative graph algorithm makes many- prominent single GPU techniques for vertex-centric graph core processors with underlying SIMD hardware such as processing, namely VWC [11] and CuSha [13]. Virtual-Warp GPUs an attractive platform for accelerating the algorithms. Centric (VWC) [11] is the state-of-the-art method that uses However, efficient mapping of real-world power law graphs the compact CSR representation and is inevitably prone to with irregularities to symmetric GPU architecture is a chal- load imbalance when processing real-world graphs due to lenging task [19]. high variation in degrees of vertices. When the size of the 1089-795X/15 $31.00 © 2015 IEEE 39 DOI 10.1109/PACT.2015.15 virtual warp is less than the number of neighbors for a if it is marked and refined by the online stage. Thus, vertex, the virtual warp needs to iterate over the neighbors Vertex Refinement eliminates the communication overhead forcing other virtual warps within the warp that are assigned and provides higher multi-GPU performance. to vertices with fewer neighbors to stay inactive. When the The key contributions of this work are: size of the virtual warp is greater than the the size of the • We introduce Warp Segmentation (WS), a novel tech- neighbors for a vertex, a great portion of the virtual warp nique for compact graph representations that overcomes is disabled. Both cases lead to underutilization of SIMD SIMD underutilization during graph processing. On resources and poor warp execution efficiency. In addition, average, WS outperforms VWC by 1.29x−2.80x. discovering the best virtual warp size for every graph and • We introduce Vertex Refinement that enables effective every expressed graph algorithm requires multiple tries. scaling of the graph processing procedure to multiple CuSha [13] addresses the drawbacks of VWC, namely warp GPUs. It efficiently filters updated vertices of a GPU execution inefficiencies and non-coalesced accesses, but at on-the-fly via parallel binary prefix sum and provides the cost of using G-Shards and CW graph representations exclusive speedup of up to 2.71x over other multi-GPU which are 2x-2.5x larger than the CSR representation due vertex communication schemes. to vertex replication. In contrast, Warp Segmentation uses • We implemented a framework that embeds above items the compact CSR representation while delivering high SIMD and enables users to easily express desired iterative hardware utilization. In warp segmentation the neighbors graph algorithm and execute it on one or multiple of warp-assigned vertices are grouped into segments. Warp CUDA-enabled GPUs. lanes then get assigned to these neighbors and recognize The remainder of the paper is organized as follows. In their position inside the segment and the segment size by Section II, we present Warp Segmentation and in Section III first performing a quick binary search on the fast shared we introduce the framework interface. Then in Section IV, memory content and then comparing their edge index with we describe efficient scaling of our framework to multiple corresponding neighbor indices. When the segment size GPUs via Vertex Refinement. In Section V, we present the and the position in the segment are known for the lanes, experimental evaluation. Sections VI and VII give related user-defined reduction can be efficiently performed between work and conclusion respectively. neighbors of a vertex without introducing any intra-warp load imbalance. When processing on a single device, our II. WARP SEGMENTATION FOR SIMD-EFFICIENCY framework employs asynchronous parallelism paradigm that Here we present Warp Segmentation (WS) that eliminates allows simultaneous program execution and data transfer in intra-warp load imbalance and enhances execution efficiency an iteration. It also permits visibility of the updated neighbor for processing a graph in CSR form. CSR is a compact vertex content, enabling a faster convergence. form suitable for representing large and sparse graphs in a Next let us consider the related works on multi-GPU graph minimum space. Due to its space-efficiency, CSR is a good processing [31] [7]. Given a partitioning of a graph across choice to hold large graphs inside the limited GPU memory. multiple GPUs, these techniques underestimate the impor- As Figure 1 shows, CSR consists of 4 arrays: tance of efficient inter-device communication and do not ith effectively utilize the PCIe bandwidth. This is a significant • VertexValues holds the content of the vertex in its ith problem because the PCIe bus, as the path to communicate element. data from one GPU to other GPUs, is tens of times slower • NbrVertexIndices holds the indices for a vertex’s neigh- than GPU global memory. Previous multi-GPU techniques bors in a contiguous fashion. either copy the whole vertex set belonging to one GPU to • NbrIndices holds a prefix sum of the number of neigh- ith other GPUs at every iteration [31], or they identify boundary bors for vertices. The vertex’s neighbors inside NbrIndices i vertices in a pre-processing stage and make GPUs exchange NbrVertexIndices start at [ ] and end before NbrIndices i these subsets of vertices in every iteration [7] [8]. In both [ +1]. approaches, a great number of vertices that are exchanged • EdgeValues holds the edge values corresponding to the between devices is redundant. In contrast, we propose Vertex neighbors inside NbrVertexIndices. Refinement, a new strategy that enables our framework To motivate the need for WS, we first describe the to efficiently scale to multiple GPUs.

Scalable SIMD-Efficient Graph Processing on Gpus

Distributed Algorithms with Theoretic Scalability Analysis of Radial and Looped Load ﬂows for Power Distribution Systems

Scalability and Performance Management of Internet Applications in the Cloud

Scalability and Optimization Strategies for GPU Enhanced Neural Networks (Genn)

Parallel System Performance: Evaluation & Scalability

Multiprocessing and Scalability

Pascal Viewer: a Tool for the Visualization of Parallel Scalability Trends

Computer Architecture: Parallel Processing Basics

Parallel Programming with Openmp

Challenges for the Message Passing Interface in the Petaflops Era

Computing at Massive Scale: Scalability and Dependability Challenges

CUDA Basics Murphy Stein New York University

Evaluation of a Multithreaded Architecture for Cellular Computing