Exploring the Limits of Gpus with Parallel Graph Algorithms

Exploring the Limits of GPUs With Parallel Graph Algorithms Kumanan Yogaratnam kyogarat@connect .carleton .ca A thesis submitted tothe Faculty of Graduate Studies and Research in partial fulfillment of the requirements for the degree of Master of Computer Science Carleton University Ottawa, Ontario, Canada © 2010 by Kumanan Yogaratnam December 19, 2010 Library and Archives Bibliotheque et 1*1 Canada Archives Canada Published Heritage Direction du Branch Patrimoine de I'edition 395 Wellington Street 395, rue Wellington Ottawa ON K1A 0N4 Ottawa ON K1A0N4 Canada Canada Your file Votre reference ISBN: 978-0-494-79594-1 Our file Notre r§f6rence ISBN: 978-0-494-79594-1 NOTICE: AVIS: The author has granted a non L'auteur a accorde une licence non exclusive exclusive license allowing Library and permettant a la Bibliotheque et Archives Archives Canada to reproduce, Canada de reproduce, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par telecommunication ou par I'lnternet, preter, telecommunication or on the Internet, distribuer et vendre des theses partout dans le loan, distribute and sell theses monde, a des fins commerciales ou autres, sur worldwide, for commercial or non support microforme, papier, electronique et/ou commercial purposes, in microform, autres formats. paper, electronic and/or any other formats. The author retains copyright L'auteur conserve la propriete du droit d'auteur ownership and moral rights in this et des droits moraux qui protege cette these. Ni thesis. Neither the thesis nor la these ni des extraits substantiels de celle-ci substantial extracts from it may be ne doivent etre imprimes ou autrement printed or otherwise reproduced reproduits sans son autorisation. without the author's permission. In compliance with the Canadian Conformement a la loi canadienne sur la Privacy Act some supporting forms protection de la vie privee, quelques may have been removed from this formulaires secondaires ont ete enleves de thesis. cette these. While these forms may be included Bien que ces formulaires aient inclus dans in the document page count, their la pagination, il n'y aura aucun contenu removal does not represent any loss manquant. of content from the thesis. ••• Canada Abstract In this thesis we explore the limits of graphics processors (GPUs) for general purpose parallel computing by studying problems that require highly irregular data access patterns: parallel graph algorithms for list ranking and connected components. Such graph problems represent a worst case scenario for coalescing parallel memory accesses on GPUs which is critical for good GPU performance. We identify and explore the impact of irregular memory access on the CUDA GPU platform, specifically looking at the memory bandwidth impact. Our experimental study also indicates that PRAM algorithms are a good starting point for developing efficient parallel GPU methods, though they require modifications, often non- trivial, to ensure good GPU performance. We identify some of the pitfalls of the CUDA GPU platform, particularly when implementing parallel graph algorithms with irregular memory access, and discuss how to adapt such graph algorithms for parallel GPU computation. We point out that the study of parallel graph algorithms for GPUs is of wider interest for discrete and combinatorial problems in general because many of these problems require similar irregular data access patterns. i Acknowledgements This thesis would not exist without the support, guidance and encouragement of my thesis supervisor Dr. Frank Dehne, to whom I remain forever grateful. At least as importantly I wish to express heartfelt thanks to my wife Trinh, without whose patience and support my attempt to complete my Master's degree while continuing to hold onto my demanding job simply would not have been successful. n Contents Abstract i Acknowledgements ii Contents iii List of Algorithms vi List of Tables vii List of Figures viii List of Listings x 1 Introduction 1 1.1 Parallel Graph Algorithms on GPUs 1 1.2 PRAM vs. GPU 2 1.3 Summary Of Results 3 2 Review of Parallel Computation Systems 6 2.1 Overview 6 2.1.1 Definitions 7 2.1.2 Multi-core Multi-processor CPU Systems 8 2.1.3 Multi-core GPU Systems 9 2.2 nVIDIA Tesla Architecture and CUDA Programming Model 11 2.3 Applying PRAM Algorithms to GPUs 13 2.3.1 SIMD and Data Parallelism 14 2.3.2 Concurrent Memory Access 14 2.3.3 Sensitivity to Overhead 15 in CONTENTS IV 3 Review of GPU/CUDA Programming Concepts 16 3.1 Memory 17 3.1.1 GPU Memory 17 3.1.2 Host Memory 19 3.2 Kernels 19 3.2.1 Kernels using Shared Memory 20 3.2.2 Executing Kernels in Parallel 21 3.3 Threading 21 3.4 Branching 22 3.5 Thread Synchronization 22 3.6 Memory Access Optimization 23 3 6.1 Coalescing Global Memory Access 23 3.6.2 Coalescing and Shared Memory 26 3.7 Memory Write Concurrency 27 3.8 Striding k Partitioning 28 4 Review of Studied Graph Algorithms 30 4.1 Parallel List Ranking . 30 4.1.1 Definition 31 4.1.2 Sequential List Ranking 31 4.1.3 Parallel Pointer Jumping 32 4.1.4 Parallel Random Splitter List Ranking 34 4.2 Parallel Connected Component Counting 37 4.2.1 Definition 38 4.2.2 Sequentially Counting Connected Components 38 4.2.3 Parallel Counting of Connected Components 40 4.2.4 Shiloach and Vishkin's PRAM Connected Components Algorithm ... 41 5 Parallel List Ranking and CUDA 46 5.1 Implementation 47 5.1.1 Sequential CPU List Ranking 47 5.1.2 Parallel CUDA Wylie's List Ranking 47 5.1.3 Parallel CUDA Random Splitter List Ranking 50 5.1.4 Parallel CPU Random Splitter List Ranking 56 5.2 Run Time Comparisons 59 5.2.1 Comparing All Implementations 59 5.2.2 Comparing Scaling Character of the CUDA Implementations 60 5.2.3 Comparing Relative Speedup of the CUDA Implementations 61 CONTENTS v 5.2.4 Comparing Relative Speedup of CUDA Random Splitter Implemen tation Kernels 63 5.2.5 Comparing Performance of CUDA Random Splitter Implementation Kernels 64 5.2.6 The Inflection in the CUDA Random Splitter Implementation 66 5.2.7 Memory Access Performance of CUDA Random Splitter Implementation 68 5.3 Conclusions 72 6 Connected Component Counting and CUDA 74 6.1 Implementation Details 74 6.1.1 Sequential CPU Connected Component Counting 74 6.1.2 Parallel CUDA Shiloach-Vishkin's Connected Component Counting . 75 6.1.3 Parallel CPU Shiloach-Vishkin's Connected Component Counting ... 84 6.2 Run Time Comparison For Different Types Of Graphs: Lists, Trees, and Random Graphs 86 6.2.1 Comparing All Connected Component Counting Implementations ... 86 6.2.2 Comparing Relative Speedup of the CUDA Implementations 90 6.3 Conclusions 92 7 Conclusion 93 7.1 Summary Of Results 93 7.2 Future Work 95 A Experimental Setup and Data Generation 97 Bibliography 105 List of Algorithms 4.1 Serial List Ranking 31 4.2 Parallel List Ranking by Pointer Jumping with p < n Processors 33 4.3 Parallel Random Splitter List Ranking 35 4.4 Serial Connected Components Counting by Labeling 39 4.5 Shiloach and Viskhin's Algorithm for p < n using Striding 42 VI List of Tables 3.1 CUDA 1.2 Global Memory Segment & Transaction Size 25 5.1 CUDA Random Splitter Implementations Read/Write Bits & Transactions per Kernel 56 6.1 Comparing the No. of Iterations in the Parallel Connected Components Im plementations for Different Graph Types . 89 A.l All the List Ranking Experiments 103 A.2 All the Connected Components Experiments (x 4 graph types) 103 vn List of Figures 2.1 Computer Organization Basics 6 2.2 Multi-Core Computer Organization Basics . 9 2.3 Multi-core Graphics Processors 10 2.4 GPU Interleaving Threads Within Single SM 10 2.5 nVIDIA Tesla Architecture (from[12]) 12 2.6 CUDA Programming Model (from[12]) 13 3.1 Overview of CUDA GPU 16 3.2 Overview of CUDA GPU and Memory Locality 17 3.3 Examples of Coalesced Memory Access Transactions 24 3.4 Example of Unavoidable Non-coalesced Memory Access 26 3.5 Shared Memory Banks and Address Assignment 26 3.6 Example of Striding Access vs Partitioned Access 28 4.1 Example of List Ranking 31 4.2 Example of List Ranking by Pointer Jumping 32 4.3 An example of a graph with 3 connected components 38 4.4 Vertex representation of a sample graph 39 4.5 Edge Representation of a Graph 40 4.6 Components as Stars (i.e. Trees with height 1) 40 5.1 Comparing All List-Ranking Implementations 59 5.2 Comparing Scaling Character of the CUDA-based List Ranking Implementa tions 61 5.3 Relative Speed-up Over Multiple CUDA Thread-Blocks 62 5.4 Relative Speed-up of Random Splitter List Ranking Kernels over Multiple CUDA Thread-Blocks 63 5.5 Performance of Random Splitter List Ranking Kernels 65 5.6 Inflection in Parallel Random Splitter List Ranking for CUDA 66 viii LIST OF FIGURES ix 5.7 Impact of Amount of Memory Accessed vs Number of Access Transactions on Random Splitter List Ranking Kernel RS3 67 5.8 Comparing Memory Access Throughput of Random Splitter List Ranking Kernels RS3 and RS5 69 5.9 Comparing Memory Access Throughput of Random Splitter List Ranking Kernels RS5 with splitter-access coalescing modified RS5 71 6.1 Run Time Comparison of the Connected Component Counting Implementa tions for List and Tree Graphs 87 6.2 Run Time Comparison of the Connected Component Counting Implementa tions for Random Graphs 88 6.3 No.

Exploring the Limits of Gpus with Parallel Graph Algorithms

Vertex Coloring

Easy PRAM-Based High-Performance Parallel Programming with ICE∗

Instructor's Manual

Lawrence Berkeley National Laboratory Recent Work

PRAM Algorithms Parallel Random Access Machine

CSE 4351/5351 Notes 9: PRAM and Other Theoretical Model S

Towards a Spatial Model Checker on GPU⋆

2009-04-21 AMSC Cloud Assembler

Implementing a Tool for Designing Portable Parallel Programs

Models for Parallel Computation in Multi-Core, Heterogeneous, and Ultra Wide-Word Architectures

PRAM Algorithms Parallel Random Access Machine

Introduction to Parallel Processing : Algorithms and Architectures