An Efficient Connected Components Algorithm For

AN EFFICIENT CONNECTED COMPONENTS ALGORITHM FOR MASSIVELY-PARALLEL DEVICES by Jayadharini Jaiganesh A thesis submitted to the Graduate Council of Texas State University in partial fulfillment of the requirements for the degree of Master of Science with a Major in Computer Science May 2017 Committee Members: Martin Burtscher, Chair Apan Qasem Vangelis Metsis COPYRIGHT by Jayadharini Jaiganesh 2017 FAIR USE AND AUTHOR’S PERMISSION STATEMENT Fair Use This work is protected by the Copyright Laws of the United States (Public Law 94-553, section 107). Consistent with fair use as defined in the Copyright Laws, brief quotations from this material are allowed with proper acknowledgement. Use of this material for financial gain without the author’s express written permission is not allowed. Duplication Permission As the copyright holder of this work I, Jayadharini Jaiganesh, authorize duplication of this work, in whole or in part, for educational or scholarly purposes only. DEDICATION To my mentor, thank you for your guidance and motivation. To my loving husband for his unwavering support. ACKNOWLEDGEMENTS This project is supported by the National Science Foundation (NSF grant #1406304) and equipment donations from Nvidia. v TABLE OF CONTENTS Page ACKNOWLEDGEMENTS ................................................................................................ v LIST OF FIGURES ......................................................................................................... viii LIST OF ABBREVIATIONS ............................................................................................. x ABSTRACT ...................................................................................................................... xi CHAPTER 1. INTRODUCTION .............................................................................................. 1 1.1 Connected Components ........................................................................ 1 1.2 General Serial Connected Components Algorithm .............................. 2 1.3 General Parallel Connected Components Algorithm ........................... 4 1.4 Contributions......................................................................................... 4 1.5 Outline .................................................................................................. 5 2. BACKGROUND STUDY .................................................................................. 7 2.1 Parallel Implementations of CC ............................................................ 7 3. RELATED WORK ........................................................................................... 15 4. ECL ALGORITHM AND IMPLEMENTATIONS ......................................... 17 4.1 ECL Base Algorithm .......................................................................... 17 4.2 ECL - GPU Implementation .............................................................. 20 4.3 ECL - Parallel CPU Implementation .................................................. 23 4.4 ECL - Serial CPU Implementation .................................................... 23 4.5 ECLaf - Atomic Free CC Implementation .......................................... 23 5. EVALUATION METHODOLOGY ................................................................ 26 5.1 GPU and CPU Machines .................................................................... 26 5.2 Input Graphs Format and Specifications ............................................ 27 vi 5.3 Compiler Information ......................................................................... 30 6. RESULTS ........................................................................................................ 31 6.1 Configurations..................................................................................... 31 6.2 Comparison with Parallel GPU Benchmarks ..................................... 31 6.3 Comparison with Parallel CPU Benchmarks ...................................... 34 6.4 Comparison with Serial CPU Benchmarks ......................................... 36 6.5 Performance Comparison across different Configurations ................. 38 6.6 Performance Comparison - Atomic-Free Implementations ................ 41 7. SUMMARY ...................................................................................................... 45 7.1 Summary ............................................................................................. 45 7.2 Future Work ........................................................................................ 46 APPENDIX SECTION ..................................................................................................... 47 LITERATURE CITED ..................................................................................................... 54 vii LIST OF FIGURES Figure Page 1.1. Connected Components in a Graph ............................................................................ 2 2.1. Graph before Hooking ................................................................................................. 7 2.2. Graph after Hooking .................................................................................................... 8 2.3. Graph after Pointer Jumping ....................................................................................... 8 4.1. Double-sided worklist ............................................................................................... 21 5.1. Graph ......................................................................................................................... 29 5.2. Compressed Adjacency List Representation ............................................................ 30 6.1. Parallel GPU Benchmarks Slowdowns relative to ECL on the Titan X .................... 32 6.2. Parallel GPU Benchmarks Slowdowns relative to ECL on the K40 ........................ 33 6.3. Parallel CPU Benchmarks Slowdowns relative to ECL on the Zurich ..................... 34 6.4. Parallel CPU Benchmarks Slowdowns relative to ECL on the Denver .................... 35 6.5. Serial CPU Benchmarks Slowdowns relative to ECL on the Zurich ........................ 36 6.6. Serial CPU Benchmarks Slowdowns relative to ECL on the Denver ...................... 37 6.7. Geometric-Mean Slowdowns of CC Implementations on different systems............. 38 6.8. Throughput (Edges/s) of various CC Implementations on different systems ............ 39 6.9. Throughput (Nodes/s) of various CC Implementations on different systems .......... 40 6.10. Parallel GPU Benchmarks Slowdowns relative to ECLaf on the Titan X ............... 41 6.11. Parallel GPU Benchmarks Slowdowns relative to ECLaf on the K40 .................... 42 6.12. Parallel CPU Benchmarks Slowdowns relative to ECLaf on the Zurich ................. 43 viii 6.13. Parallel CPU Benchmarks Slowdowns relative to ECLaf on the Denver ............... 44 ix LIST OF ABBREVIATIONS Abbreviation Description BFS Breadth First Search BFSCC Breadth First Search for Connected Components CC Connected Components DFS Depth First Search GPU Graphic Processing Unit LSG Lonestar GPU LS Lonestar CPU x ABSTRACT Massively-parallel devices such as GPUs are best suited for accelerating regular algorithms. Since the memory access patterns and control flow of irregular algorithms are data dependent, such programs are more difficult to parallelize in general and a direct parallelization may not yield good performance, on GPUs in particular. However, by carefully studying the underlying problem, it may be possible to derive new algorithms that are more suitable for massively-parallel accelerators. This thesis involves studying and analyzing such an irregular algorithm, called Connected Components, and proposes an efficient algorithm, called ECL, which is faster than the existing CC algorithms on most tested inputs. Though atomic operations are fast, they can represent a bottleneck as these operations run serially and might hinder performance in the future parallel devices. This thesis also proposes a synchronous and atomic-free algorithm, called ECLaf, whose performance is comparable to the fastest existing CC algorithms. xi CHAPTER 1 INTRODUCTION Finding the Connected Components (CC) is a key preprocessing step in many graph algorithms. It is used in real-world applications such as navigation, image segmentation, and in the medical field. Thus, a faster connected components algorithm and implementation has the potential to improve many important graph processing codes. 1.1 Connected Components For an undirected graph G = (V, E), where V is the set of vertices and E is the set of edges, a connected component C is a subset of V such that all the vertices belonging to C are reachable from any vertex in C, and there are no edges between vertices belonging to different components. The connected components problem is to find the number of such components present in a given graph, assign a unique ID to each component, and label each vertex in the graph with its component ID. Figure 1.1 shows a graph with multiple connected components. There are several variants of connected components. A strongly Connected Component of a directed graph is a maximal set of vertices such that every pair of vertices in the set is reachable from each other. Bi-Connected Components are components in which removing any vertex still results in a connected component. A directed graph is Weakly Connected [7] if replacing all its directed edges with undirected edges produces a connected (undirected) graph. 1 Figure 1.1 Connected

Load more