Ph.D. Dissertation
Total Page:16
File Type:pdf, Size:1020Kb
Parallel Algorithms and Dynamic Data Structures on the Graphics Processing Unit: a warp-centric approach By SAMAN ASHKIANI DISSERTATION Submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in Electrical and Computer Engineering in the OFFICE OF GRADUATE STUDIES of the UNIVERSITY OF CALIFORNIA DAVIS Approved: Chair John D. Owens Nina Amenta Venkatesh Akella Committee in Charge 2017 -i- To my parents, Mina and Farshad; and my brother, Soheil. -ii- CONTENTS List of Figures viii List of Tables ix Abstract.........................................x Acknowledgments.................................... xii 1 Introduction1 1.1 Overview.....................................1 1.1.1 Our contributions:............................6 1.2 Preliminaries: the Graphics Processing Unit...................8 1.2.1 CUDA Terminology...........................8 1.2.2 GPU Memory Hierarchy.........................9 1.2.3 Warp-wide communications....................... 10 1.2.4 CUDA Built-in intrinsics......................... 11 2 Parallel Approaches to the String Matching Problem on the GPU 12 2.1 Introduction.................................... 12 2.2 Prior work..................................... 15 2.3 Preliminaries................................... 16 2.3.1 Serial Rabin-Karp............................ 16 2.3.2 Cooperative Rabin-Karp......................... 17 2.4 Divide-and-Conquer Strategy........................... 19 2.4.1 Theoretical analysis with finite processors................ 20 2.4.2 Theoretical Conclusions......................... 21 2.5 Summary of Results................................ 22 2.6 Implementation details.............................. 24 2.6.1 Cooperative RK.............................. 24 2.6.2 Divide-and-Conquer RK......................... 24 -iii- 2.6.3 Hybrid RK................................ 25 2.7 Expanding beyond the simple RK........................ 26 2.7.1 General alphabets............................. 26 2.7.2 Two-stage matching........................... 27 2.8 Performance Evaluation.............................. 29 2.8.1 Algorithm behavior in problem domains................. 31 2.8.2 Binary sequential algorithms....................... 32 2.8.3 General sequential algorithms...................... 33 2.8.4 Dependency on alphabet size....................... 34 2.8.5 Real-world scenarios........................... 35 2.8.6 False positives.............................. 37 2.9 Conclusion.................................... 38 3 GPU Multisplit 40 3.1 Introduction.................................... 40 3.2 Related Work and Background.......................... 43 3.2.1 Parallel primitive background...................... 43 3.2.2 Multisplit and Histograms........................ 44 3.3 Multsiplit and Common Approaches....................... 45 3.3.1 The multisplit primitive......................... 45 3.3.2 Iterative and Recursive scan-based splits................. 46 3.3.3 Radix sort................................. 47 3.3.4 Reduced-bit sort............................. 48 3.4 Algorithm Overview............................... 49 3.4.1 Our parallel model............................ 50 3.4.2 Multisplit requires a global computation................. 50 3.4.3 Dividing multisplit into subproblems.................. 51 3.4.4 Hierarchical approach toward multisplit localization.......... 53 3.4.5 Direct solve: Local offset computation.................. 55 3.4.6 Our multisplit algorithm......................... 56 -iv- 3.4.7 Reordering elements for better locality.................. 57 3.5 Implementation Details.............................. 58 3.5.1 GPU memory and computational hierarchies.............. 58 3.5.2 Our proposed multisplit algorithms................... 59 3.5.3 Localization and structure of our multisplit............... 60 3.5.4 Ballot-based voting............................ 63 3.5.5 Computing Histograms and Local Offsets................ 65 3.5.6 Reordering for better locality....................... 70 3.5.7 More buckets than the warp width.................... 74 3.5.8 Privatization............................... 75 3.6 Performance Evaluation.............................. 76 3.6.1 Common approaches and performance references............ 78 3.6.2 Performance versus number of buckets: m 256 ............ 79 ≤ 3.6.3 Performance for more than 256 buckets................. 86 3.6.4 Initial key distribution over buckets................... 87 3.7 Multisplit, a useful building block........................ 90 3.7.1 Building a radix sort........................... 90 3.7.2 The Single Source Shortest Path problem................ 95 3.7.3 GPU Histogram.............................. 98 3.8 Conclusion.................................... 102 4 GPU LSM: A Dynamic Dictionary Data Structure for the GPU 103 4.1 Introduction.................................... 103 4.1.1 Dynamic data structures for the GPU: the challenge........... 104 4.1.2 Dictionaries................................ 105 4.2 Background and Previous Work......................... 106 4.3 The GPU LSM.................................. 108 4.3.1 Batch Operation Semantics........................ 109 4.3.2 Insertion................................. 110 4.3.3 Deletion.................................. 112 -v- 4.3.4 Lookup.................................. 112 4.3.5 Count and Range queries......................... 112 4.3.6 Cleanup.................................. 113 4.4 Implementation details.............................. 114 4.4.1 Insertion and Deletion.......................... 114 4.4.2 Lookup queries.............................. 115 4.4.3 Count queries............................... 116 4.4.4 Range queries............................... 117 4.4.5 Cleanup.................................. 117 4.5 Performance Evaluation.............................. 118 4.5.1 What does the GPU LSM provide?................... 118 4.5.2 Insertions and deletions.......................... 120 4.5.3 Queries: Lookup, count and range operations.............. 123 4.5.4 Cleanup and its relation with queries................... 125 4.6 Conclusion.................................... 127 5 Slab Hash: A Dynamic Hash Table for the GPU 128 5.1 Introduction.................................... 128 5.2 Background & Related Work........................... 130 5.3 Design description................................ 131 5.3.1 Slab list.................................. 132 5.3.2 Supported Operations in Slab Lists................... 134 5.3.3 Slab Hash: A Dynamic Hash Table................... 135 5.4 Implementation details.............................. 136 5.4.1 Our warp-cooperative work sharing strategy............... 136 5.4.2 Choice of parameters........................... 137 5.4.3 Operation details............................. 138 5.5 Dynamic memory allocation........................... 141 5.6 Performance Evaluation.............................. 144 5.6.1 Bulk benchmarks (static methods).................... 144 -vi- 5.6.2 Incremental insertion........................... 147 5.6.3 Concurrent benchmarks (dynamic methods)............... 148 5.7 Conclusion.................................... 150 6 Conclusion 151 References 155 -vii- LIST OF FIGURES 1.1 Work assignment and processing: schematic demonstration...........4 2.1 Schematic view of all four proposed string matching approaches......... 20 2.2 Summary of results: Running time vs. pattern size................ 23 2.3 Running time vs. pattern and text length, and superior algorithms vs. parameters. 30 2.4 Running time vs. pattern size, with a fixed text size................ 33 3.1 Multisplit examples................................ 47 3.2 An example for our localization terminology................... 54 3.3 Hierarchical localization.............................. 56 3.4 Different localizations for DMS, WMS and BMS are shown schematically... 62 3.5 Key distributions for different multisplit methods and different number of buckets. 71 3.6 Running time vs. number of buckets for all proposed methods.......... 82 3.7 Achieved speedup against regular radix sort.................... 83 3.8 Running time vs. number of buckets for various input distributions........ 89 4.1 Insertion example in GPU LSM.......................... 111 4.2 Batch insertion time, and effective insertion rate for GPU LSM......... 124 5.1 Regular linked list and the slab list......................... 133 5.2 Psuedo-code for search (SEARCH) and insert (REPLACE) operations in slab hash. 140 5.3 Memory layout for SlabAlloc........................... 142 5.4 Performance the slab hash and cuckoo hashing versus memory efficiency.... 145 5.5 Performance versus total number of stored elements............... 147 5.6 Incremental batch update for the slab hash, and bulk build for the cuckoo hashing. 148 5.7 Concurrent benchmark for the slab hash...................... 150 -viii- LIST OF TABLES 1.1 High level aspects of work assignment and processing approaches........3 2.1 Step-work complexity for our methods with p processors (p1 groups with p2 processors in each)................................. 22 2.2 Processing rate for binary alphabets........................ 32 2.3 Processing rate for alphabets with 256 characters................. 34 2.4 Processing rate for the DRK-2S method, with different alphabet sizes...... 35 2.5 Running time vs. pattern sizes, on a sample text from Wikipedia......... 36 2.6 Running time vs. pattern sizes, on E. coli genome................. 37 2.7 Running time vs. pattern sizes, on the Homo sapiens protein sequence...... 38 3.1 Size