PARALLEL SORTING and MOTIF SEARCH by Shibdas Bandyopadhyay August 2012
Total Page:16
File Type:pdf, Size:1020Kb
PARALLEL SORTING AND MOTIF SEARCH By SHIBDAS BANDYOPADHYAY A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2012 c 2012 Shibdas Bandyopadhyay 2 I dedicate this to my family. 3 ACKNOWLEDGMENTS This work would not have been possible without continuous guidance of my advisor Dr. Sartaj Sahni. His ideas and pertinent questions inspired me to explore different ways to solve problems. I’m also grateful to my parents for encouraging and supporting me during all these years. 4 TABLE OF CONTENTS page ACKNOWLEDGMENTS .................................. 4 LIST OF TABLES ...................................... 8 LIST OF FIGURES ..................................... 9 ABSTRACT ......................................... 12 CHAPTER 1 MULTI-CORE ARCHITECTURES ......................... 13 1.1 The Cell Broadband Engine .......................... 13 1.2 Graphics Processing Units .......................... 16 2 SORTING ON THE CELL BROADBAND ENGINE ................ 20 2.1 High-level Strategies For Sorting ....................... 20 2.2 SPU Vector and Memory Operations ..................... 21 2.3 Sorting Numbers ................................ 23 2.3.1 Single SPU Sort ............................ 23 2.3.1.1 Shellsort variants ...................... 24 2.3.1.2 Merge sort .......................... 28 2.3.1.3 Comparison of single SPU sorting algorithms ....... 37 2.3.2 Hierarchical Sort ............................ 38 2.4 Sorting Records ................................ 41 2.4.1 Record Layouts ............................. 41 2.4.2 High-level Strategies for Sorting Records ............... 42 2.4.3 Single SPU Record Sorting ...................... 44 2.4.4 Hierarchical Sorting for Records - 4way Merge ........... 45 2.4.5 Comparison of Record Sorting Algorithms .............. 49 2.4.5.1 Run times for ByField layout ................ 50 2.4.5.2 Run times for ByRecord layout ............... 52 2.4.5.3 Cross layout comparison .................. 54 2.5 Summary .................................... 55 3 SORTING ON A GRAPHICS PROCESSING UNIT (GPU) ............ 59 3.1 Sorting Numbers on GPUs .......................... 59 3.1.1 GPU Sample Sort ........................... 61 3.1.1.1 Step 1–Splitter selection .................. 62 3.1.1.2 Step 2–Finding buckets ................... 63 3.1.1.3 Step 3–Prefix sum ...................... 63 3.1.1.4 Step 4–Placing elements into buckets ........... 63 3.1.2 Warpsort ................................ 64 5 3.1.2.1 Step 1–Bitonic sort by warps ................ 64 3.1.2.2 Step 2–Bitonic merge by warps ............... 65 3.1.2.3 Step 3–Splitting long sequences .............. 65 3.1.2.4 Step 4–Final merge by warps ................ 66 3.1.3 SRTS Radix Sort ............................ 66 3.1.3.1 Step 1–Bottom level reduce ................. 68 3.1.3.2 Step 2–Top level scan .................... 69 3.1.3.3 Step 3–Bottom level scan .................. 69 3.1.4 SDK Radix Sort Algorithm ....................... 70 3.1.4.1 Step 1–Sorting tiles ..................... 71 3.1.4.2 Step 2–Calculating histogram ................ 73 3.1.4.3 Step 3–Prefix sum of histogram .............. 73 3.1.4.4 Step 4–Rearrangement ................... 74 3.1.5 GPU Radix Sort(GRS) ......................... 75 3.1.5.1 Step 1–Histogram and Ranks ................ 76 3.1.5.2 Step 2–Prefix sum of tile histograms ............ 79 3.1.5.3 Step 3–Positioning numbers in a tile ............ 80 3.1.6 Comparison of Number Sorting Algorithms ............. 81 3.2 Sorting Records on GPUs ........................... 82 3.2.1 Record Layouts ............................. 82 3.2.2 High level Strategies for Sorting Records ............... 83 3.2.3 Sample Sort for Sorting Records ................... 84 3.2.4 SRTS for Sorting Records ....................... 87 3.2.5 GRS for Sorting Records ........................ 89 3.3 Experimental Results ............................. 91 3.3.1 Run Times for ByField Layout ..................... 92 3.3.2 Run Times for Hybrid Layout ...................... 93 3.4 Summary .................................... 95 4 MOTIF SEARCH ................................... 97 4.1 Introduction ................................... 97 4.2 PMS5 ...................................... 100 4.2.1 Notations and Definitions ....................... 100 4.2.2 PMS5 Overview ............................. 100 4.2.3 Computing Bd (x, y, z) ......................... 101 4.2.4 Intersection of Qs ............................ 104 4.3 PMS6 ...................................... 105 4.3.1 Overview ................................ 105 4.3.2 Computing Bd (C(n1, ··· , n5)) ..................... 105 4.3.3 The Data Structure Q ......................... 109 4.3.4 Complexity ............................... 110 4.4 Experimental Results ............................. 110 4.4.1 Preprocessing .............................. 111 4.4.2 Computing Bd .............................. 112 6 4.4.3 Total Time ................................ 112 4.4.4 Total Time Plus Preprocessing Time ................. 113 4.5 PMS6MC .................................... 113 4.5.1 Overview ................................ 113 4.5.2 Outer-Level Parallelism ......................... 114 4.5.3 Inner-Level Parallelization ....................... 114 4.5.3.1 Finding equivalence classes in parallel .......... 115 4.5.3.2 Computing Bd (C) in parallel ................ 116 4.5.3.3 Processing Q in parallel ................... 117 4.5.3.4 The outputMotifs routine .................. 119 4.6 Experimental Results ............................. 120 4.6.1 PMS6MC Implementation ....................... 121 4.6.2 PMS6 and PMS6MC .......................... 121 4.6.3 PMS6MC and Other Parallel Algorithms ............... 122 4.7 Summary .................................... 123 5 CONCLUSION AND FUTURE WORK ....................... 125 REFERENCES ....................................... 127 BIOGRAPHICAL SKETCH ................................ 132 7 LIST OF TABLES Table page 2-1 Comparison of various SPU sorting algorithms .................. 39 4-1 Preprocessing times for PMS5 and PMS6 ..................... 112 4-2 Total storage for PMS5 and PMS6 in MegaBytes(MB) .............. 112 4-3 Time taken to compute Bd (C(n1, ··· , n5)) and Bd (x, y, z) ............ 112 4-4 Total run rime of PMS5 and PMS6 ......................... 113 4-5 Total + preprocessing time by PMS5 and PMS6 ................. 113 4-6 Degree of inner and outer level parallelism for PMS6MC ............ 121 4-7 Run times for PMS6 and PMS6MC ........................ 122 4-8 Run times for PMS5 and PMSPrune ........................ 122 4-9 Total run time of different PMS Algorithms .................... 123 4-10 Comparing mSPELLER and gSPELLER with PMS6MC ............. 123 8 LIST OF FIGURES Figure page 1-1 Architecture of the Cell Broadband Engine ..................... 15 1-2 NVIDIA’s Tesla GPU ................................. 16 1-3 Cuda Programming Model .............................. 17 2-1 Comparisons for mySpu cmpswap skew ...................... 22 2-2 Comb sort ...................................... 26 2-3 Single SPU column-major AA-sort ......................... 26 2-4 Collecting numbers from a 4 × 4 Matrix ...................... 27 2-5 Column-major to row-major ............................. 28 2-6 Column-major brick sort ............................... 29 2-7 Column-major shaker sort ............................. 29 2-8 Merge sort example ................................. 31 2-9 Phase 2 of merge sort ............................... 32 2-10 Phase 3 counters .................................. 33 2-11 Phase 3 of merge sort ............................... 34 2-12 Phase 4 counters .................................. 37 2-13 Phase 4 of merge sort with counters in different columns ............ 38 2-14 Plot of average time to sort 4-byte integers .................... 40 2-15 SIMD odd-even merge of two vectors ....................... 41 2-16 SIMD 2-way merge of 2 vectors v1 and v2 .................... 42 2-17 Plot of average time to sort 1 to 67 million integers ................ 43 2-18 The findmin operation for records ......................... 44 2-19 The fields select operation in ByField layout .................... 44 2-20 The fields select operation in ByRecord layout ................... 45 2-21 Shuffling two records in ByField layout ....................... 45 2-22 Shuffling two records in ByRecord layout ..................... 45 9 2-23 4-way merge ..................................... 46 2-24 4-way merge ..................................... 48 2-25 SIMD 2-way merge of 2 vectors v1 and v2 .................... 49 2-26 2-way sorts (ByField) ................................. 51 2-27 4-way sorts (ByField) ................................. 52 2-28 2-way and 4-way sorts (ByField), 1M records ................... 53 2-29 4-way sorts (ByField), 1M records ......................... 54 2-30 2-way sorts (ByRecord) ............................... 55 2-31 4-way sorts (ByRecord) ............................... 56 2-32 2-way and 4-way sorts (ByRecord), 1M records .................. 57 2-33 4-way sorts (ByRecord), 1M records ........................ 58 2-34 4-way sorts using the best algorithm for different layouts ............ 58 3-1 Serial Sample Sort ................................. 62 3-2 An Iteration of GPU Sample Sort .......................... 63 3-3 Bitonic Merge Sort of 8 Elements ......................... 64 3-4 Warpsort Steps ................................... 66 3-5