Efficient Algorithms for a Mesh-Connected Computer With
Total Page:16
File Type:pdf, Size:1020Kb
Efficient Algorithms for a Mesh-Connected Computer with Additional Global Bandwidth by Yujie An A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and Engineering) in the University of Michigan 2019 Doctoral Committee: Professor Quentin F. Stout, Chair Associate Professor Kevin J. Compton Professor Seth Pettie Professor Martin J. Strauss Yujie An [email protected] ORCID iD: 0000-0002-2038-8992 ©Yujie An 2019 Acknowledgments My researches are done with the help of many people. First I would like to thank my advisor, Quentin F. Stout, who introduced me to most of the topics discussed here, and helped me throughout my Ph.D. career at University of Michigan. I would also like to thank my thesis committee members, Kevin J. Compton, Seth Pettie and Martin J. Strauss, for their invaluable advice during the dissertation process. To my parents, thank you very much for your long-term support. All my achievements cannot be accomplished without your loves. To all my best friends, thank you very much for keeping me happy during my life. ii TABLE OF CONTENTS Acknowledgments ................................... ii List of Figures ..................................... v List of Abbreviations ................................. vii Abstract ......................................... viii Chapter 1 Introduction ..................................... 1 2 Preliminaries .................................... 3 2.1 Our Model..................................3 2.2 Related Work................................5 3 Basic Operations .................................. 8 3.1 Broadcast, Reduction and Scan.......................8 3.2 Rotation................................... 11 3.3 Sparse Random Access Read/Write..................... 12 3.4 Optimality.................................. 17 4 Sorting and Selecting ................................ 18 4.1 Sort..................................... 18 4.1.1 Sort in the Worst Case....................... 18 4.1.2 Sort when Only k Values are Out of Order............. 22 4.1.3 Splitter Sort............................. 27 4.1.4 Chain Rank............................. 29 4.2 Selection and Finding the Median..................... 32 4.2.1 Worst Case............................. 32 4.2.2 Expected Case........................... 34 4.3 Label Counting............................... 36 5 Fast Fourier Transform ............................... 38 5.1 Fast Fourier Transform........................... 38 5.2 Pattern Matching.............................. 43 6 Graph Algorithms .................................. 47 6.1 Graph Algorithms with Matrix Input.................... 47 iii 6.2 Graph Algorithms with Edge input..................... 58 6.3 Graph Algorithms on a CR MCC...................... 61 7 Computational Geometry .............................. 72 7.1 Convexity.................................. 72 7.2 Nearest Neighbor Problems......................... 75 7.3 Line Segment Intersection......................... 87 8 Coarse-Grained Algorithms ............................ 92 8.1 Broadcast, Reduction and Scan....................... 92 8.2 Sort..................................... 93 8.3 Minimum Spanning Tree.......................... 99 8.4 Convex Hull................................. 104 9 Conclusion ...................................... 106 9.1 Future Work................................. 107 Bibliography ...................................... 109 iv LIST OF FIGURES 2.1 Space-filling curve ordering...........................4 2.2 Mesh with additional global bandwidth.....................4 2.3 Mesh augmented by global communication power...............6 3.1 Segmented broadcast............................... 10 3.2 Data requesting operation with n = 256 and = 1=2; after step 3, channel n =2 saves the range of channels that saves pi in range [n=2; 3n=4) ....... 13 4.1 Sort using multiple splitters........................... 19 4.2 Sort when inputs are 1 quasi-sorted....................... 23 4.3 Determine 2k values to take out......................... 26 4.4 Splitter sort: put points into non-overlapping regions.............. 28 4.5 Chain rank.................................... 30 0 1 7 5.1 The values of !8;!8;:::;!8 in the complex plane............... 39 5.2 Directions of convolution and string matching.................. 44 5.3 Image pattern matching: set blank areas in the pattern to be wildcard to count the red regions, and white color otherwise; green region is an example of an invalid matching................................. 45 6.1 Creating a linear chain from a tree........................ 49 6.2 Game tree..................................... 50 6.3 A game tree that is hard to evaluate....................... 51 6.4 Bridges in G, the number on the node represents its pre-order numbering... 53 6.5 Articulation points in G, the number on the node represents its pre-order num- bering....................................... 54 6.6 Biconnected component: graph transformation................. 56 7.1 A convex hull example.............................. 73 7.2 Line L separates S1 and S2 ............................ 74 7.3 Nearest neighbor in a corner, p and q may have closer points outside the current vertical and horizontal slabs because currently d(p) > pc and d(q) > qc; this condition will be true for at most two points for each corner in each block... 75 7.4 Voronoi partition................................. 77 7.5 All squares that possibly contain a global closest dissimilar point from α ... 79 7.6 Modified version of Voronoi partition: intervals below with the same label must be ignored.................................. 82 v 7.7 New intervals generated: a points and b intervals can generate at most a + b intervals, intervals without labels above can have any label below....... 83 7.8 Updates in the Voronoi partitioning....................... 85 7.9 More than two points are closest to a point on L: C must be in between A and B, and CD must go through O; but in this case CD cannot be the optimal answer, because both AD and BD are shorter than it.............. 87 7.10 Sort line segments using partial order sort, bold lines are in S1 and others are in S2; only one intersection detected is I, and intersections between two line segments in S2 cannot be detected at this stage................. 88 7.11 Two visibility problems............................. 90 7.12 Blocking visibility................................ 90 8.1 Destination ordering............................... 93 8.2 Shuffle the array................................. 95 vi LIST OF ABBREVIATIONS VLSI Very Large Scale Integration PRAM Parallel Random Access Machine ER Exclusive Read EW Exclusive Write CR Concurrent Read CW Concurrent Write ADT Abstract Data Type SIMD Single Instruction Multiple Data RAR Random Access Read RAW Random Access Write SRAR Sparse Random Access Read SRAW Sparse Random Access Write MST Minimum Spanning Tree MSF Minimum Spanning Forest DFT discrete Fourier transform FFT fast Fourier transform KMP Knuth-Morris-Pratt HPC High Performance Computing vii ABSTRACT This thesis shows that adding additional global bandwidths to a mesh-connected computer can greatly improve the performance. The goal of this project is to design algorithms for mesh-connected computers augmented with limited global bandwidth, so that we can further enhance our understanding of the parallel/serial nature of the problems on evolving parallel architectures. We do this by first solving several problems associated with fundamental data movement, then sum- marize ways to resolve different situations one may observe in data movement in parallel computing. This can help us to understand whether the problem is easily parallelizable on different parallel models. We give efficient algorithms to solve several fundamental problems, which include sorting, counting, fast Fourier trans- form, finding a minimum spanning tree, finding a convex hull, etc. We show that adding a small amount of global bandwidth makes a practical design that com- bines aspects of mesh and fully connected models to achieve the benefits of each. Most of the algorithms are optimal. For future work, we believe that algorithms with peak-power constrains can make our model well adapted to the recent archi- tectures in high performance computing. viii CHAPTER 1 Introduction In recent years, advances in Very Large Scale Integration (VLSI) technology have provided increasingly powerful cores consist of many thousands and potentially millions of proces- sors. The layouts of cores and processors are very different from machine to machine, and they are changing very rapidly. For example, the PSC Bridges supercomputer has 28 cores on each node [1], and communication between nodes are expensive. This is not true for the same supercomputer system before 2016, not to say many other supercomputers. From a programmer’s prospective, it would be ideal to write programs for a Parallel Random Access Machine (PRAM). In a PRAM, all processors have a small local mem- ory but can communicate with a large shared memory in constant time. The power of such global communication can significantly reduce the program’s complexity, though al- gorithms can still be extremely hard to implement. It is a quite powerful theoretical model that is capable of solving many problems in poly-logarithmic time (i.e., O(logk n) time for some constant k)[2,3,4,5]. However, the PRAM is not practical and cannot be scaled to large numbers of processors. Further, many of the optimal (in O-notation) algorithms