Computing Included and Excluded Sums Using Parallel Prefix

Computing Included and Excluded Sums Using Parallel Prefix by Sean Fraser S.B., Massachusetts Institute of Technology (2019) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY May 2020 © Massachusetts Institute of Technology 2020. All rights reserved. Author...................................................................... Department of Electrical Engineering and Computer Science May 18, 2020 Certified by. Charles E. Leiserson Professor of Computer Science and Engineering Thesis Supervisor Accepted by................................................................. Katrina LaCurts Chair, Master of Engineering Thesis Committee 2 Computing Included and Excluded Sums Using Parallel Prefix by Sean Fraser Submitted to the Department of Electrical Engineering and Computer Science on May 18, 2020, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science Abstract Many scientific computing applications involve reducing elements in overlapping subregions of a multidimensional array. For example, the integral image problem from image processing requires finding the sum of elements in arbitrary axis-aligned subregions of an image.Fur- thermore, the fast multipole method, a widely used kernel in particle simulations, relies on reducing regions outside of a bounding box in a multidimensional array to a representative multipole expansion for certain interactions. We abstract away the application domains and define the underlying included and excluded sums problems of reducing regions inside and outside (respectively) of an axis-aligned bounding box in a multidimensional array. In this thesis, we present the dimension-reduction excluded-sums (DRES) algorithm, an asymptotically improved algorithm for the excluded sums problem in arbitrary dimensions and compare it with the state-of-the-art algorithm by Demaine et al. The DRES algorithm reduces the work from exponential to linear in the number of dimensions. Along the way, we present a linear-time algorithm for the included sums problem and show how to use it in the DRES algorithm. At the core of these algorithms are in-place prefix and suffix sums. Furthermore, applications that involve included and excluded sums require both high performance and numerical accuracy in practice. Since standard methods for prefix sums on general-purpose multicores usually suffer from either poor performance or low accuracy, we present an algorithm called the block-hybrid (BH) algorithm for parallel prefix sums to take advantage of data-level and task-level parallelism. The BH algorithm is competitive on large inputs, up to 2:5× faster on inputs that fit in cache, and 8:4× more accurate compared to state-of-the art CPU parallel prefix imple- mentations. Furthermore, a BH algorithm variant achieves at least a 1:5× improvement over a state-of-the-art GPU prefix sum implementation on a performance-per-cost ratio (using Amazon Web Services’ pricing). Much of thesis represents joint work with Helen Xu and Professor Charles Leiserson. Thesis Supervisor: Charles E. Leiserson Title: Professor of Computer Science and Engineering 3 4 Acknowledgments First and foremost, I would like to thank my advisor Charles Leiserson for his advice and support throughout the course of this thesis. His positive enthusiasm, vast technical knowl- edge, and fascination with seemingly simple problems that have emergent complexity have been nothing short of inspiring. Additionally, this work would not have been possible without Helen Xu, who has spent a considerable amount of time guiding my research, discussing new avenues to pursue, and even writing parts of this thesis. I have learnt a great deal from both Charles Leiserson and Helen Xu, and I consider myself very fortunate to have such great collaborators and mentors. This thesis is derived from a project done in collaboration with both of them. I would also like to thank my academic advisor, Dennis Freeman, for his support during my MIT career. Furthermore, I am grateful to the entire Supertech Research Group and to TB Schardl for their discussions over the past year. Thanks to Guy Blelloch, Julian Shun and Yan Gu for providing a technical benchmark necessary for this work. I would also like to acknowledge MIT Supercloud for allowing me to run experiments on their compute cluster. I am extremely grateful to my friends at MIT, who have made my time here especially memorable. Thank you to my parents, my brother, and my girlfriend for providing uncon- ditional support. This thesis and my journey at MIT would not have been possible without you. This research was sponsored in part by NFS Grant 1533644, and in part by the United States Air Force Research Laboratory and was accomplished under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. 5 6 Contents 1 Introduction 17 2 Preliminaries 25 3 Included and Excluded Sums 31 3.1 Tensor Preliminaries . 31 3.2 Problem Definitions . 32 3.3 INCSUM Algorithm . 33 3.4 Excluded Sums (DRES) Algorithm . 41 3.5 Applications . 50 4 Prefix Sums 53 4.1 Previous Work . 54 4.2 Vectorization . 59 4.3 Accurate Floating Point Summation . 61 4.4 Ordering of Computation . 65 4.5 Block-Hybrid Algorithm . 66 4.6 Experimental Evaluation . 68 4.6.1 Performance . 69 4.6.2 Accuracy . 72 4.7 GPU Comparison . 74 5 Conclusions and Future Work 77 7 A Implementation Details 79 A.1 Implementation of Prefix Sum Vectorization . 79 8 List of Figures 1-1 An example of the included and excluded sums in two dimensions for one box. The grey represents the region of interest to be reduced. Typically, the included and excluded sums problem require computing the reduction for all such regions (only one is depicted in the figure for illustration). 18 1-2 An example of the included and excluded sums problems in two dimensions with a box size of 3 × 3 and the max operator. The dotted box represents an example of a 3 × 3 box in the matrix, but the included and excluded sums problem computes the inclusion or exclusion regions for all 3 × 3 boxes.... 19 1-3 An example of the corners algorithm for one box in two dimensions. The grey regions represent excluded regions computed via prefix and suffix sums. 20 1-4 Performance and accuracy of the naive sequential prefix sum algorithm compared to an unoptimized parallel version on a 40-core machine. Clearly, there is no speedup for the parallel version, and in fact, it makes it even slower. For the accuracy comparison, we define the error as the root mean square relative error over all outputs of the prefix sum array compared to a higher precision reference at the equivalent index. The input is an array of 215 single-precision floating point numbers, according to a distribution in the legend. The options are random numbers either drawn from Unif(0; 1), Exp(1), or Normal(0; 1). On these inputs, the parallel version is on average around 17× more accurate. 21 9 1-5 Three highlight matrices of the trade-offs between performance and accuracy. The input size is either 215 (left), 222 (center), or 227 (right) floats, and 1 tests are run on a 40-core machine. Accuracy is measured as RMSE and plotted on a logarithmic scale, where RMSE is defined in Section 4.3, for the random numbers uniformly distributed between 0 and 1. Performance (or T1 observed parallelism) is measured as , where T40h is the parallel time on T40h the Supercloud 40-core Machine with 2-way hyper-threading, and T1 is the serial time of the naive sequential prefix sum on the same machine. 22 3-1 Pseudocode for the included sum in one dimension. 34 3-2 Computing the included sum in one dimension in linear time. 34 3-3 Example of computing the included sum in one dimension with N = 8; k = 4. 35 3-4 Computing the included sum in two dimensions. 35 3-5 Ranged prefix along row. 36 3-6 Ranged suffix along row. 36 3-7 Computes the included sum along a given row. 37 3-8 Computes the included sum along a given dimension. 38 3-9 Computes the included sum for the entire tensor. 38 3-10 An example of decomposing the excluded sum into disjoint regions in two dimensions. The red box denotes the points to exclude. 44 3-11 Steps for computing the excluded sum in two dimensions with included sums on prefix and suffix sums. 45 3-12 Prefix sum along row. 46 3-13 Adding in the contribution. 47 3-14 Prefix sum along row. 48 10 4-1 Hillis and Steele’s data-parallel prefix sum algorithm on an 8-input array using index notation. The individual boxes represent individual elements of the array and the notation j : k contains elements at index j to index k inclusive with the operator (here assumed to be addition), denoted by ⊕. The lines from two boxes means those two elements are added where they meet the operator. The arrow corresponds to propagating the old element at that index (no operation). 55 4-2 The balanced binary tree-based algorithm described by Blelloch [5] and [25] consisting of the upsweep (first four rows) followed by the downsweep (last3 rows). The individual boxes represent individual elements of the array and the notation i : j contains elements at index i to index j added together.

Computing Included and Excluded Sums Using Parallel Prefix

Parallel Prefix Sum (Scan) with CUDA

Generic Implementation of Parallel Prefix Sums and Their

PRAM Algorithms Parallel Random Access Machine

Parallel Prefix Sum on the GPU (Scan)

Combinatorial Species and Labelled Structures Brent Yorgey University of Pennsylvania, [email protected]

Algorithms Based on Parallel Prefix (Scan) Operations, Cont

MPI Collectives I: Reductions and Broadcast Calculating Pi with a Broadcast and Reduction

Appendix Mathematical Background A

A Sound and Complete Abstraction for Reasoning About Parallel Prefix

Designing Efficient Parallel Prefix Sum Algorithms for Gpus

Parallel Prefix Adders

Competitive Programmer's Handbook