Computing Included and Excluded Sums Using Parallel Prefix

by Sean Fraser

S.B., Massachusetts Institute of Technology (2019) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY May 2020 © Massachusetts Institute of Technology 2020. All rights reserved.

Author...... Department of Electrical Engineering and Computer Science May 18, 2020

Certified by...... Charles E. Leiserson Professor of Computer Science and Engineering Thesis Supervisor

Accepted by...... Katrina LaCurts Chair, Master of Engineering Thesis Committee 2 Computing Included and Excluded Sums Using Parallel Prefix by Sean Fraser

Submitted to the Department of Electrical Engineering and Computer Science on May 18, 2020, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science

Abstract Many scientific computing applications involve reducing elements in overlapping subregions of a multidimensional array. For example, the integral image problem from image processing requires finding the sum of elements in arbitrary axis-aligned subregions of an image.Fur- thermore, the fast multipole method, a widely used kernel in particle simulations, relies on reducing regions outside of a bounding box in a multidimensional array to a representative multipole expansion for certain interactions. We abstract away the application domains and define the underlying included and excluded sums problems of reducing regions inside and outside (respectively) of an axis-aligned bounding box in a multidimensional array. In this thesis, we present the dimension-reduction excluded-sums (DRES) algorithm, an asymptotically improved algorithm for the excluded sums problem in arbitrary dimensions and compare it with the state-of-the-art algorithm by Demaine et al. The DRES algorithm reduces the work from exponential to linear in the number of dimensions. Along the way, we present a linear-time algorithm for the included sums problem and show how to use it in the DRES algorithm. At the core of these algorithms are in-place prefix and suffix sums. Furthermore, applications that involve included and excluded sums require both high performance and numerical accuracy in practice. Since standard methods for prefix sums on general-purpose multicores usually suffer from either poor performance or low accuracy, we present an algorithm called the block-hybrid (BH) algorithm for parallel prefix sums to take advantage of data-level and task-level par- allelism. The BH algorithm is competitive on large inputs, up to 2.5× faster on inputs that fit in cache, and 8.4× more accurate compared to state-of-the art CPU parallel prefix imple- mentations. Furthermore, a BH algorithm variant achieves at least a 1.5× improvement over a state-of-the-art GPU prefix sum implementation on a performance-per-cost ratio (using Amazon Web Services’ pricing). Much of thesis represents joint work with Helen Xu and Professor Charles Leiserson.

Thesis Supervisor: Charles E. Leiserson Title: Professor of Computer Science and Engineering

3 4 Acknowledgments

First and foremost, I would like to thank my advisor Charles Leiserson for his advice and support throughout the course of this thesis. His positive enthusiasm, vast technical knowl- edge, and fascination with seemingly simple problems that have emergent complexity have been nothing short of inspiring. Additionally, this work would not have been possible without Helen Xu, who has spent a considerable amount of time guiding my research, discussing new avenues to pursue, and even writing parts of this thesis. I have learnt a great deal from both Charles Leiserson and Helen Xu, and I consider myself very fortunate to have such great collaborators and mentors. This thesis is derived from a project done in collaboration with both of them. I would also like to thank my academic advisor, Dennis Freeman, for his support during my MIT career. Furthermore, I am grateful to the entire Supertech Research Group and to TB Schardl for their discussions over the past year. Thanks to Guy Blelloch, Julian Shun and Yan Gu for providing a technical benchmark necessary for this work. I would also like to acknowledge MIT Supercloud for allowing me to run experiments on their compute cluster. I am extremely grateful to my friends at MIT, who have made my time here especially memorable. Thank you to my parents, my brother, and my girlfriend for providing uncon- ditional support. This thesis and my journey at MIT would not have been possible without you. This research was sponsored in part by NFS Grant 1533644, and in part by the United States Air Force Research Laboratory and was accomplished under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

5 6 Contents

1 Introduction 17

2 Preliminaries 25

3 Included and Excluded Sums 31

3.1 Tensor Preliminaries ...... 31 3.2 Problem Definitions ...... 32 3.3 INCSUM Algorithm ...... 33 3.4 Excluded Sums (DRES) Algorithm ...... 41 3.5 Applications ...... 50

4 Prefix Sums 53

4.1 Previous Work ...... 54 4.2 Vectorization ...... 59 4.3 Accurate Floating Point ...... 61 4.4 Ordering of Computation ...... 65 4.5 Block-Hybrid Algorithm ...... 66 4.6 Experimental Evaluation ...... 68 4.6.1 Performance ...... 69 4.6.2 Accuracy ...... 72 4.7 GPU Comparison ...... 74

5 Conclusions and Future Work 77

7 A Implementation Details 79 A.1 Implementation of Prefix Sum Vectorization ...... 79

8 List of Figures

1-1 An example of the included and excluded sums in two dimensions for one box. The grey represents the region of interest to be reduced. Typically, the included and excluded sums problem require computing the reduction for all such regions (only one is depicted in the figure for illustration)...... 18

1-2 An example of the included and excluded sums problems in two dimensions with a box size of 3 × 3 and the max operator. The dotted box represents an example of a 3 × 3 box in the matrix, but the included and excluded sums problem computes the inclusion or exclusion regions for all 3 × 3 boxes.... 19

1-3 An example of the corners algorithm for one box in two dimensions. The grey regions represent excluded regions computed via prefix and suffix sums. . 20

1-4 Performance and accuracy of the naive sequential prefix sum algorithm com- pared to an unoptimized parallel version on a 40-core machine. Clearly, there is no speedup for the parallel version, and in fact, it makes it even slower. For the accuracy comparison, we define the error as the root mean square relative error over all outputs of the prefix sum array compared to a higher precision reference at the equivalent index. The input is an array of 215 single-precision floating point numbers, according to a distribution in the legend. The options are random numbers either drawn from Unif(0, 1), Exp(1), or Normal(0, 1). On these inputs, the parallel version is on average around 17× more accurate. 21

9 1-5 Three highlight matrices of the trade-offs between performance and accuracy. The input size is either 215 (left), 222 (center), or 227 (right) floats, and 1 tests are run on a 40-core machine. Accuracy is measured as RMSE and plotted on a logarithmic scale, where RMSE is defined in Section 4.3, for the random numbers uniformly distributed between 0 and 1. Performance (or 푇1 observed parallelism) is measured as , where 푇40ℎ is the parallel time on 푇40ℎ the Supercloud 40-core Machine with 2-way hyper-threading, and 푇1 is the serial time of the naive sequential prefix sum on the same machine. . . . 22

3-1 Pseudocode for the included sum in one dimension...... 34

3-2 Computing the included sum in one dimension in linear time...... 34

3-3 Example of computing the included sum in one dimension with 푁 = 8, 푘 = 4. 35

3-4 Computing the included sum in two dimensions...... 35

3-5 Ranged prefix along row...... 36

3-6 Ranged suffix along row...... 36

3-7 Computes the included sum along a given row...... 37

3-8 Computes the included sum along a given dimension...... 38

3-9 Computes the included sum for the entire tensor...... 38

3-10 An example of decomposing the excluded sum into disjoint regions in two dimensions. The red box denotes the points to exclude...... 44

3-11 Steps for computing the excluded sum in two dimensions with included sums on prefix and suffix sums...... 45

3-12 Prefix sum along row...... 46

3-13 Adding in the contribution...... 47

3-14 Prefix sum along row...... 48

10 4-1 Hillis and Steele’s data-parallel prefix sum algorithm on an 8-input array using index notation. The individual boxes represent individual elements of the array and the notation 푗 : 푘 contains elements at index 푗 to index 푘 inclusive with the operator (here assumed to be ), denoted by ⊕. The lines from two boxes means those two elements are added where they meet the operator. The arrow corresponds to propagating the old element at that index (no operation)...... 55

4-2 The balanced -based algorithm described by Blelloch [5] and [25] consisting of the upsweep (first four rows) followed by the downsweep (last3 rows). The individual boxes represent individual elements of the array and the notation 푖 : 푗 contains elements at index 푖 to index 푗 added together. The lines coming from two boxes means those two elements are added to where the line meets the new box. The bolded-outline boxes indicate when that element in the array in is ‘finished’, i.e. it has the correct prefix sum at the index at that point. The arrows from Figure 4-1 are omitted but they have the same effect...... 57

4-3 The algorithm of PBBSLIB (scan_inplace) [31]. It first performs a parallel reduce operation on each block to a temporary array. After running an ex- clusive scan on that temporary array, we then run in-place prefix sums, using the value from the temporary array as the offset, for each block 퐵, in parallel. The symbols are described by the key on the right...... 58

4-4 An illustration of Prefix-Sum-Vec, for an 4-unit section of input (1, 2, 3, 4), with desired output (1,3,6,10). The offset is kept track of while iterating over the entire array in a forward pass in blocks of 4 elements. Here, 푉 = 4 is our vector width (4 floats, or 128 bits). However, this can be 256-bit vectors for example. The central idea is that it performs Hillis’ algorithm exclusively through vector operations (cheap vector adds, shifts, and shuffles etc), maximizing data level parallelism. The subscripted arrow means a shift by that number...... 60

11 4-5 The red boxed numbers by the lines indicate the order in which those sums are computed (increasing order). Left: the ordering from the algorithm discussed by Blelloch [5] / Work-Efficient Parallel Prefix, analogous to a level order traversal of a binary tree. Right: the ordering from a circuit described by a recursive version from Ladner and Fischer [25]. The upsweep phase can be described as a postorder traversal, and the downsweep as as preorder traversal. These depth-first orderings are more suited to cache-locality, because onlarge arrays, their accesses are grouped together and more cache-efficient. Blelloch provides a proof in [6]...... 66

4-6 An illustration of the Block Hybrid algorithm on a fabricated input array of size 40. In the first phase, in-place prefix sums / inclusive scans (using the intrinsics based vectorization algorithm) are performed on each block of size 퐵 (here, 퐵 = 8). The last value of each block is also copied to a temporary array. In the second phase, an in-place prefix sum in the form of a recursive work efficient parallel prefix is performed on the temporary array, andfinally the corresponding values are now broadcasted to all elements of all the blocks in parallel...... 68

4-7 A highlight matrix of the trade-offs between performance and accuracy. The input size is 222 floats, and tests are run on a 40-core machine. This in- put size is where the differences are most pronounced - for smaller inputs, the serial algorithms perform better, and for larger inputs, PBBSLIB and block_hybrid_reduce both have similar performance due to reaching the 1 maximum memory bandwidth. Accuracy is measured as RMSE and plot- ted on a logarithmic scale, where RMSE is defined in Section 4.3, for the random numbers uniformly distributed between 0 and 1. Performance (or ob- 푇1 served parallelism) is measured as , where 푇40ℎ is the parallel time on the 푇40ℎ Supercloud 40-core Machine with 2-way hyper-threading, and 푇1 is the serial time of the naive sequential prefix sum on the same machine. Our algorithms are sequential intrinsics, block hybrid, and block hybrid reduce.... 69

12 4-8 A logarithmic-scaled runtime graph compared all the algorithms discussed so far. It is worth noting the consistent performance of the block hybrid based algorithms, across all input sizes. This was measured on the Supercloud 40- core machine...... 70 4-9 A linearly scaled runtime graph where the y-axis is the runtime divided by the size of the array (to get a normalized per unit array size) runtime. This shows that the PBBSLIB algorithm is inefficient for small inputs, while the other algorithms scale almost linearly, until the input gets large enough, when the parallel algorithms scale better. This was measured on the Supercloud 40-core machine...... 71

푇1 4-10 Speedup (measured as 푇푃 ) versus input size (log scale) on two different parallel machines. As we can see as 푛 increases, we do gain some speedup, however we eventually hit a bottleneck which is the maximum memory bandwidth of the machine, since the calculation required in a prefix sum is small compared to the runtime associated with reading and writing memory...... 72 4-11 A bar chart comparing the root mean squared relative error of the different algorithms, compared to a prefix sum algorithm that uses a much higher precision (100) where inputs are 215 random single precision floating point numbers drawn from a uniform distribution between 0 and 1...... 73 4-12 A bar chart comparing the root mean squared relative error of the different algorithms, compared to a prefix sum algorithm that uses a much higher precision (100) where inputs are 215 random single precision floating point numbers drawn from the exponential distribution with lambda = 1...... 74 4-13 A bar chart comparing the root mean squared relative error of the different algorithms, compared to a prefix sum algorithm that uses a much higher precision (100) where inputs are 215 random single precision floating point numbers drawn from the standard normal distribution...... 75

13 14 List of Tables

4.1 Amazon EC2 Spot and On Demand Pricing Comparison...... 75 4.2 Runtime comparison against the state-of-the-art GPU implementation on 1 Volta GPU versus the affordable 2-core AWS machine...... 75

15 16 Chapter 1

Introduction

Many scientific computing applications require reducing many (potentially overlapping) re- gions of a tensor, or multidimensional array, to a single value for each region quickly and accurately. In this thesis, we explore the “included and excluded sums problems”, which underlie applications that require reducing regions of a tensor to a single value using a bi- nary associative operator. The problems are called “sums” for ease of presentation, but the general problem statements (and therefore algorithms to solve the problems) apply to any binary associative operator - not just addition. For simplicity, we will sketch the problems in two dimensions in the introduction but will formalize and provide algorithms for arbitrary dimensions in the remainder of the thesis. At a high level, the included and excluded sums problems require computing reductions over many different (but possibly overlapping) regions in a matrix (one corresponding to each entry in the matrix). The problems take as input a matrix and a “box size” (or side lengths defining a rectangular region, or “box”). Given a box size, each location in thematrix defines a spatial box of that size. An algorithm for included sums outputs another matrix of the same size as the input matrix where each entry is the reduction of all elements contained in the box for that entry. The “excluded sums problem” is the inverse of the included sums problem: each entry in the output matrix is the reduction of all elements outside of the corresponding box. As we will see, a solution for included sums does not directly translate into a solution for excluded sums for general operators. Algorithms for included and excluded sums must compute the reduction for the included

17 Included Sum Excluded Sum

X

Figure 1-1: An example of the included and excluded sums in two dimensions for one box. The grey represents the region of interest to be reduced. Typically, the included and excluded sums problem require computing the reduction for all such regions (only one is depicted in the figure for illustration). or excluded region for all points in the matrix. If we were only interested in one box (or one excluded region) a single pass over the matrix would suffice. If there are 푁 points in the matrix and the box size is 푅, naively summing elements in each box for the included sum takes 푂(푁푅) work (푂(푁(푁 − 푅)) for the naive excluded sum) and involves many repeated computations. An example of one entry of the included and excluded sum can be found in Figure 1-1. The included and excluded sums problems appear in a variety of multidimensional compu- tations. For example, the integral image problem (or summed-area table) [8,12] pre-processes an image to answer queries for the sum of elements in arbitrary rectangular subregions of a matrix in constant time. The integral image has applications in real-time image processing and filtering [19]. The fast multipole method (FMM) is a widely-used numerical approxima- tion for the calculation of long-ranged forces in various 푁-particle simulations [3, 17]. The essence of the FMM is a reduction of a neighoring subregion’s particles, excluding particles too close, to a multipole to allow for fewer pairwise calculations [9,13].

Summation Without Inverse

An obvious solution to first solve the included sums problem and subtract out the results from reducing the entire tensor fails for functions of interest which might represent singularities (e.g. in physics 푁-particle simulations) [14] or for operators without inverse (such as max). Even for functions with an inverse, subtracting floating-point values suffers from catastrophic

18 cancellation [14, 38]. Therefore, we require algorithms without subtraction for the excluded sums problem. More generally, algorithms for the included and excluded sums problem should apply to any associative operator (with or without an inverse). In Figure 1-2, we present an example of the included and excluded sums problem with the max operator (which does not have an inverse). Input Inclusion (max) Exclusion (max)

1 3 6 2 5 15 15 17 17 17 18 18 18 18 18 10 9 1 8 17 18 15 17 17 17 17 18 18 18 18 5 11 15 3 2 18 16 16 12 9 17 18 18 18 18 18 4 2 12 9 18 16 16 12 9 17 18 18 18 18 6 2 16 7 8 16 16 16 8 8 18 18 18 18 18

Figure 1-2: An example of the included and excluded sums problems in two dimensions with a box size of 3 × 3 and the max operator. The dotted box represents an example of a 3 × 3 box in the matrix, but the included and excluded sums problem computes the inclusion or exclusion regions for all 3 × 3 boxes.

Corners Algorithm for Excluded Sums

Since naive algorithms for reducing regions of a tensor may waste work by recomputing reductions for overlapping regions, researchers have proposed algorithms that reuse the work of reducing regions of a tensor. Demaine et al. [14] introduced an algorithm to compute the excluded sums in arbitrary dimensions without subtraction, which we will call the corners algorithm. To our knowledge, the corners algorithm is the fastest algorithm for the excluded sums problem. Given a 푑-dimensional tensor and a length 푑 vector of box side lengths, the corners algorithm computes the excluded sum for all (푁) boxes of that size. At a high level, the corners algorithm decomposes the excluded region for each box into 2푑 disjoint regions (one corresponding to each corner of that box, or to each vertex in 푑-dimensions) such that summing all of the points in the 2푑 regions exactly matches the excluded region. The algorithm heavily depends on prefix and suffix sums to compute the reduction of pointsin

19 each of the disjoint regions. The original article that proposed the corners algorithm does not include a formal analysis of its runtime or space usage in arbitrary dimensions. Given a 푑-dimensional tensor of 푁 points, the corners algorithm takes Ω(2푑푁) work to compute the excluded sum in the best case because there are 2푑 corners and each one requires Ω(푁) work to account for the contribution to each excluded box. The bound is tight: given arbitrary extra space, the corners algorithm takes 푂(2푑푁) work. An in-place (uses space sublinear in the input size) corners algorithm takes 푂(2푑푑푁) work.

Figure 1-3: An example of the corners algorithm for one box in two dimensions. The grey regions represent excluded regions computed via prefix and suffix sums.

To our knowledge, the best lower bound for the work and space usage for any algorithm for excluded sums is Ω(푁) because any algorithm has to read in the entire input tensor (of size 푁) at least once and output a tensor of size Ω(푁).

Prefix and Suffix Sums

As we will formalize in Chapter 3, an efficient implementation of either the corners algorithm or the DRES algorithm requires a fast prefix and suffix sum implementation since prefixand suffix sums are core subroutines in both algorithms. For simplicity, we will discuss prefix sums for the rest of this thesis, but all of the techniques that we explore for prefix sums also apply to suffix sums. The prefix sum (also known as scan) [5] operation is one of the most fundamental building blocks of parallel algorithms. Prefix sums are so ubiquitous that they have been included as primitives in some languages such as C++ [11], and more recently have been considered

20 as a primitive for GPU computations in CUDA [18]. Furthermore, parallel prefix sums (or analogously, a pairwise summation) achieve high numerical accuracy on finite-precision floating-point numbers [21] compared to a naive sequential reduction [20]. Therefore, parallel prefix sums are well-suited to scientific computing applications such as the FMMwhich optimize for both performance and accuracy.

(a) Performance (b) Accuracy

Figure 1-4: Performance and accuracy of the naive sequential prefix sum algorithm compared to an unoptimized parallel version on a 40-core machine. Clearly, there is no speedup for the parallel version, and in fact, it makes it even slower. For the accuracy comparison, we define the error as the root mean square relative error over all outputs of the prefix sum array compared to a higher precision reference at the equivalent index. The input is an array of 215 single-precision floating point numbers, according to a distribution in the legend. The options are random numbers either drawn from Unif(0, 1), Exp(1), or Normal(0, 1). On these inputs, the parallel version is on average around 17× more accurate.

Although the canonical parallel prefix sum should be faster in theory, traditional im- plementations on general-purpose multicores exhibit several performance issues. As shown in Figure 1-4, an unoptimized parallel prefix sum is slower than the simplest serial algorithm.

The parallel prefix sum is recursive, which generates function call overhead. The parallel prefix sum is work-efficient but has a constant factor of extra work when compared tothe serial version [5]. Furthermore, there is additional overhead in parallelization due to schedul- ing [27]. Parallel implementations without careful coarsening may also run into cache-line conflicts and contention [15,36]. The serial algorithm takes advantage of prefetching [34]be- cause it just requires a straightforward pass through the input, while the accesses elements out of order in a tree-like traversal [5]. Finally, as we will see in Chapter 4,

21 the traditional parallel algorithm does not take advantage of data-level parallelism via vec- torization. An efficient implementation of parallel prefix sum requires careful parallelization and optimization to take advantage of task-level and data-level parallelism effectively. The highlights of our results for prefix sum can be summarized by Figure 1-5, which shows the trade-offs between performance and accuracy for various prefix sum algorithms atthree input sizes - 215, 222, and 227. These input sizes highlight the trends - for smaller inputs, the serial algorithms perform better, and for larger inputs, PBBSLIB and block_hybrid_reduce both have similar performance due to reaching the maximum memory bandwidth. The differences are most pronounced for our algorithms in the middle-sized plot. The fullplots in Chapter 4 have the details. The top right corner of each matrix represents algorithms that are highly performant and accurate. Our algorithms are sequential intrinsics (single- threaded), block_hybrid, and block_hybrid_reduce.

Figure 1-5: Three highlight matrices of the trade-offs between performance and accuracy. The input size is either 215 (left), 222 (center), or 227 (right) floats, and tests are run on a 40-core 1 machine. Accuracy is measured as RMSE and plotted on a logarithmic scale, where RMSE is defined in Section 4.3, for the random numbers uniformly distributed between 0 and 1. Performance 푇1 (or observed parallelism) is measured as , where 푇40ℎ is the parallel time on the Supercloud 푇40ℎ 40-core Machine with 2-way hyper-threading, and 푇1 is the serial time of the naive sequential prefix sum on the same machine.

Contributions

Our contributions can be divided into two main categories: results for included and excluded sums, and results for prefix sums.

22 Our main contribution for excluded sums is the dimension-reduction excluded-sums (DRES) algorithm, an asymptotically improved algorithm for the excluded sums problem in arbitrary dimensions. Along the way, we present an efficient algorithm for included sums called the INCSUM algorithm. We measure multithreaded algorithms in terms of their work and span, or longest chain of sequential dependencies [10, Chapter 27]. Chapter 2 formalizes the dynamic multithreading model. The DRES and INCSUM algorithms take as input the following parameters:

• a 푑-dimensional tensor 풜 with 푁 entries,

• the size of the excluded region (the “box size”), and

• a binary associative operator ⊕.

Both algorithms output another 푑-dimensional tensor ℬ with 푁 entries where each entry is the excluded or included sum (respectively) corresponding to that point in the tensor under the operator ⊕. Our contributions towards excluded and included sums are as follows:

• The dimension-reduction excluded-sums (DRES) algorithm, an asymptotically im- proved algorithm for the excluded sums problem in arbitrary dimensions.

• Theorems showing that DRES computes the excluded sum in 푂(푑푁) work and in (︂ 푑 )︂ 2 ∑︀ 푂 푑 lg 푛푖 span for a 푑-dimensional tensor with 푁 points where each dimension 푖=1 푖 = 1, . . . , 푑 has length 푛푖.

• An implementation of DRES in C++ for an experimental evaluation on the horizon.

Our results for prefix sums are as follows:

• The block-hybrid (BH) algorithm for prefix sums, a data-parallel and task-parallel algorithm for prefix sums, and a less accurate, faster variant.

• A proof of the observation that reducing the span of a prefix sum algorithm improves both its parallelism and worst case error bound in floating point arithmetic.

23 • An implementation of the block-hybrid algorithm in C++ / Cilk [7].

• An experimental comparison of prefix sum algorithms that shows that the BHalgo- rithm is up to 2.5× faster on inputs that fit in cache and 8.4× more accurate than state-of-the-art CPU parallel prefix sum algorithms. We also present a variation of BH that is faster on larger inputs and 2.6× more accurate compared to the same benchmarks, in addition to being up to 2.5× faster on inputs that fit in cache.

• An evaluation of prefix sums on a CPU versus a GPU that shows the BH algorithm variant on a CPU is at least 1.5× more cost-efficient than the state-of-the-art GPU implementation (using AWS pricing [2]).

Map

The rest of this thesis is organized as follows. In Chapter 2, we review background on prefix sums, cost models for algorithm analysis in the remainder of the thesis, and our experimental setup. We present and analyze algorithms for excluded and included sums in Chapter 3. We introduce the block-hybrid algorithm and two variants for prefix sums, and conduct an experimental study of prefix sum algorithms in Chapter 4. Finally, we provide closing remarks in Chapter 5.

24 Chapter 2

Preliminaries

This section reviews the prefix sum primitive as well as models for dynamic multithreading and numerical accuracy that we will use to analyze algorithms in this thesis. The included sums, excluded sums, and prefix sums software are implemented in Cilk [7, 24], whichis a linguistic extension to C++ [33]. Therefore, we describe Cilk-like pseudocode and use the model of multithreading underlying Cilk, although the pseudocode can be applied to any arbitrary fork-join parallelism model.

Prefix and Suffix Sums

We first review the all-prefix-sums operation [5] (or scan) as a primitive operation that we will use throughout this thesis (note that this is an inclusive scan).

Definition 1 (All-prefix-sums Operation) The all-prefix-sums operation takes a binary associative operator ⊕ (for example, addition, multiplication, minimum or maximum), and an ordered set of 푛 elements

[푎0, 푎1, . . . , 푎푛−1] and returns the ordered set

[푎0, (푎0 ⊕ 푎1),..., (푎0 ⊕ 푎1 ⊕ ... ⊕ 푎푛−1)].

25 Example. If ⊕ is addition, then the all-prefix-sums operation on the ordered set

[3 1 7 0 4 1 6 3]

would return [3 4 11 11 15 16 22 25].

The suffix sum is the reverse of the prefix sum:

Definition 2 (All-suffix-sums Operation) The all-suffix-sums operation takes a binary associative operator ⊕ (for example, addition, multiplication, minimum or maximum), and an ordered set of 푛 elements

[푎0, 푎1, . . . , 푎푛−1] and returns the ordered set

[(푎0 ⊕ 푎1 ⊕ ... ⊕ 푎푛−1), (푎1 ⊕ ... ⊕ 푎푛−1), . . . , 푎푛−1].

Additionally, Reduce, which we will see in Chapter 4, takes the same arguments as prefix sum, but only returns the single element 푎0 ⊕ 푎1 ⊕ ... ⊕ 푎푛−1 (the last element).

Analysis of Multithreaded Algorithms

The linguistic model for multithreaded pseudocode from [10, Chapter 27] follows MIT Cilk [24]. It augments serial code with three keywords: spawn, sync, and parallel; the last of which can be implemented with the first two. The spawn keyword before a function call creates nested parallelism. The parent func- tion executes a spawn and can execute in parallel with the spawned child subroutine. There- fore, the code that immediately follows the spawn may execute in parallel with the child, rather than waiting for it to complete as in a serial function call. A parent function cannot safely use the values returned by its children until after a sync statement, which causes the function to wait until all of its spawned children to complete before proceeding to the code after the sync. Every function also implicitly syncs before it returns, preventing orphaning.

26 The spawn and sync keywords denote logical parallelism in a computation, but do not require parts of a computation to run in parallel. At runtime, a scheduler determines which subroutines run concurrently by assigning them to different cores in a multicore machine. Cilk has a runtime system that implements a provably efficient work-stealing scheduler [7]. Parallel loops can be expressed by preceding an ordinary for with the keyword parallel, which indicates that all iterations of the loop may run in parallel. Parallel loops can be implemented by parallel divide-and-conquer recursion using spawn and sync. We use the dynamic multithreading model to measure the asymptotic costs of multi- threaded algorithms in terms of their work and span [10, Chapter 27]. The work is the total time to execute the entire algorithm one one processor. The span 1 is the longest serial chain of dependencies in the computation (or the runtime in instructions on an infinite number of processors). For example, a parallel for loop over 푁 iterations of 푂(1) work per iteration has 푂(log 푁) span2 in the dynamic multithreading model.

Accuracy Model

Sums of floating point numbers in scientific computing are ubiquitous and require careful consideration of the accumulation of roundoff errors. Because floating-point addition isnon- associative on computers, standard compilers such as Clang and GCC are not allowed to reorder operations in the summation of an arbitrary sequence of floating point operations without breaking IEEE or ISO guarantees (this restriction can be lifted with certain compiler flags, but the results are no longer guaranteed). The effect would be different summation results with different degrees of error, when dissimilar sized numbers are added together. There has been plenty of research carried out on numerical stability, most notably by Higham [21] [20]. As we find out analytically and experimentally, the order in which we suman arbitrary sequence of floating point numbers has a great effect on the accuracy of aresult, depending on the input values. Moreover, if we can understand how to best mitigate floating point roundoff error and achieve high accuracy in our , we can utilize thisskill set in designing serial and parallel algorithms that achieve good compromises of performance and accuracy. 1Sometimes called critical-path length or computational depth. 2In this thesis, log is always base 2.

27 We use a standard accuracy model for floating point arithmetic, as described by Higham [20] and Goldberg [16].

푓푙(푥 op 푦) = (푥 op 푦)(1 + 훿), |훿| ≤ 푢, op = +, −, *,/ (2.1) where 훿 is a small error associated with the floating point representation of the calculation after correct rounding, and 푢 is the unit roundoff (or machine precision), which, for single precision floating points, is 5.96×10−8, and for doubles is 1.11×10−16. We are also assuming the use of a guard digit, which is standard. For more details on floating point arithmetic, see [16] and the rest of [20]. We use the following summarized results from Higham’s work on the accuracy of floating point summation to guide our accuracy analysis and evaluation:

Worst Case Error Bounds. In summation, we are evaluating an expression of the form ∑︀푛−1 푆푛 = 푖=0 푥푖, where 푥0, ..., 푥푛−1 are real numbers. As we will elucidate in Chapter 4, Higham [20] shows that the worst case error bound for summing an array of numbers in the naive way (a running total from left to right) has an error constant of 푂(푛), i.e., the length of the longest chain of . Further, a pairwise summation, which we will describe in Chapter 4 and analogize to a parallel-prefix computation, has a worst case error bound constant of 푂(log 푛). We will apply these error bounds to our prefix sum algorithms. Higham notes that when 푛 is very large, pairwise summation is an efficient compromise between performance and accuracy.

Experimental Analysis and Results. Rounding error bounds tend to be very pessimistic, since they are worst case bounds. Higham mentions that it is also important to experimen- tally evaluate the accuracy of summations. We use several ideas from his methodologies: using higher precision numbers as a reference point, using random numbers from uniform, exponential and normal distributions as input, and the Kahan summation algorithm as a benchmark [16]. Further, we use his results on statistical estimates of accuracy on pairwise summation to affirm its performance. Our decision to use root mean square relative errorto compare the results is also inspired by his discussion. 28 Experimental Setup

This section summarizes the shared-memory multicore machines, compilers, and software used for all experimental evaluation in this thesis. The following are the references names that will be used throughout. AWS 18-core Machine (c4.8xlarge) An 18-core machine with 2 × Intel Xeon CPU E5-2666 v3 (Haswell) @ 2.90GHz processors, each with 9 cores per processor, with 2-way hyperthreading. Each processor has a 1600MHz bus and a 25MB L3 cache. Each core has a 256KB L2 cache, a 32KB L1 data cache, and a 32KB L1 instruction cache. The cache line size is 64B. There is a total of 60GB DRAM on the machine. The theoretical maximum memory bandwidth is 51.2 GB/second. The cores have AVX2 (and earlier AVX, SSE) instruction set extensions. Supercloud 40-core Machine A 40-core machine with 2 × Intel Xeon Gold 6248 @ 2.50GHz processors, each with 20 cores per processor, with 2-way hyperthreading. Each processor has a 2993MHz bus and a 27.5MB L3 cache. Each core has a 1MB L2 cache, a 32KB L1 data cache, and a 32KB L1 instruction cache. The cache line size is 64B. There is a total of 384GB DRAM on the machine. The theoretical maximum memory bandwidth is 93.4GB/second. The cores have AVX512 (and earlier AVX2, AVX, SSE) instruction set extensions. Affordable AWS Machine (c5.xlarge) A 2-core machine with 1 × Intel Xeon Platinum 8124M CPU @ 3.00GHz processors, each with 2 cores per processor, with 2-way hyperthread- ing. The processor has a 25MB L3 cache. Each core has a 1MB L2 cache, a 32KB L1 data cache, and a 32KB L1 instruction cache. The cache line size is 64B. There is a total of 8GB DRAM on the machine. The cores have AVX512 (and earlier AVX2, AVX, SSE) instruction set extensions. Volta GPU The Supercloud 40-core machine also has 2 NVIDIA Volta V100 GPUs at- tached. The GPU has 32GB RAM. When run with just 1 GPU, we note that this is the same as the p3.2xlarge Amazon EC2 instance [1], and use this for pricing comparisons against the Affordable AWS Machine. Compiler As stated earlier, all the code is implemented in C++ using Cilk for fork-join parallelism. The compiler used is an extended version of Clang version 8, called Tapir/LLVM

29 [28]. The -O3 and -march=native compiler flags are used throughout the experiments for maximum performance. Boost The Boost C++ Library is used to carry out the accuracy experiments regarding floating point summation. In particular, Boost multi-precision floating point numbers [26] are used, providing 100 decimal digits of precision. The experiments can then calculate summation results much closer to the true real value, and is the basis for which we compare relative error against for the different methods of summation. Higham [20] uses a similar technique of comparing to a higher precision reference point. Code The code will be available at https://github.com/seanfraser/thesis at the time of publication. Expected by the end of May 2020.

30 Chapter 3

Included and Excluded Sums

Outline. In this chapter, we present algorithms for included and excluded sums in arbitrary dimensions. We begin with necessary preliminaries for indexing and defining ranges in tensors to understand the later algorithms. We conclude with two potential applications for included and excluded sums, and their relation to prefix sums.

3.1 Tensor Preliminaries

We discuss order-푑 tensors in a particular orthogonal basis. That is, tensors are 푑-dimensional arrays of elements over some field ℱ, usually the real or complex numbers. We denote tensors by capital script letters 풜 and vectors by lowercase boldface letters a.

Definition 3 (Index domain) An index domain 푈 is the cross product 푈 = 푈1 × 푈2 ×

... × 푈푑 where 푑 ≥ 1 and for 푖 = 1, 2, . . . , 푑, we have 푈푖 = {0, 1, . . . , 푛푖 − 1} where 푛푖 ≥ 1. That is, dimensions are 1-indexed and coordinates start at 0.

A tensor 풜 is a mapping such that 풜 : 푈 → F for some index domain 푈 and for some field F. The length of dimension 푖 of tensor 풜 is 푛푖.

We can denote a particular element of an index domain as an index x = (푥1, 푥2, . . . , 푥푑) where for all 푖 = 1, 2, . . . , 푑, 0 ≤ 푥푖 ≤ 푛푖 − 1. Sometimes for simplicity of notation, we will use 푛푖 in place of 푛푖 − 1 to denote the end of a row.

31 We use 푥 : 푥′ to denote a range of indices along a particular dimension and : without bounds to indicate all elements along a particular dimension. For example, the middle 푛/2 columns of an 푛 × 푛 matrix 풜 would be written 풜(:, 푛/4 : 3푛/4). A range of indices defines a subtensor. A tensor defines a value at each index (i.e. 풜[x] ∈ F). The value of a tensor at a range of indices is the sum (or reduction) of the values of the tensor at each index in the range. Next, we introduce box notation, which we will use to formally define the included and excluded sums problems.

′ Definition 4 (Box) A box of an index domain 푈 cornered at x = (푥1, . . . , 푥푑) ∈ 푈, x = ′ ′ ′ ′ ′ (푥1, . . . , 푥푑) ∈ 푈 which is denoted 퐵 = (푥1 : 푥1, 푥2 : 푥2, . . . , 푥푑 : 푥푑) ⊆ 푈 is the region ′ ′ ′ {푦 = (푦1, 푦2, . . . , 푦푑) ⊆ 푈 : 푥1 ≤ 푦1 < 푥1, 푥2 ≤ 푦2 < 푥2, . . . , 푥푑 ≤ 푦푑 < 푥푑}. For all ′ 푖 = 1, 2, . . . , 푑, 푥푖 < 푥푖.

′ Definition 5 (Box-side lengths) Given an index domain 푈 and a box 퐵 = (푥1 : 푥1, 푥2 : ′ ′ ′ ′ ′ 푥2, . . . , 푥푑 : 푥푑) ⊆ 푈, the box-side lengths are denoted ℓ퐵 = (푥1 − 푥1, 푥2 − 푥2, . . . , 푥푑 − 푥푑), or the length of the box in each dimension.

3.2 Problem Definitions

In this section we describe the included and excluded sums problems. Throughout this section, let 풜 : 푈 → F be a 푑-dimensional tensor and k = (푘1, . . . , 푘푑) be a vector of box-side lengths. For simplicity in the pseudocode, we will assume 푛푖 mod 푘푖 = 0 for all 푖 = 1, 2, . . . , 푑. In implementations, the input can either be padded with the identity to make this true, or add in extra code to deal with unaligned boxes.

Problem 1 (Included Sums Problem) An algorithm for the included sums problem takes as input a tensor 풜 and vector of box-side lengths k. It outputs a new tensor ℬ : 푈 → F such that every index x = (푥1, . . . , 푥푑) of ℬ maps to the sum of elements at all indices inside ′ the box cornered at x = (푥1, . . . , 푥푑), x = (푥1 : 푥1 + 푘1, . . . , 푥푑 : 푥푑 + 푘푑).

32 For example, the included sums problem in one dimension is the problem of finding the sum of each run of length 푘 in an array of length 푛. In two dimensions, the included sums

problem is finding the sum of all elements in every 푘1 × 푘2 box in an 푛1 × 푛2 matrix. The excluded sums problem definition is the inverse of the included sums: every index in the output tensor ℬ maps to the sum of element at all indices outside of the box cornered at that index.

3.3 INCSUM Algorithm

First, we will present a linear-time algorithm INCSUM to solve the included sums problem in one dimension and demonstrate how to extend it to arbitrary dimensions.

Included Sums in One Dimension

Let 풜 be a list of length 푁 and 푘 be the (scalar) length of the box. For simplicity, assume 푁 mod 푘 = 0 (we can pad the list length). At a high level, the incsum_1D algorithm generates two intermediate lists 퐴푝, 퐴푠 of length 푁 each and does 푁/푘 prefix and suffix sums of length [︀ ⌈︀ 푥+1 ⌉︀ ]︀ 푘 each, respectively. By construction, for 푥 = 0, 1, . . . , 푁 − 1, 풜푠[푥] = 풜 푥 : 푘 · 푘 and [︀⌊︀ 푥 ⌋︀ ]︀ 풜푝[푥] = 풜 푘 · 푘 : 푥 + 1 . We begin with pseudocode for the included sum in one dimension in Figure 3-1. We then illustrate the ranged prefix and suffix sums in Figure 3-2 and present a worked example in Figure 3-3. The algorithm in Figure 3-1 is clearly linear in the number of elements in the list since the total number of loop iterations is 2푁. As mentioned in the introduction, a naive algorithm for included sums takes 푂(푁푘) where 푁 is the number of elements and 푘 is the box length. We will now show that incsum_1D computes the included sum.

Lemma 1 incsum_1D solves the included sums problem in one dimension.

Proof. Consider a one-dimensional list 풜 with 푁 elements and box length 푘. We will show that for each 푥 = 0, 1, . . . , 푁 − 1, ℬ(푥) contains the desired sum. For 푥 mod 푘 = 0, this holds by construction. For all other 푥, the previously defined prefix and suffix sum give

33 incsum_1D(퐴, 푁, 푘) 1 ◁ Input: list 퐴 of size 푁, included sum length 푘 2 ◁ Output: list 퐵 of size 푁 where each entry 퐵[푖] = 퐴[푖 : 푖 + 푘] for 푖 = 0, 1, . . . 푁 − 1. 3 퐵[푁] 4 퐴푝 ← 퐴, 퐴푠 ← 퐴 5 ◁ 푘-wise prefix sum along row 6 parallel for i ← 0 to N /k − 1 7 for j ← 1 to 푘 − 1 8 퐴푝[i(푁/푘) + 푗]+ =퐴푝[i(푁/푘) + 푗 − 1] 9 for j ← 푘 − 2 downto 0 10 퐴푠[i(푁/푘) + 푗]+ =퐴푠[i(푁/푘) + 푗 + 1] 11 parallel for i ← 0 to N − 1 12 if i mod 푘==0 13 퐵[i] = 퐴푠[i] 14 else 15 퐵[i] = 퐴푠[i] + 퐴푝[i + k − 1] 16 return 퐵

Figure 3-1: Pseudocode for the included sum in one dimension.

k k Suffix

Prefix

Figure 3-2: Computing the included sum in one dimension in linear time.

⌈︀ 푥+1 ⌉︀ the desired result. Recall that ℬ[푥] = 풜푝[푥 + 푘 − 1] + 풜푠[푥], 풜푠[푥] = 풜[푥 : 푘 · 푘], and 푥+푘−1 푥+푘−1 푥+1 풜푝[푥 + 푘 − 1] = 풜[⌊ 푘 ⌋· 푘 : 푥 + 푘]. Also note that ⌊ 푘 ⌋ = ⌈ 푘 ⌉ for all 푥 mod 푘 ̸= 0.

Therefore,

ℬ[푥] = 풜푝[푥 + 푘 − 1] + 풜푠[푥] [︂ ⌈︂푥 + 1⌉︂ ]︂ [︂⌊︂푥 + 푘 − 1⌋︂ ]︂ = 풜 푥 : · 푘 + 풜 · 푘 : 푥 + 푘 푘 푘 = 풜[푥 : 푥 + 푘] which is exactly the desired sum.

34 Figure 3-3: Example of computing the included sum in one dimension with 푁 = 8, 푘 = 4. Generalizing to Arbitrary Dimensions

Next, we demonstrate how to extend the one-dimensional INCSUM algorithm to solve the included sums problem for a 푑-dimensional tensor of 푁 points in 푂(푑푁) work. At a high level, we apply the same one-dimensional INCSUM algorithm along every row of each dimension. For example, Figure 3-4 shows how to use the result of the included sum along one dimension of a matrix to find the included sum of a two-dimensional box. INCSUM in second dimension INCSUM in first dimension on result

f f

Figure 3-4: Computing the included sum in two dimensions.

35 Subroutines incsum_prefix_along_row and incsum_suffix_along_row in Fig- ures 3-5 and 3-6 (respectively) compute the ranged prefix and suffix sum that later combine to form the included sum along an arbitrary row of the tensor in higher dimensions. incsum_prefix_along_row(퐴푝, i, j , x = (푥1, . . . , 푥푑), k = (푘1, . . . , 푘푑))

1 Input: Tensor 퐴푝 (푑 dimensions, side lengths (푛1, . . . , 푛푑)), dimensions reduced up to i, 2 dimensions to take points up to j (푗 ≥ 푖), row defined by index x, and box-lengths k. 3 Output: Modify 퐴푝 to do a ranged prefix sum along dimension 푗 4 while fixing dimensions up to 푖. 5 6 ◁ Divide up row in 푛푗/푘푗 pieces of size 푘푗 each 7 parallel for l ← 0 to nj /kj − 1 8 ◁ 푘-wise prefix sum along row 9 for m ← 1 to kj − 1 10 퐴푝[푛1, . . . , 푛푖, 푥푖+1, . . . , 푥푗 , 푙푘푗+1 + 푚, 푥푗+2, . . . , 푥푑 ]+ = ⏟ ⏞ ⏟ ⏞ ⏟ ⏞ 푖 dimensions 푗 − 푖 dimensions 푑 − 푗 − 1 dimensions

11 퐴푝[푛1, . . . , 푛푖, 푥푖+1, . . . , 푥푗 , 푙푘푗+1 + 푚 − 1, 푥푗+2, . . . , 푥푑 ] ⏟ ⏞ ⏟ ⏞ ⏟ ⏞ 푖 dimensions 푗 − 푖 dimensions 푑 − 푗 − 1 dimensions

Figure 3-5: Ranged prefix along row.

incsum_suffix_along_row(퐴푠, i, j , x = (푥1, . . . , 푥푑), k = (푘1, . . . , 푘푑))

1 Input: Tensor 퐴푠 (푑 dimensions, side lengths (푛1, . . . , 푛푑)), dimensions reduced up to i, 2 dimensions to take points up to j (푗 ≥ 푖), row defined by index x, and box-lengths k. 3 Output: Modify 퐴푠 to do a ranged suffix sum along dimension 푗 4 while fixing dimensions up to 푖. 5 6 ◁ Divide up row in 푛푗/푘푗 pieces of size 푘푗 each 7 parallel for l ← 0 to nj /kj − 1 8 ◁ 푘-wise suffix sum along row 9 for m ← 푘푗 − 2 downto 0 10 퐴푠[푛1, . . . , 푛푖, 푥푖+1, . . . , 푥푗 , 푙푘푗+1 + 푚, 푥푗+2, . . . , 푥푑 ]+ = ⏟ ⏞ ⏟ ⏞ ⏟ ⏞ 푖 dimensions 푗 − 푖 dimensions 푑 − 푗 − 1 dimensions

11 퐴푠[푛1, . . . , 푛푖, 푥푖+1, . . . , 푥푗 , 푙푘푗+1 + 푚 + 1, 푥푗+2, . . . , 푥푑 ] ⏟ ⏞ ⏟ ⏞ ⏟ ⏞ 푖 dimensions 푗 − 푖 dimensions 푑 − 푗 − 1 dimensions

Figure 3-6: Ranged suffix along row.

The function incsum_result_along_row in Figure 3-7 combines the ranged prefix

36 and suffix sums along an arbitrary dimension to compute the included sum inarowofthe tensor.

incsum_result_along_row(A, i, j , x = (푥1, . . . , 푥푑), k = (푘1, . . . , 푘푑), 퐴푝, 퐴푠)

1 Input: Tensor 퐴 (푑 dimensions, side lengths (푛1, . . . , 푛푑)) to write the output, 2 dimensions reduced up to i, dimensions to take points up to j (푗 ≥ 푖), 3 row defined by index x, box-lengths k, ranged prefix and suffix tensors 퐴푝, 퐴푠. 4 Output: Modify 퐴 with the included sum along the specified row in dimension 푗. 5 6 parallel for ℓ ← 1 to nj 7 ◁ If on a boundary, just assign to the beginning of the k-wise suffix 8 ◁ (equal to the sum of k elements in this window) 9 if ℓ mod 푘푗 == 0 10 퐴[푛1, . . . , 푛푖, 푥푖+1, . . . , 푥푗 , ℓ, 푥푗+2, . . . , 푥푑 ]= ⏟ ⏞ ⏟ ⏞ ⏟ ⏞ 푖 dimensions 푗 − 푖 dimensions 푑 − 푗 − 1 dimensions 11 퐴푠[푛1, . . . , 푛푖, 푥푖+1, . . . , 푥푗 , ℓ, 푥푗+2, . . . , 푥푑 ] ⏟ ⏞ ⏟ ⏞ ⏟ ⏞ 푖 dimensions 푗 − 푖 dimensions 푑 − 푗 − 1 dimensions 12 else 13 ◁ Otherwise, add in relevant prefix and suffix 14 퐴[푛1, . . . , 푛푖, 푥푖+1, . . . , 푥푗 , ℓ, 푥푗+2, . . . , 푥푑 ]= ⏟ ⏞ ⏟ ⏞ ⏟ ⏞ 푖 dimensions 푗 − 푖 dimensions 푑 − 푗 − 1 dimensions 15 퐴푠[푛1, . . . , 푛푖, 푥푖+1, . . . , 푥푗 , ℓ, 푥푗+2, . . . , 푥푑 ] ⏟ ⏞ ⏟ ⏞ ⏟ ⏞ 푖 dimensions 푗 − 푖 dimensions 푑 − 푗 − 1 dimensions 16 +퐴푝[푛1, . . . , 푛푖, 푥푖+1, . . . , 푥푗 , ℓ + 푘푗+1 − 1, 푥푗+2, . . . , 푥푑 ] ⏟ ⏞ ⏟ ⏞ ⏟ ⏞ 푖 dimensions 푗 − 푖 dimensions 푑 − 푗 − 1 dimensions

Figure 3-7: Computes the included sum along a given row.

The function incsum_along_dim in Figure 3-8 computes the included sum for a specific dimension of the tensor along all rows. Finally, incsum in Figure 3-9 computes the full included sum along all dimensions.

37 incsum_along_dim(A, i, j , k = (푘1, . . . , 푘푑))

1 Input: Tensor 퐴 (푑 dimensions, side lengths (푛1, . . . , 푛푑)) to write the output, 2 dimensions reduced up to i, dimensions to take points up to j (푗 ≥ 푖), 3 row defined by index x, box-lengths k. 4 Output: Modify 퐴 with the included sum in dimension 푗. 5 6 ◁ Save into temporaries to not overwrite input 7 퐴푝 ← 퐴, 퐴푠 ← 퐴 8 ◁ Iterate through coordinates by varying coordinates in dimensions 푖 + 1, . . . , 푑 9 ◁ while fixing the first 푖 − 1 dimensions. 10 parallel for {x = (푥1, . . . , 푥푑) ∈ (푛1, . . . , 푛푖, :,..., : , _, :,..., : )} ⏟ ⏞ ⏟ ⏞ ⏟ ⏞ 푖 dimensions 푗 − 푖 dimensions 푑 − 푗 − 1 dimensions 11 spawn incsum_prefix_along_row(퐴푝, i, j , x, k) 12 incsum_suffix_along_row(퐴푠, i, j , x, k) 13 sync 14 ◁ Calculate incsum into output 15 incsum_result_along_row(A, i, j , x, k, 퐴푝, 퐴푠)

Figure 3-8: Computes the included sum along a given dimension.

incsum(A, k = (푘1, . . . , 푘푑)) 1 Input: Tensor 퐴 (푑 dimensions), box-lengths k. 2 Output: Modify 퐴 with the included sum in all dimensions. 3 4 for 푗 ← 1 to 푑 5 ◁ Included sum along dimension 푗 overwrites the input 6 incsum_along_dim(A, 0, j , k). 7 ◁ Included sums in all dimensions is in 퐴.

Figure 3-9: Computes the included sum for the entire tensor.

38 Lemma 2 (Work of Included Sum) incsum_along_dim(퐴, 푖, 푗) has work

(︃ 푑 )︃ ∏︁ 푂 푛ℓ . ℓ=푖+1

(︁∏︀푑 )︁ Proof. The loop over points in incsum_along_dim has ℓ=푖+1 푛ℓ /푛푗+1 iterations over dimensions 푖+1, . . . , 푗 −1, 푗, 푗 +2, . . . , 푑. Each call to incsum_prefix_along_row, incsum_prefix_along_row, and incsum_result_along_row has work 푂(푛푗+1), (︁∏︀푑 )︁ for total work 푂 ℓ=푖+1 푛ℓ .

Corollary 3 Given a 푑-dimensional tensor with 푁 points, INCSUM has work 푂(푑푁).

Lemma 4 (Span of Included Sum) incsum_along_dim(퐴, 푖, 푗) has span

(︃ (︃ 푑 )︃)︃ ∑︁ 푂 log 푛푗+1 + 푂 log(푛ℓ) . ℓ=푖+1

푑 ∏︀ Proof. The loop over points in incsum_along_dim has 푛ℓ iterations which ℓ=푖+1,ℓ̸=푗+1 can be done in parallel, for span

(︃ (︃ 푑 )︃)︃ (︃ 푑 )︃ ∏︁ ∑︁ 푂 log 푛ℓ = 푂 log(푛ℓ) . ℓ=푖+1,ℓ̸=푗+1 ℓ=푖+1,ℓ̸=푗+1

The subroutines incsum_prefix_along_row and incsum_suffix_along_row are logically parallel and have equal span, so we will just analyze the ranged prefix. The outer loop of incsum_prefix_along_row can be parallelized, and the inner loop can be replaced with a log-span parallel prefix as we will discuss in Chapter 4. Therefore, the span of the ranged prefix for each row along dimension 푗 + 1 is

푂(log(푛푗+1/푘푗+1) + log 푘푗+1) = 푂(log 푛푗+1) .

Finally, calculating the contribution to the output for each of the outer loop iterations

39 can be parallelized, for span 푂(log 푛푗+1). Therefore, the total span is

(︃ 푑 )︃ (︃ (︃ 푑 )︃)︃ ∑︁ ∑︁ 푂 log(푛ℓ) + log 푛푗+1 + log 푛푗+1 = 푂 log 푛푗+1 + 푂 log(푛ℓ) . ℓ=푖+1,ℓ̸=푗+1 ℓ=푖+1

Corollary 5 Given a 푑-dimensional tensor with 푁 points, INCSUM has span

(︃ 푑 )︃ ∑︁ 푂 푑 log(푛ℓ) . ℓ=1

Observation 1 (Included Sum Computation) After incsum_along_dim(퐴, 푖, 푗), for ′ all 0 ≤ 푗 < 푛푗, let 퐴 be the output and 퐴 be the input.

′ 퐴 [푛1, . . . , 푛푖, 푥푖+1, . . . , 푥푑 ] = 퐴[푛1, . . . , 푛푖, 푥푖+1, . . . , 푥푗 , 푥푗+1 : 푥푗+1 + 푘푗+1, 푥푗+2, . . . , 푥푑 ]. ⏟ ⏞ ⏟ ⏞ ⏟ ⏞ ⏟ ⏞ ⏟ ⏞ 푖 dimensions 푑 − 푖 dimensions 푖 dimensions 푗 − 푖 dimensions 푑 − 푗 − 1 dimensions

Lemma 6 Applying one-dimensional INCSUM along 푖 dimensions of a 푑-dimensional ten- sor solves the included sums problem up to 푖 dimensions.

Proof. By induction.

Base case: We have proved the one-dimensional case in Lemma 1.

Inductive Hypothesis: Let 퐵푖 be the result of 푖 iterations of the above INCSUM algorithm. That is, we do 퐼푁퐶푆푈푀 along dimensions 1, . . . , 푖. Suppose that ℬ푖 is the included sum in 푖 dimensions.

Inductive Step: We will show that 퐵푖+1 = 퐼푁퐶푆푈푀(퐵푖, 0, 푖 + 1, k) is the included sum of 푖 + 1 dimensions. By the induction hypothesis, each element of 퐵푖 is the included sum in 푖 dimensions. Consider any valid coordinate x = (푥1, . . . , 푥푑). The value in 퐵푖+1 at

40 coordinate x is

퐵푖[x] = 풜[푥1 : 푥1 + 푘1, . . . , 푥푖 : 푥푖 + 푘푖, 푥푖+1, . . . , 푥푑] ⏟ ⏞ ⏟ ⏞ 푖 푑−푖 ⌈︂ ⌉︂ 푥푖+1 + 1 퐵푖+1[x] = 풜[푥1 : 푥1 + 푘1, . . . , 푥푖 : 푥푖 + 푘푖, 푥푖+1 : · 푘푖+1, 푥푖+2, . . . , 푥푑] ⏟ ⏞ 푘푖+1 ⏟ ⏞ 푖 푑−푖−1 ⌊︂ ⌋︂ 푥푖+1 + 푘푖+1 − 1 + 풜[푥1 : 푥1 + 푘1, . . . , 푥푖 : 푥푖 + 푘푖, · 푘푖+1 : 푥푖+1 + 푘푖+1, 푥푖+2, . . . , 푥푑] ⏟ ⏞ 푘푖+1 ⏟ ⏞ 푖 푑−푖−1

= 풜[푥1 : 푥1 + 푘1, . . . , 푥푖+1 : 푥푖+1 + 푘푖+1, 푥푖+2, . . . , 푥푑]. ⏟ ⏞ ⏟ ⏞ 푖+1 푑−푖−1

Corollary 7 Applying INCSUM along 푑 dimensions of a 푑-dimensional tensor solves the included sums problem.

3.4 Excluded Sums (DRES) Algorithm

The remainder of this chapter presents the dimension-reduction excluded-sums (DRES) al- gorithm. Before showing how to compute the excluded sum, we will formulate the points in the excluded sum in terms of the “box complement” and partition the region of interest into disjoint sets of points. For the remainder of this section, we will refer to an index domain 푈 and a box 퐵 ′ ′ ′ ′ cornered at x = (푥1, . . . , 푥푑) ∈ 푈, 푥 = (푥1, . . . , 푥푑) ∈ 푈 where for all 푖 = 1, 2, . . . , 푑, 푥푖 < 푥푖.

We use k = (푘1, 푘2, . . . , 푘푑) to denote the size of the box in each dimension.

Box Complements and Excluded Sums

We introduce notation for taking the complement of a point (or range of points) for over some (not necessarily all) dimensions.

Definition 6 (Box Complement) The i-complement of a box 퐵 is denoted 퐶푖(퐵) = ′ {푦 = (푦1, . . . , 푦푑) such that there exists 푗 ≤ 푖, 푦푗 < 푥푗 or 푦푗 ≥ 푥푗 and for all 푚 > 푖, 푥푚 ≤ ′ 푦푚 < 푥푚}.

41 For example, we can use the box complement 퐶푑(퐵) to denote the set of points outside

of a box 퐵. The sum of all points in the set 퐶푑(퐵) is exactly the excluded sum. We will now show how to recursively partition the box complement into disjoint sets of points, which we will later sum in the DRES algorithm.

′ ′ ′ Theorem 8 (Recursive Box Complement) Let x = (푥1, . . . , 푥푑) ∈ 푈, x = (푥1, . . . , 푥푑) ∈ ′ ′ ′ 푈 be points such that for all 푖 = 1, 2, . . . , 푑, 푥푖 < 푥푖, and let box 퐵 = (푥1 : 푥1, . . . , 푥푑 : 푥푑). The 푖-complement of 퐵 can be expressed recursively with the 푖 − 1-complement of 퐵 as follows:

′ ′ ′ ′ ′ 퐶푖(퐵) = (:,..., :, : 푥푖, 푥푖+1 : 푥푖+1, . . . , 푥푑 : 푥푑) ∪ (:,..., :, 푥푖 :, 푥푖+1 : 푥푖+1, . . . , 푥푑 : 푥푑) ∪ 퐶푖−1(퐵). ⏟ ⏞ ⏟ ⏞ ⏟ ⏞ ⏟ ⏞ 푖−1 푑−푖 푖−1 푑−푖

Note that 퐶0(퐵) = ∅.

Proof. Let 퐴푖 be the right hand side of Equation 8. We will show that for any 푦 =

(푦1, 푦2, . . . , 푦푑) ∈ 푈, 푦 ∈ 퐴푖 if and only if 푦 ∈ 퐶푖(퐵).

First, we will show 푦 ∈ 퐴푖 implies 푦 ∈ 퐶푖(퐵) by case analysis of the three main terms in

퐴푖.

′ ′ Case 1: 푦 ∈ (:,..., :, : 푥푖, 푥푖+1 : 푥푖+1, . . . , 푥푑 : 푥푑) ⏟ ⏞ ⏟ ⏞ 푖−1 푑−푖 ′ By Definition 6, 푧 ∈ 퐶푖(퐵) implies that there exists some 푗 ≤ 푖 where 푧푗 < 푥푗 or 푧푗 ≥ 푥푗, ′ and that for all 푚 > 푖, 푥푚 ≤ 푧푚 < 푥푚. Therefore, 푦 ∈ 퐶푖(퐵) because there exists some 푗 ≤ 푖 ′ (in this case 푗 = 푖) such that 푦푗 < 푥푗 and for all 푚 > 푖, 푥푚 ≤ 푦푚 < 푥푚.

′ ′ ′ Case 2: 푦 ∈ (:,..., :, 푥푖 :, 푥푖+1 : 푥푖+1, . . . , 푥푑 : 푥푑) ⏟ ⏞ ⏟ ⏞ 푖−1 푑−푖 ′ By Definition 6, 푧 ∈ 퐶푖(퐵) implies that there exists some 푗 ≤ 푖 where 푧푗 < 푥푗 or 푧푗 ≥ 푥푗, ′ and that for all 푚 > 푖, 푥푚 ≤ 푧푚 < 푥푚. Therefore, 푦 ∈ 퐶푖(퐵) because there exists some 푗 ≤ 푖 ′ ′ (in this case 푗 = 푖) such that 푦푗 ≥ 푥푗 and for all 푚 > 푖, 푥푚 ≤ 푦푚 < 푥푚.

Case 3: 푦 ∈ 퐶푖−1(퐵) ′ By Definition 6, 푧 ∈ 퐶푖(퐵) implies that there exists some 푗 ≤ 푖 where 푧푗 < 푥푗 or 푧푗 ≥ 푥푗, ′ and that for all 푚 > 푖, 푥푚 ≤ 푧푚 < 푥푚. Again by Definition 6, 푦 ∈ 퐶푖−1(퐵) implies that

42 ′ there exists 푗 in the range 1 ≤ 푗 ≤ 푖 − 1 such that 푦푗 < 푥푗 or 푦푗 ≥ 푥푗 and that for all ′ 푚 ≥ 푖, 푥푚 ≤ 푦푚 < 푥푚. Therefore, 푦 ∈ 퐶푖−1(퐵) implies 푦 ∈ 퐶푖(퐵) since there exists some ′ ′ 푗 ≤ 푖 (in this case, 푗 < 푖) where 푦푗 < 푥푗 or 푦푗 ≥ 푥푗 and for all 푚 > 푖, 푥푚 ≤ 푦푚 < 푥푚.

Next, we will show that 푦 ∈ 퐶푖(퐵) implies 푦 ∈ 퐴푖 by case analysis on 퐶푖(퐵). Let 푗 ≤ 푖 ′ be the highest dimension at which 푦 is “out of range”, or where 푦푗 < 푥푗 or 푦푗 ≥ 푥푗 (note that there may be multiple values 푗 ≤ 푖 where 푦 is out of range). ′ ′ We define the three main terms of 퐴푖 as 퐴푖,1 = (:,..., :, : 푥푖, 푥푖+1 : 푥푖+1, . . . , 푥푑 : 푥푑), ⏟ ⏞ ⏟ ⏞ 푖−1 푑−푖 ′ ′ ′ 퐴푖,2 = (:,..., :, 푥푖 :, 푥푖+1 : 푥푖+1, . . . , 푥푑 : 푥푑), and 퐴푖,3 = 퐶푖−1(퐵). ⏟ ⏞ ⏟ ⏞ 푖−1 푑−푖 Case 3a: 푦 ∈ 퐶푖(퐵) and 푗 = 푖 ′ ′ Since 푦 ∈ 퐶푖(퐵) and 푗 = 푖, either 푦푖 < 푥푖 or 푦푖 ≥ 푥푖, and for all 푚 > 푖, 푥푚 ≤ 푦푚 ≥ 푥푚. ′ By construction, if 푦푖 < 푥푖, then 푦 ∈ 퐴푖,1. Similarly, if 푦푖 ≥ 푥푖, then 푦 ∈ 퐴푖,2.

Case 3b: 푦 ∈ 퐶푖(퐵) and 푗 < 푖

By Definition 6, 푦 ∈ 퐶푖(퐵) and 푗 < 푖 implies that there exists 푗 ≤ 푖 − 1 such that ′ ′ 푦푗 < 푥푗 or 푦푗 ≥ 푥푗 and that for all 푚 ≥ 푖, 푥푚 ≤ 푦푚 < 푥푚. Also, 푧 ∈ 퐶푖−1(퐵) implies ′ that there exists 푗 in the range 1 ≤ 푗 ≤ 푖 − 1 such that 푧푗 < 푥푗 or 푧푗 ≥ 푥푗 and that for ′ all 푚 ≥ 푖, 푥푚 ≤ 푧푚 < 푥푚. Therefore by definition, 푦 ∈ 퐶푖(퐵) and 푗 < 푖 implies that

푦 ∈ 퐶푖−1(퐵).

We have shown that 푦 ∈ 퐶푖(퐵) if and only if 푦 ∈ 퐴푖, so 퐶푖(퐵) = 퐴푖. Now we will show how to use the box complement to find the set of points in an excluded sum.

Corollary 9 (Recursive Excluded Sum)

퐶푑(퐵) = (:,..., :, : 푥푑) ∪ (:,..., :, 푥푑 + 푘푑 :) ∪ 퐶푑−1(퐵). ⏟ ⏞ ⏟ ⏞ 푑−1 푑−1

Corollary 10 (Recursive Excluded Sum Components) The excluded sum can be rep- resented as the union of 2푑 disjoint sets of points.

푑 ⋃︁ ′ ′ ′ ′ 퐶푑(퐵) = [(:,..., :, : 푥푖, 푥푖+1 : 푥푖+1, . . . , 푥푑 : 푥푑) ∪ (:,..., :, 푥푖 + 푘푖 :, 푥푖+1 : 푥푖+1, . . . , 푥푑 : 푥푑)]. 푖=1 ⏟ ⏞ ⏟ ⏞ ⏟ ⏞ ⏟ ⏞ 푖−1 푑−푖 푖−1 푑−푖

43 Figure 3-10 illustrates an example of the disjoint regions of the recursive box complement in two dimensions.

3

1 X 2

4

Figure 3-10: An example of decomposing the excluded sum into disjoint regions in two dimensions. The red box denotes the points to exclude.

Dimension-reduction Excluded Sums

Suppose that we want to find the excluded sum over all points in the index domain 푈. We denote the size of 푈 with 푁 = 푛1 · 푛2 · ... · 푛푑. We will now show how to use INCSUM as a subroutine to find the excluded sums in 푂(푑푁) operations and 푂(푁) space. At a high level, our DRES algorithm proceeds by dimension reduction. At each step 푖 of the reduction, we add two of the components from Corollary 10 to the resulting tensor. we construct a tensor of prefix and suffix sums along 푖 dimensions, and do INCSUM along the remaining 푑 − 푖 dimensions (if there are any remaining). Figure 3-11 presents an example of the dimension reduction in two dimensions. We will begin with important subroutines in the excluded sum algorithm. First, we define prefix/suffix and included sums along a particular dimension, which are key subroutines in our excluded sums algorithm. We have already presented the included sum subroutine in Section 3.3. Finally, we specify the excluded sums algorithm. Along the way, we will show that the DRES algorithm runs in 푂(푑푁) time and computes the excluded sum.

Prefix Sums

We first specify a subroutine for doing prefix sums along an arbitrary dimension and analyze its work and span. The suffix sum is the same but with a suffix rather than prefixsumalong

44 1. DRES in first dimension 2. DRES in first dimension

f

INCSUM INCSUM f X f X f f f along each f f f along each f f f f column column

Prefix along each row Suffix along each row

3. DRES in second dimension 4. DRES in second dimension f f f f f f f f f f f f f f f X f Prefix f X f Suffix f f f f f f f f f f f f f f f f f f f Figure 3-11: Steps for computing the excluded sum in two dimensions with included sums on prefix and suffix sums. each row, so we omit its pseudocode and analysis.

(︃ 푑 )︃ ∏︀ Lemma 11 (Work of prefix sum) prefix_along_dim(풜, i) has work 푂 푛푗 . 푗=푖+1

(︃ (︃ 푑 )︃)︃ ∏︀ Proof. The outer loop over dimensions 푖+2, . . . , 푑 has 푂 max 1, 푛푗 iterations, 푗=푖+2 (︃ 푑 )︃ ∏︀ each with 푂 (푛푖+1) work for the inner prefix sum. Therefore, the total work is 푂 푛푗 . 푗=푖+1

(︃ 푑 )︃ ∑︀ Lemma 12 (Span of prefix sum) prefix_along_dim(풜, i)) has span 푂 log 푛푗 . 푗=푖+1

45 prefix_along_dim(풜, i)

1 Input: tensor 풜 (푑 dimensions, side lengths (푛1, . . . , 푛푑), 2 dimension index 푖 to do the prefix sum along. 3 Output: Modify 풜 to do the prefix sum along dimension 푖 + 1, 4 fixing dimensions up to 푖. 5 ◁ Iterate through coordinates by varying coordinates in dimensions 푖 + 2, . . . , 푑 6 ◁ while fixing the first 푖 dimensions. 7 ◁ Blanks mean they are not iterated over in the outer loop 8 parallel for {(푥1, . . . , 푥푑) ∈ (푛1, . . . , 푛푖, _, :,..., : )} ⏟ ⏞ ⏟ ⏞ 푖 dimensions 푑 − 푖 − 1 dimensions 9 ◁ Prefix sum along row (can be replaced with a parallel prefix) 10 for ℓ ← 1 to ni+1 11 퐴[푛1, . . . , 푛푖, ℓ, 푥푖+2, . . . , 푥푑 ]+ =퐴[푛1, . . . , 푛푖, ℓ − 1, 푥푖+2, . . . , 푥푑 ] ⏟ ⏞ ⏟ ⏞ ⏟ ⏞ ⏟ ⏞ 푖 dimensions 푑 − 푖 − 1 dimensions 푖 dimensions 푑 − 푖 − 1 dimensions

Figure 3-12: Prefix sum along row. Proof. prefix_along_dim fixes the first 푖 dimensions (1 ≤ 푖 < 푑) and does the prefix sum along the rows of 푑 − 푖 − 1 dimensions. Specifically, it does the prefix sum along the last 푑 − 푖 − 1 dimensions. The span of parallelizing over that many rows is (︃ (︃ 푑 )︃)︃ (︃ (︃ 푑 )︃)︃ ∏︀ ∑︀ 푂 max(1, log 푛푗) = 푂 max 1, log 푛푗 . As we will see in Chapter 4, the 푗=푖+1 푗=푖+1 (︃ 푑 )︃ ∑︀ span of each prefix sum is 푂(log 푛푖+1), so the total span is 푂 log 푛푗 . 푗=푖+1

Observation 2 (Prefix sum computation) After prefix_along_dim(풜, 푖), for all 푗 =

0, 1, . . . , 푛푖+1 − 1,

퐴[푛1, . . . , 푛푖, 푗, 푥푖+2, . . . , 푥푑 ] = 퐴[푛1, . . . , 푛푖, : 푗 + 1, 푥푖+2, . . . , 푥푑 ]. ⏟ ⏞ ⏟ ⏞ ⏟ ⏞ ⏟ ⏞ 푖 dimensions 푑 − 푖 − 1 dimensions 푖 dimensions 푑 − 푖 − 1 dimensions

Add Contribution

Next, we will specify how to add the contribution from each dimension reduction step with a pass through the tensor.

Lemma 13 (Work of Adding Contribution) add_contribution(풜, ℬ, i, offset) has work 푂(푁).

46 add_contribution(풜, ℬ, i, offset) 1 Input: input tensor 풜, output tensor ℬ, fixing dimensions up to 푖. 2 Output: for all coords in ℬ, add the relevant contribution from 풜. 3 for {(푥1, . . . , 푥푑) ∈ (:,..., :)} 4 if 푥푖+1 + offset ≤ 푛푖+1 5 ℬ[푥1, . . . , 푥푑]+ = 풜[푛1, . . . , 푛푖, 푥푖+1 + offset, 푥푖+2, . . . , 푥푑] ⏟ ⏞ ⏟ ⏞ 푖 dimensions 푑 − 푖 dimensions

Figure 3-13: Adding in the contribution.

Lemma 14 (Span of Adding Contribution) add_contribution(풜, ℬ, i, offset) has span (︃ 푑 )︃ ∑︀ 푂 log 푛푗 . 푗=1

Excluded Sum

Theorem 15 (Work of excluded sum) Excluded-Sum(풜, ℬ, k) has work 푂(푑푁) if for

all 푖 = 1, 2, . . . , 푑, 푛푖 ≥ 2.

Proof. We will analyze the prefix step (the suffix step is symmetric and logically parallel, so the work and span are asymptotically the same). Each dimension reduction step parametrized by 푖 takes 1 prefix step and 푑−푖−1 included (︃ 푑 )︃ ∏︀ sum calls, which each have 푂 푛푗 work. Furthermore, there is an additional 푂(푁) 푗=푖+1 to add in the contribution. The total work over 푑 steps is therefore

푑−1 (︃ 푑 )︃ ∑︁ ∏︁ (푑 − 푖) 푛푗 + 푁 . 푖=0 푗=푖+1

Adding in the contribution is clearly 푂(푑푁) in total, so we focus on bounding the work of the actual computation. 푑−1 ∏︀ 푑−1 푑 푑−1 푛푗 ∑︀ ∏︀ ∏︀ 푗=푖 The work of the computation is (푑 − 푖) 푛푗. For all 푖, 푗 ≥ 0, 푛푗 ≤ 2 푖=0 푗=푖+1 푗=푖+1

47 Excluded_Sum(풜, ℬ, k)

1 Input: tensor 풜 of 푑 dimensions and side lengths (푛1, . . . , 푛푑) 2 output tensor ℬ, side lengths of the excluded box k = (푘1, . . . , 푘푑), 푘푖 ≤ 푛푖 for all 푖 = 1, 2, . . . , 푑. 3 Output: tensor ℬ with size and dimensions matching 풜 containing the excluded sum. 4 For all coordinates x = (푥1, . . . , 푥푑), let 퐵x,k be the box cornered at coordinate x ∑︀ 5 with side lengths k. ℬ[x] = 퐶푑(퐵x,k). 6 풜푝 ← 풜, 풜푠 ← 풜 ◁ Prefix and suffix temp 7 for i ← 0 to d-1 8 ◁ PREFIX STEP 9 ◁ 풜푝 should hold prefixes up to dimension 푖 − 1. 10 풜 ← 풜푝 11 ◁ Do prefix sum along dimension 푖 + 1 assuming all previous dimensions have been done 12 prefix_along_dim(풜, i) 13 ◁ Save prefix up to dimension 푖 in 풜푝 14 풜푝 ← 풜 15 for j ← i + 2 to d 16 ◁ Do included sum on dimensions [푖 + 2, 푑] 17 incsum_along_dim(풜, i, j , kj ) 18 ◁ Add into result 19 add_contribution(풜, ℬ, i, 0) 20 21 ◁ SUFFIX STEP 22 ◁ 풜푠 should hold suffixes up to dimension 푖 − 1. 23 풜 ← 풜푝 24 ◁ Do suffix sum along dimension 푖 + 1 assuming all previous dimensions have been done 25 suffix_along_dim(풜, i) 26 ◁ Save suffix up to dimension 푖 in 풜푝 27 풜푠 ← 풜 28 for j ← i + 2 to d 29 ◁ Do included sum on dimensions [푖 + 2, 푑] 30 incsum_along_dim(풜, i, j , kj ) 31 ◁ Add into result 32 add_contribution(풜, ℬ, i, 푘푖)

Figure 3-14: Prefix sum along row. 푑 ∏︀ 푁 because 푛푗 ≥ 2. Therefore, 푛푗 ≥ 2푖 . To bound the work of the computation, 푗=푖+1

푑−1 푑 푑−1 ∑︁ ∏︁ ∑︁ 푁 (푑 − 푖) 푛 ≥ (푑 − 푖) 푗 2푖 푖=0 푗=푖+1 푖=0 ≥ 2(푑 + 2−푑 − 1)푁

= 푂(푑푁). 48 Therefore, the work of the excluded sum is 푂(푑푁).

2 Theorem 16 (Span of Excluded Sum) Excluded-Sum(풜, ℬ, k) has span 푂(푑 푆푐), where (︃ 푑 )︃ ∑︀ 푆푐 = 푂 log 푛푗 (the span of adding in the contribution). 푗=1

Proof. The span of each excluded sum step 푖 is the sum of the spans of the prefix sum along dimension 푖, 푑 − 푖 − 1 included sum steps with dimensions up to 푖 fixed, and adding (︃ 푑 )︃ ∑︀ into the contribution. Let 푆푐 = 푂 log 푛푗 be the span of adding in the contribution. 푗=1 That is, the span is

푑−1 푑 (︃ 푑 (︃ 푑 )︃)︃ ∑︁ [︂ ∑︁ ∑︁ ∑︁ ]︂ log 푛푗 + log 푛푗 + log 푛ℓ + 푆푐 . ⏟ ⏞ 푖=0 푗=푖+1 푗=푖+2 ℓ=푖+1 add contrib ⏟ ⏞ ⏟ ⏞ prefix included sum

At each dimension reduction step, the span of the prefix sum is bounded above by 푆푐. Furthermore, the span of the included sum is bounded above by

푑 (︃ 푑 )︃ 푑 푑 푑 ∑︁ ∑︁ ∑︁ ∑︁ ∑︁ log 푛푗 + log 푛ℓ = log 푛푗 + log 푛ℓ 푗=푖+2 ℓ=푖+1 푗=푖+2 푗=푖+2 ℓ=푖+1

≤ 푆푐 + (푑 − 푖 − 1)푆푐

= 푂((푑 − 푖)푆푐).

Substituting back into the expression for the total span of the excluded sum,

(︃ 푑−1 )︃ (︃ 푑−1 )︃ ∑︁ ∑︁ 2 푂 (푆푐 + (푑 − 푖)푆푐 + 푆푐) = 푂 ((푑 − 푖 + 2)푆푐) = 푂(푑 푆푐). 푖=0 푖=0

Correctness

Theorem 17 (Correctness) Excluded-Sum(풜, ℬ, k) computes the excluded sum at each coordinate.

49 Proof. After each dimension reduction step, by construction, the prefix sum step gives you

퐴[푥1, . . . , 푥푑] = 퐴[:,..., : , : 푥푖+1, 푥푖+2 : 푥푖+2 + 푘푖+2, . . . , 푥푑 : 푥푑 + 푘푑]. ⏟ ⏞ ⏟ ⏞ 푖 dimensions 푑 − 푖 − 1 dimensions

Similarly, the suffix sum step gives you

퐴[푥1, . . . , 푥푑] = 퐴[:,..., : , 푥푖+1 + 푘푖+1 :, 푥푖+2 : 푥푖+2 + 푘푖+2, . . . , 푥푑 : 푥푑 + 푘푑]. ⏟ ⏞ ⏟ ⏞ 푖 dimensions 푑 − 푖 − 1 dimensions

These two are the exact regions defined by Corollary 10.

3.5 Applications

In this section, we briefly discuss two potential applications of included and excluded sums, and discuss their connection with prefix sums.

Integral Image

The integral image, also known as a summed area table, is a two-dimensional table generated from an input image. Often times we care about images that have floating-point values, i.e., a mapping from pixels to real numbers. Each entry in the table stores the cumulative sum of all pixels to the left and above of it, including itself. This enables the rapid calculation of summations over image subregions. Subregion summations can be computed in constant time as a linear combination of only four pixels in the integral image, regardless of the size of the subregion. The use of integral images was popularized by the Viola-Jones algorithm [35], and can also be used for image thresholding [8]. If we instead formulate this problem as a problem of included sums, there is one constraint - the input box size has to fixed as input to the problem. However, we note that withthe included sums algorithm, if we choose a box in two dimensions, with side length ℓ퐵 = (푘1, 푘2), then any box of those side lengths can be computed in constant time. Additionally, any box with side lengths greater than that can be computed in time 푂(푐), where 푐 is a constant such that the longest side of the query box size is less than 푐푘1 and 푐푘2, by using the resulting

50 output of the included sum algorithm. Although this formulation does have its drawbacks, it provides one key advantage - it does not use subtraction. For large images, with the traditional integral image approach, subtraction of other subregions is required to compute the result and can often result in large floating-point catastrophic cancellations.

Fast Multipole Method

The fast multipole method (FMM) is an algorithm that is designed to speed up the calcula- tion of long-ranged forces in the 푛−body problem [17] [4]. The basic problem is to compute the mutual interaction of a large number of particles. The essence of multipole methods is the approximation of the net effect of a large number of individual interactions with distant particles as a single interaction. Demaine et. al. [14] claim that, barring the complexities of the fast multipole method, its inner core is a computation of excluded sums:

∑︁ ∑︁ 푥푗 and 푥푗 푗̸=푖 |푗−푖|>1

More precisely, we are adding finite precision floating point numbers 푥푗 and excluding a neighborhood of 푥푖, but in the FMM, the 푥푗 become representations of functions which are accurate only at some distance from point 푖. This key insight recognizes the core problem of excluded sums as we have defined in this chapter: trying to exclude a specified regionina multidimensional tensor, and sum all the elements outside of that region, for many different positions. The goal is a stable computation of all 푁 excluded sums with 푂(1) work per sum.

Both of these applications require included and excluded sums in some form. We have seen from the specification of these algorithms, that they heavily rely on computing prefix and suffix sums along an entire dimension of a tensor. Since the suffix sum is effectively the reverse of the prefix sum, most of the techniques that we explore for prefix sums alsoapplies to suffix sums. Hence, given the accuracy and performance considerations, we nowturnour attention to optimizing the core subroutine, the prefix sum.

51 52 Chapter 4

Prefix Sums

The prefix sum (also known as scan [5]) is one of the most fundamental building blocks of parallel algorithms. Some of its applications include implementing stream compaction (deleting marked elements from an array) and in parallel, comparing strings in lexicographical order, and solving recurrence relations. As Blelloch [5] notes, in addition to being a useful building block, the prefix sum operation is a “good example of a computation that seems inherently sequential, but for which there is an efficient parallel algorithm”. As we have seen in Chapter 3, prefix sums (in addition to suffix sums) are also heavily called subroutines in included and excluded sums. Therefore, it is important to understand opti- mizations to the prefix sum operation, both in terms of performance and numerical accuracy.

Outline. In this chapter we first conduct a review of previous work done on parallel prefix algorithms. We also review how floating-point accuracy depends on the order in which we sum the numbers of a prefix sum. We then discuss opportunities for optimization, both in terms of performance and accuracy. After presenting our findings in the form of anew algorithm, we evaluate its performance and accuracy against a variety of benchmarks.

We have defined the All-prefix-sums operation (and suffix sums) formally in Chapter 1. As a quick review, the prefix sums operation using addition (hereafter which we will refer to as prefix sum, or scan, unless otherwise noted), in summation notation, takes asinputan array 퐴 of 푛 elements and outputs an array 푃 푆(퐴) (either in-place or another array) such

53 that for all 푖 = 0, 1, ..., 푛 − 1, 푖 ∑︁ 푃 푆(퐴)[푖] = 퐴[푗] 푗=0

In this chapter, we focus only on addition, but as we have seen before, the definition of a prefix sum takes any binary associative operator. Hence, many, but not all of the algorithms following can be applied to different operators such as multiplication or maximum. The exclusive scan is the inclusive scan shifted right by one element, losing the last element, with the identity (0 for addition) at the first index. From here on, unless explicitly noted otherwise, scan or prefix sum is the inclusive version. Additionally, we are interested in prefix sums (or scans) that are in-place, or use an amount of space sublinear in the input size. In-place algorithms not only reduce memory usage, but can also improve locality and performance by reducing data movement. These factors are important when we operate on large datasets. The naive serial implementation of the all-prefix-sums (scan) operation is trivial and sequential. We simply keep a running total of the sum of the elements up to and including a particular index in the array, updating the total as we loop through the array and writing the updated total to the output at that index as we go along. Pseudocode for this operation is as follows (where 퐴 is the array of elements, and 푛 is the number of elements in the array):

Sequential-Prefix-Sum(퐴, 푛) 1 푠 ← 0 ◁ 0 is the identity for our operator (+) 2 for 푖 ← 0 to 푛 − 1 3 푠 ← 푠 + 퐴[푖] 4 퐴[푖] ← 푠

When designing parallel algorithms, we would like them to be work-efficient. This means they do no more additions or work than the sequential version, which is 푂(푛) and in-place.

4.1 Previous Work

Hillis and Steele [22] present a data-parallel algorithm, which has been demonstrated on GPUs by Horn [23]. The algorithm assumes that there are as many processors as data ele-

54 ments, and is the most highly parallel (shortest span) parallel prefix algorithm. Pseudocode for the algorithm is below (as well as a diagram):

Hillis-Prefix-Sum(퐴, 푛)

1 for 푖 ← 0 to ⌈log2 푛⌉ − 1 2 parallel for 푗 ← 0 to 푛 − 1 3 if 푗 ≥ 2푖 4 퐴[푗] ← 퐴[푗 − 2푖] + 퐴[푗] 5 else 6 퐴[푗] ← 퐴[푗]

Figure 4-1: Hillis and Steele’s data-parallel prefix sum algorithm on an 8-input array using index notation. The individual boxes represent individual elements of the array and the notation 푗 : 푘 contains elements at index 푗 to index 푘 inclusive with the operator (here assumed to be addition), denoted by ⊕. The lines from two boxes means those two elements are added where they meet the operator. The arrow corresponds to propagating the old element at that index (no operation).

⌈log2 푛⌉ As Harris [18] shows, the algorithm performs ∑︀ 푛 − 2푑 = 푂(푛 log 푛) addition opera- 푑=0 tions, which compared to the serial implementation of 푂(푛) work, makes it clearly not work efficient. A more work efficient algorithm is described by Blelloch [5], which is the same asLadner and Fischer’s algorithm [25], barring a slight difference in the ordering of computation. Both maintain 푂(푛) work, and 푂(log2 푛) span under the Cilk model of computation, as explained in Chapter 2. Both use a balanced binary tree as an algorithmic pattern that gives rise to parallelism. A binary tree with 푛 leaves has log 푛 depth, and if we perform one add per node in the tree then we still have 푂(푛) total adds.

55 A tree data structure is not actually built; instead it is just used to determine what each thread does at each step of the traversal. In this algorithm, the operations are performed in place on an array in shared memory. Pseudocode for an iterative variant of the algorithm, inspired by [25] and [5] is below. For simplicity, this algorithm only works for arrays that are powers of 2, however it can be altered to work for arbitrary size inputs (see [25]). The algorithm is sometimes instead formulated in a recursive / divide-and-conquer form.

Upsweep(퐴, 푛)

1 for 푑 ← 0 to log2 푛 − 1 2 parallel for 푖 ← 0 to 푛 − 1 by 2푑+1 3 퐴[푖 + 2푑+1 − 1] ← 퐴[푖 + 2푑 − 1] + 퐴[푖 + 2푑+1 − 1]

Downsweep(퐴, 푛)

1 for 푑 ← log2 푛 − 2 to 0 2 parallel for 푖 ← 2푑+1 − 1 to 푛 − 2 by 2푑+1 3 퐴[푖 + 2푑] ← 퐴[푖] + 퐴[푖 + 2푑]

Work-Efficient Parallel Prefix(퐴, 푛) 1 Upsweep(퐴, 푛) 2 Downsweep(퐴, 푛)

In the upsweep phase we traverse the tree from the leaves to the root computing the partial sums at internal nodes in the tree as Figure 4-2 shows. After this phase, the root node (i.e. the last element in the array) holds the sum of all the nodes in the array. In the downsweep phase, the tree is traversed up from the root, using the partial sums to build the scan in place using the partial sums computed from the upsweep phase, as shown by Figure 4-2. After a complete downsweep, each vertex of the tree contains the sum of all the leaf values that precede it [5]. This algorithm, in combination, performs 푂(푛) operations in the serial case, and under the Cilk model of computation, has a span of 푂(log2 푛), because of the span required to control the parallel loop. However, despite the theoretical improvement and work efficiency, this algorithm does not perform well in practice as we saw in Figure 1-4 (Chapter 1) due to its memory access

56 Figure 4-2: The balanced binary tree-based algorithm described by Blelloch [5] and [25] consisting of the upsweep (first four rows) followed by the downsweep (last 3 rows). The individual boxes represent individual elements of the array and the notation 푖 : 푗 contains elements at index 푖 to index 푗 added together. The lines coming from two boxes means those two elements are added to where the line meets the new box. The bolded-outline boxes indicate when that element in the array in shared memory is ‘finished’, i.e. it has the correct prefix sum at the index at thatpoint. The arrows from Figure 4-1 are omitted but they have the same effect. patterns when 푛 is much larger than the number of processors and a lack of coarsening 1. Caching conflicts with high contention (for example, false sharing) occur from accessing elements of the array by multiple processors when the data is close together, and slows the algorithm down severely. A similar phenomenon occurs with shared memory banks in [18] when this algorithm is implemented on a GPU in CUDA as well. The best performing prefix sum implementation on a GPU devises a solution to this problem in [18]. Lastly, to the best of our knowledge, the best performing implementation of prefix sum on a CPU is the Problem Based Benchmark Library (PBBSLIB) [31]. The algorithm (visually summarized in Figure 4-3) begins by dividing the input into blocks of size 퐵 = 1024. In

1By coarsening, we are referring to the technique of switching from a parallel to a sequential algorithm when the problem size falls below a certain threshold so as to reduce the impact of excess parallelism and overhead on work efficiency. This is sometimes also known as granularity control.

57 parallel, on individual blocks, a reduce operation (defined in Chapter 2) is performed toget the total sum for each block, and written to a temporary array. A serial exclusive scan is ⌈︀ 푛 ⌉︀ performed on the temporary array of block sums of size 퐵 . The results of this exclusive scan operation are then used as an offset for prefix sums in each of the blocks in parallel. This algorithm is work efficient: for each element of the input, it performs two reads andone write (disregarding the temporary array, which is much less than the array) - one read for each of the partial sums in the first phase, and one read and write for the prefix sumsinthe last phase. It also performs well in practice for very large arrays that do not fit in cache at all, and becomes bottlenecked by the memory bandwidth (which is the optimal performance for large arrays that do not fit in cache). PBBSLIB’s prefix sum algorithm has 푂(푛) work 푛 푛 and 푂(퐵 + log 퐵 + 퐵 ) = 푂(푛/퐵) span.

Figure 4-3: The algorithm of PBBSLIB (scan_inplace) [31]. It first performs a parallel reduce operation on each block to a temporary array. After running an exclusive scan on that temporary array, we then run in-place prefix sums, using the value from the temporary array as the offset,for each block 퐵, in parallel. The symbols are described by the key on the right.

There are also other algorithms [37] [30] [11] that improve upon the naive parallelization, but are worse in practice than the PBBSLIB algorithm, especially for large inputs. The C++ STL implementation to the best of our knowledge performs a coarsened version of an

58 algorithm similar to Work-Efficient Parallel Prefix. We will also compare to state- of-the-art implementations on an NVIDIA GPU [18] [29] at the end of this chapter, on a cost-efficiency metric.

4.2 Vectorization

Our first optimization is the vectorization of the prefix sum algorithm - Prefix-Sum-Vec. Noting that the sequential algorithm performs relatively well due to prefetching and its cache behavior, we try to extend its beneficial properties to a vectorized algorithm. We can apply the idea of Single Instruction, Multiple Data (SIMD), or vectorization techniques to exploit data-level parallelism. In particular, SIMD approaches are particularly well-suited towards Hillis and Steele’s algorithm [22], since there is high data-level parallelism. However, due to the backwards dependencies of the sequential prefix sum algorithm, the compiler is unable to vectorize it effectively to utilize the high data-level parallelism inherent in a prefixsum (refer to Appendix A to see the generated assembly). If we take the sequential prefix sum algorithm, and instead process the array in chunks of a vector width 푉 at a time (e.g. for Intel AVX2 this is a 256-bit width, but it can be a vector width of any of Intel’s supported sizes: 128-bit SSE4; 256-bit AVX2; 512-bit AVX512), we can perform a vectorized version of Hillis and Steele’s algorithm within the vector to maximize data-level parallelism, and then propagate the last element of the vector to the next vector chunk as an offset. We can implement the algorithm on a vector using Intel Intrinsics (which are in general necessary when the compiler cannot auto-vectorize), which are C style functions that provide access to vector Intel instructions without the need to write assembly code. An illustration of the algorithm is given in Figure 4-4. The code for an SSE (128-bit) implementation Prefix- Sum-Vec for type float (32-bits wide in C/C++) is given in Appendix A. In this example, for 4 consecutive elements, Prefix-Sum-Vec performs 3 vector adds, 2 vector bit shifts, 1 vector shuffle, 1 vector load and 1 vector store. Comparatively, in Sequential-Prefix-Sum, for the same 4 elements we perform 4 adds, 4 loads and 4 stores. Comparing the throughputs of these operations from [32], the vectorized version has higher total throughput, since vector operations are generally just as cheap. The algorithm

59 Figure 4-4: An illustration of Prefix-Sum-Vec, for an 4-unit section of input (1, 2, 3, 4), with desired output (1,3,6,10). The offset is kept track of while iterating over the entire array inaforward pass in blocks of 4 elements. Here, 푉 = 4 is our vector width (4 floats, or 128 bits). However, this can be 256-bit vectors for example. The central idea is that it performs Hillis’ algorithm exclusively through vector operations (cheap vector adds, shifts, and shuffles etc), maximizing data level parallelism. The subscripted arrow means a shift by that number.

proceeds by processing the array in blocks of 4 at at time, propagating the offset from the last element of each vector to the next vector chunk. If the array size is not a multiple of 4, we can either pad with zeroes or we can perform the last ≤ 3 adds serially.

The asymptotic work and span of this algorithm is 푂(푛), but if we parameterize the 푛 notation by the vector width 푉 , then it is 푂( 푉 log 푉 ). While the asymptotic work is still the same as the serial version 푂(푛), this optimization alone speeds up the algorithm compared to the Sequential-Prefix-Sum by up to 2× consistently over all input sizes, as we will see in Section 4.6 of this chapter, so despite its lesser portability (requiring a different im- plementation for each type), its performance in practice is not to be overlooked, especially considering the algorithm is entirely serial and only needs a single core. Its serial nature

60 makes it very effective in a performance-per-cost ratio as discussed in Section 4.7.

4.3 Accurate Floating Point Summation

Recall from Figure 1-4, that despite its poor performance, Work-Efficient Parallel Prefix achieves comparatively good accuracy. We now take a more detailed look at accuracy considerations for the prefix sum algorithms we have shown so far. Our goal is to understand how the ordering of a summation affects the worst case error bounds for the result. Wefirst have to define the error metric we are using to evaluate our prefix sum algorithms. Inaprefix ∑︀푘 sum, our task is to evaluate sums of the form 푆푘 = 푖=0 푥푖, for all values of 푘 = 0, ..., 푛 − 1, where 푥0, ..., 푥푛−1 are real numbers. In general, as Higham notes in [20], each different ˆ ordering of the 푥푖 will give different computed sums 푆푘 in floating point arithmetic. We are interested in determining how this ordering affects the error

ˆ 퐸푘 = 푆푘 − 푆푘 (4.1)

One metric we can use is the relative error compared to a reference point at every index of the output array. For the experiments at the end of this chapter, we use the root mean square relative error compared to a much higher precision reference point over all indices. For a prefix sum on an array of 푛 elements, this is defined as:

√︃ ∑︀푛−1 2 (퐸푘) RMSE = 푘=0 (4.2) 푛 ˆ where 푆푘 is the completed value of the prefix sum in floating point arithmetic at index 푘,

and 푆푘 is the reference point (which can be real-valued in theory, or a much higher precision floating point value used as a reference). However, sometimes other metrics make sense, such as the maximum relative error across all indices, or

Max Error = max |퐸푘| (4.3) 푘∈{0,...,푛−1}

61 This maximum error metric is useful for getting the worst case error bound of a prefix

sum. If we can bound the worst case of each error |퐸푘|, then the overall worst case error bound of a prefix sum is the maximum of those individual bounds. Higham in [20] proves worst case bounds on the summation of 푛 floating-point numbers. We recreate the core lemmas of his work, and then see how they apply to different prefix sum algorithms.

Recall the accuracy model for floating point arithmetic, as we defined in Chapter 2:

푓푙(푥 op 푦) = (푥 op 푦)(1 + 훿), |훿| ≤ 푢, op = +, −, *,/ (4.4)

where 훿 is a small relative error associated with the floating point representation of the calculation after correct rounding, and 푢 is the machine precision satisfying 푛푢 ≤ 1.

We first consider sequential summation for computing 푆푘, which is analogous to Sequential- Prefix-Sum at each output index 푘 for all 푛 indices in the array. In other words, sequential summation is adding together 푛 numbers simply by keeping a running total, iterating through the array from indices 0 to 푛 − 1.

Lemma 18 When using sequential summation to compute 푆푘, it holds

푘 ∑︁ |퐸푘| ≤ 훾푘 |푥푖|, 푖=0

푘푢 for all 푘 = 1, .., 푛 − 1, where 훾푘 = 1−푘푢 . As a consequence,

푛−1 푛−1 ∑︁ ∑︁ 2 |Max Error| ≤ 훾푛−1 |푥푖| = (푛 − 1)푢 |푥푖| + 푂(푢 ). 푖=0 푖=0

∑︀푘 Proof. By directly applying the model in 4.4 to a sequential summation of 푆푘 = 푖=0 푥푖 in floating point arithmetic, we have:

ˆ ˆ ˆ 푆푘 = 푓푙(푆푘−1 + 푥푘) = (푆푘−1 + 푥푘)(1 + 훿푘) for some |훿푘| ≤ 푢, ∀ 푘 = 1, .., 푛 − 1.

Notice that in the expression above 푥0 and 푥1 appear in exactly 푘 additions, and each term

62 푥푖 for 푖 = 2, .., 푘 takes part in exactly 푘 − 푖 + 1 additions. We then have

푘 푘 푘 ˆ ∏︁ ∑︁ ∏︁ 푆푘 = (푥0 + 푥1) (1 + 훿푗) + 푥푖 (1 + 훿푗) (4.5) 푗=1 푖=2 푗=푖 (︃ 푘 )︃ 푘 (︃ 푘 )︃ ˆ ∏︁ ∑︁ ∏︁ =⇒ 푆푘 − 푆푘 = 퐸푘 = (푥0 + 푥1) (1 + 훿푗) − 1 + 푥푖 (1 + 훿푗) − 1 (4.6) 푗=1 푖=2 푗=푖 ⃒ ⃒ ⃒ ⃒ 1 ⃒ 푘 ⃒ 푘 ⃒ 푘 ⃒ ∑︁ ⃒∏︁ ⃒ ∑︁ ⃒∏︁ ⃒ =⇒ |퐸푘| ≤ |푥푖| · ⃒ (1 + 훿푗) − 1⃒ + |푥푖| · ⃒ (1 + 훿푗) − 1⃒ (4.7) 푖=0 ⃒푗=1 ⃒ 푖=2 ⃒푗=푖 ⃒

The product terms can be reduced using the following result:

⃒ 푘 ⃒ ⃒ ⃒ ⃒∏︁ ⃒ ⃒ ∑︁ ∑︁ ∑︁ ⃒ ⃒ (1 + 훿 ) − 1⃒ = ⃒ 훿 + 훿 훿 + ... + 훿 훿 ...훿 ⃒ ⃒ 푗 ⃒ ⃒ 푖1 푖1 푖2 푖1 푖2 푖푘 ⃒ ⃒푗=1 ⃒ ⃒1≤푖1≤푘 1≤푖1<푖2≤푘 1≤푖1<...<푖푘≤푘 ⃒ 푘 푘 ∞ ∑︁ (︂푘)︂ ∑︁ ∑︁ (4.8) ≤ 푢푗 ≤ (푘푢)푗 ≤ (푘푢)푗 푗 푗=1 푗=1 푗=1

= 훾푘

And therefore (4.7) becomes

1 푘 푘 ∑︁ ∑︁ ∑︁ |퐸푘| ≤ 훾푘 |푥푖| + 훾푘−푖+1 |푥푖| ≤ 훾푘 |푥푖|, 푖=0 푖=2 푖=0

as desired.

We note here that the error constant is proportional to 푘 for each |퐸푘|, so therefore the worst case error bound on Sequential-Prefix-Sum will be proportional to 푛.

Pairwise summation is defined [21] such that the 푥푖 are summed in pairs according to 푛−1 푦푖 = 푥2푖 + 푥2푖+1, 푖 = 0 : ⌊ ⌋ (푦 푛 = 푥푛 if 푛 is odd). The pairwise summation is repeated 2 ⌊ 2 ⌋

recursively on the 푦푖, 푖 = 0 : ⌊푛/2⌋, and takes ⌈log2 푛⌉ stages. For a brief example with 푛 = 4

elements, the order of pairwise summation gives 푆4 = ((푥1 + 푥2) + (푥3 + 푥4)), as opposed to the naive sequential summation which gives 푆4 = (((푥1 + 푥2) + 푥3) + 푥4).

63 Lemma 19 When using pairwise summation to compute 푆푘, it holds

푘 ∑︁ |퐸푘| ≤ 훾⌈log2(푘+1)⌉ |푥푖|, 푖=0

for all 푘 = 1, .., 푛 − 1, and therefore

푛−1 푛−1 ∑︁ ∑︁ 2 |Max Error| ≤ 훾⌈log2(푛)⌉ |푥푖| ≤ ⌈log2(푛)⌉푢 |푥푖| + 푂(푢 ) 푖=0 푖=0

Proof. For pairwise summation, each term 푥푖 takes part in ⌈log2(푘+1)⌉−1 or ⌈log2(푘+1)⌉ ˆ addition terms in the computation of 푆푘. Let 퐴푘 be the set of indices 푖 such that 푥푖 appears 푖 푖 ⌈log (푘 + 1)⌉ − 1 훿 , ..., 훿 퐵푘 in 2 additions with corresponding error rates 1 ⌈log2(푘+1)⌉−1, and let

be the indices 푖 such that 푥푖 appears in ⌈log2(푘 + 1)⌉ additions with corresponding error 훿푖 , .., 훿푖 rates 1 ⌈log2(푘+1)⌉. Then, analogously to (4.5) we have that

⌈log2(푘+1)⌉−1 ⌈log2(푘+1)⌉ ˆ ∑︁ ∏︁ 푖 ∑︁ ∏︁ 푖 푆푘 = 푥푖 (1 + 훿푗) + 푥푖 (1 + 훿푗) (4.9)

푖∈퐴푘 푗=1 푖∈퐵푘 푗=1 ⎛ ⎞ ⎛ ⎞ ⌈log2(푘+1)⌉−1 ⌈log2(푘+1)⌉ ˆ ∑︁ ∏︁ 푖 ∑︁ ∏︁ 푖 =⇒ 푆푘 − 푆푘 = 퐸푘 = 푥푖 ⎝ (1 + 훿푗) − 1⎠ + 푥푖 ⎝ (1 + 훿푗) − 1⎠ 푖∈퐴푘 푗=1 푖∈퐵푘 푗=1 (4.10) 푘 ∑︁ ∑︁ ∑︁ =⇒ |퐸푘| ≤ 훾⌈log2(푘+1)⌉−1 |푥푖| + 훾⌈log2(푘+1)⌉ |푥푖| ≤ 훾⌈log2(푘+1)⌉ |푥푖| (4.11) 푖∈퐴푘 푖∈퐵푘 푖=0 where the first inequality in 4.11 follows from the derivations in (4.8).

The bound attained in Lemma 19 is better than the one obtained in Lemma 18, as the first one is proportional to log 푛 instead of proportional to 푛, and as shown in [20] the bounds are attainable in the worst case.

We note that the key factor that makes the pairwise summation have a better worst case bound than the sequential version is that is maximum number of addition stages, or the longest chain of additions performed, is 푂(log(푛). This is analogous to the notion of 푠푝푎푛, defined in Chapter 2 in the context of parallelism. When designing parallel algorithms, one

64 usually aims to produce low-span algorithms in order to maximize parallelism. This brings us to a key observation on accuracy and parallelism for summation:

Observation 3 Reducing the span of a prefix sum algorithm improves both its parallelism and worst case error bound in floating point arithmetic.

For the Work-Efficient Parallel Prefix, we note that the longest chain of addi- tions after both the upsweep and downsweep steps (or equivalently, the number of stages) is 2 log 푛. Therefore, |Max Error| is proportional to 2 log 푛. One thing to keep in mind, however, is that these bounds are just worst case bounds, and there is a lot of freedom for what the actual error might be. Rigorous experiments are performed in Section 4.6, that use the RMSE error as a metric. However, they do provide good guidance for designing accurate parallel prefix sum algorithms, especially when 푛 is large. Lastly, we finally note that there are better summation algorithms, such asKahan Summation [21] or compensated summation, that attempt to counteract the roundoff error associated in floating point arithmetic. However, they are often an order of magnitude slower than work-efficient algorithms. We will also benchmark against Kahan Summation experimentally.

4.4 Ordering of Computation

There is a subtlety among variations of the work-efficient parallel prefix sum algorithm. All variations use a balanced binary tree in some way for an upsweep, and then use the partial computed sums for a downsweep. However, one key point of difference is the ordering in which the computations are done (i.e. how the balanced binary tree is traversed). The algorithm discussed above by Blelloch [5] performs a traversal similar to a level order tree traversal, except from the leaves to the root on the upsweep. It then performs a level order traversal on the downsweep. While there is no difference in parallelizability, one disadvantage of this approach is that when 푛 is larger than cache size, this approach does not optimize for cache locality at all. However, a recursive traversal (depth-first) of the binary tree on the upsweep and down- sweep is better suited to optimizing for cache locality, because it deals with array elements

65 closer together first. Blelloch in [6] proves that all the recursive orderings (postorder, pre- ⌈︀ 푛 ⌉︀ order, and inorder) all have optimal cache-complexity of 푂( 퐿 ), where 퐿 is the cache-line size, but the level order traversal does not - it has cache-complexity 푂(푛). The differences in ordering of computation are highlighted in diagram 4-5.

(a) Level order traversal (b) Depth-first traversals (postorder, preorder)

Figure 4-5: The red boxed numbers by the lines indicate the order in which those sums are computed (increasing order). Left: the ordering from the algorithm discussed by Blelloch [5] / Work-Efficient Parallel Prefix, analogous to a level order traversal of a binary tree. Right: the ordering from a circuit described by a recursive version from Ladner and Fischer [25]. The upsweep phase can be described as a postorder traversal, and the downsweep as as preorder traversal. These depth-first orderings are more suited to cache-locality, because on large arrays, their accesses are grouped together and more cache-efficient. Blelloch provides a proof in [6].

4.5 Block-Hybrid Algorithm

Having now understood the advantages of the data-level parallelism of vectorization, the accuracy of the tree-structure summation, and the subtleties among the ordering of compu- tation, we present a hybrid algorithm called Block-Hybrid that works in-place. Block- ⌈︀ 푛 ⌉︀ Hybrid begins by splitting up the input array 퐴 into 퐵 blocks of size 퐵. For input arrays ⌈︀ 푛 ⌉︀ smaller than 퐵, we simply run Prefix-Sum-Vec on the entire array. For each of the 퐵 blocks, in parallel, we compute in place prefix sums on each block using the Prefix-Sum- Vec subroutine, writing the results back to the original array. We copy the last element of each block (the sum of each block) to a temporary array 푇 . On the temporary array in

66 place, we perform a recursive work-efficient coarsened parallel prefix sum that achieves a postorder traversal for the upsweep and a preorder traversal for the downsweep. At index 푖 of the temporary array, we now have the offset needed to be added to every element of block 푖 + 1 in the original array for 푖 = 0, ..., 푛 − 2. We can perform this operation in parallel, broadcasting each block its respective offset. The algorithm is described in detail in Figure 4-6, and pseudocode is given below.

Block-Hybrid(퐴, 푛) 1 if 푛 < 퐵 2 return Prefix-Sum-Vec(퐴, 푛) 3 parallel for 푖 ← 0 to 푛 − 1 by 퐵 4 Prefix-Sum-Vec(퐴 + 푖, 퐵) 푖 5 푇 [ 퐵 ] = 퐴[푖 + 퐵 − 1] ◁ copy the last element to a temporary array ⌈︀ 푛 ⌉︀ 6 Recursive-Upsweep(푇, 퐵 ) ⌈︀ 푛 ⌉︀ 7 Recursive-Downsweep(푇, 퐵 ) 8 parallel for 푖 ← 퐵 to 푛 − 1 by 퐵 9 for 푗 ← 0 to 퐵 − 1 푖 10 퐴[푖 + 푗] = 푇 [ 퐵 − 1] ◁ broadcast offset 푖 to all elements of block

Prefix-Sum-Vec is described above, and Recursive-Upsweep and Recursive-Downsweep are just recursive formulations of Work-Efficient Parallel Prefix that achieve a depth-first ordering as described in Section 4.4. Block-Hybrid, like PBBSLIB’s algorithm, reduces the overhead of parallelism and avoids contention on nearby memory by coarsening to operate on blocks of sufficient size, instead of individual elements. The work of Block-Hybrid is 푂(푛) and the span is 푛 2 푛 푛 2 푂(log 퐵 + 퐵 + log 퐵 + log 퐵 + 퐵) = 푂(log 푛) assuming 푛 ≫ 퐵. Thus, we expect it to achieve the same worst case accuracy error bound as Work-Efficient Parallel Pre- fix. In particular, we note that the maximum number of additions that any index in the 푛 output array has to go through is 푂(퐵 + 2 log 퐵 + 1), which improves the accuracy signif- icantly compared to either the naive, purely vectorized, or PBBSLIB algorithms which all need to go through 푂(푛) for each index in the worst case. 퐵 is a tunable parameter, which

67 we found in practice to be effective at 퐵 = 1024, the same as PBBSLIB [31]. Lastly, there is one variant, called Block-Hybrid-Reduce, which optimizes in practice for more performance at the cost of accuracy. It is the same algorithm as in PBBSLIB, except that it uses Prefix-Sum-Vec as its serial implementation of inclusive scan, and performs a vectorized version of the reduce operation as well. It outperforms or is on par with PBBSLIB for all input sizes, the fastest algorithm for prefix sum in practice on a CPU to our knowledge.

Figure 4-6: An illustration of the Block Hybrid algorithm on a fabricated input array of size 40. In the first phase, in-place prefix sums / inclusive scans (using the intrinsics based vectorization algorithm) are performed on each block of size 퐵 (here, 퐵 = 8). The last value of each block is also copied to a temporary array. In the second phase, an in-place prefix sum in the form of a recursive work efficient parallel prefix is performed on the temporary array, and finally the corresponding values are now broadcasted to all elements of all the blocks in parallel.

4.6 Experimental Evaluation

We perform a variety of experiments to evaluate the different prefix algorithms in terms of both accuracy and performance. Our experimental setup is described in Chapter 2. The highlights of our results for prefix sum can be summarized by the matrix in Figure 4-7, which shows the trade-offs between performance and accuracy for various prefix sum algorithms at a particular input size - 222. This input size is where the differences are most pronounced - for smaller inputs, the serial algorithms perform better, and for larger

68 inputs, PBBSLIB and block_hybrid_reduce both have similar performance due to reaching the maximum memory bandwidth.

Figure 4-7: A highlight matrix of the trade-offs between performance and accuracy. The input size is 222 floats, and tests are run on a 40-core machine. This input size is where the differences are most pronounced - for smaller inputs, the serial algorithms perform better, and for larger inputs, PBBSLIB and block_hybrid_reduce both have similar performance due to reaching the maximum 1 memory bandwidth. Accuracy is measured as RMSE and plotted on a logarithmic scale, where RMSE is defined in Section 4.3, for the random numbers uniformly distributed between 0 and 1. 푇1 Performance (or observed parallelism) is measured as , where 푇40ℎ is the parallel time on the 푇40ℎ Supercloud 40-core Machine with 2-way hyper-threading, and 푇1 is the serial time of the naive sequential prefix sum on the same machine. Our algorithms are sequential intrinsics, block hybrid, and block hybrid reduce.

4.6.1 Performance

For performance testing, runtimes are taken as the average of 5 runtimes using a high res- olution clock, since the Cilk runtime system is non-deterministic. The input size is varied from 210 to 228. Figure 4-8 summarizes the runtime on a range of prefix sum input arrays ranging in size from 210 to 228 on a logarithmic scale. The Work-Efficient Prefix Sum algorithm

69 is omitted from the runtime comparison, since as we saw in Figure 1-4, it is even worse than the sequential scan. Prefix-Sum-Vec is labelled as sequential intrinsics and it is important to note how well it performs, especially for smaller inputs. This algorithm should likely be used for small inputs unless accuracy is a concern. PBBSLIB suffers from worse performance when the input fits in cache, since it is primarily designed for very large inputs that are cannot fit in any layer of the cache hierarchy. Block-Hybrid performs well for pretty much all input sizes. At very large inputs, Block-Hybrid-Reduce performs better, at which point it is on par with PBBSLIB because at this point the memory bandwidth of the machine becomes the bottleneck. Lastly note that the C++ STL implementation std::inclusive_scan barely improves on the sequential version.

Figure 4-8: A logarithmic-scaled runtime graph compared all the algorithms discussed so far. It is worth noting the consistent performance of the block hybrid based algorithms, across all input sizes. This was measured on the Supercloud 40-core machine.

Figure 4-9 shows the average time taken for each prefix sum algorithm, divided by the input size 푛, as 푛 increases. As can be seen, the vectorized intrinsics algorithm remains

70 quite efficient for smaller inputs, until the input gets sufficiently large enough suchthatthe parallel algorithms begin scaling well.

Figure 4-9: A linearly scaled runtime graph where the y-axis is the runtime divided by the size of the array (to get a normalized per unit array size) runtime. This shows that the PBBSLIB algorithm is inefficient for small inputs, while the other algorithms scale almost linearly, until theinputgets large enough, when the parallel algorithms scale better. This was measured on the Supercloud 40-core machine.

Lastly, Figure 4-10 shows the speedup obtained for the different algorithms. Speedup is

푇1 푇푃 푃 defined as 푇푃 , where is the time run on processors. The graphs show the speedup for two different machines, AWS 18-core Machine (36 CPUs with hyperthreading) and Supercloud 40-core Machine (80 CPUS with hyperthreading). Overall, Block-Hybrid and Block-Hybrid-Reduce outperform the best known bench- mark for prefix sum (whether that be the C++ STL or PBBSLIB) for input sizes that fit in any layer of the cache hierarchy (in these experiments, it is on the order of 10 million elements or fewer). For very large inputs that do not fit in the last layer of the cache hierarchy, Block- Hybrid-Reduce outperforms or stays competitive with PBBSLIB, the best benchmark for this range on a CPU. Prefix-Sum-Vec consistently outperforms the sequential prefix sum (and parallel prefix implementations if the input fits in cache) by a factor ofaround 2.5×,

71 (a) AWS 18-core Machine (b) Supercloud 40-core Machine

Figure 4-10: Speedup (measured as 푇1 ) versus input size (log scale) on two different parallel 푇푃 machines. As we can see as 푛 increases, we do gain some speedup, however we eventually hit a bottleneck which is the maximum memory bandwidth of the machine, since the calculation required in a prefix sum is small compared to the runtime associated with reading and writing memory. all while only requiring a single core.

4.6.2 Accuracy

For accuracy testing, the root mean square error, defined above in Section 4.3, is used to evaluate the different prefix sum algorithms. Different input sizes are tested, from 210 to 225. Once the input gets sufficiently large, the results are very similar, so the graphs only show the accuracy on input of size 215. IEE754 single precision 32-bit C floats are used, although the results are the same for double precision floating point numbers, provided overflow does not occur. The reference value is a 100 digit precision floating point value, enabled via Boost, and the reference prefix sum is calculated sequentially. It is worth remembering that the machine precision, or 푢, for single precision floating points is 5.96 × 10−8. The of 3 graphs below are tested on an input size of 215 random single-precision floating point numbers. The Mersenne Twister 19937 generator is used to generate the random numbers, according to a specified probability distribution. Each graph represents an different input probability distribution. The first, Unif(0, 1) is the uniform distribution between 0 and 1. The second, Exp(1), is the exponential distribution with parameter 1. The third is the standard normal distribution. These inputs were specifically chosen since Higham uses similar methodology in his experiments [20]. Further, no single summation method can

72 be regarded as superior on the sole basis of accuracy since for each method the error can vary greatly depending on the input data, within the freedom of the worst case bounds. However, when 푛 gets sufficiently large, the bounds become more apparent. While there can be arbitrarily chosen input sequences which may perform better for some prefix sum algorithms, these three random input distributions were chosen so as to be general, but also robust to real world input. The standard normal distribution includes about half as many numbers with the opposite sign, so the errors are larger due to more cancellation. While we could choose specific real world data suited to one application to highlight the differences in algorithms, these input distributions hope to highlight the differences in algorithms that apply to a much wider range of real world data.

Figure 4-11: A bar chart comparing the root mean squared relative error of the different algo- rithms, compared to a prefix sum algorithm that uses a much higher precision (100) where inputs are 215 random single precision floating point numbers drawn from a uniform distribution between 0 and 1.

As shown in Figures 4-11, 4-12, and 4-13, the consistent winner according to these bench- marks is the Kahan Summation algorithm. However, on this input size, it also performs at least 10× as slow as the Block-Hybrid algorithm. It is similarly as slow on other input sizes. The Block-Hybrid algorithm achieves just as good accuracy as the Work- Efficient Parallel Prefix as expected from the worst case error bounds, and is compa- rable to the Kahan Summation algorithm. It is 8.4× more accurate over these input ranges

73 Figure 4-12: A bar chart comparing the root mean squared relative error of the different algo- rithms, compared to a prefix sum algorithm that uses a much higher precision (100) where inputs are 215 random single precision floating point numbers drawn from the exponential distribution with lambda = 1. compared to PBBSLIB, and 20.2× more accurate than the sequential prefix sum (taking an average of the 3 accuracy differences across the input distributions). By the same metrics, Block-Hybrid-Reduce is 2.6× more accurate compared to PBBSLIB, and 6.3× more accurate than the sequential prefix sum.

4.7 GPU Comparison

Since there has been a lot of previous work done on efficient GPU implementations [29] [18] [23], we also compare the best prefix sum implementations on a CPU with them, on a performance-per-cost ratio. Specifically, we define performance as the inverse of runtime, and the performance-per-cost ratio therefore as 1 , where the cost is retrieved runtime×cost from Amazon EC2 Spot Instances Pricing [2]. The specifications of the two environmental setups we compare is described in Chapter 2. Specifically, we compare the Affordable AWS Machine (c5.xlarge) and the VoltaGPU. As of May 2020 [2] [1], their on-demand and spot pricing is in Table 4.1: For performance measurements, we took the following sample measurements in Table 4.2

74 Figure 4-13: A bar chart comparing the root mean squared relative error of the different algo- rithms, compared to a prefix sum algorithm that uses a much higher precision (100) where inputs are 215 random single precision floating point numbers drawn from the standard normal distribution. Cost per Hour Machine Spot On Demand c5.xlarge (Affordable AWS Machine) $0.0396 $0.17 p3.2xlarge (Volta GPU) $0.918 $3.06

Table 4.1: Amazon EC2 Spot and On Demand Pricing Comparison. using the same benchmark suite and methodology as in Section 4.6.1. The GPU implemen- tation we used is the state-of-the-art implementation specified by the CUDPP library in [29] and [18], on the Volta GPU Machine specified in the experimental setup. The CPU runtimes are run using the same benchmarks as in Section 4.6.1, except it is just run on the Affordable AWS Machine. Runtime(ms) on input size 푛 Implementation 210 213 216 219 223 227 sequential scan 0.00124 0.00966 0.0772 0.621 10.0 161 sequential intrinsics 6.40e-4 0.00425 0.0339 0.278 5.04 83.0 block_hybrid 0.00518 0.00845 0.0586 0.385 3.19 55.4 block_hybrid_reduce 0.00377 0.00815 0.0549 0.322 2.96 48.5 GPU CUDPP 0.00819 0.0256 0.0271 0.141 0.523 4.14

Table 4.2: Runtime comparison against the state-of-the-art GPU implementation on 1 Volta GPU versus the affordable 2-core AWS machine.

Across these inputs, we can see that the GPU only makes sense when the input is very

75 large (in fact, the CPU implementation is faster at smaller input sizes). In the best case for the GPU, it is around 11.7× faster than the best CPU implementation (block_hybrid_reduce). At all inputs smaller than that, it is less than 10× faster (the improvement decreases as 푛 gets smaller). Hence, the best comparison for the GPU on a performance-per-cost ratio is at 27 1 1 input size 2 , which gives a ratio of 4.14×0.918 = 0.26 for spot pricing and 4.14×3.06 = 0.079 for 1 on demand pricing. The best CPU implementation has a ratio of 48.5×0.0395 = 0.52 for spot 1 pricing and 48.5×0.17 = 0.12 for on demand pricing. So on this ratio metric, the affordable CPU is at least 2× better for spot pricing and 1.5× better for on demand pricing. Hence, despite their massive computation power and parallelism, on a comparison that includes cost, these results shows that CPU computing is strongly worth considering since it outperforms the GPU on a performance-per-cost ratio by at least 1.5×, unless the input size is extremely large (larger than 227). Furthermore, for more complicated applications that use prefix sum as a subroutine, the flexibility of a general purpose CPU provides considerable advantages.

76 Chapter 5

Conclusions and Future Work

This thesis presents how we can abstract away the application domains of scientific com- puting applications, such as the fast multipole method (FMM) and integral image, which involve reducing elements in overlapping subregions of a multidimensional array, and instead focus on the underlying computation. More precisely, we formulate problems called included and excluded sums: given a 푑-dimensional array of numbers and an axis-aligned subregion (subarray) of interest, compute the total sum of all elements included in the region or the total sum of all elements outside of the region respectively, for all positions of the region of interest. In our findings, we present an asymptotically improved algorithm called DRES forthe excluded sums problem in arbitrary dimensions, and show that it reduces the work from exponential to linear in the number of dimensions compared to the previous state-of-the-art. While we have implemented DRES, a rigorous experimental evaluation remains to be done, for a variety of applications that scale with the number of dimensions. It remains to be shown exactly how well DRES scales in practice, and whether the asymptotic analysis on the number of dimensions is practical, for a variety of applications. By inspecting the core subroutines of included and excluded sums, which are in-place prefix and suffix sums, we provide several improvements to the parallel prefixsumona CPU, both in terms of performance and accuracy. Further, we elucidate a key observation for parallel prefix sums: reducing the span of a prefix sum algorithm improves bothits parallelism and worst case error bound in floating point arithmetic. After first reviewing

77 the literature on parallel prefix computation and observing opportunities for optimization, we present the block-hybrid algorithm, which achieves a strong compromise between high performance and floating point summation accuracy, outperforming the state-of-the-art CPU implementation over several practical input sizes. We also note the reality of the vectorized implementation: Intel intrinsics vectorization is powerful for CPU prefix sums since the compiler cannot capitalize on the potential 2× consistent speedup on a single core. Lastly, by comparing our state-of-the-art prefix sum implementations on a CPU with the respective GPU implementations, we make a case for cost-efficient prefix sum computation on general commodity multicores. Despite the high data-level parallelism inherent in a prefix sum well- suited to GPUs, with performant algorithms such as block-hybrid, it may make sense to use multicore software over GPUs unless the input is extremely large.

78 Appendix A

Implementation Details

A.1 Implementation of Prefix Sum Vectorization

The following is a C intrinsics implementation of Prefix-Sum-Vec for float. Different vec- tor widths can be used (256-bit with AVX2 for example - 8 × 32-bit floats), but this proved to have just as good performance. See the Intel Intrinsics Guide [32] for a specification of the intrinsic functions. The following code compiles with GCC or Clang on any Intel processor that has Streaming SIMD Extensions (SSE4.2) or more recent (AVX and newer). Also see the associated assembly generated by x86-64 clang 9.0.0 with -O3 and -ffast-math.

1 #include

2

3 static inline __m128 scan_SSE(__m128 x) {

4 // shift(replacing with 0) vector left by4 bytes (1 float) and add to itself

5 x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4)));

6 // shift(replacing with 0) vector left by8 bytes (2 floats) and add to itself

7 x = _mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40));

8 return x;

9 }

10

11 static inline void prefix_sum_SSE(float* a, const int n){

12 // set initial offset to identity [0, 0, 0, 0]

79 13 __m128 offset = _mm_setzero_ps();

14 size_t i = 0;

15 for (; i

16 // load4 elements at once

17 __m128 x = _mm_load_ps(&a[i]);

18 __m128 out = scan_SSE(x);

19 // add to offset

20 out = _mm_add_ps(out, offset);

21 // write back

22 _mm_store_ps(&a[i], out);

23 // shuffle last element to all lanes of vector, for next offset

24 offset = _mm_shuffle_ps(out, out, _MM_SHUFFLE(3, 3, 3, 3));

25 }

26 // deal with case ifn is not divisible by4

27 float last_offset = _mm_cvtss_f32(offset);

28 for (; i < n; ++i) {

29 last_offset += a[i];

30 a[i] = last_offset;

31 }

32 }

Listing A.1: C intrinsics implementation for width 4 implementation Prefix-Sum-Vec for float

1 .LBB0_3: # =>This Inner Loop Header: Depth=1

2 movdqa 15872(%rsp,%rax,4), %xmm1

3 movdqa 15888(%rsp,%rax,4), %xmm2

4 movdqa %xmm1, %xmm3

5 pslldq $4, %xmm3 # xmm3 = zero,zero,zero,zero,xmm3 [0,1,2,3,4,5,6,7,8,9,10,11]

6 addps %xmm1, %xmm3

7 xorps %xmm1, %xmm1

8 movlhps %xmm3, %xmm1 # xmm1 = xmm1[0],xmm3[0].

9 addps %xmm3, %xmm1

10 addps %xmm0, %xmm1

11 movaps %xmm1, 15872(%rsp,%rax,4)

12 shufps $255, %xmm1, %xmm1 # xmm1 = xmm1[3,3,3,3]

13 movdqa %xmm2, %xmm3

80 14 pslldq $4, %xmm3 # xmm3 = zero,zero,zero,zero,xmm3 [0,1,2,3,4,5,6,7,8,9,10,11]

15 addps %xmm2, %xmm3

16 xorps %xmm0, %xmm0

17 movlhps %xmm3, %xmm0 # xmm0 = xmm0[0],xmm3[0]

18 addps %xmm3, %xmm0

19 addps %xmm1, %xmm0

20 movaps %xmm0, 15888(%rsp,%rax,4)

21 shufps $255, %xmm0, %xmm0 # xmm0 = xmm0[3,3,3,3]

22 addq $8 , %rax

23 cmpq $4000, %rax # imm = 0xFA0

24 jb .LBB0_3

Listing A.2: Generated assembly by x86-64 clang 8.0.0 with -O3 and -ffast-math for a toy example input. The assembly represents the inner loop starting on line 17 in the code for the C intrinsics implementation above. Note the effective use of SSE instructions.

1 static inline void sequential_scan(float*A, const int n){

2 float sum = 0.0;

3 for (size_t i = 1; i < n; i++) {

4 sum += A[i];

5 A[i] = sum;

6 }

7 }

8

9 .LBB0_5: # =>This Inner Loop Header: Depth=1

10 addss -144(%rsp,%rax,4), %xmm0

11 movss %xmm0, -144(%rsp,%rax,4)

12 addss -140(%rsp,%rax,4), %xmm0

13 movss %xmm0, -140(%rsp,%rax,4)

14 addss -136(%rsp,%rax,4), %xmm0

15 movss %xmm0, -136(%rsp,%rax,4)

16 addss -132(%rsp,%rax,4), %xmm0

17 movss %xmm0, -132(%rsp,%rax,4)

18 addss -128(%rsp,%rax,4), %xmm0

19 movss %xmm0, -128(%rsp,%rax,4)

20 addq $5, %rax

81 21 cmpq $4004, %rax # imm = 0xFA4

22 jne . LBB0_5

Listing A.3: An implementation of naive sequential scan and its generated assembly by x86-64 clang 8.0.0 with -O3 and -ffast-math for a toy example input. The assembly represents the inner loop starting at line 4. Note the ineffective use of SSE instructions that is unable to capture the high data level parallelism.

82 Bibliography

[1] Amazon. Amazon EC2 Pricing. https://aws.amazon.com/ec2/pricing/on-demand/, 2020. Accessed: 2020-05-11.

[2] Amazon. Amazon EC2 Spot Instances Pricing. https://web.archive.org/ web/20200513005845/https://aws.amazon.com/ec2/spot/pricing/, 2020. Accessed: 2020-05-13.

[3] Rick Beatson and Leslie Greengard. A short course on fast multipole methods. Wavelets, multilevel methods and elliptic PDEs, pages 1–37, 1997.

[4] Rick Beatson and Leslie Greengard. A short course on fast multipole methods. In Wavelets, Multilevel Methods and Elliptic PDEs, pages 1–37. Oxford University Press, 1997.

[5] Guy E. Blelloch. Prefix sums and their applications. Technical Report CMU-CS-90-190, School of Computer Science, Carnegie Mellon University, November 1990.

[6] Guy E. Blelloch, Phillip B. Gibbons, and Harsha Vardhan Simhadri. Low depth cache- oblivious algorithms. In Proceedings of the Twenty-Second Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’10, page 189–199, New York, NY, USA, 2010. Association for Computing Machinery.

[7] Robert D Blumofe, Christopher F Joerg, Bradley C Kuszmaul, Charles E Leiserson, Keith H Randall, and Yuli Zhou. Cilk: An efficient multithreaded runtime system. Journal of parallel and distributed computing, 37(1):55–69, 1996.

[8] Derek Bradley and Gerhard Roth. Adaptive thresholding using the integral image. Journal of graphics tools, 12(2):13–21, 2007.

[9] Ronald Coifman, Vladimir Rokhlin, and Stephen Wandzura. The fast multipole method for the wave equation: A pedestrian prescription. IEEE Antennas and Propagation magazine, 35(3):7–12, 1993.

[10] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Intro- duction to algorithms. MIT Press, 3 edition, 2009.

[11] cppreference. std::inclusive_scan. https://en.cppreference.com/w/cpp/algorithm/ inclusive_scan, 2020. Accessed: 2020-05-06.

83 [12] Franklin C Crow. Summed-area tables for texture mapping. In Proceedings of the 11th annual conference on Computer graphics and interactive techniques, pages 207–212, 1984.

[13] Eric Darve. The fast multipole method: numerical implementation. Journal of Com- putational Physics, 160(1):195–240, 2000.

[14] E. D. Demaine, M. L. Demaine, A. Edelman, C. E. Leiserson, and P. Persson. Building blocks and excluded sums. SIAM News, 38(4):1–5, 2005.

[15] Alexandra Fedorova, Margo Seltzer, and Michael D Smith. Improving performance iso- lation on chip multiprocessors via an operating system scheduler. In 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007), pages 25–38. IEEE, 2007.

[16] David Goldberg. What every computer scientist should know about floating-point arith- metic. ACM Comput. Surv., 23(1):5–48, March 1991.

[17] L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. J. Comput. Phys., 135(2):280–292, August 1997.

[18] Mark Harris, Shubhabrata Sengupta, and John Owens. Parallel prefix sum (scan) with CUDA, volume 39, pages 851–. In GPU Gems 3. Addison-Wesley, 08 2007.

[19] Justin Hensley, Thorsten Scheuermann, Greg Coombe, Montek Singh, and Anselmo Lastra. Fast summed-area table generation and its applications. In Computer Graphics Forum, volume 24, pages 547–555. Wiley Online Library, 2005.

[20] Nicholas J. Higham. The accuracy of floating point summation. SIAM J. Scientific Computing, 14:783–799, 1993.

[21] Nicholas J. Higham. Accuracy and Stability of Numerical Algorithms. Society for In- dustrial and Applied Mathematics, Philadelphia, PA, USA, 2nd edition, 2002.

[22] W. Daniel Hillis and Guy L. Steele, Jr. Data parallel algorithms. Commun. ACM, 29(12):1170–1183, December 1986.

[23] D. Horn. Stream reduction operations for gpgpu applications. GPU Gems, 2, 01 2005.

[24] Cilk Hub. Programming in Cilk. http://cilk.mit.edu/programming/, 2020. Accessed: 2020-05-09.

[25] Richard E. Ladner and Michael J. Fischer. Parallel prefix computation. JOURNAL OF THE ACM, 27(4):831–838, 1980.

[26] Boost Organization. Boost c++ libraries: Multiprecision. https://www.boost. org/doc/libs/1_66_0/libs/multiprecision/doc/html/boost_multiprecision/tut/ floats/cpp_bin_float.html, 2020. Accessed: 2020-05-09.

84 [27] Artur Podobas, Mats Brorsson, and Karl-Filip Faxén. A comparison of some recent task-based parallel programming models. In 3rd Workshop on Programmability Issues for Multi-Core Computers, 2010.

[28] Tao B. Schardl, William S. Moses, and Charles E. Leiserson. Tapir: Embedding fork- join parallelism into llvm’s intermediate representation. SIGPLAN Not., 52(8):249–265, January 2017.

[29] Shubhabrata Sengupta, Mark Harris, Michael Garland, and John Owens. Efficient Parallel Scan Algorithms for GPUs, pages 413–442. 01 2011.

[30] Shubhabrata Sengupta, Aaron Lefohn, and John Owens. A work-efficient step-efficient prefix sum algorithm. 05 2006.

[31] Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Aapo Kyrola, Harsha Vardhan Simhadri, and Kanat Tangwongsan. Brief announcement: The problem based benchmark suite. In Proceedings of the Twenty-Fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12, page 68–70, New York, NY, USA, 2012. Association for Computing Machinery.

[32] Intel Software. Intel intrinsics guide. https://software.intel.com/sites/ landingpage/IntrinsicsGuide/, 2020. Accessed: 2020-05-09.

[33] Bjarne Stroustrup. The C++ programming language. Pearson Education, 2013.

[34] Steven P Vanderwiel and David J Lilja. Data prefetch mechanisms. ACM Computing Surveys (CSUR), 32(2):174–199, 2000.

[35] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, volume 1, pages I–I, 2001.

[36] Chi Xu, Xi Chen, Robert P Dick, and Zhuoqing Morley Mao. Cache contention and application performance prediction for multi-core systems. In 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), pages 76–86. IEEE, 2010.

[37] Nan Zhang. A novel parallel prefix sum algorithm and its implementation on multi- core platforms. In 2010 2nd International Conference on Computer Engineering and Technology, volume 2, pages V2–66. IEEE, 2010.

[38] Gernot Ziegler. Summed area computation using ripmap of partial sums, 2012. GPU Technology Conference (talk).

85