Multidimensional Included and Excluded Sums

Helen Xu∗ Sean Fraser∗ Charles E. Leiserson∗

Abstract (x1, x2) This paper presents algorithms for the included-sums and excluded-sums problems used by scientific com- n k1 X puting applications such as the fast multipole method. 1 These problems are defined in terms ofa 푑-dimensional k2 array of 푁 elements and a binary associative operator ⊕ on the elements. The included-sum problem requires that the elements within overlapping boxes cornered at n2 each element within the array be reduced using ⊕. The (a) (b) excluded-sum problem reduces the elements outside each Figure 1: An illustration of included and excluded sums box. The weak versions of these problems assume that in 2 dimensions on an 푛1 × 푛2 matrix using a (푘1, 푘2)-box. the operator ⊕ has an inverse ⊖, whereas the strong (a) For a coordinate (푥1, 푥2) of the matrix, the included-sums versions do not require this assumption. In to problem requires all the points in the 푘1 × 푘2 box cornered studying existing algorithms to solve these problems, we at (푥1, 푥2), shown as a grey rectangle, to be reduced using a introduce three new algorithms. binary associative operator ⊕. The included-sums problem The bidirectional box-sum (BDBS) algorithm requires that this reduction be performed at every coordinate of the matrix, not just at a single coordinate as is shown solves the strong included-sums problem in Θ(푑푁) in the figure. (b) A similar illustration for excluded sums, summed- time, asymptotically beating the classical which reduces the points outside the box. area table (SAT) algorithm, which runs in Θ(2푑푁) and which only solves the weak version of the problem. image to answer queries for the sum of elements in arbi- Empirically, the BDBS algorithm outperforms the SAT trary rectangular subregions of a matrix in constant time. algorithm in higher dimensions by up to 17.1×. The integral image has applications in real-time image The box-complement algorithm can solve the processing and filtering [18]. The fast multipole method strong excluded-sums problem in Θ(푑푁) time, asymp- (FMM) is a widely used numerical approximation for the totically beating the state-of-the-art corners algorithm calculation of long-ranged forces in various 푁-particle by Demaine et al., which runs in Ω(2푑푁) time. In 3 simulations [2, 16]. The essence of the FMM is a reduc- dimensions the box-complement algorithm empirically tion of a neighboring subregion’s elements, excluding outperforms the corners algorithm by about 1.4× given elements too close, using a multipole expansion to allow similar amounts of space. for fewer pairwise calculations [9, 12]. Specifically, the The weak excluded-sums problem can be solved in multipole-to-local expansion in the FMM adds relevant Θ(푑푁) time by the bidirectional box-sum comple- expansions outside some close neighborhood but inside arXiv:2106.00124v1 [cs.DS] 31 May 2021 ment (BDBSC) algorithm, which is a trivial extension some larger bounding region for each element [2, 28]. of the BDBS algorithm. Given an operator inverse ⊖, High-dimensional applications include the FMM for par- BDBSC can beat box-complement by up to a factor of 4. ticle simulations in 3D space [8,17] and direct problems in higher dimensions [25]. 1 Introduction These problems give rise to the excluded-sums Many scientific computing applications require reducing problem [13], which underlies applications that require many (potentially overlapping) regions of a tensor, or reducing regions of a tensor to a single value using a multidimensional array, to a single value for each region binary associative operator. For example, the excluded- quickly and accurately. For example, the integral-image sums problem corresponds to the translation of the local problem (or summed-area table) [7, 11] preprocesses an expansion coefficients within each box in the FMM [16]. The problems are called “sums” for ease of presentation, ∗Computer Science and Artificial Intelligence Laboratory. but the general problem statements (and therefore Massachusetts Institute of Technology. Email: {hjxu, sfraser, algorithms to solve the problems) apply to any context cel}@mit.edu. involving a (푆, ⊕, 푒), where 푆 a set of values, ⊕

1 Copyright © 2021 by SIAM Unauthorized reproduction of this article is prohibited 푆 푒 ∈ 푆 ⎛ ⎞⎛ ⎞⎛ ⎞ is a binary associative operator defined on , and 1 3 6 2 5 16 19 10 10 7 75 72 81 81 84 is the identity for ⊕. ⎜ ⎟⎜ ⎟⎜ ⎟ ⎜ 3 9 1 1 2 ⎟⎜18 16 10 8 4 ⎟⎜73 75 81 83 87⎟ Although the excluded-sums problem is particularly ⎜ ⎟⎜ ⎟⎜ ⎟ ⎜ 5 1 5 3 2 ⎟⎜13 11 10 14 11⎟⎜78 80 81 77 80⎟ challenging and meaningful for multidimensional tensors, ⎜ ⎟⎜ ⎟⎜ ⎟ ⎝ 4 3 2 0 9 ⎠⎝15 8 10 24 17⎠⎝76 83 81 67 74⎠ let us start by considering the problem in only 2 6 2 1 7 8 8 3 8 15 8 83 78 83 76 83 dimensions. And, to understand the excluded-sums problem, it helps to understand the included-sums (a) (b) (c) Figure 2: Examples of the included- and excluded-sums problem as well. Figure 1 illustrates included and problems on an input matrix in 2 dimensions with box size excluded sums in 2 dimensions, and Figure 2 provides (3, 3) using the max operator. (a) The input matrix. The examples using ordinary addition as the ⊕ operator. We square shows the box cornered at (3, 3). (b) The solution have an 푛1 × 푛2 matrix 풜 of elements over a monoid for the included-sums problem with the + operator. The (푆, ⊕, 푒). We also are given a “box size” k = (푘1, 푘2) highlighted square contains the included sum for the box such that 푘1 ≤ 푛1 and 푘2 ≤ 푛2. The included sum at in (a). The included-sums problem requires computing the a coordinate (푥1, 푥2), as shown in Figure 1(a), involves included sum for every element in the input matrix. (c) A reducing — accumulating using ⊕ — all the elements similar example for excluded sums. The highlighted square of 풜 inside the k-box cornered at (푥1, 푥2), that is, contains the excluded sum for the box in (a). 푥 +푘 −1 푥 +푘 −1 -problem, and any algorithm for the strong excluded 2؁ 2 1؁ 1 풜[푦1, 푦2] , sums problem trivially solves the weak excluded-sums 푦1=푥1 푦2=푥2 problem. This paper presents efficient algorithms for where if a coordinate goes out of range, we assume both the weak and strong excluded-sums problems. that its value is the identity 푒. The included-sums problem computes the included sum for all coordinates Summed-area Table for Weak Excluded Sums of 풜, which can be straightforwardly accomplished with The summed-area table (SAT) algorithm uses the four nested loops in Θ(푛1푛2푘1푘2) time. Similarly, the classical summed-area table method [7,11,29] to solve the excluded sum at a coordinate, as shown in Figure 1(b), weak included-sums problem on a 푑-dimensional tensor reduces all the elements of 풜 outside the k-box cornered 풜 having 푁 elements in 푂(2푑푁) time. This algorithm at (푥1, 푥2). The excluded-sums problem computes precomputes prefix sums along each dimension of 풜 and the excluded sum for all coordinates of 풜, which can be uses inclusion-exclusion to “add” and “subtract” prefixes straightforwardly accomplished in Θ(푛1푛2(푛1 −푘1)(푛2 − to find the included sum for arbitrary boxes. The SAT 푘2)) time. We shall see much better algorithms for both algorithm cannot be used to solve the strong included- problems. sums problem, however, because it requires an operator inverse. The summed-area table algorithm can easily Excluded Sums and Operator Inverse One way be extended to an algorithm for weak excluded-sums by to solve the excluded-sums problem is to solve the totaling the entire tensor and subtracting the solution included-sums problem and then use the inverse ⊖ of to weak included sums. We will call this algorithm the the ⊕ operator to “subtract” out the results from the SAT complement (SATC) algorithm. reduction of the entire tensor. This approach fails for operators without inverse, however, such as the Corners Algorithm for Strong Excluded Sums maximum operator max. As another example, the The naive algorithm for strong excluded sums that just FMM involves solving the excluded-sums problem over sums up the area of interest for each element runs in a domain of functions which cannot be “subtracted,” 푂(푁 2) time in the worst case, because it wastes work by because the functions exhibit singularities [13]. Even recomputing reductions for overlapping regions. To avoid for simpler domains, using the inverse (if it exists) may recomputing sums, Demaine et al. [13] introduced an have unintended consequences. For example, subtracting algorithm that solve the strong excluded-sums problem finite-precision floating-point values can suffer from in arbitrary dimensions, which we will call the corners catastrophic cancellation [13, 30] and high round-off algorithm. error [19]. Some contexts may permit the use of an At a high level, the corners algorithm partitions the inverse, but others may not. excluded region for each box into 2푑 disjoint regions Consequently, we refine the included- and excluded- that each share a distinct vertex of the box, while sums problems into weak and strong versions. The collectively filling the entire tensor, excluding the box. weak version requires an operator inverse, while the The algorithm heavily depends on prefix and suffix sums strong version does not. Any algorithm for the included- to compute the reduction of elements in each of the sums problem trivially solves the weak excluded-sums disjoint regions.

2 Copyright © 2021 by SIAM Unauthorized reproduction of this article is prohibited algorithm takes Θ(2푑푁) time. With Θ(푁) space, the corners algorithm takes Θ(푑2푑푁) time.

Contributions This paper presents algorithms for in- cluded and strong excluded sums in arbitrary dimensions that improve the runtime from exponential to linear in the number of dimensions. For strong included sums, we introduce the bidirectional box-sum (BDBS) al- gorithm that uses prefix and suffix sums to compute the included sum efficiently. The BDBS algorithm can be easily extended into an algorithm for weak excluded sums, which we will call the bidirectional box-sum complement (BDBSC) algorithm. For strong excluded sums, the main insight in this paper is the formulation of the excluded sums in terms of the “box complement” on which the box-complement algorithm is based. Table 1 Figure 3: Space and time per element of the corners and box- summarizes all algorithms considered in this paper. complement algorithms in 3 dimensions. We use Corners(c) Figure 3 illustrates the performance and space usage and Corners Spine to denote variants of the corners algorithm of the box-complement algorithm and variants of the 3D with extra space. We set the number of elements 푁 = 3 corners algorithm. Since the paper that introduced the 681472 = 88 and the box lengths 푘1 = 푘2 = 푘3 = 4 (for 퐾 = 64). corners algorithm stopped short of a general construc- tion in higher dimensions, the 3D case is the highest dimensionality for which we have implementations of the box-complement and corners algorithm. The 3D case is of interest because applications such as the FMM often present in three dimensions [8, 17]. We find that the box-complement algorithm outperforms the corners algorithm by about 1.4× when given similar amounts of space, though the corners algorithm with twice the space as box-complement is 2× faster. The box-complement algorithm uses a fixed (constant) factor of extra space, while the corners algorithm can use a variable amount of space. We found that the performance of the corners algorithm depends heavily on its space usage. We use Corners(c) to denote the implementation of the corners algorithm that uses a factor of 푐 in space to store leaves in the computation tree and gather the results into the output. Furthermore, we also explored a variant of the Figure 4: Time per element of algorithms for excluded sums in corners algorithm in Appendix A, called Corners Spine, arbitrary dimensions. The number of elements 푁 of the tensor which uses extra space to store the spine of the compu- in each dimension was in the range [2097152, 134217728] tation tree and asymptotically reduce the runtime. (selected to be a exact power of the number of dimensions). Figure 4 demonstrates how algorithms for weak For each number of dimensions 푑, we set the box volume excluded sums scale with dimension. We omit the corners 푑 퐾 = 8 . algorithm because the original paper stopped short of a construction of how to find the corners in higher Since the original article that proposed the corners dimensions. We also omit an evaluation of included- algorithm does not include a formal analysis of its sums algorithms because the relative performance of runtime or space usage in arbitrary dimensions, we all algorithms would be the same. The naive and present one in Appendix A. Given a 푑-dimensional tensor summed-area table perform well in lower dimensions of 푁 elements, the corners algorithm takes Ω(2푑푁) time but exhibit crossover points (at 3 and 6 dimensions, to compute the excluded sum in the best case because respectively) because their runtimes grow exponentially there are 2푑 corners and each one requires Ω(푁) time with dimension. In contrast, the BDBS and box- to add its contribution to each excluded box. As we’ll complement algorithms scale linearly in the number see, the bound is tight: given Θ(푑푁) space, the corners

3 Copyright © 2021 by SIAM Unauthorized reproduction of this article is prohibited Algorithm Source Time Space Included or Excluded? Strong or Weak? Naive included sum [This work] Θ(퐾푁) Θ(푁) Included Strong Naive included sum complement [This work] Θ(퐾푁) Θ(푁) Excluded Weak Naive excluded sums [This work] Θ(푁 2) Θ(푁) Excluded Strong Summed-area table (SAT) [11, 29] Θ(2푑푁) Θ(푁) Included Weak Summed-area table [11, 29] Θ(2푑푁) Θ(푁) Excluded Weak complement (SATC) Corners(c) [13] Θ((푑 + 1/푐)2푑푁) Θ(푐푁) Excluded Strong Corners Spine(c) [13] Θ((2푐+1 + 2푑(푑 − 푐) + 2푑)푁) Θ(푐푁) Excluded Strong Bidirectional box sum (BDBS) [This work] Θ(푑푁) Θ(푁) Included Strong Bidirectional box sum [This work] Θ(푑푁) Θ(푁) Excluded Weak complement (BDBSC) Box-complement [This work] Θ(푑푁) Θ(푁) Excluded Strong

Table 1: A summary of all algorithms for excluded sums in this paper. All algorithms take as input a 푑-dimensional tensor of 푁 elements. We include the runtime, space usage, whether an algorithm solves the included- or excluded-sums problem, and whether it solves the strong or weak version of the problem. We use 퐾 to denote the volume of the box (in the runtime of the naive algorithm). The corners algorithm takes a parameter 푐 of extra space that it uses to improve its runtime. of dimensions and outperform the summed-area table describes and analyzes the resulting box-complement method by at least 1.3× after 6 dimensions. The BDBS algorithm. Section 6 presents an empirical evaluation algorithm demonstrates the advantage of solving the of algorithms for excluded sums. Finally, we provide weak problem, if you can, because it is always faster than concluding remarks in Section 7. the box-complement algorithm, which doesn’t exploit an operator inverse. Both algorithms introduced in this 2 Preliminaries paper outperform existing methods in higher dimensions, This section reviews tensor preliminaries used to describe however. algorithms in later sections. It also formalizes the To be specific, our contributions are as follows: included- and excluded-sums problems in terms of tensor notation. Finally, it describes the prefix- and suffix-sums • the bidirectional box-sum (BDBS) algorithm for primitive underlying the main algorithms in this paper. strong included sums;

• the bidirectional box-sum complement (BDBSC) Tensor Preliminaries We first introduce the coordi- algorithm for weak excluded sums; nate and tensor notation we use to explain our algo- rithms and why they work. At a high level, tensors • the box-complement algorithm for strong excluded are 푑-dimensional arrays of elements over some monoid sums; (푆, ⊕, 푒). In this paper, tensors are represented by cap- ital script letters (e.g., 풜) and vectors are represented 푑 • theorems showing that, for a -dimensional tensor by lowercase boldface letters (e.g., a). 푁 Θ(푑푁) of size , these algorithms all run in time We shall use the following terminology. A 푑- Θ(푁) and space; dimensional coordinate domain 푈 is is the cross • implementations of these algorithms in C++; and product 푈 = 푈1 ×푈2 ×...×푈푑, where 푈푖 = {1, 2, . . . , 푛푖} for 푛푖 ≥ 1. The size of 푈 is 푛1푛2 ··· 푛푑. Given a • empirical evaluations showing that the box- coordinate domain 푈 and a monoid (푆, ⊕, 푒) as defined complement algorithm outperforms the corners al- in Section 1, a tensor 풜 can be viewed for our purposes gorithm in 3D given similar space and that both as a mapping 풜 : 푈 → 푆. That is, a tensor maps a the BDBSC algorithm and box-complement algo- coordinate x ∈ 푈 to an element 풜[x] ∈ 푆. The size of rithm outperform the SATC algorithm in higher a tensor is the size of its coordinate domain. We omit the dimensions. coordinate domain 푈 and monoid (푆, ⊕, 푒) when they are clear from context. Outline The rest of the paper is organized as follows. We use Python-like colon notation 푥 : 푥′, where Section 2 provides necessary preliminaries and notation 푥 ≤ 푥′, to denote the half-open interval [푥, 푥′) of to understand the algorithms and proofs. Section 3 coordinates along a particular dimension. If 푥 : 푥′ presents an efficient algorithm to solve the included- would extend outside of [1, 푛], where 푛 is the maximum sums problem, which will be used as a key subroutine in coordinate, it denotes only the coordinates actually in the the box-complement algorithm. Section 4 formulates the interval, that is, the interval max{1, 푥} : min{푛 + 1, 푥′}. excluded sum as the “box-complement,” and Section 5

4 Copyright © 2021 by SIAM Unauthorized reproduction of this article is prohibited If the lower bound is missing, as in : 푥′, we interpret where the interval as 1 : 푥′, and similarly, if the upper bound is {︂ 푎1 if 푘 = 1, missing, as in 푥 : , it denotes the interval [푥, 푛]. If both (2.1) 푏푘 = 푎푘 ⊕ 푏푘−1 if 푘 > 1 . bounds are missing, as in : , we interpret the interval as the whole coordinate range [1, 푛]. Let Prefix denote the algorithm that directly We can use colon notation when indexing a tensor implements the recursion in Equation 2.1. Given 풜 to define subtensors, or boxes. For example, an array a and indices start ≤ end, the function 풜[3 : 5, 4 : 6] denotes the elements of 풜 at coordinates Prefix(a, start, end) computes the prefix sum in the (3, 4), (3, 5), (4, 4), (4, 5). For full generality, a box 퐵 range [start, end] of a in 푂(end − start) time. Similarly, ′ cornered at coordinates x = (푥1, 푥2, . . . , 푥푑) and x = the suffix-sums operation is the reverse of the prefix ′ ′ ′ ′ (푥1, 푥2, . . . , 푥푑), where 푥푖 < 푥푖 for all 푖 = 1, 2, . . . , 푑, is sum and computes the sum right-to-left rather than left- ′ ′ ′ the box (푥1 : 푥1, 푥2 : 푥2, . . . , 푥푑 : 푥푑). Given a box size to-right. Let Suffix(a, start, end) be the corresponding k = (푘1, . . . , 푘푑), a k-box cornered at coordinate x is algorithm for suffix sums. ′ the box cornered at x and x = (푥1 +푘1, 푥2 +푘2, . . . , 푥푑 + 푘푑).A (tensor) row is a box with a single value in 3 Included Sums each coordinate position in the colon notation, except This section presents the bidirectional box-sum algo- for one position, which includes that entire dimension. rithm (BDBS) algorithm to compute the included For example, if x = (푥1, 푥2, . . . , 푥푑) is a coordinate of a sum along an arbitrary dimension, which is used as a tensor 풜, then 풜[푥1, 푥2, . . . , 푥푖−1, : , 푥푖+1, 푥푖+2, . . . , 푥푑] main subroutine in the box-complement algorithm for denotes a row along dimension 푖. excluded sums. As a warm-up, we will first describe how The colon notation can be combined with the to solve the included-sums problem in one dimension and reduction operator ⊕ to indicate the reduction of all extend the technique to higher dimensions. We include elements in a subtensor: the one-dimensional case for clarity, but the main focus ′ ′ ′ ؁ 풜[푥1 : 푥1, 푥2 : 푥2, . . . , 푥푑 : 푥푑] of this paper is the multidimensional case. We will sketch the subroutines for higher dimensions ؁ ؁ ؁ = ··· 풜[푦1, 푦2, . . . , 푦푑] . in this section. The full version of the paper includes all 푦 ∈[푥 ,푥′ ) 푦 ∈[푥 ,푥′ ) 푦 ∈[푥 ,푥′ ) 1 1 1 2 2 2 푑 푑 푑 the pseudocode and omitted proofs for BDBS in 1D. We sketch the key subroutines in higher dimensions and omit Problem Definitions We can now formalize the them from this paper because they straightforwardly included- and excluded-sums problems from Section 1. extend the computation from 1 dimension. Definition 1. (Included and Excluded Sums) Included Sums in 1D An algorithm for the included-sums problem takes as Before investigating the in- input a 푑-dimensional tensor 풜 : 푈 → 푆 with size 푁 and cluded sums in higher dimensions, let us first turn a box size k = (푘 , 푘 , . . . , 푘 ). It produces a new tensor our attention to the 1D case for ease of understand- 1 2 푑 BDBS-1D 풜′ : 푈 → 푆 such that every output element 풜′[x] holds ing. Figure 13 presents an algorithm which 퐴 푁 the reduction under ⊕ of elements within the k-box of takes as input a list of length and a (scalar) 1 푘 퐴′ 풜 cornered at x. An algorithm for the excluded-sums box size and outputs a list of corresponding in- BDBS-1D problem is defined similarly, except that the reduction cluded sums. At a high level, the algorithm 퐴 퐴 is of elements outside the k-box cornered at x. generates two intermediate lists 푝 and 푠, each of length 푁, and performs 푁/푘 prefix and suffix sums of 푘 In other words, an included-sums algorithm computes, length on each intermediate list. By construction, for ′ 푥 = 1, 2, . . . , 푁, we have 퐴푝[푥] = 퐴[푘⌊푥/푘⌋ : 푥 + 1], and for all x = (푥1, 푥2, . . . , 푥푑) ∈ 푈, the value 풜 [x] = .[⌈퐴푠[푥] = 퐴[푥 : 푘⌈(푥 + 1)/푘 ؀ 풜[푥1 : 푥1+푘1, 푥2 : 푥2+푘2, . . . , 푥푑 : 푥푑+푘푑]. It’s messier 퐴 퐴 to write the output of an excluded-sums problem using Finally, BDBS-1D uses 푝 and 푠 to compute 푘 colon notation, but fortunately, our proofs do not rely the included sum of size for each coordinate in one on it. pass. Figure 5 illustrates the ranged prefix and suffix As we have noted in Section 1, there are weak and sums in BDBS-1D, and Figure 6 presents a concrete strong versions of both problems which allow and do not example of the computation. allow an operator inverse, respectively. 1For simplicity in the algorithm descriptions and pseudocode, we assume that 푛푖 mod 푘푖 = 0 for all dimensions 푖 = 1, 2, . . . , 푑. In Prefix and Suffix Sums The prefix-sums opera- implementations, the input can either be padded with the identity tion [4] takes an array a = (푎1, 푎2, . . . , 푎푛) of 푛 elements to make this assumption hold, or it can add in extra code to deal and returns the “running sum” b = (푏1, 푏2, . . . , 푏푛) , with unaligned boxes.

5 Copyright © 2021 by SIAM Unauthorized reproduction of this article is prohibited k k Suffix 풜푖[푥1, 푥2, . . . , 푥푑] = ؁ 풜[푥1 : 푥1 + 푘2, . . . , 푥푖 : 푥푖 + 푘푖, 푥푖+2, . . . , 푥푑]. ⏟ ⏞ ⏟ ⏞ Prefix 푖 푑 − 푖 Overall, BDBS computes the full included sum of a Figure 5: An illustration of the computation in the bidirec- tensor with 푁 elements in Θ(푑푁) time and Θ(푁) space tional box-sum algorithm. The arrows represent prefix and by performing the included sum along each dimension suffix sums in runs of size 푘, and the shaded region repre- in turn. sents the prefix and suffix components of the region ofsize 푘 Although we cannot directly use BDBS to solve outlined by the dotted lines. the strong excluded-sums problem, the next sections demonstrate how to use the BDBS technique as a key Position 1 2 3 4 5 6 7 8 subroutine in the box-complement algorithm for strong excluded sums. A 2 5 3 1 6 3 9 0 4 Excluded Sums and the Box Complement The main insight in this section is the formulation of the Ap 2 7 10 11 6 9 18 18 excluded sum as the recursive “box complement”. We show how to partition the excluded region into 2푑 non- As 11 9 4 1 18 12 9 0 overlapping parts in 푑 dimensions. This decomposition of the excluded region underlies the box-complement for strong excluded sums in the next section. 11 15 13 19 18 12 9 0 A′ First, let’s see how the formulation of the “box Figure 6: An example of computing the 1D included sum complement” relates to the excluded sum. At a high level, using the bidirectional box-sum algorithm, where 푁 = 8 and given a box 퐵, a coordinate x is in the “푖-complement” 푘 = 4. The input array is 퐴, the 푘-wise prefix and suffix of 퐵 if and only if x is “out of range” in some dimension sums are stored in 퐴푝 and 퐴푠, respectively, and the output 푗 ≤ 푖, and “in the range” for all dimensions greater is in 퐴′. than 푖. Definition 2. (Box Complement) Given a 푑- BDBS-1D solves the included-sums problem on dimensional coordinate domain 푈 and a dimension an array of size 푁 in Θ(푁) time and Θ(푁) space. 푖 ∈ {1, 2, . . . , 푑}, the i-complement of a box 퐵 cornered ′ ′ ′ First, it uses two temporary arrays to compute the at coordinates x = (푥1, . . . , 푥푑) and x = (푥1, . . . , 푥푑) is prefix and suffix as illustrated in Figure 5in Θ(푁) the set time. It then makes one more pass through the data 퐶푖(퐵) = {(푦1, . . . , 푦푑) ∈ 푈 : there exists 푗 ∈ [1, 푖] Θ(푁) ′ to compute the included sum, requiring time. such that 푦푗 ∈/ [푥푗, 푥푗), and for all 푚 ∈ [푖 + 1, 푑], Figure 13 in Appendix B contains the full pseudocode 푦 ∈ [푥 , 푥′ )}. for BDBS-1D. 푚 푚 푚 Given a box 퐵, the reduction of all elements at Generalizing to Arbitrary Dimensions The main coordinates in 퐶푑(퐵) is exactly the excluded sum with focus of this work is multidimensional included and respect to 퐵. The box complement recursively partitions excluded sums. Computing the included sum along an excluded region into disjoint sets of coordinates. an arbitrary dimension is almost exactly the same as computing it along 1 dimension in terms of the Theorem 4.1. (Recursive Box-complement) Let underlying ranged prefix and suffix sums. We sketch 퐵 be a box cornered at coordinates x = (푥1, . . . , 푥푑) and x′ = (푥′ , . . . , 푥′ ) in some coordinate domain 푈. The an algorithm BDBS that generalizes BDBS-1D to 1 푑 arbitrary dimensions. 푖-complement of 퐵 can be expressed recursively in terms Let 풜 be a 푑-dimensional tensor with 푁 elements of the (푖 − 1)-complement of 퐵 as follows: and let k be a box size. The BDBS algorithm computes ′ ′ 퐶푖(퐵) = ( : ,..., : , : 푥푖, 푥푖+1 : 푥푖+1, . . . , 푥푑 : 푥푑 ) ∪ the included sum along dimensions 푖 = 1, 2, . . . , 푑 in turn. ⏟ ⏞ 푖 ⏟ ⏞ After performing the included-sum computation along 푑−푖 ′ ′ ′ dimensions 1, 2, . . . , 푖, every coordinate in the output 풜푖 (: ,..., : , 푥푖 : , 푥푖+1 : 푥푖+1, . . . , 푥푑 : 푥푑 ) ∪ 퐶푖−1(퐵), ⏟ ⏞ ⏟ ⏞ contains the included sum in each dimension up to 푖: 푖−1 푑−푖

6 Copyright © 2021 by SIAM Unauthorized reproduction of this article is prohibited where 퐶0(퐵) = ∅. We use the box-complement formulation in the next section to efficiently compute the excluded sums ona Proof. (퐵) For simplicity of notation, let RHS푖 be the tensor by reducing in disjoint regions of the tensor. right-hand side of the equation in the statement of The- orem 4.1. Let y = (푦1, . . . , 푦푑) be a coordinate. In order 5 Box-Complement Algorithm to show the equality, we will show that y ∈ 퐶푖(퐵) if and This section describes and analyzes the box-complement only if y ∈ RHS푖(퐵). algorithm for strong excluded sums, which efficiently Forward Direction: y ∈ 퐶푖(퐵) → y ∈ RHS푖(퐵). implements the dimension reduction in Section 4. The We proceed by case analysis when y ∈ 퐶푖(퐵). Let 푗 ≤ 푖 box-complement algorithm relies heavily on prefix, suffix, be the highest dimension at which y is “out of range,” ′ and included sums as described in Sections 2 and 3. or where 푦푗 < 푥푗 or 푦푗 ≥ 푥 . 푗 Given a 푑-dimensional tensor 풜 of size 푁 and a Case 1: 푗 = 푖. box size k, the box-complement algorithm solves the Definition 2 and 푗 = 푖 imply that either 푦푖 < excluded-sums problem with respect to k for coordinates ′ ′ 푥푖 or 푦푖 ≥ 푥푖, and 푥푚 ≤ 푦푚 ≤ 푥푚 for all in 풜 in Θ(푑푁) time and Θ(푁) space. Appendix C 푚 > 푖. By definition, 푦푖 < 푥푖 implies y ∈ contains all omitted pseudocode and proofs for the serial (︀ ′ ′ )︀ : ,..., : , : 푥푖, 푥푖+1 : 푥푖+1, . . . , 푥푑 : 푥푑 . Similarly, 푦푖 ≥ box-complement algorithm. ′ (︀ ′ ′ ′ )︀ 푥푖 implies y ∈ : ,..., : , 푥푖 : 푥푖+1 : 푥푖+1, . . . , 푥푑 : 푥푑 . These are exactly the first two terms in 푖RHS (퐵). Algorithm Sketch At a high level, the box- complement algorithm proceeds by dimension reduction. Case 2: 푗 < 푖. That is, the algorithm takes 푑 dimension-reduction steps, Definition 2 and 푗 < 푖 imply that y ∈ 퐶푖−1(퐵). where each step adds two of the components from Corol- Backwards Direction: y ∈ RHS푖(퐵) → y ∈ 퐶푖(퐵). lary 4.1 to each element in the output tensor. In the We again proceed by case analysis. 푖th dimension-reduction step, the box-complement algo- (︀ ′ ′ )︀ rithm computes the 푖-complement of 퐵 (Definition 2) Case 1: y ∈ : ,..., : , : 푥푖, 푥푖+1 : 푥 , . . . , 푥푑 : 푥 or 푖+1 푑 for all coordinates in the tensor by performing a prefix y ∈ (︀ : ,..., : , 푥′ : , 푥 : 푥′ , . . . , 푥 : 푥′ )︀ 푖 푖+1 푖+1 푑 푑 . and suffix sum along the 푖th dimension and then per- y ∈ 퐶 (퐵) Definition 2 implies 푖 because there exists forming the BDBS technique along the remaining 푑 − 푖 푗 ≤ 푖 푗 = 푖 푦 < 푥 some (in this case, ) such that 푗 푗 and dimensions. After the 푖th dimension-reduction step, the 푥 ≤ 푦 < 푥′ 푚 > 푖 푚 푚 푚 for all . box-complement algorithm operates on a tensor of 푑 − 푖 Case 2: y ∈ 퐶푖−1(퐵). dimensions because 푖 dimensions have been reduced so Definition 2 implies that there exists 푗 in the range far via prefix sums. Figure 7 presents an example of ′ 1 ≤ 푗 ≤ 푖 − 1 such that 푦푗 < 푥푗 or 푦푗 ≥ 푥푗 and that the dimension reduction in 2 dimensions, and Figure 8 ′ for all 푚 ≥ 푖, we have 푥푚 ≤ 푦푚 < 푥푚. Therefore, illustrates the recursive box-complement in 3 dimensions. y ∈ 퐶푖−1(퐵) implies y ∈ 퐶푖(퐵) since there exists some ′ 푗 ≤ 푖 (in this case, 푗 < 푖) where 푦푗 < 푥푗 or 푦푗 ≥ 푥푗 and Prefix and Suffix Sums In the 푖th dimension re- ′ 푥푚 ≤ 푦푚 < 푥푚 for all 푚 > 1. duction step, the box-complement algorithm uses pre- fix and suffix sums along the 푖th dimension to reduce 퐶 (퐵) Therefore, 푖 can be recursively expressed as the elements “out of range” along the 푖th dimension (퐵) RHS푖 . in the 푖-complement. That is, given a tensor 풜 of In general, unrolling the recursion in Theorem 4.1 size 푁 = 푛1 · 푛2 ··· 푛푑 and a number 푖 < 푑 of dimen- yields 2푑 disjoint partitions that exactly comprise the sions reduced so far, we define a subroutine Prefix- excluded sum with respect to a box. Along-Dim(풜, 푖) that fixes the first 푖 dimensions at 푛1, . . . , 푛푖 (respectively), and then computes the prefix Corollary 4.1. (Excluded-sum Components) sum along dimension 푖+1 for all remaining rows in dimen- The excluded sum can be represented as the union of 2푑 sions 푖 + 2, . . . , 푑. The pseudocode for Prefix-Along- disjoint sets of coordinates as follows: Dim(풜, 푖) can be found in Figure 14 in Appendix C, ⎛ (︁ 푑 )︁ 푑 ∏︀ and the proof that it incurs 푂 푗=푖+1 푛푗 time can be ⋃︁ ⎜ ′ ′ 퐶푑(퐵) = ⎝(: ,..., : , : 푥푖, 푥푖+1 : 푥푖+1, . . . , 푥푑 : 푥푑) found in Appendix C. 푖=1 ⏟ ⏞ ⏟ ⏞ 푖−1 푑−푖 The subroutine Prefix-Along-Dim computes the ⎞ reduction of elements “out of range” along dimension 푖. ′ ′ That is, after Prefix-Along-Dim(풜, 푖), for each coor- ∪ (: ,..., : , 푥푖 + 푘푖 : , 푥푖+1 : 푥 , . . . , 푥푑 : 푥 )⎟ . 푖+1 푑 ⎠ dinate 푥푖+1 = 1, 2, . . . , 푛푖+1 along dimension 푖 + 1, every ⏟ ⏞ ⏟ ⏞ 푖−1 푑−푖

7 Copyright © 2021 by SIAM Unauthorized reproduction of this article is prohibited f f (ii) f f f (ii) f f f f BDBS f f BDBS f f f X f X f f f X f f f f f f f f f f f f f f X f Suffix f along f f along f f Prefix f f f f f f f f f f each f each f f column f column f f f (i) Prefix along (i) Suffix along f each row each row (a) (b) (c) (d) Figure 7: Steps for computing the excluded sum in 2 dimensions with included sums on prefix and suffix sums. The steps are labeled in the order they are computed. The 1-complement (a) prefix and (b) suffix steps perform a prefix and suffix along dimension 1 and an included sum along dimension 2. The numbers in (a),(b) represent the order of subroutines in those steps. The 2-complement (c) prefix and (d) suffix steps perform a prefix and suffix sum on the reduced array, denoted by the blue rectangle, from step (a). The red box denotes the excluded region, and solid lines with arrows denote prefix or suffix sums along a row or column. The long dashed line represents the included sum along eachcolumn.

each row along a specified dimension after dimension reduction. Let 풜 be a 푑-dimensional tensor, k be a 3 3 3 box size, 푖 be the number of reduced dimensions so far, and 푗 be the dimension to compute the included sum 2 2 2 1 1 1 along such that 푗 > 푖. BDBS-Along-Dim(풜, k, 푖, 푗) Full Prefix / Suffix 1 1, 2 1, 2, 3 computes the included sum along the 푗th dimension Dimensions: for all rows ( 푛1, . . . , 푛푖, : ,..., :). That is, for each Included Sum 2, 3 3 None coordinate x = ( 푛1, . . . , 푛푖, 푥푖+1, . . . , 푥푑 ), the output Dimensions: ′ (a) (b) (c) tensor 풜 contains the included sum along dimension 푗: Figure 8: An example of the recursive box-complement in 3 ؁ ′ dimensions with dimensions labeled 1, 2, 3. The subfigures 풜 [x] = 풜[푛1, . . . , 푛푖, 푥푖+1, . . . , 푥푗 , (a), (b), and (c) illustrate the 1-, 2-, and 3-complement, ⏟ ⏞ ⏟ ⏞ 푖 푗 − 푖 respectively. The blue region represents the coordinates inside the box, and the regions outlined by dotted lines 푥푗+1 : 푥푗+1 + 푘푗+1, 푥푗+2, . . . , 푥푑]. ⏟ ⏞ represent the partitions defined by Corollary 4.1. For each 푑 − 푗 − 1 partition, the face against the edge of the tensor is highlighted in green. (︁∏︀푑 )︁ BDBS-Along-Dim(풜, k, 푖, 푗) takes Θ ℓ=푖+1 푛ℓ coordinate in the (dimension-reduced) output 풜′ con- (︁ )︁ ∏︀푑 푛 /푛 tains the prefix up to that coordinate in dimension 푖 + 1: time because it iterates over ℓ=푖+1 ℓ 푗+1 rows and runs in Θ(푛푗+1) time per row. It takes Θ(푁) space ′ using the same technique as BDBS-1D. 풜 [푛1, . . . , 푛푖, 푥푖+1, 푥푖+2, . . . , 푥푑] = ⏟ ⏞ ⏟ ⏞ 푖 푑 − 푖 − 1 Adding in the Contribution Each dimension- ؁ 풜[푛1, . . . , 푛푖, : 푥푖+1 + 1, 푥푖+2, . . . , 푥푑]. reduction step must add its respective contribution to ⏟ ⏞ ⏟ ⏞ 푖 푑 − 푖 − 1 each element in the output. Given an input tensor 풜 and output tensor 풜′, both of size 푁, the function Since the similar subroutine Suffix-Along-Dim Add-Contribution takes Θ(푁) time to add in the has almost exactly the same analysis and structure, we contribution with a pass through the tensors. The full omit its discussion. pseudocode can be found in Figure 15 in Appendix C.

Included Sums In the 푖th dimension reduction step, Putting It All Together Finally, we will see how the box-complement algorithm uses the BDBS technique to use the previously defined subroutines to describe along the 푖th dimension to reduce the elements “in range” and analyze the box-complement algorithm for ex- along the 푖th dimension in the 푖-complement. That is, cluded sums. Figure 9 presents pseudocode for the box- given a tensor 풜 of size 푁 = 푛1 · 푛2 ··· 푛푑 and a number complement algorithm. Each dimension-reduction step 푖 < 푑 of dimensions reduced so far, we define a subroutine has a corresponding prefix and suffix step to add inthe BDBS-Along-Dim. two components in the recursive box-complement. Given BDBS-Along-Dim computes the included sum for an input tensor 풜 of size 푁, the box-complement algo-

8 Copyright © 2021 by SIAM Unauthorized reproduction of this article is prohibited rithm takes Θ(푁) space because all of its subroutines Therefore, the total time of Box-Complement is use at most a constant number of temporaries of size 푁, Θ(푑푁). as seen in Figure 9. Given a tensor 풜 as input, the box-complement algorithm solves the excluded-sums problem by com- Box-Complement(풜, k) puting the recursive box-complement components from 1 // Input: Tensor 풜 with 푑-dimensions, box size k 푖 ∈ [1, 푑] Corollary 4.1. By construction, for dimension , // Output: Tensor 풜′ with size and dimensions 푖 the prefix-sum part of the th dimension-reduction // matching 풜 containing the excluded sum. 풜 step outputs a tensor 푝 such that for all coordinates 2 init 풜′ with the same size as 풜 x = (푥1, . . . , 푥푑), we have 풜푝 ← 풜; 풜푠 ← 풜 3 ؁ 풜푝[푥1, . . . , 푥푑] = 풜[ : ,..., : , : 푥푖+1, 4 // Current dimension-reduction step ⏟ ⏞ 푖 5 for i ← 1 to d 푥푖+2 : 푥푖+2 + 푘푖+2, . . . , 푥푑 : 푥푑 + 푘푑]. 6 // Saved from previous dimension-reduction step. ⏟ ⏞ 풜 ← 풜 푖 − 1 푑 − 푖 − 1 7 푝 reduced up to dimension 8 풜푠 ← 풜푝 // Save input to suffix step Similarly, the suffix-sum step constructs a tensor 풜푠 9 // PREFIX STEP such that for all x, .Reduced up to 푖 dimensions // ؁ 풜푠[푥1, . . . , 푥푑] = 풜[ : ,..., : , 푥푖+1 + 푘푖+1 : , 10 Prefix-Along-Dim along ⏟ ⏞ 푖 11 dimension i on 풜푝. 푥푖+2 : 푥푖+2 + 푘푖+2, . . . , 푥푑 : 푥푑 + 푘푑]. 12 풜 ← 풜푝 // Save for next round ⏟ ⏞ 13 // Do included sum on dimensions [푖 + 1, 푑]. 푑 − 푖 − 1 14 for j ← i + 1 to d We can now analyze the performance of the box- 15 // 풜푝 reduced up to 푖 dimensions complement algorithm. 16 BDBS-Along-Dim on Theorem 5.1. (Time of Box-complement) Given 17 dimension 푗 in 풜푝 18 // Add into result a 푑-dimensional tensor 풜 of size 푁 = 푛1 · 푛2 · ... · 푛푑, ′ Box-Complement solves the excluded-sums problem 19 Add-Contribution from 풜푝 into 풜 in Θ(푑푁) time. 20 21 // SUFFIX STEP Proof. We analyze the prefix step (since the suffix step // Do suffix sum along dimension 푖 is symmetric, it has the same running time). Let 22 Suffix-Along-Dim along 푖 ∈ {1, . . . , 푑} denote a dimension. 23 dimension 푖 in 풜푠 The 푖th dimension reduction step in 24 // Do included sum on dimensions [푖 + 1, 푑] Box-Complement involves 1 prefix step and (푑 − 푖) 25 for j ← i + 1 to d (︁ )︁ ∏︀푑 // 풜 푖 included sum calls, which each have 푂 푗=푖 푛푗 26 푠 reduced up to dimensions BDBS-Along-Dim time. Furthermore, adding in the contribution at each 27 on 푗 풜 dimension-reduction step takes Θ(푁) time. The total 28 dimension in 푠 29 // Add into result time over 푑 steps is therefore ′ (︃ 푑 (︃ 푑 )︃)︃ 30 Add-Contribution from 풜푠 into 풜 ∑︀ ∏︀ ′ Θ (푑 − 푖 + 1) 푛푗 + 푁 . Adding in the 31 return 풜 푖=1 푗=푖 contribution is clearly Θ(푑푁) in total. Figure 9: Pseudocode for the box-complement algorithm. Next, we bound the runtime of the prefix and in- For ease of presentation, we omit the exact parameters to the subroutines and describe their function in the algorithm. cluded sums. In each dimension-reduction step, reduc- The pseudocode with parameters can be found in Figure 16. ing the number of dimensions of interest exponentially decreases the size of the considered tensor. That is, di- mension reduction exponentially reduces the size of the 6 Experimental Evaluation ∏︀푑 푛 ≤ 푁/2푖−1 input: 푗=푖 푗 . The total time required to This section presents an empirical evaluation of strong compute the box-complement components is therefore 푑 푑 푑 and weak excluded-sums algorithms. In 3 dimensions, ∑︁ ∏︁ ∑︁ 푁 (푑 − 푖 + 1) 푛 ≤ (푑 − 푖 + 1) we compare strong excluded-sums algorithms: specifi- 푗 2푖−1 푖=1 푗=푖 푖=1 cally, we evaluate the box-complement algorithm and −푑 variants of the corners algorithm and find that the box- ≤ 2(푑 + 2 − 1)푁 = Θ(푑푁). complement outperforms the corners algorithm given

9 Copyright © 2021 by SIAM Unauthorized reproduction of this article is prohibited similar space. Furthermore, we compare weak excluded- sums algorithms in higher dimensions. Lastly, to simu- late a more expensive operator than numeric addition when reducing, we compare the box-complement algo- rithm and variants of the corners algorithm using an artificial slowdown.

Experimental Setup We implemented all algorithms in C++. We used the Tapir/LLVM [27] branch of the LLVM [23, 24] compiler (version 9) with the -O3 and -march=native and -flto flags. All experiments were run on a 8-core 2-way hyper- threaded Intel Xeon CPU E5-2666 v3 @ 2.90GHz with 30GB of memory from AWS [1]. For each test, we took the median of 3 trials. To gather empirical data about space usage, we interposed malloc and free. The theoretical space usage Figure 10: The scalability of excluded-sum algorithms as of the different algorithms can be found in Table 1. a function of the cost of operator ⊕ on a 3D domain of 푁 = 4096 elements. The horizontal axis is the time in Strong Excluded Sums in 3D Figure 3 summarizes nanoseconds to execute ⊕. The vertical axis represents the the results of our evaluation of the box-complement and time per element of the given algorithm divided by the time corners algorithm in 3 dimensions with a box length of for ⊕. We inflated the time of ⊕ using increasingly large arguments to the standard recursive implementation of a 푘1 = 푘2 = 푘3 = 4 (for a total box volume of 퐾 = 64) Fibonacci computation. and number of elements 푁 = 681472. We tested with varying 푁 but found that the time and space per element take different amounts of time. were flat (full results in Appendix D). We found that Figure 10 summarizes our findings. We ran the the box-complement algorithm outperforms the corners algorithms on a 3D domain of 푁 = 4096 elements. algorithm by about 1.4× when given similar amounts of (Although this domain may seem small, Appendix D space, though the corners algorithm with 2× the space shows that the results are relatively insensitive to domain as the box-complement algorithm was 2× faster. size.) For inexpensive ⊕ operators, the box-complement We explored two different methods of using extra algorithm is the second fastest, but as the cost of ⊕ space in the corners algorithm based on the computation increases, the box-complement algorithm dominates. tree of prefixes and suffixes: (1) storing the spine ofthe The reason for this outcome is that box-complement computation tree to asymptotically reduce the running performs approximately 12 ⊕ operations per element time, and (2) storing the leaves of the computation tree to in 3D, whereas the most efficient corners algorithm reduce passes through the output. Although storing the performs about 22 ⊕ operations. As ⊕ becomes more leaves does not asymptotically affect the behavior of the costly, the time spent executing ⊕ dominates the other corners algorithm, we found that reducing the number bookkeeping overhead. of passes through the output has significant effects on empirical performance. Storing the spine did not improve Weak Excluded Sums in Higher Dimensions performance, because the runtime is dominated by the Figure 4 presents the results of our evaluation of weak number of passes through the output. excluded-sum algorithms in higher dimensions. For all dimensions 푖 = 1, 2, . . . , 푑, we set the box length 푘푖 = 8 Excluded Sums With Different Operators Most and chose a number of elements 푁 to be a perfect power of our experiments used numeric addition for the ⊕ of dimension 푖. Table 1 presents the asymptotic runtime operator. Because some applications, such as FMM, of the different excluded-sum algorithms. involve much more costly ⊕ operators, we studied The weak naive algorithm for excluded sums with how the excluded-sum algorithms scale with the cost nested loops outperforms all of the other algorithms up of ⊕. To do so, we added a tunable slowdown to the to 2 dimensions because its runtime is dependent on the invocation of ⊕ in the algorithms. Specifically, they call box volume, which is low in smaller dimensions. Since an unoptimized implementation of the standard recursive its runtime grows exponentially with the box length, Fibonacci computation. By varying the argument to the however, we limited it to 5 dimensions. Fibonacci function, we can simulate ⊕ operators that The summed-area table algorithm outperforms the

10 Copyright © 2021 by SIAM Unauthorized reproduction of this article is prohibited BDBS and box-complement algorithms up to 6 dimen- sions, but its runtime scales exponentially in the number of dimensions. Finally, the BDBS and box-complement algorithms scale linearly in the number of dimensions and outper- form both naive and summed-area table methods in higher dimensions. Specifically, the box-complement al- gorithm outperforms the summed-area table algorithm by between 1.3× and 4× after 6 dimensions. The BDBS algorithm demonstrates an advantage to having an in- verse: it outperforms the box-complement algorithm by 1.1× to 4×. Therefore, the BDBS algorithm dominates the box-complement algorithm for weak excluded sums.

7 Conclusion In this paper, we introduced the box-complement algo- rithm for the excluded-sums problem, which improves the running time of the state-of-the-art corners algo- rithm from Ω(2푑푁) to Θ(푑푁) time. The space usage of the box-complement algorithm is independent of the number of dimensions, while the corners algorithm may use space dependent on the number of dimensions to achieve its running-time lower bound. The three new algorithms from this paper parallelize straightforwardly. In the work/span model [10], all three algorithms are work-efficient, achieving Θ(푑푁) work. The BDBS and BDBSC algorithms achieve Θ(푑 log 푁) span, and the box-complement algorithm achieves Θ(푑2 log 푁) span.

Acknowledgments The idea behind the box-complement algorithm was originally conceived jointly with Erik Demaine many years ago, and we thank him for helpful discussions. Research was sponsored by the United States Air Force Research Laboratory and the United States Air Force Artificial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA8750-19-2- 1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

11 Copyright © 2021 by SIAM Unauthorized reproduction of this article is prohibited References [17] Nail A. Gumerov and Ramani Duraiswami. Fast multipole methods for the Helmholtz equation in three dimensions. Elsevier, 2005. [1] Amazon. Amazon Web Services. https://aws.amazon. [18] Justin Hensley, Thorsten Scheuermann, Greg Coombe, com/, 2021. Montek Singh, and Anselmo Lastra. Fast summed-area [2] Rick Beatson and Leslie Greengard. A short course on table generation and its applications. In Computer fast multipole methods. Wavelets, Multilevel Methods Graphics Forum, volume 24, pages 547–555. Wiley and Elliptic PDEs, pages 1–37, 1997. Online Library, 2005. [3] Pierre Blanchard, Nicholas J. Higham, and Theo Mary. [19] Nicholas J. Higham. The accuracy of floating point A class of fast and accurate summation algorithms. summation. SIAM J. Scientific Computing, 14:783–799, SIAM Journal on Scientific Computing, 42(3):A1541– 1993. A1557, 2020. [20] Nicholas J. Higham. Accuracy and Stability of Numeri- [4] Guy E. Blelloch. Prefix sums and their applications. cal Algorithms. Society for Industrial and Applied Math- Technical Report CMU-CS-90-190, School of Computer ematics, Philadelphia, PA, USA, 2nd edition, 2002. Science, Carnegie Mellon University, November 1990. [21] Intel Corporation. Intel Cilk Plus Language [5] Guy E. Blelloch, Daniel Anderson, and Laxman Dhuli- Specification, 2010. Document Number: 324396- pala. ParlayLib: a toolkit for parallel algorithms on 001US. Available from http://software.intel.com/ shared-memory multicore machines. page 507–509, sites/products/cilk-plus/cilk_plus_language_ 2020. specification.pdf. [6] Robert D. Blumofe, Christopher F. Joerg, Bradley C. [22] Peter M. Kogge and Harold S. Stone. A parallel al- Kuszmaul, Charles E. Leiserson, Keith H. Randall, and gorithm for the efficient solution of a general class of Yuli Zhou. Cilk: An efficient multithreaded runtime recurrence equations. IEEE Transactions on Comput- system. Journal of Parallel and Distributed Computing, ers, 100(8):786–793, 1973. 37(1):55–69, 1996. [23] Chris Lattner. LLVM: An infrastructure for multi-stage [7] Derek Bradley and Gerhard Roth. Adaptive threshold- optimization. Master’s thesis, Computer Science Dept., ing using the integral image. Journal of Graphics Tools, University of Illinois at Urbana-Champaign, Urbana, 12(2):13–21, 2007. IL, December 2002. [8] Hongwei Cheng, William Y. Crutchfield, Zydrunas [24] Chris Lattner and Vikram Adve. LLVM: A compilation Gimbutas, Leslie F. Greengard, J. Frank Ethridge, framework for lifelong program analysis & transforma- Jingfang Huang, Vladimir Rokhlin, Norman Yarvin, tion. In International Symposium on Code Generation and Junsheng Zhao. A wideband fast multipole method and Optimization (CGO), page 75, Palo Alto, California, for the Helmholtz equation in three dimensions. Journal March 2004. of Computational Physics, 216(1):300–325, 2006. [25] William B March and George Biros. Far-field compres- [9] Ronald Coifman, Vladimir Rokhlin, and Stephen sion for fast kernel summation methods in high dimen- Wandzura. The fast multipole method for the wave sions. Applied and Computational Harmonic Analysis, equation: A pedestrian prescription. IEEE Antennas 43(1):39–75, 2017. and Propagation Magazine, 35(3):7–12, 1993. [26] Tao B. Schardl, William S. Moses, and Charles E. [10] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Leiserson. Tapir: Embedding fork-join parallelism Rivest, and Clifford Stein. Introduction to Algorithms. into LLVM’s intermediate representation. In ACM MIT Press, 3rd edition, 2009. SIGPLAN Notices, volume 52, pages 249–265. ACM, [11] Franklin C. Crow. Summed-area tables for texture 2017. mapping. pages 207–212, 1984. [27] Tao B Schardl, William S Moses, and Charles E [12] Eric Darve. The fast multipole method: numerical Leiserson. Tapir: Embedding recursive fork-join implementation. Journal of Computational Physics, parallelism into llvm’s intermediate representation. 160(1):195–240, 2000. ACM Transactions on Parallel Computing (TOPC), [13] E. D. Demaine, M. L. Demaine, A. Edelman, C. E. 6(4):1–33, 2019. Leiserson, and P. Persson. Building blocks and excluded [28] Xiaobai Sun and Nikos P Pitsianis. A matrix version of sums. SIAM News, 38(4):1–5, 2005. the fast multipole method. Siam Review, 43(2):289–300, [14] S. Fraser, H. Xu, and C. E. Leiserson. Work-efficient 2001. parallel algorithms for accurate floating-point prefix [29] Ernesto Tapia. A note on the computation of high- sums. In 2020 IEEE High Performance Extreme dimensional integral images. Pattern Recognition Computing Conference (HPEC), pages 1–7, 2020. Letters, 32(2):197–201, 2011. [15] Sean Fraser. Computing included and excluded sums [30] Gernot Ziegler. Summed area computation using using parallel prefix. Master’s thesis, Massachusetts ripmap of partial sums, 2012. GPU Technology Institute of Technology, 2020. Conference (talk). [16] L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. J. Comput. Phys., 135(2):280–292, August 1997.

12 Copyright © 2021 by SIAM Unauthorized reproduction of this article is prohibited A Analysis of Corners Algorithm This section presents an analysis of the time and space usage of the corners algorithm [13] for the excluded- P (0) S (1) sums problem. The original article that proposed the corners algorithm did not include an analysis of its performance. As we will see, the runtime of the corners PP PS SP SS height = d algorithm is a function of the space it is allowed. (00) (01) (10) (11) … … Algorithm Description. Given a 푑-dimensional tensor 풜 of size 푁 and a box 퐵, the corners algorithm 퐶 (퐵) 2푑 partitions the excluded region 푑 into disjoint PP…P SS…S regions corresponding to the corners of the box. Each (00…0) (11…1) excluded sum is the sum of the reductions of each of the corresponding 2푑 regions. The corners algorithm com- putes the reduction of each partition with a combination Figure 12: The dependency tree of computations in the of prefix and suffix sums over the entire tensor andsaves corners algorithm. P and S represent full-tensor prefix and work by reusing prefixes and suffixes in overlapping re- suffix sums, respectively. Each leaf is a string of length 푑 that gions. Figure 11 illustrates an example of the corners denotes a of prefix and suffix sums along the entire algorithm. tensor. Θ(푁) (1,1) without reusing computation between paths takes space, but Θ(푑푁) time per leaf, for total time Θ(푑2푑푁). We will see how to use extra space to reuse computation PP between paths and reduce the total time. Theorem A.1. (Time / Space Tradeoff) Given a PS multiplicative space allowance 푐 such that 1 ≤ 푐 ≤ 푑, n1 the corners algorithm solves the excluded-sums problem k in Θ((2푐+1 +2푑(푑−푐)+2푑)푁) time if it is allowed Θ(푐푁) 1 (x1, x2) space. Proof. SP SS The corners algorithm must traverse the entire k2 computation tree in order to compute all of the leaves. If it follows a depth-first traversal of the tree, one possible Θ(푐푁) n use of the extra allowed space is to keep the 2 intermediate combination of prefix and suffices at the first 푐 internal nodes along the current root-to-leaf path in Figure 11: An example of the corners algorithm in 2 the traversal. We will analyze this scheme in terms of 1) dimensions on an 푛1 ×푛2 matrix using a (푘1, 푘2)-box cornered the amount of time that each leaf requires independently, at (푥1, 푥2). The grey regions represent excluded regions and 2) the total shared work between leaves. The total computed via prefix and suffix sums, and the black boxes time of the algorithm is the sum of these two components. correspond to the corner of each region with the relevant Independent work: For each leaf, if the first 푐 prefixes contribution. The labels 푃 푃, 푃 푆, 푆푃, 푆푆 represent the and suffixes have been computed along its root-to-leaf combination of prefixes and suffixes corresponding toeach (푑 − 푐) vertex. path, there are an additional prefix and suffix computations required to compute that leaf. Therefore, We can represent each length-푑 combination of each leaf takes Θ((푑 − 푐)푁) additional time outside of prefixes and suffixes as a푑 length- binary string where the shared computation, for a total of Θ(2푑(푑 − 푐)푁) a 0 or 1 in the 푖-th position corresponds to a prefix or time. suffix (resp.) at depth 푖. As illustrated in Figure 12, the Shared work: The remaining time of the algorithm is corners algorithm defines a computation tree where each the amount of time it takes to compute the higher levels node represents a combination of prefixes and suffixes, of the tree up to depth 푐 given a 푐 factor in space. Given and each edge from depth 푖 − 1 to 푖 represents a full a node 푣 at depth 푐 with position 푖 such that 1 ≤ 푖 < 푐, prefix or suffix along dimension 푖. The total height of the amount of time it takes to compute the intermediate this computation tree is 푑, so there are 2푑 leaves. sums along the root-to-leaf path to 푣 depends on the Analysis. The most naive implementation of the difference in the bit representation between 푖 and 푖 − 1. corners algorithm that computes every root-to-leaf path Specifically, if 푖 and 푖 − 1 differ in 푏 bits, it takes 푏푁

13 Copyright © 2021 by SIAM Unauthorized reproduction of this article is prohibited additional time to store the intermediate sums for node 푖 at depth 푐. In general, the number of nodes that differ in 푏 ∈ {1, 2, . . . , 푐} positions at depth 푐 is 2푐−푏. Therefore, the total time of computing the intermediate sums is

푐 ∑︁ 푁 푏2푐−푏 ≈ 2푐+1푁 = Θ(2푐푁). 푏=1

Putting it together: Each leaf also requires Θ(푁) time to add in the contribution. Therefore, the total ⎛ ⎞ ⎜ 푐 푑 푑 ⎟ time is Θ⎝ 2 푁 + 2 (푑 − 푐)푁 + 2 푁 ⎠. ⏟ ⏞ ⏟ ⏞ ⏟ ⏞ shared independent contribution

The time of the corners algorithm is lower bounded by Ω(2푑푁) and minimized when 푐 = Θ(푑). Given Θ(푁) space, the corners algorithm solves the excluded-sums problem in 푂(2푑푑푁) time. Given Θ(푑푁) space, the corners algorithm solves the excluded-sums problem in 푂(2푑푁) time.

14 Copyright © 2021 by SIAM Unauthorized reproduction of this article is prohibited B Included Sums Appendix time of BDBS-1D is Θ(푁). Furthermore, BDBS-1D uses two temporary arrays of size 푁 each for the prefix Θ(푁) BDBS-1D(퐴, 푁, 푘) and suffix, for total space . 1 // Input: List 퐴 of size 푁 and // included-sum length 푘. // Output: List 퐴′ of size 푁 where each // entry 퐴′[푖] = 퐴[푖 : 푖 + 푘] for 푖 = 1, 2, . . . 푁. 2 allocate A′ with 푁 slots 3 퐴푝 ← 퐴; 퐴푠 ← 퐴 4 for i ← 1 to N /k 5 // 푘-wise prefix sum along 퐴푝 6 Prefix(퐴푝, (푖 − 1)푘 + 1, 푖푘) 7 // 푘-wise suffix sum along 퐴푠 8 Suffix(퐴푠, (푖 − 1)푘 + 1, 푖푘) 9 for i ← 1 to N // Combine into result 10 if i mod 푘 = 0 ′ 11 퐴 [i] ← 퐴푠[i] 12 else ′ 13 퐴 [i] ← 퐴푠[i] ⊕ 퐴푝[i + k − 1] 14 return 퐴′

Figure 13: Pseudocode for the 1D included sum.

Lemma B.1. (Correctness in 1D) BDBS-1D solves the included sums problem in 1 dimension.

Proof. Consider a list 퐴 with 푁 elements and box length 푘. We will show that for each 푥 = 1, 2, . . . , 푁, the output 퐴′[푥] contains the desired sum. For 푥 mod 푘 = 1, this holds by construction. For all other 푥, the previously defined prefix and suffix sum give the desired result. Recall that 퐴′[푥] = 퐴푝[푥 + 푘 − 1] + 퐴푠[푥], 퐴푠[푥] = 퐴[푥 : ⌈(푥 + 1)/푘⌉· 푘], and 퐴푝[푥 + 푘 − 1] = 퐴[⌊(푥 + 푘 − 1)/푘⌋· 푘 : 푥 + 푘]. Also note that for all 푥 mod 푘 ̸= 1, ⌊(푥 + 푘 − 1)/푘⌋ = ⌈(푥 + 1)/푘⌉. Therefore,

′ 퐴 [푥] = 퐴푝[푥 + 푘 − 1] + 퐴푠[푥] [︂ ⌈︂푥 + 1⌉︂ ]︂ [︂⌊︂푥 + 푘 − 1⌋︂ ]︂ = 퐴 푥 : · 푘 + 퐴 · 푘 : 푥 + 푘 푘 푘 = 퐴[푥 : 푥 + 푘] which is exactly the desired sum.

Lemma B.2. (Time and Space in 1D) Given an in- put array 퐴 of size 푁 and box length 푘, BDBS-1D takes Θ(푁) time and Θ(푁) space.

Proof. The total time of the prefix and suffix sums is 푂(푁), and the loop that aggregates the result into 퐴′ has 푁 iterations of 푂(1) time each. Therefore, the total

15 Copyright © 2021 by SIAM Unauthorized reproduction of this article is prohibited C Excluded Sums Appendix

Prefix-Along-Dim(풜, i) 1 // Input: Tensor 풜 (푑 dimensions, side lengths // (푛1, . . . , 푛푑), dimension 푖 to do the prefix sum // along. // Output: Modify 풜 to do the prefix sum along // dimension 푖 + 1, fixing dimensions up to 푖.

2 // Iterate through coordinates by varying // coordinates in dimensions 푖 + 2, . . . , 푑 // while fixing the first 푖 dimensions. // Blanks mean they are not iterated over // in the outer loop 3 for 4 {x = (푥1, . . . , 푥푑) ∈ (푛1, . . . , 푛푖, _, :,..., : )} ⏟ ⏞ ⏟ ⏞ 푖 푑 − 푖 − 1 5 // Prefix sum along row // (can be replaced with a parallel prefix) 6 for ℓ ← 2 to ni+1 Add-Contribution(풜, ℬ, i, offset) 7 풜[푛1, . . . , 푛푖, ℓ, 푥푖+2, . . . , 푥푑] ⊕= ⏟ ⏞ ⏟ ⏞ 1 // Input: Input tensor 풜, output tensor ℬ, 푖 푑 − 푖 − 1 // fixing dimensions up to 푖. 8 풜[푛1, . . . , 푛푖, ℓ − 1, 푥푖+2, . . . , 푥푑] ⏟ ⏞ ⏟ ⏞ // Output: For all coords in ℬ, add the 푖 푑 − 푖 − 1 // relevant contribution from 풜. 2 for {(푥1, . . . , 푥푑) ∈ (:,..., :)} Figure 14: Prefix sum along all rows along a dimension with 3 if 푥푖+1 + offset ≤ 푛푖+1 initial dimensions fixed. 4 ℬ[푥1, . . . , 푥푑] = 5 풜[푛1, . . . , 푛푖, 푥푖+1 + offset, 푥푖+2, . . . , 푥푑] The suffix sum along a dimension is almost exactly ⏟ ⏞ ⏟ ⏞ 푖 푑 − 푖 the same, so we omit it.

Lemma C.1. (Time of Prefix Sum) Prefix- Figure 15: Adding in the contribution. (︁∏︀푑 )︁ Along-Dim(풜, 푖) takes 푂 푗=푖+1 푛푗 time.

Proof. The outer loop over dimensions 푖 + 2, . . . , 푑 has (︁ ∏︀푑 )︁ max 1, 푗=푖+2 푛푗 iterations, each with Θ(푛푖+1) work for the inner prefix sum. Therefore, the total time is (︁∏︀푑 )︁ 푂 푗=푖+1 푛푗 .

Lemma C.2. (Adding Contribution) Add-Contribution takes Θ(푁) time.

16 Copyright © 2021 by SIAM Unauthorized reproduction of this article is prohibited 1 // Input: Tensor 풜 of 푑 dimensions and side lengths // (푛1, . . . , 푛푑) output tensor ℬ, side lengths of the // excluded box k = (푘1, . . . , 푘푑), 푘푖 ≤ 푛푖 for all // 푖 = 1, 2, . . . , 푑. // Output: Tensor ℬ with size and dimensions // matching 풜 containing the excluded sum. ′ 2 풜 ← 풜, 풜푝 ← 풜, 풜푠 ← 풜 // Prefix and suffix temp 3 for i ← 1 to d // Current dimension-reduction step 4 // PREFIX STEP // At this point, 풜푝 should hold prefixes up to // dimension 푖 − 1. ′ 5 풜 ← 풜푝 6 // Save the input to the suffix step 7 풜푠 ← 풜푝 8 // Do prefix sum along dimension 푖 9 Prefix-Along-Dim(풜′, i − 1) 10 // Save prefix up to dimension 푖 in 풜푝 ′ 11 풜푝 ← 풜 12 // Do included sum on dimensions [푖 + 1, 푑] 13 for j ← i + 1 to d 14 BDBS-Along-Dim(풜′, i − 1, j , k) 15 // Add into result 16 Add-Contribution(풜′, ℬ, i, −1) 17 // SUFFIX STEP // Start with the prefix up until dimension // 푖 − 1 ′ 18 풜 ← 풜푠 19 // Do suffix sum along dimension 푖 20 Suffix-Along-Dim(풜′, i − 1) 21 // Do included sum on dimensions [푖 + 1, 푑] 22 for j ← i + 1 to d 23 BDBS-Along-Dim(풜′, i − 1, j , k) 24 // Add into result ′ 25 Add-Contribution(풜 , ℬ, i − 1, 푘푖)

Figure 16: Pseudocode for the box-complement algorithm with parameters filled in. For the 푖th dimension-reduction step, the copy of temporaries only needs to copy the last 푑 − 푖 + 1 dimensions due to the dimension reduction.

17 Copyright © 2021 by SIAM Unauthorized reproduction of this article is prohibited D Additional experimental data The data in this appendix was generated with the experimental setup described in Section 6.

Figure 18: Space per element of algorithms for strong excluded sums in 3D.

Figure 17: Time per element of algorithms for strong excluded sums in 3D.

Figure 19: Space and time per element of the corners and box-complement algorithms in 3 dimensions, with an artificial slowdown added to each numeric addition (or ⊕) that dominates the runtime.

18 Copyright © 2021 by SIAM Unauthorized reproduction of this article is prohibited