Designing Efficient Parallel Prefix Sum Algorithms for GPUs

Gabriele Capannini Information Science and Technology Institute of the National Research Council Pisa, Italy Email: [email protected]

Abstract—This paper presents a novel and efficient method Ding et al. [4] use prefix sum to implement their PForDelta to compute one of the simplest and most useful building block decoder for accessing postings lists, and a number of other for parallel algorithms: the parallel prefix sum operation. applications refer to this operation [5], [6]. Besides its practical relevance, the problem achieves further interest in parallel-computation theory. Concerning the computational complexity on a sequential We firstly describe step-by-step how parallel prefix sum processor, prefix sum operation is linear in the size of the is performed in parallel on GPUs. Next we propose a more input list. Instead when the computation is performed in par- efficient technique properly developed for modern graphics allel the complexity becomes logarithmic [7], [8], [9]. On the processors and alike processors. Our technique is able to one hand, the sequential solution is more easy to implement perform the computation in such a way that minimizes both memory conflicts and memory usage. and works in-place. As a consequence, if the available degree Finally we evaluate theoretically and empirically all the of parallelism is low, the approaches based on the sequential considered solutions in terms of efficiency, space complexity, solution could be still applied without major performance and computational time. In order to properly conduct the loss. On the other hand, it is straightforward that, augment- theoretical analysis we used a novel computational model ing the parallelism degree, the efficient prefix sum computa- proposed by us in a previous work: K-model. Concerning the experiments, the results show that the proposed solution obtains tion becomes a key aspect of the overall performance. This is better performance than the existing ones. the case of Graphics Processing Units (GPUs), Sony’s Cell BE processor, and next-generation CPUs. They consist of an I.INTRODUCTION array of single-instruction multiple-data (SIMD) processors and allow to execute a huge number of arithmetic operations Nowadays most processing unit architectures are made up at the same time. Related to SIMD, there is the stream of several cores. We omit the discussion about the correct- processing [10]. It is a programming paradigm that allows ness of using the same term core for all these architectures, some applications to easy use multiple computational units 1 and focus on the relevance of the prefix sum operation as by restricting the parallel synchronization or communication building block to effectively exploit the parallelism. that can be performed among those units. Given a set of Given a list of elements, after scan execution each position data (stream), a of operations (kernel) is computed in the resulting list contains the sum of the items in the independently on each stream element. Furthermore, also operand list up to its index. In [1], Blelloch defines prefix due to the performance constraints related to the complex sum as follows: memory hierarchy, the design of efficient algorithms for Definition 1: Let ⊕ be a binary associative operator with these processors requires more efforts than for standard identity I. Given an ordered set of n elements: CPUs. An important contribution of this paper is to provide a [ a0, a1, . . . , an−1 ] comparison between the main existing solutions for comput- the exclusive scan operation returns the ordered set: ing prefix sum and our original access pattern. In particular, for each solution, we provide both theoretical and experi- [ I, a , (a ⊕ a ),..., (a ⊕ ... ⊕ a )] 0 0 1 0 n−2 mental analysis. We describe and justify the main existing In literature, many application fields exploit scan to merge algorithms within K-model [11], which is a computational the partial results obtained by different computations. In model defined for the novel massively parallel computing particular, general purpose GPU algorithms largely apply architectures such as GPUs and alike processors. The theo- this operation as atomic step in many solutions. For example, retical results point out that our approach efficiently satisfies some of the fastest sorting algorithms (e.g. Radixsort [2], all the performance constraints featured by K-model, and [3], and others) perform scan operation many show the benefits and the limits of each considered solution. times for reordering the values belonging to different blocks. Finally, the experimental results validate the theoretical ones by showing that our access schema is up to 25% faster than 1 Also known as scan, prefix reduction, or partial sum. the other considered solutions. The paper is organized as follows: Section II resumes the following table points out the features (i.e. those marked features and the performance constraints characterizing the with X) that can be properly modeled within an existing target architectures. Section III describes K-model referring model. the architectural details previously introduced. Section IV serial shared external Model name analyses the solution that can be considered suitable for control data-memory memory GPUs or computational models exposing performance con- PRAM [14] – – – straints that are similar to the main ones considered by K- V-RAM [15] X –– model (e.g. QRQW PRAM [12]). In Section V we provide (P)BT [16] – – X QRQW PRAM [12] – – some general considerations to define a metric for the X Cache Oblivious [17] – – X evaluation of tree-based solutions. In Section VI our access schema is defined. Finally, Section VII and Section VIII In [11] we proposed K-model as a viable solution that show both the theoretical analysis and the experimental is able to characterize all the previous aspects so as to results of the different solutions. overcome the lack of accuracy of the existing computational models. II.GPUARCHITECTURAL DETAILS Modern processors, in particular GPUs, permit to compute IU hundreds of arithmetical operations at the same time. The key features joining these architectures are the integration EM IM of several computing cores on a single chip, grouped into SIMD processors [13]. E[1] bank[1] In this paper we focus on GPUs. These devices are Q[1] built around a scalable array of SIMD processors. Each of them comprises: one instruction unit, a collection of scalar E[k] bank[k] shared data- processors, a set of private registers, and a Q[k] memory. In more detail, is divided into independent modules, which can be accessed concurrently. S / k If an instruction issues two memory requests falling in the same shared data-memory module, there is a conflict, thus Figure 1. Representation of the target architecture. the accesses have to be serialized, penalizing the latency of the instruction. Each device is also equipped with an The core of K-model is the evaluation of the “parallel- external memory that is off-chip, and can be addressed work” performed by a SIMD processor on each stream by all SIMD processors during the computation. To access element, disregarding the number of processors composing external memory data, a processor performs a set of memory the real processors array. For this reason the abstract ma- transactions. Each transaction is able to transfer the content chine related to K-model (as shown Figure 1) restricts the of a fixed-size segment of contiguous memory addresses. processors array to only one SIMD processor schema. Moreover, the latency of each transaction is constant for any The memory hierarchy consists of an external memory given data size to transfer. Finally, strictly related to SIMD EM and of an internal memory IM made up of S locations processors, the serial control programming is the last aspect equally divided into k modules accessible in parallel. The to manage. On the one hand, this architectural solution processor is composed by an array of k scalar execution permits to multiply the FLOPS over chip-area ratio. On the units E[]s linked to an instruction unit IU that supplies other hand, algorithms have to synchronize the instruction the instructions to the scalar units. This forces the scalar flows performed by the computational elements to obtain the units to work in a synchronized manner, because only one maximum efficiency. instruction is issued at each processing step. The evaluation The architectural organization of other types of common of each instruction issued by IU is based on two criteria: processors is similar to GPUs. For example, Cell BE ar- latency and length. Length is simply the number of scalar chitecture exposes on the same chip both a CPU and a units that must run an instruction issued by IU, thus it is set of SIMD processors (PPUs) which are equipped with defined in the range 1 ÷ k. Latency is measured in terms a local memory. In this case, all the processors share the of the type of instruction issued. If the instruction access system main memory that is unified with the analogous GPU to one of the k modules composing IM, then the latency external memory. is proportional to the contention degree it generates. This means that the latency of a data-access instruction can be III.K-MODEL correctly evaluated as the maximum number of requests In [11] we aim to find a computational model that is incoming to one of the k memory banks via the queues able to capture all the features described in Section II. The Q[]. Instead the latency of arithmetical instructions is always equal to 1. Harris et al. [8] proposed a work-efficient scan algorithm Given an algorithm, the functions T () and W () evaluate that avoids the extra factor of log n. Their idea consists in the amount of work performed inside the processor in term two visits of the balanced built on the input array: of latency and length of the instructions, as previously de- the first one is from the leaves to the root, the second starts fined. In particular, T () represents the latency per instruction at the root and ends at the leaves, see Figure 2. call, and W () is the length induced by each instruction, Since binary trees with n leaves have log n levels and 2d both summed over the instructions issued. Let us note that nodes at the generic level d ∈ [0, log n − 1], the overall T () can be thought as the parallel complexity assuming number of operations to perform is O(n). In particular, in a collection of k scalar processors, whereas W () can be the first phase the algorithm traverses the tree from leaves thought as the complexity assuming we are considering to the root computing the partial results at internal nodes, a serial RAM. Regarding EM, its latency is evaluated see lines 1÷6 of Algorithm 2. After that, it traverses the separately. The peculiarity of the external memory is the tree starting at the root to build the final results in place coalescing of the accesses. Whenever, the scalar units access by exploiting the previous partial sums, see lines 8÷15 of a set of addresses, the architecture performs one memory Algorithm 2. transaction for each segment of length k to address. Since the time required to execute a transaction is constant, the Algorithm 2 Work-efficient scan. function G() measures the number of memory transactions Require: n-size array A[] issued to access to the external memory. 1: for d ← 0 to log2(n) − 1 do 2: for k ← 0 to n − 1 by 2d+1 in parallel do d IV. RELATED WORK 3: i ← k + 2 − 1 4: A[i + 2d] ← A[i + 2d] + A[i] Due to its relevance, scan has been deeply studied from 5: end for researchers to devise faster and more efficient solutions. 6: end for However, only few of these are relevant for our purpose, 7: A[n − 1] ← 0 namely the studies conducted for general purpose com- 8: for d ← log2(n) − 1 down to 0 do putation on GPUs. Furthermore, we attempted to include 9: for k ← 0 to n − 1 by 2d+1 in parallel do in our study also those solutions proposed for QRQW 10: i ← k + 2d − 1 PRAM [12] because the behavior of its memory equals the 11: temp ← A[i] IM one, which is the performance constraint mainly taken 12: A[i] ← A[i + 2d] into account in this paper. Unfortunately, to the best of our 13: A[i + 2d] ← A[i + 2d] + temp knowledge, none has been found. 14: end for Horn et al. [7] propose a solution derived by the algorithm 15: end for described in [9]. This solution, represented in Algorithm 1, performs O(n log n) operations. Since a sequential scan Despite the work-efficiency, Algorithm 2 is not yet fully performs O(n) sums, the log n extra-factor could have a efficient considering the metrics defined in Section III, i.e. significant effect on the performance hence it is considered the function T (). Due to the data access pattern used, the work-inefficient. Note that the algorithm in [9] is designed solution is inclined to increase the latency for accessing to for the architecture [18] that is different the shared data-memory IM. In fact, considering a generic from the ours, with an execution time dominated by the step, different scalar processors require to access elements number of performed steps rather than work complexity. belonging to the same memory bank. As a consequence, the From this point of view, that consists in considering the most part of the IM accesses are serialized so as to augment number of loops performed at line 1 of Algorithm 1, the the time required to perform an instruction. solution has a step-complexity equal to O(log n). To avoid most memory bank conflicts, Harris et al. insert some empty positions in the memory representation of the Algorithm 1 Work-inefficient scan. input list. Specifically, their solution stores an input list Require: n-size array A[] element ai at the array A[] position as follows: 1: for d ← 1 to log n do a 7→ A[ i + bi/kc ] 2: for all k in parallel do i (1) 3: if 2d ≤ k < n then d−1 Therefore the authors avoid the “periodicity” with which 4: A[k] ← A[k] + A[k − 2 ] the same subset of memory banks are accessed in Algo- 5: end if rithm 2. On the one hand, they balance the memory banks 6: end for workload by applying this technique. On the other hand, 7: end for the memory space required for representing the input in memory grows of an extra-factor that equals n/k. Let us (ii) The second point concerns the operations performed consider, as proof, that an n-size list {a0, . . . , an−1} is on the shared data-memory, i.e. load and store. By represented by the array A as follows: a0 7→ A[0] and induction on the tree levels: the nodes storing the n−1 an−1 7→ A[ n − 1 + b k c ]. So, the required space is results of the operations performed at the level i n−1 n n + b k c = O(n + k ), which corresponds to an extra correspond to the operands to add at the level i + 1. factor of O(n/k). Last but not the least, this solution does Thus, by reducing the number of conflicts generated not scale with the size of the input list because, as shown to store the results at a generic level of the tree, we in Section V, this technique is effective on lists having up automatically reduce the conflicts generated to read to k2 elements. the operands at the next level of the visit. This second aspect permits us to focus our analysis on only one V. PROBLEM ANALYSIS type of operation (i.e. store) instead of considering The previous section emphasizes the weak points of both the store and the load ones. Algorithm 2. In particular, such algorithm consists in travers- Resuming, in the rest of the paper, we face the problem ing twice the binary tree built on the input list. Before of defining a data access pattern able to equally spread introducing our pattern, let us point out two aspects that sequences of k store operations on k different memory permit to define a “metric” for the analysis of the different banks. solutions: VI.OUR DATA ACCESS PATTERN This section describes our data access pattern. The main advantages are: (i) our solution is the optimal in space2, namely O(1); (ii) the proposed schema perfectly balances (a) the workload among the internal memory modules for any given input size; (iii) the solution is based on the tree-based solution and inherits its work efficiency. In order to reduce the memory contention our solution redirects the sum results so as to refer to distinct memory modules, instead of applying a function like (1) to stretch the memory representation of the input list. Our schema works by considering subsets of k consecutive store-operations. As shown in Figure 3, given a sequence of k sums to calculate, the algorithm alternatively stores the result of each one of (b) the first k/2 sums by replacing the value of the operand with the smallest index (denoted as left-sum) and it stores the results of the remaining k/2 sums at position of the operand with the greatest index (denoted as right-sum). Then, we iteratively apply this pattern to each k-size sequence at all Figure 2. Types of tree-traversal performed by Algorithm 2: (a) bottom-up and (b) top-down. levels of the tree. When the tree-traversal is ending, the number of nodes belonging to a tree level becomes smaller than k and the algorithm performs only left-sums. (i) Let us consider the sets of accesses made at each phase In this manner, each sequence of k operations addresses of Algorithm 2: the first phase consists of a bottom- k distinct memory banks. From K-model point of view, this up tree-traversal, see Figure 2a, the second phase means to eliminate the contention on IM. Equation (2) consists in the same procedure performed in a top- defines the approach depicted in Figure 3, namely it dis- down manner, see Figure 2b. The two sets of accesses criminates between left-sums and right-sums. Basically, this are equivalent. The only difference is the reversed function consists in permuting the set of memory banks order adopted to traverse the tree. So, by defining an related to a sequence of sums to compute. optimized access pattern for the first visit, we have m m a method to efficiently perform the second visit too. LEFTRIGHTh(x, m) = RSTh(x · 2 ) + ROTh (x) (2) | {z } | {z } In other words, if a pattern generates a conflict in the base offset shared memory by addressing a pair of elements in the Let k be a power of 2, given an element at position first visit, the same conflict will occur, for the same i, the least significant log k bits of i identify the memory pair, in the second visit too. As a consequence, we can reduce the analysis of an access pattern to one phase: 2 This result is still more relevant considering that the real architectures in our case we chose the bottom-up tree-traversal. exploits small sized shared data-memories. 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 statement for circular-shift of a binary word. However, m ROTh (x) can be emulated via: 1: x &= (1<>m) | (x<<(h-m));

and the operation RSTh(x) can be emulated via: 1: return x & (0xFFFFFFFF<

VII.THEORETICAL EVALUATION In this section we evaluate the solutions presented in Sec- Figure 3. Representation of our pattern applied to a tree with n = 16 leaves allocated in a memory of k = 4 banks. tion IV and Section VI by defining the related complexities discussed in Section III, i.e. G(), W () and T (). Firstly we show the complexities of the three tree-based solutions bank where the element is stored. Equation (2) consists in varying the data access pattern: the na¨ıve version represented applying a left circular-shift to the least significant log k by Algorithm 2, the version exploiting padding, and the one bits of the element positions involved in the computation. In based on our novel solution. These solutions are referred particular, ROT is applied on a sequence of 2 · k addresses in the remainder of the paper as NoPadding, Padding, referring k pairs of operands. If the addresses in the initial LeftRight, respectively. Concerning T (), the analysis of the sequence are uniformly spread on k distinct memory banks, different solutions is presented using the metric introduced we obtain a new sequence of k positions belonging to k in Section V, i.e. taking into account the number of conflicts distinct memory banks that are arranged as Figure 3 shows. generated storing the results of the operations performed Since at the first step of the tree traversal the operands are during the bottom-up tree traversal. Then we analyze the already equally spread among the k banks, no conflicts are method proposed for the Connection Machine architecture generated accessing them. Therefore, applying Equation (2) and adapted for GPUs in [7]. This is referred as CM. we maintain the desired property, i.e. avoid the conflicts on Concerning G(n, k), that measure the external memory the memory banks during the execution of Algorithm 2. data-transfers efficiency, it equals n/k for all referred meth- m ods. In fact, the allocation in the external memory does In particular, ROTh (x) stands for computing m times the left circular-shift of the h least significant bits ex- not need any particular arrangement because it is already tracted by the binary representation of x. For example, optimal. Since the input elements are coalesced in a vector, 1 1 1 each memory transaction fully exploits the communication ROT3(12) = ROT3(1100) = ROT3(100) = 001 = 1 2 bandwidth by transferring k elements at a time. and ROT3(2) = 001 = 1. The RSTh(x) operator resets the h least significant bits of the binary representation of Concerning W (n), let us observe that the number of x. For example RST3(13) = RST3(1101) = 1000 = 8. operations performed by the tree-based solutions does not Resuming, an index resulting from Equation (2) is composed change by varying the access pattern. We already pointed by two distinct parts: the first part, that is referred as base, out that these methods differ only in the schema applied for specifies the most significant part of the address; the least mapping the operands into the internal data-memory IM. significant log k bits, that are referred as offset, are obtained Analyzing the lines 1÷6 of Algorithm 2 and Figure 2a, it by applying a circular-shift rotation. The related pseudo-code can be seen that the number of nodes to compute halves at is shown in Algorithm 3. each step. Let n denote the input size and, without loss of generality, suppose n is a power of 2. The overall number Algorithm 3 Free-conflict bottom-up tree visit step. of operations performed to traverse the tree is given by (3) and it corresponds to W () too. 1: for d ← 0 to log2(n) − 1 do 2: for i ← 0 to n/2d+1 − 1 in parallel do log n X n 3: x ← A[ LEFTRIGHTlog k(2i, d)] W (n) = = n − 1 = O(n) (3) 2i 4: y ← A[ LEFTRIGHTlog k(2i + 1, d)] i=1 5: A[ LEFTRIGHT (i, d + 1) ] = x + y log k Thus, let us define the efficiency of the solutions and ` 6: end for denote the number of instructions issued by IU, we can 7: end for define ` as follows:

log n−1 `2 This section ends with some notes about the implementa- X 2i z}|{ n − k ` = + log k = + log k (4) tionof our access function on the tested GPUs. The SDK [19] n,k k k provided with the tested devices consists in an extension i=log k C C C++ | {z } of the language. However, neither nor have a `1 Given a generic tree level d ≥ log k, the number of steps, this technique is no more effective. As a consequence, d d instructions issued by the unit IU is `1 = 2 /k, with 2 the computation of the remaining levels generates the same is the number of nodes composing the level. At the last contention-degree of the NoPadding method. log k steps of the tree-visit, the efficiency halves as well as Hence, we get the peak of IM bandwidth computing the the number of sums to perform, nevertheless IU issues one first log k steps because all operations are equally spread instruction per step, denoted by `2. on the memory banks, see (6). Note that such formula is With both W (n, k) and `, we can evaluate the efficiency straightforwardly derived by Equation (3) because, when the of the solutions as W (n, k) divided by k·`, see Equation (5). contention is null, the latency of the overall accesses corre- Without loss of generality, we suppose the parameter k, sponds to W ()/k. For the remaining steps the complexity inherent K-model, is a power of 2 and k < n. function is the same obtained in the NoPadding case, i.e. it is T (). W (n, k) n − 1 n − 1 p¯ = = (5) k · ` k · ` k · log k + n − k log k n,k n,k 1 X n fp(n, k) = · (6) CM k 2i Concerning the method, as argued in Section IV, i=1 the number of operations performed by Algorithm 1 is O(n log n) for any given n-length input list. Therefore W () The complexity of Padding corresponds to the following equals O(n log n). expression, which is obtained by merging fp() and Tp¯. As last indicator, we consider T () that denotes the time n  T (n, k) = f (n, k) + T , k required to compute a stream element. It depends on the p p p¯ k different number of conflicts generated by addressing the LeftRight: As argued in Section VI, this memory access internal memory IM so we analyze one solution at a time. pattern was developed in order to avoid all memory bank NoPadding: Let n be the input size and d ∈ [1, log n] conflicts so as to expose the complexity of T () as follows: indicate a step of the loop represented at the lines 1÷6 of log n−1 Algorithm 2. The number of operations to compute at each X 2i d T (n, k) = log k + step is n/2 , and the displacement between the positions lr k storing two consecutive results is 2d. As a consequence, the i=log k d n − k involved queues Q[]s are those having index 2 mod k. At = log k + each step, the number of operations to perform halves, the k queues size doubles and hence the time to perform a step is unvaried because the workload halves as the number of the This result corresponds to Equation (4) that represents the working memory modules. Since k requests are processed number of instructions issued by IU. This is the result we at a time, the time needed to perform the first log k steps expected. In fact, when the set of accesses corresponding to is n/k for each one of them. Only on the last log n − log k an instruction does not generate any memory conflict then steps, the maximum queue size does not double because the the parallel complexity computed in K-model is asymptoti- number of adds becomes too small. Finally Tp¯(n, k) is: cally equal to the number of instruction issued, namely `. log k log n CM: The solution performs log n steps for a given X n d 1 X d 1 Tp¯(n, k) = · 2 · + 2 · n-size array. Let d ∈ [0, log n − 1] denote a step of 2d k k d d=1 d=log k+1 Algorithm 1, the method performs n − 2 operations per log n−log k step. Resuming, the overall amount of sums corresponding n X = log k · + 2d to W () is: k d=1 log n−1 log k + 1 X d = n · − 1 Wcm(n) = n − 2 = n · log n − n + 1 k d=0 Padding: This method is fully efficient until the dis- In K-model, such solution splits the set of operations placement applied to the indexes is smaller than the number performed at each step into session of k operations. Each of memory modules, i.e. k in K-model. In particular, for session can be implemented so as to avoid all conflicts on values greater than k, the Function (1) generates the same the memory modules. In practice, we temporarily store the conflicts in IM as the NoPadding algorithm does. In par- sums of the operands loaded from IM into a local variable. ticular, let us consider a tree traversal performed on arrays When all sums have been computed, we proceed by writing made up of more than 2 · k elements. Due to Padding, the back the results into IM. As a consequence of that: representation in memory of two consecutive segments of k W (n) elements is divided by an empty entry. At the end of the T (n, k) = cm traversal performed on such segments, namely after log k cm k cni w ifrn ae o l h solutions: the operation all the for perform cases to different required two experiments in time conducted scan the The measure [8]. to in aim experiments the for the can also element to stream Due each consid- size, compute device. word and 32-bit data-memory video standard shared the GTX on-chip ering the 275 of NVIDIA size limited the on 3.1 IV. Section in discussed only and algorithm: considered each of accesses. plexity the of serialization the to due wasted are than greater most at with of is performed made trees it that particular means log This In performed. sizes. first are problem steps the when larger only to effective well scale not one to comparable is of performance its accesses, conflict-free until that results. Furthermore, out experimental make solution. to points and as figure so theoretical tests The the compare for used to corresponds hardware that possible the 16 of to value equal the set to been parameter has The K-model) account. in into ferred taken solutions the all for hardware the to related is it plot). e.g. (log-log VIII. value, Section solutions realistic in different a testing the is for of it used behavior because Expected 16 equals 4. Figure h xeiet aebe odce sn UASDK CUDA using conducted been have experiments The space linear in scan perform them of three particular, In com- space the summarizes table following the Finally, the of plot log-log the is 4 Figure LeftRight

k T(n, k=16) log + blocks LeftRight atr aeSaeComplexity Space Padding NoPadding CM Name Pattern Padding k k ouin oee,tepdigtcnqedoes technique padding the However, solution. 2 ees namely levels, Left Right Padding NoPadding CM eepc that expect we fa ot24 lmns ..tesm used same the i.e. elements, 2048 most at of Padding II E VIII. eursa xr atrfrtereasons the for factor extra an requires ntecnrr,friptsize input for contrary, the On . XPERIMENTS CM log Padding k n Padding 2 k O O O O stelws performing lowest the is tm,cnb efficiently be can items, ( ( ( ( tp n h last the and steps T ) 1 n/k ) 1 ) 1 () xeietlresults experimental sal operform to able is ucin computed functions ) k log (re- k k nipto 08ies u ouini 5 atrthan faster 25% is solution our items, particular, 2048 In has Padding competitors. of solution the our input than VII: faster on be Section to of proved end been the at considerations ( ( ii h eut hw nFgr r lge ihthe with aligned are 5 Figure in shown results The iue5 lpe iefrsa nsra lcso ayn size. varying of blocks stream on scan for time Elapsed 5. Figure msecs i ) ) 10 12 14 use scan we Then block of sum block total of element last the j output reset Before an algorithm. of scan locations sequential array to scans block resulting each 1 lines compute (i.e. to algorithm scan the of allocate We block. a in n/b processed elements list of let input in number the elements detail, in the elements more of all In number partial “update” blocks. of to respective used array their are an we that Then generating of results block-sums. block-sums, sum of the total array the scan another write to into and block on array blocks, each stream single large perform the a scan the within to scanned element, divide be idea can we each basic that simple: blocks The is [3], 6. lists [2], large Figure in scanned see are arbitrary etc.), arrays of large ex- arrays (e.g. large we of dimensions algorithms, computation different [5]. for the [4], them in tend compare used to is exclusively scan order of approaches how In i.e. the goal processor, all SIMD The a compare 5. on to Figure is see test blocks, this 10752 stream of each results, consists stable to namely Moreover, architecture, from integers. the 2048 range by the block, allowed size stream in maximum one size the only block in the stored an varying be i.e. computing can by that approaches list input different the tested We sn enlpromn h eodpr ftescan the of part 8 second lines the (i.e. performing kernel a using 0 2 4 6 8 ie ie7o loih ) esoetevle(the value the store we 2), Algorithm of 7 line (i.e. hc a osdrdtefsetslto o GPUs. for solution fastest the considered was which partial lcsof blocks 2 out 5 ] [ emk n io oicto othe to modification minor one make We . ÷ [ 2 j NoPadding Left Right Padding CM 6 5o loih )on 2) Algorithm of 15 ] partial oiiilz h optto fblock of computation the initialize to b lmnsec.W s h rtpart first the use We each. elements j 2 noa uiir array auxiliary an into ) 7 ] [ block size ntesm anr fe we After manner. same the in j 2 8 neednl,soigthe storing independently, 2 9 ÷ fAgrtm2) Algorithm of 6 in n/b ] [ 2 10 and , blocks. partial n 2 2 b 11 5 ethe be ethe be pto up ] [ j . This is mainly due to our novel schema that is able to fully [3] D. Cederman and P. Tsigas, “A practical quicksort algorithm exploit the parallelism of the architecture that means, from for graphics processors,” in ESA ’08. Berlin, Heidelberg: K-model point of view, to reduce T () with respect to the Springer-Verlag, 2008. other methods. [4] S. Ding, J. He, H. Yan, and T. Suel, “Using graphics processors for high performance ir query processing,” in 25 WWW ’09: Proceedings of the 18th international conference CM on World wide web. ACM, 2009, pp. 421–430. [Online]. NoPadding 20 Padding Available: http://dx.doi.org/10.1145/1526709.1526766 Left Right [5] J. Hensley, T. Scheuermann, G. Coombe, M. Singh, and 15 A. Lastra, “Fast summed-area table generation and its ap- plications,” Computer Graphics Forum, vol. 24, pp. 547–555, msecs 10 2005.

5 [6] V. Vineet, P. Harish, and S. Patidar, “Fast minimum spanning tree for large graphs on the GPU,” 2005. 0 19 20 21 22 23 24 25 [7] D. Horn, “Stream reduction operations for GPGPU applica- 2 2 2 2 2 2 2 tions,” in GPU Gems 2. Addison-Wesley, 2005, pp. 573–589. input list size [8] M. Harris, S. Sengupta, Y. Zhang, and J. D. Owens, “Scan Figure 6. Elapsed time for exclusive scan on large arrays of varying size. primitives for GPU computing,” in Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics In Figure 6, that concerns the computation of large arrays, hardware. Aire-la-Ville, Switzerland, Switzerland: Euro- graphics Association, 2007, pp. 97–106. [Online]. Available: the benefits of applying our solution are decreased but still http://portal.acm.org/citation.cfm?id=1280094.1280110 permit to outperform the competitors, i.e. we are 18% faster. The main reason is the increased number of communications [9] W. D. Hillis and G. L. Steele, Jr., “Data parallel algorithms,” between the array processors and the off-chip memory that Commun. ACM, vol. 29, no. 12, pp. 1170–1183, 1986. dominates the computational time. [10] B. Khailany, W. J. Dally, U. J. Kapasi, and P. Mattson, “Imagine: Media processing with streams,” IEEE Micro, IX.CONCLUSION vol. 21, no. 2, 2001. The scan operation is a simple and powerful parallel primitive with a broad range of applications. This paper [11] G. Capannini, R. Baraglia, and F. Silvestri, “K-model: A new computational model for stream processors,” in 12th Interna- uses the K-model as valid tool for the analysis of the tional Conference on High Performance and Communications efficiency and the complexity of algorithms performing the (HPCC10), Melbourne, Australia, September 2010. prefix sum operation when performed on GPUs. Driven by the definitions of K-model complexities, we have defined a [12] P. B. Gibbons, Y. Matias, and V. Ramachandran, “The QRQW PRAM: accounting for contention in parallel algorithms,” in pattern access able to minimize the measures inherent to the SODA ’94. Philadelphia, PA, USA: Society for Industrial internal computation performed on such model. Moreover, and Applied Mathematics, 1994. the new proposed technique can be exported also on the computational models that share some of the K-model [13] V. Kumar, Introduction to Parallel Computing. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 2002. features, e.g. QRQW PRAM [12]. Experiments confirm the theoretical results by reducing the time for computing each [14] S. Fortune and J. Wyllie, “Parallelism in random access stream element up to the 25% with respect to the other machines,” in STOC ’78. New York, NY, USA: ACM, 1978. methods existing in literature. [15] G. E. Blelloch, Vector models for data-parallel computing. ACKNOWLEDGEMENTS Cambridge, MA, USA: MIT Press, 1990. This research is in the activity of the Action IC0805: [16] A. Aggarwal, A. K. Chandra, and M. Snir, “Hierarchical Open European Network for High Performance Computing memory with block transfer,” in SFCS ’87. Washington, on Complex Environments. DC, USA: IEEE Computer Society, 1987. [17] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran, REFERENCES “Cache-oblivious algorithms,” in FOCS, 1999. [1] G. Blelloch, “Prefix sums and their applications,” Synthesis of Parallel Algorithms, pp. 35–60, 1990. [18] L. W. Tucker and G. G. Robertson, “Architecture and applica- tions of the Connection Machine,” IEEE Computer, vol. 21, [2] N. Satish, M. Harris, and M. Garland, “Designing efficient pp. 26–38, 1988. sorting algorithms for manycore GPUs,” Parallel and Dis- tributed Processing Symposium, International, vol. 0, pp. 1– [19] NVIDIA Corporation, “Programming guide 2.0.” [Online]. 10, 2009. Available: http://www.nvidia.com/object/cuda develop.html