Encyclopedia of Parallel Computing “00101” — 2011/4/16 — 13:39 — Page 1 — #2

A

One of the main di7erences between “regular” bitonic   Adaptive Bitonic Sorting sorting and adaptive bitonic sorting is that regular  bitonic sorting is data-independent, while adaptive   G"#$%&' Z"()*"++ bitonic sorting is data-dependent (hence the name).   Clausthal University, Clausthal-Zellerfeld, Germany As a consequence, adaptive bitonic sorting cannot  be implemented as a , but only on archi-   Definition tectures that o7er some kind of 9ow control. Nonethe-   Adaptive bitonic sorting is a suitable less, it is convenient to derive the method of adaptive   for implementation on EREW parallel architectures. bitonic sorting from bitonic sorting.   Similar to bitonic sorting, it is based on merging, which Sorting networks have a long history in computer   is recursively applied to obtain a sorted sequence. In science research (see the comprehensive survey []).   contrast to bitonic sorting, it is data-dependent. Adap- One reason is that sorting networks are a convenient   tive bitonic merging can be performed in O n parallel p way to describe parallel sorting algorithms on CREW-   time, p being the number of processors, and executes ! " PRAMs or even EREW-PRAMs (which is also called   only O n operations in total. Consequently, adaptive  n log n PRAC for “parallel random access computer”).  bitonic sorting can be performed in O time, ( ) p In the following, let n denote the number of keys   which is optimal. So, one of its advantages is that it exe- ! " to be sorted, and p the number of processors. For the   cutes a factor of O log n less operations than bitonic sake of clarity, n will always be assumed to be a power   sorting. Another advantage is that it can be imple- ( ) of . (In their original paper [], Bilardi and Nicolau   mented e5ciently on modern GPUs. have described how to modify the algorithms such that  they can handle arbitrary numbers of keys, but these   Discussion technical details will be omitted in this article.)   Introduction 6e8rst to present a sorting network with optimal   6is chapter describes a parallel sorting algorithm, asymptotic complexity were Ajtai, Komlós, and Sze-   adaptive bitonic sorting [], that o7ers the following merédi []. Also, Cole [] presented an optimal parallel   bene8ts: approach for the CREW-PRAM as well as  for the EREW-PRAM. However, it has been shown that   It needs only the optimal total number of compar- neither is fast in practice for reasonable numbers of keys   ison/exchange operations, O n log n . ● [, ].   6e hidden constant in the asymptotic number of ( ) In contrast, adaptive bitonic sorting requires less   operations is less than in other optimal parallel sort- ● than n log n comparisons in total, independent of the   ing methods. number of processors. On p processors, it can be imple-   It can be implemented in a highly parallel manner n log n Correctedmented in O Prooftime, for p n .   on modern architectures, such as a streaming archi- p log n ● Even with a small number of processors it is e5-   tecture (GPUs), even without any scatter operations, ! " ≤ cient in practice: in its original implementation, the   that is, without random access writes. sequential version of the algorithm was at most by a  factor .slower than (for sequence lengths  up to )[]. 

David Padua (ed.), Encyclopedia of Parallel Computing,DOI./----, ©SpringerScience+BusinessMediaLLC Encyclopedia of Parallel Computing “00101” — 2011/4/16 — 13:39 — Page 2 — #3

 A Adaptive Bitonic Sorting

 Fundamental Properties Asaconsequence,allindexarithmeticis understood   One of the fundamental concepts in this context is the modulo n,thatis,indexi k i k mod n, unless   notion of a bitonic sequence. otherwise noted, so indices range from through n .  + ≡ + As mentioned above, adaptive bitonic sorting can be  − regardedasavariantofbitonicsorting,whichisinorder   De!nition (Bitonic sequence) Let a a ,...,a  n  to capture the notion of “rotational invariance” (in some   be a sequence of numbers. 6en, a is bitonic,i7itmono- =( − ) sense) of bitonic sequences; it is convenient to de8ne the   tonically increases and then monotonically decreases, following rotation operator.   or if it can be cyclically shi:ed (i.e., rotated) to De!nition (Rotation) Let a a ,...,a and   become monotonically increasing and then monoton-  n  j N.Wede8nearotationasanoperatorRj on the   ically decreasing. =( − ) sequence a:  ∈

 Figure  shows some examples of bitonic sequences. Rja aj, aj ,...,aj n    In the following, it will be easier to understand =( + + − )  any reasoning about bitonic sequences, if one consid- 6is operation is performed by the network shown   ers them as being arranged in a circle or on a cylinder: in Fig. .Suchnetworksarecomprisedofelementary   then, there are only two in9ection points around the cir- comparators (see Fig. ).   cle. 6is is justi8ed by De8nition .Figure depicts an Two other operators are convenient to describe   example in this manner. sorting. 

i i i 1 n 1 n 1 n

Adaptive Bitonic Sorting. Fig.  Three examples of sequences that are bitonic. Obviously, the mirrored sequences (either way) are bitonic, too

Corrected Proof

Adaptive Bitonic Sorting. Fig.  Left:accordingtotheirdefinition,bitonicsequencescanberegardedaslyingona cylinder or as being arranged in a circle. As such, they consist of one monotonically increasing and one decreasing part. Middle:inthispointofview,thenetworkthatperformstheL and U operators (see Fig. ) can be visualized as a wheel of “spokes.” Right:visualizationoftheeffectoftheL and U operators; the blue plane represents the median Encyclopedia of Parallel Computing “00101” — 2011/4/16 — 13:39 — Page 3 — #4

Adaptive Bitonic Sorting A 

a min(a,b) n 6e proof needs to consider only two cases: j   and  j n .Intheformercase,Eq. becomes La   = b max(a,b) LR n a,whichcanbeveri8edtrivially.Inthelattercase,   ≤ < = Eq.  becomes 

a max(a,b) n n  LRja min aj, aj  ,...,min a  , an  ,...,

min aj , a+j  n − −  = ' ' (  ' ( b min(a,b) R La. − − +  j ' (( Adaptive Bitonic Sorting. Fig.  Comparator/exchange 6us,= with the cylinder metaphor, the L and U oper-  elements ators basically do the following: cut the cylinder with  circumference n at any point, roll it around a cylinder  n with circumference  ,andperformposition-wisethe  max and min operator, respectively. Some examples are  shown in Fig. .  6e following theorem states some important prop-  erties of the L and U operators.  %eorem  Given a bitonic sequence a, 

max La min Ua .  Adaptive Bitonic Sorting. Fig.  Anetworkthatperforms the rotation operator Moreover, La and Ua{are} bitonic≤ { too.}  In other words, each element of La is less than or  equal to each element of Ua.  6is theorem is the basis for the construction of the  bitonic sorter [].6e8rst step is to devise a bitonic  merger (BM). We denote a BM that takes as input  bitonic sequences of length n with BMn.ABMisrecur-  sively de8ned as follows: 

n n  BMn a BM  La ,BM Ua . 6e base case is, of course, a two-key sequence, which  Adaptive Bitonic Sorting. Fig.  Anetworkthatperforms ( )=' ( ) ( ) ( is handled by a single comparator. A BM can be easily  the L and U operators represented in a network as shown in Fig. .  Given a bitonic sequence a of length n,onecanshow  that   De!nition (Half-cleaner) Let a a,...,an  . BMn a Sorted a .()  n n − La min a, a  ,...,min=(a  , an  , ) It should be obvious( that)= the sorting( ) direction can be   Ua max a, a n ,...,max a n− , an−  . = ' '  ( '  (( changed simply by swapping the direction of the ele-   In [],= a' network' that( performs' these− − operations(( mentary comparators.   together is called a half-cleaner (see Fig.Corrected). Coming backProof to the metaphor of the cylinder, the   It is easy to see that, for any j and a, 8rst stage of the bitonic merger in Fig.  can be visual-  ized as n comparators, each one connecting an element   La R j mod n LRja,()   of the cylinder with the opposite one, somewhat like  −  and = spokes in a wheel. Note that here, while the cylinder can  rotate freely, the “spokes” must remain 8xed.   Ua R j mod n URja.()  From a bitonic merger, it is straightforward to derive  −  6is is the reason why= the cylinder metaphor is valid. abitonicsorter,BSn,thattakesanunsortedsequence,  Encyclopedia of Parallel Computing “00101” — 2011/4/16 — 13:39 — Page 4 — #5

 A Adaptive Bitonic Sorting

Ua Ua a La a La i i 1 n/2 1nn/2

Adaptive Bitonic Sorting. Fig.  Examples of the result of the L and U operators. Conceptually, these operators fold the n n bitonic sequence (black), such that the part from indices  throughn (light gray) is shifted into the range through  (black); then, L and U yield the upper (medium gray) and lower (dark gray) hull, respectively + BM(n)

0

La BM(n/2)

n/2-1 Sorted

Bitonic n/2

Ua BM(n/2)

n–1

1 stage

Adaptive Bitonic Sorting. Fig.  Schematic, recursive diagram of a network that performs bitonic merging

 and produces a sorted sequence either up or down. As a consequence, the bitonic sorter consists of    Like the BM, it is de8ned recursively, consisting of two O n log n comparators.   smaller bitonic sorters and a bitonic merger (see Fig. ). Clearly, there is some redundancy in such a net-  ' (  Again, the base case is the two-key sequence. work, since n comparisons are su5cient to merge two  sorted sequences. 6e reason is that the comparisons   Analysis of the Number of Operations of performed by the bitonic merger are data-independent.   Bitonic Sorting  Since a bitonic sorter basically consists of a number of Derivation of Adaptive Bitonic Merging   bitonic mergers, it su5ces to look at the total number of 6e algorithm for adaptive bitonic sorting is based on   comparisons of the latter. the following theorem.   6e total number of comparators, C n ,inthe Corrected%eorem  ProofLet a be a bitonic sequence. 6en, there is   bitonic merger BMn is given by: ( ) an index q such that  n n  C n C ,withC  , La aq,...,aq n  ()     Ua aq n ,...,+aq− ()  ( )= ) *+ ( )= = '  (  which amounts to = ' + − (  (Remember that index arithmetic is always mod-   C n n log n. ulo n.)   ( )= Encyclopedia of Parallel Computing “00101” — 2011/4/16 — 13:39 — Page 5 — #6

Adaptive Bitonic Sorting A 

BS(n)

0

BS(n/2) Sorted n/2-1 BM(n) Sorted n/2 Bitonic Unsorted

BS(n/2) Sorted

n-1

Adaptive Bitonic Sorting. Fig.  Schematic, recursive diagram of a bitonic sorting network

n where the cut happens are q and q  .Figure shows  an example (in one dimension).  + m 6e following theorem is the 8nal keystone for the  adaptive bitonic sorting algorithm.  %eorem Any bitonic sequence a can be partitioned  into four subsequences a, a, a, a such that either  0q+n/2 q n-1     La, Ua ( a , a , a ,)a ()  UL or ( )=( )      Adaptive Bitonic Sorting. Fig.  Visualization for the La, Ua a , a , a , a .() proof of Theorem  Furthermore, ( )=( )      n a a a a ,()   6e following outline of the proof assumes, for the + +++ +=+ +++ +=     sake of simplicity, that all elements in a are distinct. Let  n a a ,()  m be the median of all ai,thatis,  elements of a are less  than or equal to m,and n elements are larger. Because and + +=+ +      of 6eorem , a a ,()

+ +=+ +   max La m min Ua . where a denotes the length of sequence a. Figure  illustrates this theorem by an example.  { }≤ < { } + +  Employing the cylinder metaphor again,Corrected the median 6is theorem Proof can be proven fairly easily too: the  n  m can be visualized as a horizontal plane z m that length of the subsequences is just q and  q,whereq is    cuts the cylinder. Since a is bitonic, this plane cuts the the same as in 6eorem .Assumingthatmax a  =  −  sequence in exactly two places, that is, it partitions the m min a ,nothingwillchangebetweenthose { }<  sequence into two contiguous halves (actually, any hor- two subsequences (see Fig. ). However, in that case  <  { }   izontal plane, i.e., any percentile partitions a bitonic min a m max a ;therefore,byswap-    sequence in two contiguous halves), and since it is ping a and a (which have equal length), the bounds  { }>  > {  }  n   the median, each half must have length  .6eindices max a , a m min a , a are obtained. 6e other case can be handled analogously.  {( )} < < { )} Encyclopedia of Parallel Computing “00101” — 2011/4/16 — 13:39 — Page 6 — #7

 A Adaptive Bitonic Sorting

a m

a 0 n/2 n – 1 0qn/2 q + n/2 n – 1

b a1 a2 a3 a4

Ua

m m

La 0q n/2 0qn/2 q + n/2 n – 1 d a1 a4 a3 a2

c La Ua

Adaptive Bitonic Sorting. Fig.  Example illustrating Theorem 

n k  Remember that there are  comparator-and- with C  , C  andn  .6isamountsto   exchange elements, each of which compares ai and ( )= ( )= =  n C n n log n .  ai  .6eywillperformexactlythisexchangeofsub-  sequences, without ever looking at the data. + 6e only question( )= that remains− is− how to achieve the   Now, the idea of adaptive bitonic sorting is to 8nd data rearrangement, that is, the swapping of the subse-   the subsequences, that is, to 8nd the index q that marks quences a and a or a and a,respectively,without  the border between the subsequences. Once q is found, sacri8cing the worst-case performance of O n .6is   one can (conceptually) swap the subsequences, instead n can be done by storing the keys in a perfectly balanced   of performing comparisons unconditionally. k ( )  tree (assuming n  ), the so-called bitonic tree. (6e   Finding q can be done simply by binary search tree can, of course, store only k keys,sothen-th   driven by comparisons of the form ai, ai n . =  key is simply stored separately. ) 6is tree is very similar   Overall, instead of performing n comparisons in the − ' + ( to a search tree, which stores a monotonically increas-   8rst stage of the bitonic merger (see Fig. ), the adaptive ing sequence: when traversed in-order, the bitonic tree   bitonic merger performs log n comparisons in its 8rst  produces a sequence that lists the keys such that there   stage (although this stage is no longer representable by ' ( are exactly two in9ection points (when regarded as a   anetwork). circular list).   Let C n be the total number of comparisons per- Instead of actually copying elements of the sequence   formed by adaptive bitonic merging, in the worst case. ( ) in order to achieve the exchange of subsequences, the   6en Correctedadaptive bitonic Proof merging algorithm swaps O log n  k  n i n  C n C log n  log , pointers in the bitonic tree. 6e recursion then works on  − i ( )  i   the two subtrees. With this technique, the overall num-  ( )= ) *+ ( )=,= ) * ber of operations of adaptive bitonic merging is O n .  Details can be found in [].  ( ) Clearly, the adaptive bitonic sorting algorithm needs  O n log n operations in total, because it consists of  log n many complete merge stages (see Fig. ).  ( ) ( ) Encyclopedia of Parallel Computing “00101” — 2011/4/16 — 13:39 — Page 7 — #8

Adaptive Bitonic Sorting A 

 It should also be fairly obvious that the adaptive and the amount of inter-process communication is   bitonic sorter performs an (adaptive) subset of the com- fairly low (only O p ), it is perfectly suitable for imple-   parisons that are executed by the (nonadaptive) bitonic mentation on stream processing architectures. In addi-  ( )  sorter. tion, although it was designed for a random access  architecture, adaptive bitonic sorting can be adapted to   The Parallel Algorithm astreamprocessor,which(ingeneral)doesnothavethe   So far, the discussion assumed a sequential implemen- ability of random-access writes. Finally, it can be imple-    tation. Obviously, the algorithm for adaptive bitonic mented on a GPU such that there are only O log n   merging can be implemented on a parallel architecture, passes (by utilizing O n log n (conceptual) proces-  ' ( )(  just like the bitonic merger, by executing recursive calls sors), which is very important, since the number of  ( - ( ))  on the same level in parallel. passes is one of the main limiting factors on GPUs.   Unfortunately, a naïve implementation would 6is section provides more details on the imple-    require O log n steps in the worst case, since there mentation on a GPU, called “GPU-ABiSort” [, ].   are log n levels. 6e bitonic merger achieves O log n For the sake of simplicity, the following always assumes  ' (  parallel time, because all pairwise comparisons within ( ) ( )  one stage can be performed in parallel. But this is not  straightforward to achieve for the log n comparisons Algorithm :AdaptiveconstructionofLa and Ua  of the binary-search method in adaptive bitonic merg- (one stage of adaptive bitonic merging) ( )  ing, which are inherently sequential. input :Bitonictree,withrootnoder and extra  However, a careful analysis of the data dependencies node e,representingbitonicsequencea  between comparisons of successive stages reveals that output: La in the le:subtree of r plus root r,andUa  the execution of di7erent stages can be partially over- in the right subtree of r plus extra node e  lapped []. As La, Ua are being constructed in one stage // phase : determine case  by moving down the tree in parallel layer by layer (occa-  sionally swapping pointers); this process can be started if value(r) value(e) then case =  for the next stage, which begins one layer beneath the <  onewherethepreviousstagebegan,beforethe8rst stage else case =  has 8nished, provided the 8rst stage has progressed “far value(r) value(e)  enough” in the tree. Here, “far enough” means exactly swap and ( p, q )=(left(r) , right(r) )  two layers ahead.  6is leads to a parallel version of the adaptive bitonic for i , . . . , log n  do n // phase i  that executes in time O for p p = − test =(value(p) value(q) )  n O log n ,thatis,itcanbeexecutedin log! n" parallel∈ if test == true then  time. > ! " ( ) swap values of p and q  Furthermore, the data that needs to be communi- case  cated between processors (either via memory, or via if ==  then swap the pointers left(p) and  communication channels) is in O p .  It is straightforward to apply the classical sorting- left(q) ( )  by-merging approach here (see Fig. ), which yields the else Corrected Proof right(p)  adaptive bitonic sorting algorithm. 6is can be imple- swap the pointers and  mented on an EREW machine with p processors in right(q) n log n n if ( case == and test == false ) or ( case ==  O p time, for p O log n . andtest == true ) then ! " ∈ ! " ( p, q )=(left(p) , left(q) )  AGPUImplementation else  Because adaptive bitonic sorting has excellent scalabil- ( p, q )=(right(p) , right(q) )  ity (the number of processors, p,cangoupton log n )

- ( ) Encyclopedia of Parallel Computing “00101” — 2011/4/16 — 13:39 — Page 8 — #9

 A Adaptive Bitonic Sorting

Algorithm :Mergingabitonicsequencetoobtaina Algorithm :Simpli8edadaptiveconstructionofLa sorted sequence and Ua input :Bitonictree,withrootnoder and extra input :Bitonictree,withrootnoder and extra node e,representingbitonicsequencea node e,representingbitonicsequencea output:Sortedtree(producessort(a) when output: La in the le:subtree of r plus root r,andUa traversed in-order) in the right subtree of r plus extra node e construct La and Ua in the bitonic tree by  // phase  call merging recursively with left(r) as root and r if value(r) value(e) then value(r) value(e) as extra node swap and > left(r) right(r) call merging recursively with right(r) as root and swap pointers and p q left(r) right(r) e as extra node ( , )=( , ) for i , . . . , log n  do // phase i = − if value(p) value(q) then  increasing sorting direction, and it is thus not explicitely swap value(p) and value(q)  speci8ed. As noted above, the sorting direction must swap pointers> left(p) and left(q)  be reversed in the right branch of the recursion in the ( p, q )=(right(p) , right(q) )  bitonic sorter, which basically amounts to reversing the else  comparison direction of the values of the keys, that is, ( p, q )=(left(p) , left(q) )  compare for instead of in .  As noted above, the bitonic tree stores the sequence < >  a,...,an  in in-order, and the key an  is stored in “payload” data anyway, so the additional memory over-   the extra node− .Asmentionedabove,analgorithmthat− ( )   constructs La, Ua from a can traverse this bitonic tree head incurred by the child pointers is at most a factor   and swap pointers as necessary. 6e index q,whichis ..) ( )  mentioned in the proof for 6eorem ,isonlydeter-  mined implicitly. 6e two di7erent cases that are men- Outline of the Implementation   tioned in 6eorem  and Eqs.  and  can be distin- As explained above, on each recursion level j   n  guished simply by comparing elements a   and an . , . . . , log n of the adaptive bitonic sorting algorithm, log n j  j  =  6is leads to .Notethattherootofthebitonic− −  bitonic trees, each consisting of  nodes,  ( ) log n j j  n − + −  tree stores element a   and the extra node stores an . have to be merged into  bitonic trees of  nodes. −  Applying this recursively− yields .Notethatthebitonic− 6e merge is performed in j stages. In each stage k   tree needs to be constructed only once at the beginning ,..., j , the construction of La and Ua is executed on  k log n j k =  during setup time.  subtrees. 6erefore,   instances of the La / Ua  −  Because branches are very costly on GPUs, one construction algorithm can− be executed in parallel dur-  ⋅  should avoid as many conditionals in the inner loops ing that stage. On a stream architecture, this potential   as possible. Here, one can exploit the fact that Rn a parallelism can be exposed by allocating a stream con-  log n j k  n n  a  ,...,an , a,...,a   is bitonic, provided " a is sisting of  elements and executing a so-called = − +  bitonic too.− 6is operation− basicallyCorrected amounts to swap- kernel on each Proof element.  ' (  ping the two pointers left(root) and right(root).6e 6e La / Ua construction algorithm consists of j k   simpli8ed construction of La and Ua is presented in . phases, where each phase reads and modi8es a pair  −  (Obviously, the simpli8ed algorithm now really needs of nodes, p, q ,ofabitonictree.Assumethataker-   trees with pointers, whereas Bilardi’s original bitonic nel implementation performs the operation of a single  ( )  tree could be implemented pointer-less (since it is a phase of this algorithm. (How such a kernel implemen-   complete tree). However, in a real-world implementa- tation is realized without random-access writes will be   tion, the keys to be sorted must carry pointers to some described below.) 6e temporary data that have to be  Encyclopedia of Parallel Computing “00101” — 2011/4/16 — 13:39 — Page 9 — #10

Adaptive Bitonic Sorting A 

 preserved from one phase of the algorithm to the next Complexity   one are just two node pointers (p and q)perkernel AsimpleimplementationontheGPUwouldneed    instance. 6us, each of the log n j k elements of the allo- O log n phases (or “passes” in GPU parlance) in   cated stream consist of exactly these− + two node pointers. total for adaptive bitonic sorting, which amounts to  '  (  When the kernel is invoked on that stream, each kernel O log n operations in total.   instance reads a pair of node pointers, p, q ,fromthe 6is is already very fast in practice. However, the  ' (  stream, performs one phase of the La/Ua construction optimal complexity of O log n passes can be achieved  ( )  algorithm, and 8nally writes the updated pair of node exactly as described in the original work [], that is,  ( )  pointers p, q back to the stream. phase i of a stage k can be executed immediately a:er  phase i ofstagek has8nished.6erefore,theexe-  ( )  Eliminating Random-Access Writes cution of a new stage can start at every other step of the  + −  Since GPUs do not support random-access writes (at algorithm.   least, for almost all practical purposes, random-access 6e only di7erence from the simple implementation   writes would kill any performance gained by the paral- is that kernels now must write to parts of the output   lelism) the kernel has to be implement so that it modi8es stream, because other parts are still in use.   node pairs p, q of the bitonic tree without random-  access writes. 6is means that it can output node pairs ( ) GPU-Specific Details   from the kernel only via linear stream write. But this For the input and output streams, it is best to apply the   way it cannot write a modi8ed node pair to its original ping-pong technique commonly used in GPU program-   location from where it was read. In addition, it can- ming: allocate two such streams and alternatingly use   not simply take an input stream (containing a bitonic one of them as input and the other one as output stream.   tree) and produce another output stream (containing  the modi8ed bitonic tree), because then it would have to  process the nodes in the same order as they are stored in Preconditioning the Input   memory, but the adaptive bitonic merge processes them For merge-based sorting on a PRAM architecture (and   in a random, data-dependent order. assuming p n), it is a common technique to sort   Fortunately,thebitonictreeisalinked datastructure locally,ina8rststep,p blocks of n p values, that is, each  <  where all nodes are directly or indirectly linked to the processor sorts n p values using a standard sequential  -  root(exceptfortheextranode).6isallowsustochange algorithm.  -  the location of nodes in memory during the merge algo- 6e same technique can be applied here by imple-   rithm as long as the child pointers of their respective menting such a local sort as a kernel program. However,   parent nodes are updated (and the root and extra node since there is no random write access to non-temporary   of the bitonic tree are kept at well-de8ned memory loca- memory from a kernel, the number of values that can be   tions). 6is means that for each node that is modi8ed its sorted locally by a kernel is restricted by the number of   parent node has to be modi8ed also, in order to update temporary registers.   its child pointers. On recent GPUs, the maximum output data size of   Notice that  basically traverses the bitonic tree akernelis bytes.Sinceusuallytheinputconsists   down along a path, changing some of the nodes as nec- of key/pointer pairs, the method starts with a local sort  ×  essary. 6e strategy is simple: simply output every node of -key/pointer pairs per kernel. For such small num-   visited along this path to a stream. SinceCorrected the data lay- bers of keys, Proof an algorithm with asymptotic complexity   out is 8xed and predetermined, the kernel can store the of O n performs faster than asymptotically optimal   index of the children with the node as it is being writ- algorithms.  ( )  ten to the output stream. One child address remains A:er the local sort, a further stream operation   the same anyway, while the other is determined when converts the resulting sorted subsequences of length   the kernel is still executing for the current node. Fig- pairwisetobitonictrees,eachcontainingnodes.   ure  demonstrates the operation of the stream pro- 6erea:er, the GPU-ABiSort approach can be applied   gram using the described stream output technique. as described above, starting with j . 

= Encyclopedia of Parallel Computing “00101” — 2011/4/16 — 13:39 — Page 10 — #11

 A Adaptive Bitonic Sorting

7 > 0 4 < 11 13 > 1 7 0 4 11 13 1 0 7 4 11 1 13

4 3 12 3 5 9 4 3 12 3 5 9 ...... 2 6 5 1 15 8 0 7 2 10 14 6 2 6 5 1 15 8 0 7 2 10 14 6

Root spare Root spare... Root spare root spare root spare ... root spare Phase 0 kernel p q p q p q 0 0 0 0 ... 0 0 p0 q0 p0 q0 ... p0 q0

3 < 4 12 > 3 9 > 5 0 7 4 11 1 13 0 7 4 11 1 13

3 4 12 3 9 5 3 4 3 12 5 9 ...... 5 1 2 6 15 8 0 7 14 6 2 10 5 1 2 6 15 8 0 7 14 6 2 10

p q p q p q p0 q0 p0 q0 ... p0 q0 0 0 0 0 ... 0 0 Phase 1 kernel p q p q p q 1 1 1 1 ... 1 1 p1 q1 p1 q1 ... p1 q1

5 > 2 8 > 7 6 < 10

0 7 4 11 1 13 0 7 4 11 1 13

3 4 3 12 5 9 3 4 3 12 5 9 ...... 5 1 2 6 0 8 15 7 2 6 14 10 2 1 5 6 0 7 15 8 2 6 14 10

p q p q p q p1 q1 p1 q1 ... p1 q1 1 1 1 1 ... 1 1 Phase 2 kernel

p2 q2 p2 q2 ... p2 q2 p2 q2 p2 q2 ... p2 q2

Adaptive Bitonic Sorting. Fig.  To execute several instances of the adaptive La/Ua construction algorithm in parallel, where each instance operates on a bitonic tree of  nodes, three phases are required. This figure illustrates the operation of these three phases. On the left, the node pointers contained in the input stream are shown as well as the comparisons performed by the kernel program. On the right,thenodepointerswrittentotheoutputstreamareshownaswellasthe modifications of the child pointers and node values performed by the kernel program according to 

 The Last Stage of Each Merge unique, these can be used as secondary sort keys for the   Adaptive bitonic merging, being a recursive procedure, adaptive bitonic merge.   eventually merges small subsequences, for instance of 6e experiments described in the following com-   length . For such small subsequences it is better to use pare the implementation of GPU-ABiSort of [, ]with   a(nonadaptive)bitonicmergeimplementationthatcan sorting on the CPU using the C++ STL sort function (an   be executed in a single pass of the whole stream. optimized quicksortimplementation) aswellas withthe  Corrected(nonadaptive) Proof bitonic sorting network implementation  on the GPU by Govindaraju et al., called GPUSort [].  Contrary to the CPU STL sort, the timings of GPU-   Timings ABiSort do not depend very much on the data to be   6e following experiments were done on arrays consist- sorted, because the total number of comparisons per-   ing of key/pointer pairs, where the key is a uniformly formed by the adaptive bitonic sorting is not data-   distributed random -bit 9oating point value and the dependent.   pointer a -byte address. Since one can assume (without  loss of generality) that all pointers in the given array are Encyclopedia of Parallel Computing “00101” — 2011/4/16 — 13:39 — Page 11 — #12

Adaptive Bitonic Sorting A 

500 450 CPU sort 400 n CPU sort GPUSort GPU-ABiSort GPUSort 350 GPU-ABiSort 32,768 9–11 ms 4 ms 5 ms 300 65,536 19–24 ms 8 ms 8 ms 250 131,072 46–52 ms 18 ms 16 ms 200 Time (in ms) 262,144 98–109 ms 38 ms 31 ms 150 524,288 203–226 ms 80 ms 65 ms 100 1,048,576 418–477 ms 173 ms 135 ms 50 0 32768 65536 131072 262144 524288 1048576 Sequence length

Adaptive Bitonic Sorting. Fig.  Timings on a GeForce  system.(There are two curves for the CPU sort, so as to visualize that its running time is somewhat data-dependent)

 Table shows the results of timings performed on same GPU by a factor .,and itisafactorfaster than   a PCI Express bus PC system with an AMD Athlon- astandardCPUsortingimplementation(STL).    + CPU and an NVIDIA GeForce  GTX  GPU with  MB memory. Obviously, the speedup Related Entries   of GPU-ABiSort compared to CPU sorting is .–. !AKS Network   for n .Furthermore,uptothemaximumtested !Bitonic Sort   sequence length n  ( ,,), GPU-ABiSort is ≥ !Lock-Free Algorithms   up to .times faster than GPUSort, and this speedup is = = !Scalability   increasing with the sequence length n,asexpected. !Speedup   6e timings of the GPU approaches assume that the  input data is already stored in GPU memory. When  embedding the GPU-based sorting into an otherwise Bibliographic Notes and Further   purely CPU-based application, the input data has to be Reading   transferred from CPU to GPU memory, and a:erwards As mentioned in the introduction, this line of research   the output data has to be transferred back to CPU mem- began with the seminal work of Batcher []inthe  ory. However, the overhead of this transfer is usually late s, who described parallel sorting as a network.   negligible compared to the achieved sorting speedup: Research of parallel sorting algorithms was reinvigo-   according to measurements by [], the transfer of one rated in the s, where a number of theoretical ques-   million key/pointer pairs from CPU to GPU and back tions have been settled [, , , , , ].   takes in total roughly ms on a PCI Express bus PC. Another wave of research on parallel sorting ensued  from the advent of a7ordable, massively parallel archi-  tectures, namely, GPUs, which are, more precisely,  streaming architectures. 6is spurred the development   Conclusion of a number of practical implementations [, –, ,   Adaptive bitonic sorting is not only appealing from a , ].   theoretical point of view, but also fromCorrected a practical one. Proof  Unlike other parallel sorting algorithms that exhibit   optimal asymptotic complexity too, adaptive bitonic Bibliography  sorting o7ers low hidden constants in its asymptotic . Ajtai M, Komlós J, Szemerédi J () An O(n log n) sorting net-  work. In: Proceedings of the 8:eenth annual ACM symposium   complexity and can be implemented on parallel archi- on theory of computing (STOC ’), New York, NY, pp –   tectures by a reasonably experienced programmer. 6e . Akl SG () Parallel sorting algorithms. Academic, Orlando, FL   practical implementation of it on a GPU outperforms . Azar Y, Vishkin U () Tight comparison bounds on the com-   the implementation of simple bitonic sorting on the plexity of parallel sorting. SIAM J Comput ():–  Encyclopedia of Parallel Computing “00101” — 2011/4/16 — 13:39 — Page 12 — #13

 A Adaptive Bitonic Sorting

 . Batcher KE () Sorting networks and their applications. In: . Schnorr CP, Shamir A () An optimal sorting algorithm for   Proceedings of the  Spring joint computer conference (SJCC), mesh connected computers. In: Proceedings of the eighteenth   Atlanta City, NJ, volume ,pp– annual ACMsymposium on theory of computing (STOC), ACM,   . Bilardi G, Nicolau A () Adaptive bitonic sorting: An optimal New York, NY, USA, pp –   parallel algorithm for shared-memory machines. SIAM J Comput . Sintorn E, Assarsson U () Fast parallel gpu-sorting using a   ():– . J Parallel Distrib Comput ():–   . Cole R () Parallel merge sort. SIAM J Comput ():–.  see Correction in SIAM J. Comput. ,  . Cormen TH, Leiserson CE, Rivest RL, Stein C () Int ro du c -  tion to algorithms, rd edn. MIT Press, Cambridge, MA  . Gibbons A, Rytter W ()E5cient parallel algorithms. Cam-  bridge University Press, Cambridge, England  . Govindaraju NK, Gray J, Kumar R, Manocha D () GPUT-  eraSort: high performance graphics coprocessor sorting for  large database management. Technical Report MSR-TR--,  Microso:Research (MSR), December . In: Proceedings of  ACM SIGMOD conference, Chicago, IL  . Govindaraju NK, Raghuvanshi N, Henson M, Manocha D ()  Acacheee5cientsortingalgorithmfordatabaseanddatamin-  ing computations using graphics processors. Technical report,  University of North Carolina, Chapel Hill  . Greß A, Zachmann G () GPU-ABiSort: optimal parallel  sorting on stream architectures. In: Proceedings of the th  IEEE international parallel and distributed processing sympo-  sium (IPDPS), Rhodes Island, Greece, p   . Greß A, Zachmann G () Gpu-abisort: Optimal parallel  sorting on stream architectures. Technical Report IfI--, TU  Clausthal, Computer Science Department, Clausthal-Zellerfeld,  Germany  . Kipfer P, Westermann R () Improved GPU sorting. In:  Pharr M (ed) GPU Gems : programming techniques for  high-performance graphics and general-purpose computation.  Addison-Wesley, Reading, MA, pp –  . Leighton T () Tight bounds on the complexity of parallel  sorting. In: STOC ’: Proceedings of the sixteenth annual ACM  symposium on 6eory of computing, ACM, New York, NY, USA,  pp –  . NatvigL() Logarithmic time cost optimal parallel sorting is  not yet fast in practice! In: Proceedings supercomputing ’, New  York, NY, pp –  . Purcell TJ, Donner C, Cammarano M, Jensen HW, Hanrahan P  () Photon mapping on programmable graphics hardware. In:  Proceedings of the  annual ACM SIGGRAPH/eurographics  conference on graphics hardware (EGGH ’), ACM, New York,  pp –  . Satish N, Harris M, Garland M () Designing e5cient sorting  algorithms for manycore gpus. In: ProceedingsCorrected of the  IEEE Proof  International Symposium on Parallel and Distributed Process-  ing (IPDPS), IEEE Computer Society, Washington, DC, USA, pp  –