Improved Parallel Integer Sorting without



Concurrent Writing

yz y

Susanne Alb ers Torb en Hagerup

A preliminary version of this pap er was presented at the rd Annual ACMSIAM Symp osium on Discrete

Algorithms SODA in Orlando Florida in January

y

MaxPlanckInstitut fur Informatik D Saarbruc ken Germany Supp orted in part by the Deutsche

Forschungsgemeinschaft SFB TP B VLSI Entwurfsmethoden und Parallelitat and in part by the ESPRIT

Basic Research Actions Program of the EU under contracts No and pro jects ALCOM and ALCOM I I

z

Part of this work was done while the author was a student at the Graduiertenkol leg Informatik Univer

sitat des Saarlandes D Saarbruc ken Germany and supp orted by a graduate fellowship of the Deutsche

Forschungsgemeinschaft

Prop osed running head Improved parallel integer sorting

Contact author

Susanne Alb ers

MaxPlanckInstitut fur Informatik

Im Stadtwald

Saarbruc ken

Germany

Abstract

We show that n integers in the range n can b e sorted stably on an EREW PRAM

p

using O t time and O n log n log log n log n t op erations for arbitrary given t

p

tlog n

log n log log n and on a CREW PRAM using O t time and O n log n log n

op erations for arbitrary given t log n In addition we are able to sort n arbitrary integers

on a randomized CREW PRAM within the same resource b ounds with high probability In

p

log n closer to optimality than all previous each case our is a factor of almost

for the stated problem in the stated mo del and our third result matches the

op eration count of the b est previous sequential algorithm We also show that n integers in

the range m can b e sorted in O log n time with O n op erations on an EREW PRAM

using a nonstandard word length of O log n log log n log m bits thereby greatly improving

the upp er b ound on the word length necessary to sort integers with a linear timepro cessor

pro duct even sequentially Our algorithms were inspired by and in one case directly use

the fusion trees of Fredman and Willard

Intro duction

A parallel algorithm is judged primarily by its speed and its eciency Concerning sp eed a

widely accepted criterion is that it is desirable for a parallel algorithm to have a running time

that is p olylogarithmic in the size of the input The eciency of a parallel algorithm is evaluated

by comparing its timeprocessor pro duct ie the total numb er of operations executed with the

running time of an optimal sequential algorithm for the problem under consideration the parallel

algorithm having optimal speedup or simply b eing optimal if the two agree to within a constant

factor A p oint of view often put forward and forcefully expressed in Kruskal et al b is

that the eciency of a parallel algorithm at least given presentday technological constraints

is far more imp ortant than its raw sp eed the reason b eing that in all probability the algorithm

must b e slowed down to meet a smaller numb er of available pro cessors anyway

The problem of integer sorting on a PRAM has b een studied intensively While previous

research has concentrated on algorithms for the CRCW PRAM Ra jasekaran and Reif

Ra jasekaran and Sen Bhatt et al Matias and Vishkin a b Raman

a b Hagerup Gil et al Hagerup and Raman Bast

and Hagerup Go o drich et al in this pap er we are interested in getting the

most out of the weaker EREW and CREW PRAM mo dels Consider the problem of sorting n

integers in the range m In view of sequential radix sorting which works in linear time if

O

m n a parallel algorithm for this most interesting range of m is optimal only if its time

pro cessor pro duct is O n Kruskal et al a showed that for m n and p n the

n log m

problem can b e solved using p pro cessors in time O ie with a timepro cessor

p log np

n log m

pro duct of O The space requirements of the algorithm are pn n for arbitrary

lognp

xed which makes the algorithm impractical for values of p close to n Algorithms that

are not aicted by this problem were describ ed by Cole and Vishkin remark following

Theorem and by Wagner and Han For nm p nlog n they use O n space

log m n

and also sort in O time with p pro cessors All three algorithms are optimal if

p log np

O

m np However let us now fo cus on the probably most interesting case of m n

O

combined with running times of log n Since the running time is at least np it is

easy to see that the timepro cessor pro duct of the algorithms ab ove in this case is b ounded only

by O n log mlog log n ie the integer sorting algorithms are more ecient than algorithms

for general comparisonbased sorting which can b e done in O log n time using O n log n

op erations Ajtai et al Cole by a factor of at most log log n to b e compared

with the p otential maximum gain of log n This is true even of a more recent EREW PRAM

algorithm by Ra jasekaran and Sen that sorts n integers in the range n stably in

O log n log log n time using O n log n log log n op erations and O n space

We describ e an EREW PRAM algorithm for sorting n integers in the range n stably that

exhibits a tradeo b etween sp eed and eciency The minimum running time is log n log log n

and for this running time the algorithm essentially coincides with that of Ra jasekaran and

Sen Allowing more time however we can sort with fewer op erations down to a minimum

p p

of n log n log log n reached for a running time of log n log log n In general for

p

any given t log n log log n the algorithm can sort in O t time using O n log n log log n

log n t op erations Run at the slowest p oint of its tradeo curve the algorithm is more

p

ecient than the algorithms discussed ab ove by a factor of log nlog log n

On the CREW PRAM we obtain a much steep er tradeo For all t log n our algorithm

p

tlog n

sorts n integers in the range n in O t time using O n log n log n op erations the

p

minimum numb er of op erations of n log n is reached for a running time of log n log log n

We also consider the problem of sorting integers of arbitrary size on the CREW PRAM and

describ e a reduction of this problem to that of sorting n integers in the range n The reduction

p

is randomized and uses O log n time and O n log n op erations with high probability It is

based on the fusion trees of Fredman and Willard and was discovered indep endently by

Raman a

The algorithms ab ove all abide by the standard convention that the word length available to

sort n integers in the range m is log n m bits ie that unittime op erations on integers

O O

of size n m are provided but that all integers manipulated must b e of size n m

A numb er of pap ers have explored the implications of allowing a larger word length Paul and

Simon and Kirkpatrick and Reisch demonstrated that with no restrictions on

the word length at all arbitrary integers can b e sorted in linear sequential time Hagerup and

Shen showed that in fact a word length of ab out O n log n log m bits suces to sort n

integers in the range m in O n sequential time or in O log n time on an EREW PRAM with

O nlog n pro cessors The practical value of such results is doubtful b ecause of the unrealistic

assumptions Hardly any computer has a word length comparable to typical input sizes We

show that n integers in the range m can b e sorted with a linear timepro cessor pro duct in

O log n time on an EREW PRAM with a word length of O log n log log n log m bits At the

price of a mo derate increase in the running time this greatly improves the known upp er b ound on

the word length necessary to sort integers with a linear timepro cessor pro duct even sequentially

Another p erhaps more illuminatin g p ersp ective is obtained by noting that providing a PRAM

with a large word length amounts to adding more parallelism of a more restrictive kind A single

register of many bits is akin to a whole SIMD machine but without the imp ortant ability to

let individual pro cessors not participate in the current step As suggested by this view using

a word length that allows k integers to b e stored in each word p otentially could decrease the

running time by a factor of up to k What our algorithm actually achieves is a reduction in

the running time by a factor of k log k curiously it do es so by calling a sequential version

of an originally parallel algorithm as a subroutine This can b e interpreted as saying that if

one happ ens to sort integers of which several will t in one word then advantage can b e taken

of this fact Our result also oers evidence that algorithms using a nonstandard word length

should not hastily b e discarded as unfeasible and b eyond practical relevance

Following the conference publication of the results rep orted here our work was shown to have

unexp ected consequences in a purely sequential setting In particular one of our algorithms is

an essential ingredient in an algorithm of Andersson et al that sorts n w bit integers in

O n log log n time on a w bit RAM for arbitrary w log n Other pap ers building directly or

indirectly on techniques and concepts intro duced here include Andersson and Thorup

Preliminaries

A PRAM is a synchronous parallel machine consisting of pro cessors numb ered and a

global memory accessible to all pro cessors An EREW PRAM disallows concurrent access to a

memory cell by more than one pro cessor a CREW PRAM allows concurrent reading but not

concurrent writing and a CRCW PRAM allows b oth concurrent reading and concurrent writing

We assume an instruction set that includes addition subtraction comparison unrestricted shift

as well as the bitwise Bo olean op erations and and or The shift op eration takes two integer

i

op erands x and i and pro duces x i bx c

p

Our algorithms frequently need to compute quantities such as log m log where m

are given integers Since we have not assumed machine instructions for multiplication and di

vision let alone extraction of logarithms and square ro ots it is not obvious how to carry out

the computation In general however it suces for our purp oses to compute the quantities

of interest approximately namely up to a constant factor which can b e done using negligible

resources Shift op erations provide exp onentiation logarithms can b e extracted by exp onenti

ating all candidate values in parallel and picking the right one and approximate multiplication

division and extraction of square ro ots reduce to extraction of logarithms followed by addition

subtraction or halving ie right shift by one bit followed by exp onentiation On some o c

casions we need more accurate calculations but the numb ers involved are small and we can

implement multiplication by rep eated addition etc The details will b e left to the reader

Two basic op erations that we shall need rep eatedly are prex summation and segmented

broadcasting The prex summation problem of size n takes as input an asso ciative op eration

over a semigroup S as well as n elements x x of S and the task is to compute x x

n

x x x x The segmented broadcasting problem of size n is given an array A n

n

and a bit sequence b b with b to store the value of Amaxfj j i and b g

n j

in Ai for i n It is wellknown that whenever the asso ciative op eration can b e applied

in constant time by a single pro cessor prex summation problems of size n can b e solved on an

EREW PRAM using O log n time and O n op erations see eg JaJa Segmented

broadcasting problems of size n can b e solved within the same resource b ounds by casting them

as prex summation problems the details can b e found eg in Hagerup and Rub

It can b e shown that if an algorithm consists of r parts such that the ith part can b e executed

in time t using q op erations for i r then the whole algorithm can b e executed in time

i i

O t t using O q q op erations ie time and numb er of op erations are

r r

simultaneously additive We make extensive and implicit use of this observation

We distinguish b etween sorting and ranking To sort a sequence of records each with a

key drawn from a totally ordered domain is to rearrange the records such that in the resulting

sequence the keys o ccur in nondecreasing order The sorting is stable if nonidentical records

with the same key o ccur in the same order in the output sequence as in the input sequence ie

if there are no unnecessary interchanges To rank a sequence R R of records as ab ove

n

stably is to compute a p ermutation i i of n such that R R is a p ossible

n i i

n

result of sorting R R stably

n

For records that consist of just a key with no asso ciated information the stability requirement

adds nothing to the sorting problem it is vacuously satised In such cases we follow the

tradition of using the term stable sorting in the sense of stable ranking Note that any

algorithm for stable ranking on a PRAM implies an algorithm for stable sorting with the same

resource b ounds up to a constant factor since rearranging a sequence of n records according to

a given p ermutation can b e done in constant time using O n op erations On the other hand

an algorithm for stable ranking can b e derived from any that can b e mo died

as follows Let the sequence of records to b e sorted b e R R and let x b e the key of R

n i i

for i n The mo died algorithm uses as the key of R the pair x i for i n

i i

and orders these pairs lexicographically ie x i x j x x or x x and i j

i j i j i j

We shall refer to this pro cess as stabilizing the original algorithm

Given an element x of a totally ordered domain U and a nite sequence L of elements of U

the rank of x in L is the numb er of elements in L no larger than x

Our basic sorting algorithm is nonconservative following the terminology of Kirkpatrick and

Reisch ie the word length allowed for sorting n integers in the range m is not

limited to O logn m bits More precisely we will use a word length of k l log n bits

where k is a p ower of with k n and l is an integer with l d logm k e This word

length enables us to store more than one numb er in a word which is essential for our technique

Throughout the pap er we numb er the bit p ositions of words and integers from right to left

with the least signicant bit o ccupying p osition We assume a word length of at least k l

bits and partition the rightmost k l bits into k elds of l bits each The elds are numb ered

from right to left starting at and the numb er of a eld is also called its address The most

signicant bit of each eld called the test bit is usually set to The remaining bits of the

eld can represent a nonnegative integer co ded in binary and called an entry A word normally

stores k entries x x in its rightmost k elds and thus has the following structure

k

x

x

k

z z z z

eld k eld k

eld k eld

The elds k k serve as temp orary storage Two elds in distinct words correspond

if they have the same address and two entries corresp ond if they are stored in corresp onding

elds For h a word that contains the test bit b and the entry x in its eld numb er i for

i i

i h and whose remaining bits are all will b e denoted b x b x If

h h

b b we simplify this to x x By a sorted word we mean a word of the

h h

form x x where x x x A sequence x x of n integers of at most

k k n

l bits each is said to b e given in the word representation with parameters k and l if it is

given in the form of the words x x x x x x

k k k n

dnkek

The remainder of the pap er is structured as follows We rst study algorithms for the EREW

PRAM Section tackles the humble but nontrivial task of merging the sequences represented

by two sorted words Based on this Section develops a nonconservative parallel merge sorting

algorithm from which the conservative algorithm that represents our main result for the EREW

PRAM is derived in Section Section describ es our algorithms for the CREW PRAM

Merging two words

In this section we describ e how a single pro cessor can merge two sorted words in O log k time

We use the bitonic sorting algorithm of Batcher alternatively see Cormen et al

Sec and need only describ e how bitonic sorting can b e implemented to work on sorted

words

A sequence of integers is bitonic if it is the concatenation of a nondecreasing sequence and a

nonincreasing sequence or if it can b e obtained from such a sequence via a cyclic shift Eg the

sequence is bitonic but is not Batchers bitonic sorting algorithm

takes as input a bitonic sequence x x where h is a p ower of The algorithm simply

h

returns the oneelement input sequence if h and otherwise executes the following steps

For i h let m min fx x g and M maxfx x g

i i i i

ih ih

Recursively sort m m the sequence of minima and M M the se

h h

quence of maxima and return the sequence consisting of the sorted sequence of minima followed

by the sorted sequence of maxima

Although we shall not here demonstrate the correctness of the algorithm we note that all

that is required is a pro of that m m and M M are bitonic so that the

h h

recursive application is p ermissible and that m M for all i j f h g so that

i j

the concatenation in Step indeed pro duces a sorted sequence

Our implementation works in log k stages numb ered log k log k where a stage

corresp onds to a recursive level in the description ab ove Note that the stages are numb ered

backwards which gives a more natural description of the single stages In the b eginning of

log k t

Stage t for t log k there is a single word Z containing bitonic sequences of

t t

length each each of which is stored in consecutive elds of Z Furthermore if z is

an element of a sequence stored to the left of a dierent sequence containing an element z

log k t

then z z Stage t for t log k carries out the algorithm ab ove on each of the

t t

bitonic sequences ie each sequence of elements is split into a sequence of minima and

t t

a sequence of maxima and these are stored next to each other in the elds previously

o ccupied by their parent sequence the sequence of maxima to the left of the sequence of minima

Because of the close connection to the recursive description the correctness of this im

plementation is obvious In order to use the algorithm for merging as opp osed to bitonic

sorting we intro duce a prepro cessing step executed b efore the rst phase Given words X

x x and Y y y the prepro cessing step pro duces the single word Z

k k

y y x x The imp ortant thing to note is that if x x and y y

k k k k

are sorted then x x y y is bitonic

k k

We now describ e in full detail the single steps of the merging algorithm We rep eatedly use

the two constants

K

z

k elds

and

K k k

It is easy to compute K and K in O log k time This is accomplished by the program fragment

b elow

K l

for t to log k do

K

z

t

elds

t

K K or K l

K K l

for t to log k do

t t t t t

K

z

k elds

t

K K K l and CopyTestBit K

The function CopyTestBit used in the last line ab ove to remove spurious bits by truncating

the shifted copy of K after eld numb er k takes as argument a word with the rightmost

l bits in each eld and all bits to the left of eld numb er k set to and returns the

word obtained from the argument word by copying the value of each test bit to all other bits

within the same eld and subsequently setting all test bits to It can b e dened as follows

CopyTestBit A A A l

We next describ e in detail the prepro cessing step of our merging algorithm The central problem

is given a word Y y y to pro duce the word

k

y y

k

z

k elds

In other words the task is to reverse a sequence of k numb ers First observe that if we identify

a addresses of elds with their binary representation as strings of log k bits and denote by

the address obtained from the address a by complementing each of its log k bits then the

entry in the eld with address a for a k is to b e moved to address a To see

this note that the entry under consideration is to b e moved to address a k a Since

k is a p ower of a a By this observation the prepro cessing can b e implemented by stages

numb ered log k executed in any order where Stage t for t log k swaps each

pair of entries whose addresses dier precisely in bit numb er t In order to implement Stage t

we use a mask M to separate those entries whose addresses have a in bit numb er t from the

t t

remaining entries shifting the former right by elds and shifting the latter left by elds

The complete prepro cessing can b e programmed as follows

for t to log k do

b egin Complement bit t

Mask away elds with in p osition t of their address

M CopyTestBit K l t and K

t

Y Y and M l or

t

Y Y and M l

end

Z X or Y

Let us now turn to the bitonic sorting itself one stage of which is illustrated in Fig In

Stage t of the bitonic sorting for t log k we rst extract the entries in those elds whose

addresses have a in bit p osition t and compute a word A containing these entries moved right

t

by elds and a word B containing the remaining entries in their original p ositions Numb ers

that are to b e compared are now in corresp onding p ositions in A and B We carry out all the

comparisons simultaneously in a way pioneered by Paul and Simon All the test bits in

A are set to the test bits in B are left at and B is subtracted from A Now the test bit in

a particular p osition survives if and only if the entry in A in that p osition is at least as large

as the corresp onding entry in B Using the resulting sequence of test bits it is easy to create a

mask M that separates the minima B and M or A A and M from the maxima A

t

and M or B B and M and to move the latter left by elds The complete program

for the bitonic sorting follows

for t log k downto do

b egin

Mask away elds with in p osition t of their address

M CopyTestBit K l t and K

t

A Z and M l

B Z Z and M

M CopyTestBit A or K B and K

Z B and M or

A A and M or

t

A and M l or

t

B B and M l

end

This completes the description of the algorithm to merge two sorted words If the input

numb ers x x and y y are drawn from the range m a eld must b e able to

k k

store nonnegative integers up to maxfm k g in addition to the test bit a eld length of

dlog m k e bits therefore suces The running time of the algorithm clearly is O log k

A

B

B and M

or

A A and M

t

or

A and M l

t

or

B B and M l

Fig One stage of bitonic sorting

Nonconservative sorting on the EREW PRAM

Given the ability to merge two sorted words it is easy to develop a pro cedure for merging

two sorted sequences given in word representation Let X and Y b e such sequences each

comprising r words Although the sequences X and Y may contain rep eated values we will

consider the elements of X and Y to b e pairwise distinct and we imp ose a total order on these

elements by declaring an element z of X or Y to b e smaller than another element z of X or

Y exactly if its value is smaller or if it precedes z in the same input sequence or if z and z

have the same value z b elongs to X and z b elongs to Y ie ties are broken by considering

elements in X to b e smaller Dene a representative as the rst smallest rightmost entry

in a word of either X or Y and b egin by extracting the representative of each word of X Y

resp ectively to form a sorted sequence x x y y resp ectively stored in the usual

r r

fashion ie one element to a word Merge x x and y y according to the total order

r r

dened ab ove by means of a standard algorithm such as the one describ ed by Hagerup and Rub

which uses O log r time and O r op erations and let z z b e the resulting

r

sequence For i r asso ciate a pro cessor with z The task of this pro cessor is to

i

merge the subsequences of X and Y comprising those input numb ers z with z z z take

i i

z One of these subsequences is part of the sequence stored in the word containing

r

z while the second subsequence is part of the sequence stored in the word containing z where

i j

j is the maximal integer with j i such that z and z do not b elong to the same input

i j

sequence if there is no such j the second subsequence is empty Segmented broadcasting

using O log r time and O r op erations allows each pro cessor asso ciated with an element

of z z to obtain copies of the at most two words containing the subsequences that it is

r

to merge after which the pro cessor can use the algorithm describ ed in the previous section to

merge its words in O log k time For i r the pro cessor asso ciated with z then lo cates

i

z and z in the resulting sequence by means of binary search which allows it to remove those

i i

input numb ers z that do not satisfy z z z Finally the pieces of the output pro duced

i i

by dierent pro cessors must b e assembled appropriately in an output sequence By computing

the numb er of smaller input numb ers each pro cessor can determine the precise lo cation in the

output of its piece which spreads over at most three in fact two output words The pro cessors

then write their pieces to the appropriate output words In order to avoid write conicts this

is done in three phases devoted to pro cessors writing the rst part of a word the last part of

a word and neither ie a middle part resp ectively To see that indeed no write conict can

o ccur in the third phase observe that if two or more pro cessors were to write to the same word

in the third phase at least one of them would contribute a representative coming from the same

input sequence as the representative contributed by the pro cessor writing the last part of the

word in the second phase this is imp ossible since no output word can contain two or more

representatives from the same input sequence We have proved

Lemma For al l given integers n m k and l where k is a power of with k n m

and l dlog m k e two sorted sequences of n integers in the range m each given

in the word representation with parameters k and l can be merged on an EREW PRAM using

O log n time O nk log k operations O nk space and a word length of O k l log n bits

Theorem For al l given integers n k and m with k n and m such that blog mc is

known n integers in the range m can be sorted on an EREW PRAM using O log n time

O nk log k log n n operations O n space and a word length of O k log m log n bits

Pro of Assume that n and k are p owers of For m log n the result is implied by the

standard integer sorting results mentioned in the intro duction Furthermore if k log n

we can replace k by log n without weakening the claim Assume therefore that k m in

which case an integer l with l dlog m k e but l O log m can b e computed in constant

time using the given value of blog mc We use the word representation with parameters k and l

and sort by rep eatedly merging pairs of ever longer sorted sequences in a treelike fashion as in

sequential merge sorting During a rst phase of the merging consisting of log k merging rounds

the numb er of input numb ers p er word doubles in each round altogether increasing from to k

The remaining O log n rounds op erate with k input numb ers p er word throughout Both phases

of the merging are executed according to Lemma For i log k the ith merging round

i i

inputs n words each containing a sorted sequence of input numb ers and merges these

i i

words in pairs storing the result in n words of input numb ers each By Lemma the cost

P

log k

i

of the rst phase of the merging in terms of op erations is O in O n The cost

i

of the second phase is O nk log k log n and the total time needed is O log n 2

Corollary For al l integers n m if blog mc is known then n integers in the range m

can be sorted with a linear timeprocessor product in O log n time on an EREW PRAM with

a word length of O log n log log n log m bits

Conservative sorting on the EREW PRAM

The sorting algorithm in the previous section is inherently nonconservative in the sense that

in most interesting cases the algorithm can b e applied only with a word length exceeding

logn m bits In this section we use a cascade of several dierent simple reductions to

derive a conservative sorting algorithm In general terms a reduction allows us to replace an

original sorting problem by a collection of smaller sorting problems all of which are eventually

solved using the algorithm of Theorem

The rst reduction is the wellknown radix sorting which we briey describ e Supp ose

that we are to sort records whose keys are tuples of w comp onents stably with resp ect to the

lexicographical order on these tuples Having at our disp osal an algorithm that can sort stably

with resp ect to single comp onents we can carry out the sorting in w successive phases In

the rst phase the records are sorted with resp ect to their least signicant comp onents in the

second phase they are sorted with resp ect to the secondleast signicant comp onents etc Since

the sorting in each phase is stable so that it do es not upset the order established by previous

phases after the w th phase the records will b e stably sorted as desired We shall apply

radix sorting to records with keys that are not actually tuples of comp onents but nonnegative

integers that can b e viewed as tuples by considering some xed numb er of bits in their binary

representation as one comp onent

For the problem of sorting n integers in the range m dene n as the size and m as the

height of the sorting problem For arbitrary p ositive integers n m and w where m is a p ower

of radix sorting can b e seen to allow us to reduce an original sorting problem of size n and

w

height m to w sorting problems of size n and height m each that must b e solved one after

the other b ecause the input of one problem dep ends on the output of the previous problem

The requirement that m should b e a p ower of ensures that when input numb ers are viewed

as w tuples of integers in the range m any given comp onent of a tuple can b e accessed in

constant time recall that we do not assume the availability of unittime multiplication

Our second reduction which was also used by Ra jasekaran and Sen and which we

call group sorting allows us to reduce a problem of sorting integers in a sublinear range to

several problems of sorting integers in a linear range More precisely if n and m are p owers

of with m n we can sp ending O log n time and O n op erations reduce a sorting

problem of size n and height m to nm sorting problems of size and height m each that can b e

executed in parallel The metho d is as follows Divide the given input numb ers into r nm

groups of m numb ers each and sort each group Then for i m and j r

determine the numb er n of o ccurrences of the value i in the j th group Because each group

ij

is sorted this can b e done in constant time using O r m O n op erations Next compute

the sequence N N N N N N of prex sums of the sequence

r r m mr

n n n n n n the asso ciative op eration b eing usual addition

r r m mr

which takes O log n time and uses O n op erations For i m and j r the

last o ccurrence of i in the sorted j th group if any can now compute its p osition in the

output sequence simply as N after which all remaining output p ositions can b e assigned

ij

using segmented broadcasting an o ccurrence of i in the j th group d p ositions b efore the last

o ccurrence of i in the j th group b eing placed at output p osition N d

ij

We now describ e an algorithm to sort n integers in the range n with a word length

s

of O bits where log n Our approach is to sort numb ers in a restricted range

where s is a p ositive integer and to apply the principle of radix sorting to sort numb ers in

the larger range n in O log ns phases s bits at a time Within each phase n

s

numb ers in the range are sorted using the metho d of group sorting where the algorithm

of Theorem is applied as the basic algorithm to sort each group Hence each radix sort phase

takes O s log n time and uses O nk s log k n op erations Employed in a straightforward

manner the algorithm of Theorem is able to sort only sbit keys Extending each key by

another s bits and stabilizing the algorithm as describ ed in Section however we can assume

that the algorithm sorts the full input numb ers in the range n stably by their sbit keys

within each group a necessary prerequisite for the use of radix sorting

Lemma For al l given integers n s and log n n integers in the range n can

be sorted stably on an EREW PRAM using

log n

O log n s time

s

s log

O n log n n operations

s

O n space and a word length of O bits

Pro of If s blog nc we can replace s by blog nc and if s we can replace by s in

each case without weakening the claim Assume therefore that s log n in which case we

can use the algorithm sketched ab ove and that s Cho ose k as an integer with k n

such that k s which causes the necessary word length to b e O k s log n O bits

as required The time needed by the algorithm is O log nss log n as claimed and the

numb er of op erations is

log n ns log k

O n

s k

log k

O n log n

k s

s log

2 O n log n

s

We obtain our b est conservative sorting algorithm by combining Lemma with a reduction

due to Ra jasekaran and Sen who showed that when n is a p erfect square a sorting

problem of size and height n reduces to two batches of sorting problems each batch consisting

p p

of n problems of size and height n The sorting problems comprising a batch can b e executed

in parallel whereas the rst batch must b e completed b efore the pro cessing of the second batch

can start and the reduction itself uses O log n time and O n op erations The reduction is

simple Using the principle of radix sorting the original sorting problem of size and height n is

p

reduced to two sorting problems of size n and height n each of which in turn is reduced using

p p

the principle of group sorting to a collection of n sorting problems of size and height n

In general we cannot assume that n is a p erfect square and the reduction ab ove needs

unittime division which is not part of our instruction rep ertoire Without loss of generality

however we can assume that n is a p ower of and we can mo dify the argument to show

that in this case a sorting problem of size and height n reduces using O log n time and O n

op erations to two batches of sorting problems each batch comprising sorting problems of total

dlog ne

size n and individual size and height Iterating this reduction i times yields a

i

pro cedure that reduces a sorting problem of size and height n to batches of sorting problems

i

dlog n e

each batch comprising sorting problems of total size n and individual size and height

i

The reduction itself uses O i log n time and O n op erations the latter quantity is always

dominated by the numb er of op erations needed to solve the subproblems resulting from the

reduction so that we need not account for it in the following Our plan is to solve the small

subproblems generated by the reduction using the algorithm of Lemma with i chosen to make

the total running time come out at a presp ecied value This gives rise to a tradeo b etween

running time and work Increasing i has the eect of lowering the running time but raising the

total numb er of op erations

In the case m n we app eal to yet another reduction Supp ose that we are given a sorting

problem of size n and height m n where n is a p ower of and view each input numb er as

written in the p ositional system with basis n as a sequence of w log mlog n digits As

stated earlier radix sorting can b e thought of as reducing the original problem of size n and

height m to w sorting problems of size and height n each that must b e solved one after the

other Vaidyanathan et al gave a dierent reduction that also results in a collection of

O w sorting problems of size and height n each The dierence is that the sorting problems

resulting from the reduction of Vaidyanathan et al can b e partitioned into O log w batches

such that all subproblems in the same batch can b e solved in parallel Moreover the numb er

i

of subproblems in the ith batch for i is at most w The reduction works as

follows Supp ose that we sort the input numb ers just by their two most signicant digits to

basis n This is one sorting problem of size n and height n or using radix sorting two sorting

problems of size and height n each After the sorting we can replace each pair of rst two digits

by its rank in the sorted sequence of pairs without changing the relative order of the input

numb ers The rank information can b e obtained via segmented broadcasting within resource

b ounds dominated by those needed for the sorting and since the ranks are integers in the range

n we have succeeded in replacing two digit p ositions by just one digit p osition We can

do the same in parallel for the rd and th digit p osition for the th and th p osition and so

on if the total numb er of digit p ositions is o dd we sort the last digit p osition together with

the two p ositions preceding it This involves a total of w sorting problems of size and height n

divided into three batches one for each radix sorting phase and it reduces the numb er of digit

p ositions to at most w We now simply pro ceed in the same way until the input numb ers have

b een sorted by a single remaining digit

Theorem Let n m be integers and take h min fn mg Then for al l given integers

t log n log h log log h log log mlog n and log n m n integers in the range m

can be sorted stably on an EREW PRAM using O t time

s

log h log m log

O n log m t

t

operations O n space and a word length of O bits

blog log mc

Pro of Assume that log m since otherwise we can replace by without

weakening the claim Assume further that n and m are p owers of and consider rst the case

m n Using the principle of group sorting and sp ending O log n time and O n op erations

we can reduce the given problem to a batch of sorting problems of total size n and individual size

p

and height m We then compute p ositive integers and i with min ft log m log g

i

and log m carry out the ilevel reduction describ ed ab ove and solve the

resulting subproblems using the algorithm of Lemma with s log m What remains is

to analyze the time and the numb er of op erations needed First note that

i i

dlog m e O log m O log m

and that

i

O log m

Since i O log log m the time needed for the ilevel reduction is O i log m O t and the

i

i dlog m e

time needed for the pro cessing of batches of subproblems of size each is

i

dlog m e

i i

O dlog m e s

s

log m

O O t O log m

log m log m

i

The numb er of op erations needed for the pro cessing of batches of subproblems each batch

consisting of subproblems of total size n is

s log

i i

O ndlog m e n

s

log log m

i

O n log m n

log m

log m log

O n

s

log m log

O n log m

s

log h log m log

O n log m

t

This proves Theorem in the case m n Consider now the case m n and supp ose

rst that min fblog c tg log n We then use the reduction of Vaidyanathan et al to

reduce the original problem of size n and height m to O log w batches of subproblems

of size and height n each where w log mlog n such that the ith batch comprises at

i

most w subproblems Each of the resulting subproblems is solved using the algorithm of

Theorem for the case m n The minimum time needed is O log n log log n log w

O log h log log h log w as claimed In order to achieve a total time of t we sp end t

time on each of the six rst batches and from then on we sp end only half as much time on

each group of six successive batches as on the previous group This clearly do es indeed yield

an overall time b ound of O t Since the numb er of subproblems in each group of six successive

batches is at most of the numb er of subproblems in the previous group while the

available time is half that available to the previous group it can b e seen that the total numb er

of op erations executed is within a constant factor of the numb er of op erations needed by the

rst batch ie the total numb er of op erations is

s

log m log n log

O n log n

log n t

s

log n log m log m log

O n log m

t log n

s

log h log m log

O n log m

t

where the last transformation uses the upp er b ounds on and t assumed ab ove

On the other hand if min fblog c tg log n we use the algorithm of Theorem with

k log m The time needed is O log n O t the word length is O and the

numb er of op erations is

n

O log k log n n

k

n log m

log log n n O

s s

log log

log n O n log m

s

log

n log m O 2

Theorem exhibits a general tradeo b etween time op erations and word length Sp ecializi ng

to the case of conservative sorting of integers in a linear range we obtain

Corollary For al l given integers n and t log n log log n n integers in the range n

p

log n log log n log n t can be sorted stably on an EREW PRAM using O t time O n

operations O n space and a standard word length of O log n bits

Another interesting consequence of Theorem is given in Corollary b elow A result corre

O

sp onding to Corollary was previously known only for m log n even for the stronger

CRCW PRAM

p

log nlog log n O

then n integers in Corollary For al l integers n and m if m

the range m can be sorted stably on an EREW PRAM using O log n time O n operations

O n space and a standard word length of O log n bits

Algorithms for the CREW PRAM

If concurrent reading is allowed we can use table lo okup to merge two sorted words in time

O rather than O log k This p ossibility was p ointed out to us by Richard Anderson A

table that maps two arbitrary sorted words given in the word representation with parameters

k l

k and l to the words obtained by merging these has at most entries each of which can b e

computed sequentially in O log k time using the algorithm of Section Similarly a table that

maps each pair consisting of a sorted word and an l bit integer to the rank of the given integer

k l

in the given word contains at most entries each of which can b e computed in O log k time

by means of binary search These tables can therefore b e constructed using O log k time and

k l

O log k op erations In the following we wish to distinguish b etween the resources needed

to construct such tables and those consumed by merging or sorting prop er the reason b eing

that if an original problem is reduced to a collection of subproblems with the same parameters

size height etc then all subproblems can b e solved using the same tables so that the cost

of constructing the tables can b e amortized over the subproblems We cho ose to use the term

prepro cessing to denote the resources needed to construct and store tables that dep end only

on the size parameters of the input problem but not on the actual numb ers to b e merged or

sorted

Lemma For al l given integers n m k and l where k is a power of with k n m

and l dlog m k e and for al l xed two sorted sequences of n integers in the range

m each given in the word representation with parameters k and l can be merged on a CREW

k l

PRAM using O log n preprocessing time O log k n preprocessing operations and space

O log log log m time O nk operations O nm k space and a word length of O k l log n

bits Moreover if m O nk log n the merging time and space can be reduced to O and

O n respectively

Pro of We use an algorithm similar to that of Lemma the main item of interest b eing how to

merge the two sequences of O nk representatives By a pro cess of rep eated squaring executed

n

during the prepro cessing we can either compute blog log log mc or determine that m

n

in b oth cases without ever creating an integer of more than log n m bits If m

the representatives can b e merged in O log log n O log log log m time with the algorithm

of Kruskal Otherwise the representatives can b e merged in O log log log m time with

the algorithm of Berkman and Vishkin which is resp onsible for the sup erlinear space

requirements If m O nk log n nally the representatives can b e merged in constant time

and linear space as noted by Chaudhuri and Hagerup the latter result assumes the

availability of certain integer values that can b e computed during the prepro cessing In each

case the numb er of op erations needed is O nk

When the representatives have b een merged each pro cessor asso ciated with a representative

can access the two words that it is to merge directly without resorting to segmented broadcast

ing The merging itself and the removal of input numb ers outside of the interval of interest is

done in constant time using table lo okup the relevant tables having b een constructed during

the prepro cessing as describ ed ab ove The remainder of the computation works in constant time

using O nk op erations even on the EREW PRAM 2

The remaining development for the CREW PRAM parallels that for the EREW PRAM for

which reason we provide a somewhat terse description We refrain from developing a series of

nonconservative CREW PRAM algorithms that employ fast merging but not table lo okup

Theorem For al l given integers n k and m with k n and m such that blog mc

is known and for al l xed n integers in the range m can be sorted on a CREW

O k log m

PRAM using O log n preprocessing time O n preprocessing operations and space

O log n log log log m time O nk log n n operations O nm space and a word length of

O k log m log n bits Moreover if m O n the time and space bounds for the sorting can

be reduced to O log n and O n respectively

Pro of We can assume that n and k are p owers of and that k log n m so that an

integer l dlog m k e with l O log m can b e computed in constant time We use

the word representation with parameters k and l and sort using rep eated merging as in the

algorithm of Theorem Each round of merging is executed using the algorithm of Lemma

and the complete sorting can b e carried out in O log n log log log m time using O nk log n

op erations

If m O n we can use radix sorting to replace the original sorting problem of size n and

height m by two sorting problems of size n and height O nk log m each 2

Lemma For al l given integers n s and log n n integers in the range n

O

can be sorted stably on a CREW PRAM using O log n preprocessing time preprocessing

operations and space

log n

O log n time

s

s

O n log n n operations

s

O n space and a word length of O

Pro of We can assume that s log n and that s We use radix sorting and group sorting

to reduce the original problem to O log ns batches each comprising sorting problems of total

s

size n and individual size and height We solve each subproblem using the algorithm of

Theorem with k chosen as an integer with k n such that k s The total time

needed is O log ns log n and the numb er of op erations is

s ns log n

n O n log n 2 O

s k s

The theorem b elow is our main result for the CREW PRAM Since it describ es a standalone

algorithm the cost of table construction is not indicated separately but incorp orated into the

overall resource b ounds We x the word length at the standard O log n m bits since

otherwise the prepro cessing cost would b e so high as to make the result uninteresting

Theorem Let n m be integers and take h min fn mg and t log h log log mlog n

Then for al l given integers t t log n n integers in the range m can be sorted stably on

a CREW PRAM using O t time

log m log h log m

p

O n

tt

log n log n

operations O n space and a standard word length of O logn m bits

Pro of Assume that n and m are p owers of and consider rst the case m n Using group

sorting we b egin by reducing the given problem to a batch of sorting problems of total size n

and individual size and height m We then compute p ositive integers a s and i with a tlog m

p

a i

but a O tlog m s log n log ma and log ms use the ilevel

reduction of Ra jasekaran and Sen and solve all the resulting subproblems using the algorithm

of Lemma

O

Recall that used with a word length of the algorithm of Lemma needs prepro cessing

op erations and space We cho ose log n suciently small to make the prepro cessing cost

O n For this we must ensure that the algorithm of Lemma is applied to inputs of size at

i i

dlog m e dlog n e

most But since the input size is this is simply a matter of always

cho osing i larger than a xed constant

i a

Since O a the time needed for the ilevel reduction is O i log m O a log m

i i

O t Furthermore since dlog m e O s the time needed for the pro cessing of batches of

i

dlog m e

subproblems of size each is

i

p

dlog m e

i i i

log n O t O dlog m e O s O log m

s

i

The numb er of op erations needed for the pro cessing of batches of subproblems each batch

consisting of subproblems of total size n is

s

i i

n O ndlog m e

log n s

s

O n log m n

log n s

p

a

log n log ma

p

n O n log m

log n

log n

log m

p

O n log m n

a

t log n

log n

log h log m log m

p

O n

tt

log n log n

This proves Theorem in the case m n In the case m n we use the reduction of

Vaidyanathan et al to reduce the original problem to O log log mlog n batches of sub

problems of size and height n each and solve each subproblem using the algorithm of Theorem

for the case m n with t log nt time allotted to each batch This gives a total time of

O t and the total numb er of op erations needed is

p

log m log n

O log n n

tt

log n

p

O n log m

tt

log n

log m log h log m

p

O n 2

tt

log n log n

Corollary For al l given integers n and t log n n integers in the range n can be

p

tlog n

log n log n operations O n space sorted on a CREW PRAM using O t time O n

and a standard word length of O log n bits

It is instructive to compare the tradeo of Corollary with that of the algorithm of Kruskal et

al a Put in a form analogous to that of Corollary the result of Kruskal et al states that

for all t log n n integers in the range n can b e sorted in O t time with a timepro cessor

n log n

n Our algorithm and that of Kruskal et al therefore pair the pro duct of O

log tlog n

same minimum time of log n with the same op eration count of n log n ie no savings

relative to comparisonbased sorting Allowing more time decreases the numb er of op erations in

b oth cases but the numb er of op erations of our algorithm decreases doublyexp onentially faster

than the corresp onding quantity for the algorithm of Kruskal et al and reaches its minimum of

p

n log n already for t log n log log n If we allow still more time our algorithm do es

p

log n

not b ecome any cheap er and the algorithm of Kruskal et al catches up for t and

is more ecient from that p oint on

Placed in a similar context our EREW PRAM sorting algorithm of Corollary exhibits a

tradeo that is intermediate b etween the two tradeos discussed ab ove As the time increases

the numb er of op erations decreases exp onentially faster than for the algorithm of Kruskal et al

p

O log n

Corollary For al l integers n and m if m then n integers in the range

m can be sorted stably on a CREW PRAM using O log n time O n operations O n space

and a standard word length of O log n bits

Allowing randomization and assuming the availability of unittime multiplication and integer

division we can extend the time and pro cessor b ounds of Corollary to integers drawn from an

arbitrary range Supp ose that the word length is bits where log n so that the numb ers

to b e sorted come from the range Following Fredman and Willard we assume

the availability of a xed numb er of constants that dep end only on Using an approach very

similar to ours Raman a showed that n integers of arbitrary size can b e sorted on a CREW

PRAM in O log n log log n time using O n log n log log n op erations with high probability

The basic idea is very simple First we cho ose a random sample V from the set of input

elements sort V by standard means and determine the rank in V of each input element We

then use the algorithm of Corollary to sort the input elements by their ranks and nally

sort each group of elements with a common rank again by a standard sorting algorithm If

each group is relatively small the last step is not to o exp ensive The other critical step is the

computation of the rank in V of each input element An obvious way to execute this step is to

store the elements in V in sorted order in a search tree T and then to carry out n indep endent

searches in T each using a dierent input element as its key Since we aim for an op eration

p

count of O n log n however T cannot b e a standard balanced binary tree and we have to

resort to more sophisticated data structures namely the of Fredman and Willard

and the of van Emde Boas Building on an approach outlined in

Section of Fredman and Willard we use a fusion tree that is a complete dary tree

p

log n

where d and d with the elements of the sample V stored in sorted order in

its leaves The distinguishing prop erty of a fusion tree is that in spite of its high no de degree

a search can pro ceed from a parent no de to the correct child no de in constant time Since the

p

log n it allows the rank of all input elements to b e determined depth of our fusion tree is O

p

using O n log n op erations which precisely matches what we have to pay for sorting the input

elements by their ranks

The fusion tree makes crucial use of very large integers Sp ecically the constraint is that

O

we must b e able to represent numb ers of d bits in a constant numb er of words It follows

p p

b log n c b log n c

that we can use the fusion tree if If we replace the fusion tree

as our search structure by a van Emde Boas vEB structure van Emde Boas The

latter supp orts sequential search op erations in O log time Since we use a vEB structure only

p

log n

if this is again sucient for our purp ose The main outstanding diculty is the

construction of fusion trees and vEB structures for which we use randomized algorithms When

discussing randomized algorithms b elow we always intend these to fail gracefully ie if they

cannot carry out the task for which they were designed they rep ort failure without causing

concurrent writing or other error conditions

A fusion tree no de contains various tables that can b e constructed sequentially in O d time

The sequential algorithm can b e parallelized to yield a randomized algorithm that constructs a

O O

fusion tree no de with probability at least and uses O log d time d op erations and d

space we present the details in an app endix Letting this algorithm b e executed in parallel

by dlog v e indep endent pro cessor teams for v we can reduce the probability that a

fusion tree no de is not constructed correctly to at most v Hence the entire fusion tree for

a sorted set of v elements which has fewer than v no des can b e constructed in O log v time

O O

using d v log v op erations and d v log v space with probability at least As concerns the

construction of a vEB structure for a sorted set of v elements Raman a gives a randomized

algorithm for this task that uses O log v time O v op erations and O v space and works with

probability at least in fact with a much higher probability We now give the remaining

details of the complete sorting algorithm

Theorem For every xed integer and for al l given integers n and t log n n ar

p

tlog n

log n log n bitrary integers can be sorted on a CREW PRAM using O t time O n

p

log n

operations and O n space with probability at least

Pro of Let x x b e the input elements assume these to b e pairwise distinct and let

n

b e an integer constant whose value will b e xed b elow Execute the following steps

p

b log n c

Draw v dn e indep endent integers i i from the uniform distribution over

v

f ng Construct the set V fx x g and sort it

i i

v

p

log n c b

let T b e a fusion tree Construct a search structure T for the set V If

otherwise let T b e a vEB structure In order to obtain T with suciently high probability

p

d log n e

carry out indep endent attempts to construct T each attempt succeeding

with probability at least By the discussion b efore the statement of Theorem this

p p

log n e O O log n d

d v log v v op erations can b e done in O log n time using

and space For chosen suciently large the b ound on op erations and space is O n and

p

log n

the probability that T is not constructed correctly is at most

p

log n time and Use T to compute the rank of x in V for j n This uses O

j

p

O n log n op erations

Use the algorithm of Corollary to sort the input elements x x with resp ect to their

n

p

tlog n

op erations For i v ranks This uses O t time and O n log n log n

let X b e the set of those input elements whose rank in V is i Step moves the elements

i

in X to consecutive p ositions for i v

i

Sort each of X X using eg Coles merge sorting algorithm Cole If

v

P

v

jX j logjX j M maxfjX j i v g this takes O log n time and O

i i i

i

O n log M op erations

The resources used by Step are negligible Hence all that remains is to show that with

p p

suciently high probability log M O log n But if log M u where u log n

u

the sampling in Step misses at least consecutive elements in the sorted sequence of input

elements the probability of which is at most

v

u

u

v n

n n e

n

p p p

log n log n log n

n e n e

p

log n

For chosen suciently large the latter probability and the failure probability of

p

log n

2 in Step add up to at most

For t log n log log n the algorithm of Theorem exhibits optimal sp eedup relative to the

sequential randomized algorithm describ ed by Fredman and Willard

Acknowledgment

We are grateful to Christine Rub and Richard Anderson whose suggestions allowed us to improve

our results and to simplify the exp osition

References

Ajtai M Komlos J and Szemeredi E An O n log n in Pro

ceedings th Annual ACM Symp osium on Theory of Computing pp

Andersson A Sublogarithmic searching without multiplications in Pro ceedings

th Annual Symp osium on Foundations of Computer Science pp

Andersson A Hagerup T Nilsson S and Raman R Sorting in linear time

in Pro ceedings th Annual ACM Symp osium on the Theory of Computing pp

Bast H and Hagerup T Fast parallel space allo cation estimation and integer

sorting Inform and Comput

Batcher K E Sorting networks and their applications in Pro ceedings AFIPS Spring

Joint Computer Conference pp

Berkman O and Vishkin U On parallel integer merging Inform and Comput

Bhatt P C P Diks K Hagerup T Prasad V C Radzik T and Saxena S

Improved deterministic parallel integer sorting Inform and Comput

Chaudhuri S and Hagerup T Prex graphs and their applications in Pro ceed

ings th International Workshop on GraphTheoretic Concepts in Computer Science

Lecture Notes in Computer Science Vol pp SpringerVerlag Berlin

Cole R Parallel SIAM J Comput

Cole R and Vishkin U Deterministic coin tossing with applications to optimal

parallel list ranking Inform and Control

Cormen T H Leiserson C E and Rivest R L Intro duction to Algorithms

The MIT Press Cambridge Mass

Fredman M L and Willard D E Surpassing the information theoretic b ound

with fusion trees J Comput System Sci

Gil J Matias Y and Vishkin U Towards a theory of nearly constant time

parallel algorithms in Pro ceedings nd Annual Symp osium on Foundations of Computer

Science pp

Goodrich M T Matias Y and Vishkin U Approximate parallel prex com

putation and its applications in Pro ceedings th International Parallel Pro cessing Sym

p osium pp

Goodrich M T Matias Y and Vishkin U Optimal parallel approximation for

prex sums and integer sorting in Pro ceedings th Annual ACMSIAM Symp osium on

Discrete Algorithms pp

Hagerup T Constanttime parallel integer sorting in Pro ceedings rd Annual

ACM Symp osium on Theory of Computing pp

Hagerup T and Raman R Waste makes haste Tight b ounds for lo ose parallel

sorting in Pro ceedings rd Annual Symp osium on Foundations of Computer Science

pp

Hagerup T and Raman R Fast deterministic approximate and exact parallel sort

ing in Pro ceedings th Annual ACM Symp osium on Parallel Algorithms and Architec

tures pp

Hagerup T and Rub C Optimal merging and sorting on the EREW PRAM

Inform Pro cess Lett

Hagerup T and Shen H Improved nonconservative sequential and parallel integer

sorting Inform Pro cess Lett

JaJa J An Intro duction to Parallel Algorithms AddisonWesley Reading Mass

Kirkpatrick D and Reisch S Upp er b ounds for sorting integers on random access

machines Theoret Comput Sci

Kruskal C P Searching merging and sorting in parallel computation IEEE Trans

Comput

Kruskal C P Rudolph L and Snir M a Ecient parallel algorithms for graph

problems Algorithmica

Kruskal C P Rudolph L and Snir M b A complexity theory of ecient parallel

algorithms Theoret Comput Sci

Matias Y and Vishkin U a On parallel hashing and integer sorting J Algorithms

Matias Y and Vishkin U b Converting high probability into nearlyconstant time

with applications to parallel hashing in Pro ceedings rd Annual ACM Symp osium on

Theory of Computing pp

Paul W J and Simon J Decision trees and random access machines in Pro ceed

ings International Symp osium on Logic and Algorithmic Zuric h pp

Rajasekaran S and Reif J H Optimal and sublogarithmic time randomized

parallel sorting algorithms SIAM J Comput

Rajasekaran S and Sen S On parallel integer sorting Acta Inform

Raman R The p ower of collision Randomized parallel algorithms for chaining and

integer sorting in Pro ceedings th Conference on Foundations of Software Technology

and Theoretical Computer Science Lecture Notes in Computer Science Vol pp

SpringerVerlag Berlin

Raman R a The p ower of collision Randomized parallel algorithms for chaining and

integer sorting Tech Rep no Computer Science Department University of Ro chester

New York March revised January

Raman R b Optimal sublogarithmic time integer sorting on the CRCW PRAM

Tech Rep no Computer Science Department University of Ro chester New York

January

Thorup M On RAM priority queues in Pro ceedings th Annual ACMSIAM Sym

p osium on Discrete Algorithms pp

Vaidyanathan R Hartmann C R P and Varshney P K Towards optimal

parallel radix sorting in Pro ceedings th International Parallel Pro cessing Symp osium

pp

van Emde Boas P Preserving order in a forest in less than logarithmic time and

linear space Inform Pro cess Lett

Wagner R A and Han Y Parallel algorithms for bucket sorting and the data de

p endent prex problem in Pro ceedings International Conference on Parallel Pro cessing

pp

App endix Constructing a fusiontree no de in parallel

In this app endix we show that if and d are p ositive integers with d then a fusion

tree no de for d given integers y y drawn from the set U f g can

d

O

b e constructed on a CREW PRAM with a word length of bits using O log d time d

O

op erations and d space with probability at least Our algorithm is a straightforward

parallelization of the corresp onding sequential algorithm of Fredman and Willard

For any integer x and any nite set S of integers denote by rank x S the rank of x in S and

let Y fy y g Recall that the purp ose of the fusiontree no de is to enable rank x Y to

d

b e determined in constant time by a single pro cessor for arbitrary given x U If y y

d

are numb ers of just O d bits this can b e done essentially as describ ed in Section Create

a word B of O bits containing y y in separate elds with test bits set to zero and

d

another word A containing d copies of x with test bits set to one then subtract B from A

and clear all bit p ositions except those of the test bits in the resulting word C The test bits

of C can b e added and their sum placed in a single eld by means of a suitable multiplication

this numb er is rank x Y In the following we will reduce the general rank computation to two

such rank computations in sequences of small integers

For arbitrary x y U dene msb x y as if x y and otherwise as the largest numb er

of a bit p osition in which x and y dier Fredman and Willard demonstrated that msb x y

can b e determined in constant time by a single pro cessor for arbitrary given x y U Without

loss of generality assume that y y In order to avoid sp ecial cases we

d

intro duce the two additional keys y and y Let P fmsb y y i dg

d i i

r

and write P fp p g with p p clearly r d Dene f U f g

r r

as the function that extracts the bit p ositions in P and packs them tightly ie bit numb er

i in f y is bit numb er p in y for i r and for all y U while bits numb er

i

r are zero in f y It is imp ortant to note that f y f y We write f Y

d

for ff y f y g

d

Let f dg f r g f g and dene a function U as follows For

x U let i rank f x f Y and cho ose j fi i g to minimize msb f x f y resolving

j

ties by taking j i Note that this means that among all elements of f Y fy y g f y

d j

is one with a longest prex in common with f x the longestprex property Furthermore

take l rank msb x y P and let a if x y a otherwise Then x j l a

j j

Fredman and Willard showed that for x U rank x Y dep ends only on x This is easy

to see by imagining a digital search tree T for the bit strings y y The ro ottoleaf path

d

in T corresp onding to y is a continuation of the path taken by a usual search in T for x and

j

P is the set of heights of children of no des in T with two children so that l determines the

maximal path in T of no des with at most one child on which the search for x terminates As

a consequence once x is known rank x Y can b e determined in constant time as Rx

where R is a precomputed table with d r entries

Since b oth images under f and bit p ositions are suciently small integers the two rank

computations in the algorithm implicit in the denition of can b e carried out in constant

sequential time as describ ed earlier in the app endix As a consequence we are left with two

problems How to construct the table R and how to evaluate f in constant time

As rank x Y can b e determined in O log d sequential time for any given x U it suces

O

for the construction of R to provide a set X U of d test values that exercise the whole

table ie such that X U This is easy Take p and p and dene x

r jla

for all j l a as the integer obtained from y by complementing the highestnumb ered

j

bit whose numb er is smaller than p and whose value is a if there is no such bit take

l

x y We will show that if j l a U then x j l a which proves that

jla j jla

X fx j l a g is a suitable test set It may b e helpful to visualize the following

jla

arguments as they apply to the digital search tree mentioned ab ove

Fix j l a and let S fx U p msb x y p and x y a g

l j l j

Elements of S are the only candidates for b eing mapp ed to j l a by

Supp ose rst that msb x y p Then f x f y so that cho osing i j clearly

jla j l jla j

achieves the unique minimum of of msb f x f y g By the longestprex prop erty and

jla i

the fact that x S it now follows that x j l a

jla jla

If msb x y p it is easy to see that S so that j l a U

jla j l

If msb x y p nally it can b e seen that msb x y p for all x S Consider two

jla j l j l

cases If msb x y p for some i f dg then msb f x f y msb f x f y for

jla i l i j

all x S Then by the longestprex prop erty no x S is mapp ed to j l a ie j l a U

If msb x y p for all i f dg on the other hand x x for all x S so

jla i l jla

that x j l a if j l a U Summing up the useful part of R can b e constructed

jla

in O log d time with d pro cessors

We actually do not know how to evaluate f eciently and Fredman and Willard employ a

dierent function g that still extracts the bit p ositions in P but packs them less tightly More

precisely for nonnegative integers q q of size O to b e determined b elow bit numb er p

r i

in y is bit numb er p q in g y for i r and for all y U while all other bits in g y

i i

are zero The integers q q will b e chosen to satisfy the following conditions

r

p q p q p q

r r

p q p q r

r r

The r sums p q where i j r are all distinct

i j

Condition ensures that rank g x g Y rank f x f Y and that minimizing

msb g x g y is equivalent to minimizing msb f x f y for all x U so that substi

j j

tuting g for f leaves the algorithm correct Condition ensures that images under g are still

suciently small following a xed right shift by p q bit p ositions to allow constanttime

computation of rank g x g Y and Condition implies that g can b e implemented as a

P

r

q

i

multiplication by followed by the application of a bit mask that clears all bits outside

i

of the p ositions p q p q of interest

r r

Fredman and Willard describ ed a deterministic pro cedure for computing q q We ob

r

tain q q through a randomized but faster pro cedure that essentially amounts to cho osing

r

q q at random More precisely cho ose Z Z indep endently from the uniform distri

r r

bution over f r g and take q p i r Z for i r It is easy to

i i i

see that q q are nonnegative and that Conditions and are satised Condition

r

may b e violated but we can check this in O log d time with d pro cessors For xed i j k l

with i j k l r and i j k l the condition p q p q is violated with prob

i j k l

ability at most r so that altogether Condition is violated with probability at most

r r

A

B

B and M

or

A A and M

t

or

A and M l

t

or

B B and M l

Fig One stage of bitonic sorting