p

(n log log n) in O Expected Time and Linear Space

Yijie Han Mikkel Thorup School of Interdisciplinary Computing and Engineering AT&T Labs— Research University of Missouri at Kansas City Shannon Laboratory 5100 Rockhill Road 180 Park Avenue Kansas City, MO 64110 Florham Park, NJ 07932 [email protected] [email protected] http://welcome.to/yijiehan

Abstract of the input integers. The assumption that each integer fits in a machine word implies that integers can be operated on

We present a randomized sorting n integers with single instructions. A similar assumption is made for

p

(n log log n) O (n log n)

in O expected time and linear space. This comparison based sorting in that an time bound

(n log log n) improves the previous O bound by Anderson et requires constant time comparisons. However, for integer al. from STOC’95. sorting, besides comparisons, we can use all the other in-

As an immediate consequence, if the integers are structions available on integers in standard imperative pro-

p

O (n log log U ) bounded by U , we can sort them in gramming languages such as C [26]. It is important that the

expected time. This is the first improvement over the word-size W is finite, for otherwise, we can sort in linear

(n log log U ) O bound obtained with van Emde Boas’ data time by a clever coding of a huge vector-processor dealing

structure from FOCS’75. with n input integers at the time [30, 27].

At the heart of our construction, is a technical determin- Concretely, Fredman and Willard [17] showed that we

n

(n log n= log log n)

istic lemma of independent interest; namely, that we split can sort deterministically in O time and

p

p

n

(n log n) integers into subsets of size at most in linear time and linear space and randomized in O expected time space. This also implies improved bounds for deterministic and linear space. The randomized bound can also be

string sorting and integer sorting without multiplication. achieved deterministically using space unbounded in terms

"W

2 "> 0 of n (of the form for constant ), but here we focus

on bounds with space bounded in terms of n.

1 Introduction In 1995, Andersson et al. [5] improved Fredman and

p

(n log n)

Willard’s O expected time for integer sorting to

(n log log n) Integer sorting has always been an important task in con- O expected time. Both of these bounds use lin- nection with the digital computer. A classic example is ear space. A similar result was found independently by Han the folklore algorithm , which according to Knuth and Shen [23]. [28] is referenced as far back as in 1929 by Comrie in a The above mentioned bounds are the best unrestricted

document describing punched-card equipment [14]. bounds in the sense that no better bounds are known even if

(log n) Whereas radix sort works in linear time for O -bit besides the randomization, we have unlimited space and are

integers, it was not until 1990 that Fredman and Willard [17] free to define our own operations on words.

n log n) beat the ( comparison-based sorting lower-bound The above results have provided great inspiration for

for the case of arbitrary single word integers. The word-size many researchers, trying to improve them in various ways.

W  log n W is determined by the processor. We assume For example, there has been work on avoiding randomiza-

so that we can address the different integers, but in princi- tion [4, 21, 22, 31, 33] and there has been work on avoiding n ple, W can be arbitrarily large compared with . An equiv- multiplication [3, 6, 36]. Also there has been lots of work alent formulation of our assumptions is that we only assume on more dynamic versions of the problem such as priority

constant time operations on integers polynomial in the sum queues [13, 18, 31, 33, 34] and searching [3, 4, 8, 9].



(n log log n) Supported in part by University of Missouri Faculty Research Grant The O algorithm of Andersson et al. from K211942 1995 [5] is very simple, and the fact that it has sustained so

Proceedings of the 43 rd Annual IEEE Symposium on Foundations of Computer Science (FOCS’02) 0272-5428/02 $17.00 © 2002 IEEE

much interest has lead many researchers to think that this Theorem 3 improves a corresponding refinement of Kirk-

(n log n)

could be the complexity for integer sorting, like O patrick and Reich [27] of van Emde Boas’ bound of

log U

) (n log

is the complexity of comparison based sorting. O expected time.

log n

(n log log n)

However, in this paper, we improve the O The following simple lemma states that it suffices for us

p

(n log log n) expected time to O expected time: to prove Theorem 3: Lemma 4 Theorem 3 implies Theorem 1 and Corollary 2.

Theorem 1 There is a randomized algorithm sorting n in-

p

(n log log n) tegers, each stored in one word, in O ex- pected time and linear space. Proof: Trivially Theorem 3 implies the weaker Corol- lary 2. However, Andersson et al. [5] have shown that we

We leave open the problems of getting a corresponding de- 3

 (log n) can sort in linear expected time if W , and other-

terministic algorithm and avoid the use of multiplication.

U  log W  3 log log n wise, log log . We note that the dynamic aspects are already pretty settled,

for the complexity of dynamic searching is known to be

p

log n= log log n) ( [8, 9], and priority queues have once 1.1 Other domains

and for all been reduced to sorting [35].

(log n) Since integers of O bits can be sorted in linear time using radix sort, we get the following immediate con- In this paper, we generally assume that our integers to be sequence of Theorem 1: sorted each fit in one word, where a word is the maximal unit that we can operate on with a single instruction. In this

subsection, we briefly discuss implications of this case to U Corollary 2 We can sort n integers of size at most in

p many other domains.

(n log log U ) O expected time.

First, consider the case of lexicographically ordered

(n log log U ) This is the first improvement over the O bound strings of words. Andersson and Nilsson [7] have presented

obtained with van Emde Boas’ data structure from 1975 an optimal randomized reduction from this case to that of

(n log log n) [37]. Indeed, the O of An- integers fitting in single words. Applying their reduction to dersson et al. [5] combines van Emde Boas’ data structure Theorem 1, we get with the packed merging of Albers and Hagerup [2] so as to

Corollary 5 We can sort n variable-length strings dis-

p

(n log log U ) U

match the O bound for large values of .

O (n log log n + N )

tributed over N words in expected

p

(log log U )

We note here that the general O bound of

(n log log n + L) time. In fact, we can get down to O ex-

van Emde Boas [37] has been improved in the context P

L = ` ` i pected time where i and is the length of the of static search structures. More precisely, Beame and i

distinguishing prefix of string i, that is, the smallest prefix Fich [9] have shown that one can preprocess a set of

of string i distinguishing it from all the other strings, or the

n integers in polynomial time and space so that given

length of string i if it is a duplicate.

any x, one can search the largest stored integer below

O (log log U= log log log U ) x in time. However, due to We note here that any algorithm sorting the strings will have the polynomial construction time, this improvement does to read the distinguishing prefixes, and since it takes an in-

not help with sorting. Beame and Fich [9] combine

(L) struction to read each word, it follows that an additive O their result with Andersson’s exponential search trees from is necessary.

[4], giving a dynamic search structure with an amor- One may instead be interested in variable length

(log log U log log n= log log log U ) tized update time of O . multiple-word integers where integers with more words are

This dynamic search structure could be used for sort- considered bigger. However, by prefixing each integer with

(n log log U log log n= log log log U )

ing in O time, but this its length, we reduce this case to lexicographic string sort-

(n log n)

is never better than the best of O time and ing.

(n log log U ) O time. Thus, from the perspective of sorting,

p In our presentation, we think of all our integers as un-

(n log log U )

our O bound constitutes the first improve- signed non-negative integers. However, the standard repre-

(n log log U ) ment over the O bound derived from van Emde sentation of signed integers is such that if we flip the sign- Boas’ data structure from 1975 [37]. bit, we can sort them as unsigned integers, and then flip We will, in fact, prove the following refinement of Corol- the sign-bit back again. Floating points numbers are even lary 2: easier, for the IEEE 754 floating-point standard [25] is de- signed so that the ordering of floating point numbers can

Theorem 3 There is a randomized algorithm sorting n be deduced by perceiving their representations as multiple

W

< 2 word integers, each of size at most U in

q word integers. Also, if we are working with fractions where

log U

O (n log

) expected time and linear space.

n log both enumerator and denominator are single word integers,

Proceedings of the 43 rd Annual IEEE Symposium on Foundations of Computer Science (FOCS’02) 0272-5428/02 $17.00 © 2002 IEEE we get the right ordering if for each fraction, we make the where a single word operation is used to manipulate the in- division in floating point numbers with double precision. formation on several pixels, each represented by one byte Now we get the correct ordering of the original integer frac- of the word. tions by perceiving the corresponding floating point num- The word RAM model distinguishes itself both from the bers as integers. comparison based model and from the pointer machine in that we can use integers, and segments of integers, as ad- 1.2 Machine model dresses. This trick goes back to radix sort where an integer is viewed as a vector of characters, and these characters are Recall that our machine model is a normal computer with used as addresses. Another word RAM trick in this direc- an instruction set corresponding to what we program in a tion is that we can hash integers into smaller ranges. Here standard programming language such as C [26] or C++ [32]. radix sort goes back at least to 1929 [14] and hashing goes back at least to 1956 [16], both being developed for efficient We have a processor determined word-size W , limiting how big integers we can operate on in constant time. We assume problem solving in the real world. that each input integer fits in a single word. We note that Fredman and Willard [17] further use the RAM for ad- for generic code, the type of a full word integer, e.g. long vanced tabulation of complicated functions over small do- long int, should be a macro parameter in C or template mains. Their tabulation is too complicated to be of practical parameter in C++. We adopt the unit-cost time measure relevance, but tabulation of functions is in itself commonly where each operation takes constant time. used to tune code. As a simple example, Bentley [11, pp. Interestingly, the traditional theoretical RAM model of 83–84] suggests that an efficient method for computing the Cook and Reckhow [15] allows infinite words. A disturb- number of set bits in 32-bit integers is to have a preprocess-

ing consequence of infinite words is that with normal op- ing where we first tabulate the number of set bits in all the x erations such as shifts or multiplication, we can simulate 256 different 8-bit integers. Now, given a 32-bit integer , an exponentially big parallel processor solving all problems we view it as the concatenation of four 8-bit integers, and in NP in polynomial time. Hence such operations have to for each of these, we look up the number of set bits in our be banned from the above unit-cost theory RAM, making it table. Finally we just add up these four numbers to get the

even more contrived from a practical view-point. number of set bits in x. However, by adopting the real-world limitation of a lim- Summing up, we have argued that the “dirty tricks” facil- ited word-size, we both resolve the above theoretical issue, itated by the word RAM are well established in the practice and we get that can be implemented in the real of writing fast code. Hence, if we disallow these tricks, we world. Hagerup [19] has named this model the word RAM. are not discussing the of running impera- The word RAM has a fairly long tradition within integer tive programs on real world computers. At the same note, sorting, being advocated and used in the 1984 paper [27] by it should be admitted that this is a theory paper. The al- Kirkpatrick and Reisch, and in the seminal 1993 paper of gorithms presented are too complicated and have too large

Fredman and Willard [17]. constants hidden in the O -notation to be of any immediate We note that our unit-cost multiplication may be con- practical use. This does not preclude that some of the ideas

sidered somewhat questionable in that multiplication is not may find use in practice. For example, Nilsson [29] has

O (n log log n) in AC0 [10], that is, there is no multiplication circuit of demonstrated that the algorithm of Anders- son et al. [5] can be implemented so as to be competitive constant-depth and of size polynomial in W .Weleave

it as an open problem to improve Thorup’s randomized with the best practical sorting algorithms.

(n log log n) O expected time and linear space sorting with- out multiplication [36]. 1.3 Deterministic splitting and sorting We will now discuss some of the (delightfully) dirty tricks that we can use on the word RAM, and which are The heart of our construction is a deterministic splitting not allowed in the comparison based model or on a pointer results of independent interest. machine.

A first nice feature of the word RAM over the compar- Definition 6 A splitting of an ordered set X is a partition-

X A

ison based model is that we can add and subtract integers. <

0 1

ingintosets k .Here, denotes

(a; b) 2 A  B For example, we can use this to code multiple comparisons that a

Proceedings of the 43 rd Annual IEEE Symposium on Foundations of Computer Science (FOCS’02) 0272-5428/02 $17.00 © 2002 IEEE

Theorem 7 We can split a set of n word integers in linear 2 Fast randomized sorting p

time and space so that each diverse subset has at most n integers. For our fast randomized sorting, we need a slight variant Applying Theorem 7 recursively, we immediately get that of the signature sorting of Andersson et al. [5] (cf. Appendix

A.1):

(n log log n) we can sort deterministically in O time and lin- ear space, but this has already been proved by Han in [22]. Lemma 11 With an expected linear-time additive over- However, for deterministic string sorting, we get the follow-

head, signature sorting with parameter r reduces the prob-

ing new result which does not follow from [22] (the general ` lem of sorting n integers of bits to reduction of Andersson and Nilsson [7] from word sorting

to string sorting is randomized):

4r log n (i) the problem of sorting n reduced integers of bits each and

Corollary 8 We can sort n variable-length strings dis-

O (n log log n + N )

tributed over N words in time and lin- `=r

(ii) the problem of sorting n integer fields of bits each.

(n log log n + L) ear space. In fact, we can get down to O

time where L is the sum of the lengths of the distinguishing Here (i) has to be solved before (ii). prefixes.

We are now going to present our randomized algorithm for

p

O (n log p)

Proof: To get the corollary, we simply apply Theorem sorting n word integers in linear space and ex-

=logU= log n 7 recursively but only to the first unmatched word of each pected time where p . string. That is, our recursive input is a subset of the integers with some common matched prefix. In the root call, the sub- Repeated splitting First we apply our splitting from The- set is the complete set with nothing matched so far. Integers orem 7 to recursively split the diverse set of integers

ending up in a duplicate set match in one more word, and p

log pe d times, thus getting a splitting with diverse subsets

the other integers end up in sets of size reduced to the square p

log p

0 1=2

= n of size at most n . Each integer is involved in

root. Since the splitting takes constant time per integer, we p

log pe

at most d linear time splittings, so the total time is

O (log log n)

pay constant time per word matched and time p

(n log p)) for reductions into smaller sets. O .

We also present variants of the splitting using only stan-

0 0 S

dard AC operations, that is, AC operations available via a Repeated signature sort To each of the diverse sets i of 0

standard programming language such as C or C++. size at most n , we apply the signature sort from Lemma 11

p

log p

r =2 0 with . Then the reduced integers from (i) have

Theorem 9 For any positive ", using standard AC oper-

0

4r log n = O (log n) "

1+ bits.

(W=(log log n) ) ations only, we can split a set of n -bit

integers in linear time so that diverse subsets have size at We are going to sort these reduced integers from all the

p

S S i subsets i together, but prefixing reduced integers from

most n.

O (log n)

with i. The prefixed reduced integers still have 0

Corollary 10 For any positive ", using standard AC op- bits, so we can radix sort them in linear time. From the

1+"

n O (n(log log n)

erations only, we can sort words in ) resulting sorted list we can trivially extract the sorted sub- S

time and linear space. list of reduced integers for each i , thus completing task (i) from Lemma 11.

Proof: We use the same proof as for Corollary 8, but We have now spent linear expected time on reducing "

1+ the problem to dealing with the fields from Lemma 11 (ii)

(W=(log log n) )

viewing each word as a string of -bit p

log p

=2 characters, to which we can apply Theorem 9. and the field length is only a fraction 1 of original length.

Corollary 10 improves the previous p

2

log pe

We repeat this signature sorting d times, at a total

O (n(log log n) =(log log log n))

bound of Han from p

(n log p)

[20], and it gets very close to the best multiplication based expected cost of O . We started with integers of U

length log , and each round reduces the integer length by a

O (n log log n)

deterministic bound of [22]. p

log p

factor 2 , so we end up with integers of length at most

1.4 Contents

p p

log p log p log p

log U=(2 )  log U=2 = log n

The rest of the paper is divided into two sections. Section n 2 presents our randomized sorting assuming deterministic Since the lengths are now at most log , we can trivially splitting, and Section 3 presents our deterministic splitting. finish with a linear time .

Proceedings of the 43 rd Annual IEEE Symposium on Foundations of Computer Science (FOCS’02) 0272-5428/02 $17.00 © 2002 IEEE

n = O (1) n = ! (1)

Summing up The total expected time spent in the above (d))(a) If , (a) is trivial, so assume .

p

p

(n log p) a n

algorithm is O , and we have only used linear We divide our input integers into batches of size .Us-

" =2 d

space, as desired. ing (d), we can split such a batch over n splitters in Thus, our only remaining task is to provide the splitting linear time.

from Theorem 7, that is, We will develop the an appropriate set Y of splitters as

= ; we go along, starting with Y . Each time we come with

Lemma 12 Theorem 7 implies Theorem 3. a batch of integers, we split them according to the current

y

1 k 1

splitters 1 , adding them to the splitting

X < fy g  X < fy g < fy g  X

1 1 2 k k

3 Deterministic splitting 0 done so

1" =2

d

X 4n

far. If one of the diverse subsets i gets or more z

To prove Theorem 7, first in Section 3.1, we reformulate elements, we split it according to its median z ,andmake +1

it in terms of splitting over a given set of splitters. and z two new splitters. Obviously, we end with at most

1" =4

d

n n = ! (1)

4 splitters in each diverse set, and since ,

1" =4 1"

a

d

4n "  n

3.1 Splitting over splitters for some positive constant a .

1" =2

d n We will charge each splitting over a median to 2 We will actually prove our splitting result in terms of elements. This implies that the median finding and sub- sequent splitting is done in linear time, and that the total

an equivalent formulation. A splitting of a set X into sets

1" =2 " =2

d d

2n=(2n = n

number of splitters is at most ) ,as

X

1 k 1 0 is a splitting over splitters

needed for applying (d) with a batch of n integers.

y < 

k 0 1 1 2 3 2 if

The charging is simple: every time a diverse subset is

fy g X k

k .

1" =2

d n

started, it has at most 2 charged elements. We only

1" =2

d n

Lemma 13 The following statements are equivalent: split the set if it gets more than 4 elements, which

1" =2

d n means that we have at 2 non-charged elements that

(a) We can split a set of n integers so that diverse subsets we can charge, and we can easily distribute the charging so

1" a

areofsizeatmost n in linear time and space for that each of the resulting divers sets get at most Moreover,

1" =2

" d

some positive constant a . n any diverse subset starts with at most 2 charged el- ements.

(b) We can split a set of n integers so that diverse sets are

1" b The rest of this paper is devoted to prove Lemma 13 (d)

of size at most n in linear time and space for any

p

3

" = n

with d , which by Lemma 13 (b) implies Theorem "

positive constant b .

p

3 n

7. That is, our remaining task is to split n integers over

"

c n

(c) We can split a set of n words over splitters in linear splitters in linear time and space. "

time and space for any positive constant c .

" 3.2 Deterministic signature sorting

d n

(d) We can split a set of n words over splitters in linear "

time and space for some positive constant d . We shall use a deterministic version of signature sort,

essentially done by Han [21] (cf. Appendix A.2).

1"

b n

Proof: (a))(b) To get diverse subsets of size we

`

2

i

O (n + )

apply (a) recursively on diverse sets. If (a) is applied times Lemma 14 With an s time additive overhead, de-

r

x x

to subsets containing , the set containing ends up non- terministic signature sorting with parameter r reduces the

i

(1" )

a

n` s x

diverse, or of size at most n . Thus gets involved at problem of splitting -bit integers over splitters to

log (1 " )=O (1)

most b times.

1"

a

4r log n (i) the problem of splitting n reduced integers of

bits over s reduced splitters, and

) "

(b) (c) To prove (c) with any given value of c , we apply

" = 2"

n `=r s c

(b) with b . Now, the diverse subsets have at most (ii) the problem of splitting fields of bits over field

12" "

c c n n integers. Out of these subsets, at most contain a splitters. splitter. We split each subset with splitters using traditional

comparison based sorting. The total number of integers in- Here (i) has to be solved before (ii).

12" " 1"

c c c

n n = n

volved in this is at most , so the total 2

(`s )

Han [21] actually paid an additive O . The fact that we

1"

c

O (n log n)=O (n) `

time for sorting is . Having sorted 2

O ( ` ) only pay s is useful if is arbitrarily large compared all diverse subsets with splitters over these splitters, we im- r

with n as in Lemma 15 below.

mediately derive the desired splitting of the original set.

c

= (log n) c  5 n

Lemma 15 If W , , we can split word

p 3

integers over at most n splitters in linear time. (c))(d) is trivial.

Proceedings of the 43 rd Annual IEEE Symposium on Foundations of Computer Science (FOCS’02) 0272-5428/02 $17.00 © 2002 IEEE

p

3 n

Proof: We apply the signature sorting of Lemma 14 with Proof: We need to split n integers over splitters.

c3

=(logn)

r . Then the additive overhead is From Lemmas 15 and 16, we know that it suffices to con-

 

(log n) q = O (log n) W =

sider q -bit integers where ,

c

p

(log n)

4 5

2

3

(q W  (log n) n = ! (1) (log n))

O n + n O (n):

= ,and .For , the latter

c3

(log n)

6

W n

implies  , as required for Lemma 17. q

Now, the reduced integers from (i) have length We view each integer as consisting of 4 characters, each

n)=4

of (log bits. Also, we have plenty of room to pack

c2 2

4r (log n)=4(logn) = O (W=(log n) )

(log q )

q integers in each word.

0 n

and that implies that we can sort and hence split them The algorithm works recursively, taking a subset of 

p i

in linear time using the packed sorting of Albers and n integers, all with a common prefix of length .Wethen

2

k = q

n) =

Hagerup [2]. Similarly, the fields have length (log use Lemma 17 with to sort the integers according to

3

i +1 (log n)=4 

(W=(log n) )

O , so they can also be sorted in linear time. character . We note that this character has

0

(log n =2 ) bits, as required of the segment in Lemma 17.

Also, we note that this is, in fact, a splitting since all the

5 3 Above, we could let c go down from to if, instead of using the packed sorting of Albers and Hagerup, we used integers agree on the preceding characters. Starting with all

n integers, we recurse on diverse subsets until they all have

the one by Han and Shen [24] which takes linear time for p

size at most n.

(W= log n) O -bit integers. However, Han and Shen’s result

employs the of Ajtai et al. [1], and the im- To see that this implies a linear time splitting, let

n ;n ;:::;n

1 t

provement does not affect the overall results of this paper. 0 be the sizes of the sets that a given integer

x n = n

is involved in as. That is, we start with 0 .For

p

5

 (log n)

Lemma 16 If W , with a linear time additive

i n  n

1 round ,wehave i integers, and we match

overhead, we can reduce the problem of splitting n word

(log n)=4  (log n x )=2

i1

p bits of , finding agreement with

3

 n

integers over s splitters into at most four problems

n x

i other integers. Hence, the cost for is

(log n) s q =

of splitting q -bit integers over splitters where

 

4

1 + log n log n

i1 i

O (log n) W = (q n))

and (log .

p

O +1=q

log n

4

= W= log n  (log n)

Proof: Let p . First we apply

i =1; :::; t x

p Summing this for , we get a total cost for of

= p

the deterministic signature sort of Lemma 14 with r .

 

p

p

log n log n

p(log n) t

Now both subproblems have integer length .We 0

p

O n + t=q : + t= log

p

4

r = p n then apply Lemma 14 with to each subprob- log

lem, getting four subproblems, each with integer length

p

n = n

Here 0 so the first term is constant. Moreover, by

4

( p(log n))

O .

q = O (log n) definition, there are 4 characters, and using By the two preceeding lemmas, we can assume that the

this upper-bound on t, we see that the last two terms are

(log n) q = O (log n) W =

integers are of length q where , constant.

4 5

q (log n)) W  (log n) ( ,and .

3.3 String sorting 3.4 Sorting over a segment We are going to show:

The goal of this section is to prove Lemma 17. We

6

 W

Lemma 17 Consider n integers packed with at most will use the lemma below on packed bucketing, essentially

(log k )

k integers in each word. We can sort the integers ac- proved by Han in [21] (cf. Appendix A.3).

n)=2

cording to the value of a given segment of at most (log

k (log k ) consecutive bits so that the time spent on an integer with Lemma 19 Consider n integers packed with inte-

gers in each word, and that an `-bit label for each integer

segment value c is 

 is packed in a parallel set of words. We can then sort the

1 + log n log n

c

(`= log n +1=k )

integers according to their labels in O time

O +1=k n

log per integer.

n c 6

where c is the number of integers with segment value .

 W

Lemma 20 Consider n integers packed with at least

(log k )

The above lemma may look somewhat strange, but as k integers in each word. We can group all integers

p

3

 n demonstrated below, it actually implies our main result. with respect to matches with t target integers within

a given segment in

" =

Lemma 18 Lemma 17 implies Lemma 13 (d) with d

p 3

n.

O (log (t +1)= log n +1=k )

Proceedings of the 43 rd Annual IEEE Symposium on Foundations of Computer Science (FOCS’02) 0272-5428/02 $17.00 © 2002 IEEE

6

 W

time per integer. Integers not matching any target integer Lemma 21 Consider n integers packed with

(log k ) end in one group. k integers in each word. We are focusing on a spe- cific segment of the integers. Suppose that for different pos-

sible segment values c, we are given a suggested frequency

Proof: Our goal is to construct a parallel set of labels so p

3

f  1= n

c for integers with that segment value. We further

t  2

that we can apply Lemma 19. First we assume that . P

2 1

require  . We can then group the integers spending

f

O (t W )

In time, we construct a perfect hash function c

2logte

from the segments of the targets into d bits. Since

O ((1 log f )= log n +1=k )

c

= O (W ) O (n=k ) k , the time spent is . To apply the hash function to all our integers, we first

time per integer with segment value c. make copies of the words containing them into a parallel set of words, then we mask out the segments, shifting them to the least significant part of each integer. Finally, we ap- Proof: The result is achieved by a simple iterative al-

ply the hash function so that we now have a parallel set of gorithm. First we sort the frequencies in descending order.

1=k

t = dn t 2

We start with e and repeat squaring until it

2logte

d bit labels, each aligned with the least significant

1=3 t part of its original integer. We generally refer the reader to reaches or passes n .Foragivenvalueof ,wetakethe

[5, pp. 77–79] for details on making such word operations t remaining segment values of highest frequency, and apply on multiple integers in a word, including the hashing. We Lemma 20 to the remaining integers. This groups the inte-

gers matching the t segment values, leaving the remaining

(1=(k log k )) only spend constant time per word, hence O

time per integer. integers for the remaining rounds.

(log (t +

We now apply Lemma 19, getting the integers sorted ac- The time spent per integer in a round is O

= log n +1=k )=O (log t= log n) log t= log n

1) .Since dou-

(log t= log n +1=k )

cording to their labels in O time per in- 2

bles in each round, it is the last round that an integer x par-

(t teger. We have ) labels, and exactly one label for each

ticipates in that dominates the time spent on that x.How- a

target. Fix 0 to be a label which is not a label of a target.

c f

ever, if the integer has segment value with frequency c ,

k (log k ) y

For each target y , we pack copies of in a word



1=f

there can be no more than c earlier frequencies, so the

O (tk (log k )) = O (n=k )

y . This takes total time.

2 1=k

x 1=f dn e value of t when is picked is at most ,or if

The integers in the words are sorted according to their la- c

1=k

1=f  n x bels, and comparing this with the sorted list of target labels, c . Consequently, the total time spent on is

we can easily identify words of integers where all or some

2 1=k

O (log maxf1=f ; dn eg= log n +1=k )

of the parallel labels match some target label. c

= O ((1 log f )= log n +1=k ):

For each word all of whose labels match the label of a tar- c

 y get y , we compare the parallel word of integers with .For

each non-match, the corresponding label is replaced with a

0 . All this takes constant time per word. For words having Finally, we have a

no target labels, all labels are replaced by 0 . There can be t

at most 2 words with only some labels being target labels. Proof of Lemma 17 We want to prove

tk log k = O (n=k )

Thesehaveatmost 2 parallel integers,

6

 W

and for each such integer, we can trivially, in constant time, Consider n integers packed with at most

k (log k ) a

replace its label with 0 if it doesn’t match a target. integers in each word. We can sort the

We now re-apply Lemma 19, getting the integers sorted integers according to the value of a given segment

n)=2 according to their revised labels. However, this time the of at most (log consecutive bits so that the

sorting gives the desired grouping. The segment of integers time spent on an integer with segment value c is

a

 

with label 0 are exactly those that do not match any target.

1+logn log n

c

= 0; 1

Finally, we have two special cases of t .The

+1=k

O (1)

log n

= 0 case t is trivial in that all integers belong in the non-

matching group, and hence nothing has to be done, agree- n

where c is the number of integers with segment

t +1)= 0 t =1 ing with log( in the time bound. For ,we

value c.

copy the unique target so that all integers in a word can be

1=2 1

matched in constant time. We use a 1-bit label with for

n O (n= log n) Since there are only = possible segment

match and 0 for non-match, and finally apply Lemma 19. values, we can initiate arrays using these values as entries.

t +1) = 1 Since log( , this again gives us the desired time Thus, with each possible segment value c, we can store a

bound.

n^

counter c for the number of integers found with configura-

c n^ =0

The lemma below essentially shows that if we had guessed tion . Initially c . We also have a list of frequencies

^

c f = n^ =n c the frequencies, we would be done. c for segment values that are common in the

Proceedings of the 43 rd Annual IEEE Symposium on Foundations of Computer Science (FOCS’02) 0272-5428/02 $17.00 © 2002 IEEE

1=4 3=4

^

f  1=n n^  n c sense that c , hence with . Initially, We are can essentially reuse the splitting algorithm for The-

this list is empty. orem 7. The only non-AC0 operation used is multiplica-

3=4

We divide the integers into batches of n integers. tion, used for hashing. Brodnik et al. in [12] have shown First we group the integers in the batch with respect to the (see their remark on BlockMult above Theorem 13) that if

common segment values using their current frequencies in words are packed with ` bit integers, we can multiply them

1=4 1+"

^

f = n^ =n  1=n O ((log `) = ) c

Lemma 21. We note here that c coordinatewise in time using only standard

p

3

0

3=4

n = 1 , so the conditions of Lemma 21 are satisfied. The AC operations. We do such packed multiplication on fields

cost for an integer with a common segment value c is when we use signature sort in Lemma 15 and in Lemma 16,

2 3

n) (log n)

with field lengths of at most (log and , respec-

3=4

O ((1 log f )= log(n )+1=k )

c tively. Also, the rest of the algorithm only considers inte-

2

O ((log n) )

= O ((1 + log (n=n^ ))= log n +1=k )

c (2) gers of size . Thus the packed simulated multi-

1+"

O ((log log n)

plication takes ) time. However, since our "

All remaining integers in the batch are bucketed using stan- 1+

(W=(log log n) input integers are of length ), the packed

dard bucketing in constant time per integer. multiplications will only be over that many bits. Hence we "

We will now argue that the time spent on inserting the 1+

((log log n)

can pack ) original packed multiplications,

1+"

ith integer in one of our buckets is

n) )

thus performing ((log log original packed multi-

  (1)

plications at the time, in O time per original packed mul-

1+logn log i

O +1=k

: (3) tiplications.

log n

3=4

 2n

If i , (3) is a constant, covering the cost of standard References

3=4

bucketing. Otherwise, since each batch adds at most n

3=4

n^  i n  i=2

integers, the batch was bucketed with c .

log n

[1] M. Ajtai, J. Koml´os, and E. Szemer´edi. Sorting in c

2 i By (2), the cost is as in (3) but with i= instead of , but this parallel steps. Combinatorica, 3(1):1–19, 1983. does not affect the asymptotic value. Thus, the cost of the [2] S. Albers and T. Hagerup. Improved parallel integer sorting

ith integer is always bounded by (3). without concurrent writing. Inf. Comput., 136:25–51, 1997. n

Now, the cost of adding all c integers to the bucket for [3] A. Andersson. Sublogarithmic searching without multipli-

th 36 segment value c is cations. In Proc. FOCS, pages 655–663, 1995.

[4] A. Andersson. Faster deterministic sorting and searching in

 

n

th c

X

37

n log i

1+log linear space. In Proc. FOCS, pages 135–141, 1996. =k

+1 [5] A. Andersson, T. Hagerup, S. Nilsson, and R. Raman. Sort-

log n =1

i ing in linear time? J. Comp. Syst. Sc., 57:74–93, 1998. An-



n (1 + log n) (n log n n (log e))

c c c

c nounced at STOC’95. O

= [6] A. Andersson, P. Miltersen, and M. Thorup. Fusion trees can

log n 0

 be implemented with AC instructions only. Theor. Comput.

+n =k

c Sc., 215(1-2):337–344, 1999. 

 [7] A. Andersson and S. Nilsson. A new efficient radix sort. In

1+logn log n

c

th

= O n + n =k

; Proc. 35 FOCS, pages 714–721, 1994.

c c n log [8] A. Andersson and M. Thorup. Tight(er) worst-case bounds

on dynamic searching and priority queues. In Proc. 32nd n

which divided by c gives the desired time per integer from STOC, pages 335–342, 2000. (1). [9] P. Beame and F. Fich. Optimal bounds for the . In Proc. 31st STOC, pages 295–304, 1999. 4 Summing up [10] P. Beame and J. H˚astad. Optimal bounds for decision prob- lems on the CRCW PRAM. J. ACM, 36(3):643–670, 1989. Proof of Theorem 1, Corollary 2, and Theorem 3 The [11] J. Bentley. Programming Pearls. Addison-Wesley, Reading, results follow directly from the statements of Lemma 4, 12, Massachusetts, 1986. [12] A. Brodnik, P. B. Miltersen, and I. Munro. Trans- 13, 17, and 18.

dichotomous algorithms without multiplication - some up- th

per and lower bounds. In Proc. 5 WADS, LNCS 1272, Proof Sketch for Theorem 9 and Corollary 10 We want pages 426–439, 1997. to show [13] B. Cherkassky, A. Goldberg, and C. Silverstein. Buckets,

heaps, lists, and monotone priority queues. SIAM J. Comp., 0

For any positive ", using standard AC operations 28(4):1326–1346, 1999.

1+"

n (W=(log log n) only, we can split a set of )- [14] L. J. Comrie. The hollerith and powers tabulating machines.

bit integers in linear time so that diverse subsets Trans. Office Machinary Users’ Assoc., Ltd, pages 25–37, p

have size at most n. 1929-30.

Proceedings of the 43 rd Annual IEEE Symposium on Foundations of Computer Science (FOCS’02) 0272-5428/02 $17.00 © 2002 IEEE [15] S. Cook and R. Reckhow. Time-bounded random access A Variants of results from other papers machines. J. Comp. Syst. Sc., 10(2):217–255, 1973. [16] A. I. Dumey. Indexing for rapid random access memory In this appendix we justify some simple variants of re- systems. Computers and Automation, 5(12):6–9, 1956. sults stemming from other papers. [17] M. L. Fredman and D. E. Willard. Surpassing the informa- tion theoretic bound with fusion trees. J. Comput. Syst. Sci., 47:424–436, 1993. Announced at STOC’90. A.1 Signature sorting [18] M. L. Fredman and D. E. Willard. Trans-dichotomous al- gorithms for minimum spanning trees and shortest paths. J. To describe our slight variant of signature sorting for

Comput. Syst. Sci., 48:533–551, 1994. Lemma 11, we first review the original signature sorting of

n `

[19] T. Hagerup. Sorting and searching on the word RAM. In Andersson et al. [5] is a set S of integers of length .Sig-

th 15 Proc. STACS, LNCS 1373, pages 366–398, 1998. nature sorting reduces the problem of sorting S to that of

[20] Y. Han. Fast integer sorting in linear space. In Proc. two sorting problems, each with n integers, but with shorter

STACS’00, LNCS 1170, pages 242–253, 2000. integers. In the reduction, for some parameter r ,weinter-

[21] Y. Han. Improved fast integer sorting in linear space. Inf. r pret the integers in S as vectors of equal sized fields. The Comput., 170(8):81–94, 2001. Announced at STACS’00 and reduction goes in several steps:

SODA’01.

O (n log log n) n

[22] Y. Han. Deterministic sorting in time and 1. For each integer, each field is hashed into 4log bits. th

linear space. In Proc. 34 STOC, 2002. The hashed fields of an integer are packed together giv-

r log n

[23] Y. Han and X. Shen. Conservative algorithms for parallel ing us a reduced integer of length 4 bits. All the st and sequential integer sorting. In Proc. 1 COCOON, LNCS reduced integers are produced in linear total time. 959, pages 324–333, 1995. [24] Y. Han and X. Shen. Parallel integer sorting is more effi- 2. The reduced integers are now sorted (our first new sort-

cient than parallel comparison sorting onm exclusive write ing problem). th

PRAMs. In Proc. 10 SODA, pages 419–428, 1999.

[25] IEEE. Standard for binary floating-point arithmetic. ACM 3. In linear time, we identify n fields from the integers in

Sigplan Notices, 22:9–25, 1985. S .

[26] B. Kernighan and D. Ritchie. The C Programming Lan- `=r guage. Prentice Hall, second edition, 1988. 4. The n fields of bits are now sorted (our second new [27] D. Kirkpatrick and S. Reisch. Upper bounds for sorting inte- sorting problem). gers on random access machines. Theor. Comp. Sc., 28:263– 276, 1984. 5. Based on the sorting done above, we sort S in linear [28] D. E. Knuth. The Art of Computer Programming, Volume time. 3: Sorting and Searching. Addison-Wesley, Reading, Mas- The above high level reduction is carefully implemented in sachusetts, second edition, 1998. [5], to which the reader is referred for details. There is a

[29] S. Nilsson. Radix Sorting & Searching. Ph.D. Thesis, Lund

2 =n probability of at most 1 that something goes wrong in University, Sweden, 1996. [30] W. Paul and J. Simon. Decision trees and random access the hashing. In [5] they just check the final sorting in the machines. In Proc. Symp.uber ¨ Logik and Algoritmik, pages end. However, here we will apply signature sorting to many 331–340, 1980. small subproblems, a small fraction of which are likely to [31] R. Raman. Priority queues: small, monotone and trans- fail. Instead of aiming at no errors at all, we introduce the dichotomous. In Proc. 4th ESA, LNCS 1136, pages 121– following convenient step between step 2 and step 3. 137, 1996. 2 1 . In expected linear time, redo and resort the reduced [32] B. Stroustrup. The C++ Programming Language, Special 2 Edition. Addison-Wesley, Reading, MA, 2000. integers if they are not OK.

[33] M. Thorup. Faster deterministic sorting and priority queues 1

th To implement step 2 , we need to check that the reduced 2

in linear space. In Proc. 9 SODA, pages 550–555, 1998. integers are OK. Referring the reader to [5] for details, this [34] M. Thorup. On RAM priority queues. SIAM J. Comp.,

30(1):86–109, 2000. is easily done in connection of their linear time construc- T [35] M. Thorup. Equivalence between priority queues and sort- tion of a certain compressed unordered D , needed for ing, 2002. Also submitted to FOCS’02. steps 3-4. If there is a failure, we return to step 1. However,

when we iterate, we can just use bubble-sort to sort the re-

(n log log n)

[36] M. Thorup. Randomized sorting in O time and

2

O (n

linear space using addition, shift, and bit-wise boolean op- duced integers in ). The point is that the probability

2 =n

erations. J. Algor., 42(2):205–230, 2002. Announced at of iterating is 1 , so the expected cost of all the iterations

P

1

2 2i

O (n ) n O (1)

SODA’97. is bounded by = .Themostex-

i=1

[37] P. van Emde Boas. Preserving order in a forest in less than pensive part of step 2 1 is therefore the first check which is 2 logarithmic time. In Proc. 16th FOCS, pages 75–84, 1975. always executed in linear time.

Proceedings of the 43 rd Annual IEEE Symposium on Foundations of Computer Science (FOCS’02) 0272-5428/02 $17.00 © 2002 IEEE Summing up, with an expected linear-time additive over- The lemma is essentially just a reformulation of Han’s

head, signature sorting with parameter r reduces the prob- Lemma 5 in [21], and is proved using the same proof. The

` O (`= log n)

lem of sorting n integers of bits to term is the inherent cost of packed bucketing

(1=k ) using that the labels are small. The O term is cost

(i) the problem in step 2 of sorting n reduced integers of of a matrix transposition of Thorup [36, Lemma 9] with

4r log n

log k log k

bits and k integers in each word. Also, Han has replaced

n k = O (log n)

by log log using . However, this change is `=r (ii) the problem in step 4 of sorting n fields of bits. not necessary for his proof. Thus Lemma 19 follows. Here (i) has to be solved before (ii). This establishes our variant of signature sorting from Lemma 11.

A.2 Deterministic signature sorting

We want to show the statement of Lemma 14:

`

2

s ) (n +

With an O time additive overhead, de- r

terministic signature sorting with parameter r re-

duces the problem of splitting n`-bit integers over

s splitters to

(i) the problem of splitting n reduced integers

r log n s

of 4 bits over reduced splitters, and `=r (ii) the problem of splitting n fields of bits

over s field splitters.

Here (i) has to be solved before (ii).

Lemma 14 is essentially shown by Han in the beginning

of Section 8 in [20]. A small difference is in the additive

`

2 2

s ) ( O (`s )

O term, where Han has . This term is the time r it takes to compute a perfect hash function. We just realize that all we need is a hash function on fields such that for any pair of splitters, the hash function should give different

values on their first distinguishing field. Since fields have

`

2

`=r O ( )

length , the hash function is computed in s time r using simple derandomization as described by Raman [31].

Moreover, Han’s version could lead to much more than s splitters in (ii). We need an analog of a simple trick from the

original signature sort [5, p. 79]; namely, at each branching T

point in the trie D , to isolate integer fields smaller than the smallest splitter field in linear time. This completes the proof of Lemma 14.

A.3 Packed bucketing

We want to show the statement of Lemma 19:

k (log k ) Consider n integers packed with integers

in each word, and that an `-bit label for each in- teger is packed in a parallel set of words. We can

then sort the integers according to their labels in

(`= log n +1=k ) O time per integer.

Proceedings of the 43 rd Annual IEEE Symposium on Foundations of Computer Science (FOCS’02) 0272-5428/02 $17.00 © 2002 IEEE