Integer Sorting in O ( N //Spl Root/Log Log N/ ) Expected Time and Linear Space
Total Page:16
File Type:pdf, Size:1020Kb
p (n log log n) Integer Sorting in O Expected Time and Linear Space Yijie Han Mikkel Thorup School of Interdisciplinary Computing and Engineering AT&T Labs— Research University of Missouri at Kansas City Shannon Laboratory 5100 Rockhill Road 180 Park Avenue Kansas City, MO 64110 Florham Park, NJ 07932 [email protected] [email protected] http://welcome.to/yijiehan Abstract of the input integers. The assumption that each integer fits in a machine word implies that integers can be operated on We present a randomized algorithm sorting n integers with single instructions. A similar assumption is made for p (n log log n) O (n log n) in O expected time and linear space. This comparison based sorting in that an time bound (n log log n) improves the previous O bound by Anderson et requires constant time comparisons. However, for integer al. from STOC’95. sorting, besides comparisons, we can use all the other in- As an immediate consequence, if the integers are structions available on integers in standard imperative pro- p O (n log log U ) bounded by U , we can sort them in gramming languages such as C [26]. It is important that the expected time. This is the first improvement over the word-size W is finite, for otherwise, we can sort in linear (n log log U ) O bound obtained with van Emde Boas’ data time by a clever coding of a huge vector-processor dealing structure from FOCS’75. with n input integers at the time [30, 27]. At the heart of our construction, is a technical determin- Concretely, Fredman and Willard [17] showed that we n (n log n= log log n) istic lemma of independent interest; namely, that we split can sort deterministically in O time and p p n (n log n) integers into subsets of size at most in linear time and linear space and randomized in O expected time space. This also implies improved bounds for deterministic and linear space. The randomized bound can also be string sorting and integer sorting without multiplication. achieved deterministically using space unbounded in terms "W 2 "> 0 of n (of the form for constant ), but here we focus on bounds with space bounded in terms of n. 1 Introduction In 1995, Andersson et al. [5] improved Fredman and p (n log n) Willard’s O expected time for integer sorting to (n log log n) Integer sorting has always been an important task in con- O expected time. Both of these bounds use lin- nection with the digital computer. A classic example is ear space. A similar result was found independently by Han the folklore algorithm radix sort, which according to Knuth and Shen [23]. [28] is referenced as far back as in 1929 by Comrie in a The above mentioned bounds are the best unrestricted document describing punched-card equipment [14]. bounds in the sense that no better bounds are known even if (log n) Whereas radix sort works in linear time for O -bit besides the randomization, we have unlimited space and are integers, it was not until 1990 that Fredman and Willard [17] free to define our own operations on words. n log n) beat the ( comparison-based sorting lower-bound The above results have provided great inspiration for for the case of arbitrary single word integers. The word-size many researchers, trying to improve them in various ways. W log n W is determined by the processor. We assume For example, there has been work on avoiding randomiza- so that we can address the different integers, but in princi- tion [4, 21, 22, 31, 33] and there has been work on avoiding n ple, W can be arbitrarily large compared with . An equiv- multiplication [3, 6, 36]. Also there has been lots of work alent formulation of our assumptions is that we only assume on more dynamic versions of the problem such as priority constant time operations on integers polynomial in the sum queues [13, 18, 31, 33, 34] and searching [3, 4, 8, 9]. (n log log n) Supported in part by University of Missouri Faculty Research Grant The O algorithm of Andersson et al. from K211942 1995 [5] is very simple, and the fact that it has sustained so Proceedings of the 43 rd Annual IEEE Symposium on Foundations of Computer Science (FOCS’02) 0272-5428/02 $17.00 © 2002 IEEE much interest has lead many researchers to think that this Theorem 3 improves a corresponding refinement of Kirk- (n log n) could be the complexity for integer sorting, like O patrick and Reich [27] of van Emde Boas’ bound of log U ) (n log is the complexity of comparison based sorting. O expected time. log n (n log log n) However, in this paper, we improve the O The following simple lemma states that it suffices for us p (n log log n) expected time to O expected time: to prove Theorem 3: Lemma 4 Theorem 3 implies Theorem 1 and Corollary 2. Theorem 1 There is a randomized algorithm sorting n in- p (n log log n) tegers, each stored in one word, in O ex- pected time and linear space. Proof: Trivially Theorem 3 implies the weaker Corol- lary 2. However, Andersson et al. [5] have shown that we We leave open the problems of getting a corresponding de- 3 (log n) can sort in linear expected time if W , and other- terministic algorithm and avoid the use of multiplication. U log W 3 log log n wise, log log . We note that the dynamic aspects are already pretty settled, for the complexity of dynamic searching is known to be p log n= log log n) ( [8, 9], and priority queues have once 1.1 Other domains and for all been reduced to sorting [35]. (log n) Since integers of O bits can be sorted in linear time using radix sort, we get the following immediate con- In this paper, we generally assume that our integers to be sequence of Theorem 1: sorted each fit in one word, where a word is the maximal unit that we can operate on with a single instruction. In this subsection, we briefly discuss implications of this case to U Corollary 2 We can sort n integers of size at most in p many other domains. (n log log U ) O expected time. First, consider the case of lexicographically ordered (n log log U ) This is the first improvement over the O bound strings of words. Andersson and Nilsson [7] have presented obtained with van Emde Boas’ data structure from 1975 an optimal randomized reduction from this case to that of (n log log n) [37]. Indeed, the O sorting algorithm of An- integers fitting in single words. Applying their reduction to dersson et al. [5] combines van Emde Boas’ data structure Theorem 1, we get with the packed merging of Albers and Hagerup [2] so as to Corollary 5 We can sort n variable-length strings dis- p (n log log U ) U match the O bound for large values of . O (n log log n + N ) tributed over N words in expected p (log log U ) We note here that the general O bound of (n log log n + L) time. In fact, we can get down to O ex- van Emde Boas [37] has been improved in the context P L = ` ` i pected time where i and is the length of the of static search structures. More precisely, Beame and i distinguishing prefix of string i, that is, the smallest prefix Fich [9] have shown that one can preprocess a set of of string i distinguishing it from all the other strings, or the n integers in polynomial time and space so that given length of string i if it is a duplicate. any x, one can search the largest stored integer below O (log log U= log log log U ) x in time. However, due to We note here that any algorithm sorting the strings will have the polynomial construction time, this improvement does to read the distinguishing prefixes, and since it takes an in- not help with sorting. Beame and Fich [9] combine (L) struction to read each word, it follows that an additive O their result with Andersson’s exponential search trees from is necessary. [4], giving a dynamic search structure with an amor- One may instead be interested in variable length (log log U log log n= log log log U ) tized update time of O . multiple-word integers where integers with more words are This dynamic search structure could be used for sort- considered bigger. However, by prefixing each integer with (n log log U log log n= log log log U ) ing in O time, but this its length, we reduce this case to lexicographic string sort- (n log n) is never better than the best of O time and ing. (n log log U ) O time. Thus, from the perspective of sorting, p In our presentation, we think of all our integers as un- (n log log U ) our O bound constitutes the first improve- signed non-negative integers. However, the standard repre- (n log log U ) ment over the O bound derived from van Emde sentation of signed integers is such that if we flip the sign- Boas’ data structure from 1975 [37]. bit, we can sort them as unsigned integers, and then flip We will, in fact, prove the following refinement of Corol- the sign-bit back again. Floating points numbers are even lary 2: easier, for the IEEE 754 floating-point standard [25] is de- signed so that the ordering of floating point numbers can Theorem 3 There is a randomized algorithm sorting n be deduced by perceiving their representations as multiple W < 2 word integers, each of size at most U in q word integers.