Theoretical Computer Science

ELSEVIER Theoretical Computer Science 164 (1996) 1-12

Heaps with bits

Svante Carlsson a, Jingsen Chen a,*, Christer Mat&son b

aDepartment of Computer Science, Luled University, S-971 87 Luled Sweden b Quality Laboratories AB, IDEON Research Park, S-223 70 Lund, Sweden

Received March 1995; revised July 1995 Communicated by M. Nivat

Abstract

In this paper, we show how to improve the complexity of operations and using extra bits. We first study the parallel complexity of implementing operations on a heap. The trade-off between the number of extra bits used, the number of processors available, and the parallel is derived. While inserting a new element into a heap in parallel can be done as fast as parallel searching in a sorted list, we show how to delete the smallest element from a heap in constant time with a sublinear number of processors, and in sublogarithmic time with a sublogarithmic number of processors. The models of parallel computation used are the CREW PRAM and the CRCW PRAM. Our results improve those of previously known . Moreover, we study a variant, the fine-heap, of the traditional heap structure. A fast for constructing this new is designed using an interesting technique, which is also used to develop an improved heapsort algorithm. Our variation of heapsort is faster than Wegener’s heapsort and requires less extra space.

1. Introduction

One of the fundamental data types in Computer Science is the priority queue. It has been useful in many applications [l 11. A priority queue is a of elements on which two basic operations are defined: inserting a new element into the set and deleting the minimum element from the set. Several data structures have been proposed for implementing priority queues. Probably the most elegant one is the heap [21]. A (min-)heap is a binary with heap-property: (i) It has the heap shape; i.e., all leaves lie on at most two levels which are adjacent and all leaves on the last level occupy the leftmost positions and all other levels are complete; (ii) It is min-ordered: the key value associated with each node is not smaller than that of its parent. The minimum element is then at the root, which is at the first level. We refer to the number of elements in a heap as its size. A max-heap is defined similarly.

* Corresponding author. E-mail: svante,[email protected].

0304-3975/96/$15.00 @ 1996- Elsevier Science B.V. All rights reserved SSDZ 0304-3975(95)00152-2 2 S. Carlsson, et al. I Theoretical Computer Science 164 (1996) I-12

The problem of heap construction and heap operations have received considerable attention in the literature [2,5,6,8,9, 11, 131. In the parallel models of computation, op- timal heap construction algorithms have also been developed [4,15]. However, parallel heap operations have not been so deeply studied. Recently, Pinotti and Pucci [14] pre- sented an O(log log n)-time ’ parallel algorithm for deleting the smallest element from a heap of size n using n/logn EREW-PRAM processors; and Zhang and Korf [22] reduced the number of processors used for the deletion to (n/logn)‘-‘ik for some constant k, 1

’ All logarithms in this Paper are to base 2. S. Carlsson, et al. I Theoretical Computer Science 164 (1996) I-12

2. Parallel heap operations

Notice that a heap on n elements can be stored level by level from left to right in an array X with the property that the element at position i has its parent at Li/2J and its children at 2i and 2i + 1. Thus, the addresses of all the nodes on a path from the root to some leaf of a heap can easily be computed by shift operations. The level, leuel(X[i]), of an element X[i] in the heap Y? is defined as LlogiJ + 1. For the insertion of a new element x into a heap 2 of size n, an optimal sequential algorithm of @(log logn) comparisons works as follow: First, x is placed at the first available position X[n + 11; and then the min-ordering is (re)stored on the path from X[l] down to X’[n + 11. This is equivalent to the problem of searching x in the path from X[l] to X[ [(n + 1)/2J] (which form a sorted list). For the complexity of parallel searching, see [ 12, 181. Therefore, the following observation is immediate.

Observation 2.1. The parallel complexity of the insert operation in a heap of size n is the same as that of searching in a sorted list of length [log(n + l)] on the same model of parallel computation.

The delete operation in a heap $4?[ 1.. n] consists of first removing the smallest element from the heap, replacing it with X[n], and then restoring the min-ordering property. The sequential deletion can be done optimally in logarithmic time. However, it appears that the delete operation is inherently sequential, since the operation involves the search of Z’[n] in some path from either ~$721 or ~4731 down to the leaf-level (called the path of minimum children). Hence, the deletion may not admit an efficient parallel solution. Observing that the searching path for Z’[n] is not known beforehand, we have

Observation 2.2. In a heap, the parallel complexity of the delete operation is at least as hard as that of parallel insertion.

In the rest of this section, nevertheless, we will try to parallelize the delete operation. More precisely, we shall demonstrate that it is possible to perform the deletion in constant time with a sublinear number of processors, and in sublogarithmic time using a sublogarithmic number of processors. Assume without loss of generality that the size of the heap is of form 2h - 1 for some integer h > 0. Otherwise, one can first perform the parallel deletion on the first [log n] levels of the heap and then in O( 1) steps find out which element at the leaf-level of the heap is the last node of the path of minimum children. We first show that the root deletion in a heap of size n can be solved in constant time with CJ(nlog n) processors on the CRCW PRAM model.

Algorithm 2.3. Suppose the number of processors available is

p = (n + 1)/2(LlognJ - 1). 4 S. Carlsson. et al. I Theoretical Computer Science 164 (1996) I-12

1. Associate with the heap X an array of n bits, denoted by B[ 1..n]. Initially, let B[i] = 0 for i = 1,2,...,n; 2. For each i, 1

The correctness of Algorithm 2.3 follows immediately from the fact that our parallel deletion is obtained by implementing the sequential deletion algorithm on the PRAM model. Moreover, the contents of LB can be updated easily and fast. Clearly, only Step 4 of Algorithm 2.3 requires O(n logn) processors. After Step 4 is completed, at most two leaf-elements whose associated bits are 1 and these two leaves are brothers. Now, by using Q(logn) processors:

l The path of minimum children can be computed in U(1) time (i.e., Step 5);

l The insertion of X[n] on the path of minimum children takes Lo(1) time (i.e., Step 6), employing a constant-time parallel searching algorithm [I]. Since Steps l-4 of Algorithm 2.3 can also be done in constant time, we have

Lemma 2.4. There is a CRCW-PRAM algorithm for deleting the smallest element in a heap of size n, running in 0( 1) time with B(n logn) processors and 8(n) extra bits.

Remark that Algorithm 2.3 only demands the the ability of multiple-write for its fourth step, which can be modified for running on the CREW-PRAM model. Assume that O(n log n) extra bits are available and that a processor can write a digit into the bit of a word. Now, we associate with each leaf node X[ j] ((n + 1)/2 d j

After that, the path of minimum children can easily be computed in constant time. We thus have

Lemma 2.5. There is a CREW-PRAM algorithm for deleting the smallest element in a heap of size n, running in 0( 1) time with O(n log n) processors and Co(nlog n) extra bits.

Notice that the sequential complexity of the deletion is Co(logn). Hence, the above schemes fail to achieve the optimal speedup. However, the number of processors re- quired can be reduced significantly. First, some definitions are needed. We denote by (/SII the cardinality of a set S of elements. An integer m > 0 is called a perfect number if there exists some integer k > 0 such that m = 2k - 1. For any integer m > 0, we define the left match m’ of m as the largest perfect number such that m’dm. Notice that if m itself is a perfect number, then m’ = m.

Algorithm 2.6. Suppose that the number of processors available is p = n and that I? is a heap of size n. Let k be the left match of [n/log nj. perform a parallel deletion on the first log k levels of Y? (with X[n] being the element to be inserted), called the heap X,,, using p processors. Call the heap X,, after the deletion by XL; compute the path of minimum children in the heap &$ and denote the last element on this path by z. Let z’ be the smaller child of z in the heap 2; run recursively a parallel deletion algorithm on the subheap, ~?b, rooted at z’ in the heap cX using p processors.

The algorithm above deletes the smallest element in a heap correctly, which is similar to the sequential deletion algorithm. Moreover, the running time of Algorithm 2.6 is G( 1). Notice first that

Therefore, Steps 1 and 2 of the algorithm run in constant time by Lemma 2.4. Notice next that the deletion on Xpo can be done in Co(l) time since ll%‘DII = Co(logn). (Remark that if Steps 1 and 2 of Algorithm 2.6 are performed according to Lemma 2.5, the number of extra bits needed is then 6’(n).) Therefore,

Lemma 2.7. There is a parallel algorithm for deleting the smallest element in a heap of size n, running in O( 1) time with O(n) CRCW- (or CREW-) PRAM processors and Co(n/logn) (or O(n) for the CREW-PRAM model) extra bits. 6 S. Car&on, et al. I Theoretical Computer Science I64 (1996) I-12

To further reduce the number of processors consumed, we shall employ p = [n&l processors, where 0 < E < 1 is a constant, and let k be the left match of [n&l. By executing three steps of Algorithm 2.6 with the new values of p and k, Steps 1 and 2 are completed in 8( 1) time as well. For Step 3, notice that the algorithm will now be repeated 0(1/s) times until the bottom of the heap X is reached. Hence,

Theorem 2.8. There is a parallel algorithm for deleting the smallest element in a heap of size n, running in O( l/s) time with n” processors and O(nE/E log n) extra bits, for any constant 0 -K E < 1, on the CRCW-PRAM model.

With this approach, we can design a parallel deletion algorithm with an even better time-processor product. In fact, the problem of deleting the smallest element in a heap can be solved in sublogarithmic time by using a sublogarithmic number of processors. Precisely, if we employ [logn/loglognj processors and let k (in Algorithm 2.6) be the left match of [log n/log log n] , then Algorithm 2.6 runs in @(log n/log log n) steps. That is,

Theorem 2.9. There is a CRCW-PRAM algorithm for deleting the smallest element in a heap of size n that runs in time O(log n/log logn) on logn/log logn processors and with lognfloglogn extra bits.

Similarly, we can implement the algorithms developed above on the CREW-PRAM model and achieve the same time-processor products. Namely, we have

Theorem 2.10. There is a CREW-PRAM algorithm for deleting the smallest element in a heap of size n that runs either in time either

l 0( l/a) with n” processors n” extra bits, for any constant 0 < E< 1; or

l O(log n/log log n) on log n/log log n processors using @(log n) extra bits.

Remark that finding the path of minimum children (which may be implicit in some heap-deletion algorithm) seems to be almost sequential in nature. More precisely, the computation of the minimum-children path needs to process the heap level by level, and the result of some operation influences latter operations. Hence, the delete operation does not lead to a good parallel implementation.

3. Fine-heap with application

In this section, we shall introduce a new variant of the conventional heap that allows a quick access of the path of minimum children and admits a moderated heapsort, improving both the time for sorting and the space consumption over the traditional heapsort algorithm and its variants. We first investigate the construction problem for this new structure. Then we show how to implement it so that the desired sorting complexity is achieved. S. Carlsson, et al. I Theoretical Computer Science 164 (1996) 1-12

The Hasse diagram The picture

Fig. 1.

3.1. Fine-heap and its construction

A Jine-heap is a heap with additional ordering relation defined on siblings. See Fig. 1 for an example of fine-heaps. It costs no comparisons to find the path of minimum children, starting at any node in the structure down to the leaf level in a fine-heap. Such an efficient treatment of the path-finding is accomplished by using Ln/2J bits of extra space. The fine-heap has also been introduced implicitly in [ 10, 13, 191. The sequential construction complexity of fine-heaps can be estimated using an information-theoretical approach.

Theorem 3.1. The average (and thus also the worst) number of comparisons neces- sary to build a fine-heap on n elements is at least 1.864436.. . n (ignoring lower-order terms).

Proof. Let P,, be a fine-heap on n elements and PL_l an ordered structure obtained from 8, by deleting the root of P,,. Denote by e(Pn) and d(PL_ 1) the number of permutations of the input elements consistent with P)n and Sk_, , respectively. Clearly, L(P,) = e(Yi_,). In establishing a lower limit on the construction complexity for fine- heaps, let us check the case when n = 2h - 1 for h > 0. Notice that a fine-heap on n elements can be viewed as the first two smallest elements connecting to the structure ~;M),2 and a fine-heap 8(,_i),~. Hence, ((8.)=1.1.((~~35/2).l(~;~_,y,).~(~,.-l),2)

. ff (qn-3)/2+1) .e (qn-I),*) = ((:--3;,2)

. e p&1)/2) . ff (~)(n-1)/X) = + ((:--1;,2) I = 2n((n - 1)/2;1 ((n - 1)/2)! V PW~i2))2 That is, 8 S. Carlsson, et al. I Theoretical Computer Science I64 (1996) l-12

By the information-theoretic lower bound, we know that the minimum number of comparisons, on the average, needed to build a fine-heap P,, on n elements is at least

n! log ___ = log(2n) + 2 . log :‘;; y/j’ 3n . “El) ; log (2 . (2’ - 1)) QP?l) n which gives 1.864436.. . n - o(n). 0

A natural way to build fine-heaps is to construct the structure in a bottom-up fashion (namely, by recursively merging the small parts of the structure), similar to Floyd’s heap construction algorithm and its variants [2,8]. When the algorithm is up to merge two full fine-heaps of height h - 1 with one singleton element (the cost is denoted by Ml(h)), the information about the path of minimum children can be deduced from the ordering relations in the structure at no extra cost. For inserting the singleton element into the path and re-establishing the ordering of the structure, h + 2 comparisons are sufficient to complete the tasks in the worst case. Thus, M(i) = i + 2 comparisons. Therefore, the worst-case number, IF(h), of comparisons for constructing a fine-heap on n = 2h+’ - 1 elements is at most

h . h M(i) 1+2 F(h)<2h.xli =2”.c -=2n-h-2. (1) i=l izl 2’

Hence, constructing a fine-heap of arbitrary size n will cost at most 2n + 0(log2 n) comparisons (similar to that for the traditional heap [9]).

Lemma 3.2. A jine-heap on n elements can be constructed in 2n + 0(log2 n) compar- isons in the worst case.

The complexity of the preceding algorithm exceeds the lower bound (Theorem 3.1) by a constant factor. In order to decrease this factor, we shall design an efficient method for constructing small fine-heaps and use them as basic building blocks for arbitrary large sized fine-heaps. Before proceeding with a presentation of our algo- rithm, we shall demonstrate that it is possible to construct our building blocks faster than the preceding algorithm does. Observe first that constructing a fine-heap on 7 el- ements costs 10 comparisons according to Eq. (l), which is only one comparison more than the information-theoretic lower bound. However, by carefully examining the symmetric property of the structure, we can actually create three fine-heaps each of size 7 in only 28 comparisons. The idea behind our fast way of building smaller fine-heaps is gained from the study of the mass production of partial orders. A fine-heap on 7 elements {20,26,35,32,53,46,50} and its Hasse diagram are shown in Fig. 1.

Lemma 3.3. Three fine-heaps each of size 7 can be constructed in at most 28 com- parisons in the worst case. S. Carlsson, et al. I Theoretical Computer Science 164 (1996) l-12 9

Fig. 2.

Proof. We briefly describe the construction method with the help of Fig. 2. The algorithm is carried out in four steps. Construct four binomial trees each of size 4. (Cost: 12 comparisons) Compare element a with b, and element c with d. (Cost: 2 comparisons) Now the algorithm generates at least the structures Pi, 92, and Ps (see Fig. 2) plus five singleton elements. Build two 7-element fine-heaps starting from PI and P2 plus two singleton elements for each of the structures. The cost of this step is 10 comparisons, since each singleton element can be inserted into the structure in 2 comparisons (which is carried out by performing a binary search) and 1 + 1 more comparisons are needed to achieve the ordering of the fine-heaps. Transform .Ps plus one singleton element into a 7-element fine-heap. To accomplish this step, first a comparison between x and y is done. Then the singleton element is inserted into the structure (with 2 comparisons) and the ordering property of the fine-heap can then be created with one additional comparison. The correctness of the algorithm follows directly from its description, and the overall cost is 12 + 2 + 10 + 4 = 28 comparisons. 0

The complexity of the above algorithm is almost tight, since the information-theoretic lower bound for constructing three 7-element fine-heaps is equal to [9 + 3 x log 631 = 27 comparisons. With the similar technique, two fine-heaps of size 7 can be built in at most 19 comparisons, which is also close to the information-theoretic lower bound of [6 + 2 x log631 = 18 comparisons. Although it is not clear how to save more comparisons through producing more fine-heaps each of size 7 simultaneously, our strategy for building 7-element fine-heaps can be used as the building blocks to construct fine-heaps of arbitrary sizes faster than 2n. In fact, to construct a fine-heap on n = 2h+1 - 1 elements, we can apply Eq. (1) and Lemma 3.3, which yields

[F(h) <2h-2 . y +2’?&+_~_2,

i=3

Therefore,

Theorem 3.4. A fine-heap on n elements can be constructed in at most (23/12)n + 0(log2 n) comparisons in the worst case.

This is only 2.8 percent off from the information-theoretic lower bound. The above modified algorithm leads to a slightly better worst-case upper bound on the sorting complexity; as will be shown in the next subsection. 10 S. Cadsson, et al. I Theoretical Computer Science 164 (1996) 1-12

The insert and delete operations on a fine-heap can be performed in a way similar to that for the traditional heap. The parallel complexity of fine-heap operations can easily be deducted from the corresponding time bound for heaps. Moreover, when deleting the smallest element from a fine-heap, one does not need to compute the path of minimum children. Hence,

Observation 3.5. On the same model of parallel computation, l A fine-heap can be built in parallel as fast as the parallel heap construction using the same number of processors. l The parallel complexity of the insert and delete operations in a fine-heap of size n is the same as that of searching in a sorted list of length 0 ]log(n + 1 )I,

3.2. Sorting complexity of &e-heap

Wegener [ 191 presented a variant of Floyd’s heapsort algorithm, which sorts n el- ements in at most n log n + l.ln comparisons in the worst case, using n extra bits and O(n log n) bit comparisons. Moreover, it only takes n logn + n comparisons when n = 2h - 1. Wegener’s heapsort algorithm can be viewed as follows: l Create a fine-heap on n elements in 2n comparisons in the worst case, using n extra bits. l Remove the smallest element from the fine-heap, and repeat this step until there is no element left in the fine-heap. While a fine-heap of size n can be built in (23/12)n comparisons by Theorem 3.4 (saving (1,’ 12)n comparisons comparing to Wegener’s heapsort algorithm), we can also save the amount of extra space consumed using one of the following methods: 1. Implement a fine-heap of size n as a heap with each internal node having an extra bit. This extra bit is used for indicating the smaller child of the node. 2. Implement a fine-heap of size n as a heap, associate a word of length Llog nj - [log i] bits to each internal node Z’[i]. This word will keep the address of the last node on the path of minimum children in the subheap rooted at X’[i]. The first implementation needs Ln/2J extra bits. However, during the repeated root re- movals, U(n log n) bit comparisons are required in order to follow the path of minimum children. On the other hand, the second implementation takes

extra bits. With this implementation, no address computation is needed during the sorting phase of the heapsort algorithm. Therefore,

Theorem 3.6. With our implementations of a fine-heap, Wegener’s heapsort algo- rithm sorts n elements in at most n logn + 1.00274n (and n logn + 0.91667n for n = 2h - 1) comparisons in the worst case, using either l n/2 extra bits and O(n log n) 2-bit variable comparisons; or l n extra bits and no bit comparison. S. Carlsson, et al. I Theoretical Computer Science 164 (1996) l-12 11

Recently, Dutton [7] presented an interesting , called weak-heapsort, that makes n log n + 0.086n comparisons in the worst case. The weak-heapsort uses n additional bits and requires special instructions - boolean functions. The weak-heap introduced in [7] is neither a heap nor a balanced while the fine-heap is a heap. Unlike the weak-heap, the sequential and parallel implementations of the priority queue operations on the fine-heap can easily be deduced from that for the traditional heap.

4. Conclusions

The goals of this paper are twofold, we provide parallel solutions to the problem of inserting an element into a heap and of deleting the smallest element from the heap, and we introduce a new heap-like data structure that can be used for developing fast sorting algorithm in the fashion of heapsort. The complexity of parallel heap operations and heapsort are improved using extra bits. Interestingly, it is the technique, which builds many isomorphic copies of fine-heaps simultaneously, that leads to a better understanding of their construction complexities. Such a technique needs to be investigated further.

Acknowledgements

The authors would like to thank the referee for helpful comments.

References

[I] S.G. Akl, The Design and Analysis of Parallel Algorithms (Prentice-Hall, Englewood Cliffs, NJ, 1989). [2] S. Carlsson, Average-case results on heapsort, BIT 27 (1987) 2-17. [3] S. Carlsson, A variant of heapsort with almost optimal number of comparisons, Information Process. Lett. 24 (1987) 247-250. [4] S. Carlsson and J. Chen, Parallel constructions of heaps and min-max heaps, Parallel Processing Letters, 2 (1992) 311-320. [5] E.-E. Doberkat, Inserting a new element into a heap, BIT 21 (1981) 255-269. [6] E.-E. Doberkat, Deleting the root of a heap, Acta Znformatica 17 (1982) 245-265. [7] R.D. Dutton, Weak-heap sort, BIT 33 (1993) 372-381. [8] R.W. Floyd, Algorithm 245 - Treesort 3, Comm. ACM 7 (1964) 701. [9] G.H. Gannet and J.I. MUNO, Heaps on heaps, SIAM J. Comput. 15 (1986) 964971. [lo] S. Haldar, Heapsort with n log(n + 1) + n - 2 log(n + 1) - 2 key comparisons using Ln/2] additional bits, Tech. Report RUU-CS-93-14, Department of Computer Science, Utrecht University, Utrecht, The Netherlands, 1993. [l I] D.E. Knuth, The Art of Computer Programming, Vol. 3: Sorting and Searching (Academic Press, Reading, MA, 1973). [12] C.P. Kruskal, Searching, merging, and sorting in parallel computation, IEEE Trans. on Comput. C-32 (1983) 942-946. [13] C.J.H. McDiarmid and B.A. Reed, Building heaps fast, J. Algorithms, 10 (1989) 352-365. 12 S. Car&on, et al. I Theoretical Computer Science 164 (1996) l-12

[14] M.C. Pinotti and G. Pucci, Parallel algorithms for priority queue operations, in: Proc. 3rd Scandinavian Workshop on Algorithm Theory. Lecture Notes in Computer Science, Vol. 621 (Springer, Berlin, 1992) 130-139. [15] N.S.V. Rao and W. Zhang, Building heaps in parallel, Information Processing Letfers 37 (1991) 355- 358. [16] R. Schaffer and R. Sedgewick, The analysis of heapsort, J. Algorithms 15 (1993) 76100. [17] A. Schonhage, M. Paterson and N. Pippenger, Finding the , J. Comput. System Sci. 13 (1976) 184199. [18] M. Snir, On parallel searching, SIAM J. Comput. 15 (1985) 688-708. [ 193 I. Wegener, The worst case complexity of McDiarmid and Reed’s variant of BOTTOM-UP HEAPSORT is less than n log n + l.ln, Inform. and Comput. 97 (1992) 8696. [20] I. Wegener, BOTTOM-UP-HEAPSORT, a new variant of HEAPSORT beating, on an average, (if n is not very small), Theoret. Comput. Sci. 118 (1993) 81-98. [21] J.W.J. Williams, Algorithm 232: Heapsort, Comm. ACM 7 (1964) 347-348. [22] W. Zhang and R.E. Korf, Parallel heap operations on an EREW PRAM, 1. Parallel Distributed Compuf. 20 (1994) 248-255.