Fast Priority Queues for Cached Memory

Peter Sanders

MaxPlanckInstitut for Computer Science

The cache hierarchy prevalent in to days high p erformance pro cessors has to b e taken into account

in order to design algorithms that p erform well in practice This pap er advocates the adaption

of external memory algorithms to this purp ose This idea and the practical issues involved are

exemplied by engineering a fast priority queue suited to external memory and cached memory

that is based on k way merging It improves previous external memory algorithms by constant

factors crucial for transferring it to cached memory Running in the cache hierarchy of a worksta

tion the algorithm is at least two times faster than an optimized implementation of binary heaps

and ary heaps for large inputs

Categories and Sub ject Descriptors C Computer Systems Organization Single Data

Stream Architectures C Computer Systems Organization Performance of Systems E

Data Data Structures E Data Data Storage Representations E Data Files F

Theory Nonnumerical Algorithms and Problems F Theory Mo des of Computation

General Terms Cache Implementation

Additional Key Words and Phrases Priority queue External memory Secondary storage Cache

eciency Multi way merging

INTRODUCTION

The mainstream mo del of computation used by algorithm designers in the last

half century Neumann assumes a sequential pro cessor with unit memory

access cost However the mainstream computers sitting on our desktops have

increasingly deviated from this mo del in the last decade Hennessy and Patterson

Intel Corp oration Keller MIPS Technologies Inc Sun

Microsystems In particular we usually distinguish at least four levels of

memory hierarchy A le of multiported registers can b e accessed in parallel in

every clo ckcycle The rstlevel cache can still b e accessed every one or two clo ck

cycles but it has only few parallel p orts and only achieves the high throughput by

Address Stuhlsatzenhausweg Saarbr ucken Germany Email sandersmpi sbmpgde

WWW httpwwwmpi sbmpgde sa nder s

Partially supp orted by the IST Programme of the EU under contract number IST

ALCOMFT

Permission to make digital or hard copies of part or all of this work for p ersonal or classro om use is

granted without fee provided that copies are not made or distributed for prot or direct commercial

advantage and that copies show this notice on the rst page or initial screen of a display along

with the full citation Copyrights for comp onents of this work owned by others than ACM must

b e honored Abstracting with credit is p ermitted To copy otherwise to republish to p ost on

servers to redistribute to lists or to use any comp onent of this work in other works requires prior

sp ecic p ermission andor a fee Permissions may b e requested from Publications Dept ACM

Inc Broadway New York NY USA fax or permissionsacmorg

 P Sanders

pip elining Therefore the instruction level parallelism of sup erscalar pro cessors

works b est if most instructions use registers only Currently most rstlevel caches

are quite small KB in order to b e able to keep them on chip and close to

the execution unit The secondlevel cache is considerably larger but also has an

order of magnitude higher latency If it is ochip its size is mainly constrained

by the high cost of fast static RAM The main memory is build of high density

low cost dynamic RAM Including all overheads for cache miss memory latency

and translation from logical over virtual to physical memory addresses a main

memory access can b e two orders of magnitude slower than a rst level cache hit

Most machines have separate caches for data and co de so that we can disregard

instruction reads as long as the inner lo ops of programs remain reasonably short

Although the technological details are likely to change in the future physical

principles imply that fast memories must b e small and are likely to b e more ex

p ensive than slower memories so that we will have to live with memory hierarchies

when talking ab out sequential algorithms for large inputs

The general approach of this pap er is to mo del one cache level and the main

memory by the single disk single pro cessor variant of the external memory mo del

Vitter and Shriver This mo del assumes an internal memory of size M that

can access the external memory by transferring blo cks of size B The word pairs

cache line and memory blo ck cache and internal memory main memory

and external memory and IO and cache fault are used as synonyms if

the context do es not indicate otherwise The only formal limitation compared to

external memory is that caches have a xed replacement strategy For the kind of

algorithms considered here this mainly has the eect of reducing the usable cache

1a

size by a factor B where B is the memory blo ck size and a is the associativity

1

of the cache Sanders

Henceforth the term cached memory is used in order to make clear that we have

a dierent mo del

Despite the farreaching analogy b etween external memory and cached memory

a number of additional dierences should b e noted Since the sp eed gap b etween

caches and main memory is usually smaller than the gap b etween main memory and

disks care must b e taken to also analyze the work p erformed internally The ratio

b etween main memory size and rst level cache size can b e much larger than that

b etween disk space and internal memory Therefore we should prefer algorithms

that use the cache as economically as p ossible Finally the remaining levels of the

memory hierarchy are discussed informally in order to keep the analysis fo cussed

on the most imp ortant asp ects

Section presents the basic algorithm for the sequence heaps data structure for

2

priority queues The algorithm is then analyzed in Section using the external

memory mo del For some m in M k in M B any constant and

I

e O M B it can p erform I insertions and up to I deleteMins R dlog

k

m

using I R B O k log k m IOs and I log I log R log m O key

1

In an away associative cache every memory blo ck is mapp ed to a xed cache set and every

cache set can hold at most a blo cks

2

A priority queue is a data structure for representing a totally ordered set that supp orts insertion

of elements and deletion of the minimal element

Fast Priority Queues for Cached Memory 

comparisons Similar b ounds hold for cached memory with away asso ciative caches

1a

if k is reduced by O B Sanders Section considers renements that take

the other levels of the memory hierarchy into account ensure almost optimal mem

ory eciency and where the amortized work p erformed for an op eration dep ends

only on the current queue size rather than the total number of op erations Sec

tion discusses an implementation of the algorithm on several architectures and

compares the results to other priority queue data structures previously found to

b e ecient in practice namely binary heaps and ary heaps An app endix gives

further details on the implementation Its goal is to make the results easier to

repro duce in other contexts and to argue that all considered co des are comparably

eciently implemented

Related Work

External memory algorithms are a well established branch of algorithmics eg

Vitter Vengro The external memory heaps of Teuhola and Wegner

Wegner and Teuhola and the shsp ear data structure Fischer and Paterson

need a factor B fewer IOs than traditional priority queues such as binary

heaps Buer search trees Arge were the rst external memory priority

M

queues to reduce the number of IOs by another factor of log thus meeting

B

the lower b ound of O I B log I M IOs for I op erations amortized But

M B

using a fulledged search for implementing priority queues may b e considered

wasteful The heaplike data structures by Bro dal and Kata jainen Crauser et al

and Fadel et al Bro dal and Kata jainen Brengel et al Fadel et al

are more directly geared to priority queues and achieve the same asymptotic b ounds

one even p er op eration and not in an amortized sense Bro dal and Kata jainen

A sequence heap is very similar In particular it can b e considered a simplication

and reengineering of the improved arrayheap Brengel et al However

sequence heaps are more IOecient by a factor of ab out three or more than

Arge Bro dal and Kata jainen Brengel et al Fadel et al and

need ab out a factor of two less memory than Arge Brengel et al Fadel

et al

THE ALGORITHM

Merging k sorted sequences into one sorted sequence k way merging is an IO

ecient subroutine used for sortingb oth for external Knuth and cached

memory LaMarca and Ladner The basic idea of sequence heaps is to adapt

k way merging to the related but more dynamical problem of priority queues

Let us start with the simple case that at most k m insertions take place where

m is the size of a buer that ts into fast memory Then the data structure could

consist of k sorted sequences of length up to m We can use k way merging for

deleting a batch of the m smallest elements from k sorted sequences The next m

deletions can then b e served from a buer in constant time

A separate with capacity m allows an arbitrary mix of insertions

and deletions by holding the recently inserted elements Deletions have to check

whether the smallest element has to come from this insertion buer When this

buer is full it is sorted and the resulting sequence b ecomes one of the sequences

for the k way merge

 P Sanders

2 1 2 ... km 1 2 ... kmk 1 2 ... k mk

k-merge k-merge k-merge T T1 T2 3

group group group buffer 1 m buffer 2 m buffer 3 m

R-merge

deletion buffer m’ m insert heap

Fig Overview of the data structure for sequence heaps for R merge groups

Up to this p oint sequence heaps and several earlier data structures Bro dal and

Kata jainen Brengel et al Fadel et al are almost identical Most

dierences are related to the question how to handle more than k m elements We

cannot increase m b eyond M since the insertion heap would not t into fast mem

ory We cannot arbitrarily increase k since eventually k way merging would start

to incur cache faults Sequence heaps make ro om by merging all the k sequences

pro ducing a larger sequence of size up to k m Bro dal and Kata jainen Brengel

et al

Now the question arises how to handle the larger sequences Sequence heaps

adopt the approach used for improved arrayheaps Brengel et al to employ

i1

R merge groups G G where G holds up to k sequences of size up to mk

1 R i

When group G overows all its sequences are merged and the resulting sequence

i

is put into group G

i+1

Each group is equipp ed with a group buer of size m to allow batched deletion

from the sequences The smallest elements of these buers are deleted in batches

0

of size m m They are stored in the deletion buer Fig summarizes the data

structure We now have enough information to understand how deletion works

DeleteMin The smallest elements of the deletion buer and insertion buer are

compared and the smaller one is deleted and returned If this empties the deletion

buer it is relled from the group buers using an R way merge Before the rell

0

group buers with less than m elements are relled from the sequences in their

group if the group is nonempty

DeleteMin works correctly provided the data structure fullls the heap prop erty

Fast Priority Queues for Cached Memory 

ie elements in the group buers are not smaller than elements in the deletion

buer and in turn elements in a sorted sequence are not smaller than the elements

in the resp ective group buer Maintaining this invariant is the main diculty for

implementing insertion

Insert New elements are inserted into the insert heap When its size reaches

m its elements are sorted eg using merge sort or heap sort The result is then

merged with the concatenation of the deletion buer and the group buer The

smallest resulting elements replace the deletion buer and group buer The

remaining elements form a new sequence of length at most m The new sequence

is nally inserted into a free slot of group G If there is no free slot initially G

1 1

is emptied by merging all its sequences into a single sequence of size at most k m

which is then put into G The same strategy is used recursively to free higher level

2

groups when necessary When group G overows R is incremented and a new

R

group is created When a sequence is moved from one group to the other the heap

prop erty may b e violated Therefore when G through G have b een emptied the

1 i

group buers through i are merged and put into G

1

The latter measure is one of the few dierences to the improved array heap

Brengel et al where the invariant is maintained by merging the new sequence

and the group buer which almost doubles the total number of IOs

For cached memory where the sp eed of internal computation matters it is also

crucial how to implement the op eration of k way merging The loser tree variant

of the selection tree data structure describ ed by Knuth Knuth Section

0

is particularly well suited When there are k nonempty sequences it consists of a

0

binary tree with k leaves Leaf i stores a p ointer to the current element of sequence

i The current keys of each sequence p erform a tournament The winner is passed

up the tree and the key of the loser and the index of its leaf are stored in the

inner no de The overall winner is stored in an additional no de ab ove the ro ot

Using this data structure the smallest element can b e identied and replaced by

the next element in its sequence using dlog k e comparisons This is less than the

heap of size k assumed in Brengel et al Fadel et al would require The

address calculations and memory references are similar to those needed for binary

heaps with the noteworthy dierence that the memory lo cations accessed in the

loser tree are predictable which is not the case when deleting from a binary heap

The instruction scheduler of the compiler can place these accesses well b efore the

data is needed thus avoiding pip eline stalls in particular if combined with lo op

unrolling

ANALYSIS

The analysis rst quanties the number of IOs in terms of B the parameters k

0

m and m and an arbitrary sequence of insert and deleteMin op erations with

I insertions and up to I deleteMins The analysis continues with the number

of key comparisons as a measure of internal work and then discusses how k m

0

and m should b e chosen for external memory and cached memory resp ectively

Adaptations for memory eciency and many accesses to relatively small queues

are p ostp oned to Section

 P Sanders

We need the following observation on the minimum intervals b etween tree emp

tying op erations in several places

i

Lemma Group G can overow at most every mk insertions

i

Pro of The only complication is the slot in group G used for invalid group buers

1

Nevertheless when groups G through G contain k sequences each this can only

1 i

happ en if

R

X

j 1 i

mk k mk

j =1

insertions have taken place

In particular since there is ro om for m insertions in the insertion buer there is

a very simple upp er b ound for the number of groups needed

I

groups suce Corollary R log

k

m

The analysis counts the number of IOs based on the assumption that the follow

ing information is kept in internal memory The insert heap the deletion buer a

merge buer of size m group buers and R the loser tree data for groups G

R

and G assuming that k B units of memory suce to store the blo cks of

R1

the k sequences that are currently accessed and the loser tree information itself

a corresp onding amount of space for one more loser tree shared by the remaining

3

R groups and data for merging the R group buers

0

Theorem If R dlog I me mm k R B M and k B

k

0

m m then

log k R

O I

B k m

IOs suce to perform any sequence of I inserts and up to I deleteMins on a

sequence heap

Pro of Let us rst consider the IOs p erformed for an element moving on the

following canonical data path It is rst inserted into the insert buer and then

written to a sequence in group G in a batched manner ie B IOs are charged

1

to the insertion of this element Then it is involved in emptying groups until it

arrives in group G For each emptying op eration the element is involved into

R

one batched read and one batched write ie it is charged with R B IOs

for tree emptying op erations Eventually the element is read into group buer R

yielding a charge of B IOs for All in all we get a charge of R B IOs for each

insertion

What remains to b e shown is that the remaining IOs only contribute lower

order terms or replace IOs done on the canonical path When an element travels

through group G then B IOs must b e charged for writing it to group buer

R1

R and later reading it when relling the deletion buer However the B IOs

3

If we accept O B more IOs p er op eration it would suce to swap b etween the insertion

buer plus a constant number of buer blo cks and one loser tree with k sequence buers in

internal memory

Fast Priority Queues for Cached Memory 

saved b ecause the element is not moved to group G can pay for this charge When

R

an element travels through group buer i R the additional c B IOs

saved compared to the canonical path can also pay for the cost of swapping loser

tree data for group G The latter costs k B B IOs which can b e divided

i

0

among at least m m k B elements removed in one batch

When group buer i b ecomes invalid so that it must b e merged with other

group buers and put back into group G this causes a direct cost of O mB

1

IOs and a cost of O imB IOs must b e charged b ecause these elements are

thrown back O i steps on their path to the deletion buer Although an element

may move through all the R groups we do not need to charge O R mB IOs for

small i since this only means that the shortcut originally taken by this element

compared to the canonical path is missed The remaining overhead can b e charged

j 2

to the mk k insertions that have lled group G Summing over all

i1

groups each insertions gets an additional charge of

R

X

j 2

O imB mk k O k

i=2

Similarly invalidations of group buer give a charge of O k p er insertion

Inserting a new sequence into the loser tree takes data structure takes O log k

IOs When done for tree this can b e amortized over m insertions For tree

i1

i it can b e amortized over mk elements by Lemma For an element

moving on the canonical path we get an overall charge of

R

X

log k log k log k

O O

i1

m mk m

i=2

p er insertion Overall we get a charge of R B O k logk m p er insertion

An estimate the number of key comparisons p erformed comes next This is a go o d

measure for the internal work since in ecient implementations of priority queues

for the comparison mo del this number is close to the number of unpredictable

branch instructions whereas lo op control branches are usually well predictable by

the hardware or the compiler and the number of key comparisons is also pro

p ortional to the number of memory accesses These two types of op erations often

have the largest impact on the execution time since they are the most severe limit

to instruction parallelism in a sup erscalar pro cessor In order to avoid notational

overhead by rounding assume that k and m are p owers of two and that I is divisible

R1

by mk A more general b ound would only b e larger by a small additive term

Theorem With the assumptions from Theorem at most I log I dlog R e

0

log m m m O log k k key comparisons are needed For average case

inputs log m can be replaced by O

Pro of Insertion into the insertion buer takes log m comparisons at worst and

O comparisons on the average Every deleteMin op eration requires a compari

son of the minimum of the insertion buer and the deletion buer The remaining

comparisons are charged to insertions in an analogous way to the pro of of Theo

rem Sorting an m element insertion buer eg using merge sort takes m log m

 P Sanders

comparisons and merging the result with the deletion buer and group buer

0

takes m m comparisons Inserting the sequence into a loser tree takes O log k

comparisons Emptying groups takes R log k O R k comparisons p er ele

ment Elements removed from the insertion buer take up to log m comparisons

But those need not b e counted since no further comparisons are needed for them

Similarly rells of group buers other than R have already b een accounted for

by the conservative estimate on group emptying cost Group G only has degree

R

R1

I mk so dlog I R log k log me comparisons p er element suce Us

ing similar arguments as in the pro of of Theorem it can b e shown that inserting

sequences into the loser trees leads to a charge of O log k m comparisons p er in

sertion and invalidating group buers costs O log k k comparisons p er insertion

Summing all the charges made yields the b ound to b e proven

For external memory one would choose m M and k M B For cached

1a

memory with an away asso ciative cache k should b e a factor B smaller

in order to limit the number of cache faults to times the number of IOs

p erformed by the external memory algorithm Sanders This requirement

4

together with the small size of many rst level caches and TLBs explains why we

may have to live with a quite small k This observation is the main reason not to

pursue the simple variant of the array heap describ ed in Brengel et al that

needs only a single merge group for all sequences This merge group would have to

b e ab out a factor R larger however

REFINEMENTS

Memory Management A sequence heap can b e implemented in a memory e

cient way by representing sequences in the groups as singly linked lists of memory

pages Whenever a page runs empty it is pushed on a stack of free pages When a

new page needs to b e allo cated it is p opp ed from the stack If necessary the stack

can b e maintained externally except for a single buer blo ck Using pages of size p

the external sequences of a sequence heap with R groups and N elements o ccupy at

most N k pR memory cells Together with the measures describ ed b elow for keep

ing the number of groups small this b ecomes N k p log N m A page size of m

k

is particularly easy to implement since this is also the size of the group buers and

the insertion buer As long as N k m this already guarantees asymptotically

optimal memory eciency ie a memory requirement of N o

Many Operations on Smal l Queues Let N denote the queue size b efore the ith

i

op eration is executed In other algorithms Bro dal and Kata jainen Brengel

P

log N m et al Fadel et al the number of IOs is b ounded by O

i

k

iI

P

log N m can b e considerably less than I log I m For some classes of inputs

i

k k

iI

However for most applications that require large queues at all the dierence should

not b e large enough to warrant signicant constant factor overheads or algorithmic

complications Therefore this pap er gave a detailed analysis of the basic algorithm

rst and outlines an adaptation yielding the rened asymptotic b ound here Similar

to Brengel et al when a new sequence is to b e inserted into group G and

i

4

T ranslation Lo okaside B uers store the physical p osition of the most recently used virtual

memory pages

Fast Priority Queues for Cached Memory 

there is no free slot rst lo ok for two sequences in G whose sizes sum to less than

i

i1

mk elements If found these sequences are merged yielding a free slot The

merging cost can b e charged to the deleteMins that caused the sequences to get so

i

small Now G is only emptied when it contains at least mk elements and the

i

IOs involved can b e charged to elements that have b een inserted when G had at

i

i1

least size mk Similarly a shrinking queue can b e tidied up When there

R1

are R groups and the total size of the queue falls b elow mk empty group G

R

and insert the resulting sequence into group G if there is no free slot in group

R1

G merge any two of its sequences rst

R1

Registers and Instruction Cache In all realistic cases we have R groups

Therefore instruction cache and register le are likely to b e large enough to e

ciently supp ort a fast R way merge routine for relling the deletion buer which

keeps the current keys of each stream in registers Refer to App endix A for more

details

Second Level Cache So far the analysis assumes only a single cache level Still

if we assume this level to b e the rst level cache the second level cache may have

some inuence First note that the group buers and the loser trees with k sequence

buer blo cks each are likely to t in second level cache The second level cache may

also b e large enough to accommo date all of group G reducing the costs for B

1

IOs p er insert We get a more interesting use for the second level cache if we

assume its bandwidth to b e suciently high to b e no b ottleneck and then lo ok

at inputs where deletions from the insertion buer are rare eg sorting Then

we can choose m O M if M is the size of the second level cache Insertions

2 2

have high lo cality if the log m cache lines currently accessed by them t into rst

level cache Furthermore no op erations on deletion buers and group buers use

random access

High Bandwidth Disks If the sequence heap data structure is viewed as a clas

sical external memory algorithm we would simply use the main memory size for

M In this case large binary heaps would b e used as insertion buers But the

measurements in Section indicate that large binary heaps might b e to o slow to

match the bandwidth of fast parallel disk subsystems In this case it is b etter to

mo dify a sequence heap optimized for cache and main memory by using sp ecialized

external memory implementations for the larger groups This may involve buering

of disk blo cks explicit asynchronous IO calls and p erhaps prefetching co de and

randomization for supp orting parallel disks Barve et al Also the number

of IOs may b e reduced by using a larger k inside these external groups If this

degrades the p erformance of the loser tree data structure to o much we can insert

another heap level ie split the high degree group into several low degree groups

connected together over suciently large level group buers and another merge

data structure

Deletions of nonminimal elements with known key value can b e p erformed by

maintaining a separate sequence heap of deleted elements When on a deleteMin

the smallest element of the main queue and the deletequeue coincide b oth are

discarded Hereby insertions and deleteMins cost only one comparison more than

b efore if we charge a delete for the costs of one insertion and two deleteMins note

 P Sanders

that the latter are much cheaper than an insertion Memory overhead can b e kept

in b ounds by completely sorting b oth queues whenever the size of the queue of

deleted elements exceeds some fraction of the size of the main queue During this

sorting op eration deleted keys are discarded The resulting sorted sequence can b e

put into group G All other sequences and the deletion heap are empty then

R

IMPLEMENTATION AND EXPERIMENTS

Sequence heaps were implemented as a p ortable C class for arbitrary

keyvaluepairs Currently sequences are implemented as a single array The inter

nal p erformance mainly dep ends on an ecient implementation of the k way merge

using loser trees sp ecial routines for way way and way merge and binary

heaps for the insertion buer The most imp ortant optimizations turned out to

b e roughly in this order Making live for the compiler easy use of sentinels ie

dummy elements at the ends of sequences and heaps which save sp ecial case tests

lo op unrolling The app endix discusses these issues in detail

Cho osing Comp etitors

When an author of a new co de wants to demonstrate its usefulness exp erimentally

great care must b e taken to choose a comp eting co de that uses one of the b est known

algorithms and is at least equally well tuned Implicit binary heaps and aligned

ary heaps were chosen In a recent study LaMarca and Ladner these two

array based algorithms outp erform the p ointer based data structures

and skew heap by more than a factor of two The situation is complicated by fact

that the ab ove p ointer based algorithms p erformed b est in an older study Jones

All four algorithms execute a similar number of instructions but p ointer

based algorithms need up to three times more memory references Possibly the

growing gap b etween memory sp eed and pro cessor sp eed has tipp ed the balance in

favor of array based algorithms

Not least b ecause the same co de is needed for the insertion buer binary heaps

were co ded p erhaps even more carefully than the remaining comp onentsbinary

heaps are the only part of the co de for which care was taken that the assembler co de

contains no unnecessary memory accesses redundant computations and a reason

able instruction schedule The co de also uses the bottom up heuristic Wegener

for deleteMin Elements are rst lifted up on a minpath from the ro ot to a

leaf the leftmost element is then put into the freed leaf and is nally bubbled up

Note that binary heaps with this heuristic p erform only log N O comparisons

for an insertions plus a deleteMin on the average This is close to the lower b ound

So in at memory it should b e hard to nd a comparison based algorithm that

p erforms signicantly b etter for average case inputs For small queues the imple

mentation of binary heaps is ab out a factor two faster than a more straightforward

nonrecursive adaptation of the textb o ok formulation used by Cormen et al Cor

men et al App endix A discusses the implementation of binary heaps in

detail

Aligned ary heaps have b een developed at the end using the same basic ap

proach as for binary heaps in particular the b ottom up heuristic is also used The

main dierence is that the data gets aligned to cache lines and that more complex

index computations are needed App endix A gives more details

Fast Priority Queues for Cached Memory 

All source co des are available electronically under httpwwwmpi sbmpgde

sandersprograms

Basic Exp eriments

Although the programs were developed and tuned on SPARC pro cessors sequence

heaps show similar b ehavior on all recent architectures that were available for mea

surements The same co de was run on a SPARC MIPS Alpha and Intel pro cessor

0

It even turned out that a single parameter setting m m and k

5

works well for all these machines Figures and resp ectively show the

results All measurements use random bit integer keys and bit values For

N

a maximal heap size of N the op eration sequence insert deleteMin insert

N

deleteMin insert deleteMin is executed To normalize the amortized exe

cution time p er insertdeleteMinpair T N it is divided by log N Since all

algorithms have an at memory execution time of c log N O for some constant

c we would exp ect that the curves have a hyperb olic form and converge to a con

6

stant for large N The values shown are averages over at least rep etitions The

programs where run on an unloaded machine Timing uses the op erating system

function for returning the CPU time consumed by the user pro cess To make sure

that everything is in internal memory and warmup the caches one initial execution

was p erformed without timing it

5

By tuning k and m p erformance improvements around are p ossible eg for the Ultra and

the PentiumII k is b etter

6

More for small inputs to avoid problems due to limited clo ck resolution

160 bottom up binary heap bottom up aligned 4-ary heap 140 sequence heap

120

100

80

60

40

(T(deleteMin) + T(insert))/log N [ns] 20

0 256 1024 4096 16384 65536 218 220 222 223

N

Fig Performance on a Sun Ultra desktop workstation with MHz UltraSparcI Ii pro ces

sor stlevel cache M KByte B Byte ndlevel cache M KByte B Byte

using Sun Workshop C with options fast O

 P Sanders

bottom up binary heap 200 bottom up aligned 4-ary heap sequence heap

150

100

50 (T(deleteMin) + T(insert))/log N [ns]

0 1024 4096 16384 65536 218 220 222 223

N

Fig Performance on a MHz MIPS R pro cessor Compiler CC r n mips

O

140 bottom up binary heap bottom up aligned 4-ary heap 120 sequence heap

100

80

60

40

20 (T(deleteMin) + T(insert))/log N [ns]

0 256 1024 4096 16384 65536 218 220 222 223

N

Fig Performance on a MHz DECAlpha pro cessor Compiler g O

Fast Priority Queues for Cached Memory 

bottom up binary heap 200 bottom up aligned 4-ary heap sequence heap

150

100

50 (T(deleteMin) + T(insert))/log N [ns]

0 1024 4096 16384 65536 218 220 222 223

N

Fig Performance on a MHz Intel Pentium II pro cessor Compiler g O

Sequence heaps show the b ehavior one would exp ect for at memorycache

faults are so rare that they do not inuence the execution time very much In

Section we will see that the decrease in the time p er comparison is not quite

so strong for other inputs

On all machines binary heaps are equally fast or slightly faster than sequence

heaps for small inputs While the heap still ts into second level cache the p erfor

mance remains rather stable For even larger queues the p erformance degenerates

This is easy to explain Whenever the queue size doubles there is another layer

of the heap that do es not t into cache contributing a constant number of cache

23

faults p er deleteMin For N sequence heaps are b etween and times

faster than binary heaps

This dierence is large enough to b e of considerable practical interest Further

more the careful implementation of the algorithms makes it unlikely that such a

7

p erformance dierence can b e reversed by tuning or use of a dierent compiler

Furthermore the satisfactory p erformance of binary heaps on small inputs shows

that for large inputs most of the time is sp ent on memory access overhead and

co ding details have little inuence on this

ary Heaps

The measurements in gures through largely agree with the most imp ortant

observation of LaMarca and Ladner LaMarca and Ladner Since the number

of cache faults is ab out halved compared to binary heaps ary heaps have a more

7

For example in older studies heaps and loser trees may have lo oked bad compared to p ointer

based data structures if the compiler generates integer division op erations for halving an index or

integer multiplications for array indexing

 P Sanders

robust b ehavior for large queues Still sequence heaps are another factor b etween

and faster for very large heaps since they reduce the number of cache

faults even more However the relative p erformance of binary heaps and ary

heaps seems to b e a more complicated issue than in LaMarca and Ladner

Although this is not the main concern of this pap er a p ossible explanation should

b e oered

Although the b ottom up heuristic improves b oth binary heaps and ary heaps

binary heaps prot much more Now binary heaps need less instead of more

comparisons than ary heaps Concerning other instruction counts ary heaps

only save on memory write instructions while they need more complicated index

computations

Apparently on the Alpha which has the highest clo ck sp eed of the machines

considered the saved write instructions shorten the critical path while the index

computations can b e done in parallel to slow memory accesses

On the other machines the balance turns into the other direction In particular

the Intel architecture lacks the necessary number of registers so that the compiler

has to generate a large number of additional memory accesses spill co de Even

for very large queues this handicap is never made up for

The most confusing eect is the jump in the execution time of ary heaps on

20

the SPARC for N Nothing like this is observed on the other machines and

this eect is hard to explain by cache eects alone since the input size is already

well b eyond the size of the second level cache but still b elow the main memory size

Possibly there is a problem with virtual address translation which also haunted

the binary heaps in an earlier version

Long Op eration Sequences

Our worst case analysis predicts a certain p erformance degradation if the number

of insertions I is much larger than the size of the heap N However in Fig it

can b e seen that the contrary can b e true for random indep endent keys

For a family of instances with I N where the heap grows and shrinks very

slowly sequence heaps are ab out two times faster than for I N The reason

is that new elements tend to b e smaller than most old elements the smallest of

the old elements have long b een removed b efore Therefore many elements never

make it into group G let alone the groups for larger sequences Since most work is

1

p erformed while emptying groups this work is saved A similar lo cality eect has

b een observed and analyzed for the shsp ear data structure Fischer and Paterson

Binary heaps or ary heaps do not have this prop erty They even seem

to get slightly slower For s this lo cality eect cannot work So that these

instances should come close to the worst case To make clear that sequence heaps

are nevertheless still much b etter than binary or ary heaps Figure additionally

contains their timing for s

Another way to generate inputs without lo cality eects is to use monotone inputs

ie keys that are larger than all keys of previously deleted elements This can b e

achieved by adding a random oset to the most recently deleted element This

mo del has also b een tried for reasons of comparability with earlier studies Jones

LaMarca and Ladner The results are not shown here however since

they are so similar to the case s shown ab ove

Fast Priority Queues for Cached Memory 

s=0, binary heap s=0, 4-ary heap s=0 s=1 128 s=4 s=16

64 (T(deleteMin) + T(insert))/log N [ns] 32 256 1024 4096 16384 65536 218 220 222 223

N

Fig Performance of sequence heaps using the same setup as in Fig but using dierent

s N s N

op eration sequences insert deleteMin insert deleteMin insert deleteMin for s

f g For s we essentially get heapsort with some overhead for maintaining useless

group and deletion buers In Fig s is used For s the timings for binary heaps and

ary heaps are also shown

DISCUSSION

Sequence heaps may currently b e the fastest available data structure for large com

parison based priority queues b oth in cached and external memory This is par

ticularly true if the queue elements are small and if we do not need deletion of

arbitrary elements or decreasing keys Our implementation approach in particular

k way merging with loser trees can also b e useful to sp eed up sorting algorithms in

cached memory

In the other cases sequence heaps still lo ok promising but we need exp eriments

encompassing a wider range of algorithms and usage patterns to decide which al

gorithm is b est For example for monotonic queues with integer keys radix heaps

lo ok promising Either in a simplied average case ecient form known as calendar

queues Brown or by adapting external memory radix heaps Brengel et al

to cached memory in order to reduce cache faults

It has b een outlined how the algorithm can b e adapted to multiple levels of

memory and parallel disks On a shared memory multiprocessor it should also b e

p ossible to achieve some mo derate sp eedup by parallelization eg one pro cessor

for the insertion and deletion buer and one for each group when relling group

buers all pro cessors collectively work on emptying groups

ACKNOWLEDGMENTS

I would like to thank Gerth Bro dal Andreas Crauser Jyrki Kata jainen and Ul

rich Meyer for valuable suggestions Ulrich R ude from the University of Augsburg

provided access to an Alpha pro cessor

 P Sanders

APPENDIX

A IMPLEMENTATION DETAILS

The exp eriments in Section evaluate the dierent algorithms based on sp eedup

margins b etween two and four Although this is a quite useful for practical applica

tions such a margin is only of scientic interest if it is repro ducible and understand

able It should b e p ossible to understand how this margin can b e achieved and

it must b e p ossible to write similar co des even in other programming languages

and using dierent machines and compilers A doubtful researcher must have a

chance to p oint her nger at some piece of co de or of the exp erimental pro cedure

and demonstrate that this particular detail causes the sp eedup margin thereby fal

sifying dierent claims made in the original publication Therefore the source co de

8

used for the exp eriments in this pap er is made publicly available for academic use

In addition the following sections explain implementation details of the inner lo ops

governing p erformance in the implemented co des Perhaps they even succeed to

demonstrate that all co des are hardly improvable on current technology Although

such high level of sophistication is no end in itself in an exp erimental study it is nev

ertheless crucial for the validity of the exp eriments Comparing several algorithms

with dierent degree of optimization would certainly b e questionable But it is very

dicult to argue that two co des are equally sub optimal One consequence of this

observation is that illdo cumented co des written by inexp erienced programmers are

of little exp erimental value if they do not show very clear margins For example

the optimizations for binary heaps describ ed b elow yield sp eedups compared to

straightforward implementations by more than a factor of two If we would have

compared the slow implementation with a tuned implementation of another algo

rithm a sp eedup of two would have b een meaningless with resp ect to the quality

of the algorithms

A Binary Heaps

The average case complexity of insertion is constant so that deletion dominates

the time required by binary heaps and by the insertion buer of sequence heaps

Therefore only the implementation of deleteMin is considered here

A TextbookStyle Implementation Before lo oking at the binary heap co de

used in the exp eriments consider an implementation which might b e exp ected from

a student lo oking at a textb o ok The following co de is a nonrecursive adaptation

of the co de prop osed in the well known textb o ok of Cormen et al Cormen et al

template class Key class Value

inline void HeapKey ValuedeleteMinBasic

int i l r smallest

data datasize

datasizekey getSupremum infinity

size i

8 httpwwwmpi sbmpgde sa nde rsp rog rams

Fast Priority Queues for Cached Memory 

for

l i r i

if l size datalkey dataikey

smallest l

else

smallest i

if r size datarkey datasmallestkey

smallest r

if smallest i break

Element temp datai

datai datasmallest

datasmallest temp

i smallest

For a queue of size N the lo op is traversed log N O times on the average Each

iteration requires ve conditional branches two of which are key comparisons whose

outcome is dicult to predict Even at the highest level of compiler optimization

the SUN compiler generated memory accesses in the inner lo op if key and data

elds consist of one memory word each

A Optimized Implementation The optimized co de starts as follows

template class Key class Value int capacity

inline void BinaryHeapKey Value capacity

deleteMin

int hole succ sz size

First note that the member variable size is copied into a lo cal variable This is

done with every member variable that is used more than twice Even using the

highest level of compiler optimization b oth the GNU g and Suns proprietary

compiler were not able to appropriately put member variables into registers This

measure alone reduces the number of memory references in the inner to After

initialization the element at the ro ot of the heap is moved to a leaf such that

the heap prop erty is maintained Note that there is only a single comparison

for breaking the lo op since there are always log sz iterations In addition this

comparison is easy to predict so that it do es not stall the pip elines of mo dern

pro cessors Hennessy and Patterson

No lo op unrolling is done since saving lo op control overhead is only a small

incentive on sup erscalar pro cessors Furthermore there is no apparent opp ortunity

for b etter instruction here since the memory lo cations involved are

highly data dep endent

while succ sz

Key key datasucckey

Key key datasucc key

if key key succ

 P Sanders

dataholekey key

dataholevalue datasuccvalue

else

dataholekey key

dataholevalue datasuccvalue

hole succ

succ

Note that there is only a single additional conditional branch in the b o dy of the lo op

In co des without the b ottom up heuristic there are at least two key comparisons

with asso ciated branches On the average co de without the b ottom up heuristic

p erforms only a constant number less iterations

In order to validate that go o d co de can b e generated for the binary heap algo

rithm the assembler co de was checked until none of the inner lo ops of binary heaps

showed obvious shortcomings of the compiler for the SUN and its native compiler

Consider the following annotated version of the assembler output pro duced by the

compiler for the rst inner lo op of deleteMin

L label starting inner loop

ld ggo Key key datasucckey

ld goo Key key datasucc key

cmp oo key key

blea L yes branch after next instr

sll oo compute address offset for dataholekey

thenpart of inner loop

sll og compute address offset for dataholekey

add oo succ

st ogo dataholekey key

sll oo address offset for datasuccvalue

ld ooo registero datasuccvalue

sll oo succ

st ogo dataholevalue registero

ba L branch around else part after next instr

cmp og succ size delay slot

L else part of inner loop

st ooo dataholekey key

or goo

ld gog registerg datasuccvalue

sll oo succ

st goo dataholevalue registerg

cmp og succ size

L end of if else

bl L next iteration if succ size

sll og compute address offset for datasucckey

Fast Priority Queues for Cached Memory 

In particular note that only ve memory accesses are p erformed for each iteration

of this inner lo op

One p ossible optimization in the ab ove lo op concerns the index arithmetics Us

ing assembler or a somewhat awkward formulation in C it would b e p ossible to save

two of the three leftshift sll instructions executed p er lo op iteration Currently

9

these instructions are needed for converting an index into an address oset How

ever most of the time these instructions are not on the critical path for instruction

scheduling and on a sup erscalar machine saving the shift instructions might not

aect execution time at all In any case compared to the cost of the unavoidable

hard to predict branch their cost is negligible on most mo dern architectures

What remains to b e done is to overwrite the deleted element with the rightmost

element of the heap and to move this element up in the tree to restore the heap

prop erty

Key bubble dataszkey

int pred hole

while datapredkey bubble

datahole datapred

hole pred

pred

finally move data to hole

dataholekey bubble

dataholevalue dataszvalue

Note that on the average this lo op only p erforms a constant number of iterations

Also it causes no cache faults since all the required cache lines were already moved

into the cache when descending the tree

At last the now empty element at the right end of the tree is turned into a

sentinel element by setting its key to innity Likewise there is a sentinel element

to the left of the ro ot of the heap with key minus innity These measures make the

fast and simple formulation of the inner lo ops p ossible that do not need to test for

sp ecial cases at the b order of the data structure Compared to the formulation in

Cormen et al sentinels save two easy to predict branches for each lo op iteration

datasizekey getSupremum mark as deleted

size sz

A remark on the b ottom up heuristic is in order It is a bit slower than the

ordinary co de in the worst case and its advantage is not very large less than

compared to a careful formulation of the ordinary algorithm that has only one hard

to predict branch in the inner lo op If the left and right successor are compared

rst the comparison with the sifted element can only yield a smaller once p er

deleteMin This means that the relative p erformance of binary heaps and sequence

heaps is not much aected by the presence or absence of the b ottom up heuristic

9

Is is assumed here that the size of a heap element is a p ower of two In a pro duction implemen

tation this should p erhaps b e made sure by padding each element to the next p ower of two

 P Sanders

A Aligned ary Heaps

Our implementation follows the description of LaMarca and Ladner LaMarca and

Ladner as far as p ossible except that the b ottom up heuristic is also used for

ary heaps Here is the only lo op traversed more than a constant number of times

for an op eration on the average

while succ sz

minKey datasucckey

delta

otherKey datasucc key

if otherKey minKey minKey otherKey delta

otherKey datasucc key

if otherKey minKey minKey otherKey delta

otherKey datasucc key

if otherKey minKey minKey otherKey delta

The array data is aligned in such a way that the ab ove three memory accesses

concern a minimal number of cache lines

succ delta

layerPos delta

move min successor up

dataholekey minKey

dataholevalue datasuccvalue

step to next layer

hole succ

succ succ layerPos layerSize beginning of next layer

layerPos

succ layerPos now correct value

layerSize

The lo op is executed a factor of two less frequently but there are a factor of three

more hard to predict branches p er lo op iteration In addtion the index arithmetics

10

is more complicated

A TwoWay Merging

The function merge is needed when the insertion buer is emptied into group one

and for relling the deletion buer when R The function takes two sorted

input sequences that are terminated by a sentinel element with innite key As in

the case of binary heaps these sentinels save sp ecial case treatments in the inner

lo op

10

Those could b e somewhat simplied at the cost of leaving holes in the data structure This

could cause additional cache faults however

Fast Priority Queues for Cached Memory 

merge sz element from the two sentinel terminated input

sequences f and f to to

advance f and f accordingly

require at least sz nonsentinel elements available in f f

require to

template class Key class Value

void mergeKNElementKey Value f

KNElementKey Value f

KNElementKey Value to int sz

The parameters are doubleindirect p ointers since they are also return values

Inside the function one level of indirection is removed by keeping the current p o

sitions in a register

KNElementKey Value from f

KNElementKey Value from f

KNElementKey Value done to sz

Key key fromkey

Key key fromkey

As b efore many other values are kept in registers for optimization purp oses In

a sp ecialized implementation for PCpro cessors some of these optimizations might

turn out contraproductive since not enough registers are available

The main lo op is very simple Only one hard to predict plus one easy to predict

branch p er lo op iteration is required

while to done

if key key

tokey key

tovalue fromvalue

from

key fromkey

else

tokey key

tovalue fromvalue

from

key fromkey

to

Again lo op unrolling did not app ear promising

Finally the p ositions of the source sequences are up dated

f from

f from

 P Sanders

A Way and Way Merging

Since the number of merge groups R is typically b ounded by four it makes sense

to supply sp ecialized routines that can b e used for relling the deletion buer when

R and R The instruction caches of to days pro cessors are large enough

to implicitly store the relative order of the heads of the sequences in the program

counter and to keep the keys of the elements in registers Here is an example for

the case of way merging

template class Key class Value

void mergeKNElementKey Value f

KNElementKey Value f

KNElementKey Value f

KNElementKey Value to int sz

KNElementKey Value from f

KNElementKey Value from f

KNElementKey Value from f

KNElementKey Value done to sz

To set up the merging pro cess the rst keys of each sequence are copied into

registers and their relative order is found out

Key key fromkey

Key key fromkey

Key key fromkey

if key key

if key key goto s

else

if key key goto s

else goto s

else

if key key

if key key goto s

else goto s

else goto s

For each of the six p ossible relative orders keya keyb keyc an analogous

piece of co de must b e provided The most reasonable way to do that is a macro

since an inline function would not have direct access to all required lo cal variables

The token pasting prop erty of ANSIC is useful here since it makes it p ossible to

build variable names and gotolab els from macro parameters

define MergeCaseabc

s a b c

if to done goto finish

tokey key a

tovalue from a value

Fast Priority Queues for Cached Memory 

to

from a

key a from a key

if key a key b goto s a b c

if key a key c goto s b a c

goto s b c a

the order of the cases is chosen

in such a way that four of the trailing gotos

can be eliminated by the optimizer

MergeCase

MergeCase

MergeCase

MergeCase

MergeCase

MergeCase

finish

f from

f from

f from

A k Way Merging with Loser Trees

Loser trees are implemented as a C class that stores the loser tree itself and

p ointers to the sequences attached to its leaves

multimerge for arbitrary K

template class Key class Value

void KNLooserTreeKey Value

multiMergeKElement to int l

Entry currentPos

Key currentKey

int currentIndex leaf pointed to by current entry

int kReg k

Element done to l

The smallest element of all sequences is stored in the ro ot of the tree Therefore

nding the globally smallest element is easy

int winnerIndex entryindex

Key winnerKey entrykey

Element winnerPos

Key sup dummykey supremum

while to done

winnerPos currentwinnerIndex

write result

tokey winnerKey

 P Sanders

tovalue winnerPosvalue

advance winner segment

winnerPos

currentwinnerIndex winnerPos

winnerKey winnerPoskey

remove winner segment if empty now

if winnerKey sup

deallocateSegmentwinnerIndex

After the smallest element has b een consumed the invariant of the data structure

must b e reestablished This op eration is the inner lo op dominating the execution

time of sequence heaps since it is needed whenever merge groups are emptied or

group buers are relled

go up the entrytree

for int i winnerIndex kReg i i

currentPos entry i

currentKey currentPoskey

if currentKey winnerKey

currentIndex currentPosindex

currentPoskey winnerKey

currentPosindex winnerIndex

winnerKey currentKey

winnerIndex currentIndex

If one insp ects the assembler co de for this inner lo op and of the inner lo op of delete

min for b ottom up binary heaps it turns out that b oth contain ab out the same

number of instructions and b oth are executed ab out the same number of times on

the average Loser trees have two advantage however First they do not need

more time even in the worst case So for worst case inputs we might nd an even

larger advantage for sequence heaps Second the memory lo cations accessed are

completely dened by the index of the leaf under considerations Therefore lo op

unrolling turned out to b e protable here At least on the UltraSparc pro cessor

to

entryindex winnerIndex

entrykey winnerKey

REFERENCES

Arge L The buer tree A new technique for optimal IOalgorithms In th WADS

Number in LNCS pp Springer

Barve R D Grove E F and Vitter J S Simple randomized mergesort on

parallel disks Parallel Computing

Fast Priority Queues for Cached Memory 

Brengel K Crauser A Meyer U and Ferragina P An exp erimental study

of prioritty queues in external memory In rd International Workshop on Algorithmic

Engineering WAE pp full pap er in ACM Journal of Exp erimental Al

gorithmics

Brodal G S and Katajainen J Worstcase ecient externalmemory priority

queues In th Scandinavian Workshop on Algorithm Theory Number in LNCS

pp Springer Verlag Berlin

Brown R Calendar queues A fast O priority queue implementation for the

simulation event set problem Communications of the ACM

Cormen T H Leiserson C E and Rivest R L Introduction to Algorithms

McGrawHill

Fadel R Jakobsen K V Katajainen J and Teuhola J External heaps

combined with eective buering In th Australasian Theory Symposium Volume of

Australian Computer Science Communications pp Springer

Fischer M J and Paterson M S Fishsp ear A priority queue algorithm Journal

of the ACM

Hennessy J L and Patterson D A Computer Architecture a Quantitative Ap

proach Morgan Kaufmann

Intel Corp oration Intel Archtecture Software Developers Manual Volume I Basic

Architecture PO Box Denver CO httpwwwintelcom Intel Cor

p oration Ordering Number

Jones D An empirical comparison of priorityqueue and event set implementations

Communications of the ACM

Keller J The A sup erscalar alpha pro cessor with outoforder execution In

Microprocessor Forum Octob er

Knuth D E The Art of Computer Programming Sorting and Searching Vol

ume Addison Wesley

LaMarca A and Ladner R E The inuence of caches on the p erformance of

heaps ACM Journal of Experimental Algorithmics

LaMarca A and Ladner R E The inuence of caches on the p erformance of

sorting In th Symposium on Discrete Algorithm pp

MIPS Technologies Inc R Microprocessor Users Manual ed MIPS Tech

nologies Inc httpwwwmipscom

Neumann J v First draft of a rep ort on the EDVAC Technical rep ort University

of Pennsylvania

Sanders P Accessing multiple sequences through set asso ciative caches In ICALP

Number in LNCS pp

Sun Microsystems UltraSPARCIIi Users Manual Sun Microsystems

Vengroff D E TPIE User Manual and Reference Duke University httpwww

csdukeedudevtpie home pa geh tml

Vitter J S External memory algorithms In th European Symposium on Algo

rithms Number in LNCS pp Springer

Vitter J S and Shriver E A M Algorithms for parallel memory I Two level

memories Algorithmica

Wegener I BOTTOMUP a new variant of HEAPSORT b eating on

an average QUICKSORT if n is not very small Theoretical Computer Science

Sept

Wegner L M and Teuhola J I The external heapsort IEEE Transactions on

Software Engineering July