Fast Priority Queues for Cached Memory

Fast Priority Queues for Cached Memory Peter Sanders MaxPlanckInstitut for Computer Science The cache hierarchy prevalent in to days high p erformance pro cessors has to b e taken into account in order to design algorithms that p erform well in practice This pap er advocates the adaption of external memory algorithms to this purp ose This idea and the practical issues involved are exemplied by engineering a fast priority queue suited to external memory and cached memory that is based on k way merging It improves previous external memory algorithms by constant factors crucial for transferring it to cached memory Running in the cache hierarchy of a worksta tion the algorithm is at least two times faster than an optimized implementation of binary heaps and ary heaps for large inputs Categories and Sub ject Descriptors C Computer Systems Organization Single Data Stream Architectures C Computer Systems Organization Performance of Systems E Data Data Structures E Data Data Storage Representations E Data Files F Theory Nonnumerical Algorithms and Problems F Theory Mo des of Computation General Terms Data structure Cache Implementation Additional Key Words and Phrases Priority queue External memory Secondary storage Cache eciency Multi way merging Heap INTRODUCTION The mainstream mo del of computation used by algorithm designers in the last half century Neumann assumes a sequential pro cessor with unit memory access cost However the mainstream computers sitting on our desktops have increasingly deviated from this mo del in the last decade Hennessy and Patterson Intel Corp oration Keller MIPS Technologies Inc Sun Microsystems In particular we usually distinguish at least four levels of memory hierarchy A le of multiported registers can b e accessed in parallel in every clo ckcycle The rstlevel cache can still b e accessed every one or two clo ck cycles but it has only few parallel p orts and only achieves the high throughput by Address Stuhlsatzenhausweg Saarbr ucken Germany Email sandersmpi sbmpgde WWW httpwwwmpi sbmpgde sa nder s Partially supp orted by the IST Programme of the EU under contract number IST ALCOMFT Permission to make digital or hard copies of part or all of this work for p ersonal or classro om use is granted without fee provided that copies are not made or distributed for prot or direct commercial advantage and that copies show this notice on the rst page or initial screen of a display along with the full citation Copyrights for comp onents of this work owned by others than ACM must b e honored Abstracting with credit is p ermitted To copy otherwise to republish to p ost on servers to redistribute to lists or to use any comp onent of this work in other works requires prior sp ecic p ermission andor a fee Permissions may b e requested from Publications Dept ACM Inc Broadway New York NY USA fax or permissionsacmorg P Sanders pip elining Therefore the instruction level parallelism of sup erscalar pro cessors works b est if most instructions use registers only Currently most rstlevel caches are quite small KB in order to b e able to keep them on chip and close to the execution unit The secondlevel cache is considerably larger but also has an order of magnitude higher latency If it is ochip its size is mainly constrained by the high cost of fast static RAM The main memory is build of high density low cost dynamic RAM Including all overheads for cache miss memory latency and translation from logical over virtual to physical memory addresses a main memory access can b e two orders of magnitude slower than a rst level cache hit Most machines have separate caches for data and co de so that we can disregard instruction reads as long as the inner lo ops of programs remain reasonably short Although the technological details are likely to change in the future physical principles imply that fast memories must b e small and are likely to b e more ex p ensive than slower memories so that we will have to live with memory hierarchies when talking ab out sequential algorithms for large inputs The general approach of this pap er is to mo del one cache level and the main memory by the single disk single pro cessor variant of the external memory mo del Vitter and Shriver This mo del assumes an internal memory of size M that can access the external memory by transferring blo cks of size B The word pairs cache line and memory blo ck cache and internal memory main memory and external memory and IO and cache fault are used as synonyms if the context do es not indicate otherwise The only formal limitation compared to external memory is that caches have a xed replacement strategy For the kind of algorithms considered here this mainly has the eect of reducing the usable cache 1a size by a factor B where B is the memory blo ck size and a is the associativity 1 of the cache Sanders Henceforth the term cached memory is used in order to make clear that we have a dierent mo del Despite the farreaching analogy b etween external memory and cached memory a number of additional dierences should b e noted Since the sp eed gap b etween caches and main memory is usually smaller than the gap b etween main memory and disks care must b e taken to also analyze the work p erformed internally The ratio b etween main memory size and rst level cache size can b e much larger than that b etween disk space and internal memory Therefore we should prefer algorithms that use the cache as economically as p ossible Finally the remaining levels of the memory hierarchy are discussed informally in order to keep the analysis fo cussed on the most imp ortant asp ects Section presents the basic algorithm for the sequence heaps data structure for 2 priority queues The algorithm is then analyzed in Section using the external memory mo del For some m in M k in M B any constant and I e O M B it can p erform I insertions and up to I deleteMins R dlog k m using I R B O k log k m IOs and I log I log R log m O key 1 In an away set associative cache every memory blo ck is mapp ed to a xed cache set and every cache set can hold at most a blo cks 2 A priority queue is a data structure for representing a totally ordered set that supp orts insertion of elements and deletion of the minimal element Fast Priority Queues for Cached Memory comparisons Similar b ounds hold for cached memory with away asso ciative caches 1a if k is reduced by O B Sanders Section considers renements that take the other levels of the memory hierarchy into account ensure almost optimal mem ory eciency and where the amortized work p erformed for an op eration dep ends only on the current queue size rather than the total number of op erations Sec tion discusses an implementation of the algorithm on several architectures and compares the results to other priority queue data structures previously found to b e ecient in practice namely binary heaps and ary heaps An app endix gives further details on the implementation Its goal is to make the results easier to repro duce in other contexts and to argue that all considered co des are comparably eciently implemented Related Work External memory algorithms are a well established branch of algorithmics eg Vitter Vengro The external memory heaps of Teuhola and Wegner Wegner and Teuhola and the shsp ear data structure Fischer and Paterson need a factor B fewer IOs than traditional priority queues such as binary heaps Buer search trees Arge were the rst external memory priority M queues to reduce the number of IOs by another factor of log thus meeting B the lower b ound of O I B log I M IOs for I op erations amortized But M B using a fulledged search tree for implementing priority queues may b e considered wasteful The heaplike data structures by Bro dal and Kata jainen Crauser et al and Fadel et al Bro dal and Kata jainen Brengel et al Fadel et al are more directly geared to priority queues and achieve the same asymptotic b ounds one even p er op eration and not in an amortized sense Bro dal and Kata jainen A sequence heap is very similar In particular it can b e considered a simplication and reengineering of the improved arrayheap Brengel et al However sequence heaps are more IOecient by a factor of ab out three or more than Arge Bro dal and Kata jainen Brengel et al Fadel et al and need ab out a factor of two less memory than Arge Brengel et al Fadel et al THE ALGORITHM Merging k sorted sequences into one sorted sequence k way merging is an IO ecient subroutine used for sortingb oth for external Knuth and cached memory LaMarca and Ladner The basic idea of sequence heaps is to adapt k way merging to the related but more dynamical problem of priority queues Let us start with the simple case that at most k m insertions take place where m is the size of a buer that ts into fast memory Then the data structure could consist of k sorted sequences of length up to m We can use k way merging for deleting a batch of the m smallest elements from k sorted sequences The next m deletions can then b e served from a buer in constant time A separate binary heap with capacity m allows an arbitrary mix of insertions and deletions by holding the recently inserted elements Deletions have to check whether the smallest element has to come from this insertion buer When this buer is full it is sorted and the resulting sequence b ecomes one of the sequences for the k way merge P Sanders 2 1 2 ..

Load more