Cache-Efficient String Sorting Using Copying

Cache-Efficient String Sorting Using Copying RANJAN SINHA and JUSTIN ZOBEL RMIT University and DAVID RING Palo Alto, CA Burstsort is a cache-oriented sorting technique that uses a dynamic trie to efficiently divide large sets of string keys into related subsets small enough to sort in cache. In our original burstsort, string keys sharing a common prefix were managed via a bucket of pointers represented as a list or array; this approach was found to be up to twice as fast as the previous best string sorts, mostly because of a sharp reduction in out-of-cache references. In this paper, we introduce C-burstsort, which copies the unexamined tail of each key to the bucket and discards the original key to improve data locality. On both Intel and PowerPC architectures, and on a wide range of string types, we show that sorting is typically twice as fast as our original burstsort and four to five times faster than multikey quicksort and previous radixsorts. A variant that copies both suffixes and record pointers to buckets, CP-burstsort, uses more memory, but provides stable sorting. In current computers, where performance is limited by memory access latencies, these new algorithms can dramatically reduce the time needed for internal sorting of large numbers of strings. Categories and Subject Descriptors: F.2.2 [Analysis of Algorithms]: Sorting; E.5 [Files]: Sorting; E.1 [Data Structures]: Trees; B.3.2 [Memory Structures]: Cache Memories; D.1.0 [Program- ming Techniques]: General General Terms: Algorithms, Design, Experimentation, Performance Additional Key Words and Phrases: Sorting, string management, cache, tries, algorthims, experimental algorithms 1. INTRODUCTION Sorting is a core problem in computer science that has been extensively re- searched over the last five decades. It underlies a vast range of computational activities and sorting speed remains a bottleneck in many applications involv- ing large volumes of data. The simplest sorts operate on fixed-size numerical values. String sorts are complicated by the possibility that keys may be long Authors’ addresses: Ranjan Sinha and Justin Zobel, School of Computer Science and In- formation Technology, RMIT University, GPO Box 2476V, Melbourne 3001, Australia; email: {rsinha,jz}@cs.rmit.edu.au; David Ring, Palo Alto; email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. C 2006 ACM 1084-6654/2006/0001-ART1.2 $5.00 DOI 10.1145/1187436.1187439 http://doi.acm.org 10.1145/1187436.1187439 ACM Journal of Experimental Algorithmics, Vol. 11, Article No. 1.2, 2006, Pages 1–32. 2 • R. Sinha et al. (making them expensive to move or compare) or variable in length (making them harder to rearrange). Finally, sorts that operate on multifield records raise the issue of stability, that is, conservation of previous order when keys are equal in the field currently being sorted. The favored sorting algorithms are those that use minimal memory, remain fast at large collection sizes, and efficiently handle a wide range of data properties and arrangements without pathological behavior. Several algorithms for fast sorting of strings in internal memory have been described in recent years, including improvements to the standard quicksort [Bentley and McIlroy 1993], multikey quicksort [Bentley and Sedgewick 1997], and variants of radix sort [Andersson and Nilsson 1998]. The fastest such algorithm is our burstsort [Sinha and Zobel 2004a], in which strings are processed depth- rather than breadth-first and a complete trie of common prefixes is used to distribute strings among buckets. In all of these methods, the strings themselves are left in place and only pointers are moved into sort order. This point- erized approach to string sorting has become a standard solution to the length and length variability of strings, as expressed by Andersson and Nilsson [1998]: to implement a string sorting algorithm efficiently we should not move the strings themselves but only pointers to them. In this way each string movement is guaranteed to take constant time. In this argument, moving the variable-length strings in the course of sorting is considered inefficient. Sorting uses an array of pointers to the strings, and, at the completion of sorting, only the pointers need to be in sort order. However, evolutionary developments in computer architecture mean that the efficiency of sorting strings via pointers needs to be reexamined. Use of pointers is efficient if they are smaller than strings and if the costs of accessing and copying strings and pointers are uniform, but the speed of memory access has fallen behind processor speeds and most computers now have significant memory latency. Each string access can potentially result in a cache miss; that is, if a collection is too large to fit in cache memory, the cost of out-of-cache references may dominate other factors, making it worthwhile to move even long strings in order to increase their spatial locality. It is the cost of accessing strings via pointers that gives burstsort (which uses no fewer instructions than other methods) a speed advantage. There has been much recent work on cache-efficient algorithms [LaMarca and Ladner 1999; Jimenez-Gonzalez et al. 2003; Rahman and Raman 2000, 2001; Wickremesinghe et al. 2002; Xiao et al. 2000], but little that applies to sorting of strings in internal memory. An exception is burstsort [Sinha and Zobel 2004a], which uses a dynamically allocated trie structure to divide large sets of strings into related subsets small enough to sort in cache. Like radix sorts and multikey quicksort, burstsort avoids whole string comparisons and processes keys one byte at a time. However, burstsort works on successive bytes of a single key until it is assigned to a bucket, while the other sorts must inspect the current byte of every key in a sublist before moving to the next byte. This difference greatly reduces the number of cache misses, since burstsort can consume multiple bytes of a key in one memory access, while the older sorts ACM Journal of Experimental Algorithmics, Vol. 11, Article No. 1.2, 2006. Cache-Efficient String Sorting Using Copying • 3 may require an access for each byte. Furthermore, burstsort ensures that final sorting will be fast by expanding buckets into new subtries before they exceed the number of keys that can be sorted within cache. In tests using a variety of large string collections, burstsort was twice as fast as the best previous string sorts, due largely to a lower rate of cache misses. On average, burstsort generated between two and three cache misses per key, one on key insertion and another during bucket sorting; the remaining fractional miss was associated with bursting, which requires accessing additional bytes of the keys, and varied with patterns in the data. There are two potential approaches to reducing cache misses because of string accesses. The first is to reduce the number of accesses to strings, which is the rationale for burstsort. The second approach is to store strings with similar prefixes together and then to sort these suffixes. By gathering similar strings into small sets, it is feasible to load the whole of each set into cache and thus sort efficiently. That is, if strings with similar prefixes could be kept together, then accesses to them will be cache efficient. Such an approach was found to work for integers in our adaptation of burstsort, where the variable-length suffix of an integer that had not been consumed by the trie is copied into the bucket [Sinha 2004]. In this paper, we explore the application of this approach to strings. The novel strategy we propose is to copy the strings into the buckets. That is, instead of inserting pointers to the keys into the burst trie buckets, we in- sert the unexamined suffixes of the keys. This eliminates any need to reaccess the original string on bucket sorting and by copying only the tails of the keys into a contiguous array, we make the most efficient use of limited cache space. Finally, with all prefix bytes reflected in the trie structure, and all suffix bytes transferred to buckets, we can free the original keys to save memory. Balancing these advantages, we can expect higher instruction counts reflecting the com- plications of moving variable length suffixes and increased copying costs when the suffixes are longer than pointers. There are several challenges, principally concerning memory usage and instruction counts. How efficiently can strings be copied to the buckets? Are the advantages of spatial locality offset by additional instructions? Is the memory usage excessive? How will the bucket be managed, as the strings are of variable length? What is the node structure? Are there any auxiliary structures? Are there any characteristics of the collection, such as the average string length, that alter best-case parameters? We show that all of these challenges can be met. In our new string sorting algorithms based on copying, memory usage is reasonable—the additional space for the strings is partly offset by elimination of pointers—and cache efficiency is high.

Cache-Efficient String Sorting Using Copying

An Efficient Evolutionary Algorithm for Solving Incrementally Structured

Sorting Algorithm 1 Sorting Algorithm

How to Sort out Your Life in O(N) Time

Sorting Algorithm 1 Sorting Algorithm

Sorting Using Bitonic Network with CUDA

Engineering Burstsort: Toward Fast In-Place String Sorting

Sorting Using Bitonic Network with CUDA

In Search of the Fastest Sorting Algorithm

Efficient Trie-Based Sorting of Large Sets of Strings

Using Random Sampling to Build Approximate Tries for Efficient

LSDS-IR'09 7Th Workshop on Large-Scale Distributed Systems For

Performance Improvement for Multi-Key Quick Sort Using Kepler Gpus