Cache-Oblivious String B-Trees
Total Page:16
File Type:pdf, Size:1020Kb
Cache-Oblivious String B-trees Michael A. Bender Martin Farach-Colton Bradley C. Kuszmaul Stony Brook University Rutgers MIT CSAIL [email protected] [email protected] [email protected] ABSTRACT Keys often share large prefixes, and are thus typically stored • B-trees are the data structure of choice for maintaining searchable using front compression [5, 15, 26, 32] within blocks. Front data on disk. However, B-trees perform suboptimally compression exhibits a tradeoff between the compression factor and the memory locality for decompression. B-trees when keys are long or of variable length, • make this tradeoff suboptimally. when keys are compressed, even when using front compres- • sion, the standard B-tree compression scheme, Range queries use disk hardware inefficiently because each • leaf block fetched may require a random disk seek. Random for range queries, and • block accesses perform two orders of magnitude more slowly with respect to memory effects such as disk prefetching. than sequential block accesses for disk. Seeks to nearby • This paper presents a cache-oblivious string B-tree (COSB-tree) tracks are nearly an order of magnitude faster than random data structure that is efficient in all these ways: seeks. The COSB-tree searches asymptotically optimally and in- B-trees do not take advantage of memory effects such as disk • serts and deletes nearly optimally. • prefetching, especially when variable amount of prefetching It maintains an index whose size is proportional to the front- is performed. • compressed size of the dictionary. Furthermore, unlike stan- Inefficiency in each of these respects can reduce performance sig- dard front-compressed strings, keys can be decompressed in nificantly. Although some of these issues are addressed individu- a memory-efficient manner. ally in the literature, as discussed below, there are no previously It performs range queries with no extra disk seeks; in con- known search structures that address all these issues effectively. • trast, B-trees incur disk seeks when skipping from leaf block This paper presents a cache-oblivious string B-tree (COSB-tree) to leaf block. data structure that is efficient in all four respects. The rest of this It utilizes all levels of a memory hierarchy efficiently and section states our results. • makes good use of disk locality by using cache-oblivious lay- out strategies. Variable-length keys. Traditional B-trees do not handle large Categories and Subject Descriptors: E.1 [Data Structures]: Ar- keys well. Typically, they pack small keys in blocks, but for large rays, Trees; E.5 [Files]: Sorting/searching; H.3.3 [Information keys the pack a pointer to the key, which is stored elsewhere. They Storage and Retrieval]: choose arbitrarily which key to promote to a parent when a block is split and often bias their choice toward promoting long keys. In General Terms: Algorithms, Experimentation, Performance, The- principle, it is better to store short keys near the top of the tree, but it ory. is also better to split the search space into nearly even pieces. There Keywords: cache oblivious string B-tree, locality preserving front are no known techniques for addressing both issues simultaneously compression, packed-memory array, range query, rebalance. in a dynamically changing B-tree. The string B-tree [19] handles keys of unbounded size effi- ciently. In the string B-tree, an insertion of a new key κ uses 1. INTRODUCTION κ κ O(1 + =B + logB N) block transfers, where is the length For over three decades, the B-tree [4,16] has been the data struc- of key kκ andk N is the number of keys in the tree.kThisk block trans- ture of choice for maintaining searchable data on disk. B-trees fer complexity is nearly optimal in the traditional Disk Access Ma- maintain an ordered set of keys and allow insertions, deletions, chine model [1], where there is a single fixed memory-transfer size, + searches, and range queries. Most implementations employ B - B, since it trivially takes at least κ =B transfers to read κ, and trees [16,26], in which the full keys are all stored in the leaves, but k k O(logB N) is the cost of insertion in a B-tree, even when all keys for convenience we refer to all the variations as “B-trees.” have unit length. See Figure 1 for a glossary of notation. Traditional B-trees perform suboptimally in several respects: The theoretical and practical performance of B-trees de- κ • Key D Set of keys (a dictionary) grades when keys are long or vary in length. Q Query result (a key set) D Number of keys in D B Block size j Dj Sum of key lengths in D This research was supported in part by the Singapore-MIT Al- N N = D kDk Size of front compressed D liance and NSF Grants ACI-0324974, CCR-0208670, and CCR- j j hh ii 9820879. Permission to make digital or hard copies of all or part of this work for personal or Figure 1: Glossary of symbols. classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute A range query for keys κ and κ0 returns all keys µ such that to lists, requires prior specific permission and/or a fee. κ µ κ lexicographically. In a string B-tree, a range query uses PODS'06 June 26–28, 2006, Chicago, Illinois USA 0 O(≤1+≤( κ + κ + Q )=B+log N) block transfers, where Q is Copyright 2006 ACM 1-595593-318-2/06/0003 ...$5.00. k k k 0k k k B the set of result keys and Q is the sum of their lengths.1 A search there are other levels between these. For example, the disk cache for a single key κ is a specialk k case of a range query and therefore sits between main memory and the rest of the disk. When the disk κ uses O(1 + =B + logB N) block transfers. system services a request for a disk block, the disk first checks The string-B-treek k paper [19] also describes a prefix-search opera- whether the requested block is in the disk cache, and if so, re- tion, but a prefix search is a special case of a range query. A slower turns the block. Otherwise, the disk seeks to the appropriate track, version of the string B-tree [19] also supports substring queries. reads the block, and prefetches the track. Thus, the effective block- Since our goal is to support heterogeneous key sizes in traditional transfer size may be much larger than a disk block. Indeed, sequen- database applications, we do not address substring queries here and tial disk-block accesses typically run over two orders of magnitude compare our results to the faster string B-tree that does not support faster than random disk-block accesses. However, disk tracks vary substring queries. in size and heavy loads can pollute the relatively small disk cache The string B-tree is deterministic and the performance bounds quickly, evicting blocks before they have a chance to be requested. given are worst-case (that is, not amortized), but it does not sup- Therefore the effective block-transfer size between these two levels port compression, disk prefetching, or disk-seek-efficient range of memory is highly variable. queries. The cache-oblivious string B-tree we present here also The cache-oblivious (CO) model [21] enables us to design algo- supports keys of unbounded length efficiently. It is amortized for rithms that achieve data locality simultaneously for all block sizes. updates and randomized, but is efficient with respect to compres- Such algorithms are on-line optimal with respect to changing block sion, prefetching, and disk seeks. sizes in the disk system as well as simultaneously optimal at all levels of the memory hierarchy. In the cache-oblivious model, the Compression. Many practical B-trees (e.g. [30]) employ front objective is to minimize the number of transfers between two lev- compression [5, 15, 26, 32] within blocks to reduce the amount of els of the hierarchy. However, unlike traditional external-memory memory required for the keys, but string B-trees [19] do not. Thus, models [1], the parameters B, the block size, and M, the main- the size of a string B-tree is Θ( D ). Front compression reduces memory size, are unknown to the coder or the algorithm. The main space by storing each key κ as ka pairk `;s , where ` is the length idea of the CO model is that if it can be proved that some algo- of the longest common prefix betweenhκ andi κ’s predecessor, and rithm performs a nearly optimal number of memory transfers in s is the suffix of κ starting at position ` + 1. This strategy pays off a two-level model with unknown parameters, then the algorithm when keys are stored lexicographically, which maximizes the aver- also performs a nearly optimal number of memory transfers on age longest common prefix between adjacent keys. Although front any unknown, multilevel memory hierarchy. Cache-oblivious data compression is not optimal with respect to the entropy bound of the structures are more portable than traditional external-memory data strings, it is used in many implementations of B-trees. structures, since they avoid any tuning parameters. CO algorithms To decode a key, one decodes the previous key. This procedure do not try to measure the machine’s cache and adjust their behavior might require scanning back to the beginning of the entire dictio- accordingly.