B-Tries for Disk-Based String Management

The VLDB Journal (2009) 18:157–179 DOI 10.1007/s00778-008-0094-1 REGULAR PAPER B-tries for disk-based string management Nikolas Askitis · Justin Zobel Received: 11 June 2006 / Revised: 14 December 2007 / Accepted: 9 January 2008 / Published online: 11 March 2008 © Springer-Verlag 2008 Abstract A wide range of applications require that large tries that are efficient in memory cannot be directly mapped quantities of data be maintained in sort order on disk. The to disk without incurring high costs [52]. B-tree, and its variants, are an efficient general-purpose disk- The best-known structure for this task is the B-tree, pro- based data structure that is almost universally used for this posed by Bayer and McCreight [9], and its variants. In its task. The B-trie has the potential to be a competitive alter- most practical form, the B-tree is a multi-way balanced tree native for the storage of data where strings are used as keys, comprised of two types of nodes: internal and leaf. Internal but has not previously been thoroughly described or tested. nodes are used as an index or a road-map to leaf nodes, which We propose new algorithms for the insertion, deletion, and contain the data. The B-tree is considered to be the most effi- equality search of variable-length strings in a disk-resident cient data structure for maintaining sorted data on disk [3,43, B-trie, as well as novel splitting strategies which are a criti- 83]. A key characteristic is the use of a balanced tree struc- ( ) cal element of a practical implementation. We experimentally ture, which guarantees worst-case O logB N performance compare the B-trie against variants of B-tree on several large for search of N keys with a branching factor (node fan-out) sets of strings with a range of characteristics. Our results of B, regardless of the distribution of data. This access bound demonstrate that, although the B-trie uses more memory, it is often significantly better than the performance of an is faster, more scalable, and requires less disk space. in-memory data structure using virtual memory [3]. In prac- tice, due to its high branching factor, typically the total vol- Keywords B-tree · Burst trie · Secondary storage · ume of internal nodes is very small and traversal requires Vocabulary accumulation · Word-level indexing · only a single disk access. External hash tables [21,44,58, Data structures 67] are also efficient data structures, but cannot guarantee a bounded worst-case cost, nor can they maintain strings in sort order, which would be required for efficient range 1 Introduction search. In addition to its widespread use in standard database sys- Efficient storage and retrieval of data is one of the fundamen- tems [23], the B-tree has many applications, including dat- tal problems in computer science. Many applications such as abases, information retrieval, and genomic databases [76]. databases and search engines, are built on infrastructures that B-trees have been used to efficiently manage and retrieve require efficient access to large volumes of data. However, the large vocabularies that are associated with text databases [12]. choice of data structure is limited, as the majority of trees and Some file systems, such as Linux Reiser and Windows NTFS, are also based on B-trees. N. Askitis (B) School of Computer Science and Information Technology, The B-trie—a disk-resident trie—has the potential to be a RMIT University, Melbourne, Australia competitive alternative for the sorted storage of data where e-mail: [email protected]; [email protected] strings are used as keys. While the concept of the B-trie has been outlined in previous literature [89], it has not previ- J. Zobel NICTA, University of Melbourne, Parkville, Australia ously been formally described, explored, or tested. In par- e-mail: [email protected] ticular, there are no discussions on node splitting strategies, 123 158 N. Askitis, J. Zobel which are a critical element of a practical implementation. In the discussions that follow, we address variants of The B-trie proposed by Szpankowski [89] is simply a static B-trees that are available for disk-based string management, trie indexing a set of buckets that store up to b keys. and continue our discussions on disk-resident suffix trees. In this paper, we propose new algorithms for the inser- We then propose new algorithms for the B-trie and conduct tion, deletion, and equality search of variable-length strings thorough experiments to compare our B-trie against effi- in a disk-based B-trie, for use in common string processing cient variants of B-trees. These variants include a prefix tasks such as vocabulary accumulation and dictionary man- B+-tree [10], where internal nodes only store the shortest agement. Our variant of B-trie is effectively a novel applica- distinct prefix of strings that are promoted from leaf nodes, tion of a burst trie [51] to disk, and is therefore composed of and the Berkeley B+-tree [77], which is a high-performance two types of nodes: trie and bucket. In a burst trie, when a open-source B+-tree implementation. Other variants we con- bucket is deemed as being full, it is burst into at most A new sider are the string B-tree [36], which can maintain unboun- buckets that can be randomly accessed during the bursting ded-length strings efficiently on disk, and a cache-oblivious phase; A represents the size of the alphabet used, which in string B-tree [17,20], which is theoretically designed to per- our case, is the 128 characters of the ASCII table. Bursting form well on all levels of the memory hierarchy (including is efficient in memory because buckets can be of variable- disk), without prior knowledge of the size and characteristics size and, as shown by Heinz et al. [51], their random access of each level [64]. during the bursting phase has little to no impact on perfor- The B-trie was, in most cases, superior to the B+-trees mance. On disk, however, a bucket must be kept to a size that with typical speed gains of 5–15% and up to 50% in the is a multiple of the disk-block size used by the underlying presence of skew in the data distribution. The B-trie creates architecture (for efficiency purposes). However this implies a larger index structure than the B+-trees, and thus requires that bursting a bucket on disk will create up to A new disk- more space to buffer its trie nodes in memory, but the amount block sized buckets, which is a waste of space. A further of space involved is small. In most cases, the overall disk issue is that these buckets can be accessed at random during space required by the B-trie was less than the equivalent the bursting phase, which can attract unacceptable costs due B+-tree, due to the elimination of shared prefixes in buckets to excessive disk accesses. that compensated for the space consumed by trie nodes. Over- A major contribution is therefore the development of an all, our results show that the B-trie is a superior data struc- appropriate approach to bucket splitting, which allows the ture to the B-tree, for the task of efficient disk-based string B-trie to reside efficiently on disk, by both minimizing the management. number of buckets created and the random access caused during the splitting process. In this approach, we classify buckets as either hybrid or pure, which differ in the man- 2 B-trees ner of how they are split. These novel elements of pure and hybrid buckets are how the B-trie can reside efficiently on The B-tree is a balanced multi-way disk-based tree designed disk, while giving a B-tree-like organization of data. Unlike to reduce the number of disk access required to manage a the B-tree, however, the B-trie is an unbalanced structure, large set of strings. The B-tree was proposed by Bayer and but as we demonstrate later, this has little or no impact on McCreight [9] to solve the problem of external data man- actual performance, which in our case, involves strings with agement. The B-tree employs a similar balancing scheme to an average length of up to 30 characters. that of AVL trees [40], which, however, cannot be efficiently Existing disk-resident trie structures, such as the exter- sustained on disk, as changes are not restricted to a single nal suffix tree [65,90], are full-text indexes and are therefore path of the tree (from the root to the candidate node). not suited for common string processing tasks, due to exces- The B-tree is one of the most efficient disk-based data sive space requirements and high update costs when moved structures for external data management [43,78,83,91], as it ontodisk[47,90]. The B-trie, in contrast, maintains a word- offers four properties that are desirable for disk-based appli- level index, and is thus not as powerful as a full-text index. cations. First, even with large volumes of data, the height Hence, like the B-tree, the B-trie is ideally suited for common of the B-tree remains low, due to the high branching fac- string processing tasks that typically involve basic string pro- tor which minimizes the number of nodes accessed. Second, cessing operations such as insertion, deletion, and equality with the exception of the root node, all nodes are guaranteed match on individual strings.

B-Tries for Disk-Based String Management

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support