<<

The Adaptive Radix : ARTful Indexing for Main-Memory

Viktor Leis, Alfons Kemper, Thomas Neumann Fakultat¨ fur¨ Informatik Technische Universitat¨ Munchen¨ Boltzmannstrae 3, D-85748 Garching @in.tum.de

Abstract—Main memory capacities have grown up to a point digit 1 where most databases fit into RAM. For main-memory systems, index performance is a critical bottleneck. A Traditional in-memory data like balanced binary digit 2 search trees are not efficient on modern hardware, because they do not optimally utilize on-CPU caches. Hash tables, also often NR used for main-memory indexes, are fast but only support point queries. digit 3 To overcome these shortcomings, we present ART, an adaptive D YT ET () for efficient indexing in main memory. Its lookup performance surpasses highly tuned, read-only search trees, while leaf nodes supporting very efficient insertions and deletions as well. At the same time, ART is very space efficient and solves the problem Fig. 1. Adaptively sized nodes in our radix tree. of excessive worst-case space consumption, which plagues most radix trees, by adaptively choosing compact and efficient data structures for internal nodes. Even though ART’s performance is comparable to hash tables, it maintains the data in sorted the long pipelines of modern CPUs stall, which causes addi- order, which enables additional operations like range scan and tional latencies after every second comparison (on average). prefix lookup. These problems of traditional search trees were tackled by recent research on data structures specifically designed to be I.INTRODUCTION efficient on modern hardware architectures. The k-ary search After decades of rising main memory capacities, even large tree [6] and the Fast Architecture Sensitive Tree (FAST) [7] transactional databases fit into RAM. When most data is use data level parallelism to perform multiple comparisons cached, traditional database systems are CPU bound because simultaneously with Singe Instruction Multiple Data (SIMD) they spend considerable effort to avoid disk accesses. This instructions. Additionally, FAST uses a data layout which has led to very intense research and commercial activities in avoids misses by optimally utilizing cache lines and main-memory database systems like H-Store/VoltDB [1], SAP the Translation Lookaside Buffer (TLB). While these opti- HANA [2], and HyPer [3]. These systems are optimized for mizations improve search performance, both data structures the new hardware landscape and are therefore much faster. Our cannot support incremental updates. For an OLTP database system HyPer, for example, compiles transactions to machine system which necessitates continuous insertions, updates, and code and gets rid of buffer management, locking, and latching deletions, an obvious solution is a differential file (delta) . For OLTP workloads, the resulting execution plans mechanism, which, however, will result in additional costs. are often sequences of index operations. Therefore, index Hash tables are another popular main-memory data struc- efficiency is the decisive performance factor. ture. In contrast to search trees, which have O(log n) access More than 25 years ago, the T-tree [4] was proposed as time, hash tables have expected O(1) access time and are an in-memory indexing structure. Unfortunately, the dramatic therefore much faster in main memory. Nevertheless, hash processor architecture changes have rendered T-trees, like all tables are less commonly used as database indexes. One reason traditional binary search trees, inefficient on modern hardware. is that hash tables scatter the keys randomly, and therefore only The reason is that the ever growing CPU cache sizes and support point queries. Another problem is that most hash tables the diverging main memory speed have made the underlying do not growth gracefully, but require expensive reor- assumption of time obsolete. B+-tree ganization upon overflow with O(n) complexity. Therefore, variants like the cache sensitive B+-tree [5] have more cache- current systems face the unfortunate trade-off between fast friendly memory access patterns, but require more expensive hash tables that only allow point queries and fully-featured, update operations. Furthermore, the efficiency of both binary but relatively slow, search trees. and B+-trees suffers from another feature of modern CPUs: A third class of data structures, known as trie, radix tree, Because the result of comparisons cannot be predicted easily, prefix tree, and digital , is illustrated in Figure1. These data structures directly use the digital representation of II.RELATED WORK keys instead of hashing or comparing keys. The underlying + idea is similar to a thumb-index found in many alphabetically In disk-based database systems, the B -tree [8] is ubiq- ordered dictionary books: The first of a word can uitous [9]. It retrieves large blocks from disk to reduce the directly be used to jump to all words starting with that of accesses. Red-black trees [10], [11] and T-trees character. In a computer, this can be repeated with [4] were proposed for main-memory database systems. Rao the next characters until a match is found. As a consequence and Ross [5] showed that T-trees, like all binary search trees, of this process, all operations have O(k) complexity where k is suffer from poor cache behavior and are therefore often slower + the length of the key. In the era of extremely large data sets, than B -trees on modern hardware. As an alternative, they + + when n is growing faster than k, having a proposed a cache conscious B -tree variant, the CSB -tree + independent of n is very attractive. [12]. Further cache optimizations for B -trees were surveyed In this work, we present the adaptive radix tree (ART) which by Graefe and Larson [13]. is a fast and space-efficient in-memory indexing structure Modern CPUs allow to perform multiple comparisons with specifically tuned for modern hardware. While most radix a single SIMD instruction. Schlegel et al. [6] proposed k- trees require to trade off tree height versus space efficiency ary search which reduces the number of comparisons from by setting a globally valid fanout parameter, ART adapts the log2 n to logK n where K is the number of keys that fit representation of every individual node, as exemplified in into one SIMD vector. In comparison with binary trees, this Figure1. By adapting each inner node locally, it optimizes technique also reduces the number of cache misses, because global space utilization and access efficiency at the same time. K comparisons are performed for each cache line loaded Nodes are represented using a small number of efficient and from main memory. Kim et al. extended this research by compact data structures, chosen dynamically depending on proposing FAST, a methodology for laying out binary search the number of child nodes. Two additional techniques, path trees in an architecture sensitive way [7]. SIMD, cache line, compression and lazy expansion, allow ART to efficiently and page blocking are used to optimally use the available index long keys by collapsing nodes and thereby decreasing cache and memory bandwidth. Additionally, they proposed to the tree height. interleave the stages of multiple queries in order to increase A useful property of radix trees is that the order of the keys the throughput of their search . FAST trees and is not random as in hash tables; rather, the keys are ordered the k-ary search trees are pointer-free data structures which bitwise lexicographically. We show how typical data types can store all keys in a single array and use offset calculations be reordered efficiently to support all operations that require to traverse the tree. While this representation is efficient and the data to be ordered (e.g., range scan, prefix lookup, top-k, saves space, it also implies that no online updates are possible. minimum, and maximum). Kim et al. also presented a GPU implementation of FAST and This work makes the following contributions: compared its performance to modern CPUs. Their results show • We develop the adaptive radix tree (ART), a fast and that, due to the higher memory bandwidth, GPUs can achieve space-efficient general-purpose indexing structure for higher throughput than CPUs. Nevertheless, the use of GPUs main-memory database systems. as dedicated indexing hardware is not yet practical because • We prove that the space consumption per key is bounded memory capacities of GPUs are limited, communications cost to 52 bytes, even for arbitrarily long keys. We show with main memory is high, and hundreds of parallel queries experimentally, that the space consumption is much lower are needed to achieve high throughput. We, therefore focus on in practice, often as low as 8.1 bytes per key. index structures for CPUs. • We describe how common built-in data types can be The use of for indexing character strings has been stored in radix trees while retaining their order. studied extensively. The two earliest variants use lists [14] • We experimentally evaluate ART and other state of the and arrays [15] as internal node representations. Morrison art main-memory index structures, including the most introduced path compression in order to store long strings efficient search tree proposals. efficiently [16]. Knuth [17] analyzes these early trie variants. • By integrating ART into the main-memory database The burst trie is a more recent proposal which uses trie nodes system HyPer and running the TPC- benchmark, we for the upper levels of the tree, but switches to a prove its superior end-to-end performance in a “real-life” once a subtree has only few elements. The HAT-trie [18] transaction processing application. improves performance by replacing the linked list with a The rest of this paper is organized as follows. The next sec- hash . While most research focused on indexing character tion discusses related work. Section III presents the adaptive strings, our goal is to index other data types as well. Therefore, radix tree and analyzes its space consumption. In SectionIV we prefer the term radix tree over trie because it underscores we introduce the concept of binary-comparable keys and show the similarity to the algorithm and emphasizes that how common built-in types can be transformed. SectionV arbitrary data can be indexed instead of only character strings. describes experimental results including a number of micro The Generalized Prefix Tree was proposed by Boehm et benchmarks and the TPC-C benchmark. Finally, SectionVI al. [19] as a general-purpose indexing structure. It is a radix concludes and discusses future work. tree with a fanout of 16 and was a finalist in the SIGMOD Programming Contest 2009. The KISS-Tree [20] is a more k efficient radix tree proposal with only three levels, but can perfect BST only store 32 bit keys. It uses an open addressing scheme for the first 16 bits of the key and relies on the system to save space. The second level, responsible for the

tree height radix tree (s=4) next 10 bits, uses an array representation and the final level k 4 radix tree (s=8) compresses 6 bits using a bit vector. The idea of dynamically k 8 changing the internal node representation is used by the Judy k array which was developed at Hewlett-Packard 0 2(k 8) 2(k 4) 2 research labs [21], [22]. tree size in # keys (log scale) Graefe discusses binary-comparable (“normalized”) keys, e.g. [23], as a way of simplifying and speeding up key Fig. 2. Tree height of perfectly balanced binary search trees and radix trees. comparisons. We use this concept to obtain meaningful order for the keys stored in radix trees. can be compared in O(1) time. For large keys, comparisons III.ADAPTIVE RADIX TREE actually take O(k) time and therefore the complexity of search This section presents the adaptive radix tree (ART). We start trees is O(k log n), as opposed to the radix tree complexity of with some general observations on the advantages of radix O(k). These observations suggest that radix trees, in particular trees over comparison-based trees. Next, we motivate the use with a large span, can be more efficient than traditional search of adaptive nodes by showing that the space consumption trees. of conventional radix trees can be excessive. We continue B. Adaptive Nodes with describing ART and for search and insertion. Finally, we analyze the space consumption. As we have seen, from a (pure lookup) performance stand- point, it is desirable to have a large span. When arrays of A. Preliminaries pointers are used to represent inner nodes, the disadvantage Radix trees have a number of interesting properties that of a large span is also clear: Space usage can be excessive distinguish them from comparison-based search trees: when most child pointers are null. This tradeoff is illustrated • The height (and complexity) of radix trees depends on in Figure3 which shows the height and space consumption the length of the keys but in general not on the number for different values of the span parameter when storing 1M of elements in the tree. uniformly distributed 32 bit integers. As the span increases, • Radix trees require no rebalancing operations and all the tree height decreases linearly, while the space consumption insertion orders result in the same tree. increases exponentially. Therefore, in practice, only some • The keys are stored in lexicographic order. values of s offer a reasonable tradeoff between time and space. • The path to a leaf node represents the key of that For example, the Generalized Prefix Tree (GPT) uses a span of leaf. Therefore, keys are stored implicitly and can be 4 bits [19], and the radix tree used in the Linux kernel (LRT) reconstructed from paths. uses 6 bits [24]. Figure3 further shows that our adaptive radix Radix trees consist of two types of nodes: Inner nodes, tree (ART), at the same time, uses less space and has smaller which map partial keys to other nodes, and leaf nodes, which height than radix trees that only use homogeneous array nodes. store the values corresponding to the keys. The most efficient The key idea that achieves both space and time efficiency is representation of an inner node is as an array of 2s pointers. to adaptively use different node sizes with the same, relatively During , an s bit chunk of the key is used as the large span, but with different fanout. Figure4 illustrates this index into that array and thereby determines the next child node without any additional comparisons. The parameter s, s=1 which we call span, is critical for the performance of radix 32 trees, because it determines the height of the tree for a given key length: A radix tree storing k bit keys has dk/se levels of 24 inner nodes. With 32 bit keys, for example, a radix tree using s=2 16 s = 1 has 32 levels, while a span of 8 results in only 4 levels. s=3 tree height GPT (s=4) Because comparison-based search trees are the prevalent 8 LRT (s=6) indexing structures in database systems, it is illustrative to s=8 s=12s=14 s=16 compare the height of radix trees with the number of compar- 1 ARTARTARTARTARTARTARTARTARTART s=32 isons in perfectly balanced search trees. While each compar- 32MB 128MB 512MB 2GB 8GB 32GB ison rules out half of all values in the best case, a radix tree space consumption (log scale) s > 1 node can rule out more values if . Therefore, radix trees Fig. 3. Tree height and space consumption for different values of the span k/s have smaller height than binary search trees for n > 2 . This parameter s when storing 1M uniformly distributed 32 bit integers. Pointers relationship is illustrated in Figure2 and assumes that keys are 8 byte long and nodes are expanded lazily. Fig. 4. Illustration of a radix tree using array nodes (left) and our adaptive radix tree ART (right). idea and shows that adaptive nodes do not affect the structure Node4 key child pointer (i.e., height) of the tree, only the sizes of the nodes. By 0 1 2 3 0 1 2 3 0 2 3 255 reducing space consumption, adaptive nodes allow to use a larger span and therefore increase performance too. In order to efficiently support incremental updates, it is a b c d too expensive to resize nodes after each update. Therefore, Node16 key child pointer we use a small number of node types, each with a different 0 1 2 15 0 1 2 15 fanout. Depending on the number of non-null children, the 0 2 3 … 255 … appropriate node type is used. When the capacity of a node is exhausted due to insertion, it is replaced by a larger node a b c d type. Correspondingly, when a node becomes underfull due to key removal, it is replaced by a smaller node type. Node48 child index child pointer 0 1 2 3 255 10 472 C. Structure of Inner Nodes … … Conceptually, inner nodes map partial keys to child pointers. Internally, we use four data structures with different capacities. b a c d Given the next key byte, each data structure allows to effi- Node256 child pointer ciently find, add, and remove a child node. Additionally, the 0 1 2 3 4 5 6 255 child pointers can be scanned in sorted order, which allows to … implement range scans. We use a span of 8 bits, corresponding to partial keys of 1 byte and resulting a relatively large fanout. This choice also has the advantage of simplifying the a b c d implementation, because bytes are directly addressable which Fig. 5. Data structures for inner nodes. In each case the partial keys0,2, avoids bit shifting and masking operations. 3, and 255 are mapped to the subtrees a, b, c, and d, respectively. The four node types are illustrated in Figure5 and are named according to their maximum capacity. Instead of using a list of key/ pairs, we split the list into one key part comparison to 256 pointers of 8 bytes, because the indexes and one pointer part. This allows to keep the representation only require 6 bits (we use 1 byte for simplicity). compact while permitting efficient search: Node256: The largest node type is simply an array of 256 Node4: The smallest node type can store up to 4 child pointers and is used for storing between 49 and 256 entries. pointers and uses an array of length 4 for keys and another With this representation, the next node can be found very array of the same length for pointers. The keys and pointers efficiently using a single lookup of the key byte in that array. are stored at corresponding positions and the keys are sorted. No additional indirection is necessary. If most entries are not Node16: This node type is used for storing between 5 and null, this representation is also very space efficient because 16 child pointers. Like the Node4, the keys and pointers only pointers need to be stored. are stored in separate arrays at corresponding positions, but Additionally, at the front of each inner node, a header of both arrays have space for 16 entries. A key can be found constant size (e.g., 16 bytes) stores the node type, the number efficiently with binary search or, on modern hardware, with of children, and the compressed path (cf. Section III-E). parallel comparisons using SIMD instructions. Node48: As the number of entries in a node increases, D. Structure of Leaf Nodes searching the key array becomes expensive. Therefore, nodes Besides storing paths using the inner nodes as discussed with more than 16 pointers do not store the keys explicitly. in the previous section, radix trees must also store the values Instead, a 256-element array is used, which can be indexed associated with the keys. We assume that only unique keys with key bytes directly. If a node has between 17 and 48 child are stored, because non-unique indexes can be implemented pointers, this array stores indexes into a second array which by appending the tuple identifier to each key as a tie-breaker. contains up to 48 pointers. This indirection saves space in The values can be stored in different ways: search (node, key, depth)

1 if node==NULL B F 2 return NULL path compression 3 if isLeaf(node) merge one-way node lazy 4 if leafMatches(node, key, depth) into child node 5 return node A O expansion remove path 6 return NULL to single leaf 7 if checkPrefix(node,key,depth)!=node.prefixLen 8 return NULL RZ O 9 depth=depth+node.prefixLen 10 next=findChild(node, key[depth]) 11 return search(next, key, depth+1)

Fig. 6. Illustration of lazy expansion and path compression. Fig. 7. .

• Single-value leaves: The values are stored using an addi- this optimization requires that the key is stored at the leaf or tional leaf node type which stores one value. can be retrieved from the database. • Multi-value leaves: The values are stored in one of four Path compression, the second technique, removes all inner different leaf node types, which mirror the structure of nodes that have only a single child. In Figure6, the inner node inner nodes, but contain values instead of pointers. storing the partial key “A” was removed. Of course, this partial • Combined pointer/value slots: If values fit into point- key cannot simply be ignored. There are two approaches to ers, no separate node types are necessary. Instead, each deal with it: pointer storage location in an inner node can either • Pessimistic: At each inner node, a variable length (pos- store a pointer or a value. Values and pointers can be sibly empty) partial key vector is stored. It contains the distinguished using one additional bit per pointer or with keys of all preceding one-way nodes which have been pointer tagging. removed. During lookup this vector is compared to the Using single-value leaves is the most general method, search key before proceeding to the next child. because it allows keys and values of varying length within one • Optimistic: Only the count of preceding one-way nodes tree. However, because of the increased tree height, it causes (equal to the length of the vector in the pessimistic one additional pointer traversal per lookup. Multi-value leaves approach) is stored. Lookups just skip this number of avoid this overhead, but require all keys in a tree to have the bytes without comparing them. Instead, when a lookup same length. Combined pointer/value slots are efficient and arrives at a leaf its key must be compared to the search allow to store keys of varying length. Therefore, this method key to ensure that no “wrong turn” was taken. should be used if applicable. It is particularly attractive for Both approaches ensure that each inner node has at least secondary database indexes which store tuple identifiers with two children. The optimistic approach is particularly beneficial the same size as pointers. for long strings but requires one additional check, while the pessimistic method uses more space, and has variable E. Collapsing Inner Nodes sized nodes leading to increased memory fragmentation. We Radix trees have the useful property that each inner node therefore use a hybrid approach by storing a vector at each represents a key prefix. Therefore, the keys are implicitly node like in the pessimistic approach, but with a constant stored in the tree structure, and can be reconstructed from size (8 bytes) for all nodes. Only when this size is exceeded, the paths to the leaf nodes. This saves space, because the the lookup algorithm dynamically switches to the optimistic keys do not have to be stored explicitly. Nevertheless, even strategy. Without wasting too much space or fragmenting with this implicit prefix compression of keys and the use memory, this avoids the additional check in cases that we of adaptive nodes, there are cases, in particular with long investigated. keys, where the space consumption per key is large. We therefore use two additional, well-known techniques that allow F. Algorithms to decrease the height by reducing the number of nodes. We now present the algorithms for search and updates: These techniques are very effective for long keys, increasing Search: Pseudo code for search is shown in Figure7. The performance significantly for such indexes. Equally important tree is traversed by using successive bytes of the key array is that they reduce space consumption, and ensure a small until a leaf node or a null pointer is encountered. Line 4 worst-case space bound. handles lazy expansion by checking that the encountered leaf With the first technique, lazy expansion, inner nodes are fully matches the key. Pessimistic path compression is handled only created if they are required to distinguish at least two in lines 7 and 8 by aborting the search if the compressed leaf nodes. Figure6 shows an example where lazy expansion path does not match the key. The next child node is found saves two inner nodes by truncating the path to the leaf “FOO”. by the findChild function, which is shown in Figure8. This path is expanded if another leaf with the prefix “F” is Depending on the node type the appropriate search algorithm inserted. Note that because paths to leaves may be truncated, is executed: Because a Node4 has only 2-4 entries, we use findChild (node, byte) insert (node, key, leaf, depth)

1 if node.type==Node4 // simple loop 1 if node==NULL // handle empty tree 2 for (i=0; i4096 NA. whether or not the data is stored in sorted order. The sorted order traversal of an index structure facilitates the implementa- tion of efficient ordered range scans and lookups for minimum, header storing the node type, the number of non null children, maximum, top-N, etc. By default, only comparison-based and the compressed path. We only consider inner nodes and trees store the data in sorted order, which resulted in their ignore leaf nodes because leaves incur no space overhead if prevalence as indexing structures for database systems. While combined pointer/value slots with tagged pointers are used. the use of order-preserving hashing has been proposed to allow Using these assumptions, the resulting space consumption for elements to be sorted, it is not common in real- each inner node type is shown in TableI. world systems. The reason is that for values from unknown Think of each leaf as providing x bytes and inner nodes as distributions, it is very hard to come up with functions that consuming space provided by their children. If each node of spread the input values uniformly over the hash table while a tree has a positive budget, then that tree uses less than x preserving the input order. bytes per key. The budget of an inner node is the sum of all Keys stored in radix trees are ordered bitwise lexicograph- budgets of its children minus the space consumption of that ically. For some data types, e.g., ASCII encoded character node. Formally, the budget b(n) of a node n with the child strings, this yields the expected order. For most data types nodes c(n) and the space consumption s(n) is defined as this is not the case. For example, negative two- signed integers are lexicographically greater than positive (x, isLeaf(n) integers. However, it is possible to obtain the desired order b(n) = P  i∈c(n) b(i) − s(n), else. by transforming the keys. We call the values resulting from such a transformation binary-comparable keys. If only binary- Using induction on the node type, we now show that comparable keys are used as keys of a radix tree, the data b(n) ≥ 52 n x = 52 for every ART node if : For leaves, the is stored in sorted order and all operations that rely on this statement holds trivially by definition of the budget function. order can be supported. Note that no modifications to the To show that the statement holds for inner nodes, we compute algorithms presented in the previous section are necessary. a lower bound for P b(i) using the induction hypothesis i∈c(n) Each key must simply be transformed to a binary-comparable and the minimum number of children for each node type. After key before storing it or looking it up. subtracting the corresponding space consumption, we obtain Binary-comparable keys have other use cases. Just as this a lower bound on the budget of each node type. In all four concept allows to replace comparison-based trees with radix 52 cases it is greater than or equal to , which concludes the trees, it allows to replace comparison-based sorting algorithms x = 52 induction. To summarize, we have shown that for it like quicksort or mergesort with the radix sort algorithm which is not possible to construct an adaptive radix tree node that can be asymptotically superior. has a negative budget. As a consequence, the worst-case space consumption is 52 bytes for any adaptive radix tree, even for A. Definition arbitrarily long keys. It can also be shown analogously that A transformation function t : D → {0, 1,..., 255}k 1 with six node types , the worst-case space consumption can be transforms values of a domain D to binary-comparable keys of reduced to 34 bytes per key. As we will show in Section V-D, length k if it satisfies the following equivalences (x, y ∈ D): in practice, the space consumption is much smaller than the • x < y ⇔ memcmp (t(x), t(y)) < 0 worst case, even for relatively long keys. The best case of 8.1 k • x > y ⇔ memcmp (t(x), t(y)) > 0 bytes, however, does occur quite frequently because surrogate k • x = y ⇔ memcmp (t(x), t(y)) = 0 integer keys are often dense. k The operators <,>,= denote the usual relational operators 1The Node4 type is replaced by the new node types Node2 and Node5 on the input type while memcmpk compares the two input and the Node48 type is replaced by the new Node32 and Node64 types. vectors component wise. It returns 0 if all compared values are equal, a negative value if the first non-equal value of the V. EVALUATION first vector is less than the corresponding byte of the second vector, and otherwise a positive value. In this section, we experimentally evaluate ART and com- For finite domains, it is always possible to transform the val- pare its performance to alternative in-memory data structures, ues of any strictly totally ordered domain to binary-comparable including comparison-based trees, hashing, and radix trees. keys: Each value of a domain size n is mapped to a string of The evaluation has two parts: First, we perform a number of dlog2 ne bits storing the zero-extended rank minus one. micro benchmarks, implemented as standalone programs, with all evaluated data structures. In the second part, we integrate B. Transformations some of the data structures into the main-memory database We now discuss how common data types can be transformed system HyPer. This allows us to execute a more realistic to binary-comparable keys. workload, the standard OLTP benchmark TPC-C. a) Unsigned Integers: The binary representation of un- We used a high-end desktop system with an Intel Core i7 signed integers already has the desired order. However, the 3930K CPU which has 6 cores, 12 threads, 3.2 GHz clock endianness of the machine must be taken into account when rate, and 3.8 GHz turbo frequency. The system has 12 MB storing the value into main memory. On little endian machines, shared, last-level cache and 32 GB quad-channel DDR3-1600 the byte order must be swapped to ensure that the result is RAM. We used Linux 3.2 in 64 bit mode as operating system ordered from most to least significant byte. and GCC 4.6 as compiler. b) Signed Integers: Signed two-complement integers As contestants, we used must be reordered because negative integers are ordered de- + • a B -tree optimized for main memory (Cache-Sensitive scending and are greater than the positive values. An b bit + integer x is transformed very efficiently by flipping the sign B -tree [CSB]), bit (using x XOR 2b−1). The resulting value is then stored like • two read-only search structures optimized for modern an unsigned value. x86 CPUs (k-ary search tree [kary], Fast Architecture c) IEEE 754 Floating Point : For floating point Sensitive Tree [FAST]), values, the transformation is more involved, although concep- • a radix tree (Generalized Prefix Tree [GPT]), and tually not difficult. Each value is first classified as positive • two textbook data structures (red-black tree [RB], chained or negative, and as normalized number, denormalized number, hash table [HT] using MurmurHash64A for 64-bit plat- NaN, ∞, or 0. Because these 10 classes do not overlap, a new forms [25]). rank can easily be computed and then stored like an unsigned For a fair comparison, we used source code provided by the value. One key transformation requires 3 if statements, 1 authors if it was available. This was the case for the CSB+- integer multiplication, and 2 additions. Tree [26], k-ary search [27], and the Generalized Prefix Tree d) Character Strings: The Unicode Collation Algorithm [28]. We used our own implementation for the remaining data (UCA) defines complex rules for comparing Unicode strings. structures. There are open source libraries which implement this algo- We were able to validate that our implementation of FAST, rithm and which offer functions to transform Unicode strings which we made available online [29], matches the originally to binary-comparable keys2. In general, it is important that published numbers. To calibrate for the different hardware, each string is terminated with a value which does not appear we used the results for k-ary search which were published anywhere else in any string (e.g., the 0 byte). The reason is in the same paper. Our implementation of FAST uses 2 MB that keys must not be prefixes of other keys. memory pages, and aligns all cache line blocks to 64 byte e) Null: To make a null value binary comparable, it must boundaries, as suggested by Yamamuro et al. [30]. However, be assigned a value with some particular rank. For most data because FAST and k-ary search return the rank of the key types, all possible values are already used. A simple solution instead of the tuple identifier, the following results include is to increase the key length of all values by one to obtain one additional lookup in a separate array of tuple identifiers in space for the null value, e.g., 4 byte integers become 5 bytes order to evaluate a meaningful lookup in the database context. long. A more efficient way to accommodate the null value is We had to use 32 bit integers as keys for the micro bench- to increase the length only for some values of the domain. For marks because some of the implementations only support 32 example, assuming null should be less than all other 4 byte bit integer keys. For such very short keys, path compression integers, null can be mapped to the byte sequence 0,0,0,0,0, usually increases space consumption instead of reducing it. the previously smallest value 0 is mapped to 0,0,0,0,1, and all Therefore, we removed this feature for the micro benchmarks. other values retain their 4 byte representation. Path compression is enabled in the more realistic second part f) Compound Keys: Keys consisting of multiple attributes of the evaluation. In contrast to comparison-based trees and are easily handled by transforming each attribute separately hash tables, the performance of radix trees varies with the and concatenating the results. distribution of the keys. We therefore show results for dense keys ranging from 1 to n (n denotes the size of the tree in # 2The C/C++ library “International Components for Unicode” (http://site.icu-project.org/), for example, provides the ucol_getSortKey keys) and sparse keys where each bit is equally likely 0 or 1. function for this purpose. We randomly permuted the dense keys. 65K 16M 256M 20 10.0

90 dense dense dense sparse 15 sparse 7.5 sparse

60 10 5.0

30 5 2.5 M lookups/second M lookups/second M lookups/second

0 0 0.0

ART GPT RB CSB kary FAST HT ART GPT RB CSB kary FAST HT ART RB kary FAST HT (GPT and CSB crashed) Fig. 10. Single-threaded lookup throughput in an index with 65K, 16M, and 256M keys.

TABLE III A. Search Performance PERFORMANCECOUNTERSPERLOOKUP. In our first experiment, we measured the performance of 3 65K 16M looking up random, existing keys. Figure 10 shows that ART (d./s.) FAST HT ART (d./s.) FAST HT ART and the hash table have the best performance. ART is Cycles 40/105 94 44 188/352 461 191 more than twice as fast as GPT, a radix tree with half the Instructions 85/127 75 26 88/99 110 26 span and therefore twice the height. The red-black tree is the Misp. Branches 0.0/0.85 0.0 0.26 0.0/0.84 0.0 0.25 slowest comparison-based tree, followed by the CSB+-tree, L3 Hits 0.65/1.9 4.7 2.2 2.6/3.0 2.5 2.1 L3 Misses 0.0/0.0 0.0 0.0 1.2/2.6 2.4 2.4 k-ary search, and finally FAST. Even though FAST does not support updates and is optimized for modern architectures, it is slower than ART and the hash table. The relative performance 200 of the data structures is very similar for the three index sizes. This indicates that the fact that small indexes (65K) are about 150 10 times faster than large indexes (256M) is mostly caused by caching effects, and not by the asymptotic properties of the 100 dense data structures. sparse To better understand these results, consider Table III, which 50 shows performance counters per random lookup for the three M lookups/second fastest data structures (ART, FAST, and the hash table). With 0 16M keys, only parts of the index structures are cached, ART FAST HT and lookup is memory bound. The number of cache misses is similar for FAST, the hash table, and ART with sparse Fig. 11. Multi-threaded lookup throughput in an index with 16M keys (12 keys. With dense keys, ART causes only half as many cache threads, software pipelining with 8 queries per ). misses because its compact nodes can be cached effectively. In small trees, the lookup performance is mostly determined by the number of instructions and branch mispredictions. unsynchronized threads. Besides using multiple threads, it has While ART has almost no mispredicted branches for dense been shown that throughput can be improved by interleaving keys, sparse keys lead to around 0.85 mispredicted branches multiple tree traversals using software pipelining [7]. This per lookup, which occur during node type dispatch. Dense technique exploits modern superscalar CPUs better, but in- keys also require less instructions, because finding the next creases latency, and is only applicable if multiple queries are child in a Node256 requires no computation, while the other available (e.g., during an index-nested loop join). FAST bene- node types result in more effort. FAST effectively avoids fits most from software pipelining (2.5x), because its relatively mispredictions which occur with ordinary search trees, but large latencies for comparisons and address calculations can requires a significant number of instructions (about 5 per be hidden effectively. Since ART performs less calculations comparison) to achieve this. The hash table has a small number in the first place, its is smaller but still significant of mispredictions which occur during collision handling. (1.6x-1.7x). A chained hash table can be considered a tree of So far, lookups were performed one query at a time, only two levels (the hash table array, and the list of key/value in a single thread. The goal of the next experiment was pairs), so the speedup is relatively small (1.2x). Nevertheless, to find the maximum achievable throughput using multiple Figure 11 shows that even with 12 threads and 8 interleaved queries per thread, ART is only slightly slower than FAST for 3Successful search in radix trees is slower than unsuccessful search. sparse keys, but significantly faster for dense keys. 100 20 ART HT

75 15

50 10 ART HT 25 5 FAST M lookups/second FAST M lookups/second 0 0

0.25 0.50 0.75 1.00 1.25 192KB 768KB384KB 3MB1.5MB 12MB6MB Zipf parameter s (skew) effective cache size (log scale)

Fig. 12. Impact of skew on search performance (16M keys). Fig. 13. Impact of cache size on search performance (16M keys).

B. Caching Effects 15 Let us now investigate caching effects. For modern CPUs, dense caches are extremely important, because DRAM latency sparse amounts to hundreds of CPU cycles. Tree structures, in 10 particular, benefit from caches very much because frequently accessed nodes and the top levels are usually cached. To 5 quantify these caching effects, we compare two tree structures, M inserts/second ART (with dense keys) and FAST, to a hash table. 0 Random lookup, which we performed so far, is the worst ART ART GPT RB CSB HT (bulk) (bulk) case for caches because this access pattern has bad temporal locality. In practice, skewed access patterns are very common, e.g., recent orders are accessed more often than old orders. Fig. 14. Insertion of 16M keys into an empty index structure. We simulated such a scenario by looking up Zipf distributed keys instead of random keys. Figure 12 shows the impact of increasing skew on the performance of the three data struc- inserting 16M random keys into an empty structure. Although tures. All data structures perform much better in the presence ART must dynamically replace its internal data structures of skew because the number of cache misses decreases. As the as the tree grows, it is more efficient than the other data skew increases, the performance of ART and the hash table structures. The impact of adaptive nodes on the insertion per- approaches their speed in small, cache resident trees. For FAST formance (in comparison with only using Node256) is 20% the speedup is smaller because it requires more comparisons for trees with 16M dense keys. Since the space savings from and offset calculations which are not improved by caching. adaptive nodes can be large, this is usually a worthwhile trade We now turn our attention to the influence of the cache off. In comparison with incremental insertion, bulk insertion size. In the previous experiments, we only performed lookups increases performance by a factor of 2.5 for sparse keys and in a single tree. As a consequence, the entire cache was by 17% for dense keys. When sorted keys, e.g., surrogate utilized, because there were no competing memory accesses. primary keys, are inserted, the performance of ordered search In practice, caches usually contain multiple indexes and other trees increases because of caching effects. For ART, 50 million data. To simulate competing accesses and therefore effectively sorted, dense keys can be inserted per second. Only the hash smaller caches, we look up keys in multiple data structures in table does not benefit from the sorted order because hashing a round-robin fashion. Each data structure stores 16M random, randomizes the access pattern. dense keys and occupies more than 128MB. Figure 13 shows FAST and the k-ary search tree are static data structures that that the hash table is mostly unaffected, as it does not use can only be updated by rebuilding them, which is why they caches effectively anyway, while the performance of the trees were not included in the previous experiment. One possibility improves with increasing cache size, because more often- for using read-only data structures in applications that require 1 traversed paths are cached. With 64 th of the cache (192KB), incremental updates is to use a delta mechanism: A second ART reaches only about one third of the performance of the data structure, which supports online updates, stores differ- entire cache (12MB). ences and is periodically merged with the read-only structure. To evaluate the feasibility of this approach, we used a red- C. Updates black tree to store the delta plus FAST as the main search Besides efficient search, an indexing structure must support structure, and compared it to ART (with dense keys) and a efficient updates as well. Figure 14 shows the throughput when hash table. We used the optimal merging frequency between 20 ART 200,000 ART 15 150,000 HT + RB 10 HT 100,000

5 50,000 RB FAST+Δ transactions/second M operations/second 0

0% 25% 50% 75% 100% 10M 20M 30M 40M 50M update percentage TPC-C transactions

Fig. 15. Mix of lookups, insertions, and deletions (16M keys). Fig. 16. TPC-C performance.

TABLE IV FAST and the delta which we determined experimentally. MAJOR TPC-C INDEXESANDSPACECONSUMPTIONPERKEYUSING ART. Our workload consisted of random insertions, deletions, and lookups in a tree of about 16M elements. Figure 15 shows the results for varying fractions of lookups versus insertions # Relation Cardinality Attribute Types Space 1 item 100,000 int 8.1 and deletions. As the fraction of lookups decreases, the per- 2 customer 150,000 int,int,int 8.3 formance of FAST+delta degrades because of the additional 3 customer 150,000 int,int,varchar(16),varchar(16),TID 32.6 periodic O(n) merging step. 4 stock 500,000 int,int 8.1 5 order 22,177,650 int,int,int 8.1 D. End-to-End Evaluation 6 order 22,177,650 int,int,int,int,TID 24.9 For the end-to-end application experiment, we used the 7 orderline 221,712,415 int,int,int,int 16.8 main-memory database system HyPer. One of its unique char- acteristics is that it very efficiently supports both transactional (OLTP) and analytical (OLAP) workloads at the same time as fast as the red-black tree alone. The hash table improved [3]. Transactions are implemented in a scripting language performance significantly over the red-black tree alone, but which is compiled to assembler-like LLVM instructions [31]. introduced unacceptable rehashing latencies which are clearly Furthermore, HyPer has no overhead for buffer management, visible as spikes in the graph. locking, or latching. Therefore, its performance critically de- Let us turn our attention to the space consumption of the pends on the efficiency of the index structures used. For each TPC-C benchmark which we measured at the end of the index, HyPer allows to determine which data structure is used. benchmark runs. In order to save space, HyPer’s red-black Originally, a red-black tree and a hash table were available. We tree and hash table implementations do not store keys, only additionally integrated ART, including the path compression tuple identifiers. Using the tuple identifiers, the keys are loaded and lazy expansion optimizations. We further implemented from the database on demand. Nevertheless, and although ART the key transformation scheme discussed in SectionIV for may use more space per element in the worst case, ART used all built-in types so that range scan, prefix lookup, minimum, only half as much space as the hash table and the red-black and maximum operations work as expected. tree. More detailed insights into space consumption can be The following experiment uses TPC-C, a standard OLTP obtained by considering the structural information for each benchmark simulating a merchandising company which man- major index and its corresponding space consumption per key, ages, sells, and distributes products. It consists of a diverse which is shown in TableIV. Index 3 uses most space per mix of select statements (including point queries, range scans, key, as it stores relatively long strings. Nevertheless, its space and prefix lookups), insert, and delete statements. While it consumption stays well below the worst case of 52 bytes. is considered a write-heavy benchmark, 54% of all SQL Indexes 1, 2, 4, and, 5 use only 8.1 bytes per key because statements are queries [32]. Our implementation omits client they store relatively dense integers. Indexes 6 and 7 fall in think-time and uses a single partition with 5 warehouses. We between these extremes. executed the benchmark until most of the available RAM was The results of our final experiment, shown in Figure 17, exhausted. As index configurations we used ART, a red-black measure the impact of path compression and lazy expansion tree, and a combination of a hash table and a red-black tree. on the average tree height. By default, the height of a radix It is not possible to use hash tables for all indexes because tree equals the length of the key (in bytes for ART). For TPC-C requires prefix-based range scans for some indexes. example, the height of index 3 would be 40 without any Figure 16 shows that the index structure choice is critical for optimizations. Path compression and lazy expansion reduce HyPer’s OLTP performance. ART is almost twice as fast as the the average height to 8.1. Lazy expansion is particularly hash table/red-black tree combination and almost four times effective with long strings (e.g., index 3) and with non-unique REFERENCES 40 [1] R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. Zdonik, E. P. C. Jones, S. Madden, M. Stonebraker, Y. Zhang, J. Hugg, and 30 D. J. Abadi, “H-store: a high-performance, distributed main memory transaction processing system,” PVLDB, vol. 1, 2008. default 20 [2] F. Farber,¨ S. K. Cha, J. Primsch, C. Bornhovd,¨ S. Sigg, and W. Lehner, +lazy expansion “SAP HANA database: data management for modern business applica- +path compression tions,” SIGMOD Record, vol. 40, no. 4, 2012. tree height 10 [3] A. Kemper and T. Neumann, “HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots,” in ICDE, 0 2011. [4] T. J. Lehman and M. J. Carey, “A study of index structures for main 1 2 3 4 5 6 7 memory database management systems,” in VLDB, 1986. index # [5] J. Rao and K. A. Ross, “Cache conscious indexing for decision-support in main memory,” in VLDB, 1999. [6] B. Schlegel, R. Gemulla, and W. Lehner, “k-ary search on modern Fig. 17. Impact of lazy expansion and path compression on the height of processors,” in DaMoN workshop, 2009. the TPC-C indexes. [7] C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. A. Brandt, and P. Dubey, “FAST: fast architecture sensitive tree search on modern cpus and gpus,” in SIGMOD, 2010. [8] R. Bayer and E. McCreight, “Organization and maintenance of large indexes mostly containing unique values because the appended ordered indices,” in SIGFIDET, 1970. 8 byte tuple identifier can be truncated (e.g., index 6). Path [9] D. Comer, “Ubiquitous B-tree,” ACM Comp. Surv., vol. 11, no. 2, 1979. compression helps with long strings (e.g., index 3) and with [10] R. Bayer, “Symmetric binary B-trees: Data structure and maintenance algorithms,” Acta Informatica, vol. 1, 1972. compound indexes of dense integers which share a common [11] L. Guibas and R. Sedgewick, “A dichromatic framework for balanced prefix (e.g., indexes 2, 4, 5, and, 7). The impact of the two trees,” IEEE Annual Symposium on the Foundations of Computer Sci- optimizations on space consumption is similar to the impact ence, vol. 0, 1978. [12] J. Rao and K. A. Ross, “Making B+ trees cache conscious in main on height which is why we do not show it separately. To memory,” in SIGMOD, 2000. summarize, except for short integers, path compression and [13] G. Graefe and P.-A. Larson, “B-tree indexes and CPU caches,” in ICDE, lazy expansions are critical for achieving high performance 2001. [14] R. De La Briandais, “File searching using variable length keys,” in and small memory consumption with radix trees. western joint computer conference, 1959. [15] E. Fredkin, “Trie memory,” Commun. ACM, vol. 3, September 1960. VI.CONCLUSIONSAND FUTURE WORK [16] D. R. Morrison, “PATRICIA-practical algorithm to retrieve information coded in alphanumeric,” J. ACM, vol. 15, no. 4, 1968. We presented the adaptive radix tree (ART), a fast and [17] D. E. Knuth, The art of , volume 3: (2nd ed.) space-efficient indexing structure for main-memory database sorting and searching, 1998. system. A high fanout, path compression, and lazy expan- [18] N. Askitis and R. Sinha, “Engineering scalable, cache and space efficient tries for strings,” The VLDB Journal, vol. 19, no. 5, 2010. sion reduce the tree height, and therefore lead to excellent [19]M.B ohm,¨ B. Schlegel, P. B. Volk, U. Fischer, D. Habich, and W. Lehner, performance. The worst-case space consumption, a common “Efficient in-memory indexing with generalized prefix trees,” in BTW, problem of radix trees, is limited by dynamically choosing 2011. [20] T. Kissinger, B. Schlegel, D. Habich, and W. Lehner, “KISS-Tree: Smart compact internal data structures. We compared ART with latch-free in-memory indexing on modern architectures,” in DaMoN other state-of-the-art main-memory data structures. Our results workshop, 2012. show that ART is much faster than a red-black tree, a Cache [21] D. Baskins, “A 10-minute description of how judy arrays work and why + they are so fast,” http://judy.sourceforge.net/doc/10minutes.htm, 2004. Sensitive B -Tree, and GPT, another radix tree proposal. Even [22] A. Silverstein, “Judy IV shop manual,” http://judy.sourceforge.net/doc/ the architecture sensitive, read-only search tree FAST, which shop interm.pdf, 2002. is specifically designed for modern CPUs, is slower than ART, [23] G. Graefe, “Implementing sorting in database systems,” ACM Comput. Surv., vol. 38, no. 3, 2006. even without taking updates into account. Of all the evaluated [24] J. Corbet, “Trees I: Radix trees,” http://lwn.net/Articles/175432/. data structures only a hash table was competitive. But hash [25] A. Appleby, “MurmurHash64A,” tables are unordered, and are therefore not suitable as general- https://sites.google.com/site/murmurhash/. [26] J. Rao, “CSB+ tree source,” http://www.cs.columbia.edu/∼kar/software/ purpose index structures. By integrating ART into the main- csb+/. memory database system HyPer and executing the TPC-C [27] B. Schlegel, “K-ary search source,” benchmark, we demonstrated that it is a superior alternative http://wwwdb.inf.tu-dresden.de/team/staff/dipl-inf-benjamin-schlegel/. [28] M. Boehm, “Generalized prefix tree source,” to conventional index structures for transactional workloads. http://wwwdb.inf.tu-dresden.de/research-projects/projects/dexter/ In the future, we intend to work on synchronizing concur- core-indexing-structure-and-techniques/. rent updates. In particular, we plan to develop a latch-free [29] V. Leis, “FAST source,” http://www-db.in.tum.de/∼leis/index/fast.cpp. [30] T. Yamamuro, M. Onizuka, T. Hitaka, and M. Yamamuro, “VAST- synchronization scheme using atomic primitives like compare- Tree: A vector-advanced and compressed structure for massive data tree and-swap. Another idea is to design a space-efficient radix tree traversal,” in EDBT, 2012. which has nodes of equal size. Instead of dynamically adapting [31] T. Neumann, “Efficiently compiling efficient query plans for modern hardware,” PVLDB, vol. 4, 2011. the fanout based on the sparseness of the keys, the number of [32] J. Krueger, C. Kim, M. Grund, N. Satish, D. Schwalb, J. Chhugani, bits used from the key should change dynamically, while the H. Plattner, P. Dubey, and A. Zeier, “Fast updates on read-optimized fanout stays approximately constant. Such a tree could also be databases using multi-core CPUs,” PVLDB, vol. 5, 2011. used for data stored on disk.