Buffering Accesses to Memory-Resident Index Structures

Buffering Accesses to Memory-Resident Index Structures Jingren Zhou Kenneth A. Ross∗ Columbia University fjrzhou, [email protected] Abstract 1 Introduction Recent advances in the speed of commodity CPUs Recent studies have shown that cache- have far outpaced advances in memory latency. Main conscious indexes outperform conventional memory access is therefore increasingly a performance main memory indexes. Cache-conscious in- bottleneck for many computer applications, including dexes focus on better utilization of each cache database systems [4, 5]. As random access memory line for improving search performance of a sin- gets cheaper, it becomes affordable to build comput- gle lookup. None has exploited cache spa- ers with large main memories. More and more query tial and temporal locality between consec- processing work can be done in main memory. Re- utive lookups. We show that conventional cent database research has demonstrated that memory indexes, even “cache-conscious” ones, suffer access is becoming a significant — if not the major — from significant cache thrashing between ac- cost component of database operations [4, 5]. cesses. Such thrashing can impact the per- We focus on memory-resident tree-structured in- formance of applications such as stream pro- dexes. There are many applications of such indexes, cessing and query operations such as index- since they speed up access to data. The applications nested-loops join. we address involve access patterns in which a large number of accesses to an index structure are made in We propose techniques to buffer accesses to a small amount of time. An example would be pro- memory-resident tree-structured indexes to cessing stream data from a large number of mobile avoid cache thrashing. We study several al- sensors [1, 2]. As each sensed point arrives, a spatial ternative designs of the buffering technique, index is consulted to determine objects in the vicinity including whether to use fixed-size or variable- of the point. A second example would be an index- sized buffers, whether to buffer at each tree nested-loops join in a database system with a memory- level or only at some of the levels, how to resident index on the inner table. The index is probed support bulk access while there are concur- once for each record of the outer table. rent updates happening to the index, and how Our high-level goal is to make a bulk lookup sub- to preserve the order of the incoming lookups stantially faster than a sequence of single lookups. Our in the output results. Our methods improve proposed solutions place minimal requirements on the cache performance for both cache-conscious index structure. We assume the index is a tree of some and conventional index structures. Our exper- kind, and we require access to some statistics such as iments show that buffering techniques enable the average branching factor. We do not place condi- a probe throughput that is two to three times tions on the node size, or on architectural parameters higher than traditional methods. such as the cache size. We do not assume that each lookup requires an exact match; for example, the technique would apply for “overlap” queries in an R-Tree, ∗ This research was supported by NSF grants IIS-01-20939, or “nearest value” in a one-dimensional ordered tree. and EIA-00-91533. The basic idea is to create buffers corresponding Permission to copy without fee all or part of this material is to non-root nodes in the tree. As a lookup proceeds granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and down the tree, a record describing the lookup (contain- the title of the publication and its date appear, and notice is ing a probe identifier and the search key) are copied given that copying is by permission of the Very Large Data Base into a buffer of such records. These buffers are peri- Endowment. To copy otherwise, or to republish, requires a fee odically emptied, and divided1 among buffers for the and/or special permission from the Endowment. Proceedings of the 29th VLDB Conference, 1For some kinds of lookup, an access record may traverse Berlin, Germany, 2003 multiple branches of the tree. child nodes. While there is extra copying of data into lookups. A buffering idea similar to ours was proposed buffers, we expect to benefit from improved spatial and for I/O optimization of non-equijoins in [18]. While temporal locality, and thus to incur a smaller number there is a high-level similarity between various levels of cache misses. Buffering is not used for single-lookup of the memory hierarchy, there are important aspects access, and so the index performance for such lookups of CPU architectures, such as TLB-misses and cache- is unchanged. The optimizer decides whether lookups line prefetching, that make the details of main-memory need to be buffered. techniques more involved. Also, the algorithm of [18] We experimentally study buffering with several requires processing all accesses through each level in kinds of index structures, including B+-Trees [9], R- turn, in a top-down breadth-first fashion. Such traver- Trees [10], and CSB+-Trees [16]. Of these, the CSB+- sal behavior is clearly inapplicable for processing con- Tree was designed with main-memory performance of tinuous data streams, because one never has a “com- single lookups in mind. We repeat our experiments plete” probe input. Finally, our techniques allow in- on three different architectures. The design parame- dex updates to be interleaved with probes, which is ters include whether to use fixed-size or variable-size not considered by [18]. buffers, whether to buffer at each tree level or at only XJoin is an operator for producing the join of two some of the levels, how to support bulk access while remote input streams [17]. Our techniques are differ- there are concurrent updates happening to the index ent from XJoin in that we’re joining one remote stream and how to preserve the order of the incoming lookups with a local relation that has a local index. [5] con- in results. We provide architecture-sensitive guidelines siders the impact of cache misses and TLB misses on for choosing buffering parameters. a radix-clustered hash join. Their algorithms do not Our results show that conventional index structures involving buffering, and apply only to equijoins. have poor cache miss rates for bulk access. More We focus on applications that are probe-intensive, surprisingly, even cache-sensitive algorithms such as with a relatively small number of updates. In situa- CSB+-Trees also have moderately high cache miss tions with large numbers of updates per second, bulk rates for bulk access, despite their cache-conscious update operations [8, 14] may be appropriate. design. Our new search algorithms yield speedups by a factor of three over conventional B+-tree index 2 Memory Hierarchy lookups, and by about a factor of two over CSB+- tree and R-tree index lookups. Our algorithms can Modern computer architectures have a hierarchical gracefully handle a moderate number of updates mixed memory system, where access by the CPU to main with a large number of searches. Our algorithms can memory is accelerated by various levels of cache mem- achieve more than a factor of two improvement in ories. Cache memories are designed based on the prin- throughput, while retaining a response time guarantee ciples of spatial and temporal locality. A cache hit hap- of less than one second for each probe. pens when the requested data is found in the cache. Our methods improve cache performance for both Otherwise, it loads data from a lower-level cache or cache-conscious and conventional index structures. In- memory, and incurs a cache miss. There are typically terestingly, with buffering, conventional index struc- two or three cache levels. The first level (L1) cache tures can achieve cache performance that is compa- and the second level (L2) cache are often integrated rable to cache-conscious index structures for batch on the CPU’s die. Caches are characterized by three lookup. Our algorithms can be implemented in cur- major parameters: capacity, cacheline size, and asso- rent commercial database systems without significant ciativity. Latency is the time span that passes after changes to existing code. issuing a data access until the requested data is avail- The rest of this paper is organized as follows. We able in the CPU. In hierarchical memory systems, the discuss related work in Section 1.1. We briefly discuss latency increases with the distance from the CPU. hierarchical memory systems in Section 2. We survey Logical virtual memory addresses used by applica- cache-conscious index structures and demonstrate why tion code have to be translated into physical page ad- they are suboptimal for batch lookups in Section 3. dresses. The memory management unit (MMU) has In Section 4, we present different buffering techniques a translation lookaside buffer (TLB), a kind of cache and detailed data structures. We derive guidelines to that holds the translation for the most recently used choose where to place buffers in Section 5. In Sec- pages. If a logical address is not found in the TLB, a tion 6, we present detailed experiments and validate TLB miss occurs. Depending on the implementation our algorithms. We conclude in Section 7. and the hardware architecture, TLB misses can have a noticeable impact on memory access performance. 1.1 Related Work Memory latency can be hidden by correctly prefetching data into the cache. On some architec- The Y-tree [11] adds a special

Load more