Integrating the UB-Tree Into a Database System Kernel
Total Page:16
File Type:pdf, Size:1020Kb
Integrating the UB-Tree into a Database System Kernel Frank Ramsak1, Volker Markl1, Robert Fenk1, Martin Zirkel2, Klaus Elhardt3, Rudolf Bayer1,2 1Bayerisches Forschungszentrum 2Institut für Informatik 3TransAction Software GmbH für Wissensbasierte Systeme TU München Gustav-Heinemann-Ring 109, Orleansstraße 34, Orleansstraße 34, D-81739 München, Germany D- 81667 München, Germany D-81667 München, Germany {frank.ramsak, robert.fenk, volker.markl}@forwiss.de, {zirkel, bayer}@in.tum.de, [email protected] Abstract Multidimensional access methods have shown 1 Introduction high potential for significant performance im- Various research approaches in the past have shown that provements in various application domains. multidimensional access methods (MAMs) have a high However, only few approaches have made their impact on different database application domains like data way into commercial products. In commercial warehousing, data mining, or geographical information database management systems (DBMSs) the B- systems. However, despite the vast research effort MAMs Tree is still the prevalent indexing technique. have not made their way into commercial database Integrating new indexing methods into existing management systems on a broad scale. This is mostly due database kernels is in general a very complex to the fact that the integration of these complex data and costly task. Exceptions exist, as our experi- structures into an existing database kernel is fairly ence of integrating the UB-Tree into TransBase, complicated. Especially concurrency and recovery issues, a commercial DBMS, shows. The UB-Tree is a which are as important as performance issues for very promising multidimensional index, which commercial systems, are major obstacles. For most has shown its superiority over traditional access MAMs new solutions to these problems, e.g., locking for methods in different scenarios, especially in R-Trees [KB95, CM98], have to be developed, as the new OLAP applications. In this paper we discuss the concepts do not allow reusing standard techniques. This major issues of a UB-Tree integration. As we makes the kernel integration of an MAM a very costly will show, the complexity and cost of this task is task in the range of multiple man-years. As consequence reduced significantly due to the fact that the UB- many DBMS producers have not integrated the new Tree relies on the classical B-Tree. Even though technology into their systems, but offer it only as add-on commercial DBMSs provide interfaces for index features. So, are MAMs just another nice research extensions, we favor the kernel integration be- gimmick, but commercially not affordable? No, in this cause of the tight coupling with the query opti- paper we will show that there are MAMs, which provide mizer, which allows for optimal usage of the UB- good performance on one side and can smoothly be Tree in execution plans. Measurements on a integrated into a DBMS kernel on the other side. A real-world data warehouse show that the kernel category of MAMs is based on the combination of one- integration leads to an additional performance dimensional index structures and space-filling curves. improvement compared to our prototype imple- One prominent example is the UB-Tree [Bay97], which mentation and competing index methods. combines the B-Tree and the Z-curve. Together with its Permission to copy without fee all or part of this material is sophisticated query processing algorithms it has proven its granted provided that the copies are not made or distributed for performance advantages in numerous application direct commercial advantage, the VLDB copyright notice and domains. Because the UB-Tree is based on the standard the title of the publication and its date appear, and notice is B-Tree, which is the basic index structure in almost every given that copying is by permission of the Very Large Data Base commercial DBMS, the task of integrating this MAM into Endowment. To copy otherwise, or to republish, requires a fee an existing kernel becomes less complex and less costly. and/or special permission from the Endowment. Proceedings of the 26th International Conference on Very The kernel integration of the UB-Tree into TransBase Large Databases, Cairo, Egypt, 2000 [Tra98] (as part of an ESPRIT project funded by the 263 European Commission) has been accomplished within The fundamental innovation of UB-Trees is the concept one year. TransBase is a full-scale relational database of Z-Regions to create a disjunctive partitioning of the system, which conforms to the SQL-92 standard. multidimensional space. This allows for very efficient TransBase, which handles databases up to 8 Terabyte of processing of multidimensional range queries (see Section data, is used especially in the field of CD-ROM retrieval 4). A Z-Region [α : β ] is the space covered by an interval systems, and has an installation base of well beyond on the Z-Curve and is defined by two Z-Addresses α and 50,000 sites worldwide. β. We call β the region address of [α : β ]. Each Z- In this paper, we will present the major issues and Region maps exactly onto one page on secondary storage, problems that have to be tackled and solved for a suc- i.e., to one leaf page of the B-Tree. cessful UB-Tree kernel integration. The paper is For an 2 dimensional universe of size 8×8, Figure 2-1b organized as follows: Section 2 presents the basic con- shows the corresponding Z-addresses. Figure 2-1c shows cepts behind the UB-Tree and Section 3 deals with the the Z-region [4: 20] and Figure 2-1d shows a partitioning issue of the UB-Tree address representation and the with five Z-regions [0 : 3],[4 : 20], [21 : 35], [36 : 47] and implementation of the standard UB-Tree operations. [48 : 63]. Assuming a page capacity of 2 points, Figure Section 4 addresses the specific query operation of the 2-1e shows ten points, which create the partitioning of UB-Tree and Section 5 tackles the important topic of Figure 2-1d. required optimizer extensions to efficiently support the The details of the UB-Tree algorithms are described in the UB-Tree. Section 6 covers additional enhancements for following sections. the UB-Tree and Section 7 presents the performance evaluation. Section 8 summarizes related work and 3 UB-Tree Address Representation and Section 9 concludes the paper. Standard Operations 2 Basic Concept of the UB-Tree For the rest of the paper we will refer to the function computing the Z-Addresses as UBKEY and to the keys as The basic idea of the UB-Tree [Bay97] is to use a space- Z-values. It is important to note that the UB-Tree filling curve to map a multidimensional universe to one- algorithms can be implemented and integrated without dimensional space. Using the Z-Curve (Figure 2-1a) for fundamental changes to the query processing of the preserving multidimensional clustering as good as database kernel. They do not require special tuple possible it is a variant of the zkd-B-Tree [OM84]. handling or other significant modifications, as the 0 1 2 3 4 5 6 7 following sections show. 0 0 1 4 5 16 17 20 21 1 2 3 6 7 18 19 22 23 3.1 Address Representation and Z-value Computation 8 9 12 13 24 25 28 29 2 An important question for the implementation of the UB- 3 10 11 14 15 26 27 30 31 4 32 33 36 37 48 49 52 53 Tree inside the database kernel is how to represent the Z- 5 34 35 38 39 50 51 54 55 values. All algorithms for the UB-Tree basically rely on 6 40 41 44 45 56 57 60 61 7 42 43 46 47 58 59 62 63 Z-values in the format of variable length bitstrings (a) (b) (trailing zeros are omitted to reduce storage requirements). The operations on Z-values manipulate single bits and copy parts of the bitstring. The UBKEY function can be efficiently implemented, as it requires only reading the specified index attributes bitwise and writing the bits at the corresponding positions in the resulting Z-value. As input the UBKEY function requires (c) (d) a bitstring representation of the attribute values. The natural order ≤ of the attribute values in the original domain A has to correspond to the bit-lexicographical ≤ order bitstr on bitstrings, i.e., ≤ ⇔ ≤ ai a j bitstr(ai ) bitstr bitstr(a j ) , where bitstr : A →{b | b ∈[]0,1 *}generates the corresponding (e) Figure 2-1 Z-Addresses and Z-Regions bitstring. For example, in case of unsigned integers and strings bitstr := identity while for signed integers the A Z-Address α = Z(x) is the ordinal number of the key bitstr function has to take care of the sign bit. attributes of a tuple x on the Z-Curve, which can be efficiently computed by bit-interleaving (see Section 3.1). To compute the Z-value of a tuple, we interleave the bits A standard B-Tree is used to index the tuples taking the of the bitstring representation of the key attributes (see Z-Address of the tuples as keys. Figure 3-1). 264 Bitstring 1 Bitstring d Note that complex normalization may lead to significant 0 1 1 ... 10 1 performance overhead, which may then not be neglected any more. Our standard normalization techniques just require a few microseconds of CPU cycles and therefore 0...1 1...0 1...1 do not affect the address calculation performance. Step0 Step1 Step2 3.2 Insertion, Deletion, Update The basic algorithms of the UB-Tree are handled by the Z-value underlying B-Tree. To perform an insertion, deletion or Figure 3-1 Calculation of Z-values by bit-interleaving the update one just has to compute the Z-value corresponding transformed attributes to the tuple.