ddd d d
¨
INSTITUT FUR INFORMATIK
d d d d
- d
- d
d d d d d d d
- ¨
- ¨
d d
DER TECHNISCHEN UNIVERSITAT MUNCHEN
Advanced Concepts and Applications of the UB-Tree
Robert Josef Widhopf-Fenk
Institut fu¨r Informatik der Technischen Universit¨at Mu¨nchen
Advanced Concepts and Applications of the UB-Tree
Robert Josef Widhopf-Fenk
Vollsta¨ndiger Abdruck der von der Fakult¨at fu¨r Informatik der Technischen Universit¨at Mu¨nchen zur Erlangung des akademischen Grades eines
Doktors der Naturwissenschaften
genehmigten Dissertation.
- Vorsitzende:
- Univ.-Prof. G. J. Klinker, Ph.D.
Pru¨fer der Dissertation: 1. Univ.-Prof. R. Bayer, Ph.D., emeritiert
2. Univ.-Prof. A. Kemper, Ph.D.
Die Dissertation wurde am 05.08.2004 bei der Technischen Universita¨t Mu¨nchen eingereicht und durch die Fakult¨at fu¨r Informatik am 07.02.2005 angenommen.
Acknowledgements
First of all, I want to thank my supervisor Rudolf Bayer for his advice, guidance, support, and patience. Without his invitation to be a Ph.D. student, I would have never been working on database research.
I owe my former colleagues in the MISTRAL project many thanks for fruitful discussions, proofreading, inspirations, and motivating me: Volker Markl, Frank Ramsak, Martin Zirkel and our “external” member Roland Pieringer.
I like to thank my students, Oliver Nickel, Chris Hodges and Werner Unterhofer contributing to this work by supporting the implementation of algorithms and performing experiments.
Special thanks to our project partners at TransAction, Teijin, GfK, and the members of the EDITH project.
And last but not least, many thanks to Michael Bauer, Bernd Reiner, and Charlie Hahn for discussions, distraction and small talk. Specifically: to Charlie for the ice cream; to Alex S. for the Gauloise; to Evi Kollmann for all your help, even when you were loaded with other work you always were at hand. Also thanks to everyone I accidently forgot.
To my love Birgit for your patience, for our children Matilda and Valentin, and for going through this with me!
Abstract
The UB-Tree is an index structure for multidimensional point data. By name, it claims to be universal, but this imposes a huge burden, as there are few things which really prove to be universal. This thesis takes a closer look at aspects where the UB-Tree is not universal at a first glance.
The first aspect is the discussion of space filling curves (SFC), in particular comparing the Z-curve and the Hilbert-curve. The Z-curve is used to cluster data indexed by the UB-Tree and we highlight its advantages in comparison to other SFCs. While the Hilbertcurve provides better clustering, the Z-curve is superior w.r. to other metrics, i.e., it is significantly more efficient to calculate addresses and the mapping of queries to SFC- segments, and it is able to space efficiently index arbitrary universes. Thus the Z-curve is more universal here.
The second aspect are bulk operations on UB-Trees. Especially for data warehousing the bulk insertion and deletion are crucial operations. We present efficient algorithms for incremental insertion and deletion.
The third aspect is the comparison of the UB-Tree with bitmap indexes used for an example data warehousing application. We show how performance of bitmap indexes is increased by clustering the base table according to a SFC. Still the UB-Tree proves to be superior.
The fourth aspect is the efficient management of data with skewed data distributions.
The UB-Tree adapts its partitioning to the actual data distribution, but in comparison to the R-Tree, it suffers from being not able to prune search path leading to unpopulated space (= dead space). This is caused by partitioning the complete universe with separators. We present a novel index structure, the bounding UB-Tree (BUB-Tree), which is a variant of the UB-Tree inheriting its worst case guarantees for the basic operations while efficiently addressing queries on dead space. In comparison to R-Trees, its query performance is similar while offering superior maintenance performance and logarithmic worst case guarantees, thus being more universal than the R∗-Tree.
The last aspect addressed in this thesis is the management of spatial data. The UB-Tree is an index designed for point data, however also spatial objects can be indexed efficiently with it by mapping them to higher dimensional points. We discuss different mapping methods and their performance in comparison to the RI-Tree and R∗-Tree.
Our conclusion: The UB-Tree comes closer to being an universal index than any other competing index structure. It is flexible, dynamic, relatively easy to integrate into a DBMS kernel, and provides logarithmic worst case guarantees for the basic operations of insertion, deletion, and update. By extending its concepts to the BUB-Tree it is also able to efficiently support skewed queries on skewed data distributions.
Contents
- Preface
- 1
- 1 Introduction
- 3
344555667778
1.1 The Objective: Getting Information Fast . . . . . . . . . . . . . . . . . . . 1.2 The Problem: The Properties of Real World Data . . . . . . . . . . . . . .
1.2.1 Complex Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Skewed Data Distribution . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Huge Data Amounts . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 Continuous Change . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 The Solution: Multidimensional Access Methods . . . . . . . . . . . . . . . 1.4 Objective and Outline of this Thesis . . . . . . . . . . . . . . . . . . . . . 1.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.1 RDBMSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Multidimensional Access Methods . . . . . . . . . . . . . . . . . . .
1.6 The MISTRAL Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
- 2 Terminology and Basic Concepts
- 11
2.1 General Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Tuple, Relation, Universe . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.2 Query, Result Set, and Selectivity . . . . . . . . . . . . . . . . . . . 14
2.2 Storage Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5 Access Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5.1 Clustering Index, Primary Index, Secondary Index . . . . . . . . . . 24 2.5.2 Address . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5.3 Accessed Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
- 3 Space-filling Curves
- 29
3.1 Space Filling Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.1 Compound Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1.2 Snake-Curve and Zig-Zag-Curve . . . . . . . . . . . . . . . . . . . . 33 3.1.3 Fractal Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
iii iv
CONTENTS
3.1.4 Curves created by order preserving Bit-Permutations . . . . . . . . 39
3.2 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 Address calculations . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.2 Indexing a Space Filling Curve . . . . . . . . . . . . . . . . . . . . 44 3.2.3 SFC Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.1 [Jag90a] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.2 [MJF+01] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.3 [HW02] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3.4 [MAK02] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3.5 [Law00] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3.6 Summary of Related Work . . . . . . . . . . . . . . . . . . . . . . . 57
3.4 Comparison of SFC Properties . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.1 Region Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
- 4 The UB-Tree
- 63
4.1 Basic Concepts of UB-Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.1 Z-regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.1.2 Range Query Processing . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 UB-Tree Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.1 UBAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
- 4.2.2 Transbase
- Hypercube . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.3 RFDBMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3 Bulk Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3.2 General Problem Description . . . . . . . . . . . . . . . . . . . . . 72 4.3.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3.4 DBMS-Kernel Integration . . . . . . . . . . . . . . . . . . . . . . . 78 4.3.5 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.3.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.4 Bulk Deletion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.4.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
- 5 Bitmap-Indexes on Clustered Data
- 95
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.1.1 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.1.2 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
CONTENTS
v
5.1.3 Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.1.4 Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.2 Bitmaps on clustered data . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.3 Evaluation with the GfK DWH . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3.1 Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.3.2 Query Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.5 Multi-User environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
- 6 The BUB-Tree
- 123
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.3 BUB-Tree regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.4 Point Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.5 Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.6 Page Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.7 Deletion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.8 Range Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.9 Bulk Insertion and Bulk Deletion . . . . . . . . . . . . . . . . . . . . . . . 136
6.9.1 Initial Bulk Loading . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.9.2 Incremental Bulk Loading . . . . . . . . . . . . . . . . . . . . . . . 136 6.9.3 Bulk Deletion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.10 Reorganization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.10.1 Tightening Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.10.2 Optimizing the Index Coverage . . . . . . . . . . . . . . . . . . . . 138 6.10.3 Increasing the Page Utilization . . . . . . . . . . . . . . . . . . . . 139
6.11 Implementation-Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.12 Comparison of BUB-Tree and R-Tree . . . . . . . . . . . . . . . . . . . . . 141
6.12.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.12.2 {R}-Tree coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.12.3 Bounding Technique . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.12.4 Supported Data Types . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.12.5 Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.12.6 Multi-User Support . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.12.7 Page Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.12.8 Index Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.12.9 Maintenance Performance . . . . . . . . . . . . . . . . . . . . . . . 145 6.12.10Query Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.13 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.13.1 Random insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 6.13.2 Bulk Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.13.3 Query Procession . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 6.13.4 Point Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 6.13.5 Random Range Queries . . . . . . . . . . . . . . . . . . . . . . . . 171 6.13.6 Street Section Queries . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.14 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
- 7 Management of Spatial Objects
- 177
7.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
7.1.1 RDBMS Implementations . . . . . . . . . . . . . . . . . . . . . . . 178 7.1.2 Interval Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 7.1.3 Sub-Cube Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 7.1.4 Previous UB-Tree based solutions . . . . . . . . . . . . . . . . . . . 181
7.2 Mapping Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.2.1 Approximation by a MBB . . . . . . . . . . . . . . . . . . . . . . . 181 7.2.2 Approximation by SFC-segments . . . . . . . . . . . . . . . . . . . 186 7.2.3 Indexing the Parameter Space . . . . . . . . . . . . . . . . . . . . . 188
7.3 Experiments: Interval Data . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.3.1 Artificial Interval Data Sets . . . . . . . . . . . . . . . . . . . . . . 189 7.3.2 Artificial Interval Query Sets . . . . . . . . . . . . . . . . . . . . . . 190 7.3.3 Interval Index Structures . . . . . . . . . . . . . . . . . . . . . . . . 191 7.3.4 Qualitative Comparison of Interval-Index Structures . . . . . . . . . 191 7.3.5 Interval Measurement Results . . . . . . . . . . . . . . . . . . . . . 192 7.3.6 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 7.3.7 Real World Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 7.3.8 GEO Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 7.3.9 Endpoint Transformation . . . . . . . . . . . . . . . . . . . . . . . . 201 7.3.10 Comparison of Point-Transformation Variants . . . . . . . . . . . . 204 7.3.11 Approximation by SFC-segments . . . . . . . . . . . . . . . . . . . 204
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
- 8 Summary
- 207
8.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
- A Description of the Test System
- 209
A.1 System Sunbayer69 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 A.2 System Sunwibas0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 A.3 System Atmistral7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
- B Real World Data Sets
- 211
B.1 GfK Datawarehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 B.2 GEO Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
C Measures of RFDBMS Bibliography
224 239
List of Figures
2.1 Example for a two dimensional Universe and Queries on it . . . . . . . . . 15 2.2 Physical Structure of a Hard-Disk . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Percentage of Positioning Cost with growing Page Size . . . . . . . . . . . 19 2.4 Random and sequential Access Times on different Hardware Configurations 20 2.5 Different Types of Clustering for Lexical Order and the Interval Query [c, f] 23
2.6 Clustering Page-Based Index vs. Secondary Index . . . . . . . . . . . . . . 25
2.7 Range Query and actually affected Area of the Universe . . . . . . . . . . . 27 3.1 Simple SFCs for 8x8 universe with range query . . . . . . . . . . . . . . . . 29 3.2 Example relations of ∆M to ∆S . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3 Universe, quad tree partitioning, and address . . . . . . . . . . . . . . . . . 34 3.4 Z-curve for 8x8 universe . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5 bit-interleaving for dimensions with differing domains . . . . . . . . . . . . 37 3.6 Hilbert-curve for 8x8 universe . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.7 Hilbert-curve indexing an universe with differing dimension cardinalities . . 38 3.8 Gray-Code-curve for 8x8 universe . . . . . . . . . . . . . . . . . . . . . . . 40 3.9 Brother-father-search with Z-curve . . . . . . . . . . . . . . . . . . . . . . 43 3.10 Cost of post filtering vs. NJI/NJO calculation . . . . . . . . . . . . . . . . 48 3.11 Space partitioning for Z- and Hilbert-curve . . . . . . . . . . . . . . . . . . 50 3.12 B-Tree traversal: Page accesses . . . . . . . . . . . . . . . . . . . . . . . . 51 3.13 Percentage of segments in SFCs . . . . . . . . . . . . . . . . . . . . . . . . 56 3.14 SFC-Regions consisting of disjoint areas in Ω . . . . . . . . . . . . . . . . . 60
4.1 A UB-Tree Partitioning the World into Z-regions . . . . . . . . . . . . . . 64 4.2 Z-regions of a UB-Tree for a given Data Distribution . . . . . . . . . . . . 66 4.3 A Bulk Loading Architecture for one dimensional clustering Indexes . . . . 73 4.4 Estimated Number of Page Reads and Writes . . . . . . . . . . . . . . . . 79 4.5 Changes of UB-Tree Space Partitioning during Incremental Loading . . . . 80 4.6 The Bulk Loading Architecture as implemented for UB-Trees . . . . . . . . 81 4.7 Data Distribution and Times for Copying resp. Pipe-Lining . . . . . . . . . 83 4.8 Address Calculation, Sorting and Loading Performance for GfK DWH . . . 83 4.9 Page Statistics for Bulk Loading of GfK DWH . . . . . . . . . . . . . . . . 86 4.10 Deleting and Freeing Subtrees . . . . . . . . . . . . . . . . . . . . . . . . . 91
vii viii
LIST OF FIGURES
5.1 Query Processing with Bitmap Indexes . . . . . . . . . . . . . . . . . . . . 98 5.2 RLE Bitmap Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.3 Cost for Address Calculation and Sorting . . . . . . . . . . . . . . . . . . . 105 5.4 Maintenance Cost for Bitmap Indexesc . . . . . . . . . . . . . . . . . . . . 107 5.5 Bitmaps mapped to Images . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.6 Overall Maintenance Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.7 PG SERIES: Result Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.8 PG SERIES: Elapsed Time . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.9 PG SERIES: Page Accesses w.r. to Query . . . . . . . . . . . . . . . . . . 116 5.10 Time per Logical Page Accesses . . . . . . . . . . . . . . . . . . . . . . . . 117 5.11 PG SERIES: Page Accesses w.r. to Result Set Size . . . . . . . . . . . . . 118 5.12 PG SERIES Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.13 2Y PG SERIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120