Advanced Concepts and Applications of the UB-Tree
Total Page:16
File Type:pdf, Size:1020Kb
INSTITUT FUR¨ INFORMATIK d d d d d d d d DER TECHNISCHEN UNIVERSITAT¨ MUNCHEN¨ d d d d d d d d d d d d Advanced Concepts and Applications of the UB-Tree Robert Josef Widhopf-Fenk Institut f¨urInformatik der Technischen Universit¨atM¨unchen Advanced Concepts and Applications of the UB-Tree Robert Josef Widhopf-Fenk Vollst¨andiger Abdruck der von der Fakult¨at f¨urInformatik der Technischen Universit¨at M¨unchen zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigten Dissertation. Vorsitzende: Univ.-Prof. G. J. Klinker, Ph.D. Pr¨uferder Dissertation: 1. Univ.-Prof. R. Bayer, Ph.D., emeritiert 2. Univ.-Prof. A. Kemper, Ph.D. Die Dissertation wurde am 05.08.2004 bei der Technischen Universit¨atM¨unchen eingereicht und durch die Fakult¨at f¨ur Informatik am 07.02.2005 angenommen. Acknowledgements First of all, I want to thank my supervisor Rudolf Bayer for his advice, guidance, support, and patience. Without his invitation to be a Ph.D. student, I would have never been working on database research. I owe my former colleagues in the MISTRAL project many thanks for fruitful discus- sions, proofreading, inspirations, and motivating me: Volker Markl, Frank Ramsak, Martin Zirkel and our “external” member Roland Pieringer. I like to thank my students, Oliver Nickel, Chris Hodges and Werner Unterhofer con- tributing to this work by supporting the implementation of algorithms and performing experiments. Special thanks to our project partners at TransAction, Teijin, GfK, and the members of the EDITH project. And last but not least, many thanks to Michael Bauer, Bernd Reiner, and Charlie Hahn for discussions, distraction and small talk. Specifically: to Charlie for the ice cream; to Alex S. for the Gauloise; to Evi Kollmann for all your help, even when you were loaded with other work you always were at hand. Also thanks to everyone I accidently forgot. To my love Birgit for your patience, for our children Matilda and Valentin, and for going through this with me! Abstract The UB-Tree is an index structure for multidimensional point data. By name, it claims to be universal, but this imposes a huge burden, as there are few things which really prove to be universal. This thesis takes a closer look at aspects where the UB-Tree is not universal at a first glance. The first aspect is the discussion of space filling curves (SFC), in particular comparing the Z-curve and the Hilbert-curve. The Z-curve is used to cluster data indexed by the UB-Tree and we highlight its advantages in comparison to other SFCs. While the Hilbert- curve provides better clustering, the Z-curve is superior w.r. to other metrics, i.e., it is significantly more efficient to calculate addresses and the mapping of queries to SFC- segments, and it is able to space efficiently index arbitrary universes. Thus the Z-curve is more universal here. The second aspect are bulk operations on UB-Trees. Especially for data warehousing the bulk insertion and deletion are crucial operations. We present efficient algorithms for incremental insertion and deletion. The third aspect is the comparison of the UB-Tree with bitmap indexes used for an example data warehousing application. We show how performance of bitmap indexes is increased by clustering the base table according to a SFC. Still the UB-Tree proves to be superior. The fourth aspect is the efficient management of data with skewed data distributions. The UB-Tree adapts its partitioning to the actual data distribution, but in comparison to the R-Tree, it suffers from being not able to prune search path leading to unpopulated space (= dead space). This is caused by partitioning the complete universe with separa- tors. We present a novel index structure, the bounding UB-Tree (BUB-Tree), which is a variant of the UB-Tree inheriting its worst case guarantees for the basic operations while efficiently addressing queries on dead space. In comparison to R-Trees, its query perfor- mance is similar while offering superior maintenance performance and logarithmic worst case guarantees, thus being more universal than the R∗-Tree. The last aspect addressed in this thesis is the management of spatial data. The UB-Tree is an index designed for point data, however also spatial objects can be indexed efficiently with it by mapping them to higher dimensional points. We discuss different mapping methods and their performance in comparison to the RI-Tree and R∗-Tree. Our conclusion: The UB-Tree comes closer to being an universal index than any other competing index structure. It is flexible, dynamic, relatively easy to integrate into a DBMS kernel, and provides logarithmic worst case guarantees for the basic operations of insertion, deletion, and update. By extending its concepts to the BUB-Tree it is also able to efficiently support skewed queries on skewed data distributions. Contents Preface 1 1 Introduction 3 1.1 The Objective: Getting Information Fast . 3 1.2 The Problem: The Properties of Real World Data . 4 1.2.1 Complex Structure . 4 1.2.2 Skewed Data Distribution . 5 1.2.3 Huge Data Amounts . 5 1.2.4 Continuous Change . 5 1.3 The Solution: Multidimensional Access Methods . 6 1.4 Objective and Outline of this Thesis . 6 1.5 Related Work . 7 1.5.1 RDBMSs . 7 1.5.2 Multidimensional Access Methods . 7 1.6 The MISTRAL Project . 8 2 Terminology and Basic Concepts 11 2.1 General Notation . 11 2.1.1 Tuple, Relation, Universe . 12 2.1.2 Query, Result Set, and Selectivity . 14 2.2 Storage Devices . 15 2.3 Caching . 21 2.4 Clustering . 21 2.5 Access Methods . 24 2.5.1 Clustering Index, Primary Index, Secondary Index . 24 2.5.2 Address . 25 2.5.3 Accessed Data . 26 3 Space-filling Curves 29 3.1 Space Filling Curves . 31 3.1.1 Compound Curve . 32 3.1.2 Snake-Curve and Zig-Zag-Curve . 33 3.1.3 Fractal Curves . 33 iii iv CONTENTS 3.1.4 Curves created by order preserving Bit-Permutations . 39 3.2 Indexing . 40 3.2.1 Address calculations . 41 3.2.2 Indexing a Space Filling Curve . 44 3.2.3 SFC Index . 46 3.3 Related Work . 49 3.3.1 [Jag90a] . 52 3.3.2 [MJF+01] ................................ 53 3.3.3 [HW02] . 54 3.3.4 [MAK02] . 54 3.3.5 [Law00] . 55 3.3.6 Summary of Related Work . 57 3.4 Comparison of SFC Properties . 58 3.4.1 Region Partitioning . 59 3.4.2 Summary . 60 4 The UB-Tree 63 4.1 Basic Concepts of UB-Trees . 63 4.1.1 Z-regions . 64 4.1.2 Range Query Processing . 66 4.2 UB-Tree Implementations . 66 4.2.1 UBAPI . 67 4.2.2 Transbase ® Hypercube . 68 4.2.3 RFDBMS . 68 4.2.4 Summary . 69 4.3 Bulk Insertion . 71 4.3.1 Introduction . 71 4.3.2 General Problem Description . 72 4.3.3 Algorithms . 73 4.3.4 DBMS-Kernel Integration . 78 4.3.5 Performance Analysis . 78 4.3.6 Performance Evaluation . 80 4.3.7 Conclusion . 87 4.4 Bulk Deletion . 87 4.4.1 Related Work . 88 4.4.2 Algorithm . 88 4.4.3 Analysis . 92 4.5 Summary . 93 5 Bitmap-Indexes on Clustered Data 95 5.1 Introduction . 96 5.1.1 Encoding . 96 5.1.2 Query Processing . 98 CONTENTS v 5.1.3 Compression . 99 5.1.4 Maintenance . 101 5.2 Bitmaps on clustered data . 102 5.3 Evaluation with the GfK DWH . 103 5.3.1 Maintenance . 103 5.3.2 Query Performance . 110 5.4 Summary . 121 5.5 Multi-User environments . 122 6 The BUB-Tree 123 6.1 Introduction . 123 6.2 Related Work . 124 6.3 BUB-Tree regions . 125 6.4 Point Query . 128 6.5 Insertion . 129 6.6 Page Split . 130 6.7 Deletion . 133 6.8 Range Queries . 133 6.9 Bulk Insertion and Bulk Deletion . 136 6.9.1 Initial Bulk Loading . 136 6.9.2 Incremental Bulk Loading . 136 6.9.3 Bulk Deletion . 137 6.10 Reorganization . ..