Virtual Forced Splitting in Multidimensional Access Methods
Total Page:16
File Type:pdf, Size:1020Kb
VIRTUAL FORCED SPLITTING IN MULTIDIMENSIONAL ACCESS METHODS by RICHARD SWINBANK A thesis submitted to The University of Birmingham for the degree of DOCTOR OF PHILOSOPHY School of Computer Science The University of Birmingham April 2008 University of Birmingham Research Archive e-theses repository This unpublished thesis/dissertation is copyright of the author and/or third parties. The intellectual property rights of the author or third parties in respect of this work are as defined by The Copyright Designs and Patents Act 1988 or as modified by any successor legislation. Any use made of information contained in this thesis/dissertation must be in accordance with that legislation and must be properly acknowledged. Further distribution or reproduction in any format is prohibited without the permission of the copyright holder. Abstract External, tree-based, multidimensional access methods typically attempt to provide B+ tree like behaviour and performance in the organisation of large collections of multidimensional data. The B+ tree’s efficiency comes directly from the fact that it organises data occupying a single dimen- sion, which can be linearly ordered, and partitioned at arbitrary points in that order. Using a multiway tree to partition a multidimensional space becomes increasingly difficult with increasing dimensionality, often leading to the loss of desirable properties like high fanout and low internode overlap. The K-D-B tree [49] is an example of a structure in which one property, that of zero internode overlap, is provided at the expense of another, high fanout. Its approach to doing this, by forced splitting, is shared by a collection of other structures, and in 1995 Freeston suggested a novel approach to mitigate the effects of forced splits, by executing them virtually. This approach has not been taken up widely, but we believe it shows a great deal of promise. In the thesis, we examine the virtual forced splitting approach in depth. We identify a number of problems presented by the approach, and propose solutions to them, allowing us to characterise a general class of virtual forced splitting structures that we call VFS-trees. The efficacy of our approach is demonstrated by our implementation of a new VFS structure, and by what we believe to be the first implementation of a BV-tree, together with new algorithms for region and K Nearest Neighbour search. We further report experimental results on construction, exact-match search and K-NN search of BV-trees, and show how they compare, very favourably, with the corresponding operations on the currently most popular multidimensional file access method, the R*-tree [3]. iii iv Acknowledgements I would like to thank my supervisor, Alan Sexton, for his unstinting support throughout my PhD, and the many friends I have made during my time in Birmingham — you know who you are. v vi Contents 1 Introduction 1 1.1Motivation........................................ 1 1.2Summaryofcontributions................................ 1 1.3Thesisoutline....................................... 2 2 Preliminaries 3 2.1 Fundamental characteristics . 3 2.2Desirablecharacteristics................................. 4 2.2.1 IO-balance.................................... 4 2.2.2 Thesinglepathproperty(SPP)......................... 4 2.2.3 Guaranteedminimumfanoutratio....................... 4 2.2.4 Localitypreservation............................... 5 2.2.5 Summary..................................... 5 2.3Thecurseofdimensionality............................... 7 2.4Nodepredicates...................................... 8 2.4.1 Globalandlocalpredicates........................... 8 2.4.2 Notation...................................... 8 2.4.3 Predicates in external access methods . 9 2.4.4 Diagramconventions............................... 10 2.4.5 Examplesoflocalpredicateinterpretation................... 11 2.5 An abstract state machine for algorithm specification . 14 2.5.1 Configurations.................................. 14 2.5.2 Operations.................................... 15 2.5.3 Thestore..................................... 17 2.5.4 Ancillary definitions . 17 2.5.5 Example:TheB+tree.............................. 18 2.5.6 Motivation.................................... 22 2.6Summary......................................... 24 3 Analysisofpreviouswork 25 3.1Locality-neglectfulstructures.............................. 25 3.1.1 Space-filling curves . 26 3.1.2 ‘Pyramid’mappings............................... 26 3.1.3 iDistance..................................... 28 3.1.4 GiMP....................................... 28 vii viii CONTENTS 3.1.5 Mappings into more than one dimension . 29 3.2Non-SPPstructures................................... 30 3.2.1 GiST....................................... 30 3.2.2 R-tree....................................... 34 3.2.3 R*-tree...................................... 35 3.2.4 SS-tree...................................... 36 3.2.5 SR-tree...................................... 37 3.2.6 M-treefamily................................... 38 3.3Structureslackingminimumfanoutguarantees.................... 39 3.3.1 K-D-Btree.................................... 40 3.3.2 LSD-tree..................................... 42 3.3.3 Buddy-tree . 44 3.3.4 Hybridtree.................................... 46 3.3.5 BANGfile..................................... 47 3.3.6 hB-tree...................................... 50 3.4Structuresrequiringvariablenodesizes........................ 53 3.4.1 X-tree....................................... 53 3.4.2 BV-tree...................................... 54 3.5Summary......................................... 57 4 Virtual Forced Splitting 59 4.1TheB+tree....................................... 60 4.2Handlingpredicateinterpretation............................ 60 4.2.1 Integrating predicate interpretation into the store . 61 4.2.2 Predicatereinterpretation............................ 63 4.2.3 Globalpredicatedisjointness.......................... 63 4.3 Region description in more than one dimension . 64 4.3.1 Overlap...................................... 65 4.3.2 Poorsplitbalance................................ 65 4.3.3 Holeysplitting.................................. 65 4.4Forcedsplitting...................................... 66 4.4.1 FS-treeinsertion................................. 66 4.4.2 Lossofguaranteedminimumfanout...................... 68 4.5Virtualforcedsplitting.................................. 70 4.5.1 Pendingsetsandpredicatereinterpretation.................. 72 4.5.2 TheKDB-VFStree............................... 73 4.5.3 Entryandnodelevelnumbers.......................... 73 4.6IssueswiththeVFSapproach.............................. 73 4.6.1 Limitingdirectelevation............................. 75 4.6.2 Limitingindirectelevation:Demotion..................... 75 4.6.3 Treeheightandelevation............................ 76 4.6.4 Overallelevationlimits............................. 77 4.6.5 Handlingoccupancyeffectsofelevation.................... 78 4.6.6 Algorithmdesign................................. 80 4.7Summary......................................... 81 CONTENTS ix 5 VFS-tree operations 83 5.1 The reduce operationandtheRVFS-tree....................... 83 5.2Queryalgorithms..................................... 88 5.2.1 Queries of fixed extent: rQuery ......................... 88 5.2.2 K NearestNeighbourQueries.......................... 93 5.2.3 LazyRVFS-treegeneration........................... 96 5.3Demotion......................................... 97 5.3.1 Demotability . 99 5.3.2 Thedemotequeue................................ 101 5.3.3 Demotion termination and insertion cost . 103 5.3.4 Demotability in the BV-tree . 104 5.4Insertion.......................................... 105 5.5Summary......................................... 110 6 Implementation 113 6.1Implementationframework............................... 113 6.2BV-treeimplementations................................ 116 6.3BV-tree:Abstractmachineimplementation...................... 117 6.3.1 Predicateinterpretation............................. 117 6.3.2 ContainmentandIntersection.......................... 118 6.3.3 Splittingpolicy.................................. 121 6.3.4 Implementingtheabstractmachine....................... 122 6.4BV-tree:Recursiveimplementation........................... 126 6.4.1 Predicateinterpretation............................. 126 6.4.2 Splittingpolicy.................................. 128 6.5KDB-VFStreeimplementation............................. 130 6.5.1 Splittingpolicy.................................. 131 6.5.2 Occupancyguarantees.............................. 132 6.6Summary......................................... 133 7 Experimental work 135 7.1Introduction........................................ 135 7.1.1 Treesetup..................................... 135 7.1.2 Datasets...................................... 136 7.2Results........................................... 138 7.2.1 Indexconstruction................................ 138 7.2.2 Filesize...................................... 138 7.2.3 Exactmatchqueries............................... 141 7.2.4 Nearestneighbourqueries............................ 141 7.2.5 Windowqueries.................................. 145 7.3Summary......................................... 149 8 Conclusions and Further work 153 8.1BV-treeoptimisations.................................. 154 8.1.1 Bitstring region representation . 154 8.1.2 Singledemotes.................................. 155 x CONTENTS 8.2OtherVFS-treeoptimisations.............................