Virtual Forced Splitting in Multidimensional Access Methods

VIRTUAL FORCED SPLITTING IN MULTIDIMENSIONAL ACCESS METHODS by RICHARD SWINBANK A thesis submitted to The University of Birmingham for the degree of DOCTOR OF PHILOSOPHY School of Computer Science The University of Birmingham April 2008 University of Birmingham Research Archive e-theses repository This unpublished thesis/dissertation is copyright of the author and/or third parties. The intellectual property rights of the author or third parties in respect of this work are as defined by The Copyright Designs and Patents Act 1988 or as modified by any successor legislation. Any use made of information contained in this thesis/dissertation must be in accordance with that legislation and must be properly acknowledged. Further distribution or reproduction in any format is prohibited without the permission of the copyright holder. Abstract External, tree-based, multidimensional access methods typically attempt to provide B+ tree like behaviour and performance in the organisation of large collections of multidimensional data. The B+ tree’s efficiency comes directly from the fact that it organises data occupying a single dimension, which can be linearly ordered, and partitioned at arbitrary points in that order. Using a multiway tree to partition a multidimensional space becomes increasingly difficult with increasing dimensionality, often leading to the loss of desirable properties like high fanout and low internode overlap. The K-D-B tree [49] is an example of a structure in which one property, that of zero internode overlap, is provided at the expense of another, high fanout. Its approach to doing this, by forced splitting, is shared by a collection of other structures, and in 1995 Freeston suggested a novel approach to mitigate the effects of forced splits, by executing them virtually. This approach has not been taken up widely, but we believe it shows a great deal of promise. In the thesis, we examine the virtual forced splitting approach in depth. We identify a number of problems presented by the approach, and propose solutions to them, allowing us to characterise a general class of virtual forced splitting structures that we call VFS-trees. The efficacy of our approach is demonstrated by our implementation of a new VFS structure, and by what we believe to be the first implementation of a BV-tree, together with new algorithms for region and K Nearest Neighbour search. We further report experimental results on construction, exact-match search and K-NN search of BV-trees, and show how they compare, very favourably, with the corresponding operations on the currently most popular multidimensional file access method, the R*-tree [3]. iii iv Acknowledgements I would like to thank my supervisor, Alan Sexton, for his unstinting support throughout my PhD, and the many friends I have made during my time in Birmingham — you know who you are. v vi Contents 1 Introduction 1 1.1Motivation........................................ 1 1.2Summaryofcontributions................................ 1 1.3Thesisoutline....................................... 2 2 Preliminaries 3 2.1 Fundamental characteristics . 3 2.2Desirablecharacteristics................................. 4 2.2.1 IO-balance.................................... 4 2.2.2 Thesinglepathproperty(SPP)......................... 4 2.2.3 Guaranteedminimumfanoutratio....................... 4 2.2.4 Localitypreservation............................... 5 2.2.5 Summary..................................... 5 2.3Thecurseofdimensionality............................... 7 2.4Nodepredicates...................................... 8 2.4.1 Globalandlocalpredicates........................... 8 2.4.2 Notation...................................... 8 2.4.3 Predicates in external access methods . 9 2.4.4 Diagramconventions............................... 10 2.4.5 Examplesoflocalpredicateinterpretation................... 11 2.5 An abstract state machine for algorithm specification . 14 2.5.1 Configurations.................................. 14 2.5.2 Operations.................................... 15 2.5.3 Thestore..................................... 17 2.5.4 Ancillary definitions . 17 2.5.5 Example:TheB+tree.............................. 18 2.5.6 Motivation.................................... 22 2.6Summary......................................... 24 3 Analysisofpreviouswork 25 3.1Locality-neglectfulstructures.............................. 25 3.1.1 Space-filling curves . 26 3.1.2 ‘Pyramid’mappings............................... 26 3.1.3 iDistance..................................... 28 3.1.4 GiMP....................................... 28 vii viii CONTENTS 3.1.5 Mappings into more than one dimension . 29 3.2Non-SPPstructures................................... 30 3.2.1 GiST....................................... 30 3.2.2 R-tree....................................... 34 3.2.3 R*-tree...................................... 35 3.2.4 SS-tree...................................... 36 3.2.5 SR-tree...................................... 37 3.2.6 M-treefamily................................... 38 3.3Structureslackingminimumfanoutguarantees.................... 39 3.3.1 K-D-Btree.................................... 40 3.3.2 LSD-tree..................................... 42 3.3.3 Buddy-tree . 44 3.3.4 Hybridtree.................................... 46 3.3.5 BANGfile..................................... 47 3.3.6 hB-tree...................................... 50 3.4Structuresrequiringvariablenodesizes........................ 53 3.4.1 X-tree....................................... 53 3.4.2 BV-tree...................................... 54 3.5Summary......................................... 57 4 Virtual Forced Splitting 59 4.1TheB+tree....................................... 60 4.2Handlingpredicateinterpretation............................ 60 4.2.1 Integrating predicate interpretation into the store . 61 4.2.2 Predicatereinterpretation............................ 63 4.2.3 Globalpredicatedisjointness.......................... 63 4.3 Region description in more than one dimension . 64 4.3.1 Overlap...................................... 65 4.3.2 Poorsplitbalance................................ 65 4.3.3 Holeysplitting.................................. 65 4.4Forcedsplitting...................................... 66 4.4.1 FS-treeinsertion................................. 66 4.4.2 Lossofguaranteedminimumfanout...................... 68 4.5Virtualforcedsplitting.................................. 70 4.5.1 Pendingsetsandpredicatereinterpretation.................. 72 4.5.2 TheKDB-VFStree............................... 73 4.5.3 Entryandnodelevelnumbers.......................... 73 4.6IssueswiththeVFSapproach.............................. 73 4.6.1 Limitingdirectelevation............................. 75 4.6.2 Limitingindirectelevation:Demotion..................... 75 4.6.3 Treeheightandelevation............................ 76 4.6.4 Overallelevationlimits............................. 77 4.6.5 Handlingoccupancyeffectsofelevation.................... 78 4.6.6 Algorithmdesign................................. 80 4.7Summary......................................... 81 CONTENTS ix 5 VFS-tree operations 83 5.1 The reduce operationandtheRVFS-tree....................... 83 5.2Queryalgorithms..................................... 88 5.2.1 Queries of fixed extent: rQuery ......................... 88 5.2.2 K NearestNeighbourQueries.......................... 93 5.2.3 LazyRVFS-treegeneration........................... 96 5.3Demotion......................................... 97 5.3.1 Demotability . 99 5.3.2 Thedemotequeue................................ 101 5.3.3 Demotion termination and insertion cost . 103 5.3.4 Demotability in the BV-tree . 104 5.4Insertion.......................................... 105 5.5Summary......................................... 110 6 Implementation 113 6.1Implementationframework............................... 113 6.2BV-treeimplementations................................ 116 6.3BV-tree:Abstractmachineimplementation...................... 117 6.3.1 Predicateinterpretation............................. 117 6.3.2 ContainmentandIntersection.......................... 118 6.3.3 Splittingpolicy.................................. 121 6.3.4 Implementingtheabstractmachine....................... 122 6.4BV-tree:Recursiveimplementation........................... 126 6.4.1 Predicateinterpretation............................. 126 6.4.2 Splittingpolicy.................................. 128 6.5KDB-VFStreeimplementation............................. 130 6.5.1 Splittingpolicy.................................. 131 6.5.2 Occupancyguarantees.............................. 132 6.6Summary......................................... 133 7 Experimental work 135 7.1Introduction........................................ 135 7.1.1 Treesetup..................................... 135 7.1.2 Datasets...................................... 136 7.2Results........................................... 138 7.2.1 Indexconstruction................................ 138 7.2.2 Filesize...................................... 138 7.2.3 Exactmatchqueries............................... 141 7.2.4 Nearestneighbourqueries............................ 141 7.2.5 Windowqueries.................................. 145 7.3Summary......................................... 149 8 Conclusions and Further work 153 8.1BV-treeoptimisations.................................. 154 8.1.1 Bitstring region representation . 154 8.1.2 Singledemotes.................................. 155 x CONTENTS 8.2OtherVFS-treeoptimisations.............................

Virtual Forced Splitting in Multidimensional Access Methods

Pivot Selection Method for Optimizing Both Pruning and Balancing in Metric Space Indexes

Peer-To-Peer Similarity Search Based on M-Tree Indexing

Similarity Searching in Peer-To-Peer Environment

Prefix Hash Tree an Indexing Data Structure Over Distributed Hash

Efficient Nearest-Neighbor Computation for GPU-Based Motion

Distributed Computation of the Knn Graph for Large High-Dimensional Point Sets

Indexing the Fully Evolvement of Spatiotemporal Objects

K-NN Search in Non-Clustered Case Using K-P-Tree

Efficient Similarity Search in High-Dimensional Data Spaces

High-Dimensional Data Indexing with Applications

Very Large Scale Nearest Neighbor Search: Ideas, Strategies and Challenges

Making the Pyramid Technique Robust to Query Types and Workloads