Hilbert R-Tree: an Improved R-Tree Using Fkactals

Hilbert R-tree: An Improved R-tree Using Fkactals Ibrahim Kamel Christos Faloutsos * Department of Computer Science Department of Computer Science and University of Maryland Institute for Systems Research (ISR) College Park, MD 20742 University of Maryland kamelQcs.umd.edu College Park, MD 20742 [email protected] Hilbert R-tree. Our experiments show that the ‘2-to-3’ split policy provides a cornpro Abstract mise between the insertion complexity and the search cost, giving up to 28% savings over the We propose a new Rtree structure that out- R’ - tree [BKSSOO]on real data. performs all the older ones. The heart of the idea is to facilitate the deferred splitting ap preach in R-trees. This is done by propos- 1 Introduction ing an ordering on the R-tree nodes. This One of the requirements for the database management ordering has to be ‘good’, in the sense that systems (DBMSs) of the near future is the ability to it should group ‘similar’ data rectangles to handle spatial data [SSUSl]. Spatial data arise in gether, to minimize the area and perimeter many applications, including: Cartography [WhiOl]; of the resulting minimum bounding rectangles Computer-Aided Design (CAD) [OHM+841 [Gut84a]; (MBRs). computer vision and robotics [BB82]; traditional Following [KF93] we have chosenthe so-called databases, where a record with k attributes corre- ‘2D-c’ method, which sorts rectangles accord- sponds to a point in a k-d space; temporal databases, ing to the Hilbert value of the center of the where time can be considered as one more dimen- rectangles. Given the ordering, every node sion [KS91]; scientific databases with spatial-temporal has a well-defined set of sibling nodes; thus, data, such as the ones in the ‘Grand Challenge’ appli- we can use deferred splitting. By adjusting cations [Gra92], etc. the split policy, the Hilbert R-tree can achieve In the above applications, one of the most typical as high utilization as desired. To the contrary, queries is the mnge query: Given a rectangle, retrieve the R.-tree has no control over the space uti- all the elements that intersect it. A special csse of the lization, typically achieving up to 70%. We range query is the point query or stabbing query, where designed the manipulation algorithms in de- the query rectangle degeneratesto a point. tail, and we did a full implementation of the We focus on the R-tree [Gut84b] family of methods, *This resexch was partially fuuded by the hmtitutc for Sys- which contains someof the most efficient methods that term, FLsemch (ISR), by the National Science Foundation un- support range queries. The advantage of our method der Grants IF&9205273 and IFU-8958546 (PYI), with matching (and the rest of the R-tree-based methods) over the fuuds from EMPRESS Software Inc. und Think& h4achinen methods that use linear quad-trees and z-ordering is IIIC. that R-trees treat the data objects as a whole, while Pcrmieoion io copy without fee all or part of thir maicrial ir granid provided that tbe copier are not made or dirtribmted for quad-tree based methods typically divide objects into direci commercial advantage, the VLDB copyrigkt notice and quad-tree blocks, increasing the number of items to be the title of be publication and iio dde eppar, and notice ir stored. given that copying ir by pewnirrion of the Very Large Data Bare Endowment. To copy other&e, or to reprblieh, reqrirer a fee The most successful variant of R-trees seems to be and/or l puial pemirvior from ihe Endowment. the R*-tree [BKSSOO].One of its main contributions is Proceedings of the 20th VLDB Conference the idea of ‘forced-reinsert’ by deleting some rectangles Santiago, Chlle, 1994 from the overflowing node, and reinserting them. 500 The main idea in the present paper is:to impose an search performance. ordering on the data rectangles. The consequencesare Subsequent work on R-trees includes the work by important: using this ordering, each R-tree node has Greene [Gre89], Roussopoulos and Leiflrer [RL85], R+- a well defined set of siblings; thus, we can use the al- tree by Sellii et al. [SRF87], R-trees using Mini- gorithms for deferred splitting. By adjusting the split mum Bounding Ploygons [JaggOal,Kamel and Falout- policy (2-to-3 or 3-t&4 etc) we can drive the utilization sos [KF93] and the R*-tree [BKSSSO] of Beckmann a8 close to 100% as desirable. Notice that the R*-tree et al. , which seems to have better performance than does not have control over the utilization, typically Guttman R-tree “quadratic split”. The main idea in achieving an average of ~70%. the R*-tree is the concept of forced re-insert. When a The only requirement for the ordering is that it has node overflows, some of its children are carefully the to be ‘good’, that is, it should lead to small R-tree sen; they are deleted and re-inserted, usually resulting nodes. in a R-tree with better structure. The paper is organized as follows. Section 2 gives a brief description of the R-tree and its variants. Sec- tion 3 describes the Hilbert R-tree. Section 4 presents 3 Hilbert R-trees our experimental results that compare the Hilbert R- In this section we introduce the Hilbert R-tree and dis- tree with other R-tree variants. Section 5 gives the cuss algorithms for searching, insertion, deletion, and conclusions and directions for future research. overflow handling. The performance of the R-trees de- pends on how good is the algorithm that cluster the 2 Survey data rectangles to a node. We propose to use space filling curves (or fractals), and specifically, the Hilbert Several spatial accessmethods have been proposed. A curve to impose a linear ordering on the data rectan- recent survey can be found in [Sam89]. These meth- gles. ods fall in the following broad classes: methods that A space filling curve visits all the points in a Ic- transform rectanglea into points in a higher dimen- dimensional grid exactly once and never crosses it- sionality space [HN83, Fre87]; methods that use lin- self. The Z-order (or Morton key order, or bit- ear quadtrees [Gar82] [AS911 or, equivalently, the Z- interleaving, or Peano curve), the Hilbert curve, and ordering [Ore861 or other space filling curvea [FR89] the Gray-code curve [Fal88] are examples of space fill- [JaggOb]; and finally, methods based on treea (R- ing curves. In [FR89], it was shown experimentally tree [Gut84b], k-d-trees [Ben75], k-d-B-trees [RobBl], that the HiIbert curve achieves the beat clustering hB-trees [LS90], cell-trees [Gun891 e.t.c.) among the three above methods. One of the most promising approaches in the last Next we provide a brief introduction to the Hilbert class is the R-tree [Gut84b]: Compared to the trans- curve: The basic Hilbert curve on a 2x2 grid, denoted formation methods, R-trees work on the native space, by HI, is shown in Figure 1. To derive a curve of or- which has lower dimensionality; compared to the lin- der i, each vertex of the basic curve is replaced by the ear quadtrees, the R-trees do not need to divide the curve of order i - 1, which may be appropriately rc+ spatial objects into (several) pieces (quadtree blocks). tated and/or reflected. Figure 1 also shows the Hilbert The R-tree is an extension of the B-tree for multidi- curvea of order 2 and 3. When the order of the curve mensional objects. A geometric object is represented tends to infinity, the resulting curve is a fractal, with a by its minimum bounding rectangle (MBR): Non-leaf fractal dimension of 2 [Man77]. The Hilbert curve can nodes contain entries of the form (R&r) where ptr is be generalized for higher dimensionalitiea. Algorithms a pointer to a child node in the R-tree; R is the MBR to draw the two-dimensional curve of a given order, that covers all rectangles in the child node. Leaf nodes can be found in [Gri86], [JaggOb]. An algorithm for contain entries of the form (obj-id, R) where obj-id is a higher dimensionalities is in [Bia69]. pointer to the object description, and R is the MBR of The path of a space filling curve imposes a linear the object. The main innovation in the R-tree is that ordering on the grid points. Figure 1 shows one such father nodes are allowed to overlap. This way, the R- ordering for a 4 x 4 grid (see curve Ha). For example tree can guarantee at least 50% space utilization and the point (0,O) on the Hz curve has a Hilbert value remain balanced. of 0, while the point (1,l) has a Hilbert value of 2. Guttman proposed three splitting algorithms, the The Hilbert value of a rectangle needs to be defined. linear split, the quadraiic split and the exponedial Following the experiments in [KF93], a good choice is split. Their namea come from their complexity; among the following: the three, the quadratic split algorithm is the one that achieves the best trade-off between splitting time and Definition 1 : The Hilbed value of a rectangle is de- 501 “1 “2 “3 Figure 1: Hilbert Curves of order 1, 2 and 3 fined as the Hilbert value of its center. on the disk; the contents of the parent node ‘II’ are After this preliminary material, we are in a position shown in more detail. Every data rectangle in node now to describe the proposed methods. ‘I’ has Hilbert value 533; everything in node ‘II’ has Hilbert value greater than 33 and 5107 etc.

Load more