Fractal Dimension and Space Filling Curve
Total Page:16
File Type:pdf, Size:1020Kb
Fractal Dimension and Space Filling Curve Ana Karina Gomes [email protected] Instituto Superior T´ecnico,Lisboa, Portugal May 2015 Abstract Space-filling curves are computer generated fractals that can be used to index low-dimensional spaces. There are previous studies using them as an access method in these scenarios, although they are very conservative only applying them up to four-dimensional spaces. Additionally, the alternative access methods tend to present worse performance than a linear search when the spaces surpass the ten dimensions. Therefore, in the context of my thesis, I study the space-filling curves and their properties as well as challenges presented by multidimensional data. I propose their use, specifically the Hilbert curve, as an access method for indexing multidimensional points up to thirty dimensions. I start by mapping the points to the Hilbert curve generating their h-values. Then, I develop three heuristics to search for approximate nearest neighbors of a given query point with the aim of testing the performance of the curve. Two of the heuristics use the curve as a direct access method and the other uses the curve as a secondary key retrieval combined with a B-tree variant. These result from an iterative process that basically consists of planning, conceiving and testing the heuristic. According to the test results, the heuristic is adjusted or a new one is created. Experimental results with the three heuristics prove that the Hilbert curve can be used as an access method, and that it can operate at least in spaces up to thirty-six dimensions. Keywords: Fractals, Hilbert Space-Filling Curve, Low-Dimensional Indexing, Approximate Nearest Neighbor 1. Introduction tors to simplify and represent data in order to lower Data is becoming increasingly complex. It con- the complexity. The features are represented by tains an extensive set of attributes and sometimes points in the high dimensional space and the closer requires to be seen as part of a multidimensional the points in space, the more similar the objects space. Database Management Systems (DBMSs) are. In the case of images, these are usually rep- have undergone several changes in its access meth- resented by feature vectors, which are then used to ods in order to support this type of data. Through- find approximately similar objects. In this case, an out the years, several indexes have been proposed to approximate result is an acceptable solution. Multi- handle multidimensional (or spatial) data, but the media data is only an example among several other most popular is the R-tree. Despite all the effort applications. The approximate nearest neighbor, as dedicated to improve it, R-tree performance does its name suggests, does not guarantee the return of not appear to keep up the increasing of the dimen- the exact neighbor, and its performance is highly sions. It is common to multidimensional indexes related to the data distribution [27] to suffer from the curse of dimensionality which Ideally, in order to reduce the search time, all means they present a similarity search performance the multidimensional data that are related should worse than a linear search in spaces that exceed ten be linearly stored on disk or memory. However, this dimensions [18, 24, 28]. Even with other indexes cannot be done because it is impossible to guarantee with better performance, Mamoulis considers that that all the pair of objects close in space will be all similarity search in these spaces is a difficult and close in the linear order. It is impossible to all but unsolved problem in its generic form [18]. not for some, and the space-filling curves provide a The approximate nearest neighbor is a similarity total order of the points in space with superior clus- search technique that neglect the accuracy of query tering property. Another problem relating the mul- results for reduced response time and memory us- tidimensional data refers to its indexing structures. age. It is an invaluable search technique in high Although, there are several low-dimensional index- dimensional spaces. Commonly, we use feature vec- ing structures, they present a problem that concerns 1 with the number of dimensions increase. When this proposals. Although there is no consensus on the happens, the volume of the space increases so fast best structure for multidimensional indexing, the that the data in the space become sparse. The curse R tree is the most commonly used for benchmark of dimensionality appears when the spaces usually the performance of new proposals. Furthermore, in surpass the ten dimensions, comparing their per- commercial DBMSs, the R-trees are preferred due formance to a linear search or worse [8]. Berch- to their suitability for points and regions and orien- told et al. refer that they also suffer from the well- tation to low-dimensional space [28, 22]. The list of known drawbacks of multidimensional index struc- alternative access methods to compete with fractals tures such as high costs for insert and delete oper- in low-dimensional space (two to thirty dimensions) ations and a poor support of concurrency control are quite extensive. Therefore, I choose the R-tree and recovery. A possible solution to this problem and the Kd-tree as two possible alternatives to take is to use dimensionality reduction techniques like a closer look. the space-filling curves [3, 8]. They allow to map a 2.1. R-tree d-dimensional space to a one-dimensional space in Since the multidimensional indexes are primarily order to reduce the complexity. After this, it is pos- conceived for secondary storage, the notion of data sible to use a B-tree variant to store the data and page is a concern. These access methods perform take advantage of all the properties of these struc- a hierarchical clustering of the data space, and the tures such as fast insert, update and delete opera- clusters are usually covered by page regions. To tions, good concurrency control and recovery, easy each data page is assigned a page region which is implementation and re-usage of the B-tree imple- a subset of the data space. The page regions vary mentation. Additionally, the space-filling curve can according to the index, and in the R-tree case has be easily implemented on top of an existing DBMS the name of minimum bounding rectangle (MBR) [3]. [5]. At a higher-resolution level, each MBR rep- For all the reasons mentioned above, the purpose resents a single object and is stored on the lower of this paper is to study the space-filling curves and leaf nodes. These leafs contain also an identifier of their properties as well as challenges presented by a multidimensional point or region that represents multidimensional data. I explore the use of the some data. The R-tree descends from the B+-tree space-filling curves, specifically the Hilbert curve, and therefore, preserves the height-balanced prop- as an access method for indexing multidimensional erty. Thus, at the higher levels will have cluster of points. I also test three heuristics to search for objects, and since they are represented by rectan- an approximate nearest neighbor on the Hilbert gles, we will also have clusters of rectangles [28, 22]. curve. The main goal is to explore the potential of The R-tree search performance relates to the size the Hilbert curve in low dimensional spaces (up to of the MBRs. Note that if there are too many over- thirty dimensions). These spaces may result from lapping regions, when performing a search it will the application of indexing structures oriented to result in several intersections with several subtrees high-dimensional space or result from the applica- of a node. This will require to traverse them all, and tion of dimensionality reduction techniques for sub- we must be considering that the search occurs in a sequent indexing. In either situation, the resulting multidimensional space. Thus, the minimization of space, despite being of lower dimensions, still re- both coverage and overlap, has a large impact on quires a large computational effort to perform op- the performance [28, 22]. Other proposals based erations on data. on R tree have been presented over the years. The 2. Multidimensional Indexing R*-tree tries to improve the R-tree by choosing cov- There are several structures oriented for indexing erage of a square shape rather than the traditional multidimensional data. These vary according to rectangle. This shape allows a reduction in cov- their structure, if they have taken into account or erage area by reducing the overlap and hence im- not the data distribution, if they are targeted for proving performance of the tree. Like the R*-tree, points or regions, if they are more suited to high many other structures have been proposed based on or low dimensional spaces, etc. For example, struc- the R-tree as the R+-tree, the X-tree, the Hilbert tures such as Kd-trees [1, 2], hB-trees [17], Grid files R-tree among others. However, there is also an- [20], Point Quad trees among others, are oriented other problem that haunts these hierarchical index- to indexing points. Structures such as the Region ing structures known as the curse of dimensionality. Quad trees [12], SKD [21] or R trees can handle The main underlying cause relates to the volume of both points or regions [22]. Subspace-tree [25] and the chosen form that covers the points. This has a LSD-tree are more suited to high dimensions and constant radius or edge size which increases expo- R-tree, Kd-tree and Space-Filling Curves are more nentially with the increasing of dimensions [4, 26]. suited do low dimensional spaces.