International Journal on Integrated Information Management Vol.01 (2012) DOI: 10.15556/IJIIM.01.01.002

Efficient indexing methods in the data mining context

Nikolaos Kouiroukidisa, Georgios Evangelidisa

a Department of Applied Informatics, University of Macedonia, Thessaloniki, Greece

Abstract: The proliferation of applications manipulating enormous sizes of multidimensional vector data, that require indexing in order to support operations like kNN searching in an efficient manner, has revived the interest in multidimensional indexing. It is well established that as the dimensionality of data increases the performance of queries such as range queries and nearest neighbor (1NN or kNN) queries decreases leading to a problem described as “the curse of dimensionality”. In this paper we point out problems that arise when indexing multidimensional data in fixed dimensions, such as 8 or 16, because, usually, when dealing with data of higher dimensionality it is common to first apply dimensionality reduction techniques such as principal component analysis. Although there is a plethora of research papers proposing multidimensional indexing structures, most of them report empirical results with relatively small and low dimensionality datasets. We attempt a fair comparison of many state of the art indexing structures designed exclusively to index multi-dimensional points like the Hybrid , the iDistance and the P+tree. We include in our comparison the R*tree, a state of the art index designed both for multidimensional points and regions. It is an improvement of the well-known R-tree, and also has been revised and improved further recently. We compare the behavior of the indexes on kNN queries (for k=1 and k=10) with varying dataset sizes and dimensionality. Our goal is to determine the structure(s) that could be used efficiently in the area of data mining. We obtained the source code of the implementations of the related structures from the authors and the R-tree portal.

Keywords: Multidimensional Indexing, Data Mining

1. Introduction In the last 25 years many indexing techniques have been proposed for the efficient storage and retrieval of multidimensional data. For the one-dimensional case, the ubiquitous B+tree [7] has been incorporated in all commercial and open source database management systems. Many more sophisticated data structures have been proposed to handle the problem of manipulating in an efficient manner enormous sizes of multidimensional data. Most of them try to solve problems concerning range queries and k nearest neighbor (kNN) queries. Difficulties arise in higher dimensions where the problem of the so called “dimensionality curse” has the effect that the higher the dimension in question the more these index structures behave like or even worse than the sequential scan in solving problems like similarity search queries.

24 INTERNATIONAL JOURNAL ON INTEGRATED INFORMATION MANAGEMENT 25

These indexing methods usually take advantage of many factors like the manner that space is occupied by the data in question or some characteristics of the way that data space is decomposed that can lead to translating the multidimensional problem into a single-dimensional one that can be efficiently handled by a B+tree. First, we survey some of the most notable indexing structures that have been proposed in the literature, especially the R*tree [1] (successor of R- tree), the Hybrid tree [3], the P+tree [6] and the iDistance [7], and then we try to study and investigate through experimentation various factors that influence these indexes when used to solve kNN queries. These factors are the data dimensionality and the size of the indexed data that usually arise in real world datasets. The paper is organized as follows. In Section 2, we introduce the index structures we will examine, and in Section 3 we discuss the experiments that have been conducted using these methods. In Section 4, we present our experiments, and, finally, we conclude the paper in Section 5.

2. Review of Popular Multidimensional Indexes

2.1. Hybrid Tree This indexing structure is characterized by the space partitioning strategy that is employed when a node splits and it supports distance-based queries (both range and nearest neighbor). It is more similar to Space Partitioning (SP-) Based data structures than Data Partitioning (DP-) Based data structures. As the authors state, the overlap is allowed only when trying to achieve an overlap-free split would cause downward cascading splits and hence a possible violation of utilization constraints. The space partitioning within each index node is accomplished with a k-d tree. Each internal node of the regular kd-tree represents a split by storing the split dimension and the split position. The novelty is that there are two split positions in the kd-tree internal node of the Hybrid tree. The first split position represents the right (higher side) boundary of the left (lower side) partition while the second split position represents the left boundary of the right partition. When splitting a data node the Hybrid tree chooses as the split dimension the one that minimizes the increase in EDA (expected number of disk accesses per query), thereby optimizing the expected search performance for future queries and the split position is determined as close to the middle as possible along the specified split dimension. This tends to produce more cubic BRs (Bounding Rectangles) and hence ones with smaller surface areas. The smaller the surface area, the lower the probability that a range query overlaps with that BR, the lower the number of expected number of disk accesses. Unlike data node splitting where the choice is independent of the query size, the choice of the split dimension for index nodes depends on the probability distribution of the query size. For splitting an index node the Hybrid tree chooses the dimension that minimizes the increase in EDA averaged over all queries. Given the split dimension, the split positions are chosen such that the overlap is minimized without violating the utilization requirement. In addition, the Hybrid tree indexes empty space using the Encoded Live Space (ELS) Optimization using a few bits. The Hybrid tree is completely dynamic with insertions, deletions and updates that occur interspersed with search queries without requiring any reorganization.

26 EFFICIENT INDEXING METHODS IN THE DATA MINING CONTEXT

2.2. P+Tree This index structure first divides the data space into clusters that are essentially hyper-rectangles. Then it transforms each subspace into a hypercube and applies the pyramid technique [2] in each subspace. The number of subspaces is always an integral power of 2. The order of Division is defined as the times we divide the space and it is an important parameter of the P+tree. To facilitate the building process of the P+tree and query processing, there is an auxiliary structure called the space- tree, which is built during the space division process. The space-tree is similar to the k-d tree, but it stores the transformation information instead of data points in the leaf nodes. For the construction phase, a P+-tree is basically a B+-tree where the data records with their keys are stored in the leaf nodes. In the P+-tree, the authors apply the pyramid technique in each subspace. Under the pyramid technique, pyramid values of points in pyramid i cover the interval [i, i + 0.5]. There are 2d pyramids, from pyramid 0 to pyramid 2d - 1, so pyramid values of all points are within the interval [0, 2d). To discriminate points from different subspaces, SNo * 2d is added, where SNo is the number of subspaces. Window queries Q are processed in two logical phases. First, clusters that intersect Q are determined so that the rest of the clusters can be pruned, and second, the transformed query T(Q) is applied on the transformed subspaces. For kNN queries, a hypercube-shaped window query centered at x is initiated with an initial side length, which is typically small. Then the side length increases gradually until we are sure that the kNNs are found.

2.3. iDistance The design of iDistance was motivated by the following observations. First, the (dis)similarity between data points can be derived with reference to a chosen reference or representative point. Second, data points can be ordered based on their distances to a reference point. Third, distance is essentially a single dimensional value. This maps high-dimensional data in single dimensional space, thereby enabling reuse of existing single dimensional indexes such as the B+tree. Moreover, false drops can be efficiently filtered out without incurring expensive distance computation. The transformation of high-dimensional points to single dimensional points is done using a three-step algorithm. In the first step, the high-dimensional data space is split into a of partitions. In the second step, a reference point is found for each partition. Suppose that we have m partitions, P0 , P1 , . . . , Pm-1 and their corresponding reference points, O0 , O1 , . . . , Om-1 . Finally, in the third step, all data points are represented in a single dimensional space as follows. A data point p : (x0 , x1 , . . . , xd-1 ), 0 ≤ xj ≤ 1, 0 ≤ j < d, has an index key, y, based on the distance from the nearest reference point Oi as follows: y = i × c + dist(p, Oi ) where c is a constant used to stretch the data ranges. Essentially, c serves to partition the single dimension space into regions so that all points in partition Pi are mapped to the range [i × c, (i + 1) × c). For a kNN query centered at q, a range query with radius r is issued. The iDistance kNN search algorithm searches the index from the query point outwards, and for each partition that intersects with the query sphere, a range query is issued. If the algorithm finds k elements that are closer than r from q at the end of the search, the algorithm terminates. Otherwise, it extends the search radius by ∆r and the process is repeated till the stopping condition is satisfied. INTERNATIONAL JOURNAL ON INTEGRATED INFORMATION MANAGEMENT 27

2.4. R*-tree The R*-tree is a data partitioning structure that indexes MBRs (minimum bounding rectangles). The minimization of both coverage and overlap of the MBRs influences the performance of R* tree. When overlap occurs on data query or insertion,33 more than one branch of the tree needs to be expanded and traversed (due to storage redundancy).traversed When (due the to storage coverage redundancy). is minimized When the this coverage has is the minimized effect othisf improving pruning performance,has the effect soof improvingwhole pages pruning are performance,excluded from so whole search pages more are excludedoften. The R*-fromtree search attempts more often.to reduce both, with a combination of a revised node split algorithm and Thethe R* concept-tree attempts of forced to reduce reinsertion both, with when a combination nodes overflow.of a revised This node is based on the observationsplit algorithm that and R- thetree concept structures of forced are reinsertion highly sensitivewhen nodes to overflow. the order This in which their entriesis basedare inserted, on the observation so an insertion that R-tree-built structures (rather are highly than sensitivebulk-loaded) to the order structure is likely to bein sub which-optimal. their entries So theare ideletionnserted, so and an insertion reinsertion-built (ratherof some than entries bulk-loaded) allows them to "find" astructure place is in likely the to tree be sub that-optimal. may beSo the more deletion appropriate and reinsertion than of their some original location. Whenentries a allows node themoverflows, to "find" a aportion place in ofthe its tree entries that may are be removed more appropriate from the node and reinsertedthan their into original the tree. location. This When produces a node overflows, better- clustered a portion of groups its entries of are entries in nodes, withremoved the effect from that the node node coverage and reinserted is reduced. into the Furthermore, tree. This produces actual better node- splits are often clustered postponed, groups causing of entries average in nodes, node with occupancy the effect that to node become coverage higher. is Re- insertion canreduced. be seen Furthermore, as a method actual of node incremental splits are often tree postponed,optimization caus ingtriggered average on node overflow. node occupancy to become higher. Re-insertion can be seen as a method of incremental tree optimization triggered on node overflow. 3. Previous experiments 3. Previous experiments In this section we present the experiments that have been carried out during the evaluation In of this the section above we-mentioned present the experiments index structures that have been in the carried corresponding out during the original papers. Theevaluation experiments of the above vary- mentioned in terms index of the structures chosen in the dimensionality, corresponding original dataset size, page size, papers. and type The experiments of queries vary tested. in terms For of example, the chosen the dimensionality, Hybrid tree dataset was tested against thesize, SR - pagetree size,[10], andthe typehB -tree of queries [4] and tested. the Forsequential example, scan, the Hybrid the P+ tree- tree was against the pyramidtested technique, against the iMinMax SR-tree [ 10[5]], theand hB the-tree sequential [4] and the sequentialscan, iDistance scan, the againstP+- the iMinMax, thetree A against-tree [9], the and pyramid the technique,linear scan, iMinMax and, [finally,5] and the sequentialR*-tree against scan, other R-tree variationsiDistance and against the theGRID iMinMax, file [11]. the A-tree [9], and the linear scan, and, finally, The summarythe R*-tree of against the other experiments R-tree variations that and were the GRID carried file [ out11]. by the authors is illustrated in TableThe summary 1. We notice of the experimentsthat there thatis no were common carried outframework by the authors to compa is re all four indexesillustrated and draw in Table conclusions 1. We notice regarding that there istheir no common performance framework on to kNN compare queries. In the followingall section four indexes we describe and draw our conclusions approach regarding in comparing their performance the four onindexes. kNN queries. In the following section we describe our approach in comparing the four indexes.

Table 1: Comparison of index structures regarding previous experimental analysis Index indexes Size distribution Block Range Query KNN- compared Dimensionality size query 1 R* tree quadratic split 100000 2d uniform, 1024 rectangle - Rtree, linear clustered, bytes intersection split Rtree, parcel, real, Greene's Rtree gaussian, mixed uniform

28 34 EFFICIENT INDEXING METHODS IN THE DATA MINING CONTEXT

2 Hybrid SR tree, hB- 1.2 million 16d fourier datasets 4096 Selectivity - tree tree, Sequential 70K-16,32,64d collhist dataset bytes 0,07% Fourier Scan 0,2% COLHIST

3 P+ tree Pyramid, 581012 10d real data set 4096 - 2-10-NN iminmax 68040 32d real data set bytes 500000 16d32d clustered

4 IDistanc M tree,Omni 100000 to uniform & 4096 - 10-50NN e Seq.Scan, 500000 of clustered bytes Arya's bbd tree 8,16,30 dim

4. Experimental Evaluation 4. ExperimentalWe conducted Evaluation a series of experiments on the four indexing structures using for each one implementation code provided by the authors. We varied certain We conducted a series of experiments on the four indexing structures using for each one implementationparameters likecode the provided dimensionality, by the the authors. data set s ize,We andvaried the parameter certain parametersk in kNN like queries, as well as parameters that are unique in each of the participating the dimensionality, the data set size, and the parameter k in kNN queries, as well as indexing structures like the number of clusters in iDistance or the order of parameters that are unique in each of the participating indexing structures like the number of clustersDivision in in P+ iDistance-tree. or the order of Division in P+-tree. Regarding the R*-tree, we used the C++ implementation by Hans-Peter Regarding the R*-tree, we used the C++ implementation by Hans-Peter Kriegel's Kriegel's group obtained from the R-tree portal site. We had to make major group obtained from the R-tree portal site. We had to make major changes to the changes to the original code in order to compile it with the g++ 4.3.2 compiler original code in order to compile it with the g++ 4.3.2 compiler running under Linux. running under Linux. Our system was a modest one, having one Intel Pentium 4 Our system was a modest one, having one Intel Pentium 4 CPU running at 1.8 GHz CPU running at 1.8 GHz with 2GB RAM. with 2GB RAM. Our purpose was to run experiments with very large data sets with more Our purpose was to run experiments with very large data sets with more than than 100.000.000 points each, but the creation of indexes of some of the tested 100.000.000 points each, but the creation of indexes of some of the tested indexing indexing data structures, namely, the R*-tree and the Hybrid tree in 16-d data data structures, namely, the R*-tree and the Hybrid tree in 16-d data sets proved to sets proved to be a very lengthy operation. For example, it took one day to be a very lengthy operation. For example, it took one day to create an R*-tree with create an R*-tree with 16.000.000 points. So, we decided to test all index 16.000.000 points. So, we decided to test all index structures with smaller data sets structures with smaller data sets and achieve a unified view of their and achieve a unified view of their performance. The page size was set to all of the performance. The page size was set to all of the competing indexing structures competing indexing structures to 4096 bytes and the k value in the kNN search to 4096 bytes and the k value in the kNN search algorithm was set to 1 and 10 algorithm was set to 1 and 10 for both of the 8 and 16 dimension cases that we for both of the 8 and 16 dimension cases that we tested. The number of clusters tested. The number of clusters in the iDistance method was set to 64 in all the in the iDistance method was set to 64 in all the experiments, and the order of experiments, and the order of division parameter in the P+-tree implementation was division parameter in the P+-tree implementation was set to 4 in all the set to 4 in all the experiments. We implemented a simple data generator35 for the experiments. We implemented a simple data generator for the experiments. experiments. Finally, because the P+-tree implements only window queries, we had Finally, because the P+-tree implements only window queries, we had to to experimentaverage answers for the of 1 appropriate and 10 NN points side could length be ofcomputed the window accounting query for 8 s o that correct experiment for the appropriate side length of the window query so that correct averageand answers 16 dimensions. of 1 and 10 NN points could be computed accounting for 8 and 16 dimensions.Table 2 shows the time required to build the indexes for varying Tabledimensionality 2 shows the and time dataset required size. iDistance to build does the notindexes provide for us varying with such dimensionality and datasetinformation. size. WeiDistance observe thatdoes the not Hybrid provide tree isus about with an such order information of magnitude. We observe that thefaster Hybrid than the tree rest is of about the methods. an order All methodsof magnitude scale linearly faster to than both the the rest of the methods.dimensionality All methods and thescale dataset linearly size. to both the dimensionality and the dataset size.

Table 2: Index build time (real time in secs) DIM=8 DIM=16 100K 200K 300K 100K 200K 300K R* tree 77 159 237 148 299 448 Hybrid tree 4 10 13 13 28 45 P+ tree 47 107 167 78 153 268

Regarding the size of the created indexes, iDistance and P+ tree are essentially B+-trees and the code we obtained does not compute the relevant statistics. The R*-tree and the Hybrid tree report similar results, with the Hybrid tree showing increased storage requirements as the dataset size increases, as shown in Table 3.

Table 3: Index size in pages Dim=8 Dim=16 100K 200K 300K 100K 200K 300K R* tree 1516 3035 4552 2942 5883 8824 Hybrid tree 1111 2213 4155 2242 4457 8099

Table 4 reveals that the Hybrid tree, although it is a true multidimensional index, has the smallest average page I/O when dealing with 1NN queries, and it is insensitive to dimensionality. The P+tree has also very good performance due to the fast searching time of the B+-tree that it employs.

Table 4: Average page I/O for 1NN K=1 Dim=8 Dim=16 100K 200K 300K 100K 200K 300K R* tree 16 22 40 987 1226 1452 Hybrid tree 3 9 10 3 9 10 iDistance 176 202 231 908 1125 1262 P+ tree 3 10 22 3 36 63

35 35 average answers of 1 and 10 NN points could be computed accounting for 8 average answers of 1 and 10 NN points could be computed accounting for 8 and 16 dimensions. and 16 dimensions. Table 2 shows the time required to build the indexes for varying Table 2 shows the time required to build the indexes for varying dimensionality and dataset size. iDistance does not provide us with such dimensionality and dataset size. iDistance does not provide us with such information. We observe that the Hybrid tree is about an order of magnitude information. We observe that the Hybrid tree is about an order of magnitude faster than the rest of the methods. All methods scale linearly to both the faster than the rest of the methods. All methods scale linearly to both the dimensionality and the dataset size. dimensionality and the dataset size. Table 2: Index build time (real time in secs) Table 2: Index build timeDIM=8 (real time in secs) DIM=16 100K 200KDIM=8 300K 100K 200KDIM=16 300K R* tree 100K 77 200K159 300K237 100K148 200K 299 300K448 HybridR* tree tree 77 4 159 10 237 13 148 13 299 28 448 45 HybridP+ tree tree 4 47 10107 13167 13 78 28153 45268 INTERNATIONAL JOURNAL ON INTEGRATED INFORMATION MANAGEMENT 29 P+ tree 47 107 167 78 153 268

Regarding the size of the created indexes, iDistance and P+ tree are RegardingRegarding the size theof the size created of the createdindexes, indexes, iDistance iDistance and P+ and tree P+ treeare essentially are essentially B+-trees and the code we obtained does not compute the relevant B+-trees andessentially the code B+-trees we obtained and the code does we not obtained compute does notthe computerelevant the statistics. relevant The R*- statistics. The R*-tree and the Hybrid tree report similar results, with the Hybrid tree and statistics. the Hybrid The R* tree-tree and report the Hybrid similar tree results,report similar with results, the with Hybrid the Hybrid tree showing tree showing increased storage requirements as the dataset size increases, as increasedtree storage showing requirements increased storage as the requirements dataset size as thincreases,e dataset size as shown increases, in asTable 3. shown in Table 3. shown in Table 3.

Table 3: Index size in pages Table 3: Index size in pages Dim=8 Dim=16 Dim=8 Dim=16 100K 200K 300K 100K 200K 300K 100K 200K 300K 100K 200K 300K R* tree 1516 3035 4552 2942 5883 8824 R* tree 1516 3035 4552 2942 5883 8824 Hybrid tree 1111 2213 4155 2242 4457 8099 Hybrid tree 1111 2213 4155 2242 4457 8099

Table 4 reveals that the Hybrid tree, although it is a true multidimensional Table 4 reveals that the Hybrid tree, although it is a true multidimensional Table 4index, reveals has the that smallest the average Hybrid page tree, I/O although when dealing it with is a 1NN true queries multidimensional, and it index, has the smallest average page I/O when dealing with 1NN queries, and it index, has isthe insensitive smallest to dimensionality.average page The I/O P+t whenree has dealing also very with good 1NN performance queries, due and it is is insensitive to dimensionality. The P+tree has also very good performance due insensitive toto the dimensionality. fast searching time The of theP+tree B+-tree has that also it employs. very good performance due to the to the fast searching time of the B+-tree that it employs. fast searching time of the B+-tree that it employs. Table 4: Average page I/O for 1NN Table 4: Average page I/O for 1NN K=1 Dim=8 Dim=16 K=1 Dim=8 Dim=16 100K 200K 300K 100K 200K 300K 100K 200K 300K 100K 200K 300K R* tree 16 22 40 987 1226 1452 R* tree 16 22 40 987 1226 1452 Hybrid tree 3 9 10 3 9 10 Hybrid tree 3 9 10 3 9 10 iDistance 176 202 231 908 1125 1262 iDistance 176 202 231 908 1125 1262 P+ tree 3 10 22 3 36 63 P+ tree 3 10 22 3 36 63

36

On the otherOn hand, the other iDistance hand, iDistance appears appears to be more to be insensitive more insensitive to the to theincrease in dataset sizeincrease and in dataset dimensionality size and dimensionality that the that R* -thetree. R* -tree The. T R*he -R*tree-tree has has the the worst performanceworst in performance high dimensions, in high dimensi butons, performs but performs well well in in low low dimensionality. dimensionality. In the experimentsIn the we experiments used comparable we used comparable buffer sizes buffer for sizes all methods.for all methods. Table 5 showsTable 5 thatshows when that when dealing dealing with with 10NN10NN queries, queries, the theP+tree P+tree has the has best the best overall performance.overall performance For such. For queries,such queries the, Hybridthe Hybrid tree tree joins joins the the R* -R*tree-tree regarding the deteriorationregarding in the performance deterioration inin performancehigh dimensions. in high dimensions.On the other On thehand, other iDistance has the worsthand, performance iDistance has in the low worst dimensionality, performance in but low is dimensionality, comparable butto the is R*-tree and Hybridcomparable tree in higher to the R* dimensions.-tree and Hybrid tree in higher dimensions.

Table 5: Average page I/O for 10NN K=10 Dim=8 Dim=16 100K 200K 300K 100K 200K 300K R* tree 41 56 67 1695 1818 2046 Hybrid tree 20 32 54 1604 1815 2012 iDistance 362 402 476 1789 1920 2146 P+ tree 10 36 31 687 1007 1399

Finally, in Table 6 and Table 7 we show the query times for the R*-tree and Finallythe, in Hybrid Table tree 6 and in the Table case of7 we 1NN show and 10NNthe query queries. times We dofor not the report R*-tree the and the Hybrid treetimes in thefor the case rest of of 1NNthe methods and 10NN that use queries. B+-trees We because do not they report are almost the zero.times for the rest of theThe methods Hybrid that tree has use a veryB+-trees good becausebehavior, comparablethey are almost to the P+zero.-tree The and theHybrid tree has a veryiDistance, good behavior, although comparable it is a true multidimensional to the P+-tree index.and the The iDistance, R*-tree has although the it is a true multidimensionalworst performance. index. The R*-tree has the worst performance.

Table 6: 1NN queries execution times (real time in secs) K=1 Dim=8 Dim=16 100K 200K 300K 100K 200K 300K R* tree 1,5 3 4,49 1,9 3,86 5,69 Hybrid tree 0 0 0 0,08 0,11 0,08

Table 7: 10NN queries execution times (real time in secs) K=10 Dim=8 Dim=16 100K 200K 300K 100K 200K 300K R* tree 1,57 3,12 4,7 1,94 3,88 6,04 Hybrid tree 0 0 0 0,17 0,23 0,25

36

On the other hand, iDistance appears to be more insensitive to the increase in dataset size and dimensionality that the R*-tree. The R*-tree has the worst performance in high dimensions, but performs well in low dimensionality. In the experiments we used comparable buffer sizes for all methods. Table 5 shows that when dealing with 10NN queries, the P+tree has the best overall performance. For such queries, the Hybrid tree joins the R*-tree regarding the deterioration in performance in high dimensions. On the other hand, iDistance has the worst performance in low dimensionality, but is comparable to the R*-tree and Hybrid tree in higher dimensions.

Table 5: Average page I/O for 10NN K=10 Dim=8 Dim=16 100K 200K 300K 100K 200K 300K R* tree 41 56 67 1695 1818 2046 Hybrid tree 20 32 54 1604 1815 2012 iDistance 362 402 476 1789 1920 2146 P+ tree 10 36 31 687 1007 1399

Finally, in Table 6 and Table 7 we show the query times for the R*-tree and the Hybrid tree in the case of 1NN and 10NN queries. We do not report the times for the rest of the methods that use B+-trees because they are almost zero. The Hybrid tree has a very good behavior, comparable to the P+-tree and the iDistance, although it is a true multidimensional index. The R*-tree has the 30 worst performance.EFFICIENT INDEXING METHODS IN THE DATA MINING CONTEXT

Table 6: 1NN queries execution times (real time in secs) K=1 Dim=8 Dim=16 100K 200K 300K 100K 200K 300K R* tree 1,5 3 4,49 1,9 3,86 5,69 Hybrid tree 0 0 0 0,08 0,11 0,08

Table 7: 10NN queries execution times (real time in secs) K=10 Dim=8 Dim=16 100K 200K 300K 100K 200K 300K R* tree 1,57 3,12 4,7 1,94 3,88 6,04 Hybrid tree 0 0 0 0,17 0,23 0,25

Acknowledgements We thank the authors for providing us with the code of the implementations of the relevant indexing methods namely R* tree, Hybrid tree, P+ tree and iDistance.

References [1] Beckmann, N., Kriegel, H.-P., Schneider, R., and Seeger, B. (1990). The R*-tree: An efficient and robust access method for points and rectangles. In Proc. SIGMOD.. [2] Berchtold, S., Bohm, C., and Kriegel, H.-P. (1998). The pyramid- technique: Towards breaking the curse of dimensionality. In Proc. SIGMOD. [3] Chakrabarti, K. and Mehrotra, S. (1999). The hybrid tree: An index structure for high dimensional feature spaces. In Proc. International Conference on Data Engineering, pp. 322–331. [4] Lomet, D.B. and Salzberg, B. (1990). The hB-tree: a multiattribute indexing method with good guaranteed performance. ACM TODS, 15(4), pp. 625-658. [5] Ooi, B.C., Tan, K.L., Yu, C., and Bressan, S. (2000). Indexing the edge: a simple and yet efficient approach to high-dimensional indexing. In Proc. ACM PODS, pp. 166–174. [6] Zhang, R., Ooi, B.C., and Tan, K-L. (2004). Making the Pyramid Technique Robust to Query Types and Workloads. In Proc. ICDE, pp. 313-324. [7] Yu, C., Ooi, B.C., Tan, K.L., and Jagadish, H. (2001). Indexing the distance: an efficient method to knn processing. In Proc. International Conference on Very Large Data Bases, pp. 421–430. [8] Comer, D. (1979). The Ubiquitous B-Tree. ACM Computing Surveys, 11(2), pp. 121–137. [9] Sakurai, Y., Yoshikawa, M., Uemura, S., and Kojima, H. (2000). The A-tree: An Index Structure for High-Dimensional Spaces Using Relative Approximation. In Proc. International Conference on Very Large Data Bases, pp. 516-526. [10] Katayama, N. and Satoh, S. (1998). SR-tree: An index structure for nearest- neighbor searching of high-dimensional point data. Systems and Computers in Japan, 29(6), pp. 59-73. [11] Nievergelt, J., Hinterberger, and H., Sevcik, K.C. (1984). The Grid File: An Adaptable, Symmetric Multikey File Structure. ACM Trans. Database Syst., 9(1), pp. 38-71.