Making the Pyramid Technique Robust to Query Types and Workloads
Total Page:16
File Type:pdf, Size:1020Kb
Making the Pyramid Technique Robust to Query Types and Workloads Rui Zhang Beng Chin Ooi Kian-Lee Tan Department of Computer Science National University of Singapore, Singapore 117543 fzhangru1, ooibc, [email protected] Abstract ranges of elevation, aspect, slope, cover type, etc are typi- cal. KNN queries are common in content-based retrieval The effectiveness of many existing high-dimensional in- systems [10, 12]. dexing structures is limited to specific types of queries and While a large number of indexing techniques have been workloads. For example, while the Pyramid technique and proposed to improve performance, most of these techniques the iMinMax are efficient for window queries, the iDistance are not sufficiently robust for a wide range of queries. For is superior for kNN queries. In this paper, we present a example, the Pyramid technique [7] and iMinMax [15] are new structure, called the P+-tree, that supports both win- efficient for window queries, but perform less satisfactorily dow queries and kNN queries under different workloads ef- for kNN queries. On the other hand, metric-based schemes ficiently. In the P+-tree, a B+-tree is employed to index such as the iDistance [21] are usually superior for kNN the data points as follows. The data space is partitioned queries, but may not be usable for window queries. More- into subspaces based on clustering, and points in each sub- over, these schemes typically perform well for certain work- space are mapped onto a single dimensional space using the loads (data set size, dimensionality, data distribution, etc) Pyramid technique, and stored in the B+-tree. The crux of and become inferior to sequential scan in other cases. the scheme lies in the transformation of the data which has In this paper, we propose an index structure, called the two crucial properties. First, it maps each subspace into P+-tree, that supports both window and kNN queries un- a hypercube so that the Pyramid technique can be applied. der different workloads (different data set size and dimen- Second, it shifts the cluster center to the top of the pyra- sionality, different data distribution, queries with different mid, which is the case that the Pyramid technique works selectivities and different shapes) efficiently. Our scheme very efficiently. We present window and kNN query pro- is based on the Pyramid technique, which is primarily de- + cessing algorithms for the P -tree. Through an extensive signed for hypercube-shaped window queries (where the + performance study, we show that the P -tree has consider- query rectangle has equal sides in all dimensions). While able speedup over the Pyramid technique and the iMinMax the Pyramid technique is shown to be much better than the for window queries and outperforms the iDistance for kNN Hilbert R-tree [13] and the X-tree [16] for data sets of di- queries. mensionality larger than 8, it has three deficiencies: its per- formance is sensitive to the positions of the query hyper- cube; it is less effective for clustered data sets; and it is infe- 1. Introduction rior for partial window queries. In these cases, a sequential scan over the data set may be more effective. Multidimensional data are common in many applica- The basic structure of the P+-tree is essentially a B+- tions. These applications have a wide range of needs. Win- tree that indexes subspaces of points under a new trans- dow queries1 are common in CAD, medical image [3] and formation method. The data space is first partitioned into GIS systems. For example, the Forest CoverType data set subspaces based on clustering. Next, points in each sub- [1] has 54 attributes, out of which 10 are quantitative. To space are mapped onto a single dimensional space using the analyze the data, (partial) window queries that specify some Pyramid technique, and stored in the B+-tree. To discrimi- nate the points within each cluster, they are transformed into 1In the literature, the term “range query” has been used to mean win- non-overlapping regions in the single dimensional space. dow query (hyperrectangle shaped) and similarity range query (hyper- The crux of the scheme lies in the choice of the transfor- sphere shaped). To avoid ambiguity, we use the term “window query” instead of “range query” throughout this paper. The “range query” in [7] mation that has two crucial properties. First, it maps each and [15], in fact, means window query. subspace into a hypercube so that the Pyramid scheme can be applied. Second, it shifts the cluster center to the top of 3 A Closer Look at the Pyramid Technique the pyramid, which is the case that the Pyramid technique works very efficiently. The Pyramid technique divides the d-dimensional data The rest of the paper is organized as follows: Section space into 2d pyramids that share the center point of the 2 reviews some related work. In section 3, we examine space as their top, and the (d-1)-dimensional surfaces of the the Pyramid technique, identify its deficiencies, and discuss space are their bases (Figure 1). Each pyramid has a pyra- how to overcome them. Then we present the P+-tree in sec- mid number i according to some rule. The distance between tion 4 and query processing schemes in section 5. In section a point v and the center in dimension i (or i ¡ d if i ¸ d) 6, we present the results of a performance study of the P+- is defined as the height of the point, hv. Then, the pyramid tree. Section 7 concludes this paper. value of v is defined as the sum of its pyramid number i and its height hv. 2 Related Work pvv = (i + hv) The R-tree [11] and its variations such as the R*-tree [4] d 1 and the X-tree [16] were first proposed to manage multi- dimensional data. They work well with low-dimensional P3 data. But their performance deteriorates as the dimensional- P ity increases, and become unacceptable in high-dimensional 2 P space due to large amount of overlap. This phenomenon is 0 called the curse of dimensionality. hv Indexing methods based on transformation were pro- P1 v posed to make window query processing in medium- and high-dimensional space more efficient. Among them, the d 0 Pyramid technique [7] is notable. In the Pyramid technique, Figure 1. The Pyramid technique the high-dimensional data points are transformed into one- This pyramid value is the key indexed by a B+-tree. dimensional values and the transformed values are indexed A query rectangle corresponds to a height range in an in- by the classic B+-tree. Experiments show that the Pyra- tersected pyramid. Those data points of height within the mid technique outperforms the Hilbert R-tree [13] and the height range are accessed. X-tree [16]. However, the Pyramid technique is primarily designed for hypercube shaped queries over uniform data. The poorer performance to handle other cases limits its use- fulness in real applications. The iMinMax [15] is another indexing scheme intended for high-dimensional window query based on a different transformation method. θ in the iMinMax is a tuning pa- rameter which makes it adaptive to data distributions. As such, it performs better than the Pyramid technique when the data is skewed. Since θ is a global parameter, the iM- (a) (b) inMax works well when the number of natural clusters is + small. Query rectangle Region accessed On kNN queries, the SS-tree [20], SR-tree [14] and X- Figure 2. (a) Query rectangle near the center tree were designed to reduce the effects of high dimension- (b) Query rectangle near the corner ality by using different page regions or bigger node size. The M-tree [8] was proposed for the generic metric space. Figure 2(a) and (b) show the region accessed when the The IQ-tree [6] compresses the leaf nodes and the A-tree query rectangle is located near the center and corner of the [17] uses virtual bounding rectangles to approximate data data space, respectively (the dark region is the query rect- objects. However, according to the analysis in [19], sequen- angle; the lighter shaded region plus the dark region is the tial scan may be the best method especially for uniform data region accessed by the query rectangle). For uniform data, in very high-dimensional space. And the VA-file was pro- the area (or volume) of the region accessed is proportional posed to accelerate sequential scan by vector approxima- to the number of data accessed. When the query rectangle tion. More recently, the iDistance [21] adopts the transfor- is near the space center, most data accessed is in the answer mation strategy on kNN search and BOND [9] uses vertical set, so the index is efficient. However, when the query rect- fragmentation to reduce the I/O cost of frequently observed angle is near the space corner or edge, the data accessed is query patterns. many times those in the answer set, which makes the index quite inefficient. In other words, the effectiveness of the cluster and therefore causes a large region to be accessed, Pyramid technique is sensitive to the positions of the query the data points accessed are not prohibitively large because rectangle. The space center is a “good position”, while the most of the data points are gathered at the cluster center. In space corner and edge are “bad positions”. The difference addition, the region accessed by a query is significantly re- in the access cost of these two cases is even larger in high- duced by space division. Thus, the P+-tree can alleviate the dimensional space.