<<

K Nearest Neighbor Queries and KNN-Joins in Large Relational (Almost) for Free

Bin Yao, Feifei Li, Piyush Kumar Computer Science Department, Florida State University, Tallahassee, FL, U.S.A. {yao, lifeifei, piyush}@cs.fsu.edu

Abstract—Finding the k nearest neighbors (kNN) of a query no restaurants offer both Italian food and French wines) and point, or a set of query points (kNN-Join) are fundamental the kNN retrieval is not necessary at all. It is well known problems in many application domains. Many previous efforts to that when the data is stored in a relational , for solve these problems focused on spatial databases or stand-alone systems, where changes to the database engine may be required, various types of queries using only primitive SQL operators, which may limit their application on large data sets that are the sophisticated query optimizer built inside the database stored in a relational database management system. Furthermore, engine will do an excellent job in finding a fairly good query these methods may not automatically optimize kNN queries or execution plan [4]. kNN-Joins when additional query conditions are specified. In These advantages of storing and processing data sets in a this work, we study both the kNN query and the kNN-Join in a relational database, possibly augmented with additional query relational database motivate us to study the kNN query and conditions. We search for relational algorithms that require no the kNN-Join problem in a relational database environment. changes to the database engine. The straightforward solution Our goal is to design algorithms that could be implemented uses the user-defined-function (UDF) that a query optimizer by the primitive SQL operators and require no changes to the cannot optimize. We design algorithms that could be implemented database engine. The benefits of satisfying such constraints by SQL operators without changes to the database engine, hence enabling the query optimizer to understand and generate are threefold. First, kNN based queries could be augmented the “best” query plan. Using only a small constant number with ad-hoc query conditions dynamically, and they are auto- of random shifts for databases in any fixed dimension, our matically optimized by the query optimizer, without updating approach guarantees to find the approximate kNN with only the query algorithm each time for specific query conditions. logarithmic number of page accesses in expectation with a Second, such an approach could be readily applied on ex- constant approximation ratio and it could be extended to find the exact kNN efficiently in any fixed dimension. Our design isting commercial databases, without incurring any cost for paradigm easily supports the kNN-Join and updates. Extensive upgrading or updating the database engine, e.g., to make it experiments on large, real and synthetic, data sets confirm the support spatial indices. We denote an algorithm that satisfies efficiency and practicality of our approach. these two constraints as a relational algorithm. Finally, this approach makes it possible to support the kNN-Join efficiently. I.INTRODUCTION We would like to design algorithms that work well for data The k-Nearest Neighbor query (kNN) is a classical problem in multiple dimensions and easily support dynamic updates that has been extensively studied, due to its many important without performance degeneration. applications, such as spatial databases, pattern recognition, The similar relational principle has been observed for other DNA sequencing and many others. A more general version problems as well, e.g., approximate string joins in relational is the kNN-Join problem [7], [8], [11], [31], [33]: Given a databases [15]. The main challenge in designing relational data set P and a query set Q, for each point q ∈ Q we would algorithms is presented by the fact that a query optimizer like to retrieve its k nearest neighbors from points in P . cannot optimize any user-defined functions (UDF) [15]. This Previous work has concentrated on the use of spatial rules out the possibility of using a UDF as a main query databases or stand-alone systems. In these solution method- condition. Otherwise, the query plan always degrades to the ologies, changes to the database engine maybe necessary; for expensive linear scan or the nested-loop join approach, which example, new index structures or novel algorithms need to is prohibitive for large databases. For example, be incorporated into the engine. This requirement poses a SELECT TOP k * FROM Address A, Restaurant R limitation when the data set is stored in a database in which WHERE R.Type=‘Italian’ AND R.Wine=‘French’ neither the spatial indices (such as the popular R-tree) nor the ORDER BY Euclidean(A.X, A.Y, R.X, R.Y) kNN-Join algorithms are available. Another limitation of the In this query, X and Y are attributes representing the existing approaches that are outside the relational database, is coordinates of an Address record or a Restaurant record. the lack of support for query optimization, when additional “Euclidean” is a UDF that calculates the Euclidean distance query conditions are specified. Consider the query: between two points. Hence, this is a kNN-Join query. Even Retrieve the k nearest restaurants of q though this query does qualify as a relational algorithm, the with both Italian food and French wines. query plan will be a nested-loop join when there are restaurants The results of this query could be empty (in the case where satisfying both the “Type” and “Wine” constraints, since the query optimizer cannot optimize the UDF “Euclidean”. This {X1,...,Xd}, let A=kNN(q, RP ) be the set of k nearest implies that we may miss potential opportunities to fully neighbors of q from RP and |x, y| be the Euclidean distance optimize the queries. Generalizing this example to the kNN- between the point x and the point y (or the corresponding query problem, the UDF-based approach will degrade to the records for the relational representation of the points), then: expensive linear scan approach. (A⊆ RP )∧(|A| = k)∧(∀a ∈ A, ∀r ∈ RP −A, |a, q| ≤ |r, q|). Our Contributions. In this work, we design relational algo- kNN-Join. In this case, the query is a set of points denoted rithms that can be implemented using primitive SQL operators by Q, and it is stored in a relational table R . The schema of without the reliance on the UDF as a main query condition, Q R is {qid,X1, · · · ,X ,B1, · · · ,B }. Each point q in Q is so that the query optimizer can understand and optimize. We Q d h represented as a record s in R , and its coordinates are stored would like to support both approximate and exact kNN and Q in attributes {X1,...,X } of s. Additional attributes of q, kNN-Join queries. More specifically, d are represented by attributes {B1, · · · ,Bh} for some value h. • We formalize the problems of kNN queries and kNN- Similarly, qid corresponds to the point id from Q. The goal Joins in a relational database (Section II). of the kNN-Join query is to join each record s from RQ with • We provide a constant factor approximate solution based its kNN from R , based on the Euclidean distance defined by on the Z-order values from a small, constant number of P {X1,...,X } and {Y1,...,Y }, i.e., for ∀s ∈ Q, we would randomly shifted copies of the database (Section III). We d d like to produce pairs (s, r), for ∀r ∈ kNN(s, R ). provide the theoretical analysis to show that by using only p O(1) random shifts for data in any fixed dimension, our Other query conditions. Additional, ad-hoc query condi- approach gives an expected constant factor approximation tions could be specified by any regular expression, over (in terms of the radius of the k nearest neighbor ball) with {A1, · · · , Ag} in case of kNN queries, or both {A1, · · · , Ag} only log N number of page accesses for the kNN query and {B1, · · · ,Bh} in case of kNN-Join queries. The relational where N is the size of data set P . Our approach can be requirement clearly implies that the query optimizer will be achieved with only primitive SQL operators. able to automatically optimize input queries based on these • Using the approximate solution, we show how to get query conditions (see our discussion in Section VII). exact results for a kNN query efficiently (Section IV). Approximate k nearest neighbors. Suppose q’s kth nearest Furthermore, we show that for certain types of data neighbor from P is p∗ and r∗ = |q,p∗|. Let p be the kth distributions, our exact solution also only uses O(log N) nearest neighbor of q for some kNN algorithm A and rp = number of page accesses in any fixed dimension. The |q,p|. Given ǫ > 0 (or c > 1), we say that (p, rp) ∈ Rd × R exact solution is also easy to implement using SQL. is a (1 + ǫ)-approximate (or c-approximate) solution to the • We extend our algorithms to kNN-Join queries, which kNN query kNN(q, P ) if r∗ ≤ rp ≤ (1 + ǫ)r∗ (or r∗ ≤ rp ≤ can be achieved in relational databases without changes cr∗ for some constant c). Algorithm A is called a (1 + ǫ)- to the engines (Section V). approximation (or c-approximation) algorithm. • We show that our algorithms easily support float values, Similarly for kNN-Joins, an algorithm that finds a kth data in arbitrary dimension and dynamic updates without nearest neighbor point p ∈ P for each query point q ∈ Q, any changes to the algorithms (Section VI). that is at least a (1 + ǫ)-approximation or c-approximation • We present a comprehensive experimental study, that w.r.t kNN(q, P ) is a (1 + ǫ)-approximate or c-approximate confirms the significant performance improvement of our kNN-Join algorithm. The result by this algorithm is referred approach, against the state of the art (Section VII). to as a (1 + ǫ)-approximate or c-approximate join result. In summary, we show how to find constant approximations for kNN queries in logarithm page accesses with a small Additional notes. The default value for d in our running constant number of random shifts in any fixed dimension; our examples is two, but, our proofs and algorithms are presented approximate results lead to highly efficient search of the exact for any fixed dimension, d. We focus on the case where the answers, with a simple post-processing of the results. Finally, coordinates for points are always integers. Handling floating our framework enables the efficient processing of kNN-Joins. points coordinates is discussed in Section VI. Our approach We survey the related work in Section VIII. easily supports updates, which is also discussed in Section VI. The number of records in RP and RQ are denoted by II. PROBLEM FORMULATION N = |RP | and M = |RQ| respectively. We assume that each Suppose that the data set P in a d dimensional space is page can store maximally B records from either RP or RQ, stored in a relational table RP . The coordinates of each point and the fan-out of a B+ tree is f. Without loss of generality, p ∈ P are stored in d attributes {Y1,...,Yd}. Each point could we assume that points in P are all in unique locations. For the associate with other values, e.g., types of restaurants. These ad- general case, our algorithms can be easily adapted by breaking ditional attributes are denoted by {A1,...,Ag} for some value the ties arbitrarily. g. Hence, the schema of RP is {pid, Y1, · · · , Yd, A1, · · · , Ag} III. APPROXIMATION BY RANDOM SHIFTS where pid corresponds to the point id from P . The z-value of a point is calculated by interleaving the kNN queries. Given a query point q and its coordinates binary representations of its coordinate values from the most significant bit (msb) to the least significant bit (lsb). For Algorithm 1: zχ-kNN (point q, point sets {P 0,...,P α}) example, given a point (2, 6) in a 2-d space, the binary 1 Candidates C = ∅; representation of its coordinates is (010, 110). Hence, its z- 2 for i =0,...,α do value is 011100 = 28. The Z-order curve for a set of points i i 3 −→ Find zp as the successor of zq+vi in P ; P is obtained by connecting the points in P by the numerical i i i 4 Let C be γ points up and down next to zp in P ; order of their z-values and this produces the recursively Z- i −→ 5 For each point p in C , let p = p − v ; shaped curve. A key observation for the computation of z- i 6 C = C Ci; value is that, it only requires simple bit-shift operations which are readily available or easily achievable in most commercial 7 Let Aχ = kSNN(q, C) and output Aχ. database engines. For a point p, zp denotes its z-value. Our idea utilizes the z-values to map points in a multi- dimensional space into one dimension, and then translate the γ = k. In this case, let P ′ be a randomly shifted version of the ′ kNN search for a query point q into one dimensional range point set P , q be the correspondingly shifted query point and ′ ′ search on the z-values around the q’s z-value. In most cases, Aq′ be the k nearest neighbors of q in P . Note that Aq′ = A. ′ z-values preserve the spatial locality and we can find q’s kNN Let points in P be {p1,p2,...,pN } and they are sorted by ′ in a close neighborhood (say γ positions up and down) of its their z-values. The successor of zq′ w.r.t the z-values in P is z-value. However, this is not always the case. In order to get denoted as pτ for some τ ∈ [1,N]. Clearly, the candidate set a theoretical guarantee, we produce α, independent, randomly C in this case is simply C = {pτ−k,...,pτ+k}. Without loss shifted copies of the input data set P and repeat the above of generality, we assume that both τ − k and τ + k are within procedure for each randomly shifted version of P . the range of [1,N], otherwise we can simply take additional Specifically, we define the “random shift” operation, as points after or before the successor as explained above in the shifting all data points in P by a random vector −→v ∈ Rd. algorithm description. We use B(c, r) to denote a ball with This operation is simply p + −→v for all p ∈ P , and denoted center c and radius r and let rad(p,S) be the distance from as P + −→v . We independently at random generate α number the point p to the farthest point in a point set S. We first show −→ −→ −→ d of vectors {v1,..., vα} where ∀i ∈ [1, α], vi ∈ R . Let the following lemmas in order to claim the main theorem. i −→ 0 −→ −→ i P = P + v , P = P and v0 = 0 . For each P , its points are i Lemma 1 Let M be the smallest box, centered at q′ contain- sorted by their z-values. Note that the random shift operation ing ′ and with side length i (where is assumed w.l.o.g to is executed only once for a data set P and used for subsequent Aq 2 i be an integer > 0) which is randomly placed in a quadtree T queries. Next, for a query point q and a data set P , let z be p (associated with the Z-order). If the event E is defined as M the successor z-value of z among all z-values for points in q being contained in a quadtree box with side length i+j , P . The γ-neighborhood of q is defined as the γ points up MT 2 and MT is the smallest such quadtree box, then and down next to zp. For the special case, when zp does not have γ points before or after, we simply take enough points 1 d dj−1 Pr(Ej ) ≤ 1 − after or before zp to make the total number of points in the j j2−j  2  2 2 γ-neighborhood to be 2γ +1, including zp itself. Our kNN query algorithm essentially finds the γ-neighborhood of the Proof: The proof is in the paper’s full version [32]. query point i −→ in i for and select the final q = q+ vi P i ∈ [0, α] Lemma 2 The zχ-kNN algorithm gives an approximate k- top from the points in the unioned -neighborhoods, ′ ′ ′ ′ k (α+1) γ nearest neighbor ball B(q ,rad(q , C)) using q and P where with a maximum number of distinct points. ′ (α + 1)(2γ + 1) rad(q , C) is at most the side length of M . Here M is the We denote this algorithm as the zχ- NN algorithm and it is T T k smallest quadtree box containing M, as defined in Lemma 1. shown in Algorithm 1. It is important to note that in Line χ Proof: z -kNN scans at least p − ...p + , and picks 5, we obtain the original point from its shifted version if it τ k τ k the top k nearest neighbors to q among these candidates. is selected to be a candidate in one of the γ-neighborhoods. Let a be the number of points between pτ and the point This step simplifies the final retrieval of the kNN from the ′ ′ with the largest z-value in B(q ,rad(q , A ′ )), the exact k- candidate set C. It also implies that the candidate sets Ci’s q nearest neighbor ball of q′. Similarly, let b be the number may contain duplicate points, i.e., a point may be in the γ- of points between pτ and the point with the smallest z- neighborhood of the query point in more than one randomly ′ ′ value in B(q ,rad(q , Aq′ )). Clearly, a + b = k. Note that shifted versions. ′ ′ B(q ,rad(q , A ′ )) ⊂ M is contained inside M , hence The zχ-kNN is clearly very simple and can be implemented q T the number of points inside M ≥ k. Now, p ...p + efficiently using only (α+1) one-dimensional range searches, T τ τ k must contain a points inside M . Similarly, p − ...p must each requires only logarithmic IOs w.r.t the number of pages T τ k τ contain at least b points from M . Since we have collected at occupied by the data set (N/B) if γ is some constant. More T least k points from M , rad(q′, C) is upper bounded by the importantly, we can show that, with α = O(1) and γ = O(k), T side length of M . zχ-kNN gives a constant approximation kNN result in any T N fixed dimension d, with only O(logf B +k/B) page accesses. Lemma 3 MT is only constant factor larger than M in In fact, we can show these results with just α = 1 and expectation. Proof: The expected side length of MT is: rs, in the next step, we retrieve all records that locate within ∞ ∞ γ positions away from rs. This can be done as: − 2 i+j i+j i 1 d j−1 3j j E[2 ] ≤ 2 Pr(E ) ≤ 2 (1 − ) d 2 2 , SELECT FROM j 2j * Xj=1 Xj=1 ( SELECT TOP γ * FROM RP WHERE RP .zval > rs.zval ORDER BY RP .zval ASC where Lemma 1 gives Pr(Ej ). Using Taylor’s approximation: UNION SELECT TOP γ * FROM RP WHERE RP .zval d < r .zval ORDER BY R .zval DESC ) AS C 1 − − − − − s P 1 − ≤ 1 − d2 j 1+2 1 j + d22 2j 1  2j  Again, due to the clustered B+ tree index on the zval,   this query is essentially a sequential scan around r , with a and substituting it in the expectation calculation, we can show s query cost of O(log N + γ ). The first SELECT clause of that E[2i+j ] is O(2i). The detail is in the full version [32]. f B B this query is similar to the successor query as above. The These lemmas lead to the main theorem for zχ-kNN. second SELECT clause is also similar but the ranking is in Theorem 1 Using α = O(1), or just one randomly shifted descending order of the zval. However, even in the second copy of P , and γ = O(k), zχ-kNN guarantees an expected case, the query optimizer is robust enough to realize that with N the clustered index on the zval, no ranking is required and it constant factor approximate kNN result with O(logf B +k/B) number of page accesses. simply sequentially scans γ records backwards from rs. Proof: The IO cost follows directly from the fact that The final step of our algorithm is to simply retrieve the top the one dimensional range search used by zχ-kNN takes k records from these 2γ +1 candidates (including rs), based N on their Euclidean distance to the query point q. Hence, the O(logf B + k/B) number of page accesses with a B-tree index. Let q be the point for which we just computed the UDF ‘Euclidean’ is only applied in the last step with 2γ +1 approximate k-nearest neighbor ball B(q′,rad(q′, C)). From number of records. The complete algorithm can be expressed Lemma 2, we know that rad(q′, C) is at most the side length in one SQL statement as follows: 1 SELECT TOP k * FROM of MT . From Lemma 3 we know that the side length of MT ′ ′ 2 ( SELECT TOP γ + 1 * FROM RP , is at most constant factor larger than rad(q , Aq′ ) in P , in 3 ( SELECT TOP 1 zval FROM R ′ P expectation. Note that rad(q, Aq ) in P equals rad(q , Aq′ ) 4 WHERERP .zval ≥ q.zval ′ in P . Hence our algorithm, in expectation, computes an 5 ORDERBYRP .zval ASC ) AS T 6 WHERER .zval≥T.zval approximate k-nearest neighbor ball that is only a constant P 7 ORDERBY RP .zval ASC factor larger than the true k-nearest neighbor ball. 8 UNION Algorithm 1 clearly indicates that zχ-kNN only relies on 9 SELECT TOP γ * FROM RP 10 WHERE R .zval < T.zval one dimensional range search as its main building block (Line P 11 ORDER BY RP .zval DESC ) AS C 4). One can easily implement this algorithm with just SQL 12 ORDER BY Euclidean(q.X1,q.X2,C.Y1,C.Y2) (Q1) 0 α statements over tables that store the point sets {P ,...,P }. For all SQL environments, line 3 to 5 need to be copied We simply use the original data set P to demonstrate the into the FROM clause in line 9. We omit this from the SQL χ translation of the z -kNN into SQL. Specifically, we pre- statement to shorten the presentation. In the sequel, for similar process the table RP so that an additional attribute zval is situations in all queries, we choose to omit this. Essentially, introduced. A record r ∈RP uses {Y1,...,Yd} to compute line 2 to line 11 in Q1 select the γ-neighborhood of q in P , its z-value and store it as the zval. Next, a clustered B+-tree which will be the candidate set for finding the approximate index is built on the attribute zval over the table RP . This also kNN of q using the Euclidean UDF by line 1 and 12. implies that records in RP are sorted by the zval attribute. For In general, we can pre-process {P 1,...,P α} similarly to a given query point q, we first calculate its z-value (denoted 1 α get the randomly shifted tables {RP ,..., RP } of RP , the as zval as well) and then, we find the successor record of q above procedure could be then easily repeated and the final in RP , based on the zval attribute of Rp and q. In the sequel, answer is the top k selected based on applying the Euclidean we assume that such a successor record always exists. In the UDF over the union of the (α+1) γ-neighborhoods retrieved. special case, when q’s z-value is larger than all the z-values The whole process could be done in just one SQL by repeating in RP , we simply take its predecessor instead. This technical line 2 to line 11 on each randomly shifted table and unioning detail will be omitted. The successor can be found by: them together with the SQL operator UNION. SELECT TOP 1 * FROM RP WHERE RP .zval ≥ q.zval In the work by Liao et al. [23], using Hilbert-curves a Note that TOP is a ranking operator that becomes part of set of d + 1 deterministic shifts can be done to guarantee the standard SQL in most commercial database engines. For a constant factor approximate answer but in this case the example, Microsoft SQL Server 2005 has the TOP operator space requirement is high, O(d) copies of the entire point available. Recent versions of Oracle, MySQL, and DB2 have set is required, and the associated query cost is increased by the LIMIT operator which has the same functionality. Since a multiplicative factor of d. In contrast, our approach only table RP has a clustered B+ tree on the zval attribute, this requires O(1) shifts for any dimension and can be adapted query has only a logarithmic cost (to the size of RP ) in terms to yield exact answers. In practice, we just use the optimal N of the IOs, i.e., O(logf B ). Suppose the successor record is value of α = min(d, 4) that we found experimentally to γh give the best results for any fixed dimension. In addition, z- δh p7 values are much simpler to calculate than the Hilbert values, p5 δh p2 p5 especially in higher dimensions, making it suitable for the SQL p3 p2 p1 q p3 environment. Finally, we emphasize that randomly shifted p1 r∗ q tables {R1 ,..., Rα } are generated only once and could be P P rp used for multiple, different queries. p4 p4 δℓ δℓ γ γ p IV. EXACT kNN RETRIEVAL -val 9 p6 z p8 χ zℓ zp zp zq zp zp z z γ The z -kNN algorithm finds a good approximate solution 4 1 2 5 p3 h ℓ N (a) kNN box. (b) Aχ = A. in O(logf B + k/B) number of IOs. We denote the kNN χ χ result from the z -kNN algorithm as A . One could further Fig. 1. kNN box: definition and exact search. retrieve the exact kNN results based on Aχ. A straightforward solution is to perform a range query using the approximate fashion, this immediately implies that zp ∈ [zℓ,zh]. The case kth nearest neighbor ball of Aχ, B(Aχ). It is defined as the for higher dimensions is similar. ball centered at q with the radius rad(q, Aχ), i.e., B(Aχ) = Lemma 4 implies that the z-values of all exact kNN points χ χ ∗ B(q,rad(q, A )). Clearly, rad(q, A ) ≥ r . Hence, the exact will be bounded by the range [zℓ,zh], where zℓ and zh are the χ χ kNN points are enclosed by B(A ). This implies that we can z-values for the δℓ and δh points of M(A ), in other words: find the exact kNN for q by: Corollary 1 Let zℓ and zh be the z-values of δℓ and δh points SELECT TOP k * FROM RP χ χ of M(A ). For all p ∈ A, zp ∈ [zℓ,zh]. WHERE Euclidean(q.X1,q.X2,R .Y1,R .Y2)≤ rad(p, A ) P P χ ORDER BY Euclidean(q.X1,q.X2,RP .Y1,RP .Y2) (Q2) Proof: By B(A) ⊂ M(A ) and Lemma 4. This query does reduce the number of records participated in Consider the example in Figure 1(a), Corollary 1 guarantees that for and are located between and the final ranking procedure with the Euclidean UDF. However, zpi i ∈ [1, 5] zq zℓ zh in the one-dimensional z-value axis. The zχ-kNN essentially it still needs to scan the entire table RP to calculate the Euclidean UDF first for each record, thus becomes very searches γ number of points around both the left and the right expensive. Fortunately, we can again utilize the z-values to of zq in the z-value axis, for α number of randomly shifted find the exact kNN based on Aχ much more efficiently. copies of P , including P itself. However, as shown in Figure We define the kNN box for Aχ as the smallest box that 1(a), it may still miss some of the exact kNN points. Let γℓ fully encloses B(Aχ), denoted as M(Aχ). Generalizing this and γh denote the left and right γ-th points respectively for this χ χ search. In this case, let γ =2, zp3 is outside the search range, notation, let Ai be the kNN result of z -kNN when it is χ χ χ i specifically, zp3 > zγh . Hence, z -kNN could not find the applied only on table RP and M(Ai ) and B(Ai ) be the corresponding kNN box and kth nearest neighbor ball from exact kNN result. However, given Corollary 1, an immediate i i result is that one can guarantee to find the exact kNN result by table RP . The exact kNN results from all table RP ’s are the same and they are always equal to A. In the sequel, when the considering all points with their z-values between zℓ and zh χ of M(Aχ). In fact, if z ≤ z and z ≥ z in at least one of context is clear, we omit the subscript i from A . γℓ ℓ γh h i i χ An example of the kNN box is shown in Figure 1(a). In the table R ’s for i =0,...,α, we know for sure that z -kNN χ has successfully retrieved the exact kNN result; otherwise it this case, the z -kNN algorithm on this table RP returns the χ may have missed some exact kNN points. The case when z kNN result as A = {p1,p2,p4} for k =3 and p4 is the kth ℓ χ χ and z are both contained by z and z of the M(A ) in nearest neighbor. Hence, B(A )= B(q, |q,p4|), shown as the h γℓ γh solid circle in Figure 1(a). The kNN box M(Aχ) is defined by one of the randomly shifted tables is illustrated in Figure 1(b). In this case, in one of the (α + 1) randomly shifted tables, the the lower-left and right-upper corner points δℓ and δh, i.e., the z-order curve passes through z first before any points in ∆ points in Figure 1(a). As we have argued above, the exact γℓ χ χ M(A ) (i.e., z ≤ z ) and it comes to z after all points kNN result must be enclosed by B(A ), hence, also enclosed γℓ ℓ γh χ χ in M(A ) have been visited (i.e., z ≥ z ). As a result, the by M(A ). In this case, the exact kNN result is {p1,p2,p3} γh h χ and the dotted circle in Figure 1(a) is the exact kth nearest candidate points considered by z -kNN include every point χ neighbor ball B(A). in A and the kNN result from the algorithm z -kNN will be exact, i.e., Aχ = A. Lemma 4 For a rectangular box M and its lower-left and When this is not the case, i.e., either i i or i i zℓ zγh upper-right corner points , , , , where i χ δℓ δh ∀p ∈ M zp ∈ [zℓ,zh] or both in all tables RP ’s for i =0,...,α, A might not be zp stands for the z-value of a point p and zℓ, zh correspond equal to A. To address this issue, we first choose one of the to the z-values of δ and δ respectively (See Figure 1(a)). j χ ℓ h table RP such that its M(Aj ) contains the least number of Proof: Consider 2-d points, let p.X and p.Y be the points among kNN boxes from all tables; then the candidate coordinate values of p in x-axis and y-axis respectively. By points for the exact kNN search only need to include all points χ j p ∈ M, we have p.X ∈ [δℓ.X,δh.X] and p.Y ∈ [δℓ.Y,δh.Y ]. contained by this box, i.e., M(Aj ) from the table RP . χ Since the z-value of a point is obtained by shuffling bits of To find the exact kNN from the box M(Aj ), we utilizing j its coordinate values from the msb to the lsb in an alternating Lemma 4 and Corollary 1. In short, we calculate the zℓ and q0.zval R0 1 1 0 α P q .zval RP Algorithm 2: z-kNN (point q, point sets {P ,...,P }) ··· zval ··· zval χ i i 1 Let Ai be kNN(q, C ) where C is from Line 4 in the ··· χ γ z0 ··· z -kNN algorithm; ℓ ··· γ ··· i i 0 z1 2 Let z and z be the z-values of the lower-left and zs ℓ ℓ h 1 χ ··· zs upper-right corner points for the box M(A ); 0 i γ ··· L ··· 1 i i L 3 Let z and z be the lower bound and upper bound of ··· γ ··· γℓ γh ··· ··· χ i z1 the z-values z -kNN has searched to produce C ; ··· h i i i i z0 4 if , s.t. and then h ∃i ∈ [0, α] zγℓ ≤ zℓ zγh ≥ zh 5 Return Aχ by zχ-kNN as A; Fig. 2. Exact search of kNN. 6 else j If this count equals 0, then [zi,zi ] ⊆ [zi ,zi ]. One can 7 Find j ∈ [0, α] such that the number of points in R ℓ h γℓ γh P easily create one SQL statement to do this checking for all with z-values in ∈ [zj ,zj ] is minimized; ℓ h tables. If there is at least one table with a count equal to 0 8 Let Ce be those points and return A = kNN(q, Ce); 0 α χ among RP ,..., RP , then we can can safely return A as the exact kNN result. Otherwise, we continue to the next step. j i We first select the table (say RP ) with the smallest L value. j j j This is done by (Q4), in which we find the value j. Then we zh of this box and do a range query with [zℓ ,zh] on the zval j j select the final kNN result from RP among the records with attribute in table RP . Since there is a clustered index built j j on the zval attribute, this range query becomes a sequential z-values between [zℓ ,zh] via (Q5). scan in [zj ,zj ] which is very efficient. It essentially involves SELECT TOP 1 ID FROM ℓ h 0 0 ( SELECT 0 AS ID, COUNT(*) AS L FROM RP AS R logarithmic IOs to access the path from the root to the leaf 0 0 0 0 WHERE R .zval ≥ zℓ AND R .zval ≤ zh level in the B+ tree, plus some sequential IOs linear to the UNION ··· UNION j j j α α number of points between zℓ and zh in table RP . The next SELECT α AS ID, COUNT(*) AS L FROM RP AS R i WHERE Rα.zval ≥ zα AND Rα.zval ≤ zα lemma is immediate based on the above discussion. Let zs ℓ h i i )ASTORDERBYT.LASC (Q4) be the successor z-value to q ’s z-value in table RP where i −→ i i i j q = q + vi , zp be the z-value of a point p in table RP and L SELECT TOP k * FROM RP //The ID from Q4 is j i i i WHERE Rj .zval ≥ zj AND Rj .zval ≤ zj be the number of points in the range [zℓ,zh] from table RP . P ℓ P h j j j j ORDER BY Euclidean(q .X1,q .X2,RP .Y1,RP .Y2) (Q5) Lemma 5 For the algorithm zχ-kNN, if there exists at least One can combine (Q3), (Q4) and (Q5) to get the exact one , such that i i i i , then χ kNN result based on the approximate kNN result given by i ∈{0,...,α} [zℓ,zh] ⊆ [zγℓ ,zγh ] A = j j j (Q1) (the zχ-kNN algorithm). A; otherwise, we can find A ⊆ [zℓ ,zh] for some table RP where j ∈{0,...,α} and Lj = min{L0,...,Lα}. Finally, we would like to highlight that theoretically, for many practical distributions, the z-kNN algorithm still achieves N These discussion leads to a simple algorithm for finding the O(logf B + k/B) number of page accesses to report the exact kNN based on zχ-kNN and it is shown in Algorithm 2. exact kNN result in any fixed dimension. When Aχ = A, We denote it as the z-kNN algorithm. this result is immediate by Theorem 1. When Aχ 6= A, the Figure 2 illustrates the z-kNN algorithm. In this case, α =1, key observation is that the number of “false positives” in the 0 1 candidate set Ce (line 8 in Algorithm 2) is bounded by some In both table RP and RP , [zγℓ ,zγh ] does not fully contain χ [zℓ,zh]. Hence, algorithm z -kNN does not guarantee to return constant O(k) for many practical data distributions. Hence, the 1 1 number of candidate points that the z-kNN algorithm needs the exact kNN. Among the two tables, [zℓ ,zh] contains less 0 0 1 0 to check is still O(k), the same as the zχ-kNN algorithm. number of points than [zℓ ,zh] (in Figure 2, L 30) one should consider Data sets. The real data sets were obtained from [1]. Each data using techniques that are specially designed for those purposes, set represents the road-networks for a state in United States. for example the LSH-based method [5], [14], [25], [30]. We have tested California, New Jersey, Maryland, Florida and Another nice property of our approach is the easy and others. They all exhibit similar results. Since the California efficient support of updates, both insertions and deletions. For data set is the largest with more than 10 million points, we only the deletion of a record r, we simply delete r based on its show its results. By default, we randomly sample 1 million pid from all tables R0,..., Rα. For an insertion of a record points from the California data set. We also generate two types r that corresponds to a point p, we calculate the z-values for of synthetic data sets, namely, the uniform (UN) points, and the p0,...,pα, recall pi = p + −→v . Next, we simply insert r into i random-clustered (R-Cluster) points. Note that the California all tables R0,..., Rα but with different z-values. The database data set is in 2-dimensional space. For experiments in higher engine will take care of maintaining the clustered B+ tree dimensional space, we use the UN and R-Cluster data sets. indices on the zval attribute in all these tables. Finally, our queries are parallel-friendly as they execute Setup. Unless otherwise specified, we measured an algo- similar queries over multiple tables with the same schema. rithm’s performance by the wall clock time metric which is the total execution time of the algorithm, i.e, including both the IO VII. EXPERIMENT cost and the CPU cost. By default, 100 queries were generated We implemented all algorithms in a database server running for each experiment and we report the average for one query. Microsoft SQL Server 2005. The state of the art algorithm for For both kNN and kNN-Join queries, the query point or the the exact kNN queries in arbitrary dimension is the iDistance query points are generated uniformly at random in the space [20] algorithm. For the approximate kNN, we compare against of the data set P . The default size of P is N = 106. The the Medrank algorithm [12], since it is the state of the art default value for k is 10. We keep α = 2, randomly shifted for finding approximate kNN in relatively low dimensions copies, for the UN data set and α = min{4, d}, randomly and is possible to adapt it in the relational principle. We also shifted copies, for the California and R-Cluster data sets, and compared against the approach using deterministic shifts with set γ =2k. These values of α and γ do not change for different the Hilbert-curve [23], however, that method requires O(d) dimensions. For the Medrank algorithms, we set its α value shifts and computing Hilbert values in different dimensions. as 2 for experiments in two dimensions and 4 for d larger than Both become very expensive when d increases, especially in 2. This is to make fair comparison with our algorithms (with a SQL environment. Hence, we focused on the comparison the same space overhead). The default dimensionality is 2. against the Medrank algorithm. We would like to emphasize A. Results for the kNN Query they were not initially designed to be relational algorithms

UN California that are tailored for the SQL operators. Hence, the results Impact of α. The number z−kNN here do not necessarily reflect their behavior when being used of “randomly shifted” copies 0.12 zχ−kNN without the SQL constraint. We implemented iDistance [20] has a direct impact on the using SQL, assuming that the clustering step has been pre- running time for algorithms 0.08 processed outside the database and its incremental, recursive zχ-kNN and z-kNN, as well 0.04 range exploration is achieved by a stored procedure. We built as their space overhead. Its Running time (secs) the clustered B+ tree index on the one-dimensional distance effect on the running time is 0 0 2 3 4 5 6 value in this approach. We used the suggested 120 clusters shown in Figure 4. For the α χ Fig. 4. Impact of α on the running with the k-means clustering method, and the recommended ∆r z -kNN algorithm, we expect time. value from [20]. For the Medrank, the pre-processing step is its running time to increase linearly with α. Indeed, this to generate α random vectors, then create α one-dimensional is the case in Figure 4. The running time for the exact z- lists to store the projection values of the data sets onto the kNN algorithm has a more interesting trend. For the uniform α random vectors, lastly sort these α lists. We created one UN data set, its running time also increases linearly with clustered index on each list. In the query step, we leveraged α, simply because it has to search more tables, and for on the cursors in the SQL Server as the ‘up’ and ‘down’ the uniform distribution more random shifts do not change pointers and used them to retrieve one record from each list in the probability of Aχ = A. This probability is essentially every iteration. The query process terminates when there are i i i i and it stays the same in the k Pr(∃i, [zℓ,zh] ⊆ [zγℓ ,zγh ]) elements have been retrieved satisfying the dynamic threshold uniform data for different α values. The number of false

zχ−kNN Medrank χ χ χ z −kNN Medrank z −kNN Medrank z −kNN

UN UN California UN UN California California California 2.2 5 R−Cluster 3 1.4 R−Cluster R−Cluster R−Cluster

4 2.5 1.3

1.8 /r* p /r* /r* /r* r p p p

r r r 1.2 3 2 1.4 1.1 2 1.5

1 1 1 1

0 20 40 60 80 100 0 1 3 5 7 2 3 4 5 6 0 10 20 30 40 50 k N: X106 α γ (a) Vary k. (b) Vary N. (c) Vary α. (d) Vary γ. Fig. 3. The approximation quality of the zχ-kNN and Medrank algorithm: average and the %5-%95 confidence interval. positive points also does not change much for the uniform much. Hence, a γ value that is close to k (say 2k) is good data, when using multiple shifts. On the other hand, for a enough. Finally, another interesting observation is that, for the highly skewed data set, such as California, increasing the California and the R-Cluster data sets, four random shifts are α value could bring significant savings, even though z-kNN enough to ensure zχ-kNN to give a nice approximation quality. has to search more tables. This is due to two reasons. First, For the UN data set, two shifts already make it give an approx- in highly skewed data sets, more “randomly shifted” copies imation ratio that is almost 1. This confirms our theoretical increase the probability i i i i . Second, analysis, that in practice shift for zχ- NN is indeed good Pr(∃i, [zℓ,zh] ⊆ [zγℓ ,zγh ]) O(1) k more shifts also help reduce the number of false positive points enough. Among the three data sets, not surprisingly, the UN around the kNN box, in at least one of the shifts. However, data set consistently has the best approximation quality and larger α values also indicate searching more tables. We expect the skewed data sets give slightly worse results. Increasing α to see a turning point, where the overhead of searching more values does bring the approximation quality of Medrank closer tables, when introducing additional shifts, starts to dominate. to zχ-kNN (Figure 3(c)), however, zχ-kNN still achieves much This is clearly shown in Figure 4 for the z-kNN algorithm on better approximation quality. California, and α = 4 is the sweet spot. This result explains our choice for the default value of α. On the storage side, the Running time. Next we test the running time (Figure 5 cost is linear in α. However, since we only keep a small, in next page) of different algorithms for the kNN queries, constant number of shifts in any dimension, such a space using default values for all parameters, but varying k and overhead is small. The running time and the space overhead of N. Since both California and R-Cluster are skewed, we only the Medrank algorithm increase linearly to α. Since its running show results from California. Figure 5 immediately tells that χ time is roughly 2 orders of magnitude more expensive than zχ- both the z -kNN and the z-kNN significantly outperform other kNN and z-kNN, we omitted it in Figure 4. methods by one to three orders of magnitude, including the brute-force approach BF (the SQL query using the “Euclidean” Approximation quality. We next study the approximation UDF directly), on millions of points and varying k. In many quality of zχ-kNN, comparing against Medrank. For different cases, the SQL version of the Medrank method is even worse values of k (Figure 3(a)), N (Figure 3(b)), α (Figure 3(c)), than the BF approach, because retrieving records from multiple and γ (Figure 3(d)), Figure 3 confirms that zχ-kNN achieves tables by cursors in a sorted order, round-by-round fashion very good approximation quality and significantly outperforms is very expensive in SQL; and updating the candidate set in Medrank. For zχ-kNN, we plot the average together with the Medrank is also expensive by SQL. The zχ-kNN has the best 5% − 95% confidence interval for 100 queries, and only the running time followed by the z-kNN. Both of them outperform average for the Medrank as it has a much larger variance. In the iDistance method by at least one order of magnitude. In all cases, the average approximation ratio of zχ-kNN stays most cases, both the zχ-kNN and the z-kNN take as little as below 1.1, and the worst cases never exceed 1.4. This is 0.01 to 0.1 second, for a k nearest neighbor query on several usually two times better or more than Medrank. These results millions of records, for k ≤ 100. Interestingly enough, both also indicate that our algorithm not only achieves an excellent Figure 5(a) and 5(b) suggest, that the running time for both expected approximation ratio as our theorem has suggested, the zχ-kNN and the z-kNN increase very slowly, w.r.t the but also has a very small variance in practice, i.e., its worst increment on the k value. This is because the dominant cost case approximation ratio is still very good and much better for both algorithms, is to identify the set of candidate points, than the existing method. Figure 3(a) and Figure 3(b) indicate using the clustered B+ tree. The sequential scan (with the that the approximate ratio of zχ-kNN is roughly a constant over range depending on the k value) around the successor record k and N, i.e., it has a superb scalability. Figure 3(c) indicates is extremely fast, and it is almost indifferent to the k values, that the approximation quality of zχ-kNN improves when more unless there is a significant increase. Figure 5(c) and 5(d) “randomly shifted” tables are used. However, for all data sets, indicate that the running time of all algorithms increase with even with α = 2, its average approximation ratio is already larger N values. For the California data set, as the distribution around 1.1. Figure 3(d) reveals that increasing γ slightly does is very skewed, the chance that our algorithms have to scan improve the approximation quality of zχ-kNN, but not by too more false positives and miss exact kNN results is higher.

χ χ BF iDistance z−kNN zχ−kNN Medrank BF iDistance z−kNN zχ−kNN Medrank BF iDistance z−kNN z −kNN Medrank BF iDistance z−kNN z −kNN Medrank

2 2 10 10 1 10 1 10 1 1 10 10 0 10 0 10 0 0 10 −1 10 10

−1 −1 −1 10 10 −2 Running time (secs) Running time (secs) Running time (secs) 10 Running time (secs) 10

−2 −2 −3 −2 10 10 10 10 0 20 40 60 80 100 0 20 40 60 80 100 0 1 3 5 7 0 1 3 5 7 k k N:X106 N:X106 (a) Vary k: UN. (b) vary k: California. (c) vary N: UN. (d) vary N: California.

Fig. 5. kNN queries: running time of different algorithms, default N = 106, k = 10, d = 2, α = 2, γ = 2k.

χ χ BF iDistance z−kNN z −kNN Medrank zχ−kNN Medrank Hence, the z - NN and the z- NN have higher costs than their k k 2.5 2 10 UN performance in the UN data set. Nevertheless, Figure 5(c) and R−Cluster

χ 1 5(d) show that z -kNN and z-kNN have excellent scalability 10 2 /r*

0 p comparing to other methods, w.r.t the size of the database. For 10 r 1.5 example, for 7 million records, they still just take less than 0.1 −1

Running time (secs) 10 second. −2 1 10 0 2 4 6 8 10 0 2 4 6 8 10 Effect of dimensionality. We next investigate the impact of Dimensionality Dimensionality (a) Running time:R-Cluster. (b) Approximation Quality dimensionality to our algorithms, compared to the BF and the SQL-version of the iDistance and Medrank methods, using Fig. 6. kNN queries: impact of the dimensionality. the UN and R-Cluster data sets. The running time for both UN R−Cluster UN R−Cluster

BF BF 2500 χ data sets are similar, hence we only report the result from z −kNNJ zχ−kNNJ R-Cluster. Figure 6(a) indicates that the SQL version of the 2000 600

1500 Medrank method is quite expensive. The BF method’s running 400 time increases slowly with the dimensionality. This is expected 1000 Running time (secs)

Running time (secs) 200 since the IOs contribute the dominant cost. Increasing the 500

0 dimensionality does not significantly affect the total number 0 2 4 6 0 6 0 2 4 6 8 10 of pages in the database. The running time of zχ-kNN and z- N:X10 Dimensionality (a) Vary N. (b) Vary dimensionality. kNN do increase with the dimensionality, but at a much slower pace compared to the SQL-versions of the iDistance, Medrank Fig. 7. kNN-Join running time: zχ-kNNJ vs BF. and BF. When the dimension exceeds eight, the performance χ of the SQL version iDistance becomes worse than the BF results between z -kNN and z-kNN in the kNN queries, the exact version of this algorithm (i.e., z- NNJ) has very similar method. When dimensionality becomes 10, our algorithms k performance (in terms of the running time) to the zχ-kNNJ. provide more than two orders of magnitude performance gain, χ compared to the BF, the iDistance and the Medrank methods, For brevity, we focus on the z -kNNJ algorithm. Figure 7 shows the running time of the kNN-Join queries, using either for both the uniform and the skewed data sets. Finally, we χ χ the z -kNNJ or the BF when we vary either N or d on UN study the approximation quality of the z -kNN algorithm, χ compared against the Medrank when dimensionality increases and R-Cluster data sets. In all cases, z -kNNJ is significantly better than the BF method, which essentially reduces to the and the result is shown in Figure 6(b). For zχ-kNN, we show its average as well as the 5%-95% confidence interval. nested loop join. Both algorithms are not sensitive to varying values up to and we omitted this result for brevity. Clearly, zχ-kNN gives excellent approximation ratios across k k = 100 However, the BF method does not scale well with larger all dimensions and consistently outperforms the Medrank χ algorithm by a large margin. Furthermore, similar to results databases as shown by Figure 7(a). The z -kNNJ algorithm, on the contrary, has a much slower increase in its running time in two dimension from Figure 3, zχ-kNN has a very small variance in its approximation quality across all dimensions. when N becomes larger (up-to 7 million records). Finally, in terms of dimensionality (Figure 7(b)), both algorithms are For example, its average approximation ratio is 1.2 when more expensive in higher dimensions. However, once again, d = 10, almost two times better than Medrank; and its the zχ-kNNJ algorithm has a much slower pace of increment. worst case is below 1.4 which is still much better than the χ approximation quality of the Medrank. We also studied the approximation quality for z -kNNJ. Since it is developed based on the same idea as the the zχ-kNN B. Results for the kNN-Join Query algorithm, their approximation qualities are the same. Hence, the results were not shown. In this section, we study the kNN-Join queries, by com- paring the zχ-kNNJ algorithm to the BF SQL query that uses C. Updates and Distance Based θ-Join the UDF “Euclidean” as a major join condition (as shown We also performed experiments on the distance based θ- in Section I). By default, M = |Q| = 100. Similar to Join queries. That too, has very good performance in practice. As discussed in Section VI, our methods easily supports designed for data in extremely high dimensions (typically updates, and the query performance is not affected by dynamic d> 30 and up-to 100 or more). It is not optimized for data in updates. Many existing methods for kNN queries could suffer relatively low dimensions. Our focus in this work is to design from updates. For example, the performance of the iDistance relational algorithms that are tailored for the later (2 to 10 method depends on the quality of the initial clusters. When dimensions). Another approximate method that falls into this there are updates, dynamically maintaining good clusters is a class is the Medrank method [12]. In this method, elements are very difficult problem [20]. This is even harder to achieve in projected to a random line, and ranked based on the proximity a relational database with only SQL statements. Medrank is of the projections to the projection of the query. Multiple also very expensive to support ad-hoc updates [12]. random projections are performed and the aggregation rule picks the database element that has the best median rank. A D. Additional Query Conditions limitation of both the LSH-based methods and the Medrank Finally, another benefit of being relational is the easy method is that they can only give approximate solutions and support for additional, ad-hoc query conditions. We have could not help for finding the exact solutions. Note that for any verified this with both the zχ-kNN and z-kNN algorithms, by approximate kNN algorithm, it is always possible to, in a post- augmenting additional query predicates and varying the query processing step, retrieve the exact kNN result by a range query selectivity of those predicates. When databases have meta-data using the distance between q and the kth nearest neighbor from (some statistical information on different attributes) available the approximate solution. However, it no longer guarantees any for estimating the query selectivity, the query optimizer is able query costs provided by the approximate algorithm, as it was to perform further pruning based on those additional query not designed for the range query which, in high dimensional conditions. This is a natural result as both the zχ-kNN and z- space, often requires scanning the entire database. kNN algorithms are implemented by standard SQL operators. Another method is to utilize the space-filling curves and map the data into one dimensional space, represented by the VIII. RELATED WORKS work of Liao et al. [23]. Specifically, their algorithm uses The nearest neighbor query with Lp norms has been ex- O(d + 1) deterministic shifted copies of the data points and tensively studied. In the context of spatial databases, R-tree stores them according to their positions along a Hilbert curve. provides efficient algorithms using either the depth-first [29] or Then it is possible to guarantee that a neighbor within an the best-first [18] approach. These algorithms typically follow O(d1+1/t) factor of the exact nearest neighbor, can be returned the branch and bound principle based on the MBRs in an for any Lt norm with at most O((d+1)log N) page accesses. R-tree index [6], [16]. Some commercial database engines Our work is inspired by this approach [10], [23], however, with have already incorporated the R-tree index into their systems. significant differences. We adopted the Z-order for the reason However, there are still many already deployed relational that the Z-value of any point can be computed using only the databases in which such features are not available. Also, the bit shuffle operation. This can be easily done in a relational R-tree family has relatively poor performance for data beyond database. More importantly, we use random shifts instead six dimensions and the kNN-join with the R-tree is most likely of deterministic shifts. Consequently, only O(1) shifts are not available in the engines even if R-tree is available. needed to get an expected constant approximation result, using Furthermore, R-tree does not provide any theoretical guar- only O(log N) page accesses for any fixed dimensions. In antee on the query costs for nearest neighbor queries (even practice, our approximate algorithm gives orders of magnitude for approximate versions of these queries). Computational improvement in terms of its efficiency, and at the same time geometry community has spent considerable efforts in design- guarantees much better approximation quality comparing to ing approximate nearest neighbor algorithms [2], [9]. Arya adopting existing methods by SQL. In addition, our approach et al. [2] designed a modification of the standard kd-tree, supports efficient, exact kNN queries. It obtains the exact kNN called the Balanced Box Decomposition (BBD) tree that for certain types of distributions using only O(log N) page can answer (1 + ǫ)-approximate nearest neighbor queries in accesses for any dimension, again using just O(1) random O(1/ǫd log N). BBD-tree takes O(N log N) time to build. shifts. Our approach also works for kNN-Joins. BBD trees are certainly not available in any database engine. Since we are using Z-values to transform points from higher Practical approximate nearest neighbor search methods do dimensions to one dimensional space, a related work is the exist. One is the well-known locality sensitive hashing (LSH) UB-Tree [28]. Conceptually, the UB-Tree is to index Z-values [14], where data in high dimensions are hashed into random using B-Tree. However, they [28] only studied range query buckets. By carefully designing a family of hash functions algorithms and they did not use random shifts to generate the to be locality sensitive, objects that are closer to each other Z-values. To the best of our knowledge, the only prior work on in high dimensions will have higher probability to end up in kNN queries in relational databases was [3]. However, it has the same bucket. The drawback of the basic LSH method is a fundamentally different focus. Their goal is to explore ways that in practice, large number of hash tables may be needed to improve the query optimizer inside the database engine to to approximate nearest neighbors well [25] and successful handle kNN queries. In contrast, our objective is to design attempts have been made to improve the performance of the SQL-based algorithms outside the database engine. LSH-based methods [5], [30]. The LSH-based methods are The state of the art technique for retrieving the exact k nearest neighbors in high dimensions is the iDistance method REFERENCES [20]. Data is first partitioned into clusters by any popular [1] Open street map. http://www.openstreetmap.org. clustering method. A point p in a cluster ci is mapped to [2] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. a one-dimensional value, which is the distance between p and An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. Journal of ACM, 45(6):891–923, 1998. ci’s cluster center. Then, kNN search in the original high [3] A. W. Ayanso. Efficient processing of k-nearest neighbor queries over dimensional data set could be translated into a sequence of relational databases: A cost-based optimization. PhD thesis, 2004. incremental, recursive range queries, in these one dimensional [4] B. Babcock and S. Chaudhuri. Towards a robust query optimizer: a principled and practical approach. In SIGMOD, 2005. distance values [20], that gradually expands its search range [5] M. Bawa, T. Condie, and P. Ganesan. Lsh forest: self-tuning indexes until kNN is guaranteed to be retrieved. Assuming that the for similarity search. In WWW, 2005. ∗ clustering step has been performed outside the database and [6] N. Beckmann, H. P. Kriegel, R. Schneider, and B. Seeger. The R - tree: an efficient and robust access method for points and rectangles. In the incremental search step is achieved in a stored procedure, SIGMOD, 1990. one can implement a SQL-version of the iDistance method. [7] C. B¨ohm and F. Krebs. High performance data mining using the nearest However, as it was not designed to be a SQL-based algorithm, neighbor join. In ICDM, 2002. [8] C. B¨ohm and F. Krebs. The k-nearest neighbour join: Turbo charging our algorithm outperforms this version of the iDistance method the kdd process. Knowl. Inf. Syst., 6(6):728–749, 2004. as shown in Section VII. [9] T. M. Chan. Approximate nearest neighbor queries revisited. In SoCG, Finally, the NN-Join has also been studied [7], [8], [11], 1997. k [10] T. M. Chan. Closest-point problems simplified on the ram. In SODA, [31], [33]. The latest results are represented by the iJoin algo- 2002. rithm [33] and the Gorder algorithm [31]. The first approach [11] A. Corral, Y. Manolopoulos, Y. Theodoridis, and M. Vassilakopoulos. is based on the iDistance, and it extends the iDistance method Closest pair queries in spatial databases. In SIGMOD, 2000. [12] R. Fagin, R. Kumar, and D. Sivakumar. Efficient similarity search and to support the kNN-Join. The extension is non-trivial and it is classification via rank aggregation. In SIGMOD, 2003. not clear how to extend it into a relational algorithm using only [13] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for SQL statements. Gorder is a block nested loop join method middleware. In PODS, 2001. [14] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high that exploits sorting, join scheduling and distance computation dimensions via hashing. In VLDB, 1999. filtering and reduction. Hence, it is an algorithm to be imple- [15] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, mented inside the database engine, which is different from our and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001. objectives. Other join queries are considered for spatial data [16] A. Guttman. R-trees: a dynamic index structure for spatial searching. sets as well, such as the distance join [17], [18], multiway In SIGMOD, 1984. spatial join [26] and others. Interested readers are referred to [17] G. R. Hjaltason and H. Samet. Incremental distance join algorithms for spatial databases. In SIGMOD, 1998. the article by Jacox et al. [19]. Finally, the distance-based [18] G. R. Hjaltason and H. Samet. Distance browsing in spatial databases. similarity join in GPU with the help of z-order curves was ACM Trans. Database Syst., 24(2), 1999. investigated in [24]. [19] E. H. Jacox and H. Samet. Spatial join techniques. ACM Trans. Database Syst., 32(1), 2007. [20] H. V. Jagadish, B. C. Ooi, K.-L. Tan, C. Yu, and R. Zhang. iDistance: An adaptive B+-tree based indexing method for nearest neighbor search. IX. CONCLUSION ACM Trans. Database Syst., 30(2):364–397, 2005. [21] D. R. Karger and M. Ruhl. Finding nearest neighbors in growth- This work revisited the classical kNN-based queries. We restricted metrics. In STOC, 2002. designed efficient algorithms that can be implemented by [22] M. Kolahdouzan and C. Shahabi. Voronoi-based k nearest neighbor SQL operators in large relational databases. We presented a search for spatial network databases. In VLDB, 2004. [23] S. Liao, M. A. Lopez, and S. T. Leutenegger. High dimensional constant approximation for the kNN query, with logarithmic similarity search with space filling curves. In ICDE, 2001. page accesses in any fixed dimension and extended it to [24] M. D. Lieberman, J. Sankaranarayanan, and H. Samet. A fast similarity the exact solution, both using just O(1) random shifts. Our join algorithm using graphics processing units. In ICDE, 2008. [25] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe approach naturally supports kNN-Joins, as well as other inter- LSH: efficient indexing for high-dimensional similarity search. In VLDB, esting queries. No changes are required for our algorithms for 2007. different dimensions, and the update is almost trivial. Several [26] N. Mamoulis and D. Papadias. Multiway spatial joins. ACM Trans. Database Syst., 26(4):424–475, 2001. interesting directions are open for future research. One is to [27] D. Papadias, M. L. Yiu, N. Mamoulis, and Y. Tao. Nearest neighbor study other related, interesting queries in this framework, e.g., queries in network databases. In Encyclopedia of GIS. 2008. the reverse nearest neighbor queries. The other is to examine [28] F. Ramsak, V. Markl, R. Fenk, M. Zirkel, K. Elhardt, and R. Bayer. Integrating the ub-tree into a database system kernel. In VLDB, 2000. the relational algorithms to the data space other than the Lp [29] N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. norms, such as the important road networks [22], [27]. In SIGMOD, 1995. [30] Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Quality and efficiency in high- dimensional nearest neighbor search. In SIGMOD, 2009. X. ACKNOWLEDGMENT [31] C. Xia, H. Lu, B. C. Ooi, and J. Hu. Gorder: an efficient method for KNN join processing. In VLDB, 2004. We would like to thank the anonymous reviewers for their [32] B. Yao, F. Li, and P. Kumar. K nearest neighbor queries and knn-joins in large relational databases (almost) for free. Technical report, 2009. valuable comments that improve this work. Bin Yao and Feifei www.cs.fsu.edu/∼lifeifei/papers/nnfull.pdf. Li are supported in part by Start-up Grant from the Computer [33] C. Yu, B. Cui, S. Wang, and J. Su. Efficient index-based KNN join Science Department, FSU. Piyush Kumar is supported in part processing for high-dimensional data. Inf. Softw. Technol., 49(4):332– 344, 2007. by NSF grant CCF-0643593.