Dynamic Enumeration of Similarity Joins
Total Page:16
File Type:pdf, Size:1020Kb
1 Dynamic Enumeration of Similarity Joins 2 Pankaj K. Agarwal 3 Duke University, USA 4 Xiao Hu 5 Duke University, USA 6 Stavros Sintos 7 University of Chicago, USA 8 Jun Yang 9 Duke University, USA 10 Abstract 11 This paper considers enumerating answers to similarity-join queries under dynamic updates: Given d 12 two sets of points A, B in R , a metric ϕ(·), and a distance threshold r > 0, it asks to report all pairs 13 of points (a, b) ∈ A × B with ϕ(a, b) ≤ r. Our goal is to design a data structure that, whenever asked, 14 can enumerate all result pairs with worst-case delay guarantee, i.e., the time between enumerating 15 two consecutive pairs is bounded. Furthermore, it can be efficiently updated when an input point is 16 inserted or deleted. 17 We propose several efficient data structures for answering similarity joins in low dimension. For 18 exact enumeration of similarity join, we obtain near-linear-size data structures for ℓ1/ℓ∞ metrics O(1) 19 with log n update time and delay. We show that such a data structure is not feasible for the ℓ2 20 metric for d ≥ 4. For approximate enumeration of similarity join, where the distance threshold is O(1) 21 a soft constraint, we obtain a unified linear-size data structure for ℓp metric, with log n delay 22 and update time. In high dimensions, we present an efficient data structure toward a worst-case 23 delay-guarantee framework using locality sensitive hashing (LSH). 24 2012 ACM Subject Classification Theory of computation → Theory and algorithms for application 25 domains → Database theory → Data structures and algorithms for data management 26 Keywords and phrases dynamic enumeration, similarity joins, worst-case delay guarantee 27 Digital Object Identifier 10.4230/LIPIcs... 28 1 Introduction 29 There has been extensive work in many areas including theoretical computer science, compu- 30 tational geometry, and database systems on designing efficient dynamic data structures to 31 store a set D of objects so that certain queries on D can be answered quickly and objects can 32 be inserted into or deleted from D dynamically. A query Q is specified by a set of constraints 33 and the goal is to report the subset Q(D) ⊆ D of objects that satisfy the constraints, the 34 so-called reporting or enumeration queries. More generally, Q may be specified on k-tuples k 35 of objects in D, and we return the subset of D that satisfy Q. One may also ask to return 36 certain statistics on Q(D) instead of Q(D) itself, but here we focus on enumeration queries. d 37 As an example, D is set of points in R and a query Q specifies a simple geometric region ∆ 38 (e.g., box, ball, simplex) and asks to return D ∩ ∆, the so-called range-reporting problem. d 39 As another example, D is again a set of points in R , and Q now specifies a value r ≥ 0 40 and asks to return all pairs (p, q) ∈ D × D with ∥p − q∥ ≤ r. Traditionally, the performance 41 of a data structure has been measured by its size, the time needed to update the data 42 structure when an object is inserted or deleted, and the total time spent in reporting Q(D). 43 In some applications, especially in exploratory or interactive data analysis, it is desirable to 44 report Q(D) incrementally one by one so that users can start exploiting the first answers 45 while waiting for the remaining ones. To offer guarantees on the regularity during the © Pankaj K. Agarwal, Xiao Hu, Stavros Sintos, and Jun Yang; licensed under Creative Commons License CC-BY 4.0 Leibniz International Proceedings in Informatics Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany :2 Dynamic Enumeration of Similarity Joins 46 enumeration process, delay between the enumeration of two consecutive objects has emerged 47 as an important measurement [10]. Formally speaking, δ-delay enumeration requires that 48 the time between the start of the enumeration process to the first result, the time between 49 any consecutive pair of results, and the time between the last result and the termination of 50 the enumeration process should be at most δ. 51 In this paper, we are interested in dynamic data structures for (binary) similarity join 52 queries, which have numerous applications in data cleaning, data integration, collaborative d 53 filtering, etc. Given two sets of points A and B in R , a metric ϕ(·), and a distance threshold 54 r > 0, the similarity join asks to report all pairs of (a, b) ∈ A × B with ϕ(a, b) ≤ r. Similarity 55 joins have been extensively studied in the database and data mining literature [18, 32, 39, 56 42, 44], but it is still unclear how to enumerate similarity join results efficiently when the 57 underlying data is updated. Our goal is to design a dynamic data structure that can be 58 efficiently updated when an input point is inserted or deleted; and whenever an enumeration 59 query is issued, all join results can be enumerated from it with worst-case delay guarantee. 60 1.1 Previous results 61 We briefly review the previous work on similarity join and related problems. See surveys[7, 62 9, 43] for more results. 63 Enumeration of Conjunctive Query. Conjunctive queries are built upon natural join 64 (on), which is a special case of similarity join with r = 0, i.e., two tuples can be joined if and 65 only if they have the same value on the join attributes. Enumeration of conjunctive queries 66 has been extensively studied in the static settings [10, 41, 15] for a long time. In 2017, two 67 simultaneous papers [13, 30] started to study dynamic enumeration of conjunctive query. 68 Both obtained a dichotomy that a linear-size data structure that can be updated in O(1) time 69 while supporting O(1)-delay enumeration, exists for a conjunctive query if and only if it is 70 q-hierarchical (e.g., the degenerated natural join over two tables is q-hierarchical). However, 1 −ε 71 for non-q-hierarchical queries with input size n, they showed a lower bound Ω(n 2 ) on the 72 update time for any small constant ε > 0, if aiming at O(1) delay. This result is very negative 73 since q-hierarchical queries are a very restricted class; for example, the matrix multiplication 74 query πX,Z R1(X, Y ) on R2(Y, Z), where πX,Y denotes the projection on attributes X, Y , and 75 the triangle join R (X, Y ) R (Y, Z) R (Z, X) are already non-q-hierarchical. Later, 1 on 2 on 3 √ 76 Kara et al. [33] designed optimal data structures supporting O( n)-time maintenance for 77 some selected non-q-hierarchical queries like the triangle query etc. However, it is still unclear √ 78 if a data structure of O( n)-time maintenance can be obtained for a large class of queries. 79 Some additional trade-off results have been obtained in [34, 45]. 80 Range search. A widely studied problem related to similarity join is range searching [2, 3, d 81 12, 47]: Preprocess a set A of points in R with a data structure so that for a query range 82 γ (e.g., rectangle, ball, simplex), all points of A ∩ γ can be reported quickly. A particular 83 instance of range searching, the so-called fixed-radius-neighbor searching, in which the range 84 is a ball of fixed radius centered at query point is particularly relevant for similarity joins. 85 For a given metric ϕ, let Bϕ(x, r) be the ball of radius r centered at x. A similarity join 86 between two sets A, B can be answered by querying A with ranges Bϕ(b, r) for all b ∈ B. 87 Notwithstanding a close relationship between range searching and similarity join, the data 88 structures for the former cannot be used for the latter: It is too expensive to query A with 89 Bϕ(b, r) for every b ∈ B whenever an enumeration query is issued, especially since many 90 such range queries may return empty set, and it is not clear how to maintain the query 91 results as the input set A changes dynamically. 92 Reporting neighbors. The problem of reporting neighbors is identical to our problem in P. K. Agarwal et al. :3 Data Structures Enumeration Metric Properties Space Update Delay ℓ1/ℓ∞ r is fixed Oe(n) Oe(1) Oe(1) Exact 1− 1 1− 1 ℓ2 r is fixed Oe(n) Oe(n d+1 ) Oe(n d+1 ) r is fixed O(n) Oe(ϵ−d) Oe(ϵ−d) ϵ- ℓ r is variable p O(ε−dn) O(ε−d) O(1) Approximate spread is poly(n) e ℓ , ℓ , r is fixed 1 2 O(dn + n1+ρ) O(dn2ρ) O(dn2ρ) hamming high dimension e e e Table 1 Summary of Results: n is the input size; r is the distance threshold; d is the dimension 1 of input points; ρ ≤ (1+ε)2 + o(1) is the quality of LSH family for the ℓ2 metric. For ℓ1, Hamming 1 O(1) ρ ≤ 1+ε . Oe notation hides a log n-factor; for the results where d is constant the O(1) exponent is at most linear on d, while for the high dimensional case the exponent is at most 3. d 93 the offline setting. In particular, given aset P of n points in R and a parameter r, the goal 94 is to report all pairs of P within distance r.