1 Dynamic of Similarity Joins

2 Pankaj K. Agarwal 3 Duke University, USA

4 Xiao Hu 5 Duke University, USA

6 Stavros Sintos 7 University of Chicago, USA

8 Jun Yang 9 Duke University, USA

10 Abstract

11 This paper considers enumerating answers to similarity-join queries under dynamic updates: Given d 12 two sets of points A, B in R , a metric ϕ(·), and a distance threshold r > 0, it asks to report all pairs 13 of points (a, b) ∈ A × B with ϕ(a, b) ≤ r. Our goal is to design a data structure that, whenever asked,

14 can enumerate all result pairs with worst-case delay guarantee, i.e., the time between enumerating

15 two consecutive pairs is bounded. Furthermore, it can be efficiently updated when an input point is

16 inserted or deleted.

17 We propose several efficient data structures for answering similarity joins in low dimension. For

18 exact enumeration of similarity join, we obtain near-linear-size data structures for ℓ1/ℓ∞ metrics O(1) 19 with log n update time and delay. We show that such a data structure is not feasible for the ℓ2

20 metric for d ≥ 4. For approximate enumeration of similarity join, where the distance threshold is O(1) 21 a soft constraint, we obtain a unified linear-size data structure for ℓp metric, with log n delay

22 and update time. In high dimensions, we present an efficient data structure toward a worst-case

23 delay-guarantee framework using locality sensitive hashing (LSH).

24 2012 ACM Subject Classification Theory of computation → Theory and for application

25 domains → theory → Data structures and algorithms for data management

26 Keywords and phrases dynamic enumeration, similarity joins, worst-case delay guarantee

27 Digital Object Identifier 10.4230/LIPIcs...

28 1 Introduction

29 There has been extensive work in many areas including theoretical computer science, compu-

30 tational geometry, and database systems on designing efficient dynamic data structures to 31 store a set D of objects so that certain queries on D can be answered quickly and objects can 32 be inserted into or deleted from D dynamically. A query Q is specified by a set of constraints 33 and the goal is to report the subset Q(D) ⊆ D of objects that satisfy the constraints, the 34 so-called reporting or enumeration queries. More generally, Q may be specified on k-tuples k 35 of objects in D, and we return the subset of D that satisfy Q. One may also ask to return 36 certain statistics on Q(D) instead of Q(D) itself, but here we focus on enumeration queries. d 37 As an example, D is set of points in R and a query Q specifies a simple geometric region ∆ 38 (e.g., box, ball, simplex) and asks to return D ∩ ∆, the so-called range-reporting problem. d 39 As another example, D is again a set of points in R , and Q now specifies a value r ≥ 0 40 and asks to return all pairs (p, q) ∈ D × D with ∥p − q∥ ≤ r. Traditionally, the performance 41 of a data structure has been measured by its size, the time needed to update the data

42 structure when an object is inserted or deleted, and the total time spent in reporting Q(D). 43 In some applications, especially in exploratory or interactive data analysis, it is desirable to

44 report Q(D) incrementally one by one so that users can start exploiting the first answers 45 while waiting for the remaining ones. To offer guarantees on the regularity during the

© Pankaj K. Agarwal, Xiao Hu, Stavros Sintos, and Jun Yang; licensed under Creative Commons License CC-BY 4.0 Leibniz International Proceedings in Informatics Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany :2 Dynamic Enumeration of Similarity Joins

46 enumeration process, delay between the enumeration of two consecutive objects has emerged

47 as an important measurement [10]. Formally speaking, δ-delay enumeration requires that

48 the time between the start of the enumeration process to the first result, the time between

49 any consecutive pair of results, and the time between the last result and the termination of

50 the enumeration process should be at most δ.

51 In this paper, we are interested in dynamic data structures for (binary) similarity join

52 queries, which have numerous applications in data cleaning, data integration, collaborative d 53 filtering, etc. Given two sets of points A and B in R , a metric ϕ(·), and a distance threshold 54 r > 0, the similarity join asks to report all pairs of (a, b) ∈ A × B with ϕ(a, b) ≤ r. Similarity

55 joins have been extensively studied in the database and data mining literature [18, 32, 39,

56 42, 44], but it is still unclear how to enumerate similarity join results efficiently when the

57 underlying data is updated. Our goal is to design a dynamic data structure that can be

58 efficiently updated when an input point is inserted or deleted; and whenever an enumeration

59 query is issued, all join results can be enumerated from it with worst-case delay guarantee.

60 1.1 Previous results

61 We briefly review the previous work on similarity join and related problems. See surveys[7,

62 9, 43] for more results.

63 Enumeration of Conjunctive Query. Conjunctive queries are built upon natural join 64 (⋊⋉), which is a special case of similarity join with r = 0, i.e., two tuples can be joined if and 65 only if they have the same value on the join attributes. Enumeration of conjunctive queries

66 has been extensively studied in the static settings [10, 41, 15] for a long time. In 2017, two

67 simultaneous papers [13, 30] started to study dynamic enumeration of conjunctive query.

68 Both obtained a dichotomy that a linear-size data structure that can be updated in O(1) time

69 while supporting O(1)-delay enumeration, exists for a conjunctive query if and only if it is

70 q-hierarchical (e.g., the degenerated natural join over two tables is q-hierarchical). However, 1 −ε 71 for non-q-hierarchical queries with input size n, they showed a lower bound Ω(n 2 ) on the

72 update time for any small constant ε > 0, if aiming at O(1) delay. This result is very negative

73 since q-hierarchical queries are a very restricted class; for example, the matrix multiplication 74 query πX,Z R1(X,Y ) ⋊⋉ R2(Y,Z), where πX,Y denotes the projection on attributes X,Y , and 75 the triangle join R (X,Y ) R (Y,Z) R (Z,X) are already non-q-hierarchical. Later, 1 ⋊⋉ 2 ⋊⋉ 3 √ 76 Kara et al. [33] designed optimal data structures supporting O( n)-time maintenance for

77 some selected non-q-hierarchical queries like the triangle query etc. However, it is still unclear √ 78 if a data structure of O( n)-time maintenance can be obtained for a large class of queries.

79 Some additional trade-off results have been obtained in [34, 45].

80 Range search. A widely studied problem related to similarity join is range searching [2, 3, d 81 12, 47]: Preprocess a set A of points in R with a data structure so that for a query range 82 γ (e.g., rectangle, ball, simplex), all points of A ∩ γ can be reported quickly. A particular

83 instance of range searching, the so-called fixed-radius-neighbor searching, in which the range

84 is a ball of fixed radius centered at query point is particularly relevant for similarity joins.

85 For a given metric ϕ, let Bϕ(x, r) be the ball of radius r centered at x. A similarity join 86 between two sets A, B can be answered by querying A with ranges Bϕ(b, r) for all b ∈ B.

87 Notwithstanding a close relationship between range searching and similarity join, the data

88 structures for the former cannot be used for the latter: It is too expensive to query A with

89 Bϕ(b, r) for every b ∈ B whenever an enumeration query is issued, especially since many 90 such range queries may return empty set, and it is not clear how to maintain the query

91 results as the input set A changes dynamically.

92 Reporting neighbors. The problem of reporting neighbors is identical to our problem in P. K. Agarwal et al. :3

Data Structures Enumeration Metric Properties Space Update Delay

ℓ1/ℓ∞ r is fixed Oe(n) Oe(1) Oe(1) Exact 1− 1 1− 1 ℓ2 r is fixed Oe(n) Oe(n d+1 ) Oe(n d+1 ) r is fixed O(n) Oe(ϵ−d) Oe(ϵ−d) ϵ- ℓ r is variable p O(ε−dn) O(ε−d) O(1) Approximate spread is poly(n) e ℓ , ℓ , r is fixed 1 2 O(dn + n1+ρ) O(dn2ρ) O(dn2ρ) hamming high dimension e e e Table 1 Summary of Results: n is the input size; r is the distance threshold; d is the dimension 1 of input points; ρ ≤ (1+ε)2 + o(1) is the quality of LSH family for the ℓ2 metric. For ℓ1, Hamming 1 O(1) ρ ≤ 1+ε . Oe notation hides a log n-factor; for the results where d is constant the O(1) exponent is at most linear on d, while for the high dimensional case the exponent is at most 3.

d 93 the offline setting. In particular, given aset P of n points in R and a parameter r, the goal 94 is to report all pairs of P within distance r. The proposed in [35] can be modified

95 to solve the problem of reporting neighbors under the ℓ∞ metric in O(n + k) time, where k

96 is the output size. Aiger et al. [6] proposed randomized algorithms for reporting neighbors

97 using the ℓ2 metric in O((n + k) log n) time, for constant d.

98 Scalable continuous query processing. There has been some work on scalable continuous

99 query processing, especially in the context of data streams [20, 17, 48] and publish/sub-

100 scribe [24], where the queries are standing queries and whenever a new data item arrives,

101 the goal is to report all queries that are affected by the new item5 [ , 4]. In the context of 102 similarity join, one can view A as the data stream and Bϕ(b, r) as standing queries, and

103 we update the results of queries as new points in A arrive. There are, however, significant

104 differences with similarity joins—arbitrary deletions are not handled; continuous queriesdo

105 not need to return previously produced results; basing enumeration queries on a solution

106 for continuous queries would require accessing previous results, which can be prohibitive if

107 stored explicitly.

108 1.2 Our results

109 We present several dynamic data structures for enumerating similarity joins under different

110 metrics. It turns out that dynamic similarity join is hard for some metrics, e.g., ℓ2 metric.

111 Therefore we also consider approximate similarity join where the distance threshold r is a soft

112 constraint. Formally, given parameter r, ε > 0, the ε-approximate similarity join relaxes the

113 distance threshold for some parameter ε > 0: (1) all pairs of (a, b) ∈ A × B with ϕ(a, b) ≤ r

114 should be returned; (2) some pairs of (a, b) ∈ A × B with r < ϕ(a, b) ≤ (1 + ε)r may be

115 returned; (3) no pair of (a, b) ∈ A × B with ϕ(a, b) > (1 + ε)r is returned. We classify our

116 results in four broad categories:

117 Exact similarity join. Here we assume that d is constant and distance threshold is

118 fixed. We exploit the geometry of the shape of the ball, defined by input points andthe

119 distance threshold r. Our first result (Section 2.1) is a data structure for similarity join

120 under the ℓ1/ℓ∞ metrics, based on range tree [11, 22]. We store the similarity join pairs

121 implicitly so that (i) they can be enumerated without probing using every input point, (ii) the 122 representation can be updated quickly whenever A or B is updated, and (iii) we ensure Oe(1) 123 delay during enumeration. We extend these ideas to construct a data structure for similarity

124 join under the ℓ2 metric (in Section 2.2) using a data structure for ball range searching, with :4 Dynamic Enumeration of Similarity Joins

1−1/d 1−1/d 125 Oe(n ) amortized update time while supporting Oe(n )-delay enumeration. Meanwhile, 126 lower bounds on ball range searching [1, 19] rule out the possibility of a linear-size data 127 structure with Oe(1) delay. 128 Approximate similarity join in low dimensions. Due to the negative result for ℓ2

129 metric, we shift our attention to ε-approximate similarity join. We now allow the distance

130 threshold to be part of the query but the value of ε, the error parameter, is fixed. We

131 present a simple data structure based on quad trees and the notion of well-separated pair

132 decomposition. If we fix the distance threshold, then the data structure can be further

133 simplified and somewhat improved by replacing the quad tree with a simple uniform grid.

134 Approximate similarity join in high dimensions. So far we assumed d to be

135 constant and the big O notation in some of the previous bounds hides a constant that is

136 exponential in d. Our final result is an LSH-based26 [ ] data structure for similarity joins in

137 high dimensions. Two technical issues arise when enumerating join results from LSH: one is

138 to ensure bounded delay because we do not want to enumerate false positive results identified

139 by the hash functions, and the other is to remove duplicated results as one join result could

140 be identified by multiple hash functions. For the ℓ2 metric (the results can also be extended 1+ρ 2ρ 141 to ℓ1, Hamming metrics) we propose a data structure of Oe(nd + n ) size and Oe(dn ) 2ρ 142 amortized update time that supports (1 + 2ε)-approximate enumeration with Oe(dn ) delay 1 143 with high probability, where ρ ≤ (1+ε)2 + o(1) is the quality of the LSH family. Our data 144 structure can be extended to the case when the distance threshold r is variable. If we allow

145 worse approximation error we can improve the results for the Hamming distance. Finally, we

146 show a lower bound by relating similarity join to the approximate nearest neighbor query.

147 Table 1 summarizes our results. We also consider similarity join beyond binary joins.

148 Triangle similarity join in low dimensions. Given three sets of points A, B, S in d 149 R , a metric ϕ(·), and a distance threshold r > 0, the triangle similarity join asks to report 150 the set of all triples of (a, b, s) ∈ A × B × S with ϕ(a, b) ≤ r, ϕ(a, s) ≤ r, ϕ(b, s) ≤ r. The ε-

151 approximate triangle similarity join can be defined similarly by taking the distance threshold

152 r as a soft constraint. We extend all our data structures on similarity join for constant d to O(1) 153 triangle similarity join by paying a factor of log n in the performance. Due to the space

154 limit, we describe this extension in Appendix B.

155 High-level framework. All our data structures rely on the following common framework. ′ 156 We model the similarity join as a bipartite graph G = (A ∪ B,E), where an edge (a, b) ∈ E if ′ 157 and only if ϕ(a, b) ≤ r. A naive solution by maintaining all edges of G explicitly leads to a data 2 158 structure of Θ(n ) size that can be updated in Θ(n) time while supporting O(1)-delay enumer-

159 ation. To obtain a data structure with poly-logarithmic update time and delay enumeration, ′ 160 we find a compact representation of G with a set F = {(A1,B1), (A2,B2),..., (Au,Bu)} Su 161 of edge-disjoint bi-cliques such that (i) Ai ⊆ A, Bi ⊆ B for any i, (ii) E = i=1 Ai × Bi, 162 and (iii) (Ai × Bi) ∩ (Aj × Bj) = ∅ for any i ̸= j. We represent F using a tripartite graph 163 G = (A ∪ B ∪ C,E1 ∪ E2) where C = {c1, . . . , cu} has a node for each bi-clique in F and for 164 every i ≤ u, we have the edges (aj, ci) ∈ E1 for all aj ∈ Ai and (bk, ci) for all bk ∈ Bi. We 165 cannot afford to maintain E1 and E2 explicitly. Instead, we store some auxiliary information 166 for each ci and use geometric data structures to recover the edges incident on a vertex ci ∈ C.

167 We also use data structures to maintain the set C and the auxiliary information dynamically

168 as A and B are being updated. We will not refer to this framework explicitly but it provides

169 the intuition behind all our data structures. Section 2 describes the data structures to

170 support this framework for exact similarity join, and Section 3 presents simpler, faster data

171 structures for approximate similarity join. Both Sections 2 and 3 assume d to be constant.

172 Section 4 describes the data structure for approximate similarity join when d is not constant. P. K. Agarwal et al. :5

173 2 Exact Similarity Join

174 In this section, we describe the data structure for exact similarity joins under the ℓ∞, ℓ1, ℓ2 175 metrics, assuming d is constant. We first describe the data structure for the ℓ∞ metric. We d 176 show that similarity join under the ℓ1 metric in R can be reduced to that under the ℓ∞ d+1 177 metric in R . Finally, we describe the data structure for the ℓ2 metric. Throughout this 178 section, the threshold r is fixed, which is assumed to be 1 without loss of generality.

179 2.1 Similarity join under ℓ∞ metric

d d 180 Let A and B be two point sets in R with |A| + |B| = n. For a point p ∈ R , let d 181 B(p) = {x ∈ R | ∥p − x∥∞ ≤ 1} be the hypercube of side length 2. Then we wish to 182 enumerate pairs (a, b) ∈ A × B such that a ∈ B(b).

183 Data structure. We build a d-dimensional dynamic range tree T on the points in A. For 184 d = 1, the range tree on A is a balanced binary search tree T of O(log n) height. The points 185 of A are stored at the leaves of T in increasing order, while each internal node v stores the − + 186 smallest and the largest values, αv and αv , respectively, contained in its subtree. The node − + 187 v is associated with an interval Iv = [αv , αv ] and the subset Av = Iv ∩ A. For d > 1, T is 188 constructed recursively: We build a 1D range tree Td on the xd-coordinates of points in A. 189 Next, for each node v ∈ Td, we recursively construct a (d − 1)-dimensional range tree Tv on ∗ 190 Av, the projection of Av onto the hyperplane xd = 0, and attach Tv to v as its secondary d d−1 d 191 tree. The size of T in R is O(n log n) and it can be constructed in O(n log n) time. 192 See [22] for details. For a node v at a level-i tree, let p(v) denote its parents in that tree. If v

193 is the root, p(v) is undefined. For each node u of the d-th level of T , we associate a d-tuple 194 π(u) = ⟨u1, u2, . . . , ud = u⟩, where ui is the node at the i-th level tree of T to which the Qd 195 level-(i + 1) tree containing ui+1 is connected. We associate the rectangle □u = j=1 Iuj Qd 196 with the node u. For a rectangle ρ = i=1 δi , a d-level node is called a canonical node if for d 197 every i ∈ [1, d], Iui ⊆ δi and Ip(ui) ̸⊆ δi. For any rectangle ρ, there are O(log n) canonical d 198 nodes in T , denoted by N (ρ), and they can be computed in O(log n) time [22]. T can be 199 maintained dynamically, as points are inserted into A or deleted from A using the standard

200 partial-reconstruction method, which periodically reconstructs various bottom subtrees. The d 201 amortized time is O(log n); see [37] for details.

202 We query T with B(b) for all b ∈ B and compute N (b) := N (B(b)) the sets of its 203 canonical nodes. For each level-d tree node u of T , let Bu = {b ∈ B | u ∈ N (b)}. We have P d 204 u |Bu| = O(n log n). By construction, for all pairs (a, b) ∈ Au × Bu, ∥a − b∥∞ ≤ 1, so 205 (Au,Bu) is a bi-clique of join results. We call u active if both Au,Bu ̸= ∅. A naive approach 206 for reporting join results is to maintain Au,Bu for every d-level node u of T as well as 207 the set C of all active nodes. Whenever an enumerate query is issued, we traverse C and 208 return Au × Bu for all u ∈ C (referring to the tripartite-graph framework mentioned in 209 Introduction, C is the set of all level-d nodes of T ). The difficulty with this approach is that 210 when A changes and T is updated, some d-level nodes change and we have to construct Bu 211 for each new level-d node u ∈ T . It is too expensive to scan the entire B at each update. 212 Furthermore, although the average size of Bu is small, it can be very large for a particular

213 u and this node may appear and disappear several times. So we need a different approach.

214 The following lemma is the key observation.

215 ▶ Lemma 1. Let u be a level-d node, and let π(u) = ⟨u1, . . . , ud = u⟩. Then there is a Qd 216 d-dimensional rectangle R(u) = i=1 δi, where the endpoints of δi, for i ∈ [1, d], are defined :6 Dynamic Enumeration of Similarity Joins

Figure 1 Left: Two levels of the range tree. Right: Definition of R(u).

d 217 by the endpoints of Iui and Ip(ui), such that for any x ∈ R , u ∈ N (x) if and only if 218 x ∈ R(u). Given ui’s and p(ui)’s, R(u) can be constructed in O(1) time.

− + 219 Proof. Notice that (x) is the hypercube of side length 2 and center x. Let I = [α , α ] B ui ui ui 220 for any ui and i ∈ [1, d]. Recall that u ∈ N (x) if and only if for each i ∈ [1, d],

221 Iui ⊆ [xi − 1, xi + 1] and Ip(ui) ̸⊆ [xi − 1, xi + 1], (∗)

− − + + 222 Fix a value of i. From the construction of a range tree either α = α or α = α . ui p(ui) ui p(ui) − − 223 Without loss of generality, assume α = α ; the other case is symmetric. Then (∗) can ui p(ui) − + + 224 be written as: xi ≤ α + 1 and α − 1 ≤ xi < α − 1. Therefore xi has to satisfy three ui ui p(ui) 225 1D linear constraints. The feasible region of these constraints is an interval δi and xi ∈ δi 226 (see also Figure 1). Hence, u is a canonical node of B(x) if and only if for all i ∈ [1, d], Qd 227 xi ∈ δi. In other words, x = (x1, . . . , xd) ∈ i=1 δi := R(u). The endpoints of δi are the 228 endpoints of Iui or Ip(ui). In order to construct R(u), we only need the intervals Iui and

229 Ip(ui) for each i ∈ [1, d], so it can be constructed in O(d) = O(1) time. ◀

230 In view of Lemma 1, we proceed as follows. We build a dynamic range tree Z on B. 231 Furthermore, we augment the range tree T on A as follows. For each level-d node u ∈ T , we 232 compute and store R(u) and βu = |Bu|. By construction, |Au| ≥ 1 for all u. We also store a 233 pointer at u to the leftmost leaf of the subtree of T rooted at u, and we thread all the leaves 234 of a d-level tree so that for a node u, Au can be reported in O(|Au|) time. Updating these 235 pointers as T is updated is straightforward. Whenever a new node u of T is constructed, 236 we query Z with R(u) to compute βu. Finally, we store C , the set of all active nodes of T , 237 in a red-black tree so that a node can be inserted or deleted in O(log n) time. The total size d−1 d 238 of the data structure is O(n log n), and it can be constructed in O(n log n) time.

239 Update and Enumerate. Updating A is straightforward. We update T , query Z with 240 R(u), for all newly created d-level nodes u in T to compute βu, and update C to delete all 241 active nodes that are no longer in T and to insert new active nodes. Since the amortized d 242 time to update T as a point is inserted or deleted is O(log n), the amortized update time of 2d d d 243 a point in A is O(log n) — we spend O(log n) time to compute βu for each of O(log n) 244 newly created nodes. If a point b is inserted (resp. deleted) in B, we update Z and query 245 T with B(b). For all canonical nodes u in N (b), we increment (resp. decrement) bu. If P. K. Agarwal et al. :7

246 u becomes active (resp. inactive), we insert (resp. delete) u in C in O(log n) time. The d+1 247 amortized update time for b is O(log n).

248 Finally, to enumerate the pairs in join results, we traverse C and for each u ∈ C , we first d 249 query Z with R(u) to recover Bu. Recall that Bu is reported as a set of O(log n) canonical 250 nodes of Z whose leaves contain the points of Bu. We simultaneously traverse the leaves of 251 the subtree of T rooted at u to compute Au and report Au × Bu. The traversals can be d 252 performed in O(log n) maximum delay. Putting everything together, we obtain:

d 253 ▶ Theorem 2. Let A, B be two sets of points in R , where d ≥ 1 is a constant, with 254 |A| + |B| = n. A data structure of Oe(n) size can be built in Oe(n) time and updated in Oe(1) 255 amortized time, while supporting Oe(1)-delay enumeration of similarity join under ℓ∞ metric.

256 Similarity join under ℓ1 metric. Given an arbitrary instance of similarity join under ℓ1 d 257 metric in R , we show how to reduce it to an instance of similarity join under ℓ∞ metric in d d+1 258 R for d ≤ 2 and R for d ≥ 3 (Appendix A.1). Plugging to Theorem 2, we obtain:

d 259 ▶ Theorem 3. Let A, B be two sets of points in R , where d ≥ 1 is a constant, with 260 |A| + |B| = n. A data structure of Oe(n) size can be built in Oe(n) time and updated in Oe(1) 261 amortized time, while supporting Oe(1)-delay enumeration of similarity join under ℓ1 metric.

262 2.2 Similarity join under ℓ2 metric

d 263 In this section, we consider the similarity join between two point sets A and B in R under 264 the ℓ2 metric. Using the standard lifting transformation technique (see Appendix A.2), we

265 can reduce it to the halfspace-containment problem and use partition trees for halfspace

266 range searching [16, 36] instead of range trees. The overall framework remains the same as

267 under the ℓ∞ metric (see Appendix A.2). Omitting the details, we conclude the following:

d 268 ▶ Corollary 4. Let A, B be two sets of points in R , where d ≥ 1 is a constant, with 2− 1 269 |A| + |B| = n. A data structure of Oe(n) size can be constructed in Oe(n d+1 ) time and 1− 1 1− 1 270 updated in Oe(n d+1 ) amortized time, while supporting Oe(n d+1 )-delay enumeration of 271 similarity join under the ℓ2 metric.

272 Lower bound. We show a lower bound for the similarity join in the pointer-machine model

273 under the ℓ2 metric based on the hardness of unit sphere reporting problem. Let P be a set d 274 of n points in R for d > 3. The unit-sphere reporting problem asks for a data structure on 275 the points in P , such that given any unit-sphere b report all points of P ∩ b. If the space is 276 Oe(n), it is not possible to get a data structure for answering unit-sphere reporting queries in 277 Oe(k + 1) time in the pointer-machine model, where k is the output size for d ≥ 4 [1].

278 For any instance of sphere reporting problem, we construct an instance of similarity join

279 over two sets, with A = ∅, B = P , and r = 1. Given a query unit-sphere of center q, we insert

280 point q in A, issue an enumeration query, and then remove q from A. All results enumerated

281 (if any) are the results of the sphere reporting problem. If there exists a data structure for 282 enumerating similarity join under ℓ2 metric using Oe(n) space, with Oe(1) update time and 283 Oe(1) delay, we would break the barrier.

d 284 ▶ Theorem 5. Let A, B be two sets of points in R for d > 3, with |A| + |B| = n. If using 285 Oe(n) space, there is no data structure under the pointer-machine model that can be updated 286 in Oe(1) time, while supporting Oe(1)-delay enumeration of similarity join under the ℓ2 metric. :8 Dynamic Enumeration of Similarity Joins

points in A c L Bi points in B

Ai

≤ L

≤ L

Figure 2 An example pair of ε-WSPD. Figure 3 An example of active cell c in the grid.

287 3 Approximate Enumeration

288 In this section we propose a dynamic data structure for answering approximate similarity-join

289 queries under any ℓp metric. For simplicity, we use the ℓ2 norm to illustrate the main idea 290 and assume ϕ(a, b) = ||a − b||2. Recall that all pairs of (a, b) ∈ A × B with ϕ(a, b) ≤ r must ′ ′ ′ ′ 291 be reported, along with (potentially) some pairs of (a , b ) with ϕ(a , b ) ≤ (1 + ε)r, but no

292 pair (a, b) with ϕ(a, b) > (1 + ε)r is reported.

293 We will start with the setting where the distance threshold r is not fixed and specified as

294 part of a query, and then move to a simpler scenario where r is fixed.

295 3.1 Variable Similarity Threshold

296 We describe the data structure when r is part of the query. In this subsection we assume maxp,q∈A∪B ϕ(p,q) O(1) 297 that the spread of A ∪ B is polynomially bounded, i.e., sp(A ∪ B) = = n . minp,q∈A∪B ϕ(p,q) 298 We use quad tree and well-separated pair decomposition (WSPD) for our data structure. We

299 describe them briefly here and refer the reader to [27].

300 Quad tree and WSPD. A d-dimensional quad tree [27, 40] over a point set P is a tree d 301 data structure T in which each node u is associated with a hypercube □u in R and each d 302 internal node has exactly 2 children. The root is associated with a hypercube containing P . 303 For a node u, let Pu = P ∩ □u. A node u is a leaf if |Pu| ≤ 1. The tree recursively subdivides d 304 the space into 2 congruent hypercubes until a box contains at most one point from P . If O(1) 305 sp(P ) = n , the height of T is O(log n). d 1 306 Given two point sets A, B ⊂ R , with |A| + |B| = n, and a parameter 0 < ε < 2 , 307 a family of pairs W = {(A1,B1), (A2,B2), ··· , (As,Bs)} is an ε-WSPD if the following 308 conditions hold: (1) for any i ≤ s, Ai ⊆ A, Bi ⊆ B (2) for each pair of points (a, b) ∈ 309 A × B, there exists a unique pair (Aj,Bj) ∈ W such that a ∈ Aj and b ∈ Bj (3) for any 310 i ≤ s, max{diam(Ai), diam(Bi)} ≤ ε · ϕ(Ai,Bi), where diam(X) = maxx,y∈X ϕ(x, y) and O(1) 311 ϕ(X,Y ) = minx∈X,y∈Y ϕ(x, y) (see Figure 2). As shown in [27, 29] if sp(A ∪ B) = n , a −d 312 quad tree T on A ∪ B can be used to construct, in time O(n log n + ε n), a WSPD W of −d 313 size O(ε n) such that each pair (Ai,Bi) ∈ W is associated with pair of nodes (□i, ⊞i) in 314 T where Ai = A ∩ □i and Bi = B ∩ ⊞i. It is also known that for each pair (Ai,Bi) ∈ W 315 (i) □i ∩ ⊞i = ∅, and (ii) max{diam(□i), diam(⊞i)} ≤ εϕ(□i, ⊞i) (see Figure 2). We will 316 use W = {(□1, ⊞i),..., (□s, ⊞s)} to denote the WSPD, with Ai,Bi being implicitly defined 317 from their nodes. Using the techniques in [14, 25], the quad tree T and the WSPD W can −d 318 be maintained under insertions and deletions of points in Oe(ε ) time.

319 Data structure. We construct a quad tree T on A ∪ B. For each node u ∈ T , we store a 320 pointer Au (and Bu) to the leftmost leaf of subtree Tu that contains a point from A (and 321 B). Furthermore, we store a sorted list AT (and BT )of the leaves that contain points from P. K. Agarwal et al. :9

322 A (and B). We use these pointers and lists to report points in □u with O(1) delay. Using −d 323 T , we can construct a WSPD W = {(□1, ⊞1),..., (□s, ⊞s)}, s = O(ε ). For each i, let 324 ∆ = min ϕ(p, q). We store all pairs ( , ) in a red-black tree using ∆ as the i p∈□i,q∈⊞i □i ⊞i Z i −d −d 325 key. The data structure has O(ε n) space and O(ε n log n) construction time.

326 Update. After inserting or deleting an input point, the quad tree T and W can be updated −d −d 327 in Oe(ε ) time, following the standard techniques in [14, 25]. As there are at most Oe(ε ) −d 328 pairs changed, we can update Z in Oe(ε ) time. Furthermore, we note that there are only 329 O(1) changes in the structure of quad tree T and the height of T is O(log n), so we can 330 update all necessary pointers Au,Bu and sorted lists AT ,BT in O(log n) time.

331 Enumeration. We traverse Z in order until we reach a pair (□j, ⊞j) with ∆j > r. For 332 each pair (□i, ⊞i) we traverse we enumerate (a, b) ∈ (A ∩ □i) × (B ∩ ⊞i) using the stored 333 pointers and the sorted lists AT ,BT . The delay guarantee is O(1). 334 Let (a, b) ∈ A×B be a pair with ϕ(a, b) ≤ r. Implied by the definition, there exists a unique 335 pair (Ai,Bi) ∈ W such that a ∈ Ai and b ∈ Bi. Notice that ϕ(□i, ⊞i) ≤ ϕ(a, b) ≤ r. Thus, all 336 results of Ai ×Bi will be reported, including (a, b). Next, let (Ai,Bi) be a pair that is found by 337 the enumeration procedure in Z , with ϕ(□i, ⊞i) ≤ r. For any pair of points x ∈ □i, y ∈ ⊞i, ε 338 we have ϕ(x, y) ≤ ϕ(□i, ⊞i) + diam(□i) + diam(⊞i) ≤ (1 + 2 · 2 ) · ϕ(□i, ⊞i) ≤ (1 + ε)r, thus 339 any pair of points with distance strictly larger than (1 + ε)r will not be reported.

d 340 ▶ Theorem 6. Let A, B be two sets of points in R for constant d, with O(poly(n)) spread −d −d 341 and |A| + |B| = n. A data structure of O(ε n) space can be built in Oe(ε n) time and −d 342 updated in Oe(ε ) time, while supporting ε-approximate enumeration for similarity join 343 under any ℓp metric with O(1) delay, for any query similarity threshold r.

344 3.2 Fixed distance threshold

345 Without loss of generality we assume that r = 1. We use a grid-based data structure for

346 enumerating similarity join with fixed distance threshold r.

1 d 347 Data structure. Let G be an infinite uniform grid in R , where the size of each grid cell ε ε ′ ′ 348 is √ and the diameter is . For a pair of cells c, c ∈, define ϕ(c, c ) = min ′ ϕ(p, q). 2 d 2 p∈c,q∈c 349 Each grid cell c ∈ G is associated with (1) Ac = A ∩ c; (2) Bc = B ∩ c; (3) mc = |{b ∈ B | ′ ′ ′ ′ 350 ∃c ∈ G s.t. b ∈ c , ϕ(c, c ) ≤ 1}| as the number of points in B that lie in a cell c within 351 distance 1 from cell c. Let C ⊆ G be the set of all non-empty cells, C = {c ∈ G | Ac ∪Bc ̸= ∅}. 352 A grid cell c ∈ C is active if and only if Ac ̸= ∅ and mc > 0 (see Figure 3 for an example). Let 353 C ⊆ C be the set of active grid cells (Figure 3). Notice that a grid cell is stored when there 354 is at least one point from A or B lying inside it, so |C| ≤ n. Finally, we build a balanced

355 search tree on C so that whether a cell c is stored in C can be answered in O(log n) time. 356 Similarly, we build another balanced search tree to store the set of non-empty cells C.

357 Update. Assume point a ∈ A is inserted into cell c ∈ G . If c is already in C, simply add a 358 to Ac. Otherwise, we add c to C with Ac = {a} and update mc as follows. We visit each cell ′ ′ 359 c ∈ C with ϕ(c, c ) ≤ 1, and add |Bc′ | to mc. A point of A is deleted in a similar manner. 360 Assume point b ∈ B is inserted into cell c ∈ G . If c∈ / C, we add it to C. In any case, we first ′ ′ ′ 361 insert b into Bc and for every cell c ∈ C with ϕ(c, c ) ≤ 1, we increase mc′ by 1 and add c ′ 362 to C if c turns from inactive to active. A point from B is deleted in a similar manner. As −d −d 363 there are O(ε ) cells within distance 1 from c, this procedure takes Oe(ε ) time.

1 1/p ϵ When extending it to any ℓp norm, the size of each grid cell is ε/(2d ) and the diameter is 2 . :10 Dynamic Enumeration of Similarity Joins

′ 364 Enumeration. For each active cell c ∈ C , we visit each cell c ∈ C within distance 1. If 365 Bc′ ̸= ∅, we report all pairs of points in Ac × Bc′ . It is obvious that each pair of points is 366 enumerated at most once. For an active cell c, there must exists a pair (a ∈ Ac, b ∈ Bc′ ) ′ ′ ′ 367 for some cell c ∈ C such that ϕ(a, b) ≤ ϕ(c, c ) + diam(c) + diam(c ) ≤ 1 + ε. So it takes at −d −d 368 most O(ε log n) time before finding at least one result for c, thus the delay is O(ε log n). ′ 369 Furthermore, consider every pair of points a, b with ϕ(a, b) ≤ 1. Assume a ∈ c and b ∈ c .

370 By definition, c must be an active grid cell. Thus, (a, b) will definitely be enumerated in this

371 procedure, thus guaranteeing the correctness of ϵ-enumeration. d 372 ▶ Theorem 7. Let A, B be two sets of points in R for some constant d, with |A| + |B| = n. −d 373 A data structure of O(n) space can be constructed in O(nε log n) time and updated in −d 374 O(ε log n) time, while supporting ε-approximate enumeration of similarity join under any −d 375 ℓp metric with O(ε log n) delay.

376 Note that if for each active cell c ∈ C , we store the cells within distance 1 that contain ′ ′ 377 at least a point from B, i.e., {c ∈ C | ϕ(c, c ) ≤ 1,Bc ̸= ∅}), then the delay can be further −d 378 reduced to O(1) but the space becomes O(ε n).

379 4 In High Dimensions

380 So far, we have treated the dimension d as a constant. In this section we describe a data

381 structure for approximate similarity join using the locality sensitive hashing (LSH) technique,

382 so that the dependency on d is a small polynomial. For simplicity, we assume that r is fixed,

383 however our results can be extended to the case in which r is part of the enumeration query.

384 Due to lack of space, we only describe the high level ideas of our data structure. The full

385 description can be found in Appendix C. 386 For ε > 0, 1 ≥ p1 > p2 > 0, a family H of hash functions is (r, (1 + ε)r, p1, p2)-

387 sensitive, if for any uniformly chosen hash function h ∈ H and any two points x, y: (i)

388 Pr[h(x) = h(y)] ≥ p1 if ϕ(x, y) ≤ r; and (ii) Pr[h(x) = h(y)] ≤ p2 if ϕ(x, y) ≥ (1 + ε)r. ln p1 389 The quality of is measured by ρ = < 1, which is upper bounded by a number that H ln p2 1 390 depends only on ε; and ρ = 1+ε for many common distance functions [26, 21, 28]. For ℓ2 the 1 391 best result is ρ ≤ (1+ε)2 + o(1) [8]. 392 The essence of LSH is to hash “similar” points into the same buckets with high probability.

393 A simple approach based on LSH is to (i) hash points into buckets; (ii) probe each bucket

394 and check, for each pair of points (a, b) ∈ A × B inside the same bucket, whether ϕ(a, b) ≤ r;

395 and (iii) report (a, b) if the inequalities holds. However, two challenges arise for enumeration.

396 First, without any knowledge of false positive results inside each bucket, checking every pair

397 of points could lead to a huge delay. Our key insight is that after checking specific number

398 pairs of points in one bucket (this number will be determined later), we can safely skip the

399 bucket, since any pair of result missed in this bucket will be found in another one with high

400 probability. Secondly, one pair of points may collide under multiple hash functions, so an

401 additional step is necessary in the enumeration to remove duplicates. If we wish to keep the

402 size of data structure to be near-linear and if we are not allowed to store the reported pairs

403 (so that the size remains near linear), detecting duplicates requires some care.

404 As a warm-up exercise to gain intuition, in Appendix C.1 we present a relatively easy

405 special case in which input points as well as points inserted are chosen from the universal

406 domain uniformly. In the following, we focus on the general case without any assumption on

407 the input distribution. Our data structure and algorithm use a parameter M, whose value

408 will be determined later. Since we do not define new hash functions, our results hold forany

409 metric for which LSH works, in particular for Hamming, ℓ2, ℓ1 metrics. P. K. Agarwal et al. :11

410 Data Structure. We adopt an LSH family H with quality parameter ρ and randomly ρ 411 choose τ = O(n ) hash functions. Let C be the set of buckets, each corresponding to one

412 possible value in the range of hash functions. We maintain some extra statistics for buckets

413 in C. For a bucket , let A = A ∩ and B = B ∩ . We choose two arbitrary subsets □ □ □ □ □ 414 A¯ , B¯ of A ,B , respectively, of M points each (the value will be fixed later). For each □ □ □ □ 415 point a ∈ A¯ , we maintain a counter a = |{b ∈ B¯ | ϕ(a, b) ≤ 2(1 + ε)r}|, i.e., the number □ c □ 416 of points in B¯ with distance at most 2(1 + ε)r from a. We store A¯ as a sorted list with □ □ 417 respect to the counters. If there exists a ∈ A¯ with a > 0, we denote as active and store □ c □ 418 an arbitrary pair (a, b) ∈ A¯ × B¯ with ϕ(a, b) ≤ 2(1 + ε)r as its representative pair. To □ □ 419 ensure our high-probability guarantee, we maintain O(log n) copies of this data structure.

420 Update. When a point is inserted, say a ∈ A, we visit every bucket □ into which a is 421 hashed and insert a to A . If |A¯ | < M we always insert it in A¯ . If ∈/ , we remove an □ □ □ □ C 422 arbitrary point from A¯ , insert a into A¯ , and compute the counter a by visiting all points □ □ c 423 in B¯ . If a > 0, we add to and store the corresponding pair as its representative. □ c □ C 424 When a point is deleted, say a ∈ A, we visit every bucket □ into which a is hashed and delete 425 a from A . If a ∈ A¯ , we delete it and insert an arbitrary point (if any) from A \ A¯ into □ □ □ □ 426 A¯ . If a participates in the representative pair of , we find a new representative pair by □ □ ′ ′ 427 considering an arbitrary point a ∈ A¯ with a > 0. If no such point exists, we remove □ c □ 428 from C . The update for point b ∈ B is similar. When there are n/2 updates, we reconstruct

429 the entire data structure from scratch.

430 Enumeration. The high-level idea is to enumerate the representative pair from each active

431 bucket and recompute new representative pairs for it. Assume a representative pair (a, b) is 432 found in bucket □ ∈ C . Next, we will enumerate all pairs that involve the point a. 433 Let C (a) ⊆ C be the set of active buckets containing a. We visit every bucket □ ∈ C (a), 434 and check the distances between a and points in B that are not marked with X( , a) □ □ 435 (we show when a point is marked in the next paragraph). Each time a pair (a, b) with

436 ϕ(a, b) ≤ 2(1 + ε)r is found, we report it and invoke the de-duplication step on (a, b) (details

437 will be given later). When all points in B or more than M points from B have been □ □ 2 438 checked without finding a pair with distance less than 2(1 + ε)r, we stop checking, remove a

439 from A , remove from (a), and skip this bucket. We also update A¯ accordingly so that □ □ C □ 440 the next representative pair of □ (if any) is not set to a in the current numeration phase. 441 Once all buckets in C (a) have been visited, we can just pick an arbitrary active bucket in C ′ ′ ′ ′ 442 with its representative pair (a , b ) for a ̸= a, and start the enumeration for a .

443 Finally, we avoid reporting a pair more than once, as follows. Once a pair (a, b) is 444 enumerated, we go over each bucket □ into which both a, b are hashed, and mark b with 445 X(□, a) to avoid further repeated enumeration. Moreover, if (a, b) is the representative pair ′ ′ 446 of , we check whether there exists b ∈ B such that ϕ(a, b ) ≤ 2(1 + ε)r. If such a pair □ □ 447 exists, we store it as the new representative pair (with respect to a) for □. Otherwise, we 448 remove a from A , remove from (a), and update A¯ accordingly. □ □ C □

449 Correctness analysis. The de-duplication procedure guarantees that each pair of points

450 is enumerated at most once. It remains to show that (1 + 2ε)-approximate enumeration is

451 supported. To prove it, we point out some important properties of our data structures first.

452 Given a set P of points and a distance threshold r, let B(q, P, r) = {p ∈ P | ϕ(p, q) > r}. 453 For any pair of points (a, b) ∈ A × B and a hashing bucket □, we refer □ as the proxy bucket

2 In the enumeration, “remove” means “conceptually mark” instead of changing the data structure itself. :12 Dynamic Enumeration of Similarity Joins

454 for (a, b) if (i) a ∈ A , b ∈ B ; (ii) | ¯(a, A ∪ B , (1 + ε)r)| ≤ M. The following property □ □ B □ □ 455 of proxy bucket is crucial for our analysis:

456 Lemma 8. For any bucket , if there exist M points from A and B each, such that ▶ □ □ □ 2 457 none of the M pairs has its distance within 2(1 + ε)r, □ is not a proxy bucket for any pair 458 (a, b) ∈ A × B with ϕ(a, b) ≤ r. □ □

′ ′ 459 Proof. Let A ,B be two sets of M points from A ,B respectively. We assume that all □ □ ′ ′ 460 pairs of points in A × B have their distances larger than 2(1 + ε)r. Observe that □ is not ′ ′ ′ 461 a proxy bucket for any pair (a ∈ A , b ∈ B ). It remains to show for (a ∈ A \ A , b ∈ B ) □ □ ′ ′ ′ 462 with ϕ(a, b) ≤ r. Assume b ∈ B \ B (the case is similar if b ∈ B ). If A ⊆ ¯(a, A, (1 + ε)r) □ B ′ ¯ 463 or B ⊆ B(a, B, (1 + ε)r), □ is not a proxy bucket for (a, b). Otherwise, there must exist at ′ ′ ′ ′ ′ ′ 464 least one point a ∈ A as well as b ∈ B such that ϕ(a, a ) ≤ (1 + ε)r and ϕ(a, b ) ≤ (1 + ε)r, ′ ′ ′ ′ ′ ′ ′ ′ 465 so ϕ(a , b ) ≤ ϕ(a, a ) + ϕ(a, b ) ≤ 2(1 + ε)r. Thus, (a , b ) ∈ A × B is a pair within distance 466 2(1 + ε)r, coming to a contradiction. ◀

467 We show that (1 + 2ε)-approximate enumeration is supported with probability 1 − 1/n.

468 It can be easily checked that any pair of points farther than 2(1 + ε)r will not be enumerated.

469 Hence, it suffices to show that all pairs within distance r are enumerated with high probability. ρ 470 From [26, 27, 31] it holds that for M = O(n ), any pair (a, b) with ϕ(a, b) ≤ 1 has a proxy 471 bucket with probability 1−1/n. Let □ be a proxy bucket for pair (a, b). Implied by Lemma 8, 472 there exist no M points from A (for example A¯ ) and M points from B (for example B¯ ) □ □ □ □ 2 473 such that all M pairs have their distance larger than 2(1 + ε)r, so □ is active. Moreover, 474 there exist no M points from B such that all of them have distance at least 2(1 + ε)r from □ 475 a, so □ is an active bucket for a. Hence, our enumeration algorithm will report (a, b).

ρ ρ 476 Complexity analysis. Recall that τ = O(n ), and M = O(n ). The data structure uses

477 O(dn + nτ log n) space since we only use linear space with respect to the points in each 478 bucket. The update time is Oe(dM · τ) as there are Oe(τ) buckets to be investigated and it 479 takes Oe(dM) time to update the representative pair. After n/2 updates we -build the data 480 structure so the update time is amortized. The delay is Oe(dM · τ); consider the enumeration 481 for point a. It takes Oe(dM · τ) time for checking all buckets while Oe(dM · τ) time for 482 de-duplication. We conclude to the following result:

d 483 ▶ Theorem 9. Let A and B be two sets of points in R , where |A| + |B| = n and let ε, r be 1 1+ρ 484 positive parameters. For ρ = (1+ε)2 + o(1), a data structure of Oe(dn + n ) size can be 1+2ρ 2ρ 485 constructed in Oe(dn ) time, and updated in Oe(dn ) amortized time, while supporting 2ρ 486 (1 + 2ε)-approximate enumeration for similarity join under the ℓ2 metric with Oe(dn ) delay.

1 487 The same result holds for Hamming and ℓ1 metrics with ρ = 1+ε . Using [31], for the 488 Hamming metric and ε > 1 we can get M = O(1). Skipping the details, we can have:

d 489 ▶ Theorem 10. Let A and B be two sets of points in H , where |A| + B| = n and let ε, r 1 1+ρ 490 be positive parameters. For ρ = 1+ε , a data structure of Oe(dn + n ) size can be built in 1+ρ ρ 491 Oe(dn ) time, and updated in Oe(dn ) amortized time, while supporting (3+2ε)-approximate ρ 492 enumeration for similarity join under the Hamming metric with Oe(dn ) delay.

493 In Appendix C, we show the full description of algorithms and proofs. We also show that

494 our results can be extended to the case where r is part of the enumeration procedure. Finally,

495 we show a lower bound relating similarity join to the approximate nearest neighbor query. P. K. Agarwal et al. :13

496 References

497 1 P. Afshani. Improved pointer machine and i/o lower bounds for simplex range reporting and

498 related problems. In SoCG, pages 339–346, 2012.

499 2 P. K. Agarwal. Simplex range searching and its variants: A review. In A Journey Through

500 Discrete Mathematics, pages 1–30. 2017.

501 3 P. K. Agarwal and J. Erickson. Geometric range searching and its relatives. Contemporary

502 Mathematics, 223:1–56, 1999.

503 4 P. K. Agarwal, J. Xie, J. Yang, and H. Yu. Monitoring continuous band-join queries over

504 dynamic data. In ISAAC, pages 349–359, 2005.

505 5 P. K. Agarwal, J. Xie, J. Yang, and H. Yu. Scalable continuous query processing by tracking

506 hotspots. In PVLDB, pages 31–42, 2006.

507 6 D. Aiger, H. Kaplan, and M. Sharir. Reporting neighbors in high-dimensional euclidean space.

508 SICOMP, 43(4):1363–1395, 2014.

509 7 A. Al-Badarneh, A. Al-Abdi, M. Sana’a, and H. Najadat. Survey of similarity join algorithms

510 based on mapreduce. 2016.

511 8 A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor

512 in high dimensions. In FOCS, pages 459–468, 2006.

513 9 N. Augsten and M. Böhlen. Similarity joins in relational database systems. Synthesis Lectures

514 on Data Management, 5(5):1–124, 2013.

515 10 G. Bagan, A. Durand, and E. Grandjean. On acyclic conjunctive queries and constant delay

516 enumeration. In CSL, pages 208–222, 2007.

517 11 J. Bentley. Decomposable searching problems. Technical report, CMU, 1978.

518 12 J. Bentley and J. Friedman. Data structures for range searching. CSUR, 11(4):397–409, 1979.

519 13 C. Berkholz, J. Keppeler, and N. Schweikardt. Answering conjunctive queries under updates.

520 In PODS, pages 303–318, 2017.

521 14 P. Callahan. Dealing with higher dimensions: the well-separated pair decomposition and its

522 applications. PhD thesis, 1995.

523 15 N. Carmeli and M. Kröll. Enumeration complexity of conjunctive queries with functional

524 dependencies. TOCS, pages 1–33, 2019.

525 16 T. Chan. Optimal partition trees. DCG, 47(4):661–690, 2012.

526 17 S. Chandrasekaran and M. Franklin. Streaming queries over streaming data. In PVLDB,

527 pages 203–214, 2002.

528 18 S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data

529 cleaning. In ICDE, pages 5–5, 2006.

530 19 B. Chazelle and B. Rosenberg. Simplex range reporting on a pointer machine. Computational

531 Geometry, 5(5):237–247, 1996.

532 20 J. Chen, D. DeWitt, F. Tian, and Y. Wang. Niagaracq: A scalable continuous query system

533 for internet . In SIGMOD, pages 379–390, 2000.

534 21 M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni. Locality-sensitive hashing scheme based

535 on p-stable distributions. In SoCG, pages 253–262, 2004.

536 22 M. De Berg, M. Van Kreveld, M. Overmars, and O. Schwarzkopf. Computational geometry.

537 In Computational geometry, pages 1–17. 1997.

538 23 J. Erickson. Static-to-dynamic transformations. http://jeffe.cs.illinois.edu/teaching/ 539 datastructures/notes/01-statictodynamic.pdf.

540 24 F. Fabret, H. Jacobsen, F. Llirbat, J. Pereira, K. Ross, and D. Shasha. Filtering algorithms

541 and implementation for very fast publish/subscribe systems. In SIGMOD, pages 115–126,

542 2001.

543 25 J. Fischer and S. Har-Peled. Dynamic well-separated pair decomposition made easy. In CCCG,

544 volume 5, pages 235–238, 2005.

545 26 A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In

546 PVLDB, volume 99, pages 518–529, 1999.

547 27 S. Har-Peled. Geometric Approximation Algorithms. Number 173. AMS, 2011. :14 Dynamic Enumeration of Similarity Joins

548 28 S. Har-Peled, P. Indyk, and R. Motwani. Approximate nearest neighbor: Towards removing

549 the curse of dimensionality. ToC, 8(1):321–350, 2012.

550 29 S. Har-Peled and M. Mendel. Fast construction of nets in low-dimensional metrics and their

551 applications. SICOMP, 35(5):1148–1184, 2006.

552 30 M. Idris, M. Ugarte, and S. Vansummeren. The dynamic yannakakis algorithm: Compact and

553 efficient query processing under updates. In SIGMOD, pages 1259–1274, 2017.

554 31 P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of

555 dimensionality. In STOC, pages 604–613, 1998.

556 32 E. H. Jacox and H. Samet. Metric space similarity joins. TODS, 33(2):1–38, 2008.

557 33 A. Kara, H. Ngo, M. Nikolic, D. Olteanu, and H. Zhang. Counting triangles under updates in

558 worst-case optimal time. In ICDT, 2019.

559 34 Ahmet Kara, Milos Nikolic, Dan Olteanu, and Haozhe Zhang. Trade-offs in static and dynamic

560 evaluation of hierarchical queries. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI

561 Symposium on Principles of Database Systems, pages 375–392, 2020.

562 35 H. Lenhof and M. Smid. Sequential and parallel algorithms for the k closest pairs problem.

563 IJCGA, 5(03):273–288, 1995.

564 36 J. Matoušek. Efficient partition trees. DCG, 8(3):315–334, 1992.

565 37 M. Overmars. The Design of Dynamic Data Structures, volume 156. SSBM, 1987.

566 38 M. Overmars and J. van Leeuwen. Worst-case optimal insertion and deletion methods for

567 decomposable searching problems. Information Processing Letters, 12(4):168–173, 1981.

568 39 R. Paredes and N. Reyes. Solving similarity joins and range queries in metric spaces with the

569 list of twin clusters. SIDMA, 7(1):18–35, 2009.

570 40 H. Samet. Spatial data structures: Quadtree, octrees and other hierarchical methods, 1989.

571 41 L. Segoufin. Enumerating with constant delay the answers to a query. In ICDT, pages 10–20,

572 2013.

573 42 Y. Silva, W. Aref, and M. Ali. The similarity join database operator. In ICDE, pages 892–903,

574 2010.

575 43 Yasin N Silva, Jason Reed, Kyle Brown, Adelbert Wadsworth, and Chuitian Rong. An

576 experimental survey of mapreduce-based similarity joins. In International Conference on

577 Similarity Search and Applications, pages 181–195. Springer, 2016.

578 44 J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering? an adaptive framework for

579 similarity join and search. In SIGMOD, pages 85–96, 2012.

580 45 Qichen Wang and Ke Yi. Maintaining acyclic foreign-key joins under updates. In Proc. of the

581 ACM SIGMOD Intern. Conf. on Manag. of Data, pages 1225–1239, 2020.

582 46 D. Willard. Polygon retrieval. SICOMP, 11(1):149–165, 1982.

583 47 D. Willard. Applications of range query theory to relational data base join and selection

584 operations. JCSS, 52(1):157–169, 1996.

585 48 K. Wu, S. Chen, and P. Yu. Interval query indexing for efficient stream processing. In CIKM,

586 pages 88–97, 2004. P. K. Agarwal et al. :15

587 A Appendix

588 A.1 Reduction from similarity join under ℓ1 metric to ℓ∞ metric

589 For d ≤ 2 it is straightforward to reduce similarity join under ℓ1 metric to ℓ∞ metric. For 590 d = 1, ℓ1 metric is obviously equivalent to the ℓ∞ metric. For d = 2, notice that the ℓ1 ball 591 is a diamond, while the ℓ∞ ball is a square. Hence, given an instance of the similarity join 592 under the ℓ1 metric we can rotate A ∪ B by 45 degrees to create an equivalent instance of 593 the similarity join problem under the ℓ∞ metric.

594 Next, we focus on d ≥ 3. The data structure we proposed in Section 2.1 for the ℓ∞ norm 595 can be straightforwardly extended to the case where for each b ∈ B, B(b) is an arbitrary

596 hyper-rectangle with center b. In that case we aim to report all (a, b) ∈ A × B such that 597 a ∈ B(b). Lemma 1 can be extended so that R(u) is a 2d-dimensional rectangle. Overall, 598 Theorem 2 remains the same assuming B(b) are hyper-rectangles (and not hypercubes). d 599 Given an instance of similarity join under ℓ1 metric in R , we next show how to reduce it to d+1 600 an instance of similarity join under ℓ∞ metric in R (assuming arbitrary hyper-rectangles).

b

3 d Figure 4 An illustration of ℓ1 ball in R . It is decomposed to 2 = 8 types of simplices.

601 In the ℓ1 metric, b can be mapped to an ℓ1 ball with radius r, i.e., a ball that contains d Pd 602 all points x ∈ R such that j=1 |bj − xj| ≤ r (please see Figure 4), where bj, xj are the 603 j-th coordinates of points b, x, respectively. Hence, we need to report a pair (a, b) ∈ A × B Pd d 604 if and only if j=1 |bj − aj| ≤ r. Let E be the set of all vectors in R with coordinates d 605 either 1 or −1. We have |E| = 2 . For each vector ei ∈ E, we construct an instance 606 of our problem. For each ei ∈ E, we map each point a = (a1, . . . , ad) ∈ A to a point Pd d+1 ¯ 607 a¯i = (a1, . . . , ad, j=1 eijaj) ∈ R . Let Ai = {a¯i | a ∈ A}. In addition, for each point d+1 ¯ 608 b = (b1, b2, . . . , bd) ∈ B, we construct the axis-align rectangle in R , bi which is defined 609 as the Cartesian product of the intervals[bj, ∞) if eij = 1 and (−∞, bj] if eij = −1 for Pd ¯ ¯ 610 each j = 1, . . . , d, and (−∞, r + j=1 eijbj]. Let Bi = {bi | b ∈ B}. For each ei ∈ E, we ¯ ¯ 611 construct the data structure for the ℓ∞ metric by taking Ai as the set of points and Bi as

612 the set of rectangles (the range tree data structure we used can work even for unbounded d 613 rectangles). Equivalently, if b is the ℓ1 ball of radius r then it can be decomposed to 2 614 simplices (Figure 4). Each vector ei ∈ E corresponds to a type of simplex of these balls. Each

615 type of simplex has a fixed orientation and can be processed independently by orthogonal d+1 616 range searching in R . 617 Let (a, b) ∈ A × B be an arbitrary pair of points such that ∥a − b∥1 ≤ 1. Then we show ¯ 618 that there is a unique vector in ei ∈ E such that a¯i ∈ bi. Let ei be the vector such that Pd 619 ∥a − b∥1 = j=1 eij(aj − bj). Notice that the first d coordinates of a¯i lie inside the first d ¯ 620 intervals defining bi: (1) If eij = 1, then aj ≥ bj, i.e., aj ∈ [bj, +∞); (2) If eij = −1, then Pd 621 aj ≤ bj, i.e., aj ∈ [−∞, bj). Observe that j=1 eij(aj −bj) ≤ r can be equivalently rewritten Pd Pd Pd Pd 622 as j=1 eijaj ≤ r + j=1 eijbj, or j=1 eijaj ∈ (−∞, r + j=1 eijbj]. It is easy to see that 623 for any other vector ei′ ̸= ei, at least one of the constraints above does not hold. With the :16 Dynamic Enumeration of Similarity Joins

624 same argument, we can show that if ∥a − b∥1 > r, then there is no vector ei ∈ E such that ¯ d 625 a¯i ∈ bi. Overall, we build O(2 ) = O(1) data structures for the reduced instance.

626 A.2 Dynamic Enumeration of Halfspace-containment problem

627 Reduction to halfspace containment. We use the lifting transformation [22] to convert

628 an instance of the similarity join problem under ℓ2 metric to the halfspace-containment d+1 629 problem in R . The join condition above can be rewritten as

2 2 2 2 2 630 a1 + b1 + ··· + ad + bd − 2a1b1 − · · · − 2adbd − r ≥ 0.

′ 2 2 d+1 631 We map the point a to a point a = (a1, . . . , ad, a1 + ··· + ad) in R and the point b to a ′ d+1 632 halfspace b in R defined as

2 2 2 633 −2b1z1 − · · · − 2bdzd + zd+1 + b1 + ··· + bd − r ≥ 0.

′ ′ 634 Note that a, b join if and only if a ∈ b . Thus, in the following, we study the halfspace- ′ ′ ′ ′ 635 containment problem. Set A = {a | a ∈ A} and B = {b | b ∈ B}.

d 636 Partition tree. A partition tree on a set P of points in R [16, 36, 46] is a tree data 637 structure formed by recursively partitioning a set into subsets. Each point is stored in exactly

638 one leaf and each leaf usually contains a constant number of points. Each node u of the tree

639 is associated with a simplex ∆u and the subset Pu = P ∩ ∆u; the subtree rooted at u is a 640 partition tree of Pu. We assume that the simplices associated with the children of a node u 641 are pairwise disjoint and lie inside ∆u, as in [16]. In general, the degree of a node is allowed 1−1/d 642 to be non-constant. Given a query simplex ∆, a partition tree finds a set of O(n )

643 canonical nodes whose cells contain the points of P ∩ ∆. Roughly speaking, a node u is a

644 canonical node for ∆ if ∆u ⊂ ∆ and ∆p(u) ̸⊆ ∆. A simplex counting (resp. reporting) query 1−1/d 1−1/d 645 can be answered in O(n ) (resp. O(n + k)) time using a partition tree. Chan [16]

646 proposed a randomized algorithm for constructing a linear size partition tree with constant 1−1/d 647 degree, that runs in O(n log n) time and it has O(n ) query time with high probability.

648 Data structure. For simplicity, with slight abuse of notation, let A be a set of points d d 649 in R and B a set of halfspaces in R each lying below the hyperplane bounding it, and 650 our goal is to build a dynamic data structure for halfspace-containment join on A, B. The

651 overall structure of the data structure is the same as for rectangle containment described in

652 Section 2.1, so we simply highlight the difference.

653 Instead of constructing a range tree, we construct a dynamic partition tree TA for A 1−1/d 654 so that the points of A lying in a halfspace can be represented as the union of O(n )

655 canonical subsets. For a halfplane bounding a halfspace b ∈ B, let ¯b denote its dual point d 656 in R (see [22] for the definition of duality transform). Note that a point a lies in b if and 657 only if the dual point ¯b lies in the halfspace lying below the hyperplane dual to a. Set

658 B¯ = {¯b | b ∈ B}. We construct a multi-level dynamic partition tree on B¯, so that for a pair

659 of simplices ∆1 and ∆2, it returns the number of halfspaces of B that satisfy the following 660 two conditions: (i) ∆1 ⊆ b and (ii) ∆2 ∩ ∂b ̸= ∅, where ∂b is the hyperplane boundary defined 661 by the halfspace b. This data structure uses O(n) space, can be constructed in Oe(n) time, 1−1/d 662 and answers a query in Oe(n ) time. 663 For each node u ∈ TA, we issue a counting query to TB and get the number of halfspaces 2−1/d 664 in B that have u as a canonical node. Hence, TA can be built in Oe(n ) time. For a 665 node u, µA(u) can be computed in O(1) time by storing Au at each node u ∈ TA. Recall 666 that µB(u) is the number of halfspaces b of B for which u is a canonical node, i.e., ∆u ⊆ b P. K. Agarwal et al. :17

667 and ∆p(u) ∩ ∂b ≠ ∅, where p(u) is the parent of u. Using TB, µB(u) can be computed in 1−1/d 668 Oe(n ) time.

669 Update and Enumeration. The update procedure is the same that in Section 2.1, however 1− 1 1− 1 670 the query time now on TA or TB is Oe(n d ) so the amortized update time is Oe(n d ). The 671 enumeration query is also the same as in Section 2.1 but a reporting query in TB takes 1− 1 1− 1 1− 1 672 Oe(n d + k) time (and it has delay at most Oe(n d )), so the overall delay is Oe(n d ).

d 673 ▶ Theorem 11. Let A be a set of points and B be a set of half-spaces in R with |A|+|B| = n. 2− 1 1− 1 674 A data structure of Oe(n) size can be built in Oe(n d ) time and updated in Oe(n d ) amortized 1− 1 675 time while supporting Oe(n d )-delay enumeration of halfspace-containment query.

676 Using Theorem 11 and the lifting transformation described at the beginning of this section

677 we conclude with Corollary 4.

678 B Triangle Similarity join

679 In this section we propose data structures for the triangle join queries. Our results can be

680 extended to m-clique join queries, for constant m. For simplicity we describe the results for d 681 the triangle join m = 3. Let A, B, S ∈ R be three sets of points such that |A| + |B| + |S| = n.

682 B.1 Exact Enumeration

683 We can easily extend the results from Section 2 to handle triangle join queries. Assume

684 the ℓ∞ metric. The high level idea is that we are going to construct two levels of our data

685 structures for pair similarity join queries. In particular, in the first level we consider two ′ 686 sets, A = A ∪ S and B. As we had in Section 2 we construct a dynamic range tree TA′ on ′ 687 A and a dynamic range tree TB on B. For each level-d node u of TA we store and maintain ′ 688 the counter βu with the same way as in Section 2. However, instead of keeping Au, we have 689 a pointer pu to a pair similarity join data structure between Au and Su, where Au is the set 690 of A in the subtree of TA′ rooted at u, and Su s the set of S in the subtree of TA′ rooted at u u 691 u. Hence, we construct a dynamic range tree TA and a dynamic range tree TS . For each u 692 level-d node v in TA we store and maintain βv as the number of points in Su that have v as u u 693 a canonical node of TA and Av the points of Au stored in the subtree of TA rooted at v. u 694 Let Cu be the active nodes in TA. A level-d node u of TA′ is active, i.e., in C , if βu > 0 and 695 there exists at least one pair of Au,Su within distance 1. In other words it should hold that 696 βu > 0 and Cu ̸= ∅. 2d 697 When we insert a point a ∈ A we need O(log n) time to insert it in TA′ and all trees u u u 698 TA . Also for each new level-d node v of TA we run a range query on TS to check if v is in 3d 699 Cu. Similarly, we can handle the other cases. Hence, each update take O(log n) amortized d 700 time. The delay guarantee remains O(log n) since we only need to report points of different

701 level-d range trees. Overall, we get the following result.

d 702 ▶ Theorem 12. Let A, B, S be two sets of points in R , where d ≥ 1 is a constant, with 703 |A| + |B| + |S| = n. A data structure of Oe(n) size can be built in Oe(n) time and updated 704 in Oe(1) amortized time, while supporting Oe(1)-delay enumeration of triangle similarity join 705 under ℓ∞ metric.

706 Similarly the results can be extended to ℓ1 and ℓ2 metrics.

d 707 ▶ Theorem 13. Let A, B, S be two sets of points in R , where d ≥ 1 is a constant, with 708 |A| + |B| + |S| = n. A data structure of Oe(n) size can be built in Oe(n) time and updated :18 Dynamic Enumeration of Similarity Joins

709 in Oe(1) amortized time, while supporting Oe(1)-delay enumeration of triangle similarity join 710 under ℓ1 metric.

d 711 ▶ Theorem 14. Let A, B, S be two sets of points in R , where d ≥ 1 is a constant, with 2− 1 712 |A| + |B| + |S| = n. A data structure of Oe(n) size can be constructed in Oe(n d+1 ) time 1− 1 1− 1 713 and updated in Oe(n d+1 ) amortized time, while supporting Oe(n d+1 )-delay enumeration 714 of similarity join under the ℓ2 metric.

715 B.2 Approximate Enumeration

716 We first consider the when the distance threshold r is fixed and then lift this assumption.

717 B.2.1 Fixed distance threshold

718 The data structure we construct works for any ℓp norm. For simplicity, we describe it for ℓ2 719 first and extend it to any ℓp metric at last. In this subsection, we use ϕ(a, b) = ||a − b||2. d 720 As we had in Section√ 3.2 let G be an infinite uniform grid in R where the size of 721 each grid cell is ε/(2 d), so its diameter is ε/2 (using the ℓ2 distance). For each grid cell 722 c ∈ G we store Ac = A ∩ c, Bc = B ∩ c, and Sc = S ∩ c. Furthermore, we store a counter 723 mc = |{(b, s) ∈ B×S | ∃c1, c2 ∈ G s.t. b ∈ c1, s ∈ c2, ϕ(c1, c2) ≤ 1, ϕ(c1, c) ≤ 1, ϕ(c2, c) ≤ 1}|,

724 i.e., the number of pairs (b, s) whose cells along with cell c are within distance 1. Let C be 725 the non-empty cells, i.e., C = {c ∈ G | Ac ∪ BC ∪ Sc}= ̸ ∅. A grid cell c ∈ C is active if 726 and only if Ac ≠ ∅ and mc > 0. Let C ⊆ C be the set of active grid cells. We construct 727 a balanced search tree to answer efficiently if a cell is already in C . Similarly we create a 728 balanced search tree for the cells in C. Our data structure has O(n) space.

729 We first describe the updates. Assume that we insert apoint a ∈ A. If a lies in a cell

730 c ∈ C then we insert a in Ac. If a is inserted to a cell that did not exist then we create c, we 731 add it in C and we set Ac = {a}. Then we need to find the value mc. The algorithm visits

732 all existed cells around c ∈ C within distance 1. Let c1 be such a cell such that Bc1 ≠ ∅ or

733 Sc1 ̸= ∅. We need to count all points in B and S that lie in cells within distance 1 from both 734 c and c1. Notice that these cells are inside a rectangle R. Indeed, if R1 is a square of radius 735 1 around c and R2 is a square of radius 1 around c1 then R = R1 ∩ R2 is a rectangle. We 736 visit all grid cells inside R and find the number of points from B,S in R. Let mB = |B ∩ R|

737 and mS = |S ∩ R|. We update mc with mc + |Bc1 | · mS + |Sc1 | · mB. In the end it is easy to 738 verify that mc has the correct value. Next, assume that we remove a point a ∈ A. Let c be 739 the cell of point a. We remove a from Ac and if Ac = ∅ and c ∈ C then we remove c from −d 740 C . If Ac = Bc = Sc = ∅ we remove c from C. Since there are O(ε ) grid cells in a square −2d −d 741 of radius 1 we need O(ε log n) time to insert a and O(ε log n) to remove a. Then we

742 continue by updating a point b ∈ B (the method is similar to update s ∈ S). Assume that

743 we add b ∈ B in a cell c (if c did not exist we create it) and we insert it in Bc. The goal is to

744 update all counters mc1 within distance 1 from c. We start by visiting all cells c1 ∈ C within

745 distance 1 from c. We need to update the value of mc1 . In particular, we need to count the 746 number of points in S that lie in cells within distance 1 from both c, c1. This is similar to

747 what we had for the insertion of a so we can count it by visiting all grid cells within distance

748 1 from c, c1. Let mS be the result. Then, we update mc1 ← mc1 + mS. Finally, assume that 749 we remove a point b ∈ B from a cell c. We remove b from Bc and again, we need to visit

750 all cells c1 within distance 1 and update their mc1 values by mc1 ← mc1 − mS (mS can be

751 found as we explain in the previous case). If c1 ∈ C and mc1 = 0 we remove c1 from C . In 752 the end, if Ac = Bc = Sc = ∅ we remove c from C. Again, it is easy to observe that mc P. K. Agarwal et al. :19

753 have the correct values for all c ∈ C and hence C is the correct set of active cells. We need −2d 754 O(ε log n) time to insert or remove a point in B.

755 Next, we describe the enumeration procedure. For each c ∈ C we consider every a ∈ Ac. 756 We visit each cell c1 ∈ C around c within distance 1. Then we visit each cell c2 ∈ C within

757 distance 1 from both c1, c. We report (if any) the points a × Bc1 × Sc2 and a × Sc1 × Bc2 . We 758 show the correctness of our method. Let (a ∈ A, b ∈ B, s ∈ S) be a triad within distance 1.

759 Let a ∈ c1, b ∈ c2, s ∈ c3 for c1, c2, c3 ∈ C. Notice that ϕ(c1, c2), ϕ(c1, c3), ϕ(c2, c3) ≤ 1. From

760 the update procedure we have that mc1 > 0, hence, c1 ∈ C . The algorithm will visit c1 and it 761 will also consider c2 since ϕ(c1, c2) ≤ 1. Then it will also consider c3 since ϕ(c1, c3) ≤ 1 and 762 ϕ(c2, c3) ≤ 1. Hence our enumeration procedure will return the triad (a, b, s). Furthermore,

763 it is straightforward to see that i) our enumeration algorithm will never report a triad (a, b, s)

764 such that a pairwise distance is greater than 1 + ε, and ii) whenever c ∈ C there will always 765 be a triad (a ∈ Ac, b ∈ B, s ∈ S) to report. Finally, since our enumeration algorithm reports

766 points that lie in cells with pairwise distance 1 it might be possible that it will return (a, b, s)

767 such that ϕ(a, b) ≤ ϕ(c1, c2)+diam(c1)+diam(c2) ≤ 1+ε, ϕ(a, c) ≤ 1+ε, and ϕ(b, c) ≤ 1+ε. −2d 768 The delay is O(ε log n). 1/p 769 The same result can be extended to any ℓp norm by considering grid cells of size ε/(2d ).

d 770 ▶ Theorem 15. Let A, B, S be three sets of points in R , with |A| + |B| + |S| = n. A data −2d −2d 771 structure of O(n) space can be constructed in O(nε log n) time and updated in O(ε log n)

772 time, while supporting ε-approximate enumeration of triangle similarity join queries under −2d 773 any ℓp metric with O(ε log n) delay.

774 For ℓ1, ℓ∞, we can slightly improve the result using a data structure to find mB, mS d−1 775 more efficiently. Skipping the details, we can obtain a data structure of O(n log n) d−1 −d d−1 −2d 776 space that can be built in O(n log n + n · min{ε log n, ε } log n) time and updated −d d−1 −2d 777 in O(min{ε log n, ε } log n) time, while supporting ε-approximate enumeration of −d d−1 −2d 778 triangle similarity join under ℓ1/ℓ∞ metrics with O(min{ε log n, ε } log n) delay.

779 B.2.2 Variable distance threshold

−1 780 We describe two data structures for this case. One is based on grid using O(ε n log(n)) −2d 781 space and the other based on WSPD using O(ε n) space.

O(1) 782 Grid-based data structure. Assume that the spread sp(A ∪ B ∪ S) = n and that all

783 points lie in a box with diagonal length R. The high level idea is to build multiple grids

784 as described in Appendix B.2.1. Recall that for each cell c ∈ C, we need to store counters

785 Ac,Bc,Sc and mc. However, the definition of mc depends on the threshold r which is not 786 known upfront in this case. Hence we consider multiple thresholds ri. In particular for R i 787 each i ∈ [0, log1+ε/4 sp(A ∪ B ∪ S)] we construct a grid for ri = sp(A∪B∪S) (1 + ε/4) as in i i 788 Appendix B.2.1. Hence for each i we maintain the counter mc defined as mc = |{(b, s) ∈ 3 789 B × S | ∃c1, c2 ∈ G s.t. b ∈ c1, s ∈ c2, ϕ(c1, c2) ≤ ri, ϕ(c1, c) ≤ ri, ϕ(c2, c) ≤ ri}|, and the −1 790 set of active cells Ci. Notice that there are O(ε log n) different values of i. For a point i 791 insertion or deletion the algorithm updates all necessary counters mc and active cells Ci 792 for all i. For an enumeration query, assume that r is the query threshold. Notice that R 793 sp(A∪B∪S) ≤ r ≤ R, otherwise the result is trivial. Running a binary search on the values 794 of i we find the smallest i such that r ≤ ri. Then using only the active cells Ci and the

3 For each i we scale everything so that ri = 1, as we did in Appendix B.2.1. :20 Dynamic Enumeration of Similarity Joins

i 795 counters mc we enumerate all triangles within distance ri. The delay guarantee is the same −2d 796 as in Appendix B.2.1, O(ε log n). We conclude with the next theorem.

d 797 ▶ Theorem 16. Let A, B, S be three sets of points in R for constant d, with O(poly(n)) 798 spread, |A| + |B| + |S| = n, where A ∪ B ∪ S lie in a hyper-rectangle with diagonal length R. −1 −2d−1 2 799 A data structure of O(ε n log n) space can be constructed in O(nε log n) time and −2d−1 2 800 updated in O(ε log n) time, while supporting ε-approximate enumeration of triangle −2d 801 similarity join queries under any ℓp metric with O(ε log n) delay, for any query distance

802 threshold r.

803 WSPD-based data structure. We describe the main idea here. Assume that sp(A ∪ O(1) 804 B ∪ S) = n . Let WA,B be the WPSD construction of A, B as in Section 3.1. Similarly, 805 we consider WA,S, and WB,S. For each pair (Ai,Bi) ∈ WA,B, let ϕ(□i, ⊞i) = ri. Let ci ′ 4 806 be the center of □i and ci be the center of ⊞i. Let Li be the lune (intersection) of the ′ 807 spheres with radius ri and with centers ci, ci. We run a query with Li on a quadtree −d 808 TS on the points S. We get O(ε ) quadtree boxes. Then we construct the triplets ′ −2d 809 WA,B = {(A1,B1,S1),..., (Aξ,Bξ,Sξ)}, where ξ = O(ε n). Similarly, we construct ′ ′ ′ ′ ′ ′ 810 WA,S, WB,S. Let W = WA,B ∪WA,S ∪WB,S. We can show that each triplet (a, b, s) ∈ A×B×S ′ 811 can be found in at least one triplet (Ai,Bi,Si) in W . In particular, let (a, b, s) ∈ A × B × S

812 be a triplet such that (without loss of generality) ϕ(a, b) ≥ ϕ(a, s) ≥ ϕ(b, s). From the 813 definition of the WPSD WA,B we have that there exists a unique pair (Ai,Bi) such that 814 a ∈ Ai and b ∈ Bi. Notice that ϕ(a, s), ϕ(b, s) < ϕ(a, b) so s should lie in the lune Li so ′ ′ 815 there exists a triplet (Ai,Bi,Si) ∈ WA,B ⊆ W such that a ∈ Ai, b ∈ Bi, s ∈ Si. In addition, −2d 816 due to the bounded spread we have that each node participates in at most O(ε log n) ′ −2d 2 ′ 817 triplets in W and each point belongs in at most O(ε log n) triplets in W . Hence, each −2d 818 update takes Oe(ε ) time. Using a tree Z as in Section 3.1 and following a deduplication 819 method as in Section 4 we can execute ε-approximate enumeration of all triplets (a, b, s) −2d 820 within distance r in Oe(ε ) delay.

d 821 ▶ Theorem 17. Let A, B, S be three sets of points in R for constant d, with O(poly(n)) −2d −2d 822 spread and |A| + |B| = n. A data structure of O(ε n) space can be built in Oe(ε n) time −2d 823 and updated in Oe(ε ) time, while supporting ε-approximate enumeration for similarity join −2d 824 under any ℓp metric with Oe(ε ) delay, for any query distance threshold r.

825 C Similarity Join in High Dimensions

826 So far, we have treated the dimension d as a constant. In this section we describe a data

827 structure for approximate similarity join using the locality sensitive hashing (LSH) technique

828 so that the dependency on the dimension is a small polynomial in d, by removing the

829 exponent dependency on d from the hidden poly-log factor. For simplicity, we describe our

830 data structure assuming that r is fixed, and in the end we extend it to the case where r is

831 also part of the similarity join query.

832 For ε > 0, 1 ≥ p1 > p2 > 0, recall that a family H of hash functions is (r, (1 + ε)r, p1, p2)-

833 sensitive, if for any uniformly chosen hash function h ∈ H, and any two points x, y, we have

834 (1) Pr[h(x) = h(y)] ≥ p1 if ϕ(x, y) ≤ r; and (2) Pr[h(x) = h(y)] ≤ p2 if ϕ(x, y) ≥ (1 + ε)r. ln p1 835 The quality of a hash function family is measured by ρ = < 1, which is upper ln p2

4 https://en.wikipedia.org/wiki/Lune(geometry) P. K. Agarwal et al. :21

1 836 bounded by a number that depends only on ε; and ρ = 1+ε for many common distance 1 837 functions [26, 8, 21, 28]. For ℓ2 the best result is ρ ≤ (1+ε)2 + o(1) [8]. 838 The essence of LSH is to hash “similar” points in P into the same buckets with high

839 probability. A simple approach to use LSH for similarity join is to (i) hash points into

840 buckets; (ii) probe each bucket and check, for each pair of points (a, b) ∈ A × B inside the

841 same bucket, whether ϕ(a, b) ≤ r; and (iii) report (a, b) if the inequalities holds.

842 However, two challenges arise with this approach. First, without any knowledge of false

843 positive results inside each bucket, checking every pair of points could lead to a huge delay.

844 Our key insight is that after checking specific number pairs of points in one bucket (this

845 number will be determined later), we can safely skip the bucket, since any pair of result

846 missed in this bucket will be found in another one with high probability. Secondly, one pair

847 of points may collide under multiple hash functions, so an additional step is necessary in

848 the enumeration to remove duplicates. If we wish to keep the size of data structure to be

849 near-linear and are not allowed to store the reported pairs, detecting duplicates requires

850 some care.

851 As a warm-up exercise to gain intuition, we first present a relatively easy special case

852 in which input points as well as points inserted are chosen from the universal domain

853 uniformly. In the following, we focus on the general case without any assumption on the

854 input distribution. Our data structure and algorithm use a parameter M, whose value will

855 be determined later. Since we do not define new hash functions, all results presented inthis

856 section hold for all Hamming, ℓ2, ℓ1 metrics.

857 C.1 With Uniform Assumption

858 Under this strong assumption, the LSH technique can be used with a slight modification. ρ 859 We adopt a LSH family H with quality parameter ρ and randomly choose τ = O(n ) 860 hash functions g1, g2, ··· , gτ . To ensure our high-probability guarantee (as shown later), we

861 maintain O(log n) copies of this data structure.

862 Data structure. Let C be the set of all buckets over all τ hash functions. For each bucket

863 , let A ,B be the set of points from A, B falling into bucket , respectively. A nice □ □ □ □ 864 property on A and B is stated in the following lemma, which is directly followed by the □ □ 865 balls-into-bins result.

866 ▶ Lemma 18. If input points are randomly and uniformly chosen from the domain universe, 1 867 with probability at least 1 − n , every bucket receives O(log n/ log log n) points.

868 As the number of points colliding in each bucket can be bounded by O(log n), it is 2 869 affordable to check all pairs of points inside one bucket in O(log n) time, thus resolving the 870 challenge (1). Moreover, we introduce a variable □out for each bucket □ ∈ C indicating the 871 number of the pair of tuples within distance r colliding inside □. Obviously, a bucket □ is 872 active if □out > 0, and inactive otherwise. All active buckets are maintained in C ⊆ C, in 873 increasing order of the index of the hash function it comes from.

874 Update. Assume one point a ∈ A is inserted. We visit each hash bucket □ into which 875 ishashed.W einsertaintoA , count the number of pair of points (a, b ∈ B ) with ϕ(a, b) ≤ r, □ □ 876 and add this quantity to □out. The case of deletion can be handled similarly.

877 Enumeration. Assume (a, b) is to be reported. We check whether a, b have ever collided

878 into any bucket previously. If there exists no index j < i such that gj(a) = gj(b), we report 879 it. Then, we need to notify every bucket which also witnesses (a, b) but comes after □. More :22 Dynamic Enumeration of Similarity Joins

′ ′ 880 specifically, for every j > i, if gj(a) = gj(b) in bucket □ , we decrease □out by 1, and remove ′ ′ 881 □ from C if □out becomes 0. The pseudocode is given below.

Algorithm 1 UniEnumLSH

1 All buckets in C are sorted by the index of hash functions; 2 foreach □ ∈ C do 3 foreach (a, b) ∈ □A × □B do 4 if ϕ(a, b) ≤ r then 5 flag = true; 6 foreach j ∈ {1, 2, ··· , i − 1} do 7 if gj(a) = gj(b) then 8 flag = false; 9 if flag = true then 10 Emit (a, b); 11 foreach j ∈ {i + 1, i + 2, ··· , τ} do 12 if gj(a) = gj(b) in □ then 13 □out ← □out − 1; 14 if □out = 0 then 15 C ← C − {□};

d 882 ▶ Theorem 19. Let A, B be two sets of points in R , with |A| + |B| = n, and ε, r be positive 883 parameters. Under uniform assumption, a data structure of Oe(nd) size can be constructed 884 in Oe(nd) time and updated in Oe(d) time, while with probability 1 − 2/n supporting exact 885 enumeration of similarity join with Oe(d) delay.

886 Proof of Theorem 19. We first prove the correctness of the algorithm. It can beeasily

887 checked that any pair of points with their distance larger than r will not be emitted. Consider

888 any pair of points (a, b) within distance r. Let i be the smallest index such that gi(a) = gi(b) 889 in bucket □. In the algorithm, (a, b) will be reported by □ and not by any bucket later. 890 Thus, each join result will be enumerated at most once without duplication. k r log n 891 In the case of hamming distance, we have k = log2 n and p1 = (1 − d ) ∈ [1/e, 1] since 5 k 892 d/r > log n by padding some zeros at the end of all points , thus τ = 3 · 1/p1 · ln n = Oe(1). 893 We next analyze the complexity of our data structure. It can be built in O(nkτ) time

894 with O(nkτ) space, since there are n vertices in A ∪ B, at most O(nτ) non-empty buckets in

895 C, and each tuple in A ∪ B is incident to exactly l buckets in C. With the same argument,

896 it takes O(nkl) time for construct the tripartite graph representation. Moreover, it takes P 897 O( |A | · |B |) time for computing the quantity out for all buckets, which can be further □ □ □ □ 898 bounded by X 899 |A | · |B | < n · max(|A | + |B |) = O(n log n) □ □ □ □ □ □

900 implied by Lemma 18. 901 Consider any bucket □ from hash function gj. If the algorithm visits it during the 902 enumeration, at least one pair of points within distance r will be emitted, which has not been

903 emitted by any bucket from hash function hi for i < j. Checking all pairs of points inside any

5 Similar assumption was made in the original paper [26] of nearest neighbor search in Hamming distance. P. K. Agarwal et al. :23

904 bucket takes at most O((d+kl)·max |A |·|B |) time, where it takes O(d) time to compute □ □ □ 905 the distance between any pair of points and kl time for checking whether this pair has been

906 emitted before or marking buckets which also witnesses this pair later. Thus, the delay

907 between any two consecutive pairs of results is bounded by O((d + kl) · max |A | · |B |), □ □ □ 908 which is Oe(d) under the uniform assumption. 909 Moreover, for each pair of points within distance r, it will be reported by any hash k 910 function with probability at least p1 . The probability that they do not collide on any one k 3·1/pk·ln n 3 2 911 of hash function is at most (1 − p1 ) 1 ≤ 1/n . As there are at most n such pairs of 912 tuples, the probability that any one of them is not reported by our data structure is at most

913 1/n. By a union bound, the probability that uniform assumption fails or one join result is not 1 1 2 2 914 reported is at most n + n = n . Thus, the result holds with probability at least 1 − n . ◀

915 C.2 Without Uniform Assumption

916 In general, without this uniform assumption, we need to explore more properties of the

917 LSH family for an efficient data structure. Our key insight is that after checking some

918 pairs of points in one bucket (the specific numbers of pairs will be determined later), we

919 can safely skip the bucket, since with high probability any join result missed in this bucket

920 will be found in another one. In this way, we avoid spending too much time in one bucket

921 before finding any join result. Given aset P of points and a distance threshold r, let

922 B(q, P, r) = {p ∈ P | ϕ(p, q) > r}. The next lemma follows from [31, 27].

d 923 ▶ Lemma 20. For a set P of n points in hamming space H and a distance threshold r, if ρ 924 k = O(log n) and τ = O(n ), then for any point p ∈ P the following conditions hold with 925 constant probability γ: for any q ∈ P such that ϕ(p, q) ≤ r, there exists a bucket □such that ρ 926 p, q collide and |□ ∩ B(p, P, (1 + ε)r)| ≤ M for M = O(n ).

927 C.2.1 Data structure

ρ 928 We adopt a LSH family H with quality parameter ρ and randomly choose τ = O(n ) 929 hash functions g1, g2, ··· , gτ . To ensure our high-probability guarantee (as shown later), we

930 maintain O(log n) copies of this data structure. We construct m = O(log n) copies of the 931 data structure as I1, I2, ··· , Im. 932 For each bucket , we store and maintain a set of M arbitrary points A¯ ⊆ A and □ □ □ 933 B¯ ⊆ B . For each point a ∈ A¯ we maintain a counter a = |{b ∈ B¯ | ϕ(a, b) ≤ 2(1+ε)r}|. □ □ □ c □ 934 A bucket is active if there exists a pair (a, b) ∈ A¯ × B¯ such that ϕ(a, b) ≤ 2(1 + ε)r. □ □ □ 935 Equivalently, a bucket is active if there exists a ∈ A¯ with a > 0. All active pairs are □ □ c 936 maintained in a list . For each bucket ∈ we store a representative pair (a , b ) ∈ C □ C □ □ 937 A¯ × B¯ such that ϕ(a , b ) ≤ 2(1 + ε)r. □ □ □ □ 938 For any pair of points (a, b) ∈ A × B and a hashing bucket □, we refer □ as the proxy 939 bucket for (a, b) if (i) a ∈ A , b ∈ B ; (ii) | ¯(a, A ∪ B , (1 + ε)r)| ≤ M. Lemma 21 implies □ □ B □ □ 940 that each join result (a, b) with ϕ(a, b) ≤ r has at least one proxy bucket.

941 ▶ Lemma 21. With probability at least 1 − 1/n, for any pair of points (a, b) ∈ A × B with 942 ϕ(a, b) ≤ r, there exists a data structure Ij that contains a bucket □ such that: 943 a, b will collide in □; 944 |□ ∩ B(a, A, (1 + ϵ)r)| ≤ M and |□ ∩ B(a, B, (1 + ϵ)r)| ≤ M.

945 Proof of Lemma 21. Consider any pair of points (a ∈ A, b ∈ B) within distance r and an

946 arbitrary data structure constructed as described above. From Lemma 20, with probability :24 Dynamic Enumeration of Similarity Joins

947 at least γ there exists a bucket in the data structure that contains both a, b and the number

948 of collisions of a (with the rest of the points in A ∪ B) is bounded by M. 949 Let Fj be the event that is true if there is a bucket in Ij that witnesses the collision of 950 a, b and the number of collisions of a is bounded by M. Since Fi,Fj are independent for 3 log n ¯ ¯ ¯ ¯ log(1/γ) 3 951 i ≠ j, we have Pr[F1 ∩ ... ∩ FC ] = Pr[F1] · ... · Pr[FC ] ≤ γ ≤ 1/n . Let Z be the 2 952 number of pairs with distance at most r. We have Z ≤ n . Let Gi be the event which is ′ ′ 953 true if for the j-th pair of points a , b with distance at most r, there is at least a copy of ′ ′ 954 the data structure such that there exists a bucket that contains both a , b and the number ′ ¯ ¯ 955 of collisions of a is bounded by M. Then Pr[G1 ∩ ... ∩ GZ ] = 1 − Pr[G1 ∪ ... ∪ GZ ] ≥ ¯ ¯ 2 3 956 1 − Pr[G1] − ... − Pr[GZ ] ≥ 1 − n /n ≥ 1 − 1/n. Hence, with high probability, for any pair

957 a ∈ A, b ∈ B with distance at most r there will be at least one bucket in the data structure

958 such that, both a, b are contained in the bucket and the number of collisions of a in the 959 bucket is bounded by M. ◀

960 Lemma 22. For any bucket , if there exist M points from A and B each, such that ▶ □ □ □ 2 961 none of the M pairs has its distance within 2(1 + ε)r, □ is not a proxy bucket for any pair 962 (a, b) ∈ A × B with ϕ(a, b) ≤ r. □ □ ′ ′ 963 Proof. Let A ,B be two sets of M points from A ,B respectively. We assume that all □ □ ′ ′ 964 pairs of points in A × B have their distances larger than 2(1 + ε)r. Observe that □ is not ′ ′ ′ 965 a proxy bucket for any pair (a ∈ A , b ∈ B ). It remains to show for (a ∈ A \ A , b ∈ B ) □ □ ′ ′ ′ 966 with ϕ(a, b) ≤ r. Assume b ∈ B \ B (the case is similar if b ∈ B ). If A ⊆ ¯(a, A, (1 + ε)r) □ B ′ ¯ 967 or B ⊆ B(a, B, (1 + ε)r), □ is not a proxy bucket for (a, b). Otherwise, there must exist at ′ ′ ′ ′ ′ ′ 968 least one point a ∈ A as well as b ∈ B such that ϕ(a, a ) ≤ (1 + ε)r and ϕ(a, b ) ≤ (1 + ε)r, ′ ′ ′ ′ ′ ′ ′ ′ 969 so ϕ(a , b ) ≤ ϕ(a, a ) + ϕ(a, b ) ≤ 2(1 + ε)r. Thus, (a , b ) ∈ A × B is a pair within distance 970 2(1 + ε)r, coming to a contradiction. ◀

971 Later, we will see that our enumeration phase only reports each join result in one of its

972 proxy buckets. This guarantees the completeness of query results, but de-duplication is still

973 necessary if a pair of points has more than one proxy buckets.

974 C.2.2 Update

975 We handle insertion and deletion separately. We assume that we insert or delete a point

976 a ∈ A. We can handle an update from B, similarly. All pseudocodes are given below.

Algorithm 2 Insert(a ∈ A)

1 foreach hash function g in the data structure do 2 □ ← the bucket with hash value g(a); 3 Insert a into A ; □ 4 if |A¯ | < M then □ 5 Insert a into A¯ ; □ 6 Compute a by computing ϕ(a, b) for each b ∈ B¯ ; c □ 7 if ac > 0 AND □ ∈/ C then 8 C ← C ∪ {□}; 9 (a , b ) = (a, b) for a point b ∈ B¯ with ϕ(a, b) ≤ 2(1 + ε)r; □ □ □

977 Insertion of a. We compute g(a) for each chosen hash function g. Assume □ is the bucket 978 with hash value g(a). We first insert a to A . If |A¯ | < M we add a in A¯ and we compute □ □ □ P. K. Agarwal et al. :25

Algorithm 3 Delete(a ∈ A)

1 foreach hash function g in the data structure do 2 □ ← the bucket with hash value g(a); 3 Delete a from □A; 4 if a ∈ A¯ then □ 5 Remove a from A¯ ; □ ′ 6 Insert an arbitrary point a ∈ A \ A¯ into A¯ ; □ □ □ ′ 7 if □ ∈/ C AND ac > 0 then 8 C ← C ∪ {□}; ′ ′ 9 (a , b ) = (a , b) where b ∈ B¯ and ϕ(a , b) ≤ 2(1+)r; □ □ □ 10 else if ∈ AND a = a then □ C □ ′′ ′′ 11 if ∃a ∈ A¯ with a > 0 then □ c ′′ ′′ 12 (a , b ) = (a , b) where b ∈ B¯ and ϕ(a , b) ≤ 2(1+)r; □ □ □ 13 else 14 C ← C \{□}

979 the counter a by visiting all points in B¯ . If was inactive and a > 0, we add in c □ □ c □ 980 , we find a point b ∈ B¯ with ϕ(a, b) ≤ 2(1 + ε)r and we set the representative pair C □ 981 (a , b ) = (a, b). □ □

982 Deletion of a. Similarly, we compute g(a) for each chosen hash function g. Assume □ is 983 the bucket with hash value g(a). We first remove a from A . If a ∈ A¯ we also remove it □ □ ′ ′ 984 from A¯ and we replace it with an arbitrary point a ∈ A \ A¯ by computing its counter a . □ □ □ c ′′ 985 If a was a point in the representative pair of we update it by finding any point a ∈ A¯ □ □ ′′ 986 with ac > 0. If there is not such point we remove □ from C .

987 When there are n/2 updates, we just reconstruct the entire data structure from scratch.

988 C.2.3 Enumeration

989 The high-level idea is to enumerate the representative pair of points for each bucket in C . 990 Assume a representative pair (a, b) is found in a bucket □ ∈ C . Next, the algorithm is going 991 to enumerate all results with point a.

992 Initially, all buckets containing a are maintained in C (a) ⊆ C . Algorithm 4 visits every 993 bucket ∈ (a) and starts to check the distances between a and points in B that are not □ C □ 994 marked by X(□, a) (we show when a point is marked in the next paragraph). Each time a 995 pair (a, b) within distance 2(1 + ε)r is found, it just reports this pair and calls the procedure

996 Deduplicate on (a, b) (details will be given later). If there are more than M points far

997 away (i.e. > 2r(1 + ε)) from a, we just stop enumerating results with point a in this bucket, 6 998 and remove the bucket from (a). We also update A¯ so that if a ∈ A¯ we replace a □ C □ □ 999 with another point A . Once the enumeration is finished on a, i.e., when (a) becomes □ C 1000 empty, it can be easily checked that a has been removed from all buckets.

1001 Next, we explain more details on the de-duplication step presented as Algorithm 5. Once

1002 a pair of points (a, b) within distance 2(1 + ε)r is reported, Algorithm 5 goes over all buckets 1003 witnessing the collision of a, b, and marks b with X(□, a) to avoid repeated enumeration (line

6 In the enumeration phase, the “remove” always means conceptually marked, instead of changing the data structure itself. :26 Dynamic Enumeration of Similarity Joins

1004 2). Moreover, for any bucket with a ∈ A and b ∈ B , if (a, b) is also its representative □ □ □ 1005 pair, Algorithm 5 performs more updates for □. Algorithm 5 first needs to decide whether □ 1006 is still an active bucket for a by checking the distances between a and M points unmarked

1007 by a in B . If such a pair within distance 2(1 + ε)r is found, it will set this pair as new □ 1008 representative for □. Otherwise, it is safe to skip all results with point a in this bucket. In 1009 this case, it needs to further update a new representative pair for using A¯ , B¯ . Moreover, □ □ □ 1010 if no representative pair can be found, it is safe to skip all results with bucket □.

Algorithm 4 EnumerateLSH

1 while C ̸= ∅ do 2 (a, b) ← the representative pair of any bucket in C ; 3 (a) ← { ∈ : a ∈ A }; C □ C □ 4 while C (a) ̸= ∅ do 5 Pick one bucket □ ∈ C (a); i ← 0; 6 foreach b ∈ B − X( , a) do □ □ 7 if ϕ(a, b) ≤ 2(1 + ε)r then 8 Emit(a, b); 9 Deduplicate(a, b); 10 else 11 i ← i + 1; 12 if i > M then break; 13 A ← A − {a}; □ □ 14 C (a) ← C (a) − {□}; 15 Replace a from A¯ (if a ∈ A¯ ) and update its representative pair; □ □

Algorithm 5 Deduplicate(a, b)

1 foreach ∈ C with a ∈ A and b ∈ B do □ □ □ 2 X(□, a) ← X(□, a) ∪ {b}; 3 if (a , b ) = (a, b) then □ □ ′ 4 B ← M arbitrary points in B − X( , a); □ □ ′ ′ ′ 5 if there is b ∈ B with ϕ(a, b ) ≤ 2(1 + ε)r then ′ 6 (a , b ) = (a, b ); □ □ 7 else 8 C (a) ← C (a) − {□}; 9 A ← A − {a}; □ □ 10 if a ∈ A¯ then □ ′ 11 Replace it with a new item a ∈ A \ A¯ ; □ □ ′ 12 Compute ac; ′′ ′′ 13 if ∃a ∈ A¯ with a > 0 then □ c ′′ ′′ ′′ ′′ ′′ 14 (a , b ) = (a , b ) where b ∈ B¯ and ϕ(a , b ) ≤ 2(1 + ε)r; □ □ □ 15 else 16 C ← C \{□};

1011 For any bucket , we can maintain the points in A ,B ,X( , a) in balanced binary □ □ □ □ 1012 search trees, so that points in any set can be listed or moved to a different set with O(log n)

1013 delay. Moreover, to avoid conflicts with the markers made by different enumeration queries, P. K. Agarwal et al. :27

1014 we generate them randomly and delete old values by lazy updates [23, 38, 37] after finding

1015 new pairs to report.

1016 ▶ Lemma 23. The data structure supports (1 + 2ε)-approximate enumeration.

1017 Proof of Lemma 23. It can be easily checked that any pair of points with distance far

1018 more than 2(1 + ε)r will not be enumerated. Also, each result is reported at most once

1019 by Algorithm 5. Next, we will show that with high probability, all pairs of points within

1020 distance r are reported. Consider any pair of points (a, b) within distance r. Implied by 1021 Lemma 21, there must exist a proxy bucket □ for (a, b). Observe that there exists no subset 1022 of M points from A as A¯ and subset of M points from B as B¯ , where all pairs of □ □ □ □ 1023 points in A¯ × B¯ have their distances larger than 2(1 + ε)r, implied by Lemma 22, so □ □ □ 1024 is active. Moreover, there exists no subset of M points from B as B¯ , where all pairs of □ □ ′ 1025 points (a, b ∈ B¯ ) have their distances larger than 2(1 + ε)r, so is an active bucket for □ □ 1026 a. In Algorithm 4, when visiting □ by line 7-18, (a, b) must be reported by □ or have been 1027 reported previously. ◀

1028 We next analyze the complexity of the data structure. The size of the data structure is 1+ρ ρ 1029 Oe(dn + nkτ) = Oe(dn + n ). The insertion time is Oe(dτM) = Oe(dn M). Using that, we 1+ρ 1030 can bound the construction time of this data structure as Oe(dnτM) = Oe(dn M). The ρ ρ 2 1031 deletion time is Oe(dτM) = Oe(dn M). The delay is bounded by Oe(dτM) = Oe(dn M ) since 1032 after reporting a pair (a, b), we may visit Oe(τ) buckets and spend O(M) time for each in 1033 updating the representative pair. Putting everything together, we conclude the next theorem.

d 1034 ▶ Theorem 24. Let A and B be two sets of points in R , where |A| + |B| = n and let ε, r 1 1+ρ 1035 be positive parameters. For ρ = (1+ε)2 + o(1), a data structure of Oe(dn + n ) size can be 1+2ρ 2ρ 1036 constructed in Oe(dn ) time, and updated in Oe(dn ) amortized time, while supporting 2ρ 1037 (1 + 2ε)-approximate enumeration for similarity join under the ℓ2 metric with Oe(dn ) delay.

1038 Notice that the complexities of Theorem 24 depend on the parameter M from Lemma 20.

1039 Hence, a better bound on M will give improve the results of our data structure. In the log(1/p1) 1040 original paper [31] (Section 4.2) for the Hamming metric the authors choose ρ = log(p1/p2) 1041 showing that for any p, q ∈ P such that ϕ(p, q) ≤ r there exists a bucket, with constant

1042 probability γ, that p, q collide and the number of points in P ∩ B(p, (1 + ε)r) colliding with 1 1043 p in the bucket is at most M = O(1). For ε > 1 they show that ρ < ε . Equivalently we can 1044 set ε as ε − 1 and M = O(1). Using this result we can get the next theorem.

d 1045 ▶ Theorem 25. Let A, B be two sets of points in H , where |A| + B| = n and let ε, r be 1 1+ρ 1046 positive parameters. For ρ = 1+ε , a data structure of Oe(dn + n ) size can be constructed in 1+ρ ρ 1047 Oe(dn ) time, and updated in Oe(dn ) amortized time, while supporting (3+2ε)-approximate ρ 1048 enumeration for similarity join under the Hamming metric with Oe(dn ) delay.

1049 In the next remarks we show that our results can be extended to r being part of the

1050 query (variable). Furthermore, we show that our result is near-optimal.

1051 Remark 1. Similar to the LSH [31] used for ANN query, we can extend our current data

1052 structure to the case where r is also part of the query. For simplicity, we focus on Hamming d −1 1053 metric. For H , it holds that 1 ≤ r ≤ d. Hence, we build Z = O(log1+ε d) = O(ε log d) 1054 data structures as described above, each of them corresponding to a similarity threshold i 1055 ri = (1 + ε) for i = 1,...,Z. Given a query with threshold r, we first run a binary search 1056 and find rj such that r ≤ rj ≤ (1 + ε)r. Then, we use the j-th data structure to answer −1 1+ρ 1057 the similarity join query. Overall, the data structure has Oe(dn + ε n log d) size can :28 Dynamic Enumeration of Similarity Joins

−1 1+ρ −1 ρ 1058 be constructed in Oe(ε dn log d) time, and updated in Oe(ε dn log d) amortized time. −1 ρ 1059 After finding the value rj in O(log(ε log d)) time, the delay guarantee remains Oe(dn ). We 1060 can also extend this result to ℓ2 or ℓ∞ metrics using known results ([26, 27, 31]).

1061 Remark 2. It is known that the algorithm for similarity join can be used to answer the d 1062 ANN query. Let P be a set of points in R , where d is a large number, and ε, r be parameters. 1063 The ANN query asks that (1) if there exists a point within distance r from q, any one of them

1064 should be returned with high probability; (2) if there is no point within distance (1 + ε)r from

1065 q, it returns “no”with high probability. For any instance of ANN query, we can construct an

1066 instance of similarity join by setting A = P and B = ∅. Whenever a query point q is issued

1067 for ANN problem, we insert q into B, invoke the enumeration query until the first result is 1+ρ 1068 returned (if there is any), and then remove q from B. Our data structure of Oe(dn + n ) 2ρ 1069 size can answer (1 + 2ε)-approximate ANN query in Oe(dn ) time in ℓ2, which is only worse ρ 1070 by a factor n from the best data structure for answering ε-approximate ANN query.