Approximate Range Searching∗

Approximate Range Searching∗ Sunil Arya† Department of Computer Science The Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong David M. Mount‡ Department of Computer Science and Inst. for Advanced Computer Studies University of Maryland College Park, MD 20742 Abstract The range searching problem is a fundamental problem in computational geometry, with numerous important applications. Most research has focused on solving this problem exactly, but lower bounds show that if linear space is assumed, the problem cannot be solved in polylogarithmic time, except for the case of orthogonal ranges. In this paper we show that if one is willing to allow approximate ranges, then it is possible to do much better. In particular, given a bounded range Q of diameter w and >0, an approximate range query treats the range as a fuzzy object, meaning that points lying within distance w of the boundary of Q either may or may not be counted. We show that d in any fixed dimension d, a set of n points in R can be preprocessed in O(n log n) time and O(n) space, such that approximate queries can be answered in O(log n +(1/)d) time. The only assumption we make about ranges is that the intersection of a range and a d-dimensional cube can be answered in constant time (depending on dimension). For ∗Appeared in Computational Geometry: Theory and Applications, 17 (2000), 135– 163. A preliminary version appeared in the Proc. of the 11th Annual ACM Symp. on Computational Geometry, 1995, 172–181. yEmail: [email protected]. The work of this author was partially supported by HK RGC grant HKUST 736/96E. Part of this research was conducted while the author was visiting the Max-Planck-Institut für Informatik, Saarbrücken, Germany. zEmail: [email protected]. This author was supported by the National Science Foun- dation under grant CCR–9712379 . 1 convex ranges, we tighten this to O(log n +(1/)d−1) time. We also present a lower bound for approximate range searching based on par- tition trees of Ω(log n +(1/)d−1), which implies optimality for convex ranges (assuming fixed dimensions). Finally we give empirical evidence showing that allowing small relative errors can significantly improve query execution times. 1 Introduction. The range searching problem is among the fundamental problems in computational geometry. A set P of n data points is given in d-dimensional real space, Rd, and a space of possible ranges is considered (e.g., d-dimensional rectangles, spheres, halfspaces, or simplices). The goal is to preprocess the points so that, given any query range Q, the points in P ∩ Q can be counted or reported efficiently. More generally, one may assume that the points have been assigned weights, and the problem is to compute the accumu- lated weight of the points in P ∩Q, weight(P ∩Q), under some commutative semigroup. There is a rich literature on this problem. In this paper we consider the weighted counting version of the problem. We are interested in applications in which the number of data points is sufficiently large that one is limited to using only linear or roughly linear space in solving the problem. For orthogonal ranges, it is well known that range trees can be applied to solve the problem in O(logd−1 n) time with O(n logd−1 n) space (see e.g., [15]). Chazelle and Welzl√ [9] showed that triangular range queries can be solved in the plane in O( n log n) time using O(n) space. Matouˆsek [13] has shown how to achieve O(n1−1=d) query time for simplex range searching with nearly linear space. This is close to Chazelle’s lower bound of Ω(n1−1=d= log n)[8] for linear space. For halfspace range queries, Brönnimann et al. [5] gave a lower bound of Ω(n1−2=(d+1)) (ignoring logarithmic factors) assuming linear space. This lower bound applies to the more general case of spherical range queries as well. Unfortunately, the lower bound arguments defeat any reasonable hope of achieving polylogarithmic performance for arbitrary (nonorthogonal) ranges. This suggests that it may be worthwhile considering variations of the problem, which may achieve these better running times. In this paper we consider an approximate version of range searching. Rather than approximating the count, we consider the range to be a fuzzy range, and assume that data points that are “close” to the boundary of the range (relative to the range’s diameter) may or may not be included in the count. 2 To make this idea precise, we assume that ranges are bounded sets of bounded complexity. (Thus our results will not be applicable to halfspace range searching). Given a range Q of diameter w, and given >0, define the inner range Q− to be the locus of points whose Euclidean distance from a point exterior to Q is at least w, and define the outer range Q+ to be the locus of points whose distance from a point interior to Q is at most w. (Equivalently, Q+ and Q− can be defined in terms of the Minkowski sum and difference, respectively, of a ball of radius w with Q.) Define a legal answer to an -approximate range query to be weight(P 0) for any subset P 0 such that P ∩ Q− ⊆ P 0 ⊆ P ∩ Q+: (See Fig. 1). Q+ Q Q- Must be included May be included Must not be included Figure 1: Approximate range searching queries. This definition allows for two-sided errors, by failing to count points that are barely inside the range, and counting points barely outside the range. It is trivial to modify the algorithm so that it produces one-sided errors (thus forbidding either sins of omission or sins of commission, but obviously not both). Our results can be generalized to other definitions Q− and Q+ provided that the minimum boundary separation distance with Q is at least w. We assume that the following range-testing primitives are provided: Point membership in Q: whether a point p lies within Q, Box intersection with Q−: whether an axis-aligned rectangle has a non- empty intersection with Q−, and Box containment in Q+: whether an axis-aligned rectangle is contained within Q+. Our running times are given assuming each of these can be computed in constant time. (In general it is the product of the maximum time for each 3 primitive and our time bounds.) Our algorithm can easily be generalized to report the set of points lying within the range, and the running time increases to include the time to output the points. Approximate range searching is probably interesting only for fat ranges. Overmars [14] defines an object Q to be k-fat if for any point p in Q, and any ball B with p as center that does not fully contain Q in its interior, the portion of B covered by Q is at least 1=k. For ranges that are not k-fat, the diameter of the range may be arbitrarily large compared to the thickness of the range at any point. However, there are many applications of range searching that involve fat ranges. There are a number of reasons that this formulation of the problem is worth considering. There are many applications where data are imprecise, and ranges themselves are imprecise. For example, the user of a geographic information system that wants to know how many single family dwellings lie within a 60 mile radius of Manhattan, may be quite happy with an answer that is only accurate to within a few miles. Also range queries are often used as part of an initial filtering process to very large data sets, after which some more complex test will be applied to the points within the range. In these applications, a user may be quite happy to accept a coarse filter that runs faster. The user is free to adjust the value of to whatever precision is desired (without the need to apply preprocessing again), with the understanding that a tradeoff in running times is involved. In this paper we show that by allowing approximate ranges, it is possible to achieve significant improvements in running times, both from a theoretical as well as practical perspective. We show that for fixed dimension d, after O(n log n) preprocessing, and with O(n) space, -approximate range queries can be answered in time O(log n +1/d). Under the assumption that ranges are convex, we show that this can be strengthened to O(log n +1/d−1). (These expressions are asymptotic in n and 1/ and assume d is fixed. See Theorems 1 and 2 for the exact dependencies on dimension.) Some of the features of our method are: • The data structure and preprocessing time are independent of the space of possible ranges and . (Assuming that the above range-testing primitives are provided at query time.) • Space and preprocessing time are free of exponential factors in dimension. Space is O(dn) and preprocessing time is O(dn log n). • The algorithms are relatively simple and have been implemented. The data structure is a variant of the well-known quadtree data structure. 4 • Our experimental results show that even for uniformly distributed points in dimension 2, there is a significant improvement in the running time if a small approximation error is allowed. Furthermore, the average error (defined in Section 5) committed by the algorithm is typically much smaller than the allowed error, .

Approximate Range Searching∗

Parameter-Free Locality Sensitive Hashing for Spherical Range Reporting˚

On Range Searching with Semialgebraic Sets*

New Data Structures for Orthogonal Range Searching

Range Searching

Further Results on Colored Range Searching

Orthogonal Range Searching in Moderate Dimensions: K-D Trees and Range Trees Strike Back

Dynamic Orthogonal Range Searching on the RAM, Revisited∗

Solving Database Queries Using Orthogonal Range Searching

Further Results on Colored Range Searching Timothy M

Orthogonal Range Searching on the RAM, Revisited

Lecture 8: Orthogonal Range Searching 8.1 Range Searching 8.2

R-Trees: Introduction