<<

Data Structures for Range Searching

JON LOUIS BENTLEY Departments of Computer Sctence and Mathematics, Carnegte-Mellon Unwerslty, Pittsburgh, Pennsylvanta 15213

JEROME H. FRIEDMAN Computatmn Research Group, Stanford Lmear Accelerator Center, Stanford, Cahfornia 94305

Much research has recently been devoted to "multikey" searching problems. In this paper the partmular multlkey problem of range searching Is investigated and a number of data structures that have been proposed as solutions to this problem are surveyed. The purposes of this paper are to bring together a collection of widely scattered results, to acquaint the reader with the structures currently avadable for solving the particular problem of range searching, and to display a set of general methods for attacking multikey searching problems.

Keywords and Phrases: analysis of algorithms, orthogonal range queries, range searching, cells, multidimensional binary search trees, projection CR Categorws. 3.63, 3.74, 5.25

INTRODUCTION tems, statistics, and design automation) and, in addition, serves as a representative The study of data structures for facilitating of the entire class of multikey searching rapid searching is a fascinating subject of problems. both practical and theoretical interest. We need some definitions to describe this Knuth [KNUT73] provides a definitive trea- particular searching problem precisely. In tise on the subject of searching when the database terminology a file is a collection search is based on only one "key," but he of records, each containing several attri- points out that not much was known at the butes or keys. A query asks for all records time his book was published about data satisfying certain characteristics. An or- structures for sets that have many "keys." thogonal range query asks for all records This subject area, which is often called with key values each within specified "multikey searching," "multidimensional ranges (that is, each key is between speci- searching," or "multiple attribute re- fied upper and lower bounds). The process trieval," has been the focus of a great deal of retrieving the appropriate records is of research in the past few years. In this called range searching. This problem can paper we study a small part of this area by also be cast in geometric terms by regarding surveying the work that has been done on the record attributes as coordinates and the one particular multikey searching problem. k values for each record as representing a This problem is important in itself (having in a k-dimensional coordinate space. applications in such areas as database sys- The file of records then becomes a point set in k-space. The intersection of the query ranges is a k-dimensional hyperrectangle in Thin research was supported m part by the Office of Naval Research under Contract N00014-76-C-0370 the space {that is, a "box"), and a range and m part by the Department of Energy query calls for finding all points lying inside

Permission to copy without fee all or part of this materml is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notme and the title of the pubhcatlon and its date appear, and notme is gwen that copying is by permlsmon of the Association for Computing Machinery. To copy otherwme, or to repubhsh, reqmres a fee and/or specific permission © 1979 ACM 0010-4892/79/1200-0397 $00 75

Computing Surveys, Voi. 11, No. 4, December 1979 398 • J.L. Bentley and J. H. Friedman CONTENTS that gathers together and presents in a common terminology a number of results that have recently appeared on the problem of range searching. This problem is of par- INTRODUCTION ticular interest for two reasons: First, it is 1 THE DATA STRUCTURES 1 1 Sequentml Scan an important problem in many practical 1 2 Projection applications (and a difficult theoretical 1 3 Cells problem!); second, the methods that we 14 k-d Trees investigate are broadly applicable to many 1 5 Range Trees other multikey searching problems. The 1 6 k-ranges 1 7 Other Structures second type of reader for whom this paper 1 8 Comparison of Methods is intended is a computer scientist who is 2 ADDITIONAL WORK somewhat familiar with data structures for 3 CONCLUSIONS single-key searching, and who would like a REFERENCES tutorial on the problem of range searching. For this reader, the methods that we dis- T cuss are described on an intuitive level, and references are given to more precise de- scriptions elsewhere in the literature. In Section 1 of this paper we examine six this hyperrectangle. We will often cast data structures for the range searching range searching in this geometric frame- problem in some detail, and then briefly work as an aid to intuition. compare those structures at the end of the Range searching arises in many applica- section. Additional work (that both has tions. In a geographic database of U.S. been done and needs to be done) is de- cities one might seek a list of all those with scribed in Section 2, and conclusions are latitude between 37 ° and 41 ° and longitude then offered in Section 3. between 102 ° and 109 ° (defining the state of Colorado). To compile an honor list of older students, a university administrator 1. THE DATA STRUCTURES may wish to know those students whose In this section we investigate a number of age is between 21 and 24 years and whose search methods for range searching. Each grade point average is between 3.5 and 4.0. search method is specified by a data struc- In data analysis it is often useful to do ture for storing the data and algorithms for separate analyses on sets of data lying in building (which we call preprocessing) and different regions (hyperrectangles) of the searching the structure. We will analyze a observation space and then compare (or search structure (say A) by giving three contrast) the respective results. (At the cost functions of N (the number of points) Stanford Linear Accelerator Center, for ex- and k (the number of dimensions): ample, over 10 hours per week of IBM 370/ 168 time is devoted to this application.) In • PA(N, k), the cost of preprocessing N statistics, range searching can be employed points in k-space into a data structure; to determine the empirical probability con- • SA(N, k), the storage required by the tent of a hyperrectangle, to determine em- data structure; pirical cumulative distributions, and to per- • QA(N, k), the search time or query cost. form density estimation (see LOFT65). Lauther [LAuT78] describes how range These costs can be analyzed in terms of searching can be used to solve a design their average or their worst case; we usually automation problem in very large-scale in- speak of the worst-case cost, explicitly men- tegrated circuitry (VLSI). tioning the average whenever we employ it. This paper has been written with two In many applications one may desire var- distinct audiences in mind. For the expert ious utility operations on data structures, in searching {with background either in such as insertion and deletion. In this sec- database systems or theoretical computer tion we ignore this issue, considering only science), this paper is intended as a survey static (unchanging) files; we then return to

Computing Surveys, Vol 11, No. 4, December 1979 Data Structures for Range Searching . 399

! I i y I ! I I !I r I ! a • o i n t g I tI t I t ! e I I i. i I I I I I I I I I !/,d, ' I I I ! D ! I I I / I I I I I I I I I I I I :>" 2' ' I O I I #; /,',, ,,' f I I I o I I U • I O I I/r Q I I I I t I I I I ! I I I I I I l I j/q /~d" ' ! I l I I I I I I I a I I I ¢" I f !

FmuR~ 1. IUustration of projectmn the question of dynamic structures in Sec- 1.2 Projection tion 2. The projection technique involves keeping, 1.1 Sequential Scan for each attribute, a sequence of the records in the file sorted by that attribute. One can The simplest approach to range searching view this geometrically as a projection of is to store the N points in a sequential list. the points on each coordinate. The k lists As each query arrives, all elements of the representing the projections can be ob- list are scanned and every record that sat- tained by using a standard sorting algo- isfies the query is reported. If the queries rithm k times. After preprocessing, a range do not have to be handled immediately, query can be answered by the following then they can be "batched" so that many search procedure: Choose one of the attri- queries can be processed with one sequen- butes, say the ith. Look up the two positions tial pass through the file. Since all k keys of in the ith sequence (using a binary search) the N records must be stored and each of the extreme values defining the range on k-key record is examined as the structure is the ~th attribute of the query. All records built or searched, it is easy to see that the satisfying the query will be in the list be- sequential scan structure SS has the prop- tween these two positions just found. This erties (smaller} list is then searched by brute Pss(N, k) = O(Nk), force. The projection technique is referred Sss(N, k) = O(Nk), to as inverted lists by Knuth [KNUT73]. This technique was applied by Friedman, Qss(N, k) = O(Nk). Baskett, and Shustek [FRIE75] in their so- Sequential scanning has the advantage of lution of the "nearest neighbor" problem being trivial to implement on any storage and by Lee, Chin, and Chang [LEEC76] to medium. It is competitive with the more a number of database problems. sophisticated methods described in this pa- The projection technique is illustrated in per when the file is small and the number Figure 1. The points represent a set of of attributes is large, or when a large frac- sixteen records of two keys each, repre- tion of the records in the file satisfy the sented by the x- and y-coordinates. The query (or queries, if they are batched). dashed lines are the projection of the rec-

Computing Surveys, Vol. 11, No. 4, December 1979 400 • J. L. Bentley and J. H. Friedman ords onto the x-coordinate (that is, the ropolitan areas are often printed in the form records sorted into x-order). The vertical of books. The first page of the book shows slab is the x-range of the query, the hori- the entire area, and the remaining pages zontal slab is the y-range, and the rectangle are detailed maps of (say) one-mile-square that is their intersection contains those regions. To find (for example) all schools in points which satisfy the query. To answer a specified rectangle, one would look at the this query, we need only investigate the six first page to find which squares overlap the points that are inside the vertical slab rectangle and then check only on those marked by the 45 ° lines. pages of the book to find the schools. This One can apply the projection technique approach can be mechanized immediately. with only one sorted list (projection). If the A square of the map corresponds to a cell distribution of values of the various attri- in k-space, and the points of the file within butes is more or less uniform over similar the cell are stored together in an implemen- ranges and the query ranges of each attri- tation. The first page of the map book bute are similar, then one list is sufficient. corresponds to a directory that allows one If this is not the case, however, then keep- to take a hyperrectangle and look up the ing several lists can often lead to substantial set of cells. reductions in the query time. The multiple The cell technique is illustrated in Figure projections are exploited by performing two 2. The sixteen points in that figure repre- binary searches in each to find the lower sent sixteen records containing two keys and upper bounds of the respective range, each. The points in each cell are stored and then searching that projection with the together in an implementation. The query smallest number of records in the range. is given by the rectangle in the upper part The cost analysis of projection is of the figure, and to answer it, only those straightforward. To preprocess a file of N points in the four dashed cells need be records of k keys each, we must perform k investigated. The squares in that figure are sorts of N elements. To store such a file, we the "directory" corresponding to the first must store k lists of N elements each. These page of the map book. facts immediately yield The directory can be implemented in two ways. If the points are (say) uniformly dis- Pp(N, k) = O(kN log N), tributed on [0, 10] 2 and we have chosen Sp(N, k) = O(kN). 1 × 1 cells, then we can use a two-dimen- Friedman, Baskett, and Shustek [FreE75] sional array as the directory, named DI- show that for searches that have almost RECT (0.. 9, 0.. 9). In DIRECT (i,j) we cubical query regions and find a small num- would keep a pointer to a list of all points ber of records (and are therefore similar to in the cell [t, t + 1] × [j,j + 1]. If we nearest neighbor searches), the query time wanted to find all points in [5.2, 6.3] × [1.2, of projection is given by 3.4], then we would only have to examine cells (5, 1), (5, 2), (5, 3), (6, 1), (6, 2), and (6, Qp(N, k) = O(N H/k) (average case) 3)--we call this "translating" from a range when the point set is drawn from a smooth query to a set of cell id's. The multidimen- underlying distribution. The projection sional array works very well when the technique is most effective when the quer- points are known a priori to be uniformly ies almost always contain one range that distributed over some given rectangle in the excludes most of the file. key space. When this is not known to be the case, one would probably use a search 1.3 Cells method, such as hashing, for the directory. There are two ways they can search [for the murder In this method we name each cell as before; weapon] from the body outward m a spiral, or dwlde so cell (t, j) is a pointer to the points in the room up Into squares--that's the grid method. [i, i + 1] × [j,j + 1]. Instead of storing all From the CBS series Kojak, cells, however, we store only those cells "Death Is Not a Passing Grade" that actually contain records of the file. To Cartographers as well as detectives use the process a query, we translate the rectangle grid (or cell) method. Street maps of met- into a set of cell id's (as we did above), look

Computing Surveys, Vol 11, No. 4, December 1979 Data Structures for Range Searching 401

f f / / / / / / / / / / / / • / / / / f / / tO / / / I / f, / # / D/ / / / / / / /: # / •/ / g / / / g / / # ql # .¢ / / /0 # / / / / / / / / / / .J II /

}

FIGURE 2 Illustration of cells up those id's, and check all the points in the other hand, there will be very many cell occupied cells for inclusion in the rectangle. accesses and very few inclusion tests. The storage required for the cell technique Clearly, either extreme is to be avoided. is the storage for the directory plus loca- The best cell size and shape depend on tions for the linked list representing points the size and shape of the query hyperrec- in cells; the size of the directory is usually tangle. Bentley, Stanat, and Williams much smaller than N. [BENT77] show that if the query hyperrec- Knuth [KNUT73] has discussed this tangles have constant size and shape so scheme for the two-dimensional case. Lev- that only their location (in the coordinate inthal [LEVI66] used a cell technique in space) is unspecified, then for a single grid three-dimensional Euclidean space for de- a nearly optimum size and shape for the termining all atoms within 5 angstroms of cells are the same as those of the query every atom in a protein molecule--he re- hyperrectangle. For this case the number ferred to this as "cubing." The idea of using of cells accessed is 2 k, and the expected hashing for the cell directory was first de- search time is proportional to 2 k times the scribed by Yuval [YUVA75], and was later number of points in the range. In this con- used by Rabin [RAm76] to solve the "clos- text the performance of cells is given by est pair" problem. Bentley, Stanat, and P~(N, k) -- O(Nk), Williams [BENT77] discuss a number of different implementations for the directory S~(N, k) ffi O(Nk), (two of which we have seen). Qc(N, k) = O(2 k F) (average), The basic parameters of the cell tech- where F is the number of records found. In nique are the size and shape of each cell. In most applications, however, the queries will analyzing a search there are two costs to vary in size and shape as well as in location, count: cell accesses (the number of direc- so there is little information available for tory look-ups) and inclusion tests (testing making a good choice of cell size and shape. whether a point satisfies the range query). If the cell size is extremely large, there will 1.4 k-d Trees be few cell accesses and many inclusion In this section we examine a data structure tests. If the cell size is very small, on the called the "k-dimensional binary search

Computing Surveys, Vol 11, No. 4, December 1979 402 , J. L. Bentley and J. H. Friedman tree," which is usually abbreviated as "k-d The partitioning value is chosen to be the tree." This structure is a natural generaliza- median value of this attribute. This algo- tion of the standard one-dimensional binary rithm is then applied recursively to the two search tree, so we will briefly review a spe- subcollections represented by the two sons cial type of that structure (a complete de- of the node just partitioned. The partition- scription of binary trees can be found in ing is stopped, creating a terminal node (or KNUT73). To build a file of single-key rec- bucket), when the cardinality of the sub- ords into a binary search tree, we choose collection is less than a prespecified maxi- the median of the set as the discriminator mum, which is a parameter of the proce- value and build all records with key values dure. (Friedman, Bentley, and Finkel less than or equal to the discriminator into [FRIE77] found empirically that values the left subtree of the root (recursively) and ranging from 8 to 16 work well in a Fortran all elements with greater key values into implementation.) The result of this proce- the right subtree. This process continues dure is that the coordinate space is divided recursively until there are only a few (say into a number of buckets, each containing six or less} nodes in the set, at which point approximately the same number of points we store them as a linked list. Note that no (by the stopping criterion) and each ap- records are stored in the internal nodes of proximately "cubical" in shape (by choos- such a binary search tree; they are con- ing as discriminator the dimension of max- tained only in the leaf nodes or "buckets" imum spread, which slowly chops long and at the bottom of the tree. We can answer a skinny rectangles into cubes). range search in this structure by a recursive Range searching with k-d trees is algorithm that compares the range to the straightforward. Starting at the root, the discriminator of the node it is currently k-d tree is recursively searched in the fol- visiting. If the range is entirely to one side lowing manner. When visiting a node that or the other of the discriminator, only the discriminates by the flh key (which we call appropriate son is searched; otherwise both a j-discriminator), one compares the jth sons are searched recursively. range of the query with the discriminator The single-key binary search tree per- value. If the query range is totally above forms three functions at once: It stores the (or below) that value, then one need only records of the file (in the external nodes, or search the right subtree (respectively, left) "buckets"), it divides the data space into of that node; the other son can be pruned segments (by choosing the discriminators), from the search because any node it con- and it gives a directory among the segments tains does not satisfy the query in that (the tree structure). We now investigate a particular key. If the query range overlaps multidimensional generalization of the bi- the node's key (that is, the key is between nary search tree that performs these same the low and high bounds of the range), then three functions: storing the records, divid- both sons need be searched. This can be ing space into hyperrectangles, and provid- accomplished by searching both sons recur- ing a directory among the hyperrectangles. sively (the search being implemented by a It accomplishes this by using the same idea stack). as the one-dimensional algorithm with one The application of k-d trees to (two-di- critical exception: In the one-dimensional mensional) range searching is illustrated in tree we only have one key to use as the Figure 3. The k-d tree is depicted in two discriminator; in a multidimensional tree ways: Figure 3a shows the structure in 2- we have to choose at each internal node space, and Figure 3b shows the abstract one of k keys to use as a discriminator. tree. The root of the tree is internal node The algorithm for constructing a k-d tree A; it is an x-discriminator. The vertical is to choose for the discriminator that co- in the right part of the figure labeled A is ordinate j for which the spread of attribute the discriminating line. That is, every point values (as measured by any convenient sta- to the left of that vertical line is in the left tistic, such as variance or distance from subtree of A {with B as root), and every minimum to maximum) is maximum for point to the right is in the subtree with root the subcollection represented by the node. C. This partitioning continues recursively,

Computing Surveys, Vol 11, No 4, December 1979 Data Structures for Range Searching • 403

" A U F G

E o •

.,,,.

13 •

/x h I / x i

D •

• • : / " ~ 'l• D ,0 • / / / • i f •

(a)

A

B C

(b) FmURE 3. Illustratmn of k-d trees a) Planar representatmn, b) tree representation and the resulting cells (buckets) in this tree [BENT75a], the discriminators are chosen each contain two points. The query rectan- cyclically (that is, the root is discriminated gle is illustrated in Figure 3a, and the search by the first key, its sons by the second, and for all points within the rectangle is illus- so on). The idea of "adaptive partitioning" trated in both figures. The search starts at was proposed by Friedman, Bentley, and the root, and since the query, rectangle is Finkel [FRIE77] and makes the k-d tree a entirely to the right of the vertical line structure very "sensitive" to the particular defined by A, the left subtree of A {with B file that it represents. The application of as root) can be pruned from the search. k-d trees to a host of problems can be found This is illustrated in Figure 3b by the per- in BENT79b, GOTL78, and SILv78b. pendicular line through the son link from A Analysis of k-d trees for range searching to B. The search continues, searching both has been considered by several researchers. sons of C, both sons of F, and only the left The work required to construct a k-d tree son of G. A total of three buckets are and its storage requirements (see BENT79b) searched; these buckets are dashed in the are planar representation and are marked by an S in the tree representation. Pk(N, k) = O(N log N), In the k-d tree as introduced by Bentley Sk(N, k) ffi O(Nk).

Computing Surveys, Vol. 11, No. 4, December 1979 404 • J. L. Bentley and J. H. Friedman

The search cost depends on the nature of sional structure and then generalize that the query. Lee and Wong [LEEW80] have structure to higher dimensions. shown that in the worst case, The simplest structure for one-dimen- sional range searching is a sorted array. Qk(N, k) < O(N H/k + F) The preprocessing sorts the N elements in where F is the number of points found in ascending order by key. To answer a range the region. If the query range is almost query, we do two binary searches to find cubical and the number of records that the positions of the low and high end of the satisfies the query is small (so that the range in the array. After these two positions range query is similar to a nearest neighbor have been found, we can list all the points search), then Friedman, Bentley, and Fin- in that part of the array as the answer to kel's [FRIE77] analysis shows that the range query. (Note that this is precisely the projection method applied to the one- Qk(N, k) = O(log N + F) dimensional problem.) For this structure we use linear storage and O(N log N) pre- (average case for small answer). processing time. The two binary searches For the case where a large fraction of the each cost O(log N), and the cost of listing file satisfies the query, Bentley and Stanat the points found in the region will, of course, be proportional to the number of [BENT75b] and Silva-Filho [Smv78a] show such points. Letting F be the number of that points found in the region, we have Qk(N, k) = O(F) Pr(N, 1) = O(N log N), (average case for large answer). Sr(N, 1) = O(N), The k-d tree structure is most effective in Q~(N, 1) = O(log N + F). situations where little is known about the nature of the queries or a wide variety of We will now build a two-dimensional queries are expected. It is also useful if , using as a tool the one-dimen- other types of queries (in addition to range sional sorted arrays (SA's) we described queries) are anticipated; many other quer- above. The range tree is similar to the ies supported by k-d trees are discussed by "binary search trees" described by Knuth Bentley [BENT79b]. [KNUT73, Sect. 6.2], so we will use his ter- minology in our discussions. The range tree is a rooted binary tree in which every node 1.5 Range Trees has a left son, a right son, a discriminating A number of very similar structures for value (all nodes in the left subtree have a range searching (of primarily theoretical discriminating value less than the node's), rather than practical interest) have recently and (unlike a regular binary search tree) been described by Lueker [LuEK78], Lee every node contains an SA. The root of the and Wong [LEEW80], and Willard range tree contains an SA (sorted by [WILL78a]. In this section we investigate y-coordinate) and has as a discriminating the range tree, a structure introduced by value the median x-value for all points. The Bentley [BENT79a] that is also similar to left subtree of the root has an SA containing the former structures. It achieves the best the N/2 points with x-value less than me- worst-case search time of all the structures dian sorted by y-coordinate. Similarly, the we have seen so far in this paper, but has right son of the root represents the N/2 relatively high preprocessing and storage points with x-value greater than the median costs. For most applications the high stor- and has an SA of those points sorted by age will be prohibitive, but the range tree y-coordinate. This partitioning continues so is very interesting from a theoretical view- that i levels away from the root we have 2' point. Since the range tree is defined recur- subtrees, each representing N/2 L points sively in dimension (that is, the k-dimen- contiguous in the x-dimension and each sional structure is defined in terms of the containing an SA of the points sorted by (k - 1)-dimensional structure), we begin y-coordinate. This partitioning continues our discussion by looking at a one-dimen- for a total of (approximately) log N levels;

Computing Surveys, Vol 11, No 4, December 1979 Data Structures for Range Searching • 405 we handle small point sets (say, less than a The range tree structure is very interest- dozen points) by brute force. ing from a theoretical viewpoint. The The search algorithm for a range tree is #symptotic search time is very fast, but the most easily described recursively. Each amount of storage used is usually prohibi- node in the tree represents a range in the tive in practice. Although the application of x-dimension from the least x-value con- this structure to practical problems will tained in the subtree to the greatest. When probably be limited to cases when k ffi 2 or visiting a node, we compare the x-range of 3, it does provide an important theoretical the query to the range of the node, and if benchmark. It also gives us an interesting the node's range is entirely within the technique (recursion in dimension) that query's, then we search that structure's SA might yield fruit in practice. {Indeed, there for all points in the query's y-range and are some very interesting relationships be- return. If the query's range does not wholly tween range trees and the k-d trees of Sec- contain the node's, then we compare the tion 1.4.) query's x-range to the node's discriminator value. If the range is entirely below the 1.6 k-ranges discriminator, we recursively visit the left The k-range is an efficient worst-case struc- subtree; if it is above, we visit the right; and ture for range searching introduced by if the range overlaps the discriminator, then Bentley and Maurer [BENT80b]. They de- we visit both subtrees. veloped two types of k-ranges, overlapping The analysis of the planar tree is rather and nonoverlapping. Both of these struc- complicated. Since there are log N levels tures involve storing sets of lists of points in the tree and N points are stored on sorted by different coordinates; additional each level, the total storage required is dimensions are added recursively, much O(N log N). The preprocessing can be per- like the range trees of the last section. Be- formed in O(N log N) time if clever tech- cause k-ranges are rather complicated to niques are employed. Analysis shows that describe and are of primarily theoretical at most two SA searches are done on each interest, we will not describe them here but level of the tree {each of cost approximately only mention their performance. The over- log N), so the total cost for a search is lapping k-ranges can be made to have per- O(log 2 N) plus the time for listing the formance points in the region. Letting F stand, as before, for the total number of points found Po(N, k) ffi O(N~+~), in the region we have So(N, k) = O(N'+~), Pr(N, 2) = O(N log N), Qo(N, k) = O(log N + F) Sr(N, 2) = O(N log N), for any e > 0. It is pleasing to note that the Qr(N, 2) = O(log '~ N + F). constants "hidden" in the O's of the above If we step back for a moment, we can see equations are just k/E. Overlapping k- how we built the structure: We constructed ranges have very efficient retrieval time but a two-dimensional structure by building a somewhat high preprocessing and storage tree of one-dimensional structures. We can costs; their dual structures, nonoverlapping perform essentially the same operation to k-ranges, have very efficient preprocessing yield a three-dimensional structure: We and storage costs but increased query construct a tree containing two-dimen- times. Their performance is sional structures in the nodes. This process Pn(N, k) = O(N log N), can be continued to yield a structure for Sn(N, k) = O(N), k-dimensions, which will be a tree contain- ing (k - D-dimensional structures. This Q°(N, k) = O(N), will yield a structure with performances for any fixed ¢ > 0. The details of these Pr(N, k) = O(Nlog k-I N), structures can be found in BENT80b. Al- though these structures were developed pri- S~(N, k) = O(N logk-l N), marily as a theoretical device, they might Qr(N, k) = O(log k N + F). prove efficient in some implementations

Computing Surveys, Vol. 11, No 4, December 19t9 406 • J. L. Bentley and J. H. Friedman

(Their primary drawback is that their space Bentley [BENT80a] have investigated a requirements are high, and space is usually number of searching problems defined on a critical resource.) sets of points in k-dimensional space. Rivest [RIVE76] provides a number of interesting 1.7 Other Structures data structures for answering "partial- match" queries, which are essentially range In the previous sections we have investi- queries in a file in which the keys assume gated six structures for the range searching discrete values. For discussions of efficient problem that (in the authors' opinion) dom- search methods in the context of database inate other structures proposed for this systems, the reader is referred to such pa- problem. In this section we briefly investi- pers as LIou77, SHNE77, YANG77, and gate some of these other structures. YANG78. Knuth [KNUT73] points out that the no- tion of cells can be applied recursively. That 1.8 Comparison of Methods is, when one of the cubes has more than some certain number of points, the cube is In Sections 1.1 through 1.6 we have dis- further divided into subcubes of yet smaller cussed six structures for range searching. size. This scheme implies a multidimen- The performances of these six structures sional tree with multiway branching. In (seven including the two variants of k- terms of both the partitioning imposed on ranges) are summarized in Table 1, which the space and the ease of implementation, shows the preprocessing, storage, and query this idea seems to be dominated by a data costs of each structure. All the functions in structure called the quad tree. that table reflect worst-case costs, except The quad tree was first described by Fin- those query costs that are footnoted. For kel and Bentley [FINK74]. It is a generaliza- those functions the probabilistic assump- tion of the standard binary search tree, in tions are described in the notes. which every node has 2 h sons. Bentley and Four of these six structures (sequential Stanat [BENT75b] analyzed the perform- scan, projection, cells, and k-d trees) have ance of quad trees for "square" range been presented as providing practical solu- searches in uniform planar point sets, and tions to the range searching problem. For Linn [LINN73] discussed the fact that quad each structure there are situations in which trees (which he called "search-sort k trees" ) it is clearly superior and other situations have advantages over binary trees when where it performs badly. In this section we used in a synchronized multiprocessor sys- will mention some of these situations and tem. This application aside, however, the compare the performance of the four quad tree seems to be dominated by the methods. k-d trees of Section 1.4. If the file is small and the number of A great deal of work has been done re- attributes large, if the f'fle is to be searched cently on multikey searching problems that only a few times, or if the queries can be are similar in flavor to the range searching batched so that nearly all the records in the problem. Dobkin and Lipton [DOBK76] and file satisfy at least one, then sequential scan

TABLE 1. Performance of Data Structures for Range Searchmg Structure P(N, k) S(N, k) Q(N, k) Sequential scan O{N} O(N) O(N) Projection O(N log N) O(N) O(N 1-1/* + F) a~) Cells O(N) O(N) O(F) a(z) k-d trees O(N log N) O(N) O(N H/k + F) O(log N + F) a()) Nonoverlappmg k-ranges O(N log N) O(N) O(N ~ + F) Range trees O(N logk-lN) O(N log *-1 N) O(log * N + F) Overlapping k-ranges O(N ~+~) O(N ~+~) O(log N + F) a Query times that indmate average case analysis Probabihstm assumptions are (1) Smooth data sets--very small query region. (2) Any data set--cell size equals query size. (3) Smooth data set.

Computing Surveys, Vol. 11, No 4, December 1979 Data Structures for Range Searching • 407 is the method of choice. In other cases one tain dynamically, and so is the projection of the more sophisticated methods is likely structure using methods for maintaining to be more efficient. Projection does best one-dimensional sorted lists described by when the query range on one of the attri- Knuth [KNUT73]. The cell technique can butes is usually sufficient to eliminate support insertions and deletions by merely nearly all the File records. For this case the keeping a linked list of the points in each low overhead of searching this structure cell and inserting or deleting the new or old allows it to dominate the others. In situa- record in the appropriate list. Dynamic k-d tions where several or many of the attri- trees are a more subtle problem and have butes serve to restrict the range query, the been discussed by Bentley [BENT79b] and projection technique performs relatively Willard [WILL78b]. poorly. Considerable research remains to be Both the cell and k-d tree structures are done in the development of heuristics for appropriate in situations where the query aiding the search methods we have seen. restricts several of the attributes. If the For example, if the range queries in a seven- approximate size and shape of the queries dimensional problem almost always involve are roughly constant and known in ad- only two of the attributes, then the design vance, then cells defined by a fixed grid of the structure should involve only those with size and shape similar to those of the two attributes. Heuristics for detecting expected queries is most advantageous. For these and other similar situations would be queries with sizes and shapes that differ very helpful. Techniques described by Ben- considerably from the design, however, per- tley and Burkhard [BENT76] might prove formance can be quite poor. useful in such an investigation. The k-d tree structure is characterized by Our discussion of all of the data struc- its robustness to wildly varying queries. tures has been for the case in which they The cell design adapts to the distribution are implemented in primary memory. of the attribute values of the file records in Many applications (particularly databases) the k-dimensional coordinate space. The inherently involve secondary storage media cells all contain very nearly the same num- such as disks and tapes. All the structures ber of records; there are no empty cells. In of Section i can be efficiently implemented dense regions there are many cells and a on such mediaJ correspondingly fine division of the coordi- Several researchers have recently consid- nate space; in sparse regions there is a ered an interesting generalization of the coarser division with fewer cells. For most range searching problem, which calls for applications of range searching that are not adding a range restriction to an existing characterized in the preceding paragraphs, data structure. That is, we already have k-d trees are likely to be the method of some structure for performing a particular choice. type of query, and we want to have the capability of saying "perform that query on 2. ADDITIONAL WORK all records in which this key lies in that range." Bentley [BENT79a], Lueker Our discussion of the data structures in [LUEK79], and Willard [WILL78a] have de- Section 1 is on a very abstract conceptual veloped a number of transformations on level, and we have ignored many problems data structures that allow one to add the that arise in actual applications of range range restriction capability. (These trans- searching. In this section we briefly exam- formations actually led to the discovery of ine some of those problems and the solu- both the range tree and the k-range data tions that have been proposed to handle structures of Section 1.) Although the stor- them. age requirements of the resulting structures All files that we have discussed so far seem to be too high to make them of im- have been static; that is, they represent unchanging files. Many applications, how- ever, require dynamic structures, in which 1 For details of these Implementations, the reader is insertions and deletions can be made. The referred to BENT78 whmh m an earlier version of thin sequential scan structure is easy to main- paper

Computing Surveys, Vol. 11, No. 4, December 1979 408 * J.L. Bentley and J. H. Friedman mediate practical interest, this approach is sion in Proc. Computer Science and Sta- tistics: llth Ann. Symp. on the Interface, a novel attack on the problem of construct- March 1978, pp. 297-307. ing data structures for range searching. BENT79a BENTLEY, J.L. "Decomposable search- ing problems," Inf. Process. Lett. 8, 5 An interesting theoretical problem that (June 1979), 133-136. could prove to be of practical value is prov- BENT79b BENTLEY, J. L. "Multidimensional bi- ing lower bounds on the complexity of the nary search trees in database applica- tions," IEEE Trans Softw. Eng SE-5, 4 range searching problem. Saxe [SAXE79] (July 1979), 333-340. has investigated this problem using the BENT80a BENTLEY, J. L "Multidimensional di- standard "decision tree" model of concrete vide-and-conquer," to appear m Comm. ACM. complexity theory and has shown that BENTS0b BENTLEY, J. L, ANn MAURER, H. A. k-ranges have optimal worst-case query "Efficient worst-case data structures for times. These k-ranges have very high stor- range searching," to appear m Acta Inf. DOBK76 DOBKIN, D, AND LIPTON, R. J "Multl- age requirements, however; so it would be dimensional searching problems," SIAM very desirable to have lower bounds that J. Comput. 5, 2 (1976), 181-186. FINK74 FINKEL, R. A, AND BENTLEY, J. L. make stronger statements of the form, "if "Quad trees--a data structure for re- you only use this much storage and prepro- trieval on composite keys," Acta Inf 4, 1 cessing, then this is the fastest search time (1974), 1-9. FRED79 FREDMAN, M. "A near optimal data you can have." Fredman [FRED79] has re- structure for a type of range query prob- cently made progress in this direction. An- lem," in Proc. llth ACM Symp. Theory of other interesting open problem is to show Computing, May 1979, pp. 62-66. FRIE75 FRIEDMAN, J H., BASKETT, F., AND SHUS- lower bounds on the average complexity, TEK, L. J "An algorithm for finding rather than just the worst-case complexity. nearest neighbors," IEEE Trans Corn- put. C-24, 10 (Oct. 1975), 1000-1006. FR~E77 FRIEDMAN, J H., BENTLEY, J L., AND 3. CONCLUSIONS FINKEL, R.A. "An algorithm for finding best matches m logarithmic expected In this paper we have investigated a num- time," ACM Trans. Math. Softw. 3, 3 ber of data structure for the range searching (Sept. 1977), 209-226. GOTL78 GOTLIEB, C. C., AND GOTLIEB, L. R problem. In 1973 Knuth [KNuT73, p. 554] Data types and structures, Prenhce-Hall, was able to write that "no really nice data Englewood Cliffs, N.J, pp. 357-363 KNOT73 KNUTH, D.E. The art of computer pro- structures seem to exist" for the problem of gramm~ng, vol. 3- sorting and searching, range searching. In this paper we have tried Addison-Wesley, Reading, Mass., 1973. to show that this situation has changed in LAUT78 LAUTHER, U. "4-d~mensmnal binary search trees as a means to speed up asso- the interim, and that these changes can ciative searches in design rule verification have a substantial impact on both the the- of integrated circuits," J Des. Autom ory and practice of multikey searching. Fault-Tolerant Comput. 2, 3 (July 1978), 241-247. LEEC76 LEE, R. C. T., CHIN, Y. H, AND CHANG, REFERENCES S.C. "Application of principal compo- nent analysm to mulh-key searching," BENT75a BENTLEY, J. L. "Multidimensmnal bi- IEEE Trans. Softw. Eng. SE-2, 3 (Sept nary search trees used for assocmtive 1976), 185-193. searching," Comm ACM 18, 9 (Sept. LEEW78 LEE, D. T, AND WONG, C K. "Worst- 1975), 509-517 case analysis for region and partial region BENT75b BENTLEY, J. L., AND STANAT, D F. searches In multidimensional binary "Analysis of range searches m quad search trees and quad trees," Acta Inf 9, trees," Inf Process Lett 3, 6 (July 1975}, 1 (1978), 23-29. 170-173 LEEW80 LEE, D. T., AND TONG, C.K. "Qulntary BENT76 BENTLEY, J L, AND BURKHARD, W A. trees' a f'de structure for multidimensional "Heunstms for partial match retrieval database systems," to appear m ACM data base design," Inf Process. Lett 4, 5 Trans. Database Syst {Feb 1976), 132-135.' LEVI66 LEVINTHAL, C. "Molecular model-build- BENT77 BENTLEY, J L., STANAT, D. F., AND WIL- ing by computer," Sc~ Am 214 (June LIAMS, E. H JR "The complexity of 1966), ~2-52. fixed-radius near neighbor searching," Inf LINN73 LINN, J. General methods for parallel Process. Lett. 6, 6 (Dec. 1977), 209-212. searchtng, Tech Rep. 61, Digital Systems BENT78 BENTLEY, J L., AND FRIEDMAN, J H. A Lab, Stanford U., Stanford, Cahf, May survey of algortthms and data structures 1973. for range searching, Carnegm-Mellon LIOU77 LIOU, J H., AND YAO, S B "Multi-di- Computer Science Rep CMU-CS-78-136 mensional clustering for data base orga- and Stanford hnear Accelerator Center nization," Inf. Syst. 2 (1977}, 187-198. Rep SLAC-PUB-2189, prehmlnary ver- LOFT65 LOFTSGAARDEN, D. O, AND QUESEN-

Computing Surveys, Vo| 11, No 4, December t979 Data Structures for Range Searching • 409

BERRY, C.P. "A nonparametric density indexes for efficient multiple attribute re- function," Ann. Math. Stat. 36 (1965), trieval," Inf. Syst. 2 (1977), 149-154. 1049-1051 SILV78a SILVA-FILHO, Y.V. Average case analo LUEK78 LUEKER, G. "A data structure for or- ysls of regton search m balanced k-d thogonal range queries," m Proe 19th trees, Rep., U. of Kent, Canterbury, Eng- Syrup Foundattons of Computer Sctence, land, Nov. 1978. IEEE, Oct. 1978, pp. 28-34 SILv78b S[LVA-F~LHO, Y V. Mult~dimenstonal LUEK79 LUEKER, G. "A transformation for add- search trees as radices of files, Rep., U mg range restriction capabdlty to dynamtc of Kent, Canterbury, England, Dec 1978. data structures for decomposable search- WILL78a WILLARD, D. E. Predicate-oriented mg problems," Tech. Rep. 129, U of Cal- database search algorithms, Rep. TR-20- iforma at Irvine, 1979. 78, Harvard U. Aiken Lab., 1978. RABi76 RABIN, M O "Probabdlstm algo- WILL 78b WILLARD, D. E. "Balanced forests of rithms," m Agortthms and complexity. k-d* trees as a dynamic data structure," new dwectlons and recent results, J. F reformative abstract, Harvard U., Boston, Traub (Ed.), Academm Press, New York, Mass, 1978. 1976, pp. 21-39. YANG77 YANG, C. "Avoiding redundant record RIVE76 RIVEST, R L. "Parhal match retrieval accesses in unsorted multilist f'de organi- algorithms," SIAM J Comput. 5, 1 zations," Inf Syst. 2 (1977), 155-158. (March 1976), 19-50 YANt~78 YANG, C. "A class of hybrid list file or- SAXE79 SAXE, J. B "On the number of range gamzations," Inf. Syst. 3 (1978), 49-58. querms m k-space," to appear m Dtscrete YUVA75 YUVAL, G. "Finding near neighbors m Appl Math k-dimensional space," Inf. Process. Left. SHNE77 SHNEIDERMAN, B. "Reduced combined 3, 4 (March 1975), 113-114

RECEIVED JANUARY 1979; FINAL REVISION ACCEPTED AUGUST 1979.

Computing Surveys, Vol. 11, No 4, December 1979