Query Processing in Spatial Network Databases
Dimitris Papadias Jun Zhang Department of Computer Science Department of Computer Science Hong Kong University of Science and Technology Hong Kong University of Science and Technology Clearwater Bay, Hong Kong Clearwater Bay, Hong Kong [email protected] [email protected]
Nikos Mamoulis Yufei Tao Department of Computer Science and Information Systems Department of Computer Science University of Hong Kong City University of Hong Kong Pokfulam Road, Hong Kong Tat Chee Avenue, Hong Kong [email protected] [email protected]
than their Euclidean distance. Previous work on spatial Abstract network databases (SNDB) is scarce and too restrictive Despite the importance of spatial networks in for emerging applications such as mobile computing and real-life applications, most of the spatial database location-based commerce. This necessitates the literature focuses on Euclidean spaces. In this development of novel and comprehensive query paper we propose an architecture that integrates processing methods for SNDB. network and Euclidean information, capturing Every conventional spatial query type (e.g., nearest pragmatic constraints. Based on this architecture, neighbors, range search, spatial joins and closest pairs) we develop a Euclidean restriction and a has a counterpart in SNDB. Consider, for instance, the network expansion framework that take road network of Figure 1.1, where the rectangles advantage of location and connectivity to correspond to hotels. If a user at location q poses the efficiently prune the search space. These range query "find the hotels within a 15km range", the frameworks are successfully applied to the most result will contain a, b and c (the numbers in the figure popular spatial queries, namely nearest correspond to network distance). Similarly, a nearest neighbors, range search, closest pairs and e- neighbor query will return hotel b. Note that the results of distance joins, in the context of spatial network the corresponding conventional queries are different (e.g., databases. the Euclidean nearest neighbor is d, which is actually the farthest hotel in the network). Furthermore, queries may 1. Introduction combine both location and network aspects, such as "find Spatial databases have been well studied in the last 20 the nearest hotel to the south" (e.g., hotel a). years resulting in the development of numerous conceptual models, multi-dimensional indexes and query processing techniques [RSV02]. Surprisingly, most of existing work considers Cartesian (typically, Euclidean) spaces, where the distance between two objects is determined solely by their relative position in space. However, in practice, objects can usually move only on a pre-defined set of trajectories as specified by the underlying network (road, railway, river etc.). Thus, the important measure is the network distance, i.e., the length Figure 1.1: Road network query example of the shortest trajectory connecting two objects, rather A crucial pre-requisite for solving these queries in SNDB is a realistic architecture, which captures spatial entities Permission to copy without fee all or part of this material is granted (e.g., hotels) and the underlying network, preserving both provided that the copies are not made or distributed for direct Cartesian co-ordinates and connectivity. In addition, this commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by architecture must take into account real-life constraints, permission of the Very Large Data Base Endowment. To copy such as unidirectional roads, “off-network” (but still otherwise, or to republish, requires a fee and/or special permission from reachable) entities, etc. Furthermore, although the the Endowment th network is almost static, the entities may change with Proceedings of the 29 VLDB Conference, high frequencies (e.g., a new/existing hotel opens/closes). Berlin, Germany, 2003 It is also possible that entire entity sets are added as more R-trees (like most spatial access methods) were motivated information or services become available (e.g., a new by the need to efficiently process range queries, where "restaurant" dataset is incorporated in the system). the range usually corresponds to a rectangular window or In this paper we propose a flexible architecture for SNDB, a circular area around a query point. The R-tree answers by separating the network from the entity datasets. In the query q (shaded area) in Figure 2.1 as follows. The particular, we employ a disk-based network representation root is first retrieved and the entries (e.g., E1, E2) that that preserves connectivity and location, while spatial intersect the range are recursively searched because they entities are indexed by respective spatial access methods may contain qualifying points. Non-intersecting entries for supporting Euclidean queries and dynamic updates. (e.g., E3) are skipped. Note that for non-point data (e.g., Using this architecture, we develop two frameworks, lines, polygons), the R-tree provides just a filter step to Euclidean restriction and network expansion, for prune non-qualifying objects. The output of this phase has processing the most common spatial queries, namely to pass through a refinement step that examines the actual nearest neighbors, range search, closest pairs and distance object representation to determine the actual result. The joins. The resulting algorithms expand conventional concept of filter and refinement steps applies to all spatial processing techniques by integrating connectivity and queries on non-point objects. location information for efficient pruning of the search A nearest neighbor (NN) query retrieves the (k≥1) data space. To the best of our knowledge, this is the first work point(s) closest to a query point q. The R-tree NN dealing with the efficient processing of spatial queries in algorithm proposed in [HS99] keeps a heap with the SNDB. entries of the nodes visited so far. Initially, the heap The rest of the paper is organized as follows: Section 2 contains the entries of the root sorted according to their overviews related work. Section 3 describes our minimum distance (mindist) from q. The entry with the architecture and its application in real-life scenarios. minimum mindist in the heap (E1 in Figure 2.1) is Sections 4 and 5 present algorithms for nearest neighbor expanded, i.e., it is removed from the heap and its and range search queries, respectively, while Sections 6 children (E3, E4, E5) are added together with their mindist. and 7 discuss closest pairs and spatial joins. Section 8 The next entry visited is E2 (its mindist is currently the evaluates the proposed techniques with comprehensive minimum in the heap), followed by E6, where the actual experiments, and Section 9 concludes the paper with a result (h) is found and the algorithm terminates, because discussion. the mindist of all entries in the heap is greater than the distance of h. The algorithm can be easily extended for 2. Related Work the retrieval of k nearest neighbors (kNN). Furthermore, it In this section we overview previous work related to is optimal (it visits only the nodes necessary for obtaining location (Section 2.1) and connectivity (Section 2.2) the nearest neighbors) and incremental, i.e., it reports representation and processing in databases. neighbors in ascending order of their distance to the query point, and can be applied when the number k of nearest 2.1 Spatial Query Processing in the Euclidean Space neighbors to be retrieved is not known in advance. R-trees [G84, SRF87, BKSS90] are the most popular An intersection join retrieves all intersecting object pairs indexes for Euclidean query processing due to their (s,t) from two datasets S and T. If both S and T are simplicity and efficiency. The R-tree can be viewed as a indexed by R-trees, the R-tree join algorithm [BKS93] multi-dimensional extension of the B-tree. Figure 2.1 traverses synchronously the two trees, following entry shows an exemplary R-tree for a set of points {a,b,…,j} pairs that overlap; non-intersecting pairs cannot lead to assuming a capacity of three entries per node. Points that solutions at the lower levels. Several spatial join are close in space (e.g., a,b) are clustered in the same leaf algorithms have been proposed for the case where only node (E ) represented as a minimum bounding rectangle 3 one of the inputs is indexed by an R-tree or no input is (MBR). Nodes are then recursively grouped together indexed [RSV02]. For point datasets, where intersection following the same principle until the top level, which joins are meaningless, the corresponding problem is the e- consists of a single root. distance join, which finds all pairs of objects (s,t) s ∈ S, t mindist( E 1 ) 10 ∈ T within (Euclidean) distance e from each other. R-tree i E 7 f E1 j Root join can be applied in this case as well, the only difference 8 E E E e 6 g 1 2 being that a pair of intermediate entries is followed if their E2 d E5 h 6 E E distance is below (or equal to) e. The intersection join can E 1 E E E 2 E E 4 3 4 5 6 7 q mindist( E 2 ) be considered as a special case of the e-distance join, 4 c = mindist( E 6 ) E3 where e=0. a a b c d e f g h i j 2 E E E E E b 3 4 5 6 7 Finally, a closest-pairs query outputs the (k≥1) pairs of mindist( E ) 3 objects (s,t) s ∈ S, t ∈ T with the smallest (Euclidean) 0 2 4 6 8 10 distance. The algorithms for processing such queries Figure 2.1: An R-tree example [CMTV00] combine spatial joins with nearest neighbor search. In particular, assuming that both datasets are queue to store visited nodes (sorted according to their indexed by R-trees, the trees are traversed synchronously, distance from the source node). Several variants of this following the entry pairs with the minimum distance. algorithm [CLR90] differ on how they manage the Pruning is based on the mindist metric, but this time priority queue. The A* algorithm [KHI+86] applies defined between entry MBRs. As all these algorithms heuristics to prune the search space and direct the graph apply only location-based metrics to prune the search expansion. Materialization techniques accelerate shortest space, they are inapplicable for SNDB. path processing (at the expense of space requirements) by 2.2 Disk-based Graph Representations and Algorithms using pre-computed results stored in materialized views [ADJ90, IRW93, JP96, JHR98]. The performance of A graph is usually represented either as a 2D matrix secondary-memory adaptations of shortest path (where each entry corresponds to an edge between a pair of nodes), or an adjacency list. Adjacency lists are algorithms has been analyzed in [J92, SKC93]. preferable for applications, such as road networks, where The only existing approach that integrates spatial and the graphs are sparse. The main issue for adapting this connectivity information [HJR97], uses thematic spatial representation to secondary memory is how to cluster lists constraints to restrict the permitted paths (e.g., "find the of adjacent nodes in the same disk page, in order to take shortest path that passes only through rural areas") and is advantage of the access locality and minimize the I/O. inapplicable to general query processing. Finally, The connectivity-clustered access method (CCAM) [SKS02] deal with nearest neighbor queries in road [SL97] generates a single-dimensional ordering of the networks by transforming the problem to high nodes (using Z-ordering) and stores the lists of neighbor dimensional space. Their solution is approximate and nodes together. Figure 2.2 shows a graph example and its specific to this problem. On the other hand, the CCAM structure, assuming that three adjacency lists fit in architecture presented in the next section supports all the one page. The lists of the neighboring (in the Z-order) counterparts of conventional queries in the SNDB context. graph nodes n1, n3 and n5 are stored in page p1, and the lists of the remaining ones in page p2. Each entry in the 3. Architecture list (e.g., l6) of a node (n6) contains an adjacent node (n2) and the corresponding network distance (3). In order to We assume a digitization process that generates a efficiently retrieve the adjacency list li of a node ni, the list modeling graph from an input spatial network. pages are indexed by a B+-tree on the node id. An Considering the road network in the introduction, the alternative technique for clustering graph nodes according graph nodes generated by this process are: (i) the network to their proximity in space is proposed in [HJR96]. junctions (e.g., the black points in Figure 3.1a), (ii) the starting/ending point of a road segment (white), and (iii) depending on the application, additional points (gray) such as the ones where the curvature or speed limit changes. The graph edges preserve the connectivity in the original network. Figure 3.1b shows the (modeling) graph for the network of Figure 3.1a; nodes at the boundary of the data space and the network distance of most edges are omitted for clarity. (a) A graph example
(a) A road network
(b) The CCAM structure Figure 2.2: A graph example and its CCAM structure CCAM and similar architectures only preserve connectivity (but not location) information; thus, they are applicable only to shortest path and other graph traversal algorithms (but not conventional spatial query processing). The most popular shortest path algorithm, proposed by Dijkstra [D59], starts from the source and expands the route towards the destination, using a priority (b) The modeling graph Figure 3.1: Graph modeling of the road network In the sequel we use the term edge/segment to denote a the form
Figure 3.3: Example of the proposed architecture The poly-line component, stores the detailed poly-line representation of each segment in the network. A poly- line entry ninj also includes a pair of pointers to the disk pages containing the adjacency lists of its endpoints ni, nj. (a) A road junction (b) The modeling graph The last component is a network R-tree that indexes the Figure 3.2: Example of pragmatic constraint poly-lines’ MBRs and supports queries exploring the spatial properties of the network. Each leaf entry contains In order to simplify the presentation, we describe our a pointer to the disk page storing the corresponding poly- architecture for the basic functionality, where nodes have line. The architecture supports the following primitive identical types and edges only store network distance. We operations for SNDB: separate the spatial entities (e.g., hotels) from the (i) check_entity(seg, p) is a Boolean function that returns underlying network, by indexing each entity dataset using true if point (entity) p lies on the network segment seg an R-tree. This division has many advantages: (i) all (we say that seg covers p). In accordance with the conventional (Euclidean) queries, which do not require conventional spatial databases methodology, the MBR of the network, can be efficiently processed by the R-trees, seg is used for filtering and its poly-line representation for (ii) as shown later, queries combining network and refinement. Due to approximation or digitization errors it Euclidean aspects are supported, (iii) dynamic updates in is possible that, although point p actually lies on seg, its each dataset are handled independently, (iv) new/existing stored co-ordinates may deviate from the segment. This datasets can be added to/removed from the system easily, situation can be handled by defining an (application- and (v) specific optimizations can be applied to each dependent) threshold dT, such that if p is within distance individual (network or entity) dataset. dT from seg, it is assumed to lie on it. The network storage scheme consists of three (ii) find_segment(p): outputs the segment that covers point components. The adjacency component captures the p by performing a point location query on the network R- network connectivity. The adjacency lists of the nodes 1 tree. If multiple segments cover p, the first one found is close in space (according to their Hilbert values) are returned. This function is applied whenever a query is placed in the same disk page. In Figure 3.3 (based on issued, to locate the segment on which the query point Figure 3.1b), the list l1 of n1 consists of 3 entries, one for lies. If the query point does not lie on any segment, we each of its connected nodes (n2, n3, n4) (ignoring nodes can “snap” it to the closest one assuming incomplete outside the boundary). The first entry (for edge n1n2) has information (e.g., an un-recorded alley), or we can consider it unreachable depending on the application 1 We apply Hilbert ordering, instead of the Z-ordering used by specifications. Although uncertainty handling in SNDB is CCAM, because it achieves better locality. an interesting topic, for the sake of simplicity, in the following discussion we assume that each entity and Next, we discuss all common spatial queries using this query point falls on at least one network segment. architecture. For each query type we propose two (iii) find_entities(seg): returns entities covered by segment algorithms based on the Euclidean restriction and network seg. Specifically it first finds all the candidate entities that expansion frameworks, respectively. Euclidean restriction lie in the MBR of seg, and then eliminates the false hits takes advantage of the Euclidean lower-bound property to using the poly-line of seg. prune the search space. On the other hand, the network (iv) compute_ND(p ,p ): returns the network distance expansion framework performs query processing directly 1 2 on the network. Since our aim is to illustrate the general dN(p1,p2) of two arbitrary points p1, p2 in the network, by applying a (secondary-memory) algorithm to compute the methodology, we intentionally keep the algorithms simple and only present essential optimization techniques shortest path from p1 to p2. Specifically, we chose to adapt Dijkstra's algorithm because it is simple, efficient and wherever necessary. Furthermore, we only describe the exhibits access locality, reducing the number of page basic query forms; as discussed in Section 9, variations faults during the retrieval of adjacency lists. However, such as "find the nearest hotels to the south", can be easily Dijkstra's algorithm assumes that the source (i.e., query) processed by the proposed techniques. and the destination (i.e., data point) fall on network nodes, 4. Nearest Neighbors in SNDB while in our scenario points may fall on edges (or assigned to edges if approximation is used). Given a source point q and an entity dataset S, a k nearest ≥ Consider, for example, the computation of d (p ,p ) in the neighbor (kNN) query retrieves the k ( 1) objects of S N 1 2 closest to q according to the network distance (e.g., find modeling graph of Figure 3.4, where n1,...,n6 denote the nodes. The algorithm first invokes find_segment to return the hotel within the shortest driving distance). Sections 4.1 and 4.2 present two algorithms for nearest neighbour the segments n1n2 covering p1, and n5n6 covering p2. Then, it calculates the distance from p to n and n (5 and 12) queries in SNDB, based on the Euclidean restriction and 1 1 2 network expansion frameworks, respectively. using the poly-line n1n2, and initiates a priority queue Q=<(n1,5),(n2,12)>. The first entry n1 is de-queued and its 4.1 Incremental Euclidean Restriction adjacent nodes (n3, n4) are inserted into Q, together with The Incremental Euclidean Restriction (IER) algorithm their accumulated distance from p1, i.e., Q=<(n2,12), applies the multi-step kNN methodology [FRM94, SK98], (n3,13), (n4,30)>. After the expansion of n2, the queue traditionally used for high-dimensional similarity becomes Q=<(n3,13),(n4,25)> (the distance to n4 retrieval. Specifically, assuming that only one NN is decreases), and after the expansion of n3, Q=<(n5,23), required, IER first retrieves the Euclidean nearest (n4,25)>. Now the next node to expand is n5. Since p2 can neighbor pE1 of q, using an incremental kNN algorithm be reached from n5 with cost 3, we insert it, and the queue (e.g., [HS99], see Section 2.1) on the entity R-tree of S. becomes Q=<(n4,25), (p2,26)>. Similarly, after the Then, the network distance dN(q,pE1) of pE1 is computed expansion of n4, Q=<(p2,26), (n6,29)> and the algorithm (by compute_ND(q,pE1)). Due to the Euclidean lower- terminates with dN(p1,p2)=26. If p1 or p2 fall on multiple bound property, objects closer (to q) than pE1 in the segments, then by the definition of the graph they network, should be within Euclidean distance correspond to graph nodes, in which case the algorithm is dEmax=dN(q,pE1) from q, i.e., they should lie in the shaded still applicable. The same is true for networks containing area of Figure 4.1a. In Figure 4.1b, the second Euclidean unidirectional segments. NN pE2 is then retrieved (within the dEmax range). Since n dN(q,pE2) Figure 3.4: Illustration of fundamental operations d (q,p ) Network distance computations can be facilitated by N E1 p materialization of pre-computed results (e.g., [ADJ90, E1 dEmax=dN(q,pE1) JHR98]). In Figure 3.4, for example, dN(p1,p2) can be q d (q,p ) obtained by fetching from the materialized view E E1 dN(n1,n5)=18, dN(n1,n6)=23, dN(n2,n5)=22, dN(n2,n6)=17, with four disk accesses using a hash function. Although materialization can be incorporated as an additional module in our architecture, we chose not to include it in (a) 1st Euclidean NN (a) 2nd Euclidean NN the basic functionality due to the huge space requirements Figure 4.1: Finding the NN pE2 for large spatial networks. The extension to k nearest neighbors is straightforward. n1n2 (using the primitive operation find_entities). Since no The k Euclidean NNs are first obtained using the entity R- point is covered by n1n2, the node (n1) closest to the query tree, sorted in ascending order of their network distance to is expanded (while, the second endpoint n2 of n1n2 is th q, and dEmax is set to the distance of the k point. Similar placed in a queue Q). No data point is found in n1n7 and n7 to the single NN case, the subsequent Euclidean neighbors is inserted to Q=<(n2,5), (n7,12)>. The expansion of n2 are retrieved incrementally, while maintaining the k reaches n4 and n3, after which Q=<(n4,7), (n3,9), (n7,12)> (network) NNs and dEmax (except that dEmax equals the and point p5 is discovered on n2n4 (while no point is found network distance of the k-th neighbor), until the next on n2n3). The distance dN(q,p5) =6 provides a bound dNmax Euclidean NN has larger Euclidean distance than dEmax. to restrict the search space. The algorithm terminates now Figure 4.2 illustrates the pseudo-code of IER. since the next entry n4 in Q has larger distance (i.e., 7) than dNmax. Figure 4.4 shows the pseudocode of INE. Algorithm IER (q, k) /* q is the query point */ Algorithm INE (q, k) 1. {p1,...,pk}=Euclidean_NN(q,k); 1. ninj=find_segment(q) 2. for each entity pi 2. Scover=find_entities(ninj); // Scover is the set of entities 3. dN(q,pi)=compute_ND(q,pi) covered by ninj 4. sort {p1,...,pk} in ascending order of dN(q,pi) 3. {p1,...,pk}=the k (network) nearest entities in Scover 5. dEmax= dN(q, pk) sorted in ascending order of their network distance 6. repeat (pm, pm+1...,pk may be ∅ if Scover contains < k points) 7. (p,dE(q, p))=next_Euclidean_NN(q); 4. dNmax=dN(q,pk) // if pk=∅, dNmax=∞ th 8. if (dN(q,p) 3. ninj=find_segment(q) segment, and therefore, may contain results). The pseudo 4. Q=<(ni,dN(q,ni)), (nj,dN(q,nj))> code of RNE is presented in Figure 5.3. The initial 5. de-queue the node n in Q with the smallest d (q,n) N parameters of the algorithm are (root of R-tree S, QS, ∅). ≤ ≠∅ 6. while (dN(q,n) e and S' ) To reduce the number of intersection tests, at lines 2 and 7 7. for each non-visited adjacent node n of n x we apply a plane sweep algorithm [APR+98]. 8. for each point s of S' 9. if check_entity(nxn,s) Algorithm RNE(node_id, QS, result) 10. result=result∪{s}; S' =S' −{s} 1. if (node_id is an intermediate node) 11. en-queue (nx,dN(q,nx)) 2. compute QSi for each entry Ei in node_id // join 12. de-queue the next node n in Q 3. for each entry Ei in node_id 13. end while 4. if (QSi ≠ ∅) End RER 5. RNE(Ei.node_id, QSi, result) Figure 5.1: Range Euclidean Restriction 6. else // node is a leaf node 7. resultnode_id =plane-sweep(node_id.entries, QSi) 5.2 Range Network Expansion 8. sort resultnode_id to remove duplicates The Range Network Expansion (RNE) algorithm first 9. result = result∪resultnode_id computes the set QS of qualifying segments within End RNE network range e from q and then retrieves the data entities Figure 5.3: Range Network Expansion falling on these segments. The methodology is similar to INE, but now numerous queries, one for each qualifying An alternative is to use the methodology suggested by segment, are performed simultaneously (i.e., an [PRS99]. In particular, the MBR of all segments in QS is intersection join as discussed in Section 2.1). To illustrate applied as a range query to the object R-tree. When a leaf RNE, assume that QS contains the segments shown in node is reached, its contents are joined with QS, using Figure 5.2a. Starting from the root of the object R-tree, plane-sweep. This technique performs a simple RNE visits nodes that intersect the MBR of at least one intersection test at each visited R-tree node; however, if segment in QS. Figure 5.2b illustrates the visited nodes the network range is large and irregular it may visit and the qualifying objects in gray. numerous tree nodes that do not overlap any qualifying segment (e.g., E in Figure 5.2). n 1 n 8 5 a n2 Root Finally, if QS does not fit in memory, the join is E1 E2 E E 1 E5 E1 2 performed in a block nested loops fashion, i.e., RNE is E3 b E E E E E4 3 4 5 6 repeatedly applied for subsets of QS that fit in memory E nq E2 E3 4 E5 E6 n6 c a bcd and the partial results are materialized. Another approach E6 n d 7 is to compute all qualifying segments, materialize them n 3 n and join them with the object R-tree using one of the 1 n 4 spatial join algorithms that are applicable in the presence (a) Network and objects (b) The object R-tree of a single tree [RSV02]. Figure 5.2: Example of RNE In order to avoid joining the entire QS (which may be 6. Closest-Pairs in SNDB large) with every entry, we perform the following Given two datasets S, T and a value k, a closest-pairs optimization. QS is divided into (possibly overlapping) query retrieves the k (≥1) pairs (s,t) s ∈ S, t ∈ T that are sets QSi, one for each entry Ei in the current R-tree node. closest in the network (e.g., find the hotel, restaurant pair A segment is assigned to all entries that intersect its MBR. within the smallest driving distance). This section When the children of Ei are visited, they are only describes retrieval of closest pairs in SNDB. 6.1 Closest-Pairs Euclidean Restriction Algorithm CPNE (S,T,k) Like IER, the Closest-Pairs Euclidean Restriction (CPER) 1. {t1, ...,tk}=INE(s1,k) algorithm applies the multi-step kNN methodology. // retrieve kNN t1,.., tk of first entity s1 in S Assume for instance that only the closest pair is required. 2. result={(s1,t1), ..., (s1,tk)} th CPER performs an incremental closest-pairs query 3. dNmax=max{dN(s1,ti)} // current k CP distance [CMTV00] on the R-trees of S, T and retrieves the 4. for each other point si∈S (si≠ s1) Euclidean closest pair (s,t). The network distance dN(s,t) 5. ninj=find_segment(si) provides an upper bound dEmax for all candidate pairs in 6. Tcover=find_entities(ninj); //in T the Euclidean space. Subsequent candidate pairs are 7. for every entity t in Tcover retrieved incrementally, continuously updating the result 8. if dN(si,t) 8. dN(s',t')=compute_ND(s',t') distance join retrieves the pairs (s,t) s ∈ S, t ∈ T such that 9. if (d (s',t') CPER the Euclidean join). Therefore, JER is preferred for CPER CPER 0 0 0.01 0.05 0.1 0.5 1 selective joins, while JNE should be applied otherwise. 1 0.01 0.05 0.1 0.5 cardinality ratio - |S|/|N| cardinality ratio - |S|/|N| Page Accesses CPU time secs 8000 (a) I/O accesses (b) CPU (sec) R-tree JNE JNE 250 JER JNE JER network Figure 8.7: Cost vs. |S| (k=100, |T|=0.1|N|) 6000 JNE JER 200 JNE JER JER 150 Figure 8.8 shows the relative costs of the algorithms when 4000 JNE 100 |S|=|T|=0.1|N| for different values of k. CPER is much 2000 JER faster than CPNE for k≤1000 for the reasons explained 50 0 0 above. For k=10000 the cost of both algorithms explodes 0.01 0.05 0.1 0.5 1 0.01 0.05 0.1 0.5 1 cardinality ratio - |S|/|N| cardinality ratio - |S|/|N| for different reasons (Note that the diagrams are in (a) I/O accesses (b) CPU (sec) logarithmic scale). CPER now incurs numerous false hits, Figure 8.9: Cost vs. |S| (e=0.001, |T|=0.1|N|) since there is a huge number of object-pairs with similar Page Accesses CPU time secs 8000 Euclidean distances, but diverge network distances. These R-tree JNE 140 JNE JER pairs require many expensive distance computations of network JNE JNE 120 6000 JNE JNE 100 long paths, which incur extensive buffer thrashing. CPNE JER JER JER 4000 JER 80 performs extensive expansion, which exceeds the 60 2000 40 available memory and causes many swaps in the buffer. JER 20 Summarizing, CPER is much faster than CPNE in our 0 0.0001 0.0005 0.001 0.002 0.005 0 0.0001 0.0005 0.001 0.002 0.005 settings, because it can utilize the Euclidean bounds to network distance - e network distance - e prune large areas of the search space early. (a) I/O accesses (b) CPU (sec) Page Accesses CPU secs Figure 8.10: Cost vs. e (|S|=|T|=0.1|N|) 1E+5 R-tree CPER 1E+3 CPNE CPER network 1E+4 CPNE CPNE 1E+2 CPNE CPNE CPNE 9. Conclusion CPER CPER 1E+3 CPER CPER 1E+1 This paper presents the first comprehensive approach for 1E+2 1E+0 query processing in spatial network databases, proposing 1E+1 an architecture that preserves connectivity and location, 1E-1 and several novel algorithms, based on the Euclidean 1E+0 1 10 100 1000 10000 1 10 100 1000 10000 number of pairs retrieved - k restriction and network expansion frameworks, covering number of pairs retrieved - k (a) I/O accesses (b) CPU (sec) the most common processing tasks. Figure 8.8: Cost vs. k (|S|=|T|=0.1|N|) The Euclidean Restriction framework provides an intuitive way to deal with spatial constraints. If for 8.4 e-Distance joins instance, we want to "find the two nearest hotels to the We proceed to compare JER (join Euclidean restriction) south", we only need to retrieve the Euclidean neighbors and JNE (join network expansion), using |T|=0.1|N| and in the area of interest using a constrained NN algorithm setting the join distance e to 0.001. Figure 8.9a (8.9b) [FSA+01]. On the other hand, although network plots the number of disk accesses (CPU time) as a expansion is still applicable, it has limited pruning power function of |S| ranging from 0.01|N| to |N|. JER has better on queries with selective spatial conditions. Considering I/O performance, but the difference diminishes as |S| again the example query, the network should be also increases. This is because, for large datasets, the number expanded to the north of the query point, because of object pairs qualifying the Euclidean distance join subsequent nodes may lead to a nearest neighbor to the increases considerably, making the subsequent node- south. expansion (for false hit elimination) expensive. In this The Euclidean Restriction framework assumes the lower case, JER consumes more CPU time, due to the expensive bounding property, which may not always hold in sorting overhead (for selecting the “seed” for node practice. If, for instance, the edge cost is defined as the expansion). expected travel time, the Euclidean distance cannot In Figure 8.10a (8.10b), we set |S| to 0.1|N| and measure confine the search space (unless we make additional the number of disk accesses (CPU cost) for different assumptions, such as maximum speed). On the contrary, values of e. JER is significantly faster in terms of I/O, network expansion permits a wide variety of costs especially for small join distance in which case very few associated with the edges. It assumes, however, that the object pairs satisfy the Euclidean join. Interestingly, the cost increases monotonically with the path (i.e., a path cannot be cheaper than one of its sub-sets), because, [FRM94] Faloutsos, C., Ranganathan, M., Manolopoulos, otherwise there is no bound in the expansion process. Y. Fast Subsequence Matching in Time-Series Dijkstra's algorithm is also based on the same assumption, Databases. SIGMOD, 1994. which is realistic for all SNDB applications. [FSA+01] Ferhatosmanoglu, H., Stanoi, I., Agrawal, D., El The experimental evaluation suggests that the network Abbadi, A. Constrained Nearest Neighbor Queries. SSTD, 2001. expansion framework has superior performance for range [G84] Guttman, A. R-trees: A Dynamic Index search and nearest neighbors, while Euclidean restriction Structure for Spatial Searching, SIGMOD, is better for closest pairs and joins. Since, however, our 1984. goal was to propose a complete set of algorithms for [HJR96] Huang, Y., Jing, N., Rundensteiner, E. Effective numerous queries, we did not focus explicitly on Graph Clustering for Path Queries in Digital optimization of each method. Therefore, further Map Databases. CIKM, 1996. improvements are possible for the proposed algorithms. It [HJR97] Huang, Y., Jing, N., Rundensteiner, E. will also be interesting to evaluate their relative Integrated Query Processing Strategies for performance in the presence of materialized network Spatial Path Queries. ICDE, 1997. distances. [HS99] Hjaltason, G., Samet, H. Distance Browsing in This paper opens a door to several interesting and Spatial Databases. TODS, 24(2), 265-318, 1999. practical directions for future work. For instance, a [IRW93] Ioannidis, Y., Ramakrishnan, R., Winger, L. continuous NN query [TPS02] would retrieve the two Transitive Closure Algorithms Based on Graph nearest gas stations (in terms of network distance) during Traversal. TODS, 18(3), 512-576, 1993. the route from city A to city B. Our framework can also [J92] Jiang, B. I/O-Efficiency of Shortest Path be used in the context of moving object databases to Algorithms: An Analysis. ICDE, 1992. [JHR98] Jing, N., Huang, Y. W., Rundensteiner, E. answer: "which is the closest taxi to my present location", Hierarchical Encoded Path Views for Path "towards which direction should I walk to catch the next Query Processing: an Optimal Model and its (moving) bus", etc. Since most real life objects move on Performance Evaluation. TKDE, 19(1), 102- pre-defined spatial networks, the SNDB versions of these 119, 1997. queries are much more important than their Euclidean [JP96] Jung, S., Pramanik, S. HiTi Graph Model of counterparts. Topographical Roadmaps in Navigation Systems. ICDE, 1996. Acknowledgements [KHI+86] Kung, R., Hanson, E., Ioannidis, Y., Sellis, T., This work was supported by grants HKUST 6197/02E, Shapiro, L., Stonebraker, M. Heuristic Search HKUST 6081/01E and HKU 7380/02E from Hong Kong in Data Base Systems. Expert Database RGC. We would like to thank Qiongmao Shen and Manli Systems, 1986. Zhu for helping with the implementation. [PRS99] Papadopoulos A., Rigaux P., Scholl M. A Performance Evaluation of Spatial Join References Processing Strategies. SSD, 1999. [ADJ90] Agrawal, R., Dar, S., Jagadish, H. Direct [RSV02] Rigaux, P. Scholl, M. Voisard, A. Spatial Transitive Closure Algorithms: Design and Databases with Application to GIS. Morgan Performance Evaluation. TODS, 15(3), 427- Kaufmann, 2002. 458,1990. [SK98] Seidl, T., Kriegel, H. Optimal Multi-Step k- [APR+98] Arge, L., Procopiuc, O, Ramaswamy, S., Suel, Nearest Neighbor Search. SIGMOD, 1998. T., Vitter, J. S. Scalable Sweeping-Based [SKC93] Shekhar, S., Kohli, A., Coyle, M. Path Spatial Join. VLDB, 1998. Computation Algorithms for Advanced [BKS+90] Beckmann, N., Kriegel, H.P., Schneider, R., Traveler Information System (ATIS). ICDE, Seeger, B. The R*-tree: An Efficient and Robust 1993. Access Method for Points and Rectangles. [SKS02] Shahabi, C., Kolahdouzan, M., Sharifzadeh, M. SIGMOD, 1990. A Road Network Embedding Technique for K- [BKS93] Brinkhoff, T., Kriegel, H., Seeger, B. Efficient Nearest Neighbor Search in Moving Object Processing of Spatial Joins Using R-trees. Databases. ACM GIS, 2002. SIGMOD, 1993. [SL97] Shekhar, S., Liu, D. CCAM: A Connectivity- [CLR90] Corman, T. H., Leiserson, C. E., Riverst. R. L. Clustered Access Method for Networks and Introduction to Algorithms. MIT Press, 1990. Network Computations. TKDE, 19(1), 102-119, [CMTV00] Corral, A., Manolopoulos, Y., Theodoridis, Y., 1997. Vassilakopoulos, M. Closest Pair Queries in [SRF87] Sellis, T., Roussopoulos, N. Faloutsos, C.: The Spatial Databases. SIGMOD, 2000. R+-tree: a Dynamic Index for Multi- [D59] Dijkstra, E. W. A Note on Two Problems in Dimensional Objects, VLDB, 1987. Connection with Graphs. Numeriche [TPS02] Tao, T., Papadias, D., Shen, Q. Continuous Mathematik, Vol. 1, 269-271, 1959. Nearest Search. VLDB, 2002. [WWW] www.maproom.psu.edu/dcw/