The Pennsylvania State University The Graduate School College of Engineering

EFFICIENT INFORMATION ACCESS FOR LOCATION-BASED SERVICES

IN MOBILE ENVIRONMENTS

A Dissertation in Computer Science and Engineering by Chi Keung Lee

c 2009 Chi Keung Lee

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

August 2009 The dissertation of Chi Keung Lee was reviewed and approved∗ by the following:

Wang-Chien Lee Associate Professor of Computer Science and Engineering Dissertation Advisor Chair of Committee

Donna Peuquet Professor of Geography

Piotr Berman Associate Professor of Computer Science and Engineering

Daniel Kifer Assistant Professor of Computer Science and Engineering

Raj Acharya Professor of Computer Science and Engineering Head of Computer Science and Engineering

∗Signatures are on file in the Graduate School. Abstract

The demand for pervasive access of location-related information (e.g., local traffic, restau- rant locations, navigation maps, weather conditions, pollution index, etc.) fosters a tremen- dous application base of Location Based Services (LBSs). Without loss of generality, we model location-related information as spatial objects and the accesses of such information as Location-Dependent Spatial Queries (LDSQs) in LBS systems. In this thesis, we address the efficiency issue of LDSQ processing through several innovative techniques. First, we develop a new client-side data caching model, called Complementary Space Caching (CSC), to effectively shorten data access latency over wireless channels. This is mo- tivated by an observation that mobile clients, with only a subset of objects cached, do not have sufficient knowledge to assert whether or not certain object access from the remote server are necessary for given LDSQs. To address this problem, our CSC requests each client to cache a global data view, that is composed of (i) cached spatial objects and (ii) complementary regions that cover the locations of all the non-cached objects. With a global data view cached, CSC enables clients to assert the completeness of LDSQ results locally. Second, we investigate two new types of complex LDSQs, namely, Nearest Surrounder (NS) Queries and Skyline Queries. Both of them have a wide application base. An NS query returns spatial objects along with dis- jointed angular ranges within which they are the nearest to a given query point. A skyline query retrieves non-dominated spatial objects. An object o is said to be dominated if there is another object o that is strictly better than o for at least one attribute and is not worse than o for all the other attributes. We conduct in-depth analysis and propose novel techniques to efficiently answer these new queries. Third, we propose an LDSQ processing framework, namely ROAD, to support efficient access of spatial objects on a road network. ROAD adopts a search space pruning technique that has not been explored before in this context. In ROAD, a large road network is organized as a hierarchy of interconnected regional sub-networks called Rnets, i.e., search subspaces. Further, with two novel concepts, namely, (i) shortcuts, that allow jumps across Rnets to accelerate the search traversal, and (ii) object abstracts, that provide search guidance during traversals, searches supported by ROAD can bypass those Rnets that contain no object of interest. Also, ROAD is flexible to support various LDSQs and objects. We conduct extensive empirical studies to evaluate the performance of our proposed ap- proaches. The experiment results demonstrate the efficiency of our approaches and their supe- riority over state-of-the-art approaches in corresponding domains.

iii Table of Contents

List of Tables viii

List of Figures ix

Acknowledgments xii

Chapter 1 Introduction 1 1.1 Location-Dependent Spatial Queries ...... 1 1.2 The System Models of Location-Based Services ...... 3 1.3 LBS Challenges and Opportunities ...... 4 1.4 Our Proposals ...... 6 1.5 The Organization of the Thesis ...... 8

Chapter 2 Literature Review 10 2.1 Mobile and Wireless Computing Environments ...... 10 2.2 Wireless Data Broadcast ...... 12 2.2.1 Broadcast Scheduling ...... 13 2.2.2 Air Indexing ...... 13 2.3 Mobile Client Caching ...... 17 2.4 Application-Specific Data Access Methods ...... 19 2.4.1 Time-Critical Data Access ...... 19 2.4.2 Semantic-Based Data Access ...... 19 2.4.3 Reliable Data Access ...... 20 2.5 Spatial Databases ...... 20 2.5.1 Spatial Indexes ...... 21

iv 2.5.2 Spatial Queries ...... 23

Chapter 3 Complementary Space Caching 24 3.1 Overview of Complementary Space Caching ...... 24 3.2 LDSQ Processing with CSC ...... 27 3.2.1 Cache Initialization ...... 27 3.2.2 Query Processing ...... 28 3.2.3 Cache Probing ...... 29 3.2.4 Remainder Query Processing ...... 29 3.3 CR Coalescence ...... 31 3.3.1 Generic CR Coalescence Algorithm ...... 32 3.3.2 Client Request CR Coalescence ...... 33 3.3.3 Server Reply CR Coalescence ...... 34 3.4 Cache Management ...... 35 3.4.1 Cache Organization ...... 35 3.4.2 Cache CR Coalescence ...... 36 3.4.3 Cache Replacement ...... 36 3.5 Other Client-Side Data Caching Models ...... 37 3.6 Performance Evaluation ...... 38 3.6.1 Performance of Caching Schemes ...... 40 3.6.2 Performance of Different Remainder Query Expressions ...... 41 3.6.3 Performance of Server Reply CR Coalescence ...... 42 3.6.4 Performance of Cache Management ...... 43 3.7 Chapter Summary ...... 45

Chapter 4 Nearest Surrounder Queries 46 4.1 Definitions of Nearest Surrounder Queries ...... 46 4.2 Na¬õve Approaches to NS Searches ...... 49 4.3 Angle-Based Bounding Properties ...... 50 4.3.1 Edge Angular Bound and Edge Angular ...... 51 4.3.2 Edge Comparison ...... 53 4.3.3 The Brute-Force Algorithm ...... 54 4.4 NS Search Algorithms ...... 57 4.4.1 The Angle-Based Properties of Polygons and MBRs ...... 58 4.4.2 The Sweep Searching Algorithm ...... 59 4.4.3 The Ripple Searching Algorithm ...... 66 4.4.4 NS Search Based on Multiple ANS Searches ...... 69 4.5 Performance Evaluation ...... 71

v 4.5.1 Performance of NS Searching Algorithms ...... 72 4.5.2 Effectiveness of Optimization Techniques ...... 78 4.5.3 Improvement Gained by ANS Queries ...... 79 4.6 Chapter Summary ...... 80

Chapter 5 Skyline Queries 81 5.1 Analysis of Skyline Search Problems ...... 82 5.2 Existing Skyline Search Approaches ...... 86 5.2.1 Skyline Query Processing ...... 86 5.2.2 Skyline Query Variants ...... 88 5.3 Skyline Search Based on Z-Order Curve ...... 92 5.3.1 Dominance Relationships and Z-Order Curve ...... 92 5.3.2 ZBtree Index Structure ...... 95 5.3.3 ZBtree Index Manipulation ...... 96 5.4 Skyline Search on Z-order Curve ...... 98 5.4.1 RZ-Region Based Dominance Test ...... 98 5.4.2 The ZSearch Algorithm ...... 101 5.5 Skyline Query Variants ...... 103 5.5.1 Extended ZBtrees ...... 103 5.5.2 The ZBand Algorithm ...... 105 5.5.3 The ZRank Algorithm ...... 108 5.5.4 The k-ZSearch Algorithm ...... 114 5.5.5 The ZSubsapce Algorithm ...... 117 5.6 Performance Evaluation of Skyline Query Processing ...... 120 5.6.1 Experiment Settings ...... 121 5.6.2 Experiments on Skyline Queries ...... 123 5.6.3 Experiments on Skyband Queries ...... 126 5.6.4 Experiments on Top-Ranked Skyline Queries ...... 127 5.6.5 Experiments on k-Dominant Skyline Queries ...... 129 5.6.6 Experiments on Subspace Skyline Queries ...... 130 5.7 Chapter Summary ...... 132

Chapter 6 LDSQs on Spatial Road Network 134 6.1 State-of-the-Art Approaches ...... 134 6.1.1 Network Expansion Based Approaches ...... 135 6.1.2 Euclidean Distance Based Approaches ...... 135 6.1.3 Solution Based Approaches ...... 135 6.2 The ROAD Framework ...... 136

vi 6.2.1 Basic Idea ...... 137 6.2.2 Rnets, Shortcuts and Object Abstracts ...... 137 6.2.3 Rnet Hierarchy ...... 139 6.2.3.1 Network Partitioning Methods ...... 140 6.2.3.2 Create of Shortcuts and Object Abstracts ...... 141 6.2.4 Route Overlay and Association Directory ...... 143 6.3 Processing LDSQs on ROAD ...... 145 6.4 ROAD Framework Maintenance ...... 149 6.4.1 Object Update ...... 149 6.4.2 Network Update ...... 149 6.4.2.1 Change of Edge Distance ...... 149 6.4.2.2 Change of Network Structure ...... 150 6.5 Performance Evaluation ...... 152 6.5.1 Indexing Overhead ...... 153 6.5.2 Maintenance Overhead ...... 154 6.5.3 Query Performance ...... 155 6.5.4 Evaluation of Rnet hierarchy ...... 158 6.6 Chapter Summary ...... 159

Chapter 7 Conclusions 160 7.1 Summary of the Thesis ...... 160 7.2 Future Research ...... 162 7.2.1 Enhancements and Extensions of Proposed Techniques ...... 162 7.2.2 Other Issues Related to LDSQ Processing ...... 164 7.2.3 Other Emerging Platforms for LBSs ...... 165

Bibliography 167

vii List of Tables

3.1 Experiment parameters for performance evaluation on CSC ...... 39

4.1 Experimental parameters of performance evaluations on NS queries . . . . . 72

5.1 A survey of representative skyline query algorithms ...... 86 5.2 Experiment settings ...... 122 5.3 Skyline query; real datasets (time (sec)) ...... 126 5.4 Skyband query; real datasets (time (msec)) ...... 128 5.5 Top-ranked skyline query; real datasets (time (msec)) ...... 129 5.6 k-dominant skyline query; real datasets (time (msec)) ...... 131 5.7 Subspace skyline query; real datasets (time (sec)) ...... 132

6.1 Evaluation parameters ...... 152

viii List of Figures

1.1 The campus map of the Penn State University at University Park ...... 2 1.2 LBS system models ...... 4 1.3 A client-server LBS system with a client-side cache, a spatial database and a spatial network database ...... 6

2.1 A general architecture ...... 11 2.2 Full index tree ...... 15

3.1 Query processing and cache replacement with CSC ...... 25 3.2 R-tree and CSC ...... 28 3.3 Cache probing ...... 29 3.4 Remainder query expressions ...... 30 3.5 The pseudo-code of generic CR coalescence algorithm ...... 33 3.6 Performance of caching schemes on various query inter-arrival times . . . . . 40 3.7 Performance of caching schemes on various cache sizes ...... 42 3.8 Performance of caching schemes on various object sizes ...... 43 3.9 Performance of remainder query expressions ...... 44 3.10 Performance of heuristical server reply CR coalescence ...... 44 3.11 Performance of cache allocation strategies ...... 45

4.1 Nearest surrounder query and angle-constrained NS query ...... 48 4.2 Constrained NN query and ray shooting query ...... 50 4.3 The properties of edges and polygons ...... 51 4.4 The pseudo-code of the function EdgeCompare ...... 55 4.5 The pseudo-code of the algorithm BruteForce ...... 56 4.6 The pseudo-code of the function NSIncorp ...... 56 4.7 The pseudo-code of the function ObjectCompare ...... 57 4.8 Illustrations of two heuristics applied for the Sweep algorithm ...... 60 4.9 The pseudo-code of the function Lookahead ...... 62 4.10 The pseudo-code of the Sweep Algorithm ...... 63 4.11 Examples of NS searches ...... 64

ix 4.12 The trace of the Sweep algorithm ...... 64 4.13 The pseudo-code of the function m-NSIncorp ...... 66 4.14 The trace of Ripple algorithm ...... 67 4.15 The pseudo-code of the function tierNSIncorp ...... 68 4.16 The pseudo-code of the Ripple algorithm (generalized for m-NS and ANS queries) ...... 70 4.17 Real object sets ...... 72 4.18 Performance in terms of CPU time on the number of objects (n) ...... 73 4.19 Performance in terms of I/O cost on the number of objects (n) ...... 74 4.20 Performance in terms of runtime memory cost on the number of objects (n). 75 4.21 Performance in terms of CPU time on the sizes of objects (l) ...... 76 4.22 Performance in terms of CPU time on real object sets ...... 77 4.23 Performance in terms of index node access on real object sets ...... 77 4.24 Performance in terms of runtime memory consumption on real object sets . . 77 4.25 Performance about CPU time improvement on various optimizations . . . . . 78 4.26 Performance about node access improvement on various optimizations . . . . 78 4.27 Performance about runtime memory consumption improvement on various op- timizations ...... 79 4.28 Performance about CPU time improvement by using ANS queries ...... 79 4.29 Performance about disk access improvement by using ANS queries ...... 79 4.30 Performance about runtime memory consumption improvement by using ANS queries ...... 80

5.1 An example skyline query (our running example) ...... 81 5.2 BBS and main memory R-tree ...... 88 5.3 A skyband query and a top-ranked skyline query ...... 89 5.4 An example for k-dominant and subspace skyline queries ...... 91 5.5 A Z-order curve example ...... 93 5.6 An RZ-region and a ZBtree ...... 95 5.7 Examples of dominance tests based on RZ-regions ...... 99 5.8 The pseudo-code of the Dominate function ...... 100 5.9 The pseudo-code of the ZSearch algorithm ...... 102 5.10 The trace of the ZSearch algorithm ...... 103 5.11 Example extended ZBtrees ...... 104 5.12 The pseudo-code of the BandDominate function ...... 105 5.13 The pseudo-code of the ZBand algorithm ...... 107 5.14 The trace of the ZBand algorithm ...... 107 5.15 An example for top-ranked skyline queries ...... 109 5.16 The pseudo-code of the DominateAndCount function ...... 111 5.17 The pseudo-code of the Rank function ...... 112

x 5.18 The pseudo-code of the ZRank algorithm ...... 113 5.19 The trace of the ZRank algorithm ...... 114 5.20 2-dominance tests in 3D space ...... 116 5.21 The pseudo-code of the k-Dominate function ...... 117 5.22 The pseudo-code of the k-ZSearch algorithm ...... 118 5.23 The pseudo-code of the ZSubspace algorithm ...... 120 5.24 The trace of the filter phase of the ZSubspace algorithm ...... 121 5.25 Skyline query; elapsed time versus dimensionalities (d) ...... 124 5.26 Skyline query; runtime memory vs dimensionalities (d)...... 124 5.27 Skyline query; elapsed time versus cardinalities (n) ...... 125 5.28 Skyline query; runtime memory vs cardinalities (n) ...... 125 5.29 Skyband query (various dimensionality (d))...... 127 5.30 Skyband query (various cardinality (n))...... 127 5.31 Top-ranked skyline query (various dimensionality (d))...... 128 5.32 Top-ranked skyline query (various cardinalities (n))...... 129 5.33 k-dominant skyline query (various k)...... 130 5.34 k-dominant skyline query (various cardinalities (n))...... 130 5.35 Subspace skyline query (various dimensionalities (d))...... 131 5.36 Subspace skyline query (various cardinalities (n))...... 132 6.1 Distance index approach ...... 136 6.2 A closed path and an Rnet ...... 138 6.3 An example Rnet hierarchy ...... 140 6.4 Route Overlay ...... 144 6.5 Association Directory ...... 144 6.6 An example 1NN query ...... 145 6.7 The pseudo-code of the kNNSearch algorithm based on ROAD ...... 146 6.8 The pseudo-code of the function ChoosePath ...... 147 6.9 Illustrations on different approaches for a 3NN query ...... 148 6.10 Network changes ...... 151 6.11 Performance of index construction on various cardinalities ...... 154 6.12 Performance about index construction on various network ...... 154 6.13 Performance about object update ...... 155 6.14 Performance about network update ...... 156 6.15 Performance of kNN query evaluation ...... 157 6.16 Performance of range query evaluation ...... 158 6.17 Performance about Rnet hierarchy level (l)...... 159

xi Acknowledgments

I would like to express my sincere gratitude to Dr. Wang-Chien Lee, Associate Professor in the Department of Computer Science and Engineering, who has been my advisor since the begin- ning of my PhD study. He provided a lot of invaluable advice and guidance in my research, and gave helps and encouragement when I encountered difficulties during the course of my studies.

I would also like to take this chance to thank Dr. Donna Peuquet, Dr. Piotr Berman, and Dr. Daniel Kifer for their time and effort serving as my PhD thesis examination committee members and for their important comments that helped improve the quality of this thesis. In addition, I wish to express thanks to Dr. Hong Va Leong, Dr. Antonio Si and Dr. Baihua Zheng for giving me a lot of inspirations in developing some research ideas. Further, I appreciate hav- ing had honor to work with Dr. Dik Lee, Dr. Jianliang Xu, Dr. Zhisheng Li, Mr. Huajing Li, Mr. Julian Winter, Miss Yuan Tian, Mr. Ye Mao, Mr. Xingjie Liu, Mr. Bingrong Lin, and Mr. Congxing Cai in those years, sharing research ideas and conducting good quality research together.

My special appreciation is to my family who always encouraged me and believed in me. Last but not least, my special thanks goes to my wife, Cherris K. Y. Lam for her love and continuous support. Without her encouragement, many tasks of mine including this thesis would not have been accomplished.

xii Chapter 1

Introduction

With the ever-growing popularity of smart mobile devices and the rapid advent of wireless technology, the vision of pervasive information access has come closer to reality. Information, especially location-related information, is valuable to users only when it is available at the right time and the right place. Recently, the demand for pervasive access of location-related information (e.g., local traffic, restaurant locations, navigation maps, weather conditions, pol- lution index, etc.) has fostered a tremendous application base of Location Based Services (LBSs) [118, 76, 22]. Without loss of generality, we represent location-related information as spatial objects and the accesses of such information as Location-Dependent Spatial Queries (LDSQs). In this thesis, we propose a set of innovative techniques to address the efficiency issue of LDSQ processing.

1.1 Location-Dependent Spatial Queries

Typically, LDSQs search for objects based on their proximity to certain reference locations (i.e., query points). Common LDSQs include location-dependent range queries and location- dependent nearest neighbor (NN) queries. The former retrieves objects located within a certain proximity range from a specified query point, and the latter searches for the nearest object (in a given object set) to a query point. To facilitate our discussion, let us consider the following application scenario. Suppose that an LBS server maintains information about the campus attractions and facilities such as restaurants, car parks, bus stops on the University Park Campus of the Penn State University (or PSU campus in short). A portion of the campus map is shown in Figure 1.1. 2

Beaver Stadium

Berkey Bus stops Creamery

Bryce Jordan Center

Figure 1.1. The campus map of the Penn State University at University Park

We assume that the information about the attractions and facilities are stored in a database system [111] as different sorts of spatial objects and their database schemas (together with self-explanatory attribute names) are listed in the following:

• Attractions :(AttractionName, Description, UserRatings, Location) • Restaurants :(RestaurantName, Cuisine, UserRatings, Location) • CarParks :(CarParkName, NumberOfAvailSlots, Location) • BusStops :(Description, RouteTimeTables, Location)

We assume that each spatial object is composed of both spatial information, e.g., the (latitude, longitude) coordinate where the object is located at, and non-spatial information. Here, the attribute Location in all the schemas represents the spatial information, while others are the non-spatial information. Users can access the information of interest by submitting LDSQs to the LBS server. For example, a prospective student, Alice, to prepare her PSU campus visit by issuing the following LDSQs:

Q1. Find car parks (i.e., in CarParks) within a mile of Beaver Stadium, Q2. Search the nearest bus stop (i.e., in BusStops) to any car park resulted from Q1, and 3

Q3. Determine the nearest bus stop (i.e., in BusStops) to the Berkey Creamery (i.e., in Restaurants).

Here, Q1 is a location-dependent range query, with car parks being the object of interest, Beaver Stadium being the query point, and one mile being the proximity range. Q2 and Q3,on the other hand, are examples of location-dependent NN queries, with bus stops being the target objects and a certain car park and the Berkey Creamery being the query points, respectively. In addition to fixed query points (e.g., “Beaver Stadium” in Q1), LDSQs can be issued with respect to the locations of users who are moving. Let us continue with our example. Alice could continuously request information on nearby attractions along her way to Beaver Stadium. The LBS server is expected to return various attractions as Alice changes her locations. For example, when Alice is at the Shields building, the Bryce Jordan Center should be returned; when Alice moves to the Pattee Library, the Lion Shrine should be returned.

1.2 The System Models of Location-Based Services

There are three commonly adopted LBS system models. The first one is client-server model, as depicted in Figure 1.2(a). In this model, the LBS server hosts data, which mobile clients (e.g., PDAs, mobile phones, laptop computers, Internet Kiosk) can access by issuing LDSQs via either point-to-point access or broadcast. For point-to-point access, the clients issue LDSQs to the server and the server processes the queries and returns the results to the clients. LBS providers such as Google Map1, Microsoft LiveSearch Maps2, Yahoo! Map3, for example, offer this type of point-to-point access. On the other hand, wireless communication directly supports data broadcast. By broadcasting data on wireless channels, clients tune in to an appro- priate wireless channel and collect required data to answer their LDSQs locally. MSN Direct4, for example, provides such a LBS data broadcast service. Compared with point-to-point ac- cess, data broadcast services can support a larger number of clients with interest in similar information simultaneously, and hence its scalability is better. The second model is the embedded LBS as depicted in Figure 1.2(b). In this model, LBS

1http://maps.google.com 2http://maps.live.com 3http://maps.yahoo.com 4http://www.msndirect.com 4

wireless broadcast wireless broadcast

Embedded LBS

LBS server LBS server point-to-point Embedded LBS point-to-point

LBS clients Embedded LBS LBS clients (a) Client-server model (b) Embedded model (c) Hybrid model

Figure 1.2. LBS system models systems are embedded in portable devices, such as GPS navigators (Garmin5, TomTom6, etc.). As portable devices normally have specialized capabilities but no communication functionali- ties, all the information required has to be well-defined and pre-loaded. As such, only simple built-in LDSQs (e.g., finding nearest lodgings from here) can be supported. The third system model is the hybrid model, as depicted in Figure 1.2(c). In this model, some LDSQ processing functionalities and some static data (e.g., digital maps and landmarks like “Sears Tower” in Chicago, IL) are pre-installed in the embedded devices, enabling them to process certain LDSQs locally. Meanwhile, the devices have wireless communication inter- faces, through which they can download up-to-date location-related information such as road traffic and points of interest from a remote LBS server for local LDSQs processing or submit the LDSQs to the LBS server. In this thesis, we focus on the client-server model, i.e., the model described in Figure 1.2(a).

1.3 LBS Challenges and Opportunities

An ideal LBS system should provide fast access to a high diversity of spatial objects for a large number of mobile users via wireless channels. Compared with conventional wired network, efficient information access over wireless channels is definitely more challenging, which is mainly attributed to the following factors:

• Scarce wireless bandwidth. Wireless bandwidth is limited and shared among a large population of clients. Very often, the wireless channel is vulnerable to signal interference and has limited spatial coverage. Frequent data retransmission whenever connectivity is not stable results in long access latency. 5http://www.garmin.com 6http://www.tomtom.com 5

• Limited mobile client energy. Mobile clients are powered by short-lived batteries. Thus, they cannot stay “on” all the time. To prolong their operational time, some com- ponents (especially wireless network interfaces, i.e., the most energy consuming parts) should be turned off, or switched to the energy-saving mode, whenever they are not needed. Thus, clients can be expected to be disconnected from the server sometime. • Client mobility. The use of battery and wireless communication enables mobile clients to change their locations while staying connected to an LBS server. This client mobility provides users with convenient information access but at the same time imposes addi- tional challenges to LDSQ processing, as location-related information collected at some positions becomes invalid (spatially irrelevant) at other locations. In order to guarantee the accuracy of information requested, LDSQs have to be re-evaluated whenever users move to new locations. Nevertheless, such repeated query evaluations will result in an exhaustive consumption of both client energy and wireless bandwidth.

In brief, the processing of LDSQs should ideally be time-efficient, wireless bandwidth efficient, power-saving, and cater to mobile users’ need for unrestricted mobility. Client-side data caching is a technique that can satisfy all of these needs. In this thesis, we consider client- side data caching as an essential component of LBS systems, and perform a thorough study on the caching model design and implementation. LBSs are application-oriented, which bring opportunities for new kinds of information searches that can be more complex than conventional LDSQs. In this thesis, we inject new blood into the conventional LDSQs from three different angles. First, we consider spatial searches based on not only the proximity between spatial objects and the query point but also their orientations, i.e., the angles (or directions) of the objects with respect to a query point. Take Alice’s campus visit as an example. The angles to those nearby attractions with respect to Alice’s current location are useful guidelines that can effectively identify those attractions. Second, we consider LDSQs that take both spatial information and non-spatial information into account. Example queries include “find the nearest Italian restaurant to Beaver Stadium” and “find all attractions within 100 meters of my current location that have overall user rat- ings higher than three”. The former queries Location (i.e., spatial information) and Cuisine (i.e., non-spatial information) of objects in Restaurants, and the latter queries Location (i.e., spatial information) and UserRatings (i.e., non-spatial information) of objects in Attractions. 6

Third, we consider LDSQs on road networks where the proximity between objects is mea- sured via network distance instead of Euclidean distance. As network distance depends on the underlying road network, its calculation is much more complex and time consuming.

1.4 Our Proposals

A client-server LBS system model, that is the focus of this thesis, is depicted in Figure 1.3. This system consists of three data storage components, namely, a client-side cache,aspatial database and a spatial network database. At the LBS server, the spatial database maintains all the spatial object data, while the spatial network database stores the information of the underlying road network structure. On the other hand, at the client side, the cache keeps some replica of the server data. Notice that although we chose the client-server LBS system model to facilitate our discussion, many of our proposed techniques can be adopted in other described system models.

LBS clients LBS server

Spatial Network Database

Client-side caching CSC Spatial Database ROAD framework NS Query Engine Skyline Query Engine

Figure 1.3. A client-server LBS system with a client-side cache, a spatial database and a spatial network database

We have conducted a thorough study to examine the three components and try to address their technical challenges. For the client-side cache, we introduce a novel caching model called Complementary Space Caching (CSC). While spatial database technology is quite advanced, we focus on extending a spatial database by providing query processing techniques for sup- port two new types of complex LDSQs, namely, Nearest Surrounder (NS) Queries and Skyline Queries. Both NS queries and skyline queries have been recently receiving a lot of atten- tions from the research community due to their broad application bases. The spatial network database, however, presents a relatively new research problem. Here, we introduce a novel LDSQ processing framework called ROAD, which addresses the limitations of existing ap- proaches that are not able to scale up and are not flexible enough for practical uses. In sum- 7 mary, there are four important problems studied in the thesis: client-side caching, the nearest surrounder query processing, the skyline query processing, and the ROAD framework. Our proposed solutions for each of them enhance the LDSQ evaluation efficiency. As described below, we introduce a number of original ideas in our research. Client-side data caching is a common used technique for shortening data access time. It can also save client energy and alleviate the network bandwidth contention. A desirable caching technique should allow clients to assert whether LDSQs can be completely answered by locally cached data, which is especially useful when clients are disconnected from the server. How- ever, the existing data caching techniques cannot achieve this. As a result, clients are forced to contact the server frequently, even when all the answer objects are actually locally available. Here, we introduce a novel caching model called Complementary Space Caching (CSC). In details, CSC maintains a global view of object distributions in space in a cache with limited size. As LDSQs exhibit a strong spatial data access locality, with CSC, complete objects that are close to the current location of a client are maintained in the cache and objects that are far away and less likely to be accessed shortly are abstracted and presented as Complemen- tary Regions (CRs) in the cache. Whenever an LDSQ is issued, the client can conduct a local evaluation based on cached objects and CRs. If the search range does not touch any CR in the cache, it is guaranteed that the result derived from the cache is complete and there is no need to contact the remote LBS server for additional objects. This model significantly outperforms others as it effectively avoids the submission of those LDSQs with all result objects locally available. Furthermore, we examine two query processing techniques for new types of queries. The first one is presented as a query engine that supports NS queries. An NS query retrieves the nearest objects for disjointed ranges of angles with respect to a given query point. While spatial objects are indexed by an R-tree, a well-known efficient index for spatial data, we study two important issues that deal with NS queries, namely, object comparison and object access strategies. Corresponding to two possible object access strategies based on object angles or , we develop two alternative search algorithms for NS queries - the Sweep algorithm and the Ripple algorithm - as the core search logic in the query engine. While a search space is centered at a query point, the Sweep algorithm explores the search space angularly and thus objects are accessed in an angular fashion; while the Ripple algorithm performs the search through a distance-based strategy. The second query processing technique is developed as a query engine for skyline queries. 8

Providing a set of objects with a number of attributes including their proximities. A skyline query retrieves non-dominated objects. Here, the dominance refers to a condition between two objects. An object o is said to dominate another object o if o is strictly better than o for at least one attribute and o is as good as o for all the remaining attributes. Corresponding to the properties of skyline queries, i.e., transitivity and incomparability among data objects, we propose and study an innovative idea of arranging the access of data objects based on a Z-order space-filling curve. By doing so, result objects of a skyline query can be quickly identified, while comparisons between blocks of data objects can be efficiently performed. Further, this skyline query engine can handle some other skyline query variants and perform considerably better than the existing approaches specialized for skyline queries or some variants. We exam- ine the properties of these two types of queries and use of these properties as a basis to develop our algorithms for the query engines. Object proximities play a key role in LDSQs. In some situations, these proximities are constrained by an underlying road network. Thus, spatial network databases have been in- troduced as a supplement to spatial databases. In this thesis, we introduce a new processing framework called ROAD as a query facility in spatial network databases for LDSQ evaluation on road networks. Different from all related existing works, we consider the road network on which objects are mapped as a search space. Some portions of a search space (i.e., search sub- spaces) can be safely pruned if they do not contain any object of interest. Based on this idea, we formulate a road network as a hierarchy of regional subnets (Rnets). Each Rnet represents a search subspace. Then, objects mapped to an Rnet together form an object abstract. Further, shortcuts (i.e., precomputed shortest paths) are maintained among border nodes (i.e., network nodes shared by Rnets) to connect neighboring Rnets. Each LDSQ on ROAD is performed based on the idea of network expansion. Whenever it reaches some border nodes of any Rnet that has required objects, an expansion traverses the physical edges within the Rnet; otherwise, it bypasses the Rnet through shortcuts. The search ends when the required portion of a network is completely traversed. All collected objects during the traversal form the result of a query.

1.5 The Organization of the Thesis

In this thesis, we shall present the proposed techniques in detail. The remainder of the thesis is organized as follows. Chapter 2 surveys existing research on mobile computing and spatial databases. Chapter 3 presents Complementary Space Caching that includes the caching model, 9 the query processing algorithm and the cache manipulation strategies. Chapter 4 introduces Nearest Surrounder queries and discusses our proposed search algorithms based on R-trees. Chapter 5 presents Skyline queries and discusses an interesting idea of using Z-order space- filling curves to process skyline queries. Chapter 6 presents the ROAD framework to support LDSQs for objects on road networks. Finally Chapter 7 concludes the thesis and states our future research directions. Chapter 2

Literature Review

In this chapter, we outline the mobile and wireless computing environments in which LBSs are deployed. Then, we review the state-of-the-art technologies in wireless data broadcast, mobile data caching, some application-specific data access methods and spatial databases that are related to our research.

2.1 Mobile and Wireless Computing Environments

Today, there are several wireless technologies (e.g., Bluetooth, IEEE 802.11, UMTS, Satellite, etc) that could be integrated to construct a seamless, wireless data access platform. Although their goals and applications are very different, data access via these wireless technologies can be logically represented by a basic model which consists of an access point (i.e., cellular base station or satellite) and a number of wireless channels. A general architecture is shown in Figure 2.1. A mobile client can be served by a base station (which supports bi-directional com- munication) or covered by a satellite (which only supports uni-directional data transmission). All access points and servers are connected to the Internet. Under this architecture, a base sta- tion serves as a gateway between mobile clients and remote servers. On the other hand, while a wireless cell can be viewed as a simple client/server environment, a base station, in a sense, can be considered as a local server disseminating information to clients inside the cell. Gen- erally speaking, the servers in this architecture are information providers while mobile clients are customers. In many research studies, updates are assumed to be performed at the servers. The Internet backbone in the model is typically a fast link with data rate in an order of tens of Mbps up to Gbps. Deployed by wireless carriers, modern cellular networks are based on 11

Satellite

Fixed Server Mobile Satellite dish client Fixed Mobile Server Mobile client internet backbone Mobile client client

Fixed Base station Server Base station Mobile client Wireless cell Mobile Mobile client client

Figure 2.1. A general architecture

GPRS [2] and 3G [1] technology. For a single point-to-point access, GPRS can support band- width up to 172.2 kbps. The bandwidth of 3G networks can reach 384 kbps for slow moving clients and up to 2 Mbps for stationary clients. Provided by private sectors, Wireless LANs (WLANs) are installed in some cafes, hotels, shopping malls, airports, college and university campuses [82] as Wi-Fi [6] service. 802.11 [3] and its extensions are standard link layer pro- tocols used in WLANs. The bandwidth ranges from 1Mbps (for 802.11) up to 20Mbps (for 802.11g) operating at 2.4GHz. For 802.11b which operates at 5GHz can provide up to 54 Mbps bandwidth. In general, signal transmitted between a base station (also known as access point) and a client can be as far as 300ft. If the distance is less than 30ft, clients operate in low power mode. Taking a base station as an intermediate host or a proxy [127] forwarding server/client mes- sages, the client and server communication can operate in the point-to-point access mode. A client initiates a request to a server and retrieves responses from the server through the base station. Basically, it has no big difference from conventional client/server interactions in wired networks. However, the wireless link is always a performance bottleneck. To have a better overall system performance, a number of data access techniques are developed. Over wireless medium, information transmitted (i.e., broadcast) can be listened by multiple clients. With this nature, a data item interested by a large portion of client population can be disseminated on cer- 12

tain dedicated public channels (called broadcast channels). The clients can tune in and retrieve the desired items on those channels. This technique, called wireless data broadcast, signifi- cantly saves bandwidth required for delivering the same data item multiple times to individual clients via point-to-point wireless channels. To alleviate the contention of wireless channels, frequently accessed data items can be maintained in the client cache. Mobile client caching techniques not only shorten the access latency but also improve the data availability during the period of disconnection. Because of limited cache size, efficient cache management techniques (that include cache replacement, cache coherence and cache prefetching) are required and are collectively studied under the theme of mobile client caching. Also, optimized for certain ap- plication needs, various data access techniques are derived. Further, the recent advancement of spatial databases enriches the LBSs with different types of searches.

2.2 Wireless Data Broadcast

Wireless data broadcast (or simply broadcast) is particularly efficient for data dissemination in wireless and mobile environments due to its high scalability, efficient client energy conserva- tion, and channel bandwidth utilization. By listening to a broadcast channel, a large population of clients can be served simultaneously. This not only conserves shared wireless bandwidth but also saves client energy (since submitting requests consumes more energy than receiving messages) and alleviates server workload. For wireless broadcast systems, access efficiency and energy conservation are two essential performance criteria. Access efficiency concerns how fast a client request is answered, while energy conservation concerns how much battery power a mobile client consumes to access the requested data items. Energy conservation is particularly important due to constrained battery power available to a mobile client. In the literature, two performance metrics, namely access time and tuning time are typically used to measure the access efficiency and energy conservation of a wireless data broadcast system, respectively. Access time is the time elapsed from the moment a request is issued by a client to the moment the requested data is returned. Tuning time is the duration of time a client stays in active mode to collect requested data items. Since the energy consumption rate is more or less constant when the wireless interface is turned, the tuning time is usually used to indicate the energy consumed. 13

2.2.1 Broadcast Scheduling

Broadcast of unwanted data items not only consumes the scarce wireless bandwidth but also increases expected client access time. Thus, broadcast scheduling, to arrange the data items, order and frequency of these data items in a broadcast channel, in order to improve the client access time, is a critical research issue. In general, broadcast schedules can be categorized as push-based broadcast, pull-based (on-demand) broadcast and hybrid broadcast. In push-based broadcast, a server schedules public information (such as weather and traf- fic information) over a broadcast channel. This approach assumes unidirectional communi- cation (i.e., server to mobile clients only). The server delivers data items using a broadcast program based on pre-compiled access profiles or access statistics already known in advance. Existing research in this area focus on the broadcasting scheduling based on different access patterns [142, 128, 54, 126, 138]. In on-demand broadcast [11, 13, 62, 126], clients send requests for data to server via uplink channels. The server then responds by disseminating requested data to clients via a shared broadcast channel. Different from conventional point-to-point communication, a broadcast data item may satisfy multiple client requests at the same time. Research works in [42, 142, 42, 13] study the broadcast scheduling problems on fixed-size data items, whereas others in [11] focus on variable-size data items. In addition to access time, studies in [36, 68] takes client energy saving into account. Both push and on-demand broadcast can be combined into a hybrid broadcast approach as examined in [42, 143, 10, 57, 71]. To dynamically allocated to on-demand and periodic accesses, adaptive hybrid broadcast is studied in [94, 125].

2.2.2 Air Indexing

A client has to stay active in order to monitor a broadcast channel and to retrieve a requested data item. Keeping a mobile client staying in active mode results in overconsumption of client’s precious energy. Mobile client devices are equipped with power management facility, which allows them to efficiently switch between active and sleep modes according to a time schedule and synchronize with a broadcast channel. Thus, air indexing techniques have been proposed to let a client effectively turn into sleep mode (to conserve energy) when irrelevant data items are broadcast on air. The basic idea is to incorporate auxiliary information about content and arrival time of data items in the broadcast program. The inclusion of air index, however, lengthens 14 the broadcast cycle and degrades the access efficiency. Hence, a trade-off between access efficiency and tuning time is established. Energy conservation is a key issue for battery-powered mobile devices. Air indexing tech- niques are employed to indicate the arrival times of data items. Instead of full scanning a broadcast channel for requested data items, the broadcast access with an air index involves the following steps: i) initial probe - tuning into the broadcast channel to determine when an index is broadcast; ii) index lookup - accessing the index to determine when a required data item is on air; iii) download - retrieving required data item at its indicated broadcast time. A client can switch into doze mode and resume to active mode only when the interested index or data items are expected to arrive, thus substantially reducing battery power consumption. In the follow- ing, three basic classes of indexing techniques namely hashing, index tree and signatures are described: Both hashing-based schemes and flexible indexing method were proposed in [67]. In hash- based schemes, each frame carries a hash function and a shift function together with the data. The hash function hashes a key attribute to the address (i.e., delivery time) of the frame holding the desired data. In the case of collision, the shift function is used to compute the address of the overflow area which consists of a sequential set of frames starting at a position behind the frame address generated by the hash function. Flexible indexing first sorts the data items in ascending (or descending) order and then divides them into p segments numbered 1 through p. The first frame in each of the data segments contains a control index, which is a binary index mapping a given key value to the frame containing that key. In this way, the tuning time can be reduced. The parameter p makes the indexing method flexible since either a very good tuning time or a very good access time can be obtained by selecting a p value. An index tree [68] stores the key attribute values and the arrival times of the data frames in each node disseminated in a broadcast channel. The index tree technique is very efficient for a clustered broadcast cycle and it provides a more accurate and complete global view of the data frames. An example of an index tree for a broadcast cycle which consists of 81 data items is shown in Figure 2.2. The lowest level consists of rectangular boxes each of which represents a data frame, i.e., a collection of three data items. Each index node has three pointers. To reduce tuning time while maintaining a good access time for clients, a part of index tree can be replicated and interleaved with the data frames. Instead of replicating the entire index tree m times (meaning the index tree is broadcast every 1/m of the broadcast cycle), each broadcast only consists of the replicated part of the index (the upper levels) and the non- 15

Replicated part Root

a1 a2 a3 Non-replicated part b1 b2 b3 b4 b5 b6 b7 b8 b9 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24 c25 c26 c27

0 3 6 9 12 15 18 21 24 36 39 42 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78

Figure 2.2. Full index tree replicated part (the lower levels) that indexes the data frames immediately following it. As such, each node in the non-replicated part appears only once in a broadcast cycle. Since the lower levels of an index tree take up much more bandwidth than the upper part, the index overheads can be greatly reduced if the lower levels of the index tree are not replicated. This scheme is termed distributed index [68]. In this way, tuning time can be significantly improved without causing much deterioration in access time. To support distributed indexing, every frame maintains an offset to the beginning of the index root to be broadcast. Each node of an distributed index tree contains a tuple, with the first field containing the last primary key value of the data frame that is previously broadcast and the second field containing the offset to the beginning of the next broadcast cycle. This is to guide the clients that have missed the required data in the current cycle to tune to the next broadcast cycle. There is a control index at the beginning of every replicated index to refer clients to a proper branch in the index tree. This additional index information for navigation together with the sparse index tree provides the same function as the complete index tree. The signature technique has been widely used for information retrieval. The signature method is not affected much by the clustering factor and thus is particularly good for multi- attribute retrieval. The signature of a data frame is generated by first hashing the values in the data frame into bit strings and then superimposing (using bitwise OR operation) them into a bit vector [90]. Signatures are delivered in the broadcast channel by interleaving with the data frames . A query signature is generated in the same fashion based on the queried values given by the user. To answer a query, a mobile client can simply retrieve signatures from the broadcast channel and then match the signatures with the query signature by performing a bitwise AND operation. If no match is determined, the corresponding data frame can be safely ignored. The client turns into doze mode and wakes up again for the next signature. Otherwise, 16

the data frame is further checked against the query. The primary issue with different signature methods is the size and the number of levels of the signatures to be used. In [90], three signature schemes, namely simple signature, integrated signature and multi- level signature, were proposed and analyzed. For simple signatures, a signature frame is broad- cast before its corresponding data frame. Therefore, the number of signatures is equal to the number of data frames in a broadcast cycle. An integrated signature is constructed for a group of consecutive data frames (called a frame group). The multi-level signature is a combina- tion of the simple signature and the integrated signature methods, in which the upper level signatures are integrated signatures and the lowest level signatures are simple signatures. Derived from these basic schemes, enhanced indexing techniques namely hybrid index, unbalanced index tree, and multi-attribute air indexing. Hybrid Index Approach. Both the signature and index tree techniques have some pros and cons. Hybrid index proposed in [64] builds index information on top of the signatures. A sparse index tree provides a global view for the data frames and their corresponding signatures. The index tree is called sparse because only the upper t levels of the index tree (the replicated part in the distributed indexing) are constructed. A key-search pointer node in the t-th level points to a frame group, which is a group of consecutive frames following their corresponding signatures. Since the size of the upper t levels of an index tree is usually small, the overheads for such additional indexes are very small. To achieve better performance with skewed queries, the unbalanced index tree technique was investigated in [28, 123]. Unbalanced indexing minimizes the average index search cost by reducing the number of index traversals for hot data at the expense of spending more on cold data. For fixed index fanouts, a Huffman-based algorithm can be used to construct an optimal unbalanced index tree. Given N data items and fanout f of the index tree, the Huffman-based algorithm first creates a forest of N subtrees, each of which is a single node labeled with the corresponding access frequency. Then, the f subtrees labeled with the smallest frequency are attached to a new node, and the resulting subtree inherits the sum of all the labeled frequencies from all its f child subtrees. This grouping procedure is repeated until only one single tree re- mains. Provided the data access patterns, an optimal unbalanced index tree with a fixed fanout is easy to construct. However, its performance may not be optimal. Thus, [28] discussed a more sophisticated case for variable fanouts. In this case, the problem of optimally constructing an index tree is NP-hard [28]. In [28], a greedy algorithm Variant Fanout (VF) was proposed. Basically, the VF scheme builds the index tree in a top-down manner. VF starts by attaching 17 all data items to the root node. Then, it groups the nodes with small access probabilities and moves them to one level lower so as to minimize the average index traversal cost. For multi-attribute indexing proposed in [63], a broadcast cycle is clustered based on the most frequently accessed attribute (that is the first attribute and also called clustered attribute). Although the other attributes are non-clustered in the cycle, a second attribute can be chosen to cluster the data items within a data cluster of the first attribute. Likewise, a third attribute can be chosen to cluster the data items within a data cluster of the second attribute. The second and third attributes are collectively called on-clustered attributes. For each non-clustered attribute, a broadcast cycle can be partitioned into a number of segments called meta-segments [68], each of which holds a sequence of frames with non-decreasing (or non-increasing) values of that attribute. Thus, at each individual meta-segment, the data frames are clustered and the indexing techniques discussed in the previous subsection are still applicable to a meta-segment. The number of meta-segments in the broadcast cycle for an attribute is called the scattering factor of the attribute. The scattering factor of an attribute increases as the importance of the attribute decreases. The index tree, signature, and hybrid methods are applicable to indexing multi-attribute data frames [94]. For multi-attribute indexing, an index tree is built for each index attribute, while the signature is generated by superimposing multiple hashed attribute values.

2.3 Mobile Client Caching

Caching frequently used data items in the mobile clients’ local storage can improve access la- tency and data availability in case of disconnection. In general, cache management techniques include cache replacement, cache coherence and cache pre-fetching. Due to weak connectivity, those cache management techniques adopted in wireless and mobile computing environment are different from conventional caching techniques. Cache replacement decides what data items to keep in order to maximize the space effi- ciency. The cache replacement issue for wireless data broadcast was first studied in the Broad- cast Disk project [128, 78, 146, 11, 147] Cache coherence considers consistency between source values and cached values of data items. As discussed in [8], a number of cache consistency models are possible for wireless and mobile computing, namely, opportunistic, quasi-caching and latest value. With the op- portunistic model, a client can read any version of a data item available in the cache without 18

worrying about consistency. With the quasi-caching model, consistency is defined subject to application constraints that specify a tolerance of slack compared with the latest value of the source data items. Any query may accept old versions of data items as long as at a same consistent snapshot [108, 109]. With the latest value model, clients must always access the most recent value of a data item, so whenever the data items are updated, the corresponding cached value are needed to update or invalidate. In [16, 72], different invalidation strategies based on the periodical broadcast are studied, namely, Broadcast Timestamp (BT), Amnesic Terminal (AT) and Signature (SIG). In BT algorithm, the identifiers of changed data items and the timestamps of the latest change on the item happened in the last w seconds are broadcast. Then, a client discards those cached items whose timestamps are less than the timestamps of corresponding items mentioned in the broadcast. In AT algorithm, the identifiers of data items that was changed since the previous broadcast. A client drops a cached item if it is reported in the broadcast. The client discards an entire cache if it missed a certain number of consec- utive broadcast. AT saves more bandwidth than BT due to no inclusion of timestamps, but it requires the client constantly listens to every broadcast. In SIG algorithm, the identifiers of changed items are encoded as a signature. The identifiers of multiple changed items are merged as a single signature reported in a broadcast. A client discards a cached item when its signature matches the signature in the broadcast. The signature is more compact in size than original item identifiers, thus saving bandwidth. However, it is possible to have ’false negative’ that the client discards items which are still valid. Due to narrow bandwidth, data items may not be always available on air, cache pre-fetching loads data items in advance if a request of those items is predictable. In a Tag-team [7], a client caches data item i when it is broadcast and replaces it with another data item j when j is broadcast. Similarly, j is dropped again when i is re-broadcast. Introduced in [9], PT algorithm is adopted on Broadcast Disk. It measures the pt value of data items. The cache is used to maintain the data items with high pt values. The pt value is the product of the access probability (p) and the time (t) that will elapse before the instance of the item to be broadcast again. [136] presents a cache update policy to minimize average access latency. Also, in [51], PP1 and PP2 policies are devised. PP1 policy achieves the minimum power consumption. PP2 policy achieve minimum access latency. Both PP1 and PP2 are static policies. For the dynamic environment where request rates are not fixed, adaptive power-aware pre-fetching schemes [60, 153] are proposed. The analytical model for various pre-fetching policies is derived in [51]. 19

2.4 Application-Specific Data Access Methods

In the previous two sections, both the wireless broadcast and the client caching techniques are developed for general data access. On the other hand, those techniques may be quite different, subject to the nature of applications and their specific requirements. Here, data access tech- niques based on the application requirements in terms of time, data semantics, and reliability are described. Time-critical applications demand data to be made available before a specified “deadline”, otherwise the delayed information is of no value to the users. The semantic and reliable data access are also discussed in the following subsections.

2.4.1 Time-Critical Data Access

Timeliness of data access is of central importance in many time-critical applications. The belated information has no value to users. Take stocks as an example, belated stock price information will lead to a wrong decision. To ensure information accessible by a specified time, the online broadcast schedules takes deadlines associated with data requests into account. Usually the schedules are applied to on-demand broadcast. The major performance for time-critical data dissemination is request drop rate.It measures the overall performance based on the rate of number of requests which deadlines are missed with respect to the total number of requests. A good broadcast schedule should provide the lowest request drop rate. Some developed heuristics include Earliest Deadline First (EDF) [149, 84], Most Request First (MRF), and Slack time Inverse Number of pending requests (SIN-α) [148] etc. By EDF, Each request specifies the deadline of requested data item. The item with the earliest deadline is delivered first. With MRF, The data items with the largest number of requests are broadcast first to satisfy many requests. SIN-α that combines the above heuristics gives priority to item a small fraction between slack time and number of requests. In [148], the theoretical bound of request drop rate is derived.

2.4.2 Semantic-Based Data Access

Due to weak connectivity, clients should be made more aware of the cache content. If a client can determine that all data items needed to answer its queries are available in its cache, many requests can be served locally without contacting the server. To provide clients such ability, a semantic caching is proposed in [35, 85, 114]. The cached items are grouped and collectively 20

described as semantic regions using query expressions. When a new query is determined with query containment [26] if it is contained by a semantic region, a client can be certain that no additional data item(s) is required from the server. In case a new query is partially covered, a remainder query is formulated to fetch missing data from the server. With the same rationale, a semantic broadcast scheme is derived. Due to limited bandwidth, not all data items are broadcast on air. With semantic information present in the index, the clients can query required data items and determine if the missing parts for their queries are needed. In [87], that data items are clustered based on semantics in different granularity are discussed.

2.4.3 Reliable Data Access

Wireless communication is error-prone. Data might be corrupted or lost due to factors such as signal interference. When an error occurs, mobile clients have to wait for the next copy of the data if no special precaution is taken. This will increase access latency. To deal with unreli- able wireless communication, an idea is to introduce controlled redundancy in the broadcast schedule. Such redundancy allows mobile clients to obtain their data items from the current broadcast cycle even in the presence of errors. This eliminates the need to wait for the next broadcast whenever any error occurs. In [97, 131], additional index information is distributed in the broadcast. Clients can resume searching right after recovery without waiting for next broadcast cycle. To have reliable data delivery, a set of k data blocks are encoded using Tor- nado codes, proposed in [24], into a set of n encoded blocks (where n is a small multiple of k). A client can collect any k out of n encoded blocks to decode and determine the original data blocks. Similar idea can be found in [18] where Adaptive Information Dispersal Algo- rithm (AIDA) is adopted to distribute the content of m data items in n (n > m) frames. By collecting a certain amount of frames, the original data items can be reconstructed by using a reconstruction transmission matrix. By the same algorithm, security is also supported that only legal clients know the reconstruction transformation matrix to recover the original data items.

2.5 Spatial Databases

With the popularity of many spatial applications such as Geographical Information Systems (GIS) and obviously LBSs, spatial databases serving as a core to those applications have been 21 receiving a lot of attentions. Although specialized for spatial applications, spatial databases have similar architecture as the conventional databases like relational DBMSs. Currently many commercial database systems such as Oracle [5] and MySQL [4] are equipped with spatial extension to support spatial data management. As the spatial indexes and query operators are important to spatial application like nearest neighbor search, those spatial extensions include spatial indexes and some spatial query operators. In the following, we review the research studies about spatial indexes and spatial queries.

2.5.1 Spatial Indexes

As many existing database systems are equipped with single-dimensional indexes such as B+-trees and hash tables, indexing the locations of spatial objects as (x,y) coordinates can be treated as (i) indexing the object coordinates by multiple single-dimensional indexes, or (ii) mapping the multi-dimensional coordinates of objects into single values and index those mapped values by a single-dimensional index. Typically, these types of approaches are for range queries. For the former approach, a range query needs to access multiple indexes and those commonly retrieved from all the accessed indexes are result objects. For the lat- ter approach, space filling curves such as Hilbert [23], Peano [116] and Z-order (or Morton- order) [101] curves, are typically used to map objects from multi-dimensional space to a po- sition of an one-dimensional space. UB-tree [112] is one of representative examples of this approach. Correspondingly, a range query on a multi-dimensional region is transformed into a collection of space filling curve segments. As a number of range queries on those segments, objects are retrieved from an one-dimensional index. However, this approach cannot be easily extended to support complicated spatial queries. Due to the demands, many efficient multi-dimensional indexes have been developed. Here we review some of representative ones, such as grid file [55], quad-tree [46], kd-tree [102] and R-tree [53], in the following. As roughly classified, grid file and quad-tree are space division based approaches, whereas kd-tree and R-tree are object division based approaches. For simplicity, let us consider 2D space only in the discussion. A grid file splits a space into an array of cells of the same size. Objects are indexed by a cell when they are covered by the area of the cell. Then, a range query looks up cells that it covers for qualified objects. Based on the similar idea, a quad-tree forms a tree, in which each node is associated with an area and all of its child nodes are associated with subareas divided 22 from that of the parent node. As the name suggests, it divides an area into four subareas. A range search is performed by traversing a quad-tree. The search drill down in the index if its range overlaps the nodes’ areas until leaf nodes are reached. Those indexed at the reached leaf nodes are result objects. A kd-tree is structured as a binary tree. Each node has two child nodes that covers areas with the exactly the same number of objects. The space partitioning is performed along the x (or y) dimension alternatively across the tree levels. On the other hand, R-trees construct an index by recursively clustering objects. Conceptually, at the bottom level, it abstracts objects as minimum bounding boxes (MBB)s. Those closely located MBBs forms leaf nodes. Then, it clusters close leaf MBBs into coarse MBBs that form intermediate nodes. This clustering repeats until a single MBB is formed, i.e., the root of R-tree. On kd-trees and R-trees, a range search traverses the tree along a path with nodes overlap with the search range. Then, the result objects are retrieved from the accessed leaf nodes. Among those multi-dimensional indexes, R-tree is more efficient for spatial searches [75] and we discuss it in more details in the following. The performance of R-trees is sensitive to how the objects are packed as MBBs. For building an R-tree based on a sequence of object insertions, (the basic) R-tree [53] finds a suitable leaf node to accommodate each inserted ob- ject. If nodes are overflow (i.e., the number of entries exceed the node capacity), the nodes are simply split into two nodes. To reduce false positives in accessing nodes, overlaps be- tween MBBs should be avoided. Based on the idea, R+-tree [119] divides object extents and indexes individual object partitions. Further, the object insertion order has a significant impact on R-tree performance. Thus, re-inserting objects from the index would provide a better R-tree organization. Based on this, R∗-tree [14] focuses re-insertion of objects from overflow nodes. Besides, instead of inserting a massive number of objects to form an index, R-tree bulkloading is a more efficient means. In general, it creates an index in a bottom up fashion. STR bulk- loading [91] accesses objects on x-dimension. Then, it groups a certain number of objects, according to their y positions, to form MBBs. It repeats the same group until all objects are examined. Further, it performs recursively on formed MBBs until a single MBB as the root of the index is formed. TGR bulkloading [48] shares some similar ideas as the formation of a kd-tree. It repeatedly divides a set of objects along x-ory- dimension until the number of objects in a finest set can be accommodated by a node. Then, the area covered by a superset of objects form a parent node. As discussed in [48], TGR can provide better performed R-tree than STR. 23

2.5.2 Spatial Queries

Other than range queries and nearest neighbor (NN) queries, there are many interesting queries. Let us quickly review some representative ones, such as spatial join queries [122, 33, 34], clos- est pair queries [32, 151], density queries [70], group nearest neighbor queries (GNN) [104], transitive nearest neighbor queries (TNN) [157], reverse nearest neighbor queries (RNN) [80, 150, 124, 134], and ranked reverse nearest neighbor (RRNN) [88]. In particular, GNN, TNN, RNN, and RRNN are all the variants of nearest neighbor queries. Spatial join queries find pairs of objects (usually in different types) whose distances are bounded by a distance threshold. Closest pair queries retrieve some pairs of objects whose distances are the shortest. Density queries determine areas that covers some numbers of objects more than a density threshold. Those queries consider objects in the entire area. They are not commonly used as location-dependent spatial queries. On the other hand, the other types of queries are used in LBSs in which user locations are specified as query points. GNN queries find objects whose aggregated distances that can be the sum of distances or the maximum among the minimum distances to a group of query points. These GNN queries are useful in recommendation to a party of users. TNN queries finds a sequence of objects such that the total distance from a query point to the last object through the sequence of objects is the minimum. These queries not only return objects but also suggest a route to travel for those objects. RNN queries swap the roles of query points and objects. They searches objects whose nearest neighbors are a given query point. Applications of these RNN queries include finding a party who would be interested in a query point based on object nearness. RNN queries are directly extended to reverse k nearest neighbor queries that search objects whose k NNs cover a query point. However, RNN and RkNN cannot identify which result objects are more important than others and they cannot limit the result set size. Thus, a notion of κ is introduced that if a query point is κ-NN of an object, the closeness of the object is considered to be κ. Based on this notion, RRNN queries return t objects which κs are the smallest. Chapter 3

Complementary Space Caching

To shorten the data access latency and to alleviate the contention of wireless bandwidth and client energy, client-side data caching techniques are particularly important and useful to LBSs. Conventional caching techniques maintain a portion of a database in units of tuples or pages. Due to the lack of a global view of object distribution on a geographical space, those caching techniques cannot enable clients to assert whether the cached data alone can sufficiently sat- isfy LDSQs, forcing them to submit requests to the server even if the answers are completely available in the cache. Inspired by the importance of using global views of spatial object distribution in support of LDSQ processing, we introduce and present a novel data caching scheme for mobile clients called Complementary Space Caching (CSC) in this chapter. In the following, we first provide an overview of CSC and then detail the LDSQ processing with CSC and CSC cache manage- ment. Further, we review related existing works about client-side data caching applicable to LDSQs. Finally, we evaluate our approaches in comparison with those existing works.

3.1 Overview of Complementary Space Caching

Formally, we consider a spatial database at the LBS server composed of a set of objects, O. The area covered by all those objects constitute a bounded geographical space, S, or called the service area of LBSs. A subset of objects O (⊆O) residing in any subspace S ⊆Scan be determined by a function, m, i.e. m(S) → O1. Conversely, given a subset of objects O (⊆O),

1This function is logically supported by LBSs. 25 the corresponding subspace S (⊆S) can also be determined. Then, the content of a CSC cache C is (O,R), where O is a subset of objects cached by the client and R presents some subspaces captured by complementary regions (CRs). The CRs are presented in a form of bounding boxes that can be obtained or derived from spatial indexes like R-tree as to be discussed later. Initially, a client cache contains no object. In other words, it is (∅, {S}). After the first client LDSQ is processed by the server, queried objects O along with bounding boxes of those unexplored objects as R are all returned to and are cached by the client. Then, the cache becomes (O, R). y y 9 “Explore rN1” 9 “Converting g to r ” rN1 g 8 8 ra 7 a c 7 c 6 b 6 rb rh 5 rh 5 f f 4 g 4 g 3 e ri 3 e

2 rd 2 rd ri 1 1 “Coalesce r , r and r ” x g h i x 0 1 2345 678 9 0 1 2345 678 9 (a) Query processing (as zooming-in operation) (b) Cache replacement (as zooming-out operation)

Figure 3.1. Query processing and cache replacement with CSC

With both objects and CRs maintained in the cache, a global view about locations of all objects can be preserved. In order to preserve a global view of a dataset, both query processing and cache management in CSC behave differently from the other caching models (that will be discussed later). Figure 3.1 gives the overview of query processing and cache replacement. Suppose that a dataset contains nine objects and all of them are indexed by an R-tree. Now, after { , , }, { , , , } the first LDSQ, Q, is evaluated, the content of a client cache becomes ( e f g rN1 rh ri rd )

(where rx is the CRs created based on the MBB of an R-tree node entry ex). Later, the client moves and issues another query that covers rN1 (see Figure 3.1(a)). The client queries the server for the details inside rN1 . Then, an object c together with two CRs, ra and rb, that are parts of rN1 are received. Both are in a finer granularity than rN1 and they represent other areas of current client interest. In that sense, query processing resembles as a zooming-in action that brings more details about a queried area into the cache. 26

Memory should be reclaimed to accommodate new coming objects and other finer CRs if the cache storage space runs out. Suppose that an object g is a victim chosen to be removed

from the cache (see Figure 3.1(b)). First, rg, a CR for object g is introduced to cover the g’s position to the cache so that the global view is preserved; and then g is physically deleted.

Further, when more free space is demanded, rg, rh and ri, three closely located CRs, are coa- lesced into a single CR that is in a coarser granularity. This cache replacement is analogous to a zooming-out action that removes the details of an area that is currently and/or is less likely to be not interested by the client. In order for query processing and cache replacement to maintain a global view of the dataset, CSC reinforces the following integrity requirements:

Requirement 3.1. Full dataset coverage. At any time, the union of cached objects, O, and O ∪ ∪ = O non-cached objects captured by CRs, R, must equal , formally ( r∈Rm(r)) O .

This integrity requirement assures that every non-cached object ( O) is captured by one of the CRs, r ∈ R. Based on R, a client can determine whether there are potentially missing objects for a query.

Requirement 3.2. No CR-object overlap. A CR should not cover any cached object, formally ∀r ∈ R, m(r) ∩ O = ∅.

Requirement 3.3. No full CR-CR containment. No CR is contained in another CR, formally, ∀ ∈ , ∈ ,  ,  ri R rj R i j m(ri) m(rj).

The second requirement guarantees no false misses. The third requirement eliminates redun- dant CRs. A CR is redundant if it is completely covered by another CR and thus is safe to remove. There is obviously an overhead for delivering those CRs additional to queried object to the clients, but it is justifiable for several reasons: (1) collection of those unexplored objects as CRs based on spatial indexes like R-tree that are used to perform query processing at the server is almost at no cost; (2) individual CRs are compact in transmission and storage size compared with all those object locations, and thus this does not consume a lot of bandwidth and cache memory; (3) the number of CRs is reasonably small since most of irrelevant data objects are not required and abstracted as large bounding boxes; (4) It is only a one-time cost, which will be amortized over subsequent queries; and (5) keeping bounding in the cache can effectively avoid sending unnecessary queries to the server as it is designed for. 27

3.2 LDSQ Processing with CSC

This section presents strategies of LDSQ processing with CSC. For simplicity, we assume an R- tree that indexes all objects is adopted at the LBS server and the locations of individual objects are geographical points within the service area in this discussion. In what follows, we first present the cache initialization, in particular, the mechanism to collect complementary regions from an R-tree. Then, we discuss LDSQ processing with cached objects and CRs maintained in the client cache.

3.2.1 Cache Initialization

As previously mentioned, the content of a CSC is initialized as (∅, S) where S is a service area assumed to be known by all the mobile clients. Whenever a client has a LDSQ issued within S, it should submit the query to the server for processing in order to retrieve answer objects and CRs (which represent non-result objects). To process a LDSQ on an R-tree, a search algorithm starts traversal at the root and recur- sively visits index nodes and objects. By simply examining minimum bounding boxes (MBB)s of index nodes or objects (e.g., by checking the intersection of MBBs and a query’s search range), whether the objects covered by an MBB are candidates for the query can be quickly identified. If an MBB does not touch the query range, the corresponding subtree (i.e., all the enclosed objects) can be safely discarded from further investigation. The query traversal ends when all objects are retrieved and all irrelevant subtrees are pruned. Figure 3.2(a) depicts an R-tree (with a fanout of 3) that has three leaf nodes, labeled N1, N2 and N3, and nine objects labeled ‘a’ through ‘i’. The corresponding object placements are shown in Figure 3.2(b). Suppose a LDSQ Q is evaluated. By a depth-first search, a search algorithm first traverses the root and skips its child entry eN1 whose MBB does not overlap with the query. Next, N2 is explored. Entry ed is not explored since it is outside Q. Then objects e and f are collected. Similarly for N3, object g is retrieved while eh and ei are not explored.

Finally, objects e, f and g are collected as the query result while the unexplored entries are eN1, ed, eh, and ei. Notice that the unexplored entries exactly represent the complement set of objects to those queried objects. In other words, both queried objects and those unexplored entries’ MBBs cover the entire space. Preliminarily, those MBBs can be trivially collected as CRs. As will be discussed later, those collected MBBs can be further coalesced to alleviate the consumption 28

y 9 Root Unexplored index 8 a entries N1 ea eN1 eN2 eN3 7 b ec c eb 6 Q h N3 5 e N1 N2 N3 4 f g h 3 e ei ea eb ec ed ee ef eg eh ei 2 i ed 1 N2 a b c d e f g h i d x 0 1 2345 678 9 (a) An example R-tree (b) Objects and complementary regions

Figure 3.2. R-tree and CSC of wireless bandwidth. After the query evaluation, the result objects and those MBBs whose associated entries are not explored by the search are sent back to the client. Correspondingly, the client now maintains e, f and g plus CRs, namely, rN1, rd, rh, and ri in the cache.

3.2.2 Query Processing

Next, with a CSC cache, various types of LDSQs can be answered by examining those cached objects and exploring some involved CRs for missing objects from the server. Generally speak- ing, query processing with CSC is a three-step procedure:

1. Cache probing: Qualified cached objects are collected as a part of a query result and CRs that cover objects possibly contribute to the query result are identified. The cache probing varies with different types of queries and will be discussed shortly in Section 3.2.3.

2. Remainder query processing: If no CR is identified for the query meaning the query is fully covered by the cache, the query processing terminates here and the collected cached objects are returned as the query result. Otherwise, the missing objects in the identified CRs need to be requested from (and possibly checked by) the server. This will be discussed in Section 3.2.4.

3. Cache maintenance: After the query is answered, the newly returned data objects and CRs are admitted to the cache. This invokes the cache maintenance operations such as cache replacement and CR coalescence that will be discussed in Section 3.4. 29

3.2.3 Cache Probing

Here, we describe the cache probing for LDSQs. The outputs of cache probing are cached objects in the answer set and CRs to be explored in the server. Although different types of LDSQs are incompatible in nature but can be processed in a similar fashion using CSC as follows. Cache probing for a range query is pretty straightforward. The client scans the cached objects and CRs to return those overlapped by the query range. Figure 3.3(a) shows an example of a range query, QR. In this example, cached object, g, and CR, rh, are identified. Without an explicit search range, cache probing for an NN query expands a search range from a query point outwards until one object is touched. Then, all CRs within the search range are identified for further exploring. The extension to handle kNN query is straightforward by extending the search range to the first k covered objects. Figure 3.3(b), the client finds objects f and e, the two closest objects to q2NN of the query Q2NN. The CR rN1 overlapped with the vicinity circle across e is identified for further exploring. y y 9 9 Q2NN 8 rN1 8 rN1 7 7 Q 6 QR 6 NN q2NN 5 5 g fg rh 4 rh 4 e f 3 3 e qNN 2 ri 2 ri rd 1 1 ed x x 0 1 2345 678 9 0 1 2345 678 9

(a) Range query (QR) (b) NN query (QNN) and kNN query (Q2NN) Figure 3.3. Cache probing

3.2.4 Remainder Query Processing

A remainder query is submitted to the server to retrieve missing objects if some CRs are iden- tified for a query. Besides, refined CRs (i.e., MBBs of those entries inside submitted CRs but not explored by a query) may be returned. One of the primary issues in processing remainder queries is “how to express a remainder query” which has a major impact on the processing 30

cost (in terms of response time and wireless bandwidth consumption) and the quality of cached information. We examine two possible approaches: 1) CRs only, and 2) Query+CRs. The first approach is to submit the identified CRs only. The CRs are thus treated as con- strained regions in the server. As illustrated in Figure3.4(a), a query, Q, overlaps with three

objects, e, f , g and three CRs, rN1, rh, ri. The remainder query in this approach is expressed as , , (rN1 rh ri). Because rN1 covers a large area outside the range of query Q, some extra objects may be returned to the client. Even worse, they would not be needed at all eventually. On the other hand, this approach consumes minimal uplink bandwidth.

y y r: coalescing region 9 9 of rh and ri 8 rN1 rN1 8 rN1 7 Q 7 Q P1 6 6 5 5 fg P2 fgrh 4 rh P3 4 3 e 3 e ri Q ri 2 2 rd rd 1 1 x x 0 1 2345 678 0 1 2345 678

(a) Query Q covering e, f , g, rh, ri and rN1 (b) Partitioning rN1 (c) Request CR Coalescence of rh and ri Figure 3.4. Remainder query expressions

The second approach is to submit the original query along with the identified CRs. Thus, as , , shown in Figure 3.4(a), a remainder query in this approach is expressed as Q plus (rN1 rh ri). The CRs are used as filters for processing of Q in the server. When the R-tree index is traversed, only the branches overlapped with both the CRs and the query range are further explored. Then, those MBBs unexplored by Q and intersecting with the CRs are also returned (as refined CRs in the original CRs) along with qualified data objects to the client. Even though this approach has a slightly higher uplink bandwidth overhead than the above approach, the downlink cost is reduced because a precise set of missing objects are downloaded. Another important issue in remainder query processing of the second approach (this issue is irrelevant to the first approach) is the number of refined CRs generated. Delivering a large number of fine-granularity CRs back to the clients may incur excessive downlink overhead (and the additional energy consumption of clients). To address this issue, a client can partition CRs if they are only partially covered by a query. Using precise CRs in remainder queries avoids

redundant CRs to be returned to the client. An example is depicted in Figure 3.4(a), a CR rN1 31

is partially covered by a query Q and thus can be partitioned into three parts, namely, P1, P2 and P3 (as shown in Figure 3.4(b)). The portion, P3, a region enclosing the overlap between rN1 and Q, is taken to formulate a remainder query. Thus, rN1 is removed while P1 and P2 are retained as CRs in the cache. This partitioning may result in some savings of download overhead because the number of refined CRs in P3 is expected to be smaller than those in rN1.

However, a low precision CR will be resulted like P1 and P2 having not exact bounding box of enclosed objects. It is also probable that the partitioned CRs do not have missing objects inside. Examining them definitely causes false misses. On the other hand, a large number of CRs could be covered by a LDSQ. Submitting all those individual CRs to the server incurs high upload costs. Thus, CRs can be merged into a few coarse CRs. This merging of CRs is called CR coalescence. As shown in Figure 3.4(c), rh and ri are coalesced to a coarse CR r. Then, rN1 and r can be submitted instead, i.e., the + , remainder query is Q (r rN1). However, the coalescence of CRs in the remainder queries is not to trivially replace all fine-granularity CRs with a coarse CR since the newly formed CR may overlap with some answer objects already found in the cache. For example, further  coalescence of r and rN1 may form a larger CR, r , that overlaps with cached objects e, f , and g. Using r as request CR will redundantly fetch these already cached objects. To tackle this problem, we may supplement IDs of overlapped cached objects in the remain- der query as a result filter that removes the objects already cached from the downlink. Then, the new remainder query becomes Q + (r) + ({e, f, g}) and is sent to the server. It raises a question +  + { , , } + , if an expression Q (r ) ( e f g ) saves more uplink bandwidth than Q (r rN1) or other expressions else. This is an optimization issue in CR coalescence, which will be discussed in Section 3.3. Finally, for NN or kNN query processing, instead of exploring all potential CRs covered by the conservative vicinity circle as described above, we can take an incremental approach to explore the identified CRs one by one until the answer set is obtained.

3.3 CR Coalescence

CRs are essential information to transmit between the client and the server. An efficient CR coalescence algorithm is needed to condense those overly fine CRs to save bandwidth. In this section, we first devise a generic CR coalescence algorithm, based on which two specializations for coalescing CRs in (i) request messages from clients to the server, and (ii) reply messages from the server to clients are derived. 32

3.3.1 Generic CR Coalescence Algorithm  ⊆  , ··· ,  Given a set of CRs, R, a coalescence algorithm divides R ( R) into n subsets, i.e., R1 Rn  ⊆ ∀ ≤ ≤  (Ri R, 1 i n) and then replaces Ri is replaced by a newly formed CR called coalescing  CR, denoted by r  , that is a bounding box to tightly enclose all original CRs in R . Hence, the Ri i  coalescence operation on R can be described as (R − R ) ∪ (∪ {r  }). i Ri The key issue here is to determine the optimal subsets of CRs to be coalesced. In order to tackle this selection problem, we first formulate a cost model. Every CR, r, bears a cost, c(r), which measures the performance loss due to presentation imprecision about non-cached objects. The definition of a cost function to measure the costs incurred by different coalescence approaches may vary with the operation scenario (as will be detailed later in this section). In general, the larger is the region, the higher is the cost, and the more is the potential performance  = , , ··· , loss. Therefore, after coalescing Ri s(i 1 2 n), the cost increase is: total cost increase = c(r  ) − c(r) , (3.1) Ri ≤ ≤ ∈  1 i n r Ri but the number of CRs is reduced by: = | | − total CR saving Ri n (3.2) 1≤i≤n Given an expected CR saving, the optimal selection algorithm should minimize the cost increase. Here, this selection is a set-cover problem which is known to be NP-complete. In- stead, we propose a heuristical algorithm to choose CRs to coalesce. This algorithm repeats the search until a termination condition is met. The termination depends on the coalescence need. It can be specified by limiting the number of CRs coalesced so that the remaining number of CRs can be controlled or by setting a threshold on cost increase metric that guarantees the CR fineness. The pseudo-code of the algorithm is outlined in Figure 3.5. First, every possible pair of CRs (∈ R) for coalescence are enqueued into a priority queue Q (line 1-3). Then, iteratively, the best pair retrieved from the queue is coalesced (line 4-9). The “best” means the least cost increase after coalescing the pair of CRs. A priority queue is used to keep track of the possible

CR pairs and order them. After the coalescence, the original CRs ri and rj are replaced with the

{ , } { , } − − coalescing CR, r ri rj (line 6); the CR saving and cost increase are 1 and c(r ri rj ) c(ri) c(rj), { , } respectively. Based on r ri rj , a new candidate pair is inserted to the queue (line 9). Then, 33

the algorithm repeats until the specified termination condition is satisfied. In the following, we shall derive specific coalescence techniques for coalescing CRs in client requests (request CRs) and CRs in server replies (reply CRs).

Algorithm GenericCRCoalescence(R) Input/output: a set of CRs R; Local: priority queue Q; Begin 1 foreach r ∈ R do  2 find r’s best counterpart CR, r from R −{r};  3 push (r, r , anticipated cost increase) intoQ; 4 while termination condition is not satisfied do

5 pop (ri, rj, anticipated cost increase) from Q; ← , 6 r coalesce(ra rb); 7 replace ri and rj with r in R;  8 find r’s best partner CR, r from R −{r};  9 push (r, r , anticipated cost increase) into Q; 10 output R; End.

Figure 3.5. The pseudo-code of generic CR coalescence algorithm

3.3.2 Client Request CR Coalescence

In Section 3.2.4, we briefly discussed the issue of request CR coalescence and raised the ques- tion about what CRs should be coalesced. Here, we address this problem with our generic coalescence algorithm described above. Let r be a CR. We set the cost of r, c(r), as the num- ber of cached objects covered by r. As the size of remainder query is our main concern in coalescing request CRs, we aim at maximizing the overhead reduction specified below:

overhead reduction = total CR saving × CR size− (3.3) total cost increase × object ID size This expression considers the volume saved by CR coalescence (CR saving) and the overhead of including additional object IDs (cost increase). Reconsider the situation in Figure 3.4(c), rh

and ri can be coalesced to form a coalescing CR, r, with 1 CR saved and no object included, =  i.e., c(r) 0. Further, coalescing r and rN1 into r has 1 more CR saved but covers three objects, i.e., c(r) = 3. As a CR and an object ID respectively take 16 bytes and 4 bytes, the 34

 { , , } − = total overhead reduction for taking r and e f g is 32 12 20 and that for taking r and rN1 is 16. As a result, both the set of r and {e, f, g} are used to represent the remainder query.

3.3.3 Server Reply CR Coalescence

Very often, portions of CRs, submitted as remainder queries, might not be fully explored for answering queries in the Query + CRs approach. Thus, refined CRs are returned to the client along with the answer objects. In order to save the downlink wireless bandwidth, the server reply CRs can be coalesced. The optimization should consider reducing the number of CRs while retaining the quality of CRs such that a low false miss rate can be achieved. We associate CR quality with some quantitative metrics by defining cost function c(r) for a CR r based on different heuristics:

• Area. Generally, the larger the area of a CR, the more likely the CR provides a higher false miss rate since it may include more empty regions in which no objects exist. There- fore, c(r) is set to the area of r, area(r). In this case, we expect a smaller average size of coalescing CRs.

• Distance. Due to strong spatial access locality exploited by location-dependent applica- tions, the closer is the CR located to the query location, the more likely is the CR to be accessed in the near future. It is thus important to have a fine granularity for those nearby CRs. Hence, we model c(r) as the inverse of its distance to the user, i.e., 1/dist(r).In this case, we expect to coalesce farther CRs.

• Area By Distance. Area and distance are two orthogonal factors and they can be used in setting the cost c(r), i.e., area(r)/dist(r).

Server reply CR coalescence can save download cost but it also haunts the CR quality. To balance the transmission overhead saving and the quality of coalesced CRs, we limit the CR saving. In our implementation, we set a threshold that is the percentage of the total number of CRs before coalescence. When the number of remained CRs falls below the threshold, the coalescence terminates. Note that server reply CR coalescence has an additional constraint in coalescing CRs. If a coalesced CR contains some returning objects, the corresponding coalescence is prohibited because the resultant CR is highly possible to give a false miss if the client issues the same queries later. 35

3.4 Cache Management

Conventional cache management handles objects in the cache memory and cache replacement is performed on an object basis. As CSC keeps both objects and CRs to preserve a global view of a dataset, its cache management becomes totally different. In this section, we discuss the cache organization implemented for CSC and two proposed cache allocation strategies. We then describe the cache CR coalescence and the corresponding cache replacement algorithm.

3.4.1 Cache Organization

The cache memory is structured as a table. Each table entry is of equal size and large enough to accommodate either one object or some CRs. A table entry assigned to maintain CRs (called CR entry) keeps at most n CRs and one coalescing CR, which is a bounding box enclosing all the CRs within this entry. Here, n is the capacity of a CR entry. Each stored CR has a time-stamp about the latest access time. The coalescing CR facilitates fast CR lookup, CR coalescence, and the cache replacement. Every insertion and deletion of a CR entry will cause the associated coalescing CR to be updated accordingly. The admission of an object is straightforward, i.e., finding a vacant entry to accommodate the object. A CR entry is chosen to store the admitted CRs if the expansion of its coalescing CR after insertion is the least among all the candidate CR entries. If a CR entry overflows after insertion, all CRs (except the coalescing CR) are migrated to other CR entries with free space. If insufficient space is available, the collection of CRs in a CR entry are split into two groups and each group is placed into two CR entries. Deletion removes a CR from an entry. To maintain high space occupancy, an occupancy threshold is set (in our experiment we used n/2 where n is the capacity of a CR entry). An underflowed entry (i.e., its occupancy below the threshold) is removed and all its CRs (except the coalescing CR) are re-inserted to other CR entries. Thereafter, the entry is freed to admitted data. We propose two possible space allocation strategies, namely static allocation and dynamic allocation, which represent different policies to manage both CRs and data objects. For static allocation, cache memory is split into two portions with each dedicated to caching objects or CRs; management of objects and CRs is independent to each other. Dynamic allocation does not distinguish objects and CRs and treats them in the same way in order to exploit higher flexibility in space utilization. 36

3.4.2 Cache CR Coalescence

Cache CR coalescence is needed to release space for newly admitted objects or CRs. The idea is to select a set of fine CRs and replace them with a bounding CR that is in a coarser granularity. The efficiency of CR coalescence is crucial to the performance when cache replacement occurs frequently. Therefore, instead of using the generic algorithm described in Section 3.3, we strategically group CRs in the same CR entry into their corresponding coalescing CR, when CRs are admitted to the cache (as described in Section 3.4.1). We use the minimal expanded area of coalescing CRs as the criteria to determine which CR entry a new CR can be inserted into. Since the coalescing CR in a CR entry readily represents the result of coalescing all CRs in the entry, we can quickly perform CR coalescence to release a CR entry by looking up the coalescing CRs only.

3.4.3 Cache Replacement

Cache replacement in CS caching is responsible not only for fitting objects and CRs in the cache but also for balancing the granularity of different portions of the global view (in terms of objects and CRs) maintained in the cache. Recall that a query is analogous to zooming in an area of interest while removing a victim object or CR(s) is analogous to zooming out an area less likely to visit in the near future. Thus, cache replacement is treated as a process of re- focusing from an area of losing interest to a new area that potentially will receive more attention (and thus need more details). An object removal is performed as converting the object to a CR. A CR removal implies coalescence with another CR. However, to make cache replacement efficient, usually all CRs in a victim CR entry will be removed by inserting its coalescing CR into another CR entry followed by freeing the entry to new cached data. In the following, we discuss the replacement algorithms corresponding to both static allocation and dynamic allocation. Static allocation. Replacement starts in the object portion. If the object portion is full, victim objects are removed by transforming them into CRs, which are inserted to the CR portion. If a CR portion is full, victim CR entries are chosen to remove and their coalescing CRs are re- inserted to the CR portion. The victim selection (i.e. cache replacement policy) can be based on LRU or FAR [113] heuristics. For FAR heuristic, distance is measured between the current client position and the anticipated CR (either resulted from object deletion or coalescence of CRs). 37

Dynamic allocation. Both object replacement and CR coalescence can make room for new objects and CRs. Cache replacement policy for both operations is crucial for ensuring the overall cache performance. In order to prioritize object replacement and CR coalescence which are totally different in nature, we use a replacement score based on the expected reloading cost of objects or CRs. ρ Let sizeo and sizer denote the object size and the CR size respectively, and denote the access probability of an entity (either an object or a CR). In this work, we consider access probability based on LRU or FAR. The communication cost of reloading an object from the ρ × ρ × × server is sizeo and that of reloading CRs is m sizer, where m is the number of CRs involved in the CR coalescence. Taking the reloading cost as a replacement score, we describe our cache replacement op- eration as follows. We maintain a priority queue of existing table entries. The queue always returns one entry with the least reloading cost. If a table entry is retrieved from the queue, it is freed to accommodate the new object and the newly formed CR (resulted from conversion of object or CR coalescence) is inserted back to an appropriate CR entry. Similarly, CRs down- loaded from the server are inserted to CR entries. It may be the case that a CR entry overflows and additional entry space is required. Then, an additional entry space is reclaimed as that for a new incoming object. It might be possible that the newly formed CR entry whose reloading cost is less than those in the queue. In this case, CR coalescence is immediately performed and the coalescing CR is re-inserted to the cache.

3.5 Other Client-Side Data Caching Models

To enable clients to reuse the cached query result, semantic caching caches query results along with their corresponding query expressions (which serve as the semantic descriptions of the cached query results) [35, 85, 114]. Thus, a query and its result form a semantic region. By consulting the query expressions of cached semantic regions, a new query can be rewritten into a probe query, which can be answered locally by the cache, and a remainder query, which is only answerable by the server. If a query is fully covered by cached semantic regions, no contact with the server is needed. However, the representation of semantic regions is highly query-dependent. If a query is of different type from the queries captured by semantic regions, whether the cached data objects can be reused cannot be trivially determined by comparing the query expressions. Besides, because clients’ knowledge about data objects is constrained by 38 cached semantic regions, the clients are unable to determine whether there are objects outside of the cached semantic regions. Therefore if a query is partially covered by semantic regions, remainder queries (i.e., uncovered portions of the query) must be formed and submitted to the server to retrieve possibly missing objects. Similar to semantic caching, chunk-based caching [39] partitions the semantic space into a set of chunks, independent of query and object distribution, by fixed space partitioning. The idea of chunk-based caching is pretty much similar to grid index. When a range query is mapped into a set of chunks, those chunks are fetched from the server if they are not in the cache, regardless of whether the chunks have objects or not. Again, due to the lack of an entire view of the data space, chunk-based caching is also limited to support various kind of queries. Proactive caching [61] supports a number of different types of LDSQs as our CSC does. It is important to highlight that proactive caching is conceptually and functionally different from our CSC. Proactive caching tightly adheres to R-tree but CSC does not. Proactive caching maintains traversed index paths (a portion of index) and a set of objects below the cached index paths in the cache. The cached partial index enables a client to execute query processing algorithms as the server does. If a query needs to find any missing index nodes or objects, the query and all intermediate execution states are shipped to the server for remaining execution. The cached index path in proactive caching is the only means to access underlying objects so that the cached index nodes are implicitly granted with higher priority to cache than objects. When space is run out, objects are, however, chosen to remove before those index nodes on a path from the root to them. This has an impact on cached hit rate because cached index nodes alone (without beneath objects) cannot make query locally answered. Besides, excessive bandwidth is taken to transmit index structures which in fact can be reconstructed using objects and CRs as shown in CSC. Moreover, without the necessity to conform to the R-tree at the server, CRs can be flexibly coalesced and partitioned for optimizing the cache performance. As in the next section, we evaluate the performance of our CSC and these related approaches.

3.6 Performance Evaluation

We conducted a performance evaluation on our CSC based on a simulation. The simulator was developed in C++. The simulator includes one client and one server communicating with each other via a point-to-point wireless channel with bandwidth of 384Kbps, i.e., the typical bandwidth of 3G network. 39

The server maintains a synthetic dataset with 100,000 objects uniformly distributed in a unit square of [1, 1] and indexed with a R*tree [14] which has a node page size of 1Kbytes and its maximum fanout is 50. The size of each object is ranged from 64, 128, 256 to 512 bytes. The client has 128K byte cache memory. In the experiments, client movement patterns are generated based on Manhattan Grid model. For Manhattan Grid model, clients move in four directions, i.e., up, down, left, and right with equal likelihood. We set the mean speed to 1 × 10−3/sec and standard deviation to 0.2 × 10−3/sec. The maximum think time (i.e., duration of time that the client remains stationary during moving path change) is set to 60 seconds. Meanwhile, we generate a query workload with query inter-arrival time following the ex- ponential distribution with mean varying from 10 to 30 at step of 10 (seconds). We assume that the client issues queries along her journey. We examine two types of range queries: 1) circular range queries (with radius of 5 × 10−3 to 10 × 10−3) with respective to a query point, 2) rectangular range queries (with query range as a rectangle with side length of (10 × 10−3)2 to (20 × 10−3)2) centered at a query point, and kNN queries (with k ∈ [10, 20]). Each type of queries has the same weight in our experiments. The simulation runs for 10,000 seconds. We summarize all parameters and their values in Table 3.1.

Parameters Values Number of objects 100,000 objects Object storage size 64, 128, 256, 512 bytes Area unit area [1, 1] Client movement Manhattan grid model − − (speed: mean=1 × 10 3/sec, standard deviation=0.2 × 10 3) − − Queries circular range query (radius range=5 × 10 3 to 10 × 10 3) − − rectangular range query (window size=(10 × 10 3)2 to (20 × 10 3)2) kNN query (k = [10, 20]); Simulation time 10,000 seconds Table 3.1. Experiment parameters for performance evaluation on CSC

The performance metrics used in this evaluation include response time, bandwidth con- sumption, cache hit ratio and answerability. The answerability measures how many queries can be completely answered by the client cache without the server’s help. In addition to our proposed CSC, we implemented Chunk-based caching [39], Semantic caching [35, 159], and Proactive caching [61] as discussed in the previous section for comparison. Notice that the chunk-based caching supports range queries only and does not support kNN queries. Seman- tic caching supports both range and kNN queries by caching two types of semantic regions. 40

Response time (seconds) Total bandwidth (bytes) 2.5 Semantic 10000 Semantic Chunk-based Chunk-based Proactive CSC 2.0 Proactive 8000 CSC, Q+CRs (p+c) 1.5 6000

1.0 4000

0.5 2000 10 20 30 10 20 30 Query inter-arrival time Query inter-arrival time (a) Response time (b) Total bandwidth consumption

Cache hit ratio 60%

40%

20% Semantic Chunk-based Proactive CSC 0% 10 20 30 Query inter-arrival time (c) Cache hit ratio

Figure 3.6. Performance of caching schemes on various query inter-arrival times

However, each type of semantic regions can only support queries of the same type. Proactive caching that maintains a portion of an R-tree index can support all types of queries we consid- ered. In the experiments, when a client receives a query that is not supported, it requests the server to process it and no results are cached.

3.6.1 Performance of Caching Schemes

We first examine the performance of different approaches, namely, CSC, Semantic caching, Chunk-based caching, and Proactive caching, labeled as CSC, Semantic, Proactive and Chunk − based, respectively, in terms of response time, bandwidth (both uplink and down- link) and cache hit ratio.ForCSC, remainder queries are expressed in form of Query + CRs, with partitioning and both request and reply CR coalescence. We vary the query inter-arrival time from 10 to 30. The results of experiments using Manhattan Gird model, object size of 256 bytes and cache size of 256K bytes are reported in Figure 3.6. From the plots, we can see that CSC outperforms all the others in all metrics for its effec- 41

tiveness in supporting different queries, the effectiveness in cache memory utilization, and the low data transmission overhead. This evaluation validates the idea of using CSC for LBSs. On the other hand, we can see that Semantic is the weakest among all, because it maintains two types of semantic regions that may result in an overlap of cached objects, in turn degrading the cache utilization (as indicated by its low cache hit ratio). On the other hand, Chunk − based performs better than Semantic, since chunks contain extra objects that can be used to answer later range queries, thus outweighing some loss in processing kNN. Proactive performs worse than Chunk because the partial server index maintained in the cache reduces the availability of cache memory for data objects. In addition, we study the impact of cache sizes on caching schemes. Here, the object size is fixed at 256 bytes and the mean query inter-arrival time is 20 seconds. In Figure 3.7, the response time of Semantic and Chunk − based is shown to be more or less invariant since they use cache space to maintain the query results they support. In effect, they may not fully utilize the cache storage. Compared with CSC whose response time is a bit higher when cache size is reduced to 64K bytes, Proactive results in a long response time with smaller cache space. This is because most of the cache space is not spent on objects but index nodes that cannot help improve the cache hits. Next, in Figure 3.8 that shows the results of experiments using a fixed cache size of 128K and mean query inter-arrival time of 20 seconds, all approaches show that the larger the object size, the longer the expected response time. As the object size is increased, the cache hit is reduced accordingly. This is because less objects are cached. More importantly, the downlink bandwidth consumption, a major response time component, also increases when larger objects are experimented. For the same reasons discussed in the above two settings, CSC is shown to be superior to others.

3.6.2 Performance of Different Remainder Query Expressions

Then, we evaluate the performance of using three different forms of remainder queries, i.e., original query plus CRs (as Q + CRs), Q + CRs with partitioning (as Q + CRs(p)), and Q + CRs with both partitioning and client request CR coalescence (as Q + CRs(p + c)). No- tice that these three forms of remainder queries have the same cache hit, so the results on this aspect are not shown. Also, we have evaluated the remainder query with CRs only but the performance is much worse. The plot is not shown for presentation clarity. The difference in 42

Response time (seconds) Total bandwidth (bytes) 2.0 10000 Semantic Chunk-based Proactive CSC 1.5 8000

1.0 6000

0.5 4000 Semantic Chunk-based Proactive CSC 0.0 2000 64K 128K 256K 512K 64K 128K 256K 512K Cache size Cache size (a) Response time (b) Total bandwidth consumption

Cache hit ratio 50%

40%

30%

20%

10% Semantic Chunk-based Proactive CSC 0% 64K 128K 256K 512K Cache size (c) Cache hit ratio

Figure 3.7. Performance of caching schemes on various cache sizes

their performance is due to the compression and improved precision of CRs. Using partitioning (i.e., Q + CRs(p) and Q + CRs(p + c)), the client can avoid downloading extra CRs so that the download bandwidth is reduced (see Figure 3.9(c)). The response time is also shortened in Figure 3.9(a). However, the answerability is much lower than the basic Q + CRs because the CRs partitioned by the client are less precise. Both of Q + CRs(p) and Q + CRs(p + c) can only answer 10% of queries without the server help while the Q + CRs needs to do that for 33% of time. Finally, the CR coalescence Q + CRs(p + c) is shown to be very effective in re- ducing the uplink bandwidth (see Figure 3.9(b)). It saves almost 50% of the uplink bandwidth compared with Q + CRs(p). This saving is very important because mobile clients consume more energy in sending packets than receiving packets. 3.6.3 Performance of Server Reply CR Coalescence

Next, we study three heuristics, Area, Distance, and AreaByDistance, used in coalescing CRs in server replies (see Figure 3.10). In this experiment, we assume that remainder queries are sent in form of Q + CRs. We vary the percentage of CRs coalesced (where 0% means no 43

Response time (seconds) Total bandwidth (bytes) 3.5 20000 Semantic Semantic 3.0 Chunk-based 16000 Chunk-based 2.5 Proactive Proactive 2.0 CSC 12000 CSC 1.5 8000 1.0 4000 0.5 0.0 0 128 256 512 128 256 512 Object size Object size (a) Response time (b) Total bandwidth consumption

Cache hit ratio 50%

40%

30%

20%

10% Semantic Chunk-based Proactive CSC 0% 128 256 512 Object size (c) Cache hit ratio

Figure 3.8. Performance of caching schemes on various object sizes

coalescence) to observe its impact on response time and total bandwidth. In general, Area is not a good heuristic in terms of response time and bandwidth consumption because the CRs close to the query answer are often of smaller area. Forming coarse CRs with those close and small CRs will degrade the cache performance since those close CRs are likely to be accessed. On the other hand, AreaByDistance can provide very good performance (even better than Distance). Balancing on client location and CR size turns out to result in an appropriate granularity for the returned CRs.

3.6.4 Performance of Cache Management

Finally, we study the two cache space allocation strategies, i.e., static allocation (labeled as Static) and dynamic allocation (labeled as Dynamic). For Static, we statically allocate x percent of cache storage for CRs. As shown in Figure 3.11(a), dynamic allocation generally performs the best in term of response time. For static allocation, we can see that increasing cache space for CRs from 10% to 20% improves the response time. However, further increas- 44

Response time (seconds) Uplink bandwidth (bytes) 2.0 Q+CRs 200 Q+CRs (p) 1.5 Q+CRs (p+c) 150 Q+CRs 1.0 100 Q+CRs (p) Q+CRs (p+c) 0.5 50

0.0 0 10 20 30 10 20 30 Query inter-arrival time Query inter-arrival time (a) Response time (b) Uplink bandwidth consumption

Downlink bandwidth (byte) Cache hit ratio 10000 50% Q+CRs Q+CRs (p) 40% 8000 Q+CRs (p+c) 30% 6000 20%

4000 10% Semantic Chunk-based Proactive CSC 2000 0% 10 20 30 128 256 512 Query inter-arrival time Object size (c) Downlink bandwidth consumption (d) Cache hit ratio

Figure 3.9. Performance of remainder query expressions

Response time (seconds) Downlink bandwidth (bytes) 1.5 7000 Distance Area 6000 1.0 Area/Distance 5000 0.5 Distance Area 4000 Area/Distance 0.0 3000 0408004080 %reply CRs coalesced %reply CRs coalesced (a) Response time (b) Downlink bandwidth consumption

Figure 3.10. Performance of heuristical server reply CR coalescence ing it to 25% causes a longer response time. This can be explained by their corresponding cache hit ratios (shown in Figure 3.11(b)). The more cache space allocated to CRs, the less cache space is available for objects. Therefore, the cache hit ratio keeps dropping when the CR portion of cache expands. On the other hand, although Static10% and Static20% by allocat- ing less space to CRs have higher cache hit ratios than Dynamic, they hold overly coarse CRs 45 and result in a high false miss rate, which in turn increases the response time.

Response time (seconds) Cache hit ratio 2.0 Static 10% Static 15% 65% Static 10% Static 15% Static 20% Static 25% 60% Static 20% Static 25% Dynamic 55% Dynamic 1.5 50% 45% 1.0 40% 35% 0.5 30% 10 20 30 10 20 30 Query inter-arrival time Query inter-arrival time (a) Response time (b) Cache hit ratio

Figure 3.11. Performance of cache allocation strategies

3.7 Chapter Summary

To the best of our knowledge, this is the first study that explores the idea of providing a global view of a dataset at the client side to support various types of LDSQs. Based on this idea, we have developed Complementary Space Caching and presented it in this chapter. In addition to the theoretical study, we also realized the CSC with a system prototype [86]. Chapter 4

Nearest Surrounder Queries

Our research presented in this chapter addresses the efficient evaluation of a practical and in- teresting type of LDSQs called Nearest Surrounder (NS) queries. Since these NS searches need to consider both the distances and angles of objects, i.e., two orthogonal search crite- ria, simultaneously, the design of efficient algorithms for NS queries is not trivial. Based on observations from an in-depth analysis, we identify the angle-based bounding properties of polygons (i.e., the representation of spatial objects) and MBBs of R-tree (i.e., an abstraction of objects and groups of objects in a space). With these properties as the basis, we develop two novel algorithms, namely, Sweep and Ripple, according to two important facets of NS queries, surroundings and nearness. The Sweep algorithm takes an angular view to explore a search space for result objects, and the Ripple algorithm explores a search space with a distance- based strategy. We evaluated the proposed algorithms for NS query and its variants based on empirical studies with both synthetic and real object sets. In the following, we first define NS queries and introduce their variants. Then, we explore the angle-based bounding properties of polygon-shaped objects and MBBs of R-trees. There- after, we present the Sweep algorithm and Ripple algorithm and report their performance.

4.1 Definitions of Nearest Surrounder Queries

Among many spatial queries, Nearest Neighbor (NN) queries [56, 115] and/or their variants such as Reverse NN [80], Constrained NN [45] and Group NN [104] are usually taken as granted in assisting location-related decision making. For instance, a tourist information sys- tem that provides an NN search may assist tourists to find the nearest attractive points of inter- 47

est. However, in many situations, an NN query may not precisely match with what we and/or many users look for. Consider the same example of tourist information system. Queries that return attractive points according to both distances and directions could give tourists with a better picture of the surroundings around their locations. Take a battlefield as another example. To ensure a clear firing path, soldiers in a battlefield need to be aware of threats from their enemy potentially located around them. These examples motivate the need of a new type of spatial query that takes angles of objects with respect to a query point into consideration Ð “Nearest Surrounder” query. Given a set of objects and a query point in a two dimensional space, a nearest surrounder (NS) query (as formally defined in Definition 4.1) retrieves the nearest neighboring objects at consecutive ranges of angles with respect to the query points. Figure 4.1(a) shows a query , , ··· point, q, and 10 objects, labeled as o1 o2 o10. The result set of an NS query issued at q, , { α ,α , α ,α , α ,α , α ,α , α ,α , NS(q O),is o1 :[ f a] o3 :[a b] o6 :[b c] o7 :[c d] o8 :[d e] α ,α } α α ···α o9 :[ e f ] where a, b f are distinct angles. The result, i.e., o1, o3, o6, o7 and o8, are the nearest surrounder (NS) objects to q, as there are no other objects located between them

and q in their associated ranges of angles. Other objects, i.e., o2, o4, o5 and o10 that are hidden by the result objects are not retrieved.

Definition 4.1. Nearest Surrounder (NS) Query. Given a set of objects O and a query point q, an NS query NS(q, O) returns a set of result tuples; formally, , = , β, γ | ∈ ∧ β, γ ⊆ , π ∧ NS(q O) o [ ] o O [ ] [0 2 ) ∀o ∈ O −{o} ∀α ∈ [β, γ], dist(q, o,α) < dist(q, o,α) in which a tuple o :[β, γ] indicates o being the nearest neighbor to q for a range of angles between β and γ; and dist(q, o,α) represents the distance of an object o to q with respect to an angle α. The retrieved objects are called “nearest surrounder objects” (or NS objects).

Besides, we extend the notion of NS query to multi-tier NS (m-NS) query as stated in Definition 4.2. An m-NS query provides a result set, in which at a particular range of angles, m objects in the order of their distances to the query point are retrieved. Thus, the described NS query is equivalent to 1-NS (i.e., m = 1). m-NS queries are useful to many scenarios. For instance, in a battlefield, m-NS queries help analyze the distribution of enemies as different layers for certain directions. In Figure 4.1(a), for m = 2, the first-tier and second-tier NS 48

αb αa -axis -axis y o y o o5 4 o 4 5 αb αa o2 s o2 e o3 o3

o6 o1 o6 o1 αc αf q q

o o o7 o8 9 o10 o7 o8 9 o10 αd αe x-axis x-axis

(a) Nearest Surrounder (NS) query (b) Angle-constrained Nearest Surrounder (ANS) query

Figure 4.1. Nearest surrounder query and angle-constrained NS query

{ , , , , , } { , , , }1 objects corresponding to q are o1 o3 o6 o7 o8 o9 and o2 o4 o5 o10 , respectively.

Definition 4.2. Multi-Tier NS (m-NS) Query. Given a set of objects O, a query point q and a requested number of tiers m (≥ 1), a m-NS query, mNS(q, O), returns a set of tuples, each of which contains m nearest neighbor objects; formally,

, = , β, γ |  ⊆ ∧ β, γ ⊆ , π ∧| | = ∧ mNS(q O) O [ ] O O [ ] [0 2 ) O m ∀o ∈ O ∀o ∈ O − O ∀α ∈ [β, γ], dist(q, o,α) < dist(q, o,α) .

For some applications, only NS objects within a certain range of angles are needed. A tourist, for example, walking toward a certain direction may be more interested in attractive points located in front of them. Accordingly, we introduce angle-constrained NS (ANS) query that retrieves NS objects for a specified range of angles, A (⊆ [0, 2π)). In Figure 4.1(b), an , { , , } ANS query issued at the query point for a range of angles [s e] returns o1 o3 o6 . Since the notion of ANS query can be combined with that of m-NS query to provide multi-tier angle- constrained NS query, Definition 4.3 states this generalized NS query variant. When m and A are set to 1 and [0, 2π), respectively, an m-ANS query is equivalent to an NS query. Now this generalized NS query can provide more useful than conventional NN search as it can provide

1Associated ranges of angles are omitted for easy illustration. 49 angular information for each returned objects and it allows search to narrow the search scope for a certain range of angles.

Definition 4.3. Multi-Tier Angle-constrained NS (m-ANS) Query. Given a set of objects O, a query point q and a range of angles A (⊆ [0, 2π)), a multi-tier angle-constrained NS query ANS(q, O, A), retrieves a set of result tuples as

, , = , β, γ |  ⊆ ∧ β, γ ⊆ ∧| | = ∧ m-ANS(q O A) O [ ] O O [ ] A O m ∀o ∈ O ∀o ∈ O − O ∀α ∈ [β, γ], dist(q, o,α) < dist(q, o,α) .

4.2 Na¬õve Approaches to NS Searches

An NS query can be approximately evaluated by issuing multiple Constrained NN (C-NN) queries [45] on disjointed regions. As illustrated in Figure 4.2(a), an NS query is evaluated as individual C-NN queries over Region 1, 2, 3, 4, 5 etc. In this example, the first two of these queries report f to be the nearest while the latter three report b. This, however, discloses two deficiencies in using C-NN queries to roughly evaluate an NS query: (1) adjacent C-NN queries are likely to redundantly access a similar set of index nodes and objects, e.g., C-NN queries on Region 1, 2 and 3 access both R3 and R4 (where R4 does not contribute to the NN); (2) the result set approximated by C-NN queries may not be accurate, e.g., a and c should be parts of the NS query result but they cannot be identified. Even worse, there is no ideal solution to address these two deficiencies simultaneously. Besides, this approximate result do not provide NS objects exact ranges of angles. More importantly, C-NN search cannot directly support m-tier NS query since k NN objects returned in each region are not necessarily at the same angle from the query point. On the other hand, the idea of spherical projection [117] for image rendering shares some similarity to NS queries. The edges of objects are projected on a spherical projection line that faces a query point. The objects on the projected line are the NSs. In Figure 4.2(b), b is not projected on the projection line since it is hidden by c. To perform a spherical projection on a set of spatial objects, ray shooting queries [12] are often used. A ray shooting query is a line segment originated from a query point towards the projection line in search of the first hit object. 50

Region 5 Region 4 Region 3 Region 2 a projection line c d R1 R2 c b e a a hidden by c

f 1 Region g R3 b d d e l q c R revised 4 MINDIST(g,q) e rays j q (a) Constrained NN query (b) Ray shooting query

Figure 4.2. Constrained NN query and ray shooting query

However, NS search and image rendering are conceptually and functionally different. Our work on NS query mainly focuses on a very large set of objects and a fast mechanism on fetching index pages possibly containing NS objects on disk to main memory, but rendering algorithms focus on generating a display from a relatively smaller number (in the order of hundreds to thousands [41]) of objects resident in main memory. Our approaches proposed in this thesis need only one index lookup but the projection approach incurs a huge number of ray shooting queries on the same dataset to improve the result accuracy.

4.3 Angle-Based Bounding Properties

While the extents of objects are represented by polygons, it is adequate to identify NS objects by examining their boundaries in a search space centered at a query point. In Figure 4.3(a), the entire boundary of an object o2 is not closer than another object o1, and thus o2 should not be an NS object. Besides, comparison of entire objects can be completely waived if they are located in different angular ranges with respect to q2. In the figure, we can see that an object o3 is incomparable to o1 as their angular ranges do not overlap. Besides, objects are compared only for their parts located within a common angular range. In the figure, the angular ranges of objects o4 and o1 partially overlap. Then, we only need to see which one of them appears closer than another to q in the overlapped angular range.

2We use the terms “a range of angles” and “angular range”, interchangeably. 51

angular q,o -axis -axis bound of o2 y y an gu o la 2 r bo q,e a u ng nd bo ul o u ar f o o nd f e q,e dist(q,e, ) o o1 e facing edges q,o r angular mindist(q,e, la bound of o u d 4 [ q,e , q,e ]) g n n u 1 a o o b f o4 o

mindist(q,o, al of le angular norm [ q,o , q,o ]) q,l o3 bound of o3 e x-axis q,le x-axis q q le (line extending e) (a) Angular bounds (b) Edges and polygons

Figure 4.3. The properties of edges and polygons

In the following, we define and discuss angular bounds and angular distances for edges that help determine what objects they belong to for what angular ranges are closer to a query point. Accordingly, we devise a pairwise edge comparison mechanism (i.e., a basic component to our search algorithms) and present a brute-force NS search algorithm.

4.3.1 Edge Angular Bound and Edge Angular Distance P P Let o be a polygon that represents the extent of an object o. o is composed of a set of n = { , , ··· } = { , , ··· } points Po p1 p2 pn and a set of n non-crossing edges Eo e1 e2 en . Each edge is a = ≤ ≤ − = line segment connecting two points such that ei pipi+1 (1 i n 1) and en pnp1. While edges cover their connected points, we simply consider edges that form the boundary of o; and we observe and formalize some edge properties, namely, edge angular bound, edge angular distance and minimum edge angular distance in the following. To illustrate them, Figure 4.3(b) θ ,θ , ,α shows for the edge angular bound ([ q,e q,e]) and the edge angular distance dist(q e ) and , , θ ,θ the minimum edge angular distance (mindist(q e [ q,e q,e])) for an edge e of an object o. With respect to a query point, q, every edge occupies an angular range in the search space. We call this as an edge angular bound and define it in Definition 4.4. In Figure 4.3(b), an θ ,θ edge e has an angular bound [ q,e q,e]. If an edge is located across the positive x-axis of the search space, we partition it into two portions: one portion is on or above the x-axis and the other portion is right below the x-axis. Correspondingly, the angular bound of this edge is 52

considered as two consecutive subranges. This partitioning guarantees that all angular ranges are bounded by [0, 2π). Also, it does not affect the correctness of an NS query result and the search algorithms developed based on this property.

Definition 4.4. Edge Angular Bound. With respect to a query point, q, the angular bound θ ,θ for an edge, e (that is a line segment pipj between points pi and pi) is denoted by [ q,e q,e] θ θ where q,e and q,e are the starting and ending angles of the range, respectively. Specifically, θ = ∠ , ∠ θ = ∠ , ∠ ∠ q,e min( qpi qpj) and q,e max( qpi qpj) where qp is an polar angle between qp and the positive x-axis.

To quickly determine if an edge is closer than another to q for a certain angular range, we define edge angular distance for an edge e denoted by dist(q, e,α), which suggests the Euclidean distance from q to e at a specified angle α. Definition 4.5 defines edge angular distance. By varying α on e, we can estimate a distance range for e from q from a certain , ,α angular range. To calculate dist(q e ), we adopt a line le that extends e and its normal, i.e., α a line perpendicular to le. Based on this normal, we formulate an angular distance to le at η , from q as dist(q, l ,α) that equals (q le) , in which η(q, l ) stands for the distance from q to e cos(α−φ(q,le)) e φ , le on the normal and (q le) represents an angle from q to le along the normal. Figure 4.3(b) η , φ , illustrates le, (q le) and (q le) for an edge e.

Definition 4.5. Edge Angular Distance. An edge angular distance dist(q, e,α) measures the distance from a query point q to an edge e. In case that α is out of the angular range of e α  θ ,θ ∞ , ,α (i.e., [ q,e q,e]), the edge angular distance is considered to be . Hence, dist(q e ) is expressed as follows. ⎧ ⎪   ⎨⎪ dist(q, l ,α)ifα ∈ [θ ,θ ]; , ,α = ⎪ e e e dist(q e ) ⎩⎪ ∞ otherwise.

η , where dist(q, l ,α) = (q le) in which l is a line extending e, η(q, l ) and φ(q, l ) are the e cos(α−φ(q,le)) e e e length and angle of the normal of le to q.

Next, we define the minimum edge angular distance for e with respect to q for a given angular range [ϑ,ϑ] as stated in Definition 4.6. As will be discussed later, this minimum edge angular distance is used to derive the distance of an object from q for an angular range.

Definition 4.6. Minimum Edge Angular Distance. Extended from Definition 4.4 and 4.5, the minimum angular distance for an edge e for a given angular range, [ϑ,ϑ], is denoted 53

    , , ϑ ,ϑ , , ϑ ,ϑ =   , ,α by mindist(q e [ ]), i.e., mindist(q e [ ]) minα∈[ϑ ,ϑ ](dist(q e )). Instead of ex- amining all possible α, a fast minimum edge angular distance estimation can be performed as ⎧ ⎪   ⎨⎪ η(q, l )ifφ(q, l ) ∈ [ϑ ,ϑ ]; , , ϑ,ϑ = ⎪ e e mindist(q e [ ]) ⎩⎪ min(dist(q, e,β), dist(q, e,γ)) otherwise.

β, γ = ϑ,ϑ ∩ θ ,θ where [ ] [ ] [ q,e q,e].

4.3.2 Edge Comparison

Next, we develop an edge comparison mechanism as a binary function. It examines two edges,

ei and ej belonging to objects oi and oj, respectively, at a time and determines the nearer one for an examined angular range, or both of which are nearer to a query point but in different portions of the angular range. Any object with the closest edge is an NS object. In general,

when two edges ei and ej are compared, there could be only four possible outcomes, categorized by their relationships: (i) ei and ej overlap each other, (ii) one of them (say ei) always nearer

than another, (iii) as opposed to (ii), ej is nearer, and (iv) ei and ej are crossing. If two edges do not have their angular ranges overlapped, they are incomparable and not needed to compare. The edge relationships are identified in the following four cases:

Case 1. Both ei and ej overlap. In other words, they are equidistant to q for the entire angular range. This case can be identified when both of their normal distances and normal angles η , = η , ∧ φ , = φ , are equal, i.e. (q lei ) (q lej ) (q lei ) (q lej ) . To determine the nearer edge, we consider other criteria, such as the Object IDs of the objects they belong to, as a tie breaker to decide the edge priority to be the NS edge for the angular range.

Case 2. ei is nearer than ej for the entire angular range from q. This case can be determined by comparing their edge angular distances (see Definition 4.5). When for every angle α (∈ ϑ,ϑ , ,α < , ,α [ ]) dist(q ei ) dist(q ej ), ej is certainly not closer to q than ei. Since the two edges do not cross each other, we can compare their distances at the bounding angles ϑ and ϑ. , ,ϑ < , ,ϑ , ,ϑ < , ,ϑ Thus, whenever dist(q ei ) dist(q ej ) and dist(q ei ) dist(q ej ), ei is certain to

be closer to q than ej.

Case 3. ej is nearer than ei for the entire angular range to q. This is opposite to the second case. , ,ϑ < , ,ϑ , ,ϑ < , ,ϑ It can be identified when both dist(q ej ) dist(q ei ) and dist(q ej ) dist(q ei ) are satisfied. 54

Case 4. ei is nearer than ej to q for a certain portion of the angular range while in the rest of

the angular range, ej is nearer than ei. Different from the three previous cases, this case needs to split an angular range into two smaller angular ranges and to determine which objects are closer to q within what divided angular ranges. α The angular range should be split at an angle, s, at which both ei and ej are equidistant to q, η , η(q,l ) (q le ) , ,α = , ,α ei = j i.e., dist(q ei s) dist(q ej s) which can further be expressed as cos α −φ , cos α −φ , . ( s (q lei )) ( s (q lej )) α Hence, s can be determined as ⎛ ⎞ ⎜ η(q, l ) cos(φ(q, l )) − η(q, l ) cos(φ(q, l )) ⎟ ⎜ ei ej ej ei ⎟ αs = arctan ⎜ ⎟ (4.1) ⎝ η , φ , − η , φ , ⎠ (q lej ) sin( (q lej )) (q lei ) sin( (q lej ))

Alternatively, we can determine their intersection point (say s)asei and ej cross each other, α followed by computing the angle (i.e., s)ofs with respect to q. ϑ,α α ,ϑ Then, the final split angular ranges are [ s] and [ s ]. To determine which object is the nearer to q for a split angular range, we perform a simple distance test on the two edges. The one with a shorter distance at ϑ (or ϑ) should be the nearer than another for an angular ϑ,α α ,ϑ range [ s] (or [ s ], respectively). Based on the above case identification and nearer edge determination, we devise Function

EdgeCompare as depicted in Figure 4.4. It accepts two edges ei and ej and an angular range ϑ,ϑ [ ] as parameters. It first examines if Case 1 happens that both ei and ej overlap and returns one with the smaller ObjectID (line 1-2). Next, it checks if Case 2 or Case 3 occurs by comparing their angular distances at ϑ and ϑ (line 6-7). Finally if none of the above cases α is satisfied, it should be Case 4. A split angle s is determined based on Equation (4.1). We identify the nearest edge for each divided angular range with a distance test (line 9-12).

4.3.3 The Brute-Force Algorithm

Objects are very often the units for storage and access (e.g., in ESRI shape file [43] and as GT-polygons in Tiger/Line [137]). Thus, the design of algorithms should be object-based. On the other hand, with respect to a query point, some edges that are already hidden by others of the same polygon should be skipped from a detailed examination. To this end, only those non-hidden edges are needed for an NS search. Here, we refer to those non-hidden edges as ⊆ facing edges. We express a set of facing edges, Fo(q) ( Eo) of an object o with respect to q 55

  Function EdgeCompare(q, ei, ej, [ϑ ,ϑ ]) Input. a query point (q), two edges (ei and ej),   an angular range ([ϑ ,ϑ ])   Output. a set of e :[ϑ ,ϑ ] tuples (R) Begin η , = η , ∧ φ , = φ , 1. if (q lei ) (q lej ) (q lei ) (q lej ) then   2. R ←{ ex :[ϑ ,ϑ ], x = i or j}; 3. else 4. let  be , ,ϑ ; let  be , ,ϑ ; di dist(q ei ) di dist(q ei ) 5. let  be , ,ϑ ; let  be , ,ϑ ; dj dist(q ej ) dj dist(q ej ) 6. if  <  ∧  <  then ←{ ϑ,ϑ }; di dj di dj R ei :[ ] 7. elseif  <  ∧  <  then ←{ ϑ,ϑ }; dj di dj di R ej :[ ] 8. else 9. let α be the split angle; s η(q,l ) cos(φ(q,l ))−η(q,l ) cos(φ(q,l )) α ← ei ej ej ei 10. s arctan η , sin φ , −η , sin φ , ; (q lej ) ( (q lej )) (q lei ) ( (q lej )) 11. if (  < ) then ←{ ϑ,α , α ,ϑ }; di dj R ei :[ s] ej :[ s ]   12. else R ←{ ej :[ϑ ,αs], ei :[αs,ϑ ]}; 13. output R; End.

Figure 4.4. The pseudo-code of the function EdgeCompare when q stays out of o as

 = { | ∈ ∧∃α∈ θ ,θ ∀ ∈ −{ } , ,α < , ,α }. Fo(q) e e Eo [ q,e q,e] e Eo e dist(q e ) dist(q e ) based on a fact the distances of facing edges to q should not be greater or equal to than other edges. In case that q stays within an object, no facing edges are defined. Next, we outline a brute-force algorithm that is a non-index approach and sequentially examines a set of objects. Its pseudo-code is depicted in Figure 4.5. At the very beginning, an NS query result N is initialized to a single tuple ⊥ :[0, 2π) where ⊥ represents a dummy object that captures “holes” in the current query result and angular distance is set to ∞. The hole represented by a dummy object in N can be filled by any real object. It refines the result set based on every examined object and returns the result when all objects are examined. Function NSIncorp (as outlined in Figure 4.6) is invoked to update the NS query result (line 3) that is a core operation in the NS search. This function then calls Function ObjectCompare (as depicted in Figure 4.7) to compare two objects and deduce updated nearest object(s). An 56

Algorithm BruteForce(q, O) Input. a query point (q), a set of objects (O)   Output. a set of o :[ϑ ,ϑ ] tuples (N) Begin 1. N←{ ⊥:[0, 2π)}; 2. foreach o ∈ O do   3. N←NSIncorp(q,N,o,[θq, o ,θq, o ]); 4. output N; End.

Figure 4.5. The pseudo-code of the algorithm BruteForce

ϑ,ϑ angular range [ n n] is passed such that any result returned by Function ObjectCompare is guaranteed to be bounded by the angular range of the original NS object. Next, the comparison results are then incorporated to N, replacing the original entry. The search ends by returning N when all objects are examined.

  Function NSIncorp(q, N, o, [ϑ ,ϑ ]) Input. a query point (q), a NS query result set (N),   an object (o), an angular range ([ϑ ,ϑ ]) Output. an updated NS query result set Begin , ϑ,ϑ ∈N| ϑ,ϑ ∩ ϑ,ϑ  ∅ 1. foreach n [ n n] [ n n] [ ] do N←N−{ , ϑ,ϑ } 2. n [ n n] ; , ← ϑ,ϑ ϑ,ϑ ∩ ϑ,ϑ 3. (R H) ObjectCompare(q, n, [ n n], o, [ n n] [ ]); ϑ,ϑ ∈ 4. foreach r :[ r r ] R do N←N∪{ , ϑ,ϑ } 5. r [ r r ] ; 6. output N; End.

Figure 4.6. The pseudo-code of the function NSIncorp

Similar to Function EdgeCompare, Function ObjectCompare takes two objects and exam- Θ θ ,θ ∩ θ ,θ ines their edges in a common angular range, q (i.e., [ q,n q,n] [ q,o q,o]) in a pairwise fashion. If an object whose angular range does not overlap with another, it is taken as the nearest one for a non-overlapped portion (line 4-5). Then, it involves Function EdgeCompare (in Figure 4.4) to examine edges if they overlap with their edge angular bounds (line 6-14). Notice that we also keep the hidden objects in another set H, which are useful for m-NS query processing as will be discussed later. 57

ϑ,ϑ ϑ,ϑ Function ObjectCompare(q, n, [ n n], o, [ o o]) Input. a query point (q), two objects (q and o), ϑ,ϑ an angular range for n ([ n n]) ϑ,ϑ an angular range for o ([ o o]) Output. a set of r :[β, γ] tuples (R) for objects in the front, a set of h :[β, γ] tuples (H) for objects behind Begin 1. R ←∅; H ←∅; Θ θ,θ Θ θ,θ 2. let n be [ n n], o be [ o o]; Θ Θ ∩ Θ 3. let q be n o; Θ − Θ  ∅ ← ∪{ Θ − Θ } 4. if ( n o) then R R n :( n o) ; 5. if (Θo − Θn)  ∅ then R ← R ∪{ o :(Θo − Θn))}; 6. foreach en ∈ Fn(q) do 7. foreach eo ∈ Fo(q) do 8. let Θ be Θ ∩ Θ ∩ θ ,θ ∩ θ ,θ ; n o [ q,en q,en ] [ q,eo q,eo ] 9. if Θ  ∅ then

10. E ← EdgeCompare(q, en, eo, Θ); 11. foreach x :[β, γ]∈E do 12. R ← R ∪{ x :[β, γ]}; 13. if (x = o) then H ← H ∪{ n :[β, γ]}; 14. else H ← H ∪{ o :[β, γ]}; 15. output R, H; End.

Figure 4.7. The pseudo-code of the function ObjectCompare

Scanning an entire object set to incrementally refine an NS query result set as this brute- force algorithm does should however invoke a large number of object comparisons, which in turn triggers a number of edge comparisons. Thus, a high computational and I/O costs should be incurred. To effectively improve the search performance, the search space should be selectively explored and effectively pruned. In the next section, we discuss efficient index- based NS search algorithms.

4.4 NS Search Algorithms

In this section, we explore the angle-based and distance-based properties of MBRs to facilitate search space exploration and pruning, i.e., useful to NS query evaluation. According to these properties, we develop the Sweep and Ripple algorithms. 58

4.4.1 The Angle-Based Properties of Polygons and MBRs

To facilitate an NS search based on objects whose extents are modeled as polygons, some poly- gon properties are needed to explore. We define polygon angular bound and minimum polygon angular distance based on their facing edges are stated in Definition 4.7 and Definition 4.8, re- spectively.

Definition 4.7. Polygon Angular Bound. The angular bound for an object, o with respect to a θ ,θ query point, q, denoted by [ q,o q,o] is the union of the angular bounds for all its facing edges,     θ ,θ = ∪ ∈ θ ,θ i.e., [ q,o q,o] e Fo(q)[ q,e q,e].Ifq is within o, the polygon angular bound of o is set to [0, 2π).

Definition 4.8. Minimum Polygon Angular Distance. For a given angular range [ϑ,ϑ],we use mindist(q, o, [ϑ,ϑ]) to denote the minimum angular distance for an object, o, to a query point q, i.e., the minimum distance of all facing edges covered by the angular range. Hence,

⎧ ⎪ , ,α θ ,θ ∩ ϑ,ϑ  ∅   ⎨ mine∈F (q)∧Θ , ∅∧α∈Θ , (dist(q e )) [ , , ] [ ] ; mindist(q, o, [ϑ ,ϑ ]) = ⎪ o q e q e q o q o ⎩ ∞ otherwise.

Θ θ ,θ ∩ ϑ,ϑ , , ϑ,ϑ in which q,e is [ q,e q,e] [ ].Ifq is within or on the boundary of o, mindist(q o [ ]) is set to 0.

As every MBR can be considered as an enclosing polygon that has four edges, the prop- erties of polygons such as the polygon angular bounds and the minimum polygon angular distances are directly applicable to MBRs. Meanwhile, the MBR of a node should tightly en- close those of its child nodes and objects. Thus, we obtain angular bound property of MBRs (as stated in Lemma 4.1) that guarantees no child nodes or enclosed objects are covered by an angular range if their parent node is not covered by the same angular range. Besides, we obtain minimum angular distance property of MBRs (as explained in Lemma 4.2) that ensures the contained MBRs or objects cannot provide a shorter angular distance to a query point than its container MBR. Based on these properties, the search space for NS objects can be effectively explored and/or pruned if container MBRs are already hidden by some NS objects.

Lemma 4.1. MBR Angular Bound. With respect to a query point q, the angular bound of a

MBR, Rp, must be wide enough to cover its enclosed objects or MBRs of its child nodes, Rc, i.e., [θ ,θ ] ⊆ [θ ,θ ]. q,Rc q,Rc q,Rp q,Rp 59

Proof. There are only three possible query point positions. In the first case that q stays , π within Rp and Rc, according to Definition 4.7, both of the angular bounds are [0 2 ), i.e., [θ ,θ ] = [θ ,θ ] = [0, 2π). In the second case that q is inside R but outside R , q,Rc q,Rc q,Rp q,Rp p c certainly [θ ,θ ] ⊂ [θ ,θ ] = [0, 2π) holds. Finally, it is the third case that when q q,Rc q,Rc q,Rp q,Rp stays out of R (that also implies q out of R ), [θ ,θ ] ⊆ [θ ,θ ] is qualified since the p c q,Rc q,Rc q,Rp q,Rp facing edges of Rp should hide Rc entirely from q.

Lemma 4.2. Minimum MBR Angular Distance Bound. Given an angle α and a query point

q, the minimum angular distance for an MBR, Rp, must not be greater than that of all its , ,α ≤ , ,α enclosed objects or child MBRs, Rc, i.e., mindist(q Rp ) mindist(q Rc )

Proof. In general, there are only two possible cases. First, when a query point q is inside Rp, then its minimum angular distance is 0 (i.e., the minimum possible distance) regardless of q

being inside or outside of Rc. Second, if q is outside Rp, q should also be outside Rc. As the

facing edges of Rc should be behind those of Rp, Rp cannot provide a longer distance to q than α Rc. The above discussion assumes that is within the angular bounds of both Rp and Rc. When α is out of the angular bound of Rc but within that of Rp, the distance to Rp should be shorter ∞ α than that to Rc (i.e., ). Finally, when is out of the angular bounds of both Rp and Rc, the ∞ , ,α ≤ , ,α angular distances for them are . For all those cases, mindist(q Rp ) mindist(q Rc ) is ensured.

4.4.2 The Sweep Searching Algorithm

In this subsection, we present the Sweep algorithm. We first describe its basic operation, and then discuss its optimizations and extensions to support m-NS and ANS queries. Basic Sweep Operation. The Sweep algorithm traverses an R-tree based on the best-first strategy. It maintains a priority queue to order the access of the index nodes and objects, collectively called “entries”, according to their starting angles with respect to the positive x- axis of the search space centered at a given query point. In case that multiple entries share the same starting angle, we give a higher priority to one with the smallest distance to the query point. Initially, the root index node serves as the first entry in the queue. Iteratively, a head queue entry that should have the smallest starting angles among all pending entries is fetched from the queue and examined. If the entry is an index node, all its children are inserted back to the queue for later examinations. Otherwise, the entry (i.e., an object) is examined and may 60

be incorporated to the existing NS result by Function NSIncorp (see Figure 4.6). The Sweep algorithm terminates when the queue is vacated. The Sweep Algorithm Optimizations. Straightforwardly scanning entries arranged in a pri- ority queue will eventually explore all index nodes and all objects. Clearly, not all dequeued entries contribute to an NS query result. In the following, we observe two important heuris- tics that effectively avoid exploring and examining unnecessary index nodes and objects so that the I/O cost and the number of ObjectCompare invocations, which contribute the major computation cost in the algorithm, can be significantly reduced.

Hueristic 4.1. Conservative Upper Distance Bound. According to the angular bound for θ ,θ an entry , [ q, q, ], the maximum of angular distances among all found NS objects (that ⊥ θ ,θ include ‘ ’) whose angular bounds intersect [ q, q, ] in the current NS query result set, called conservative upper distance bound, is determined. If the minimum angular distance , , θ ,θ for , mindist(q [ q, q, ]), is greater than the conservative upper distance bound, can be safely skipped from a detailed examination, since it is certainly not an NS object and not

containing any NS object. U

[θ q, 1 , θ q, ] mindist (q,Ra, 1 ) [ q,o , ] max θq, θq, dist(q,o, ) θ θq,o (conservative upper bound)

1 2 o

o o' U q q [θ q, , θ ] mindist (q, , 2 q, 2 ) mindist (q,Ra, [ θ q, , θ q, ] ) 2 [θ q,o , θ q,o ] (a) Conservative upper bound (Heuristic 4.1) (b) Smallest distance first (Heuristic 4.2)

Figure 4.8. Illustrations of two heuristics applied for the Sweep algorithm

To illustrate Heuristic 4.1, Figure 4.8(a) shows an entry (that may represent an ob- ject or an index node) and two objects o and o that are tentatively taken in the NS query , , θ ,θ result set. As mindist(q [ q, q, ]) exceeds the conservative upper distance bound, i.e.,  α∈ θ ,θ , ,α max [ q, q, ]dist(q o ) (where o (not o ) blocks ), it is not necessary to explore . Owing to the defined access order only on the starting angles, any entry with the smallest starting angle is always fetched first. Indeed, it may not be or contain any NS object. Take 61

the case shown in Figure 4.8(b) as an example that includes one tentative NS object o and two

entries 1 and 2. Following the starting angle order, 1 is expected to be examined right after

o. However, as we can anticipate, 1 does not contribute any NS object within the common     angular range, θ , ,θ , ∩ θ , ,θ , . In fact, it is completely hidden by and that will be [ q o q o] [ q 1 q 1 ] o 2 accessed after it. Thus, it suggests Heuristic 4.2 to slightly revise the access order. An entry that is more likely to be or contain NS objects should be retrieved before others, although it does not have the smallest starting angle among those in the queue. As suggested by Heuristic 4.2,

2 whose smaller minimum angular distance is smaller than 1 with respect to an angular range

explored by a search is picked rather than 1.

Hueristic 4.2. Smallest Distance First. Suppose an angular range [ϑ,ϑ] covering all currently found NS objects, the access priority is granted to an entry in a priority queue that can provide the smallest minimum angular distance in the common angular range, i.e., ϑ,ϑ ∩ θ ,θ [ ] [ q, q, ] among all queued entries. In this case, is more likely to be or contain an NS object among others within [ϑ,ϑ].

Let us continue with our example and see a scenario that 2 is picked instead of 1, 1 is retained in the queue. Notice that extra runtime memory is consumed to house this kind of

retained entries in the priority queue. It also raises a question: in what situations and how 1 can be safely removed from the queue? Recall Heuristic 4.1. An entry can be safely discarded when its minimum angular distance is greater than the conservative upper bound. Thus, in our approach, a queue entry whose angular bound is completely covered by an explored angular

range for the current NS query result is retrieved first. Figure 4.8(b) shows that after 2 is

retrieved instead of 1, 1 is not needed to explore since it is already farther from q than o and

2, i.e., the existing found NS objects. To manage this priority queue retrieval operation, we develop Function Lookahead (as outlined in Figure 4.9). It does not re-order the entry positions in the queue but whenever it is invoked, it examines the front part of the queue and picks the best entry according to the revised priority. This lookahead function iteratively fetches head entries and picks the best candidate among those fetched entries. It takes three parameters, namely, a query point (q), a priority queue (Q), and an angular range [ϑ,ϑ] that covers all currently found NS objects. Initially, we assume the first queue entry is the candidate. Then, it loops to fetch subsequent queue entries. Whenever the queue becomes empty or the rest of queues entries do not have their angular 62

  Function Lookahead(q, Q, [ϑ ,ϑ ])   Input. a query point (q), a pri. queue (Q), an ang. range ([ϑ ,ϑ ]) Local. a buffer for examined entries (B), a minimum distance (min) Output. a queue entry (cand) Begin 1. min ←∞; cand ← Q.top(); 2. while (Q is not empty) do 3. ← Q.dequeue(); 4. B.enqueue( ); θ ,θ ∩ ϑ,ϑ = ∅ 5. if ([ q, q, ] [ ] ) then 6. goto 12; // terminate the search θ ,θ ⊆ ϑ,ϑ 7. if ([ q, q, ] [ ]) then 8. cand ← ; goto 12; // an entry that is fully covered , ,θ ,θ ∩ ϑ,ϑ < 9. if (mindist(q q, q, ] [ ]) min) then ← , ,θ ,θ ∩ ϑ,ϑ 10. min mindist(q q, q, ] [ ]); 11. cand ← ;//an entry with the smallest mindist 12. foreach e ∈ B do 13. if (e  cand) then Q.enqueue(e); 14. output cand; End.

Figure 4.9. The pseudo-code of the function Lookahead

bound intersecting the angular range, the loop ends (line 2 and line 5-6). Then, it gives the highest priority to an entry for sake of shortening the queue and saving memory if it is fully covered by [ϑ,ϑ] (line 7-8). Next, an entry with the smallest mindist within the angular range [ϑ,ϑ] is collected (line 9-11). Since some examined entries would not be the final candidate, a buffer B is provided to keep them (line 4) and those buffered entries are reinserted to the queue (line 12-13) by the end of the function. Finally, a collected candidate is returned (line 14). Since index nodes and objects are put in the queue and examined only once, our Sweep algorithm needs only at most one index lookup. On the other hand, it can provide progressive result delivery as another important property. It delivers a completed portion of an NS query result as soon as it is ready and no longer changed in the rest of the execution. With the index traversal order based on the starting angles, Heuristic 4.3 can guarantee the correctness of partial Sweep result delivery during the query execution.

Hueristic 4.3. Safe Partial Result Delivery. If the starting angle of the head entry in the priority queue is δ, all the remaining queued entries should have starting angles not less than 63

δ and thus they do not affect the existing NS query result associated with ending angles <δ.

Summary of the Sweep Algorithm and Example. With all the discussed optimizations (ex- cept progressive result delivery for description clarity), we outline the Sweep algorithm in Figure 4.10. A priority queue, Q, first contains the root of an R-tree; an angular range [β, γ] that represents the angular ranges of result objects is initialized to [0, 0]; and the NS result set, N, is set to { ⊥ :[0, 2π)} (line 1-3). Later, the queue is iteratively examined until it becomes empty (line 4-12). Function Lookahead picks an entry from Q (line 6). If the mindist of is greater than the conservative upper distance bound (i.e., the maximum among the angular distances for all overlapped NS objects), it is safe to ignore it (line 7). Further, if is an index node, its child entries are explored from the index and put into Q for later investigation (line 9). On the other hand, should be an object. It will compare with existing NS objects by Function NSIncorp (see Figure 4.6) to update the NS query result. After all, [β, γ] is updated to cover all objects examined so far. At last, the NS query result N is output.

Algorithm Sweep(q, root) Input. a query point (q), an R-tree root index node (root)   Output. an NS query result, i.e., a set of o :[ϑ ,ϑ ] tuples Begin 1. let Q be priority queue and be initialized with root; 2. let [β, γ] be an ang. range of examined objects, set to [0, 0]; 3. let N be the NS result and be initialized as { ⊥ :[0, 2π)}; 4. while (Q is not empty) 5. let be an entry (it could be either a node or an object); 6. ← Lookahead(q, Q, [β, γ]); , , θ ,θ > 7. if mindist(q [ q, q, ]) conservative upper bound then goto 4; /* skip examining it */ 8. if ( is node) then 9. explore and put all its child to Q; 10. else N← N θ ,θ 11. NSIncop(q, , , [ q, q, ]); β, γ ← β, γ ∪ θ ,θ 12. [ ] [ ] [ q, q, ]; 13. output N; End.

Figure 4.10. The pseudo-code of the Sweep Algorithm

We use an example to show how the Sweep algorithm operates based on a scenario depicted in Figure 4.11(a). Here, 10 objects (labeled as ‘a’ through ‘j’) are included in four index nodes 64

h h -axis θq,R4 -axis y y

g θq,R3 g j e j e R3 R3 R4 f R4 f f R2 R2 θq,R2 b i d i d a c i c a a q b R1 x-axis q b R1 x-axis θq,R1

(a) Sweep (b) Ripple

Figure 4.11. Examples of NS searches

Q N (Remarks)

- [R1, R2, R3, R4] { ⊥ :[0, 2π)} R1 [a1, c, R2, R3, R4, a2, b] ditto (note: a is divided into a1 and a2) , , , , , { ,θ , θ , π } a1 [c R2 R3 R4 a2 b] a1 :[0 q,a1 ] a1 :[ q,a1 2 ) c [R2, R3, R4, a2, b] ditto R3 [R2, f, R4, g, h, a2, b] ditto (By Heuristic 4.2, R3 is chosen, not R2.)     f [R2, R , g, h, a , b] { a :[0,θ , ], f :[θ , ,θ ], ⊥ :[θ , 2π)} 4 2 1 q a1 q a1 q, f q, f (By Heuristic 4.2, f is chosen.) R2 [R4, g, h, a2, b] ditto (R2 is safely discarded.) R4 [i, g, h, j, a2, b] ditto       i g, h, j, a , b { a ,θ , , f θ , ,θ , i θ ,θ , ⊥ θ , π } [ 2 ] 1 :[0 q a1 ] :[ q a1 q,i] :[ q,i q,i] :[ q,i 2 ) g, h [j, a2, b] ditto (g and h are safely discard.)        j a , b { a ,θ , , f θ , ,θ , i θ ,θ , j θ ,θ , [ 2 ] 1 :[0 q a1 ] :[ q a1 q,i] :[ q,i q,i] :[ q,i q,j] ⊥ θ , π } :[ q,j 2 )        a b { a ,θ , , f θ , ,θ , i θ ,θ , j θ ,θ , 2 [ ] 1 :[0 q a1 ] :[ q a1 q,i] :[ q,i q,i] :[ q,i q,j]    ⊥ θ ,θ , ,, a θ , , π } :[ q,j q a2 ] 2 :[ q a2 2 ) b [] ditto (b is safely discarded.)

Figure 4.12. The trace of the Sweep algorithm

(MBRs). The radial lines indicate the starting angles of some MBRs. Our trace shown in

Figure 4.12 starts at where a priority queue contains these index nodes, i.e., [R1, R2, R3, R4],

ordered by their starting angles. Since R1 intersects x-axis, we consider its starting angle to

be 0. R1, the head entry of the priority queue is firstly explored and so, its children a, b, and c are inserted into the queue. Crossing the x-axis of the search space, a is logically divided into a1 (i.e., the portion on or above the x-axis) and a2 (i.e., the other portion below the x-

axis). This is for intermediate processing only. After the completion of the algorithm, a1

and a2 could be merged to recover a if both a1 and a2 are retained adjacently in the final NS 65

query result. Since a1 has a smaller minimum angular distance than c at the same starting

angle, a1 is placed before c in the queue. Next, a1 is retrieved and the result set is changed to    { , ,θ , , ⊥, θ , , π } and β, γ is updated to ,θ , . Next, is examined but as it a1 [0 q a1 ] [ q a1 2 ) [ ] [0 q a1 ] c is hidden, it is discarded.

R2 becomes the head entry of the queue. Due to Heuristic 4.2, R3, overlapping the angular

range of a1 and providing a shorter minimum angular distance than R2, is picked instead. R3’s

children f , g and h are then enqueued. Similarly, f is taken rather than R2, and the NS result is      updated to { , ,θ , , , θ , ,θ , ⊥, θ , π }.Now β, γ is updated to ,θ . a1 [0 q a1 ] f [ q a1 q, f ] [ q, f 2 ) [ ] [0 q, f ]

R2 is retrieved from the queue since its angular range is entirely covered by the currently

NS objects. As R2’s minimum angular distance is greater than the conservative upper bound formed by both a1 and f , R2 that includes d and e is skipped from a detailed examination. R4 is next explored and its children i and j are inserted to the queue. An object i is then examined      and the result set is revised into { , ,θ , , , θ , ,θ , , θ ,θ , ⊥, θ , π }. a1 [0 q a1 ] f [ q a1 q,i] i [ q,i q,i] [ q,i 2 )

Further, g and h are examined but they are not closer than i to q. j and a2 are examined in

sequence and added to the result. Next, b is examined but it is hidden by a2. The queue becomes empty and the algorithm terminates. Extended Sweep Algorithm for m-NS and ANS Queries. The Sweep algorithm can be ex- tended to evaluate a multi-tier NS (m-NS) query. First, we generalize an NS query result tuple that contains m objects. For example, to handle a 2-NS query, we associate two NS objects instead of a single object for an angular range in each result set tuple. Thus, an initial result for { ⊥, ⊥ , , π } a 2-NS query is initialized as ( ) [0 2 ] . In Figure 4.11(a), both a1 and c are examined     and the query result first becomes { , , ,θ , , , ⊥ , θ , ,θ , , ⊥, ⊥ , θ , , π }. (a1 c) [0 q c] (a1 ) [ q c q a1 ] ( ) [ q a1 2 ) Correspondingly, we devise Function m-NSIncorp that substitutes Function NSIncorp to up- date a m-NS query result set. It is depicted in Figure 4.13. It is a recursive function. Every time it compares an object at slot (i.e., tier) t with an initialized value of 0 in an NS query result tuple with an input object using Function ObjectCompare. Then m-NSIncorp is invoked again to update objects at the next tier (i.e., t + 1) with hidden objects (in H). Beside the extended NS slots in the NS query result and the invocation of Function m- NSIncorp, the general logic of the Sweep algorithm for m-NS queries is pretty much the same as that already discussed for NS queries. We omit its pseudo-code for space saving. Fur- ther, progressive result delivery and single index lookup features are applicable to the Sweep algorithm for m-NS queries. The Sweep algorithm for ANS queries include an additional parameter, namely, a query 66

  Function m-NSIncorp(q, N, t, m, o, [ϑ ,ϑ ]) Input. a query point (q), a NS query result set (N), the target tier t, max no. of tiers (m)   an object (o), an angular range ([ϑ ,ϑ ]) Output. an updated NS query result set Begin , ... , ϑ,ϑ ∈N| ϑ,ϑ ∩ ϑ,ϑ  ∅ 1. foreach (n1 nm) [ n n] [ n n] [ ] do N←N−{ , ... , ... , ϑ,ϑ } 2. (n1 nt nm) [ n n] ; , ← ϑ,ϑ 3. (R H) ObjectCompare(q, nt, o, [ n n]); ϑ,ϑ ∈ 4. foreach r :[ r r ] R do N←N∪{ , ... , ... , ϑ,ϑ } 5. (n1 r nm) [ r r ] ; 6. foreach ϑ,ϑ ∈ do h :[ h h] H 7. if (t ≤ m ∧ h  ⊥) then 8. -NSIncorp( , N, + , , , ϑ,ϑ ); m q t 1 m h [ h h] 9. output N; End.

Figure 4.13. The pseudo-code of the function m-NSIncorp

θ,θ θ,θ angular range [ q q]. Then, only NS objects covered by [ q q] are retrieved. Meanwhile, θ,θ all objects and index nodes beyond [ q q] are not needed to be enqueued. The search can also θ start at q and end when the queue becomes empty.

4.4.3 The Ripple Searching Algorithm

Next, we present the Ripple algorithm. It accesses an R-tree index for NS objects based on distance order. In what follows, we first briefly discuss its basic operation and its extension to support m-NS and ANS queries. Basic Ripple Operation. The Ripple algorithm adopts the best-first search strategy and it maintains a priority queue to keep track of unexamined entries in non-descending distance order with respect to a given query point. Thus, the exploration of a search space by the Ripple algorithm starts from a query point outwards for all directions. If a de-queued entry is an index node, all its children are enqueued for later examination. Otherwise, the entry should be an object and it is incorporated into NS query result by Function NSIncorporate (in Figure 4.6). Similar to the Sweep algorithm, it examines each dequeued index nodes and objects only once. Thus, it has the one index lookup property. Furthermore, it can terminate early before the queue is completely scanned according to Heuristic 4.4. 67

Hueristic 4.4. Early Termination. Because of the distance ordering of all queued entries, the examination of a priority queue can be completely skipped as long as the following two conditions are satisfied: (1) an NS query result set has no dummy NS object (i.e., ⊥); and (2) all queued entries have longer distances than the conservative upper distance bound of the entire NS query result set to the query point (as defined in Heuristic 4.1).

In Heuristic 4.4, condition (1) guarantees that an NS result set is completely filled, and condition (2) asserts that the rest of the priority queue does not have any object that appears nearer than any currently found NS object. This heuristic provides a safe termination condition for the Ripple algorithm so that a faster response of the algorithm will be produced.

Q N (Remarks)

- [R4, R1, R3, R2] { ⊥ :[0, 2π)} R4 [i, R1, R3, j, R2] ditto , , , { ⊥ ,θ , θ ,θ , ⊥ θ , π } i [R1 R3 j R2] :[0 q,i] i :[ q,i q,i] :[ q,i 2 ) R1 [a, R3, b, j, R2, c] ditto        a R , b, j, R , c { a ,θ , , ⊥ θ , ,θ , i θ ,θ , ⊥ θ ,θ , , [ 3 2 ] 1 :[0 q a1 ] :[ q a1 q,i] :[ q,i q,i] :[ q,i q a2 ] θ , π } ( is cut into and .) a2 :[ q,a2 2 ) a a1 a2 R3 [b, f, j, R2, c, g, h] ditto b [ f, j, R2, c, g, h] ditto        f j, R , c, g, h { a ,θ , , f θ , ,θ , i θ ,θ , ⊥ θ ,θ , , [ 2 ] 1 :[0 q a1 ] :[ q a1 q,i] :[ q,i q,i] :[ q,i q a2 ] θ , π } a2 :[ q,a2 2 )        j R , c, g, h { a ,θ , , f θ , ,θ , i θ ,θ , j θ ,θ , [ 2 ] 1 :[0 q a1 ] :[ q a1 q,i] :[ q,i q,i] :[ q,i q,j]    ⊥ θ ,θ , , a θ , , π } :[ q,j q a2 ] 2 :[ q a2 2 ) All queue remainders (e.g., R2) are scanned but hidden, so they are not put into the result.

Figure 4.14. The trace of Ripple algorithm

To highlight the difference between the Ripple algorithm and the Sweep algorithm, let us consider the same running example depicted in Figure 4.11(b). In the figure, the circu- lar lines indicate the distances of some MBRs and objects from a query point, q. The trace of the Ripple algorithm is outlined in Figure 4.14. The NS query result set, N is initialized { ⊥ , π } , , , to :[02 ) and the priority queue is initialized to [R4 R1 R3 R2] in a distance order.

First, R4 is explored and its children i and j are inserted into the priority queue, which be- , , , , comes [i R1 R3 j R2]. Then, i is examined and found to be an NS object. The result set is { ⊥ ,θ , θ ,θ , ⊥ θ , π } updated to :[0 q,i) i :[ q,i q,i] :[ q,i 2 ) . Next, R1 is explored and its children a, b and c are enqueued. Now a is picked and incorporated to the result set (which becomes        { ,θ , , ⊥ θ , ,θ , θ ,θ , ⊥ θ ,θ , , θ , , π } . Here, a1 :[0 q a1 ] :[ q a1 q,i] i :[ q,i q,i] :[ q,i q a2 ] a2 :[ q a2 2 ) )

a is divided along x-axis into a1 and a2. Next, R3 is explored and its children f , g and h are 68

enqueued. Then, b is fetched but found to be hidden by a2. Further, f is examined and it fills  a hole between and . is examined and the result is further updated to { ,θ , , a1 i j a1 :[0 q a1 ] f :          θ , ,θ , θ ,θ , θ ,θ , ⊥ θ ,θ , , θ , , π }. The remaining [ q a1 q,i] i :[ q,i q,i] j :[ q,i q,j] :[ q,j q a2 ] a2 :[ q a2 2 ) queued entries are examined but none of them is found to be NS objects. Next, a1 and a2 are merged to recover a. Finally, the search terminates.

Extended Ripple Algorithm for m-NS and ANS Queries. The extension of the Ripple al- gorithm for m-NS queries is different from that of the Sweep algorithm. Instead of defining multiple m slots in each NS query result tuple, we maintain an array of m NS query results, N[m]. Each array element N[i] represents a tier (i) of an NS query result. This arrangement of a query result resembles onion rings. The lower tier results (inner rings) are filled prior to the higher tier results (outer rings). When an object is examined, we study how it contributes to the lower tier(s) first. We devise Function tierNSIncorp (as outlined in Figure 4.15) to handle this m-NS query result update. It is very similar to Function NSIncorp (see Figure 4.6). However, it performs ObjectCompare between an input object and an object in a result tuple to update the specified tier of NS query result (specified by t) and then it invokes itself to incorporate hidden objects if any to update (t + 1)-th tier of NS query results.

  Function tierNSIncorp(q, N, t, m, o, [ϑ ,ϑ ]) Input. a query point (q), a NS query result set (N), a target tier t, a maximum no. of tier (m), an object (o),   an angular range ([ϑ ,ϑ ]) Output. an updated NS query result set Begin , ϑ,ϑ ∈N | ϑ,ϑ ∩ ϑ,ϑ  ∅ 1. foreach n [ n n] [t] [ n n] [ ] do N ←N −{ , ϑ,ϑ } 2. [t] [t] n [ n n] ; , ← ϑ,ϑ ϑ,ϑ ∩ ϑ,ϑ 3. (R H) ObjectCompare(q, n, [ n n], o, [ n n] [ ]); ϑ,ϑ ∈ 4. foreach r :[ r r ] R do N ←N ∪{ , ϑ,ϑ } 5. [t] [t] r [ r r ] ; 6. foreach ϑ,ϑ ∈ do h :[ h h] H 7. if (t ≤ m ∧ h  ⊥) then 8. tierNSIncorp( , N, + , , , ϑ,ϑ ); q t 1 m h [ h h] 9. output N; End.

Figure 4.15. The pseudo-code of the function tierNSIncorp

While the Early Termination (as already stated in Heuristic 4) for a single-tier NS query 69 discussed in the previous subsection is still applicable, it is extended to Heuristic 4.5 as follows.

Hueristic 4.5. Tier Completion. As long as (1) the result of a tier contains no dummy NS, and (2) the head entry of the queue to be examined is farther than the conservative upper bound of the NS query result of the tier, the remaining entries will not contribute to the NS result of the tier.

Based on this heuristic, three enhancements for the Ripple algorithm can be achieved, namely, skipped examination of completed tier NS result, progressive result delivery, and early m-NS query termination. Here, progressive result delivery refers to delivering NS query result in a tier-wise fashion once a tier is completed. To evaluate ANS query, Ripple algorithm is extended to filter out all index nodes or objects if they are out of a parameter angular range. To generalize the Ripple algorithm for multi-tier ANS queries, we outline the pseudo-code in Figure 4.16. When m and an angular range are set to 1 and [0, 2π), respectively, the algorithm performs an NS search. The Ripple algorithm takes three parameters, namely, the root of an R-tree (root), a query θ,θ point (q), the number of tiers (m) and an angular range ([ q q]). It operates according to the content of a priority queue, Q, which is initialized with root. It maintains an integer CurrTier initialized to 1 and an array of m NS query results, N (line 1-3). According to Heuristic 5, whether the result of the current tier is completed is checked (line 7). If it is completed, the result of the current tier is then delivered (line 8) and CurrTier is then incremented by one (line θ,θ 9). Next, it terminates (line 10) if all m tiers are completed. Besides, when lies out of [ q q], the detail examination is skipped (line 11). If an entry is an index node, all its children are enqueued for detailed examination (line 12). On the other hand, if the entry is an object, it is incorporated to existing m-NS query result (line 14) by invoking Function tierNSIncorp. Finally, the incomplete NS query result sets are output (line 15-16).

4.4.4 NS Search Based on Multiple ANS Searches

Thus far we have discussed the algorithms to process an NS query that covers [0, 2π) with respect to a query point. Logically an NS query can be evaluated as several ANS queries if all angular ranges examined by all ANS queries constitutes [0, 2π). Thus, the final NS result can be obtained by union-ing all those ANS query results. Subject to the application needs, the number of divisions and the sizes of angular ranges can be customized. Besides, we can see several performance advantages of processing multiple ANS queries over a single 70

θ,θ Algorithm Ripple(q, root, m, [ q q]) Input. a query point (q), an R-tree root (root), θ,θ the req. no. of tiers (m), a query ang. range ([ q q]), Output.anm-tier NS query result (N) Begin 1. let Q be the priority queue and be initialized with root; 2. let CurrTier be an integer and be initialized to 1;   3. let N[1 ···m] be a result array; N[i] set to { ⊥ :[θ ,θ ]}; 4. let be an entry of Q; 5. while (Q is not empty) do 6. ← dequeue(Q); 7. if (N[CurrTier] is completed, as implied by ) then 8. output N[CurrTier]; 9. CurrTier ← CurrTier + 1; 10. if (CurrTier > m) then goto 15; // early term. θ,θ ∩ θ,θ = ∅ 11. if ([ ] [ q q] ) then goto 5; // skip exam. 12. if ( is a node) then explore and put all its children to Q; 13. else // it should be an object N θ,θ ∩ θ,θ 14. tierNSIncorp(q, , currTier, m, , [ ] [ q q]); 15. for i = CurrTier to m do 16. output N[i]; End.

Figure 4.16. The pseudo-code of the Ripple algorithm (generalized for m-NS and ANS queries)

NS query. First, it may reduce the memory contention in maintaining a long priority queue (i.e., the major source of runtime memory consumption) for ANS queries as ANS queries are evaluated separately. This is because index nodes and objects out of a specified angular range are not kept in the queue. Certainly, a shorter priority queue would result in smaller manipulation costs. Second, if the evaluation order of those ANS queries is aligned with the angular sequence. Some index nodes and objects accessed by an ANS query will more likely be accessed by subsequent ANS queries for adjacent sectors. Although possibly repeated accesses over an index, with index nodes and/or objects cached, the performance impact caused is not significant. 71

4.5 Performance Evaluation

This section evaluates the performance of the search algorithms, namely, the Sweep and Ripple algorithms (labeled as Sweep and Ripple, respectively) for NS query, based on three commonly used performance metrics: CPU time, access costs, and runtime memory consumption. CPU time is measured that suggests the computation cost. Access costs count the number of in- dex nodes/disk pages accessed and runtime memory consumption measures the maximum the priority queue length. We also include a sampling approach that uses ray shooting queries (denoted by Sample) (as described in Section 4.2) as a baseline approach in this evaluation. For Sample, we issue ray shoot queries to locate NS objects for each sampled angle of d that is varied from 1o, 0.1o to 0.01o degree. We employ real and synthetic object sets. All of them are normalized in a fixed square data area of [1000, 1000]. Two real object sets are used. They represent polygon landscapes in New York state (labeled as NY) and Rhode Island state (labeled as RI) in USA obtained from Tiger/Line [137]. The object sets cardinality (i.e., number of objects) (n)ofNY and RI are 1334k and 103k, respectively and the average size of individual objects (in MBR, length and width) in NY, and RI are [0.26, 0.35] and [1.45, 1.07], respectively. These object sets are plotted in Figure 4.17. As shown, due to data space normalization, those objects are resized. On the other hand, synthetic object sets have 10k, 50k, 100k, 500k, and 1000k uniformly distributed line segments, while the lengths of line segments (l) are ranged from 0.1, 0.5, 0.1, 0.5 and 1. In the experiments, query points are randomly chosen in the space. The number of tiers, m, for each query is ranged from 1 to 5. Also, we examine the performance gained by dividing an NS query into several finer ANS queries. We set the number of ANS queries from 1 (i.e., the original NS query) up to 8 (each of which cover 45o of the search space). Table 5.2 lists the experimental parameters and their values. We implemented all of the algorithms in GNU C++. All experiments are conducted upon Linux computers with Intel Xeon 3.2GHz CPU and 4GB RAM. We build R-tree index for objects based on the TGS bulkloading technique [48] while the polygons of all objects are stored in a separate file and indexed based on object IDs. In our evaluation, the access costs measure only the number of index nodes/disk pages accessed. The index node/disk page size is fixed at 4K bytes and the maximum number of branches per node is 340. For each setting, we conducted 30 queries and each result to be reported is the average over the 30 collected 72

(a) NY (b) RI

Figure 4.17. Real object sets

Evaluation candidates: Sweep, Ripple and Sample (sample rate: 0.01o, 0.1o and 1o degree) Unit space size: [1000,1000] Synthetic object set (uniform distributed): Object set size (n) = 10k, 50K, 100k, 500k and 1000k (default: 100k) Object size (l)=0.01, 0.05, 0.1, 0.5 and 1 (default: 0.1) Real object sets: NY (n=1334k, s=[0.26,0.35]), RI (n=103k, s=[1.45,1.07]) Number of tiers (m): 1, 2, 3, 4 and 5 Number of ANS queries: 1 (360o), 2 (180o), 4 (90o), 6 (60o), 8 (45o) Table 4.1. Experimental parameters of performance evaluations on NS queries readings. In the following three subsections, we discuss performance evaluation of NS query search algorithms, the effectiveness of heuristical optimization techniques and improvement by using multiple ANS queries.

4.5.1 Performance of NS Searching Algorithms

The first set of experiments evaluate the performance of the Sweep, Ripple and Sample algo- rithms in terms of CPU time, access costs, and the runtime memory consumption for both synthetic and real object sets. We first evaluate the object set sizes (n) and the object sizes (l) of synthetic object sets independently below. Firstly, we examine the impact of object set sizes (n) that varies from 10k up to 1000k on 73

CPU time (Sync, l=0.1, n=10k) CPU time (Sync, l=0.1, n=50k) Sample-1 Sample-0.1 Sample-1 Sample-0.1 Sample-0.01 Sweep Sample-0.01 Sweep (13349) 1000 Ripple (9107) 1000 Ripple 100 100 (7268) (3883) 10 (475) 10

second 1 second 1 (1168) 0.1 0.1 0.01 0.01 #NSObj: 8139 9760 9956 9987 9993 #NSObj: 17069 32834 42582 47095 48958 1 2 3 4 5 1 2 3 4 5 m m (a) n=10k (b) n=50k

CPU time (Sync, l=0.1, n=100k) CPU time (Sync, l=0.1, n=500k) Sample-1 Sample-0.1 Sample-1 Sample-0.1 Sample-0.01 Sweep Sample-0.01 Sweep 1000 Ripple (8932) 1000 Ripple (2676) (5915) 100 100 (2328) 10 10 (1311) 1 second 1 second (1157) 0.1 0.1 0.01 0.01 #NSObj: 14528 34304 54052 70317 82450 #NSObj: 4850 12630 22626 34356 47931 1 2 3 4 5 1 2 3 4 5 m m (c) n=100k (d) n=500k

CPU time (Sync, l=0.1, n=1000k) Sample-1 Sample-0.1 Sample-0.01 Sweep 1000 Ripple (1446)

100 (1325) 10 second 1 (811)

0.1 #NSObj: 2736 7401 13644 21049 29264 1 2 3 4 5 m (e) n=1000k Figure 4.18. Performance in terms of CPU time on the number of objects (n) the NS search algorithm performance while the object sizes (l) are fixed at 0.1. The results are depicted in Figure 4.18, Figure 4.19 and Figure 4.20. In Figure 4.18, the numbers of NS objects returned are provided right below the x-axis. As we can observe, in general, the number of NS objects increases from n = 10k to n = 100k and then drops afterwards. It is because when n is small, many NS objects do not overlap angularly. In contrast, when n is further increased, 74

Node access (Sync, l=0.1, n=10k) Node access (Sync, l=0.1, n=50k) Sample-1 Sample-0.1 Sample-1 Sample-0.1 Sample-0.01 Sweep Sample-0.01 Sweep 1000 Ripple 1000 Ripple 100 100 10 10 1 1 (20k) (41k) (48k) (49k) (50k) (8,399) (9899) (9,994) (10,000) (10k)

0.1 # nodes (x1000) 0.1 # nodes (x1000) (20k) (35k) (44k) (48k) (50k) 0.01 (8,395) (9826) (9,985) (9,999) (10k) 0.01 12345 12345 m m (a) n=10k (b) n=50k

Node access (Sync, l=0.1, n=100k) Node access (Sync, l=0.1, n=500k) Sample-1 Sample-0.1 Sample-1 Sample-0.1 Sample-0.01 Sweep Sample-0.01 Sweep 1000 Ripple 1000 Ripple 100 100 10 10 (92k) (31k) (50k) (70k) 1 (18k) (60k) (80k) (90k) (96k) 1 (6k) (28k) (42k) (58k) 0.1 (18k) (39k) (59k) (75k) (86k) 0.1 (6k) (15k) # nodes (x1000) # nodes (x1000) 0.01 0.01 12345 12345 m m (c) n=100k (d) n=500k

Node access (Sync, l=0.1, n=1000k) Sample-1 Sample-0.1 Sample-0.01 Sweep 1000 Ripple 100 (45k) 10 (17k) (30k) (64k) 1 (3k) (15k) (23k) (33k) # nodes (x1000) 0.1 (8k) (3k) 0.01 12345 m (e) n=1000k

Figure 4.19. Performance in terms of I/O cost on the number of objects (n) many objects are located closely to a query point that hide many other objects. Among all the algorithms, Samples take much longer time than Sweep and Ripple. Sweep runs faster than Ripple for small n (such as 10k, 50k) but slower than Ripple when n is increased. It is explained that when for a large n, Sweep’s Lookahead function has to examine many queue entries. More, as explained later, more objects are examined by Sweep. This causes high CPU times. While Sweep and Ripple return the exact result, Sample (as indicated for m=5) provides only some identified objects. In other words, Sample is inaccurate and inefficient. 75

Runtime Mem. (Sync, l=0.1, n=10k) Runtime Mem. (Sync, l=0.1, n=50k) Sample-1 Sample-0.1 Sample-1 Sample-0.1 10000 Sample-0.01 Sweep 10000 Sample-0.01 Sweep Ripple Ripple 1000 1000

100 100

10 10 no. of queue entries no. of queue entries 1 1 12345 12345 m m (a) n=10k (b) n=50k

Runtime Mem. (Sync, l=0.1, n=100k) Runtime Mem. (Sync, l=0.1, n=500k) Sample-1 Sample-0.1 Sample-1 Sample-0.1 Sample-0.01 Sweep Sample-0.01 Sweep 10000 Ripple 100000 Ripple 1000 10000 1000 100 100 10 10

no. of queue entries 1 no. of queue entries 1 12345 12345 m m (c) n=100k (d) n=500k

Runtime Mem. (Sync, l=0.1, n=1000k) Sample-1 Sample-0.1 Sample-0.01 Sweep 100000 Ripple 10000 1000 100 10

no. of queue entries 1 12345 m (e) n=1000k Figure 4.20. Performance in terms of runtime memory cost on the number of objects (n)

In Figure 4.19, the number of objects examined by Sweep and Ripple are depicted next to the markers. We can see that initially for n being 10k, 50k and 100k, they perform equally well with the same node accesses and similar number of examined objects. However, when n is increased to 500k and 1000k, Sweep access more pages than Ripple that means Sweep incurs many false hits. It is because for large synthetic object sets, many objects (line segments) cross each other and small MBRs are formed. Recall that Sweep puts the highest priority to those entries whose angular ranges are fully covered by an existing result. Then, it is still possible that some entries which overlap with existing NS objects but hidden by later found NS objects 76

are retrieved. As a result, higher access costs are incurred. Notice that Sweep’s Lookahead is actually effective especially when small object sets are experimented. Here, Sample incurs very high access costs. Finally, Figure 4.20 shows the runtime memory consumption in terms of the maximum number of queue entries. Due to progressively exploring search space in angular fashion, Sweep incurs much smaller memory consumption than Ripple.

CPU time (Sync, l=0.01, n=100k) CPU time (Sync, l=0.05, n=100k) Sample-1 Sample-0.1 Sample-1 Sample-0.1 Sample-0.01 Sweep Sample-0.01 Sweep (9694) 1000 Ripple (38570) 1000 Ripple (4877) (1018) 100 100 10 (103) second 10 second (526) 1

1 0.1 #NSObj: 80041 96828 99141 99379 99403 #NSObj: 34862 65272 84142 93638 97683 1 2 3 4 5 1 2 3 4 5 m m (a) l=0.01 (b) l=0.05

CPU time (Sync, l=0.50, n=100k) CPU time (Sync, l=1.00, n=100k) Sample-1 Sample-0.1 Sample-1 Sample-0.1 Sample-0.01 Sweep Sample-0.01 Sweep 1000 Ripple 1000 Ripple (8932) (3025) 100 100 (5915) (2586) 10 10

second 1 second 1 (1311) (1225) 0.1 0.1 0.01 0.01 #NSObj: 1033 2574 4570 6965 9634 #NSObj: 354 913 1559 2288 3094 1 2 3 4 5 1 2 3 4 5 m m (c) l=0.50 (d) l=1.00

Figure 4.21. Performance in terms of CPU time on the sizes of objects (l)

Secondly, we evaluate the impact of the sizes of objects (i.e., line segments) l varying from 0.01 up to 1.00. The results in terms of CPU times are shown in Figure 4.213. Again, the result sets (as indicated for m = 5) returned by Samples are imprecise, compared to the exact results returned by Sweep and Ripple. When l is increased to 0.5 and 1, many line segments cross each other. Thus, Sweep performs slower than Ripple as previously explained. Again, Sample incurs high CPU time. Due to limited space, the results for access cost and runtime memory consumption are omitted.

3The results for l=0.1 is already shown in Figure 4.18. 77

Finally, we evaluate the performance of search algorithms with real object sets. The results are shown in Figure 4.22, Figure 4.23 and Figure 4.24. Due to a large number of objects in NY and large object sizes in RI, Ripple outperforms Sweep in terms of CPU times and access costs. Besides, they both perform significantly better than Sample (with sample angles of 0.01o), although it incurs small memory consumption.

CPU time (RI) CPU time (NY) Sample-0.01 Sweep Ripple 1000 Sample-0.01 Sweep Ripple 1000 (4566) 100 100 10 (659) 10 1 1 second second 0.1 0.1 0.01 0.01 #NSObj: 163 278 399 530 684 #NSObj: 1253 2551 3634 4667 5590 1 2 3 4 5 1 2 3 4 5 m m (a) RI (b) NY Figure 4.22. Performance in terms of CPU time on real object sets

Node access (RI) Node access (NY) 1000 100 100 Sample-0.01 Sweep Ripple 10 Sample-0.01 Sweep Ripple 1 1 0.1 #nodes (x1000) 0.01 # nodes (x1000) 0.01 12345 12345 m m (a) RI (b) NY Figure 4.23. Performance in terms of index node access on real object sets

Runtime memory (RI) Runtime memory (NY) 3000 Sample-0.01 Sweep Ripple 5000 Sample-0.01 Sweep Ripple 4000 2000 3000 1000 2000 1000 0 0 # queued entries # queued entries 12345 12345 m m (a) RI (b) NY Figure 4.24. Performance in terms of runtime memory consumption on real object sets 78

4.5.2 Effectiveness of Optimization Techniques

In discussing the algorithms, we proposed various heuristic optimization techniques that in- cludes lookahead for Sweep to selectively explore the search space and early termination for Ripple to terminate the search as soon as no more objects are found. In this set of experiments, we examine their effectiveness. For space saving, we only report the results based on the real object sets and various m. We label Sweep without lookahead as Sweep-NoOpt and Ripple without early termination as Ripple-NoOpt.

CPU time (RI) CPU time (NY) Sweep Sweep-NoOpt Sweep Sweep-NoOpt 100 Ripple Ripple-NoOpt 20 Ripple Ripple-NoOpt 10 15 1 10 second 0.1 second 5 0.01 0 12345 12345 m m (a) RI (b) NY Figure 4.25. Performance about CPU time improvement on various optimizations

Node access (RI) Node access (NY) Sweep Sweep-NoOpt Sweep Sweep-NoOpt 0.25 Ripple Ripple-NoOpt 3.00 Ripple Ripple-NoOpt 0.20 2.50 0.15 2.00 1.50 0.10 1.00 0.05 0.50

# nodes (x1000) 0.00 # nodes (x1000) 0.00 12345 12345 m m (a) RI (b) NY Figure 4.26. Performance about node access improvement on various optimizations

From Figure 4.25, we can see that lookahead and early termination can improve CPU times as lookahead can reduce the examination of unnecessary index nodes and objects; and early termination can finish the search without examining all index nodes and objects. Next, from Figure 4.26, we can observe lookahead can significantly reduce the index node accesses while early termination has no impact on node access. Although lookahead can save many node accesses, it also incurs high computation cost in deciding the best candidate to examine; result- ing less improvement in terms of CPU time. Further, lookahead can save memory cost for its reduced number of index nodes accessed as shown in Figure 4.27. 79

Runtime memory (RI) Runtime memory (NY) 3000 Sweep Sweep-NoOpt 5000 Ripple Ripple-NoOpt 4000 2000 3000 2000 1000 1000 Sweep Sweep-NoOpt

# queued entries 0 # queued entries 0 Ripple Ripple-NoOpt 12345 12345 m m (a) RI (b) NY Figure 4.27. Performance about runtime memory consumption improvement on various optimizations

4.5.3 Improvement Gained by ANS Queries

Next, we examine the improvement for an NS query gained by using multiple ANS queries. We vary the number of ANS queries from 1 (i.e., original NS query) up 8. Meanwhile, we also examine the impact of cache whose size 10 disk pages and 50 disk pages on disk access savings. All ANS queries are issued to equally divided angular ranges.

CPU time (RI) CPU time (NY) 2000 8000 1500 6000 Sweep 1000 Sweep 4000 Ripple Ripple 500 2000 millisecond millisecond 0 0 12468 12468 #ANS queries #ANS queries (a) RI (b) NY Figure 4.28. Performance about CPU time improvement by using ANS queries

Disk access (RI) Disk access (NY) 100 Sweep-0 Sweep-10 Sweep-50 300 Sweep-0 Sweep-10 Sweep-50 80 Ripple-0 Ripple-10 Ripple-50 250 Ripple-0 Ripple-10 Ripple-50 60 200 150 40 # pages

# pages 100 20 50 0 0 12468 12468 #ANS queries #ANS queries (a) RI (b) NY Figure 4.29. Performance about disk access improvement by using ANS queries

The results on real object sets in terms of CPU time and memory consumption are shown in Figure 4.28 and Figure 4.30. From the figure, we can see that in general, using ANS queries 80

Runtime memory (PA) Runtime memory (NY) 2500 Sweep 5000 Sweep 2000 Ripple 4000 Ripple 1500 3000 1000 2000 500 1000 0 0 no. of queued entries 12468 no. of queued entries 12468 #ANS queries #ANS queries (a) RI (b) NY Figure 4.30. Performance about runtime memory consumption improvement by using ANS queries

can reduce the CPU time and memory consumption. The result in terms of disk access is shown in Figure 4.29 where Sweep-0, Sweep-10 and Sweep-50 (Ripple-0, Ripple-10 and Ripple-50) represent Sweep (Ripple) running with a cache that can buffer 0, 10, 50 index nodes. Although a bit increase in disk access when more ANS queries are issued, this extra overhead can be eliminated when cache is used.

4.6 Chapter Summary

In this chapter, we have presented our research on a new type of spatial queries: Nearest Surrounder (NS) queries and their variants, namely, multi-tier NS (m-NS) queries and angle- constrained NS (ANS) queries, on polygon-shaped spatial objects. We developed two search algorithms, namely, the Sweep algorithm and the Ripple algorithm, to support NS queries and their variants. Through an extensive experiment evaluation, both the Sweep and Ripple algo- rithms are shown to significantly outperform Sample algorithm (i.e., a na¬õve NS processing solution based on ray shooting queries on sampled angles) in terms of both I/O cost and CPU time, especially when object sets with high object density are queried. As indicated in our experiments, the execution of multiple ANS queries rather than a single NS query for the same angular range performs faster and takes less memory. Chapter 5

Skyline Queries

Given two data points p and p in a multidimensional space, p is said to dominate p if p is strictly better than p on at least one dimension and p is not worse than p on the other dimensions. To illustrate the idea of dominance relationships, Figure 5.1 gives a hotel finding example (i.e., our running example in this chapter) in which a conference participant looks for a hotel based on two criteria, price per night and distance to the conference venue. This search is common to many LBSs. Figure 5.1(a) lists nine hotel records and their values, while Figure 5.1(b) depicts the geometrical representation of the hotels in a 2D space with each dimension representing one criterion. ) Hotel Price Distance 7 (× 100) (× 0.1) p4

0.1 mile 6 p 3 3 p2 1 x p9 ( p2 1 6 5 p3 p8 dominated p3 2 4 4 by p5 p4 3 7 p1 p5 5 2 3 p5 p6 7 1 2 p7 p7 6 2 p6 4 4 1 p8 skyline points 6 6 p9 distance to conference (a) Hotel (price,distance) 0 1234567 price (x100) (b) Price, distance space

Figure 5.1. An example skyline query (our running example)

Hotels p4, p7, p8 and p9 are all dominated. For instance, p7 is dominated by p5 as it is 82

more expensive than p5, although both of them are equally good in terms of nearness to the conference venue. Queries specialized to search for those non-dominated data points are named skyline queries, and their corresponding results are called skyline results. In the example, hotels p1, p2, p3, p5 and p6 form a skyline result set. Individual data points in a skyline result set are skyline points. Based on alternative dominance relationships, there are several skyline query variants, such as skyband query [105], top-ranked skyline query, k-dominant skyline query [25] and subspace skyline query [38, 135], that have appeared in the literature. A skyband query returns data points that are not dominated by more than b other data points (where b specifies the width of a skyband query result interested by users). A top-ranked skyline query finds t skyline points, each of which dominates the most data points, where t is a query parameter. A k-dominant skyline query retrieves a representative subset of skyline points from a high-dimensional data set by considering dominant relationships on arbitrary k dimensions rather than all d dimen- sions (where k ≤ d). Last but not least, a subspace skyline query performs a skyline search on a specified subset of dimensions. Processing of skyline queries and their variants are expensive operations. Their costs are mainly constituted by I/O costs in accessing data from a secondary storage (e.g., disks) and CPU costs spent on dominance tests. In particular, the CPU costs can be very high due to exhaustive data point comparisons. In this chapter, we develop a suite of algorithms to evaluate skyline queries and the above- described skyline query variants. It is noteworthy that all algorithms and indexes developed for skyline queries in this research are based on a coherent concept of organizing the access of objects along the Z-order space filling curve. More importantly, all our algorithms outperform the state-of-the-art approaches, which are specialized for skyline queries or certain skyline query variants only. In what follows, we first analyze the skyline query problems, explain the connection between dominance relationships among data points and Z-order space filling curves, present our algorithms and evaluate them against existing approaches.

5.1 Analysis of Skyline Search Problems

={ , , ··· } ={ , , ··· } Given a d-dimensional space S s1 s2 sd that covers a set of data points P p1 p2 pn

(such that every pi is a data point in S), the dominance relationship between data points and the . skyline queries are defined in Definition 5.1 and Definition 5.2, respectively. We use pi sj to denote the j-th dimensional value of pi, and assume the existence of a total order relationship, 83

< > . either ‘ ’or‘ ’, on each dimension. Without loss of generality, we consider that p sj is better . . < . than p sj if p sj p sj, throughout this chapter.

Definition 5.1. Dominance. Given p, p ∈ P, p is said to dominate p (denoted by p  p) ∀ ∈ , . ≤ . ∧∃ ∈ , . < .  iff si S p si p si sj S p sj p sj; otherwise, p does not dominate p (denoted by p  p).

Definition 5.2. Skyline query. A skyline query retrieves those data points in P that are not dominated by any other point. We denote a skyline result by U, and formally, U = {p ∈ P | p ∈ P −{p}, p  p}.

We observe two skyline query properties, namely, transitivity and incomparability, which provide important insights to facilitate the development of our index and algorithms, as stated below.

Property 5.1. Transitivity. Given p, p, p ∈ P,ifp dominates p and p dominates p, then p dominates p, i.e., p  p ∧ p  p ⇒ p  p.

Property 5.2. Incomparability. Given p, p∈ P, p and p are incomparable if they do not dominate each other, i.e., p  p∧p  p.

Based on the transitivity property, if dominating points are always processed before their dominated data points, any data point, which passes the dominance tests against all the skyline points obtained ahead of it, is guaranteed to be a skyline point. This property inspires the sorting-based approaches [30, 50]. On the other hand, if two data points are known to be incomparable in advance, dominance tests between them can be avoided. This incomparability inspires divide-and-conquer approaches [21]. We shall discuss them in the next section. Generally speaking, a skyline query classifies data points into a set of skyline points (U) and non-skyline points (P-U) and returns U. Some non-skyline points may be dominated by only a few data points and they are incomparable to all the remaining skyline points. Revisit

our hotel example in Figure 5.1. Hotel p7 is only dominated by hotel p5. In other words, p7 is

not dominated by all the other skyline points, namely, p1, p2, p3 and p6. In this case, if the user

is interested in p5 but it is not available, p7 could be the next candidate for her to consider. To retrieve data points that are not dominated by more than a certain number of data points, the skyband queries are defined in Definition 5.3. 84

Definition 5.3. Skyband query. Let Dp represent the set of data points that dominate p, i.e., = {  ∈ −{ }   } ≥ Dp p P p p p . Given an integer b (where b 1), a skyband query retrieves all the points p which are dominated by b or less other points. We denote the skyband result of the = { ∈ | |≤ } query by Ub and Ub p P Dp b .

Besides, the conventional skyline query ignores a fact that skyline points might have differ- ent number of dominated data points. In many cases, skyline points that have smaller number of dominated data points are those which have a few better attributes and have many less compet-

itive attributes. In our hotel example, p6 has no dominated point at all because of its extremely high price though being the closest to the conference. Referring one’s number of dominated points to as its dominating power, we consider those skyline points with the greatest dominat- ing power (with the most dominated data points) to be the most representative. Then, given a desired number of returned top-ranked skyline points, t, the top-ranked skyline queries are defined in Definition 5.4 to retrieve t skyline points that dominate the most data points.

Definition 5.4. Top-ranked skyline query. Let Ep represent a set of data points that are = {  ∈ |  } dominated by p, i.e., Ep p P p p . A point p is said to have greater dominating power  | | > |  | than p if Ep Ep . Given an integer t, a top-ranked skyline query retrieves t skyline points from U that have the largest dominating power. We denote the top-ranked skyline result of the  ⊆ | |≤ ∀ ∈ , ∀ ∈ − , | | > |  |1 query by Ut such that Ut U, Ut t and p Ut p (U Ut) Ep Ep .

Another variant is the k-dominant skyline queries that filter out skyline points with only a few good attributes. Instead of all the d dimensions, a k-dominance relationship considers arbitrary k out of all the d dimensions, as defined in Definition 5.5. With a small k, the size of a skyline result can be reduced [25]. The definition of k-dominant skyline queries is provided in Definition 5.6.

 ∈    ∃  ⊆ Definition 5.5. k-dominance. Given p, p P, pk-dominates p (denoted by p k p )iff S , | | = , ∀ ∈ , . ≤ . ∧∃ ∈ , . < .  S S k si S p si p si sj S p sj p sj; otherwise, p does not k-dominate p   (denoted by p k p ).

Definition 5.6. k-dominant skyline query.Ak-dominant skyline query retrieves a subset of data points in P that are not k-dominated by any other point. We denote a k-dominated skyline { ∈ |∀  ∈ −{ },   } result of the query by Uk and Uk is p P p P p p k p .

1 If |U| < t, we collect Ut as the entire U. 85

According to the k-dominance relationship, skyline points that are only good at a few di- mensions are very likely to be k-dominated by others and hence the k-dominant skyline result size is smaller than a conventional skyline result. It is important to note that data points may get dominated by other data points sorted behind them according to any monotonic access order. This is because a data point pk-dominating another point p can also get k-dominated by p at the same time, but in different k dimensions. This is the so-called cyclic dominance [25]. In this case, p and p are not k-dominant skyline points. Thus, candidate reexaminations are needed which incurs larger processing overhead. Further, determining whether pk-dominates p requires to compare at least k dimensional values. Meanwhile, examining whether p does not k-dominate p needs to examine no less than (d-k+1) dimensional values2. Therefore, each dominance test based on the k-dominance relationship incurs a larger processing overhead than the conventional dominance test. Different from k-dominant skyline queries, subspace skyline queries retrieve data points that are not dominated by others based on a specified subset of d dimensions (or called a subspace). Accordingly, Definition 5.7 defines a subspace-dominance relationship between points and Definition 5.8 formalizes these subspace skyline queries.

Definition 5.7. Subspace dominance. Given p, p ∈ P on a subspace S (⊂ S), pS-dominates         ∀ ∈ , . ≤ . ∧∃ ∈ , . < . p (denoted by p S p )iff si S p si p si sj S p sj p sj; otherwise, p does not      S -dominate p (denoted by p S p ).

Definition 5.8. Subspace skyline query. A subspace skyline query retrieves a subset of data points in P that are not S-dominated by any other point on a subspace S. We denote the   = { ∈ |∀ ∈ −{ },   } subspace skyline result of the query by Us and Us p P p P p p S p .

The evaluation of subspace skyline queries would be intuitively treated as conventional skyline queries, except only certain dimensions are considered. However, it is not trivial to have a single data organization to support subspace skyline queries on all possible subsets of dimensions, simultaneously. Recall our example. A subspace skyline query on the price

dimension can be facilitated if data points are arranged in an ascending price order, i.e., p2, p3,

p4, etc. Likewise, another subspace skyline query for the distance dimension prefers objects sorted according to their distances. Clearly, these orders are different and not favorable to conventional skyline queries that consider both of the dimensions.

2Among the k dimensions, p has at least one dimensional value better than p and the rest of p is not worse than that of p. Meanwhile, as p has d − k + 1 dimensional values worse than p, p does not k-dominates p. 86

5.2 Existing Skyline Search Approaches

Skyline queries have been studied extensively for various environments, e.g., Web [15], dis- tributed databases [145], data stream [95, 133], MANET [66], spatial database [120, 158], peer- to-peer system [93, 140], probabilistic data [107], [27, 47], etc. In this research, our focus is mainly on developing algorithms for skyline queries in centralized databases. Be- fore the discussion of our approaches, we review representative algorithms for skyline queries and skyline variants, such as skyband queries, top-ranked skyline queries, k-dominant skyline queries and subspace skyline queries.

5.2.1 Skyline Query Processing

Existing algorithms for skyline query processing include Block-Nested Loop (BNL) [21], Bitmap [130], Divide-and-Conquer (D&C) [21], Sort-Filter-Skyline (SFS) [30], LESS [50], SaLSa [17], Index [130], Nearest Neighbor (NN) [81], and Branch and Bound Search (BBS) [105] etc. They can be roughly classified based on the adoption of two popular techniques, namely, divide-and-conquer (D&C) and/or sorting. Table 5.1 gives a short summary. BNL and Bitmap are brute force algorithms. In the following, we review representative divide-and- conquer, sorting and hybrid algorithms.

BNL, Bitmap D&C SFS, SaLSa, LESS Index, NN, BBS D&C X X Sorting X X Table 5.1. A survey of representative skyline query algorithms

D&C Algorithms. D&C divides a dataset into several partitions small enough to be loaded into the main memory for processing [21]. A local skyline for each partition is computed. The global skyline is obtained by merging all local skylines and removing dominated data points. Sorting Based Algorithms. SFS [30] is devised based on an observation that by having a dataset presorted according to a certain monotone scoring function such as the entropy, sum or minimum of attribute values, it is guaranteed that data points must not be dominated by others sorted behind them and the partial query result can be delivered immediately. SFS sequentially scans the sorted dataset and keeps a set of skyline candidates. Those not dominated by skyline candidates should be a part of the skyline. SaLSa [17] presorts a dataset based on the records’ minimum attribute values and it improves SFS by terminating a search whenever an examinee 87 data point whose minimum attribute value is greater than the MiniMax (i.e., the minimum among the maximim) of examined data points’ attribute values is reached. This termination condition guarantees that all later examined data points should not be skyline points; and this early termination avoids scanning the entire dataset. However, SFS and SaLSa have no or limited search space pruning capability and inevitably examine and compare all the individual data points in a pairwise fashion. Also, dominance tests in SFS and SaLSa are based on an exhaustive scan on existing skyline candidates. Unless the number of skyline points is very small, they tend to be CPU-bound. LESS [50] combines external sort and skyline search in a multi-pass fashion. It reduces the sorting cost and provides an attractive asymptotic best-case and average-case performance based on an assumption that the number of skyline points is small. Hybrid Algorithms. Index [130], NN [81] and BBS [105] are hybrid approaches that use both divide-and-conquer and sorting techniques in skyline query processing. Here, we review BBS, i.e., currently the most efficient online skyline search algorithm. BBS is based on iterative nearest neighbor search [56]. In Figure 5.2(a) that illustrates BBS with the nine hotel example points, p1, the nearest neighbor (nn) to the origin in the whole space, is identified as the first skyline point since the empty vicinity circle that centers at the origin and touches p1 ensures that there is no data points dominating p1. Then, data points (i.e., p4, p8 and p9) fallen into the dominance region of p1, i.e., the region bounded by p1 and the maximal point of the entire space, are not skyline points and can be safely discarded from a detailed examination. Based on the same idea, the second nn to the origin, p3, not dominated by p1, is another skyline point.

Next, p5, the third nn, is retrieved and its dominated point p7 is removed. Finally p2 and p6 are retrieved and the search terminates. BBS adopts one R-tree to index the source dataset because R-tree can facilitate iterative nn search. It also uses a heap to keeps track of unexamined index nodes and data points in non- decreasing distance order with respect to the origin so that repeated accesses of R-tree nodes can be alleviated, and a main memory R-tree to index the dominance regions of all skyline points. Via an additional main memory R-tree, BBS performs dominance tests on every exam- inee (i.e., data point or index node) by issuing an enclosure query. If an examinee is entirely enclosed by any skyline candidate’s dominance region, it is dominated. In Figure 5.2(b), p8, an examinee data point, is compared with Ba and Bb, the minimum bounding boxes of two leaf nodes in the main memory R-tree. As it is in Ba, not Bb, p8 is possibly dominated by some data points enclosed by Ba but certainly not Bb. Next, p8 is compared with the dominance regions 88

y y maximal point 7 7 p4 of the space p4 6 6 p2 p9 p2 p9 5 5 p3 p8 p3 p8 4 dominance 4 p1 region of p1 p1 3 3 Ba 2 2 p7 p p7 p5 p5 p6 1 1 p6 Bb o 0 1234567x 0 1234567x (a) BBS (b) Main memory R-tree

Figure 5.2. BBS and main memory R-tree

of all the data points inside Ba and found to be dominated by p1 (and p3). However, as data dimensionality increases, the performance of R-trees and main-memory R-trees that BBS depends on deteriorates (due to curse of dimensionality [19]). Even worse, some data points (or index nodes) that are far away from the origin are loaded into the heap as their enclosing (or parent nodes, respectively) appeared the closest to the origin. As a re- sult, BBS, sometimes, loads some entries earlier than they are needed. Thus it imposes a high contention on the runtime memory. Besides, dominance tests based on main-memory R-trees are actually not efficient because the examinee data points or index nodes have to reach the bottom level to determine by what skyline candidates it is dominated. In other words, domi- nance tests are performed based on point-to-block comparison, which is clearly less efficient than block-to-block comparison as provided by our proposed algorithms. Our ZSearch algorithm (to be present in Section 5.4) is also a hybrid approach. It processes a skyline query by accessing data points arranged on a Z-order curve. With effective block- based dominance tests, ZSearch can efficiently assert if a block of data points is dominated by a single data point or a block of skyline points. This significantly improves both the overall processing time and the runtime memory consumption.

5.2.2 Skyline Query Variants

Based on revised search criteria and dominance conditions, several skyline query variants have been recently proposed to retrieve “good” data points in different perspectives. Here, we dis- 89

cuss four major variants, namely, skyband queries, top-ranked skyline queries, k-dominant skyline queries and subspace skyline queries Skyband Query Processing. A skyband query retrieves data points dominated by no more

than b other data points. In Figure 5.3(a) that illustrates a skyband query example, p8 is dom-

inated by p1 and p3 and hence its count of dominating points is 2, as shown in the associated { , , , , , } brace. Based on the counts, a skyband query with b set to 2 returns p1 p2 p3 p5 p6 p7 . The

other data points, p4, p8 and p9 are excluded as they are dominated by 3, 2, and 6 other data points, respectively. y dominated by y both p , p and p 7 1 2 3 7 p4 (3) p4 p2 (0) 6 6 p9 (6) p2 (2) p9 5 5 p3 (0) p8 (2) p3 (3) p8 4 4 p1 (0) p (3) 3 3 1 p5 (0) p7 (1) 2 2 Top-ranked p7 skyline points p5 (2) 1 skyband points 1 (t=2) (b=2) p6 (0) p6 (0) 0 1234567x 0 1234567x (a) A skyband skyline query (b) A top-ranked skyline

Figure 5.3. A skyband query and a top-ranked skyline query

It is quite straightforward to extend existing skyline algorithms such as SFS [30] and BBS [105] to keep track of data points not dominated by no more than b data points as skyband candidates. However, increasing b to a larger value will definitely result in more skyband can- didates and thus incur a higher computational cost in dominance tests. Our ZBand algorithm improves the skyband search efficiency based on the advantage of block-based dominance tests. Specifically, we associate each block of skyband candidates with a count to indicate enclosed skyband candidates. If an examinee (that could be a data point or a block of data points) is found to be dominated by some blocks of skyband candidates and the sum of their counts greater than or equal to b, the examinee is certainly dominated by b or more data points and it can be safely discarded. However, the idea of associating counts is not applicable to main- memory R-trees, thus unable to improve BBS performance. The detailed examination of the ZBand algorithm is in Section 5.5.2. Top-Ranked Skyline Query Processing. To quantify the importance of a data point p based 90 on the idea of the dominance relationship, as suggested in [105], the number of data points that are dominated by p is used to indicate the importance of p. The more data points p dominates, the more important p is considered to be. Based on this metric, K-dominating query has been proposed [105, 154] to retrieve those data points that dominate the most data points. However, K-dominating query returns data points that can be dominated by some other points. Strictly speaking, this K-dominating query is not a skyline query variant. Also, a research work in [96] finds the k most representative skyline points whose dominated data points cover most (if not all) of the dominated data points in a dataset. In [132], a subset of skyline points are collected that can approximately present the distribution of an entire set of skyline points. However, these works cannot tell which skyline points can dominate the most. In this research, we present a top-ranked skyline query that returns a subset of skyline points that can dominate most of other data points. In Figure 5.3(b), we can see that p1 dominates p4, p8, and p9 so that its number of dominated data points is 3 (as marked inside the brace). Similarly, the number of data points dominated by p2, p3, p5 and p6 are 2, 3, 2, and 0, respectively. Based on these numbers, p1 and p3 are considered to be the most important while p6 is the least since it dominates zero data point. Given a parameter t, a top-ranked skyline query finds t skyline points whose counts of dominated data points are the largest. In Figure 5.3(b), a top-ranked skyline query with t = 2 returns p1 and p3. Existing skyline search algorithms can be extended to support top-ranked skyline queries by counting the number of points dominated by each skyline point. Correspondingly, dom- inance tests that decide whether a data point is dominated should also be extended to count the numbers of data points dominated by individual skyline candidates. However, since data points can be dominated by more than one skyline candidate simultaneously, to record the num- ber of dominated data points for all skyline candidates, exhaustive comparisons between each examinee to all skyline candidates are inevitable. Our ZRank algorithm explores clustering properties of Z-order curve to place skyline candidates as blocks. Very often, a block of sky- line candidates dominate some common examinees. By keeping track of the numbers of data points dominated by blocks other than individual skyline points, some comparison overheads can be alleviated. The details of our ZRank algorithm are provided in Section 5.5.3. k-Dominant Skyline Query Processing. By considering arbitrary k among d dimensions, k-dominant skyline queries [25] reduce the size of skyline result set. However, data points can k-dominate each other simultaneously. Figure 5.4 lists four three-dimensional data points, namely, p1, p2, p3, and p4, which are all skyline points based on the conventional dominance 91

s1 s2 s3

p1 9 11 2 p2 2 11 11 p3 8 8 8 p4 1 25 1 Figure 5.4. An example for k-dominant and subspace skyline queries

condition. If we consider a 2-dominant relationship, p1 and p2 2-dominate each other. This cyclic dominance relationship violates the transitivity property, making all the existing skyline search algorithms inapplicable. As reported in [25], Two-Scan-Algorithm (TSA) is currently the most efficient algorithm to k-dominant skyline queries. TSA scans the dataset twice. The first scan collects candidate skyline points, which might include false hits, and the second scan eliminates those false hits.

As an example in Figure 5.4, p1 is first picked and it 2-dominates p2 and p3 but not p4. After the first scan, p1 and p4 are retained. In the second scan, p1 is checked against p2 and gets dom- inated. It is removed from the candidate set. A data point p4, not 2-dominated after the second scan, is the 2-dominant skyline. With the relaxed k-dominant relationship, more data points can be dominated. The search space should therefore be significantly shrunk. However, TSA and other existing approaches, without being conscious of this fact, access all individual data points.

In the example as shown in Figure 5.4, data points p1, p2 and p3 can be grouped as a block and represented by their lower and upper bounds for each dimension (i.e., ( 2, 9, 8, 11, 2, 11)) in the search space. It is easy to see that p4 2-dominates the entire block, which eliminates the pairwise dominance tests between p4 and all the enclosed data points. Our k-ZSearch algo- rithm (that will be presented in Section 5.5.4) tackles this k-dominant skyline query problem by exploiting the clustering property of Z-order curve. Whenever a cluster of data points is found to be k-dominated, the examination of all the enclosed points is avoided, thus speeding up the search. Subspace Skyline Query Processing. To address this subspace skyline search problem, SUB- SKY [135] has been recently proposed. It presorts a dataset based on records’ minimum at- tribute values, and it sequentially scans the sorted dataset until the MiniMax of all examined records’s attribute values is smaller than the minimum attribute values of all the remaining un- examined records or the dataset is completely scanned. To improve the search efficiency, it groups data points into regions. An entire region is skipped from detailed examinations if it is found to be dominated by the MiniMax of the existing skyline points. 92

However, SUBSKY can be very inefficient especially when the number of skyline points is large. First, since early examined data points may get dominated by those data points that are accessed later, SUBSKY needs to preserve a buffer of skyline candidates and maintains it when

a new skyline candidate is collected. Suppose that a subspace skyline query concerning s2 is

issued on data points illustrated in Figure 5.4. First, p4 (with its minimum attribute value of 1)

is first accessed but later it is replaced with p1 and p2 whose minimum attribute values are both

2. Then p1 and p2 are removed as p3, the skyline point, is accessed. Thus, this sorting is not appealing. Also, SUBSKY adopts point-to-point dominance tests and the effectiveness of its pruning degrades when many data points are non-dominated. Instead, our ZSubspace algorithm as will be presented in Section 5.5.5 utilizes the block-based dominance test to enhance the

search efficiency. For example, for the same set of data points, p1, p2 and p4 are bounded by , , , , ,  a block ( 1 9 11 25 1 11 ). Suppose that p3 is accessed first. By projecting values on ,  s2, these points are bounded by ( 11 25 ). From this, we can quickly assert that the block is

dominated by p3 without exploring it. Further, accessing data points along Z-addresses usually reaches data points close to the origin of the data space prior to others. Those data points typically can dominate many other data points projected on any search subspace.

5.3 Skyline Search Based on Z-Order Curve

In this section, we first state the properties of Z-order curves that are found to match perfectly well with the dominance relationship among data points in a data space. This observation inspires our algorithms for skyline queries and their variants. We then present ZBtree, an index structure developed based on the idea of Z-order curves that facilitate skyline searches and result updates as well as index manipulations.

5.3.1 Dominance Relationships and Z-Order Curve

The efficiency of skyline query processing and skyline result updating is highly dependent on two factors, i.e., the access order of the data points and the organization of skyline candidates that facilitates dominance tests. An appropriate access order can early identify skyline points that dominate lots of other data points. Thus, this can eliminate unneeded dominance tests and candidate reexaminations [30]. It is desirable to adopt block-based dominance tests if possible, instead of pairwise data point comparisons. To enable block-based dominance tests, data points 93 should be clustered. As will be shown below, Z-order curves can satisfy all these needs. y 7 7 II p4 IV p4 6 6 p2 p9 p2 p9 5 5 p8 p8 4 p3 4 p3 3 3 p1 p1 2 2 p5 p7 p5 p7 1 1 p p6 I III 6 0 1234567x 0 1234567 (a) Nine points in a 2D data space (b) A Z-order curve

Figure 5.5. A Z-order curve example

, ··· Figure 5.5(a) depicts our nine example 2D data points, i.e., p1 p9. Suppose we partition the entire space into four equal-sized quadrants, i.e., regions I, II, III and IV,along the directions parallel to the two axis. Data points in Region I are definitely not dominated by those in the other three regions. On the contrary, all data points in Region IV are dominated by any data point in Region I. Meanwhile, Region II and III are diagonally opposite to each other, and their data points are incomparable. Dominance tests between them are not needed. However, data points in Regions II or III may get dominated by some data points in Region I while they may dominate some data points in Region IV. Finally, those in Region IV cannot dominate any data point in the other three regions. These observations not only provide excellent heuristics for dominance tests in regional (i.e., block) level, but also lead to a natural access order of the regions (and their data points) for processing skyline queries, i.e., Region I should be accessed first, followed by Region II, then Region III (or Region III, then Region II) and finally Region IV3. The same principles are also applicable to subregions divided from each region. The finest subregion can be just a single coordinate in the space. This access sequence exactly follows a (rotated) ‘Z’ order, which is the well known Z-order space filling curve. Figure 5.5(b) shows the Z-order curve, which starts at the origin and passes through all coordinates (and data points) in the space. A Z-order curve maps multi-dimensional data points into a one-dimensional space, with

3The visiting order of Region II and III does not affect the correctness of the skyline result. 94

each point represented by a unique position, called Z-address. A Z-address is a bit string cal- culated by interleaving the bits of all the coordinate values of a data point. For a d-dimensional space with ([0, 2v-1]) as the coordinate value ranges, the Z-address of a data point contains dv bits, which can be considered as vd-bit groups. The i-th bit of a Z-address is contributed / 4 by the (i d)-th bit of the (i%d)-th coordinate . In our example, the Z-address of p2 (1,6) (i.e., (001,110) in binary) is 010110. Here, 01,01, 10 are the three 2-bit groups. Similarly, the Z-

address of p4 (3,7) (i.e.,(011,111) in binary) is 011111 and 01,11, 11 are the three 2-bit groups. This Z-address calculation is reversible so that original coordinates in the multi-dimensional space can be recovered by distributing bit values to corresponding dimensions5. Besides, Z-addresses are hierarchical in nature. Given a Z-address with vd-bit groups, the first bit group partitions the search space into 2d equal-sized sub-spaces, the second bit group partitions each sub-space into 2d equal-sized smaller sub-spaces, and so on. For in-

stance, p2 and p4 share the same first bit group (i.e., 01), and hence both of them fall inside the left part along x-axis and upper part along y-axis (i.e., the Region II in Figure 5.5(a)). With data points arranged according to their Z-addresses, Z-order curves bear two important prop- erties, monotonic ordering and clustering, as stated below that perfectly match transitivity and incomparability properties of skyline queries, respectively, as already discussed in Section 5.1.

Property 5.3. Monotonic Ordering. Data points ordered by non-descending Z-addresses are monotonic such that data points are always placed before their dominated points.

Property 5.4. Clustering. Data points with identical prefixes (i.e., leading bit groups) in their Z-addresses are naturally clustered as regions.

As shown in Figure 5.5(b), p1 that dominates p8 and p9 is accessed before both of them on the

Z-order curve. Similarly, p2 and p3 are accessed before p4, while p5 is accessed before p7. This access order guarantees that no candidate reexamination is needed. Due to the hierarchical property of Z-addresses, data points located in the same regions have the same prefixes (i.e.,

some beginning d-bit groups) in their Z-addresses. As in our example, p2, p3 and p4 (with the first bit-group (i.e., 01) in common) are located inside Region II. Grouping data points in regions can facilitate block-based dominance tests. Notice that other space filling curves such as Hilbert and Peano curves are inappropriate for skyline query processing as they lack monotonic ordering property. These curves do not

4’/’ and ’%’ are divider and modulus operators, respectively. 5We may use Z-address and coordinate interchangeably when the context is clear. 95

always start at the origin of a space (or a subspace), that implies dominating points may be placed after their dominated points, and so candidate reexamination is required.

5.3.2 ZBtree Index Structure

The design goals of an index to support skyline query processing and updates are i) to fa- cilitate data processing along a Z-order address sequence; and ii) to preserve data points in regions to enable efficient search space pruning. While Z-addresses are one-dimensional, a straightforward approach is to combine Z-order curve and B+-tree. Thus, we propose ZBtree,a variant of B+-tree, to organize data points in accordance with monotonic Z-addresses. ZBtree indexes disjoint Z-order curve segments (i.e., sequences of data points on the curve), with each corresponding to a region to preserve the clustering property. Z-region maxpt curve segment [p1 , p4 ] [p5 , p9 ]

p9 [p1 , p1 ] [p2 , p4 ] [p5 , p7 ] [p8 , p9 ]

minpt p8

p1 p2 p3 p4 p5 p6 p7 p8 p9 RZ-region

(a) An RZ-region enclosing p8 and p9 (b) A ZBtree

Figure 5.6. An RZ-region and a ZBtree

In detail, leaf nodes in a ZBtree maintain the IDs and the Z-addresses of data points and non-leaf nodes maintain the pointers to their children and the child Z-address intervals (denoted by [α, β]), each representing a curve segment covering data points with Z-addresses bounded in between α and β. The space spanned by a Z-order curve segment is called a Z-region.

Figure 5.6(a) illustrates a Z-region spanned by a curve starting at point p8 and ending at point

p9. Since a Z-region can be of any size and in any shape, we bound a Z-region with an RZ- region, as defined in Definition 5.9.

Definition 5.9. RZ-region. An RZ-region is the smallest square region covering a Z-region bounded by [α, β]. The RZ-region is specified by two Z-addresses, minpt and maxpt,so[α, β] ⊆ [minpt, maxpt]. Both minpt and maxpt share an identical prefix d-bit group. 96

RZ-Regions can be straightforwardly derived based on Z-Regions (i.e., [α, β]). First, the common prefix of both α and β is determined. With the common prefix, minpt is formed by appending all 0’s to the rest of the bits and maxpt is formed by setting all the remaining bits to 1’s. Figure 5.6(a) shows an RZ-region derived from a Z-region that is covered by a Z-order curve segment in between points p8 and p9. While their formations are straightforward, RZ- regions provide the following bounding property that can considerably facilitate dominance tests between RZ-regions.

Property 5.5. RZ-region bounds. Within an RZ-region, all data points (except those at minpt) should be dominated by any data point at minpt; and data points located at maxpt should be dominated by other data points in the region.

Other works such as [103, 112] designed to support multidimensional range searches also exploit the idea of indexing data points on a Z-order curve with a B+-tree. In order to reduce false hits for range searches, their optimization goal is to index Z-order curve segments, each providing the smallest area, which is referred to as a Z-region in our discussion. For instance, a node that contain data points p7, p8 and p9 (see Figure 5.5(a)) is very likely to be grouped. Without a need to deal with dominance tests, those existing works may even group incompa- rable data points in a node. In fact, those data points that forms a small RZ-regions should be grouped in a node so as to facilitate dominance tests. Therefore, the existing approaches cannot be used to support skyline searches efficiently. As we can see, if p8 and p9 form a node and this node can be skipped from a detailed examination once p1 is identified. Thus, different from existing works, our approach is aware of dominance relationships while grouping data points in a ZBtree.

5.3.3 ZBtree Index Manipulation

Certainly, the search efficiency improvement depends on how data points are organized to form RZ-regions. There are two main index construction objectives. The first objective is to opti- mize the storage. We can achieve this by packing as many data points (or nodes) as possible in leaf nodes (or non-leaf nodes, respectively) in a ZBtree such that the size of the ZBtree is min- imized. In case that an entire index is needed to traverse, the access overhead can be reduced. The second objective is to optimize the retrieval performance. We can strategically allocate data points that provide small RZ-regions stored in nodes, such that search space pruning and dominance tests can be effectively facilitated. As shown in Figure 5.5(b), a node that contains 97

data points p8 and p9 can be discarded once we identify the data point p1, and a node that con- tains data points p2, p3, and p4 does not need to be examined against another node that contains data points p5, p6, and p7. Consider that the node capacities range from 1 to 3. p1 can be put into the first leaf node. Next, p2, p3 and p4 are inserted into the second leaf node. Similarly, p5, p6 and p7 occupy the third leaf node. Finally, p8 and p9 are put into the last leaf node. This index structure is depicted in Figure 5.6(b). While this requires some extra storage, this region-aware grouping improves the search performance, as some unnecessary node traversal and compar- isons between incomparable nodes are avoided. Based on the same principle, we group leaf nodes into appropriate non-leaf nodes and recursively propagate the process upwards till the root of the index is formed. In the following, we assume that each node can contain at most N entries and the minimum node utilization threshold is M (where M ≤ N/2) and describe ZBtree manipulations such as insertion, deletion and bulkloading.

ZBtree Insertion. Insertion places a data point with a Z-address zins to a leaf node in a ZBtree. The search for a target leaf node to accommodate the new data point is based on a depth-first α, β search with zins as the search key. Along the traversal, the branch whose [ ] covers zins is explored. In case that no branch with an interval covering the data point is identified, a branch with the smallest resulted RZ-region is chosen6. After an insertion, a leaf node filled with more than N entries becomes overflow and needs to be split into two new nodes. These two nodes are formed to cover two disjoint Z-order curve segments such that the total sizes of their corresponding RZ-regions are the smallest among all possible splits. After an insertion, the interval of the inserted leaf node is updated to accommodate the newly inserted data point. This interval update is propagated up from the leaf node to its parent and ascendent nodes.

ZBtree Deletion. Deletion removes a data point with Z-address zdel from a ZBtree. First, a leaf node that contains the deleted data point is found by a depth-first search using zdel as the search key. The data point is then removed from the node. In case that the leaf node, after deletion, contains less than M entries, it needs to be removed. All the enclosed data points are re-inserted into a ZBtree based on the above-described insertion procedure. Similarly, the intervals of the nodes along the path from affected leaf nodes to the root node are updated to reflect the deletion. ZBtree Bulkloading. Bulkloading builds a ZBtree in a bottom-up fashion. It is mainly used to prepare a source dataset for skyline queries. It has two steps. The first step sorts all data

6A resulted RZ-region here refers to the extended RZ-region after inserting the new data point. 98

points based on Z-addresses. The second step scans these data points (or nodes) following the ascending Z-addresses with a sliding window of N slots to form leaf nodes (or non-leaf nodes, respectively). Since data points in a high dimensional space are very often sparsely distributed, it is very unlikely that small RZ-regions can be formed by including a lot of points in a node. Then, such small RZ-regions can be formed but sacrificing the space utilization. Also, as the number of skyline points is expected to be large, many RZ-regions may not be dominated and thus, the entire dataset needs to be examined. While various bulkloading strategies are examined and no significant difference is observed in terms of the overall processing time [89], compact bulkloading strategy is adopted in this work to maximize the node utilization and minimize the access cost. It sorts all data points in an ascending order of their Z-addresses and forms leaf nodes based on every N data points. Similarly, it puts every N leaf nodes together to form non-leaf nodes until the root of a ZBtree is formed.

5.4 Skyline Search on Z-order Curve

In this section, we discuss ZSearch, an efficient skyline search algorithm. Its efficiency is attributed to RZ-region based dominance tests and search space pruning.

5.4.1 RZ-Region Based Dominance Test

Dominance tests are a key determinant of computational overhead in skyline query processing. To avoid comparing data points in a time-consuming pairwise fashion, we introduce block based dominance tests, which are based on RZ-regions derived during traversing a ZBtree.In the Z-SKY framework, we maintain a source dataset SRC and a set of skyline candidates SL both indexed by ZBtrees. These two indexes present clusters of data points in RZ-regions. The dominance relationships between any two RZ-regions are defined in Lemma 5.1. When not causing any confusion, we use RZ-region R to denote all the data points within it and use minpt(R) and maxpt(R) to represent the minpt and maxpt of R, respectively.

Lemma 5.1. Given two RZ-regions R and R, the following four cases hold:

1. If maxpt(R)  minpt(R), then R  R, then any data point in R for sure dominates all the data points in R. 99

2. If maxpt(R)  minpt(R) ∧ minpt(R)  minpt(R), then some data points in R may dominate all the data points in R.

3. If minpt(R)  minpt(R) ∧ minpt(R)  maxpt(R), then it is possible that some data points in R dominates some data points in R.

4. If minpt(R)  maxpt(R), then R  R, then no data point in R can dominate any data point in R.

Proof. We prove the lemma case by case. Case 1. maxpt(R)  minpt(R). By the transitivity property (Property 1) and RZ-region bounds (Property 5), ∀p ∈ R −{maxpt(R)}, ∀p ∈ R −{minpt(R)}, p  maxpt(R) ∧ minpt(R)  p ⇒       p p and as stated, maxpt(R) minpt(R ). Hence, R R . Rs and R1 in Figure 5.7(a) form an example. Case 2. maxpt(R)  minpt(R) ∧ minpt(R)  minpt(R). Data points (e.g. minpt(R)) are dominated by minpt(R). Thus, the case holds. Rs and R2 in Figure 5.7(a) form an example. Case 3. minpt(R)  minpt(R) ∧ minpt(R)  maxpt(R). As some data points in R (e.g.  maxpt(R )) are dominated by minpt(R), the case holds. Rs and R3 in Figure 5.7(a) form an example. Case 4. minpt(R)  maxpt(R). This case can be proved by contradiction. Assume p  p (where p ∈ R, p ∈ R). By transitivity, minpt(R)  p  p  maxpt(R) ⇒ minpt(R)   maxpt(R ). This contradicts the condition of the case. Rs and R4 in Figure 5.7(a) form an example. maxpt maxpt maxpt maxpt

R1 R4 R1 minpt R3 minpt maxpt maxpt maxpt minpt minpt maxpt p R2 Rs R2 minpt minpt minpt minpt R3

(a) Two RZ-regions (b) RZ-regions vs a data point

Figure 5.7. Examples of dominance tests based on RZ-regions 100

This dominance relationship is generic enough to be applied to an RZ-region which con- tains a single data point p, i.e., minpt(R) = maxpt(R) = p, and/or minpt(R) = maxpt(R) = p.

Figure 5.7(b) shows the comparisons of R1, R2, R3 against p. Further, the dominance relation- ship between two data points (as stated in Definition 5.1 in Section 5.1) becomes a special case of this lemma.

Function Dominate(SL, minpt, maxpt) Input. a ZBtree indexing skyline points, SL; the endpoints of an RZ-region, minpt and maxpt; Local. Queue, q; Output. TRUE if input is dominated, else FALSE; Begin 1. q.enqueue( SL’s root); 2. while q is not empty do 3. var n: Node; 4. q.dequeue( n); 5. if n is a non-leaf node then 6. forall child node c of n do 7. if c’s RZ-region’s maxpt  minpt then 8. output TRUE; /* Case 1 of Lemma 5.1 */ 9. else if (c’s RZ-region’s minpt  minpt) then 10. q.enqueue(c); /* Case 2 of Lemma 5.1 */ 11. else /* leaf node */ 12. forall child point c of n do 13. if c  minpt then 14. output TRUE; 15. output FALSE; End.

Figure 5.8. The pseudo-code of the Dominate function

Based on Lemma 5.1, we can perform effective dominance tests on the RZ-region of any node from SRC against those from SL. Figure 5.8 outlines Dominate function for the dom- inance test. It checks whether a given RZ-region R from SRC (represented by its minpt and maxpt)7 contains any potential skyline points by conducting dominance tests against all the identified skyline points in SL. The function traverses SL based on breadth-first traversal such that the RZ-regions of upper-level nodes from SL are compared against R first and drilled down if R needs further examination against finer RZ-regions in SL.

7In case of a data point, p, R contains p and minpt(R) = maxpt(R) = p. 101

In the function, a queue is initialized with the root of SL. An entry n popped from the queue will be further explored in two situations:

• n is a non-leaf node, and all its child entries c are examined. If any point inside the RZ-region of c dominates R (i.e., Case 1 of Lemma 5.1), the algorithm indicates that R is dominated and the dominance test is completed (Line 7-8). If some points inside the RZ-region of c may dominate R (i.e., Case 2 of Lemma 5.1), the entry c is enqueued for further examination (Line 9-10). Tests on Case 3 and Case 4 of Lemma 5.1 are omitted since they are implied by the failures of the above two cases and the input RZ- region should not be completely dominated or these two RZ-regions are found to be incomparable.

• n is a leaf node. We check R against individual skyline points inside n (Line 13-14).

This traversal continues until the queue is vacated (i.e., all comparable nodes in SL are visited). Finally, R is reported to be not dominated. Notice that the function may stop before the leaf level is reached. As such, a few nodes in SL need to be accessed and thus an efficient block-based dominance test is provided.

5.4.2 The ZSearch Algorithm

The ZSearch algorithm traverses SRC to examine RZ-regions and data points in a depth-first order, which exactly follows the order of Z-addresses, with a stack keeping track of unexplored paths. The stack memory consumption is bounded by a factor of the tree height of SRC. Figure 5.9 depicts the pseudo-code of the ZSearch algorithm. It fetches the index nodes and/or data points from SRC (Line 2-13). At each round, the RZ-region of a node is examined against SL with the Dominate function (see Figure 5.8). If its corresponding RZ-region is not dominated, the node is further explored (Line 7-13). Data points not dominated by any skyline candidate in SL are collected as new skyline candidates and appended to SL (Line 13). The mechanism of appending skyline candidates to SL is based on the ZBtree insertion (discussed in Section 5.3.3). As the newly admitted skyline candidates should be added to the latest created leaf node since their Z-addresses are greater than those of collected skyline candidates, we maintain a pointer to the latest created leaf node to save the traversals in SL. The algorithm terminates when all entries from SRC are examined (i.e., signaled by an empty stack). The organization of skyline candidates in SL enables the formation of RZ- 102

Algorithm ZSearch(SRC) Input. a ZBtree for source data set, SRC; Local. a stack, s; Output. a ZBtree for skyline candidates, SL; Begin 1. s.push(SRC’s root); 2. while s is not empty do 3. var n: Node; 4. s.pop( n); 5. if Dominate(SL, n’s minpt, n’s maxpt) then 6. goto line 3. 7. if n is a non-leaf node then 8. forall child node c of n do 9. s.push(c); 10. else /* leaf node */ 11. forall child point c of n do 12. if not Dominate(SL, c, c) then 13. SL.insert(c); 14. output SL; End.

Figure 5.9. The pseudo-code of the ZSearch algorithm

regions to facilitate dominance tests. SL may be too large to fit in the main memory. It can be stored on disk and we use available main memory as a cache. In this case, our approach is very appropriate since 1) the cached upper (and usually a small) portion of SL can be sufficient to perform dominance tests as the leaf levels may not always be reached, and 2) clustered data points in SRC exhibit good access locality of related branches in SL. Finally we use Example 5.1 to illustrate how the ZSearch algorithm runs.

Example 5.1. We use the example data points in Figure 5.5(a) indexed by a corresponding ZB- tree in Figure 5.6(b) to illustrate the ZSearch algorithm. The trace is depicted in Figure 5.10. , , SL Initially, the root entries, i.e., [p1 p4] and [p5 p9], are pushed to the stack and is empty. {} For presentational simplicity, only leaf nodes enclosed by are shown. First, we obtain p1, the

first skyline point. Next, p2, the second accessed data point, not dominated by p1, is inserted to SL . Notice that BBS and SFS access p3 after p1 but before p2. Then, the delivery order of a skyline result by the ZSearch algorithm is not necessarily the same as BBS and SFS. SL Assume that the node capacity of is 2. Insertion of p3 triggers a node split. p2 and p3 103

Stack Skyline points SL

[p1, p4],[p5, p9] ∅ [p1, p1],[p2, p4],[p5, p9] ∅ [p2, p4],[p5, p9] {p1} [p5, p9] {p1},{p2, p3} [p5, p7],[p8, p9] {p1},{p2, p3} [p8, p9] {p1},{p2, p3},{p5, p6} Figure 5.10. The trace of the ZSearch algorithm

{ } { , } are put together since p1 and p2 p3 form smaller RZ-regions. p4, dominated by p2 (or p3), , , { , } is discarded. Later, [p5 p9] is explored. As the RZ-region of [p5 p7] is incomparable to p2 p3 , SL the comparison with p2 or p3 is saved. Next, explored p5 and p6 are inserted to but p7 is , { } SL dropped. Finally, [p8 p9] dominated by p1 in is skipped and the search completes.

5.5 Skyline Query Variants

In this section, we evaluate different skyline query variants. Although those skyline query variants are defined differently, strategies that can efficiently answer them can be very similar. For skyband and top-ranked skyline queries, counting of dominating and dominated data points is needed in addition to dominance tests. Thus, we extend the Dominate function and put some additional information in ZBtrees to facilitate the counting. On the other hand, for k-dominant skyline and subspace skyline queries, it is not guaranteed that data points are not dominated by those accessed after them following the access order on their Z-addresses. Thus, we adopt a filter-and-reexamine approach in which candidates that may contain false hits are first retrieved, and then they are reexamined to discard those false hits. Based on the idea of Z-order curves and above techniques, we derive and discuss the algorithms, namely, ZBand, ZRank, k-ZSearch and ZSubspace to evaluate skyband queries, top-ranked skyline queries, k-dominant skyline queries and subspace skyline queries, respectively, in the following.

5.5.1 Extended ZBtrees

For some skyline query variants like skyband and top-ranked skyline queries as will be dis- cussed in this section, auxiliary information like the number of dominating points or the num- ber of dominated points provided in nodes in a ZBtree can speed up the search. With a plain ZBtree that does not have these types of auxiliary information, the search has to traverse the 104

index down to some leaf nodes to determine the exact number of data points. Otherwise, if some auxiliary information like count of data points indexed is provided in some high-level nodes, the counting of individual data points can be done in some nodes before accessing the leaf level of an index. In the following, we call ZBtrees augmented with auxiliary information as extended ZBtrees. cnts doms 4 5 [p1 , p4 ] [p5 , p9 ] 3 0 0

1 3 3 2 [p1 , p1 ] [ p2 , p3] [ p5 , p6] [p1 , p1 ] [p2 , p4 ] [p5 , p7 ] [p8 , p9 ] doms

(0) (1) (3) (2) (0) p1 p2 p3 p4 p5 p6 p7 p8 p9 p1 p2 p3 p5 p6

(a) Incorporated with cnts (b) Incorporated with doms

Figure 5.11. Example extended ZBtrees

Subject to query needs, we can associate different types of auxiliary information to the nodes of extended ZBtrees. The first type of auxiliary information we consider is the number of indexed data points of each node, labeled as cnt. Figure 5.11(a) depicts an example of extended ZBtree that associates a cnt with each node for the dataset as depicted in Figure 5.5. These cnts are useful to both skyband queries and top-ranked skyline queries. For example, , = let n be the node entry [p8 p9].Acnt ( 2) is associated with n that indicates two data points indexed by the pointed node. As a result, if n is dominated by a data point p, it is certain, without a detailed examination, that p dominates the two data points indexed by n. The second type of auxiliary information is the number of data points that are dominated by each associated node, labeled as dom, useful for top-ranked skyline queries. Because of the clustering property, data points dominated by a node n in SL are for sure dominated by n’s descendants. Therefore, the dom associated with n’s child entry does not count for those points already dominated by n, and the dominating power, i.e., the number of dominated data points, of a candidate skyline point can be derived by summing up the doms of its enclosing node, its parent node, and all its ancestors. An example extended ZBtree associated with doms , is depicted in Figure 5.11(b). The dom associated with the node entry [p1 p1] is three. The

dominance power of the candidate point p1 is three that is the summation of the dom associated 105

, with p1 (i.e., 0) and the dom associated with the node [p1 p1] (i.e., 3).

5.5.2 The ZBand Algorithm

A skyband query retrieves those data points that are not dominated by more than b other data points in a dataset. Due to the transitivity property, any given data point should appear after all its dominating data points on a Z-order curve. We can naturally extend dominance tests for skyband queries to check whether an examinee is dominated by more than b collected skyband candidates.

Function BandDominate(cnt-SL, minpt, maxpt, b) Input. cnt-SL:aextended ZBtree indexing skyband candidates; minpt and maxpt: endpoints of an RZ-region; b: threshold skyband width; Local. q: Queue; cnt: a counter of dominating points Output. TRUE if input is dominated, else FALSE; Begin 1. q.enqueue( cnt-SL’s root); cnt ← 0; 2. while q is not empty do 3. var n: Node; 4. q.dequeue( n); 5. if n is a non-leaf node then 6. forall child node c of n do 7. if c’s RZ-region’s maxpt  minpt then 8. cnt← cnt + c’s cnt; /* Case 1 of Lemma 5.1 */ 9. else if c’s RZ-region’s minpt  minpt then 10. q.enqueue(c); /* Case 2 of Lemma 5.1 */ 11. else /* leaf node */ 12. forall child point p of n do 13. if p  minpt then 14. cnt ← cnt + 1; 15. if cnt ≥ b then output TRUE; 16. output FALSE; End.

Figure 5.12. The pseudo-code of the BandDominate function

However, the number of skyband candidates increases when b grows. Thus, dominance tests between each examinee data point and all the candidate points may incur a longer precess- ing time. To improve the search performance, we derive BandDominate function as outlined 106

in Figure 5.12. While its logic is very similar to Dominate function (in Figure 5.12), Band- Dominate function uses cnt-SL to facilitate the counting of dominating points and reports dominated if the number of dominating points is found to be greater than or equal to b. As outlined in Figure 5.12, BandDominate function examines an RZ-region R from SRC against skyband candidates indexed by cnt-SL. A queue initialized with the root node of cnt- SL is used to guide the breadth-first traversal on cnt-SL. In addition, a count cnt with initial value set to 0 is used to record the number of data points found so far that dominate R (line 1). Thereafter, the search proceeds by iteratively fetching the head entry n from the queue (line 2-15). If n is a non-leaf node, we check for its every child entry c. If RZ-region of c dominates minpt (and hence the entire R), we increment cnt by c’s cnt (line 7-8). Or, if c may contain some data points dominating R, it is enqueued for further exploration (line 9-10). On the other hand, n must be a leaf node and all the enclosed data points are compared against R. Each detected dominating data point increases cnt by one (line 12-14). By the end of each iteration, we check the early termination condition, i.e., whether cnt reaches b, that indicates R is for sure dominated by b or more than b data points (line 15) to terminate the algorithm. With cnt’s presented in the immediate nodes, cnt-SL is not needed to be traversed to the bottom and hence the performance is improved. Figure 5.13 sketches the ZBand algorithm. Like the ZSearch algorithm, it traverses SRC in depth-first order, but it uses cnt-SL to maintain skyband candidates and the BandDominate function to perform dominance tests, in place of SL and Dominate function, respectively. The logic is very similar to the ZSearch algorithm. Finally, we use Example 5.2 to illustrate the operation of the ZBand algorithm.

Example 5.2. This example illustrates how the ZBand algorithm evaluates a skyband query with b = 2 on our example dataset. Figure 5.14 outlines the trace. We present a node in cnt- SL as {p}(cnt) that means the node has a data point p and it is associated with a count cnt of enclosed data points. SRC , Initially, the stack is initialized with the root node of , i.e., branches [p1 p4] and , , [p5 p9]. Then, the first leaf node [p1 p1] is explored and the first data point, p1, is pushed SL , into cnt- . Next, the second leaf node [p2 p4] is explored. The first two data points p2 and SL p3 are examined; they are not dominated and enrolled into cnt- . The third one, p4 is dis- { }(1) { , }(2) carded, as it is dominated by both the node p1 and p3 in the node p2 p3 . Next, the , leaf node [p5 p7] is explored. Since all the enclosed data points are not dominated by two or SL , more other data points, they are appended to cnt- . Last, the leaf node [p8 p9] is examined 107

Algorithm ZBand(SRC, b) Input. SRC:aZBtree for source data set; Local. s: Stack; Output. cnt-SL:aextended ZBtree for skyline points; Begin 1. s.push(SRC’s root); 2. while (s is not empty) do 3. var n: Node; 4. s.pop( n); 5. if BandDominate(cnt-SL, n’s minpt, n’s maxpt, b) 6. then goto line 3. 7. if n is a non-leaf node then 8. forall child node c of n do 9. s.push(c); 10. else /* leaf node */ 11. forall child point c of n do 12. if not BandDominate(cnt-SL, c, c, b) then 13. cnt-SL.insert(c); 14. output cnt-SL; End.

Figure 5.13. The pseudo-code of the ZBand algorithm

Stack Skyline points cnt-SL

[p1, p4],[p5, p9] ∅ [p1, p1],[p2, p4],[p5, p9] ∅ (1) [p2, p4],[p5, p9] {p1} (1) (2) [p5, p9] {p1} ,{p2, p3} (1) (2) [p5, p7],[p8, p9] {p1} ,{p2, p3} (1) (2) (3) [p8, p9] {p1} ,{p2, p3} , {p5, p6, p7} Figure 5.14. The trace of the ZBand algorithm

SL against cnt- and discarded since it is dominated by p1 and p3. The stack becomes empty. { , , , , , } The search terminates and the skyband query result is p1 p2 p3 p5 p6 p7 .

Other approaches such as SFS and BBS (i.e., two of the representative approaches as re- viewed in Section 5.2) can be intuitively extended to answer skyband queries. However, they are not efficient. This is because SFS does not cluster data points and it is forced to compare data points individually, and BBS, which keeps dominance regions in the main-memory R- tree, cannot efficiently determine the number of data points that dominate a given index node and/or data point. Because of the nature of R-tree, a data point covered by the MBR associ- 108

ated with an internal node might not be covered by some of its enclosed skyline candidates’ dominance regions (see Section 5.2). Hence, maintaining counts of underlying dominance re- gions in intermediate nodes in main-memory R-tree cannot provide the same effect as cnt-SL and thereafter, all traversals have to reach the leaf levels in order to determine the number of dominating data points.

5.5.3 The ZRank Algorithm

In this subsection, we present the ZRank algorithm that can efficiently answer top-ranked sky- line queries. Its efficiency is attributed to (i) an efficient dominance test mechanism integrated with counting of dominated data points, (ii) cnt-SRC that indexes a source dataset and provides cnts of data points associated with index nodes, and (iii) dom-SL that indexes skyline candi- dates plus doms associated with individual index nodes and candidates indicating the number of data points they dominate. These cnt-SRC and dom-SL are extended ZBtrees. As briefly described, the ZRank algorithm operates in two phases, i.e., counting phase and ranking phase. The counting phase searches for skyline candidates and counts the numbers of their dominated data points. The ranking phase orders the skyline candidates according to their dominating power and returns t candidates with the largest dominating power as the result (where t is controlled by a user). We detail these two phases below. The Counting Phase. For conventional skyline queries, dominance tests simply determine whether a data point is dominated. Once a node is dominated, the whole branch rooted by the node in SRC is skipped from further examinations. However, for top-ranked skyline queries, skipping detailed examinations of some dominated nodes would result in incorrect counts. Figure 5.15(a) illustrates a case in which skipping exploring an RZ-region leads to an incorrect { } counting. Recall that the skyline points are p1, p2, p3, p5, p6 in our running example and an

RZ-region R that contains p8 and p9 is completely dominated by p1. However, p9 (not p8)inR

is dominated by p2 and p5 and so it should contribute to the dominating power of both p2 and

p5.IfR is not further explored once it is found to be dominated by p1, the counts of p2 and p5

would miss p9, i.e., incorrect counts received by p2 and p5. As such, nodes cannot simply be discarded as they may have enclosed data points dominated by other skyline candidates. We derive DominateAndCount function that incorporate the counting of dominated points for skyline points with dominance tests. The basic idea of the DominateAndCount function is that when comparing an examinee RZ-region that bounds one or multiple data points against 109

7 7 p2 p9 6 6 p2 p9 5 5 p8 p8 4 p3 4 p3 3 RZ-region, R 3 MBB, B p1 p1 2 2 p5 p5 1 1 p6 p6

0 1234567 0 1234567 (a) Dominance and Counting (b) Counting based on BBS

Figure 5.15. An example for top-ranked skyline queries

existing skyline points in dom-SL, the counting is performed simultaneously. To facilitate the counting, each node n in dom-SL is associated with a counter dom that records the number of data points dominated by n. Whenever n is found to dominate some x points, its dom (as initialized to 0) is incremented by x. We assume an extended ZBtree in which the number of data points cnt beneath very node n is available at n. Thus, in the function, an examinee RZ-region R from cnt-SRC is compared with an RZ-region R from dom-SL. There are three possible cases. First, R is dominated by R and hence R’s cnt contributes to R’s dom. Second, R may be dominated by some, but not all, data points of R and a further examination of the examinee is triggered. Third, R is not dominated by R and the examination on R is finished. Since a node n from cnt-SRC cannot be discarded even if it is dominated by some skyline candidates, the child entries of n may be further investigated that triggers the traversal of the same nodes in a dom-SL again. This raises a duplicate counting problem. To tackle this, we include a parent RZ-region of R in the DominateAndCount function to provide an additional checking. Thus, if the parent RZ-region is already dominated by R (that implies R’s cnt already included by R’s dom), the further investigation of the examinee can be skipped. The pseudo-code of the DominateAndCount function is listed in Figure 5.16. It takes four parameters: (i) a dom-SL, (ii) an examinee RZ-region expressed as (minpt, maxpt) from cnt-SRC, (iii) the number of data points inside the examinee (cnt), and (iv) the examinee’s parent RZ-region as the inputs. Since the parent RZ-region R is only used to check if it is completely dominated by existing skyline candidates, we only need minpt(R) (i.e., pminpt in 110

the algorithm). In the function, we maintain a queue q to perform breadth-first traversal in dom-SL and a flag needToExplore to indicate the need to explore the examinee. For every dequeued entry, we check its child c against pminpt to avoid redundant counting (line 7 and line 15). Then we increment dom by cnt if the examinee is completely dominated (line 8-10 and line 16-18). Otherwise, we further explore the child node c (line 11-12) or indicate the need to explore the examinee if the leaf level is reached and a part of examinee node is found to be dominated (line 19-20). For brevity, we omit the two cases in which the examinee is not dominated (as implied by previous two conditions) as stated in Lemma 5.1. Meanwhile, we maintain a flag dominated to record if an examinee is dominated or not. Finally needToExplore and dominated are returned. The counting phase utilizes the monotonic ordering property of Z-order curves by which the accuracy of dom is guaranteed as all the dominating data points are accessed before an examinee. Therefore, examinees only contribute their counts to those skyline candidates re- trieved earlier than them, but definitely not those accessed later. However, other approaches such as BBS cannot achieve this. Accessed in accordance with the shortest distances to the

origin, the MBB B is examined before p2 as shown in Figure 5.15(b). In this case, BBS does

not anticipate some data points inside B (e.g., p9) to be dominated by some skyline candidates

(e.g., p2) retrieved later. To remedy this problem, BBS has to explore all the index nodes, or to separate the retrieval of skyline candidates and the counting. Obviously, both approaches are inefficient. The former degenerates BBS to a scan-based approach and forces it to examine all individual data points; while the latter scans the index twice. Besides, SFS suffers from exhaustive point-to-point comparisons. The Ranking Phase. To rank skyline candidates based on the numbers of data points they dominate, the Rank function consolidates skyline candidates’ doms by aggregating domsof nodes to the data points at the bottom in dom-SL. Figure 5.17 depicts the pseudo-code of the Rank algorithm. It traverses the dom-SL based on a depth-first traversal (line 2-11) and performs sorting on all skyline candidates (line 12). Finally the initial t skyline points with the smallest doms in the sorted list are returned. Putting All Together. Our ZRank algorithm (as outlined in Figure 5.18) is devised based on both DominateAndCount and Rank functions. It iteratively examines nodes and data points in non-decreasing order of Z-addresses using a stack to keep track of unexamined entries (line 2-14). For non-leaf nodes, it examines child nodes c with the DominateAndCount function. If it needs to be explored, c is pending in the stack (line 11-14). For leaf nodes, all child points are 111

Function DominateAndCount(dom-SL, minpt, maxpt, cnt, pminpt) Input. a ZBtree that indexes skyline points, dom-SL; the bounds of an RZ-region, minpt and maxpt; the count of data points inside RZ-region, cnt; the lower endpoint of a parent RZ-region, pminpt; Local. q: Queue; cnt: a counter of dominating points Output. needToExplore: a flag: (default: FALSE) dominated: a flag: (default: FALSE) Begin 1. q.enqueue( dom-SL’s root); 2. while q is not empty do 3. var n: Node; 4. q.dequeue( n); 5. if n is a non-leaf node then 6. forall child node c of n do  7. if c’s RZ-region’s maxpt pminpt then 8. if c’s RZ-region’s maxpt  minpt then 9. c’s dom ← c’s dom + cnt; /* Case 1 of Lemma 5.1 */ 10. dominated ← TRUE; 11. else if c’s RZ-region’s minpt  minpt then 12. q.enqueue(c); /* Case 2 of Lemma 5.1 */ 13. else /* leaf node */ 14. forall child point c of n do  15. if c pminpt then /* avoid redundant count */ 16. if c  minpt then 17. c’s dom ← c’s dom + cnt; 18. dominated ← TRUE; 19. else c  maxpt then 20. needToExplore ← TRUE; 21. output needToExplore, dominated; End.

Figure 5.16. The pseudo-code of the DominateAndCount function examined against those in dom-SL with the same function. Those data points not dominated are inserted to dom-SL. After all entries are examined, the skyline candidates in dom-SL are ranked by the Rank function and the top t ones are returned. Example 5.3 illustrates how the ZRank algorithm in invoked to evaluate a top-ranked skyline query.

Example 5.3. In this example, a top-ranked skyline query with t set to 2 is evaluated on our example dataset. The trace is depicted in Figure 5.19. Here, we denote every cnt-SRC node by 112

Function Rank(dom-SL, t) Input. dom-SL:aZBtree that indexes skyline points; t: the requested number of top-ranked skyline points. Local. s: stack; l: result list; Output. t top-ranked skyline points Begin 1. s.push(dom-SL’s root); 2. while s is not empty do 3. var n: Node; 4. s.pop( n); /* depth-first traversal */ 5. if n is a non-leaf node then 6. forall child node c of n do 7. c’s dom ← c’s dom + n’s dom; 8. else 9. forall child point c of n do 10. c’s dom ← c’s dom + n’s dom; 11. insert c to l; 12. sort l in non-descending order of dom; 13. output first t points from l; End.

Figure 5.17. The pseudo-code of the Rank function

[p, q][cnt] where p and q are the ending data points and cnt is the count of enclosed data points, (dom ) (dom ) and denote a -SL node by { 1 , 2 , ···}(domn) where is a data point enclosed by dom n p1 p2 pi n, domi records the number of data points dominated by pi, and domn records the number of data points n dominate (i.e., all data points enclosed in the corresponding RZ-region). The + total number of data points dominated by pi is therefore domi domn. First, a stack that keeps unexplored cnt-SRC nodes is set with the root of cnt-SRC. Next, , [4] , [1] the search explores the first branch [p1 p4] , and then drills down the first leaf node [p1 p1] where a data point p1 is collected as the first skyline point. Thereafter, the second leaf node , [3] [p2 p4] is examined and explored. The enclosed data points p2 and p3 are collected as the SL second and third skyline points in dom- . Then, p4 is evaluated and is dominated by an entire leaf node { (0)}, and two data points and . All corresponding counts are incremented by p1 p2 p3 one. Notice that p4, although dominated by both p2 and p3, is not dominated by the node , , [p2 p3] and hence its cnt contributes to the domsofp2 and p3, but not that of node [p2 p3]. , [5] { (1), (1)}(0) Further, [p5 p9] is explored. Since it is incompatible to p2 p3 and not dominated by { (0)}, , [3] is explored. Then, and are collected as the fourth and fifth skyline points p1 [p5 p7] p5 p6 113

Algorithm ZRank(cnt-SRC, t) Input. cnt-SRC:aZBtree that indexes skyline points; t: the requested number of top-ranked skyline points. Local. s: stack; dom-SL Output. Ut: t top-ranked skyline points Begin 1. s.push(dom-SL’s root); 2. while s is not empty do 3. var n: Node; 4. s.pop( n); /* depth-first traversal */ 5. if n is a non-leaf node then 6. forall child node c of n do 7. needToExplore, dominated ← SL DominateAndCount (dom- , c, c,1,n’s minpt); 8. if not dominated then 9. dom-SL.insert(c); 10. else 11. forall child point c of n do 12. needToExplore, dominated ← DominateAndCount(dom-SL, c’s minpt, c’s maxpt, c’s cnt , n’s minpt); 13. if needToExplore then 14. s.push(c); 15. Ut ← Rank(dom-SL,t); 16. output Ut; End.

Figure 5.18. The pseudo-code of the ZRank algorithm

{ (0), (0)}(0) SL maintained as p5 p6 in dom- .Asp7 is only dominated by p5, its count contributes to

p5’s dom. Next, the leaf node , [2] is completely dominated by { (0)} and , so both the sof [p8 p9] p1 p3 dom { (0)} and are increased to . At the same time, it may contain data points dominated by , p1 p3 3 p2 , [2] p5 and p6. Thereafter, [p8 p9] is further explored. When p8 is examined, as the associated

parent minpt, i.e, p8 is already dominated by all the dominating points, no count is updated. As for p9, it is dominated by p2 and p5 and hence it increases both p2’s and p5’s doms by one. After the entire cnt-SRC is traversed, the counting phase completes and the ranking phase starts. The counts of intermediate nodes in dom-SL are propagated to the data points so the counts

for skyline points, p1, p2, p3, p4 and p5, become 3, 2, 3, 2 and 0, respectively. Data points p1 114

Stack Skyline points cnt-SL [4] [5] [p1, p4] ,[p5, p9] ∅ [1] [3] [p1, p1] ,[p2, p4] , ∅ [5] [p5, p9] , [1], , [5] { (0)}(1) [p2 p4] [p5 p9] p1 , [5] { (0)}(1),{ (1), (1)}(0) [p5 p9] p1 p2 p3 , [3], , [2] { (0)}(1),{ (1), (1)}(0) [p5 p7] [p8 p9] p1 p2 p3 , [2] { (0)}(3),{ (1), (3)}(0), { (1), (0)}(0) [p8 p9] p1 p2 p3 p5 p6 [1], [1] { (0)}(3),{ (1), (3)}(0), { (1), (0)}(0) p8 p9 p1 p2 p3 p5 p6 [1] { (0)}(3),{ (2), (3)}(0), { (2), (0)}(0) p9 p1 p2 p3 p5 p6 Figure 5.19. The trace of the ZRank algorithm

and p3 are picked as the query result to complete the search.

5.5.4 The k-ZSearch Algorithm

Next, we present the k-ZSearch algorithm to evaluate k-dominant skyline queries. To ad- dress the cyclic dominance problem, our k-ZSearch algorithm adopts a filter-and-reexamine approach. In the filtering phase, possible k-dominant skyline candidates, which may contain false hits are collected. The reexamination phase eliminates those false hits. We detail these two phases in the following. The Filtering Phase. In the filtering phase, the k-ZSearch algorithm traverses SRC for k- dominant skyline candidates while filtering out those data points or nodes that are k-dominated by collected skyline candidates. Here, candidates are maintained in k-SL indexedbyaZB- tree. Extended from the definition of the dominance relationship to the k-dominance relation- ship between RZ-regions, Theorem 5.1 formalizes the transitive k-dominant relationship; and Lemma 5.2 suggests an efficient candidate filtering in the k-ZSearch algorithm.

Theorem 5.1. Transitive k-dominance relationship. Given three points p, p and p, the following two transitive k-dominance relationships are true:

       1. If p k p and p p , then p k p .

       2. If p p and p k p , then p k p .

. ≤ . ≤ . ∈  ∧  ⊆ Proof. For (1), given certain k out of d dimensions, si, p si p si p si (where si S S ∧| | = . ≤ . S S k) must hold, implying p si p si. Suppose there is one dimension out of k . < . ≤ . . < .   dimensions, sj, such that p sj p sj p sj, in this case, p sj p sj. Hence, p k p . 115

. ≤ . ≤ . ∈ Likewise for (2), given certain k out of d dimensions, si, p si p si p si (where si  ∧  ⊆ ∧| | = . ≤ . S S S S k) must hold, so p si p si. Also, there exists at least one dimension sj . ≤ . < . . < .   among those k dimensions, p sj p sj p sj, then p sj p sj. Thus, p k p .

Lemma 5.2. Given two RZ-regions, R and R, the following four cases hold:

    1. if maxpt(R) k minpt(R ), then R k R , then any data point in Rk-dominates all the data points in R.

  ∧   2. if maxpt(R) k minpt(R ) minpt(R) k minpt(R ), then some data points in R may k-dominate all the data points in R.

  ∧   3. if minpt(R) k minpt(R ) minpt(R) k maxpt(R ), then some data points in R would k-dominate some data points in R.

    4. if minpt(R) k maxpt(R ), then R k R , then no data point in R can k-dominate any data point in R.

Proof. We prove the lemma case by case.    Case 1. maxpt(R) k minpt(R ). According to the first part of Theorem 5.1, since maxpt(R) k minpt(R), minpt(R) k-dominates all the data points in R. By the second part of Theorem 5.1,   all the data points in R (except maxpt(R)) dominate maxpt(R). As a result, R k R . As shown in Figure 5.20(a), R1 and R2 are 2-dominated by Rs in a 3D space.   ∧   Case 2. maxpt(R) k minpt(R ) minpt(R) k minpt(R ). Some data points in R, such as minpt(R), that can k-dominate minpt(R). Based on the first part of Theorem 5.1, they can also  k-dominate all the other points in R . In Figure 5.20(a), Rs and R3 shows an example.   ∧    Case 3. minpt(R) k minpt(R ) minpt(R) k maxpt(R ). Some data points in R including  maxpt(R ) might be k-dominated by some points in R. Rs and R4 in Figure 5.20(a) form an example.     Case 4. minpt(R) k maxpt(R ). The case can be proved by contradiction. Suppose p k p ∈ ,  ∈       ⇒   (p R p R ). By Theorem 5.1, minpt(R) p k p maxpt(R ) minpt(R) k maxpt(R ). This contradicts the case condition. Rs and R5 in Figure 5.20(a) are an example for this case. Lemma 5.2 is applicable to data points (see Figure 5.20(b)). Based on this, we derive the k-Dominate function to determine whether an RZ-region, R, presented as minpt and maxpt,is k-dominated by any existing candidate k-dominant skyline points preserved in k-SL. The logic of this function is outlined in Figure 5.21. Similar to the Dominate function, the k-Dominate 116

maxpt maxpt maxpt maxpt R1 R1

minpt R2 R2 minpt maxpt maxpt maxpt maxpt maxpt minpt R4 p minpt R5 R3 Rs R3 x y minpt minpt minpt z minpt minpt

(a) RZ-regions (b) Point and RZ-region

Figure 5.20. 2-dominance tests in 3D space

terminates when R is assured to be k-dominated by any RZ-region enclosing some candidate skyline points, or when all relevant nodes of k-SL are visited. With the k-Dominate function, the filtering phase of the k-ZSearch algorithm (line 1-18 in Figure 5.22) collects k-dominant skyline candidates in k-SL. Those unqualified data points or index node entries (i.e., k-dominated by any existing result candidate) are reserved in a non- candidate set (T) which will be used in the reexamination phase for false hit removal. The memory consumption for T is expected to be low, because most of branches of SRC can be pruned at high levels owing to the k-dominance relationship. The filtering phase terminates when the SRC is completely traversed. The Reexamination Phase. The reexamination phase of the k-ZSearch algorithm removes the false hits from candidates collected in the filtering phase. The main idea is that if a candidate in k-SL is found to be k-dominated by any other point in k-SL or in T, it is moved from k-SL to T. All candidate points are reexamined. The logic of this phase is depicted in Line 19-26 in Figure 5.22. As the number of candidates maintained in k-SL is much smaller than that of points stored in T and those candidates usually have strong dominating power, our reexamination checks every candidate p against others in k-SL first. If there is a candidate p k-dominating p, p is moved to T. Otherwise, we proceed to check p against entries in T. If no nodes and/or data points in Tk-dominate p, p retains in k-SL. If some index nodes in T need further exploring, they are replaced with all their child entries (either data points or child nodes) in T. Those final 117

Function k-Dominate(k-SL, minpt, maxpt) Input. k-SL:aZBtree (k-dominant skyline candidates); minpt and maxpt: the endpoints of an RZ-region; Local. q: Queue; Output. TRUE if the input is k-dominated, else FALSE; Begin 1. q.enqueue( k-SL’s root); 2. while q is not empty do 3. var n: Node; 4. q.dequeue( n); 5. if n is a non-leaf node then 6. forall child node c of n do  7. if c’s maxpt k minpt then 8. output TRUE; /* Case 1 of Lemma 5.2 */  9. else if c’s mint k maxpt then 10. q.enqueue(c); /* Case 2 of Lemma 5.2 */ 11. else /* n is leaf node */ 12. forall child point c of n do  13. if c k minpt then 14. output TRUE; 15. output FALSE; End.

Figure 5.21. The pseudo-code of the k-Dominate function

remainders in k-SL are the k-dominant skyline points.

5.5.5 The ZSubsapce Algorithm

At last, we present the ZSubspace algorithm to process subspace skyline queries. Since the access of data points based on the ascending order of Z-addresses cannot exploit the transitivity property for subspace skyline queries. Thus, the algorithm also adopts a filter-and-reexamine approach. In the filtering phase, SRC is access based on a depth-first traversal that retrieves data points and nodes in the ascending order of Z-addresses. Following this order, it collects candi- dates if they are not S-dominated (where S is a set of dimensions specified at the query time). Theorem 5.2 suggests the transitive subspace-dominance relationship and Lemma 5.3 suggests to prune the search space if the RZ-regions of certain nodes are subspace-dominated by cur- 118

Algorithm k-ZSearch(SRC) Input. SRC:aZBtree for source data set; Local. k-SL:aZBtree (k-dominance skyline candidates); s: Stack; T: Set; Begin 1. s.push(source’s root); 2. while s is not empty do 3. var n: Node; 4. s.pop( n); 5. if Dominate(k-SL, n’s minpt, n’s maxpt) then 6. goto line 3. /* remove those dominated data points */ 7. if k-Dominate(k-SL, n’s minpt, n’s maxpt) then 8. T.insert(n); /* n for reexamination. */ 9. goto line 3. 10. if n is a non-leaf node then 11. forall child node c of n do 12. s.push(c); 13. else /* leaf node */ 14. forall child point c of n do 15. if k-Dominate(k-SL, c, c) then 16. T.insert(c); /* c for reexamination. */ 17. else 18. k-SL.insert(c); /* tentative result set */ ∈ SL 19. forall point p k- do ∃  ∈ SL   ∧   20. if p k- , p p p k p then 21. remove p from k-SL; 22. T.insert( p); ∃  ∈   23. else if p T, p k p then 24. remove p from k-SL; 25. T.insert(p); 26. output k-SL; End.

Figure 5.22. The pseudo-code of the k-ZSearch algorithm rent skyline candidates. Thanks to the clustering property of Z-order curves, the ZSubspace algorithm can effectively prune the search space in the filtering phase. In Figure 5.23, line 1-13 outlines the filtering phase. While subspace dominance relationship is used in place of conventional dominance relationship, the logic of Subspace-Dominate function is very similar to the Dominate function and we do not include it for space saving.

Theorem 5.2. Transitive subspace dominance relationship. Given a subset of dimensions 119

S, the following two transitive subspace dominance relationships are true:

         1. If p S p and p p , then p S p .

         2. If p p and p S p , then p S p .

Proof. The proof is very similar to that for Theorem 5.1. We omit the discussion to save space.

Lemma 5.3. Given a subset of dimensions, S and two RZ-regions, R and R, the following four cases hold:

       1. if maxpt(R) S minpt(R ), then R S R , then any data point in RS-dominates all the data points in R.

    ∧   2. if maxpt(R) S minpt(R ) minpt(R) S minpt(R ), then some data points in R may S-dominate all the data points in R.

    ∧   3. if minpt(R) S minpt(R ) minpt(R) S maxpt(R ), then some data points in R would S-dominate some data points in R.

       4. if minpt(R) S maxpt(R ), then R S R , then no data point in R can S -dominate any data point in R.

Proof. This can be proved as that for Lemma 5.2. We omit the proof for space saving. In the reexamination phase (as shown in Line 14-16), candidates are reexamined to elimi- nate false hits. Here, we compare it against all other candidates in S-SL. Those S-dominated are removed from S-SL. Finally, the remainders in S-SL are the subspace skyline points and the algorithm terminates. To illustrate the operation of ZSubspace algorithm, Example 5.4 is provided below.

Example 5.4. Consider a subspace skyline query on distance dimension based on our hotel  = { } { } example (i.e., S distance ). The result is p6 . The trace of the filter phase is shown in { } Figure 5.24. Based on the depth-first traversal on a ZBtree, p1 is the first data point accessed  SL ,  and it is included in S - . Then the node [p2 p4] is S -dominated by p1 and ignored. Next, ,  SL [p5 p7] is examined against p1 and expanded. Thereafter, p5 is visited and kept in S - . , Further, p6 is not dominated and collected. Later, p7 and [p8 p9] are dominated and pruned. 120

 Algorithm ZSubspace(SRC, S ) Input. SRC: ZBtree for source data set;  S : a subset of dimensions;  Local. S -SL: ZBtree (subspace skyline points); s: Stack; Begin /*** filtering phase ***/ 1. s.push(source’s root); 2. while s is not empty do 3. var n: Node; 4. s.pop( n);  5. if Subspace-Dominate(S -SL, n’s minpt, n’s maxpt) then 6. goto line 3. /* remove those dominated data points */ 7. if n is a non-leaf node then 8. forall child node c of n do 9. s.push(c); 10. else /* leaf node */ 11. forall child point c of n do  12. if not Subspace − Dominate(S -SL, c, c) then  13. S -SL.insert(c); /* tentative result set */ /*** reexamination phase ***/ ∈  SL 14. forall point p S - do  15. if Subspace-Dominate(S -SL, p, p) then  16. S -SL.delete(p);  17. output S -SL; End.

Figure 5.23. The pseudo-code of the ZSubspace algorithm

Then the filter phase ends. Notice that p1 is not a result point but it helps filtering out the nodes , , [p2 p4] and [p5 p7].

Next, in the examination phase, p1 and p5 are both dominated by p6 and are removed from  SL  SL S - . Now, p6, retained in S - is the subspace skyline point.

5.6 Performance Evaluation of Skyline Query Processing

This section evaluates the performance of the proposed suite of algorithms based on Z-order curve, namely, ZSearch, ZBand, ZRank, k-ZSearch and ZSubspace and compares them with the state-of-the-art approaches specialized for corresponding domains. 121

Stack S-SL , , , ∅ [p1 p4] [p5 p9] , , ∅ [p1], [p2 p4], [p5 p9] , , { } [p5 p7], [p8 p9] p1 , { } { , } [p8 p9] p1 , p5 p6 Figure 5.24. The trace of the filter phase of the ZSubspace algorithm

5.6.1 Experiment Settings

Our evaluations are based on both synthetic and real datasets. Synthetic datasets are generated and they follow correlated distribution, independent distribution and anti-correlated distribu- tion as discussed in [21] with various data dimensionalities (d) and cardinalities (n). Due to the limited space, we present the results for d ranging from 4, 8, 12 and 16 and n from 10kto 10, 000k (ten millions) in order to demonstrate the scalability of the proposed algorithms. In ad- dition, three real datasets, i.e., NBA, HOU and FUEL that follow anti-correlated, independent and correlated distributions, respectively8, are employed. NBA contains 17k 13-dimensional data points, each of which corresponds to the statistics of an NBA player’s performance in 13 aspects (such as points scored, rebounds, assists, etc.). HOU consists of 127k 6-dimensional data points, each representing the percentage of an American family’s annual expense on 6 types of expenditures such as electricity, gas, and so on. FUEL is a 24k 6-dimensional dataset, in which each point stands for the performance of a vehicle (such as mileage per gallon of gaso- line in city and highway, etc.). In the experiments, all datasets are normalized to [0, 1024)d. Besides our proposed algorithms, we include the state-of-the-art algorithms, namely, SFS [30], SaLSa [17], BBS [105], SFSBand, SFSRank, BBSBand, BBSRank, TSA [25] and SUB- SKY [135] for comparison. SFSBand and BBSBand are devised based on SFS and BBS, respectively, to answer skyband queries. SFSRank and BBSRank are developed based on SFS and BBS, respectively to answer top-ranked skyline queries. Similar to most of the related works in the literature, we use elapsed time as the main performance metric. It represents the duration from the time an algorithm starts to the time the result is completely returned. To facilitate our understandings of how elapsed times are spent, we also report the time spent on performing dominance test, data access and maintaining skyline candidates (or candidate maintenance), i.e., the three major operations in the algorithms. Here, data access time includes 8Those are collected from www.nba.com, www.ipums.org and www.fueleconomy.gov, respectively. 122

Parameters Settings (∗ means default) Synthetic datasets Distribution: Correlated, Independent, Anti-correlated ∗ Dimensionality (d), 4, 8 , 12, 16; ∗ Cardinality (n): 10k, 100k, 1000k , 10, 000k Real datasets NBA (13D, 17k, anti-correlated), HOU (6D, 127k, independent), FUEL (6D, 24k, correlated) Data space [0, 1024)d Skyline algorithms SFS, BBS, ZSearch Skyband algorithms SFSBand, BBSBand, ZBand (skyband width (b): 2, 4, and 8) Top-ranked skyline algorithms SFSRank, BBSRank, ZRank (result set size (t): 100) ∗ k-dominant skyline algorithms TSA and k-ZSearch (k: 11, 12, 13 , 14, 15 (where d = 16))  Subspace skyline algorithms SUBSKY and ZSubspace (d =d/4, d/2 and d/1) Table 5.2. Experiment settings

data retrieval and traversing the required indices. In addition, we include the runtime memory consumption as well as the number of skyline points for references. We implemented all of the evaluated approaches in GNU C++ and conducted the experi- ments on Linux Servers (running kernel version 2.6.9 with Intel Xeon CPU 3.2GHz and 4GB RAM). The disk page size is fixed at 4KB. In the experiments, sufficient memory (including both main memory and virtual memory provided by OS) was available to accommodate all skyline candidates and required data structures. Since Z-addresses can be used to derive the original attribute values, we only keep Z-addresses in ZBtrees and extended ZBtrees used in the experiments and derive the original dimensional values as needed. Notice that the size of a Z- addresses equals to the total size of the corresponding original dimensional values. Here, each original values is stored in a two-byte short integer. The R-tree adopted by BBS, BBSBand and BBSRank are built with TGS bulkloading [48]. For SFS, TSA, SFSBand and SFSRank, records are presorted in accordance with the sum of all attribute values and for SaLSa, records are sorted according to their minimum attribute values. All indices and sorted records are pre- pared prior to the experiments. SRCs in form of ZBtrees and extended ZBtrees are built with compact bulkloading (see Section 5.3.3). The results to be reported are the averaged perfor- mance of 30 sample runs. Table 5.2 summarizes all the parameters and their settings used in the experiments. In the following, we first evaluate the performance of skyline search, and then the performance of skyline result update followed by the evaluation of algorithms for skyline variants. 123

5.6.2 Experiments on Skyline Queries

The first experiment set evaluates ZSearch, compares it against SFS, SaLSa and BBS with synthetic datasets with various data distributions, dimensionality and cardinalities, and then examines its practicality with the real datasets. Effect of data dimensionality. Figure 5.25 plots the elapsed time (in log scale) against the data dimensionality (d) from 4 up to 16 in a step of 4 for synthetic correlated, independent and anti-correlated datatsets while data cardinality (n) is fixed at 1,000k. The numbers of skyline points (m) are marked right below the x-axis. We notice that all the algorithms incur longer elapsed time with the increase of the data dimensionality. We also find that dominance tests consume more time in anti-correlated datasets than others. SaLSa improves SFS by terminating the search earlier. However, SFS and SaLSa perform inefficient point-to-point dominance tests, resulting in longer elapsed times. BBS performs the best for correlated and independent datasets with low dimensionalities (e.g., d = 4), since many branches in R-trees are pruned. However, its performance significantly deteriorates because of the degraded performance of R-trees and the overheads incurred by dominance tests based on main-memory R-trees and by heaps manipulating the pending entries. This is also reflected as the maximum number of pending entries as shown in Figure 5.26. Finally, ZSearch, owing to the effective space pruning capability and block-based dominance tests, performs better than others when the data dimensionalities increase. For anti-correlated datasets, data points in some senses are clustered but located in posi- tions in the data space that they do not dominate each other. In this case, ZSearch shows its superiority over SFS, SaLSa and BBS. As most of data points cannot be dominated, compu- tational overheads in performing dominance tests become more significant than those under the other two data distributions. We can see that BBS performs the worst due to inefficient dominance tests based on main-memory R-trees and exhaustive heap manipulations for almost entire datasets. Meanwhile, SFS and SaLSa perform worse than ZSearch as they cannot avoid many comparisons. Figure 5.26 depicts the runtime memory consumption measured in terms of the maximum number of entries maintained in memory to facilitate index traversal (in log scale)9. Since SFS and SaLSa do not need any extra , it incurs zero runtime memory consumption. BBS uses a heap to order index pages and data points; and ZSearch uses a stack. BBS takes

9The data structures to keep track of collected skyline points are not counted. 124

Elapsed time (correlated, n=1000k) 1000 Dominance test 123.43 24.89 100 Candidate maintenance 24.72 11.46 Data access 10 1 0.1 0.01 time (seconds) 0.001 SFS SFS SFS SFS BBS BBS BBS BBS SaLSa SaLSa SaLSa SaLSa ZSearch ZSearch ZSearch ZSearch

4 (m=16) 8 (m=571) 12 (m=9798) 16 (m=32118) dimensionality (d) (a) correlated

Elapsed time (independent, n=1000k) 31738.1 Elapsed time (anti-correlated, n=1000k) 48577.1 100000 6680.2 7025.9 4723.9 1000000 Dominance test Dominance test 17860.4 19646.4 11318.6 10000 Candidate maintenance 100000 Candidate maintenance Data access 1000 Data access 10000 100 1000 10 1 100 0.1 10 time (seconds) 0.01 time (seconds) 1 SFS SFS SFS SFS SFS SFS SFS SFS BBS BBS BBS BBS BBS BBS BBS BBS SaLSa SaLSa SaLSa SaLSa SaLSa SaLSa SaLSa SaLSa ZSearch ZSearch ZSearch ZSearch ZSearch ZSearch ZSearch ZSearch

4 (m=51) 8 (m=19076) 12 (m=205670) 16 (m=590920) 4 (m=18434) 8 (m=884591) 12 (m=999926) 16 (m=1000000) dimensionality (d) dimensionality (d) (b) independent (c) anti-correlated

Figure 5.25. Skyline query; elapsed time versus dimensionalities (d)

Runtime memory consumption (n=1000k) 10000000 SFS SaLSa BBS ZSearch 1000000 100000 10000 1000 100 10 1 No. of No. of pending entries 4 8 12 16 4 8 12 16 4 8 12 16

correlated independent anti-correlated dimensionality (d)

Figure 5.26. Skyline query; runtime memory vs dimensionalities (d)

(up to four orders of magnitudes) more memory than ZSearch. Even worse, each heap entry in BBS that maintains the distance to the origin of the space is slightly larger than a ZSearch stack entry (which simply is a pointer to a node). Effect of data cardinalities. Figure 5.27 depicts the elapsed time against the data cardinalities (n: 10k up to 10,000k) while d is fixed at 8. The numbers of skyline points (m) are listed right below the x-axis. In general, m increases with the sizes of the experimented datasets. There is an exception that m for correlated datasets drops when datasets are increased from 1000k to 10,000k. This is because more data points in a larger dataset would appear close to the origin 125 and they can dominate many other data points.

Elapsed time (correlated, d=8) 100 23.3 10.72 Dominance test 11.62 10 Candidate maintenance 2.70 1 0.1

time (seconds) 0.01 0.001 SFS SFS SFS SFS BBS BBS BBS BBS SaLSa SaLSa SaLSa SaLSa ZSearch ZSearch ZSearch ZSearch

10k (m=212) 100k (m=445) 1000k (m=571) 10000k (m=369) cardinality (n) (a) correlated

Elapsed time (independent, d=8) Elapsed time (independent, d=8) 564259 8716.1 145060 23874 10000 1000000 Dominance test 131432 Dominance test 100000 1000 Candidate maintenance 10000 Candidate maintenance 100 51.2 44.0 47.4 1000 10 100 10 1 1 0.1 time (seconds) 0.1 time (seconds) 0.01 0.01 SFS SFS SFS SFS BBS BBS BBS BBS SaLSa SaLSa SaLSa SaLSa SFS SFS SFS SFS BBS BBS BBS BBS ZSearch ZSearch ZSearch ZSearch SaLSa SaLSa SaLSa SaLSa ZSearch ZSearch ZSearch ZSearch 10k (m=9981) 100k (m=98349) 1000k (m=884591) 10000k 10k (m=2161) 100k (m=7088) 1000k (m=19076) 10000k (m=34236) (m=796132) cardinality (n) cardinality (n) (b) independent (c) anti-correlated

Figure 5.27. Skyline query; elapsed time versus cardinalities (n)

The elapsed times of all the algorithms increase dramatically as n grows. For correlated datasets, SaLSa performs the best since it can early terminates the searches once (a few number of) skyline points are identified. For the independent datasets, the performances of SFS, SaLSa and ZSearch are quite close. Conversely, ZSearch produces the shortest elapsed time for anti- correlated dataset. As n increases, the costs incurred by dominance tests become predominant and thus, SFS, SaLSa and BBS perform worse than ZSearch.

Runtime memory consumption (d=8) 10000000 SFS SaLSa BBS ZSearch 1000000 100000 10000 1000 100 10 1 10k 10k 10k 100k 100k 100k No. of No. of pending entries 1000k 1000k 1000k 10000k 10000k 10000k

Correlated Independent Anti-correlated cardinality (n)

Figure 5.28. Skyline query; runtime memory vs cardinalities (n)

Next, Figure 5.28 plots runtime memory consumption under various data cardinalities. SFS 126

and SaLSa incur no extra data structure and thus they produce zero runtime memory consump- tion. ZSearch consumes reasonably small runtime memory, less than a hundred entries for all the cases. BBS keeps a large number of data points and index nodes, especially for high data cardinality. From this, we can conclude that ZSearch is more memory efficient than BBS. Experiments on real datasets. Here, we evaluate the algorithms on the real datasets. The experiment results are listed in Table 5.3. ZSearch clearly outperforms SFS and BBS and it performs very close to SaLSa among the evaluated algorithms for the elapsed time (in seconds).

Dataset m SFS SaLSa BBS ZSearch NBA 10816 2.933 1.702 3.364 1.723 HOU 5774 1.334 0.736 2.169 0.944 FUEL 1 0.031 0.001 0.001 0.001 Table 5.3. Skyline query; real datasets (time (sec))

The experiment results on synthetic datasets consistently show that ZSearch is superior to SFS and BBS, especially when large skyline results are derived. ZSearch performs as good as SaLSa for relatively small real datasets. In conclusion, we consider that ZSearch is a very good skyline search approach. Due to the limited space, we exclude the results for synthetic correlated datasets hereafter.

5.6.3 Experiments on Skyband Queries

Next, we evaluate ZBand for datasets with different data distributions, dimensionality and car- dinality in terms of elapsed time, and compare it with SFSBand and BBSBand. Figure 5.29 plots the elapsed times (in log scale) taken by ZBand, SFSBand and BBSBand upon inde- pendent and anti-correlated datasets with data dimensionality varied from 4 up to 16 and data cardinality fixed at 1000k. Here, the skyband widths (b) are set to 2, 4, and 8. When a larger b is experimented, more skyband points are resulted. In general, ZBand performs the best for all the experimented b’s, especially when datasets with high dimensionality are experimented. For independent datasets with lower dimensionality (say 4), the number of skyline points (m) is very small. In this case, BBSBand performs better than all the others. However, due to expensive dominance tests that search for b dominating skyband candidates, BBSBand suffers a lot for the rest of experimented datasets. Figure 5.30 depicts the experiment results for data cardinality ranged from 10kupto 10, 000k while the data dimensionality is fixed at 8. ZBand in general outperforms both SFS- 127

Elapsed time (independent, n=1000k) Elapsed time (anti-correlated, n=1000k) 100000 SFSBand 100000 SFSBand 10000 BBSBand 10000 BBSBand 1000 ZBand ZBand 1000 100 100 10 10 time (seconds) 1 time (seconds) 1 0.1 m m 140 312 693 23126 36680 61536 31680 49978 76020 964219 992972 999441 279843 360326 446010 694310 775748 839810 1000000 1000000 1000000 1000000 1000000 1000000

248 248 248 248 248 248 248 248 d=4 d=8d=12 d=16 d=4 d=8d=12 d=16 dimensionality (d) dimensionality (d) (a) independent (b) anti-correlated

Figure 5.29. Skyband query (various dimensionality (d))

Elapsed time (independent, d=8) Elapsed time (anti-correlated, d=8) 100000 SFSBand 10000000 SFSBand 1000000 10000 BBSBand BBSBand 100000 ZBand 1000 ZBand 10000 100 1000 10 100 10 time (seconds) 1 time (seconds) 1 0.1 0.1 m m 140 312 693 10000 10000 10000 99894 99999 31680 49978 76020 100000 964219 992972 999441 279843 360326 446010 694310 775748 839810 9642202 9929730 9994480 248 248 248 248 248 248 248 248 n=10k n=100kn=1000k n=10000k n=10k n=100kn=1000k n=10000k cardinality (n) cardinality (n) (a) independent (b) anti-correlated

Figure 5.30. Skyband query (various cardinality (n))

Band and BBSBand. Further, we evaluate skyband queries on the real datasets. The experiment results are shown in Table 5.4. For FUEL, which is a correlated dataset and has a small number of skyband points, BBSBand incurs the shortest elapsed time. For NBA and HOU, that are anti-correlated and independent datasets, respectively, ZBand performs the best. From these experiment results, we can see that ZBand is generally the best skyband query algorithm.

5.6.4 Experiments on Top-Ranked Skyline Queries

Next, we examine the performance of our proposed ZRank in comparison with SFSRank and BBSRank for top-ranked skyline queries. Here, the number of returned top-ranked skyline points (t) has a default value of 100. We have conducted experiments to evaluate the perfor- 128

Dataset b m SFSBand BBSBand ZBand NBA 2 12064 2904 5517 1986 4 12968 3265 6703 2215 8 13747 3570 7514 2397 HOU 2 7248 1498 4034 1689 4 8897 2077 5559 2236 8 10881 2913 7980 3009 FUEL 2 4 57 1 4 4 5 58 1 3 8 17 62 1 3 Table 5.4. Skyband query; real datasets (time (msec))

mance of different algorithms under various t and found that t does not have any significant impact on the performance. Consequently, the results under different t are skipped. In our implementation of BBSRank, we explore all index nodes to count the number of dominated data points for all individual skyline candidates.

Elapsed time (independent, d=8) Elapsed time (anti-correlated, d=8) 100000 SFSRank 100000 SFSRank BBSRank ZRank BBSRank 10000 10000 ZRank 1000 1000 100 100

time (seconds) 10 time (seconds) 10 1 4 8 12 16 1 4 8 12 16 dimensionality (d) dimensionality (d) (a) independent (b) anti-correlated

Figure 5.31. Top-ranked skyline query (various dimensionality (d))

Figure 5.31 shows the experiment result for datasets with dimensionality varied from 4 up to 16. For all evaluated datasets, ZRank is clearly shown to outperform SFSRank and BBSRank. In particular, BBSRank suffers from expensive computational costs in counting dominated data points for all individual skyline candidates. Likewise, SFS also incurs ex- haustive dominance tests. Very differently, ZRank uses both cnt-SL and dom-SL to perform block-level dominance tests and counting, making it outperform the others. Figure 5.32 shows the experiment results for datasets with cardinality varied from 10kup to 10, 000k and dimensionality fixed at 8. Again, for all evaluated datasets, ZRank outperforms SFSRank and BBSRank. This can be explained with the above discussed reasons. Further, we 129

Elapsed time (independent, d=8) Elapsed time (anti-correlated, d=8) 100000 SFSRank 1000000 SFSRank 10000 BBSRank 100000 BBSRank 10000 1000 ZRank ZRank 1000 100 100 10 time (seconds) time (seconds) 10 1 1 0.1 0.1 10k 100k 1000k 10000k 10k 100k 1000k 10000k cardinality (n) cardinality (n) (a) independent (b) anti-correlated

Figure 5.32. Top-ranked skyline query (various cardinalities (n))

evaluate top-ranked skyline queries on the real datasets. From the experiment results as shown in Table 5.5, we can see that ZRank performs much better than SFSRank and BBSRank.

Dataset t SFSRank BBSRank ZRank NBA 100 4206 11561 1095 HOU 100 29121 74006 9332 FUEL 100 56 75 43 Table 5.5. Top-ranked skyline query; real datasets (time (msec))

5.6.5 Experiments on k-Dominant Skyline Queries

This experiment set evaluates k-ZSearch for k-dominant skyline queries on high-dimensional datasets. Here, the state-of-the-art approach to be compared with is TSA. The experiment re- sults for various k and a fixed cardinality at 1000k is shown in Figure 5.33. Attributed to block- based dominance tests and a single traversal of ZBtree, k-ZSearch consistently outperforms TSA as shown in the figure. We can also observe that the number of k-dominant skyline points is reduced when a small k is evaluated, as many data points get k-dominated. The number of skyline candidates is expected to be reduced as well. This results in a shorter processing time. To the other end, the processing time increases as k grows. Due to expensive k-dominance tests to compare more attribute values than the conventional dominance test, the processing times of k-dominance skyline queries are not necessarily shorter than those of skyline queries. Thus, when k approaches d, the number of k-dominant skyline points is close to the number of conventional skyline points, but with longer processing time. Figure 5.34 shows the elapsed times of the evaluated algorithms with cardinalities varied 130

Elapsed time (independent, d=16, n=1000k) Elapsed time (anti-correlated, d=16, n=1000k) 10000000 TSA 100000000 TSA 1000000 k-ZSearch 10000000 k-ZSearch 100000 1000000 100000 10000 10000 1000 1000 100 100 time (seconds) 10 time (seconds) 10 1 1 m 2 127 2794 31385 188043 m 100 117414 998012 1000000 1000000 11 12 13 14 15 11 12 13 14 15 k k (a) independent (b) anti-correlated

Figure 5.33. k-dominant skyline query (various k)

from 10kto10000k and k fixed at 13. Again, k-ZSearch outperforms TSA and their perfor- mance difference becomes more significant as the cardinality of the dataset increases. This indicates that k-ZSearch has a better scalability than TSA.

Elapsed time (independent, d=16, k=13) Elapsed time (anti-correlated, d=16, k=13) 100000 TSA 1000000000 TSA k-ZSearch 100000000 k-ZSearch 10000 10000000 1000000 1000 100000 100 10000 1000 100 time (seconds) 10 time (seconds) 10 1 1 m 841 1596 2794 3682 m 9980 98076 998012 9980120 10k 100k 1000k 10000k 10k 100k 1000k 10000k cardinality (n) cardinality (n) (a) independent (b) anti-correlated

Figure 5.34. k-dominant skyline query (various cardinalities (n))

Finally, we evaluate these algorithms on the real datasets. We set k to (d − 1), (d − 2), and (d − 3) where d represents the dataset dimensionality. Table 5.6 shows the performance. Consistent with our expectation, k-ZSearch performs better than TSA, showing the superiority of k-ZSearch for k-dominant skyline queries.

5.6.6 Experiments on Subspace Skyline Queries

The final set of experiments evaluates the performance of ZSubspace and compares it against SUBSKY. We first examine the impact of dimensionalities. We generated synthetic dataset with d set to 4, 8, 12 and 16 and n set to 1000k. With respect to different d, we randomly select 131

Dataset k m TSA k-ZSearch NBA 12 3794 7931 2696 11 682 1980 731 10 79 322 171 HOU 5 22 815 226 4 0 487 220 FUEL 5 1 63 1 4 1 62 1 Table 5.6. k-dominant skyline query; real datasets (time (msec))

Elapsed time (independent, n=1000k) Elapsed time (anti-correlated, n=1000k) 100000 SUBSKY ZSubspace 100000 SUBSKY ZSubspace 10000 10000 1000 1000 100 100 10 10

1 time (seconds)

time (seconds) 1 0.1 0.1 1 2 5 18 256 216 210 979 4616 1640 28577 27400 235669 619454 112410 322772 112947 323647 999984 992342 324109 999986 1000000 1000000

12424836124816 12424836124816

4 8 12 16 4 8 12 16 dimensionality (d) dimensionality (d) (a) independent (b) anti-correlated

Figure 5.35. Subspace skyline query (various dimensionalities (d))

d/4, d/2 and d dimensions as query subspaces. The experimental results in terms of elapsed times are shown in Figure 5.35. The numbers of result skyline points are listed right below the x-axis. For independent datasets, SUBSKY performs better than ZSubspace when the subspace dimensionalities are small. However, when the subspace dimensionality increases, ZSubspace becomes more efficient than SUBSKY due to the effective block-based dominance tests. Due to this reason, ZSubspace outperforms SUBSKY for most of the cases in anti-correlated datasets. Next, we evaluate the performance of both ZSubspace and SUBSKY for datasets with car- dinalities n varied from 10k to 10000k and dimensionalities set to 8. The results in terms of elapsed times and the number of skyline points are plotted in Figure 5.36. For indepen- dent datasets, SUBSKY in general performs better than ZSubspace except that the number of queried dimensions approaches d. Conversely, for anti-correlated datasets, ZSubspace slightly outperforms SUBSKY. This can be explained with the above discussed reasons. Finally we evaluate both ZSubspace and SUBSKY on the real datasets. The results are 132

Elapsed time (independent, d=8) Elapsed time (anti-correlated, d=8) 1000 SUBSKY ZSubspace 10000000 SUBSKY ZSubspace 1000000 100 100000 10 10000 1000 1 100 10 time (seconds) 0.1 time (seconds) 1 0.01 0.1 6 3 5 8 138 266 216 239 2475 9347 1236 9642 28577 85731 10000 11282 74445 99999 112947 323647 999984 1185944 3398294 10000000 248248248248 248248248248 10k 100k 1000k 10000k 10k 100k 1000k 10000k cardinality (n) cardinality (n) (a) independent (b) anti-correlated

Figure 5.36. Subspace skyline query (various cardinalities (n))

shown in Table 5.7. Due to space limitation, we only show some sample numbers of queried dimensions (i.e., |S|). As we can observe from the results, the performance of these two ap- proaches is very close, different from the synthetic datasets. This is because the real datasets are relatively small. Based on the experiment results, ZSubspace performs better than SUB- SKY, when the number of skyline points is large.

 Dataset |S | m SUBSKY ZSubspace NBA 5 397 0.047 0.159 9 5058 1.217 0.969 13 10816 4.009 2.225 HOU 2 8 0.049 0.065 4 294 0.223 0.255 6 5774 2.121 1.908 FUEL 2 1 0.005 0.003 4 1 0.005 0.004 6 1 0.005 0.004 Table 5.7. Subspace skyline query; real datasets (time (sec))

5.7 Chapter Summary

In this chapter, we analyze the skyline queries and point out the transitivity and incomparability properties that can facilitate efficient searches for skyline results. Correspondingly, we exploit the monotonic ordering and clustering features of the Z-order curve that match perfectly well 133 with these two properties. Because of the ordering feature, data points on the Z-order curve have their dominating points, if there exists any, accessed ahead of them, and thus the candidate reexamination is totally eliminated. Meanwhile, due to the clustering feature, data points with similar values are naturally mapped onto Z-order curve segments which are in turn grouped into blocks. It facilitates true block-based dominance tests that no other existing approaches support. This true block-based dominance tests enables efficient search space pruning. Eval- uation results obtained from comprehensive experiments demonstrate the superiority of our approaches, devised based on the Z-order curve features, over all the existing works. Chapter 6

LDSQs on Spatial Road Network

In this chapter, we present ROAD, a general framework to evaluate Location-Dependent Spatial Queries (LDSQ)s that searches for spatial objects on road networks. By exploiting search space pruning technique and providing a dynamic object mapping mechanism, ROAD is very effi- cient and flexible for various types of objects and queries, namely, range search and nearest neighbor search, on objects over large-scale networks. ROAD is named after its two com- ponents, namely, Route Overlay and Association Directory, designed to address the network traversal and object access aspects of the framework. In ROAD, a large road network is or- ganized as a hierarchy of interconnected regional sub-networks (called Rnets) augmented with 1) shortcuts for accelerating network traversals; and 2) object abstracts for guiding network traversals for queries. In this chapter, we present (i) the Rnet hierarchy and several properties useful to construct Rnet hierarchy, (ii) the design and implementation of the ROAD framework, (iii) efficient object search algorithms for various queries. We conducted extensive experiments with real road networks to evaluate ROAD. The experiment result shows the superiority of ROAD over the state-of-the-art approaches.

6.1 State-of-the-Art Approaches

First, we review existing works that evaluate LDSQs on road networks. In general, they can be categorized into three types of approaches, namely, network expansion based approaches, Euclidean distance bound approaches and solution based approaches. 135

6.1.1 Network Expansion Based Approaches

Network expansion gradually expands the search space in a network by forming a spanning tree rooted at a given query point. It is applicable for object search and shortest path search like Dijkstra’s algorithm [40]. Iteratively, it examines the next closest unexplored node that guarantees the expansion to be minimal each time until all the nodes and edges that satisfy search criteria are visited [69, 106]. Objects of interest located on the visited nodes and edges are the result objects and the paths from the root to those objects are the shortest paths. Al- though the network expansion is useful for many LDSQs, it is inefficient due to an almost blind scan over the entire search space and a slow node-by-node expansion towards all directions. For a large search space, this deficiency seriously deteriorates the search performance.

6.1.2 Euclidean Distance Based Approaches

Euclidean distance is always the lower bound of physical path distance. Euclidean distance bound approaches [106, 155] employ this property as a heuristic to identify candidate objects whose Euclidean distances are not greater than a certain threshold distance. Then, false can- didates whose network distances that can be determined by shortest path algorithms (e.g. A* algorithm [37]) or materialized distances (e.g. HEPV [73], HiTi [74]) are greater than the threshold are eliminated. However, the heuristic is not applicable to other network distance metrics, such as travel time or cost. It is also not very effective when paths between objects and query points are not in straight lines. As studied in [106], these approaches perform worse than network expansion for the same LDSQs.

6.1.3 Solution Based Approaches

By pre-computing and maintaining query results for potential access in the future, solution based approaches such as VN3 [79], UNICONS [29], SPIE [59] and Distance Index [58], optimize the search performance for a given type of queries. VN3 [79] employs the concept of Voronoi diagram for nearest neighbor (NN) queries on road networks. For each object, a geometric polygon is formed based on network distances from other neighboring objects and indexed in a spatial index. All points within a polygon should have the enclosed object as their NNs. With VN3, NN search is transformed to a point enclosure problem. UNICONS [29] pre-computes kNN objects for some selected nodes. SPIE [59] organizes a network as a set of 136 spanning trees and pre-computes NN results on nodes in the spanning trees. NN queries can be answered by accessing pre-computed results maintained at some of the closest nodes. Distance Index [58] pre-computes for all nodes the object distances and pointers to next nodes towards individual objects, and encodes them as distance signatures. Directed by the signatures, both range and NN queries are supported. Instead of precise distances, distance ranges are adopted such that narrow (wide) distance ranges are used to indicate objects nearby (faraway). To determine the precise distances of objects, a search chases the next pointers of nodes to reach some nodes closer to the objects, as signatures there provide more precise distance ranges of the objects. Based on the more precise distances, objects may be collected as a query result. Figure 6.1 illustrates the distance signatures on objects o1 and o2 stored at nq and nq . However, we can see both distance ranges for objects are identical, that implies redundant storage. In other words, it incurs excessive storage and pre-computation costs. Our evaluation result also reflect that it is completely impractical.

[d0-d1): {} [d0-d1): {} [d1-d2): {} [d1-d2): {} [d -d ): {o , o } [d -d ): {o , o } n 2 3 1 2 2 3 1 2 q o1,o2 o2

o1,o2 o1 nq'

Figure 6.1. Distance index approach

The common pitfalls of solution based approaches are their extremely high overheads in- curred in pre-computation, result storage, and maintenance. More importantly, they adapt very poorly to other types of queries, and to objects and network updates.

6.2 The ROAD Framework

We first present the concept, design and implementation of our ROAD framework. In the following, we introduce Rnets, shortcuts and object abstracts, i.e., the key design in support of search space pruning in ROAD, and then discuss Rnet hierarchy formation. More, we present Route Overlay and Association Directory, the two core components in ROAD implementation. 137

6.2.1 Basic Idea

Formally, a road network can be modeled as a weighted graph N consisting of a set of nodes N and edges E, i.e., N = (N, E). A node n ∈ N represents a road intersection or an end point; and an edge (n, n) ∈ E represents a road segment connecting nodes n and n. |n, n| denotes the edge distance, which can represent the travel distance, trip time or toll of the corresponding road segment, and its value is positive. We simply use distance in the rest of the paper. A path P(u, v) stands for a set of edges connecting nodes u and v and its distance | , | = | , | P(u v) (n,n)∈P(u,v) n n . Among all possible paths connecting node u and node v, the one with the shortest distance is referred to as the shortest path, denoted by SP(u, v). The network distance ||u, v|| between u and v is the distance of their shortest path SP(u, v), i.e., ||u, v|| = |SP(u, v)|. For simplicity, we assume that objects reside on edges (i.e., road segments) in a network. Objects at nodes (i.e., road intersections) can be treated as they are located at the end of the edges. We denote a set of objects on edge (n, n) by O(n, n) and the distance from an object o ∈ O(n, n) to the nodes n and n by δ(o, n) and δ(o, n), respectively. Also, we assume LDSQs to be initiated at nodes for simplicity. Each LDSQ is specified with a distance condition D and an attribute predicate A. Given a set of objects in a network, an object, o, is collected as the answer of an LDSQ if (1) its distance from a query node, nq, denoted by || , || || , || + δ , , || , || + δ ,  || , || ≤ nq o = min( nq n (o n) nq n (o n )) satisfies D (e.g., nq o 100) and (2) its attribute denoted by o.a satisfies A (e.g., restaurant o.type = ‘seafood’). As shown, we single out the conditions of network distance from other attributes due to its importance and focus of this work.

6.2.2 Rnets, Shortcuts and Object Abstracts

To find objects in terms of their network distances and attributes, a search algorithm may implicitly form a search tree originated at the query node. Following the topology of the network, the portion of the network covered by a search tree conceptually represents a search space. Scanning an entire search space incurs significant traversal overheads. Skipping some search subspaces that do not contain objects of interest from detailed examinations presents an optimization opportunity. This search space pruning technique is expected to be very effective in road networks because spatial objects are often clustered and concentrated in some areas, e.g., hotels and resorts are likely to be in business and scenic areas, respectively. Thus, many subspaces do not contain objects of interest and can be pruned. Though well received in various 138 database searches, to the best of our knowledge, the idea of search space pruning has not been exploited in the context of object search on road networks.

shortcut S(n1,n6) S(n5,n6) n nq q border node n7 n n 6 n1 5 n1 n n5 n 4 3 n3 n2 n4 n shortcut 2

S(n1,n5) (a) A closed path (b) An Rnet

Figure 6.2. A closed path and an Rnet

Figure 6.2(a) explains how search space pruning can be realized in a road network. Suppose a search tree grows from a node, nq, to reach a node n1. Assume that the path covering edges , , , , (n1 n2), (n2 n3), (n3 n4), and (n4 n5) represents a closed path, i.e., a path that has no nodes connecting to other parts of the network besides the two ending nodes (i.e., n1 and n5). If no object of interest presents in the closed path, a detailed traversal on the path can be skipped and the traversal has to continue at n5 in order to explore objects thereafter. Considering this closed path as a subspace, we need to have (1) a hint about whether or what objects are on the path; and (2) an artifact at n1 connecting n5, the other end of the path. Accordingly, we introduce object abstracts and shortcuts. As such, when a closed path is reached and no object of interest is indicated in an object abstract, a search can bypass the entire path via a shortcut to the other end directly. A shortcut between two ending nodes is the shortest path between them. In road networks, closed paths are usually short; thus the performance gained by bypassing closed paths is rather limited. We, therefore, introduce a notion of Rnets, which stands for regional sub-networks, in a road network. Each Rnet encloses a subset of edges and it is bounded by a set of border nodes. Each border node is the entrance and exit of an Rnet. The formal definition of Rnet is stated in Definition 6.1. In an Rnet R, nodes with edges not belonging to R are border nodes. Meanwhile, a border nodes can be shared by more than two Rnets at the same time. Based on Rnets, the concepts of shortcuts and object abstracts are developed and formally stated in Definition 6.2 and 6.3, respectively. It is noteworthy that the  edges that contribute to SP(b, b ) might not necessarily be included in ER.

Definition 6.1. Rnet. In a network N = (N, E), an Rnet R = (NR, ER, BR) represents a search subspace, where NR, ER and BR stand for nodes, edges and border nodes in R, and 139

(1) ER ⊆ E,   (2) NR = {n|(n, n ) ∈ ER ∨ (n , n) ∈ ER}, and     (3) BR = NR ∩{n|(n, n ) ∈ E ∨ (n , n) ∈ E }, where  E =E − ER.

Definition 6.2. Object Abstract. The object abstract of an Rnet R, O(R), represents all the objects residing on edges in ER, i.e., O(R) = O(e). e∈ER   Definition 6.3. Shortcut. The shortcut, S(b, b ), between border nodes b and b (∈ BR)ofan Rnet R bears the shortest path SP(b,b’) and its distance ||b, b||.

Figure 6.2(b) depicts an Rnet, R, where n1, n5 and n6 are the border nodes. When a search , , reaches n1, the entire Rnet can be bypassed with shortcuts S(n1 n5) to n5 or S(n1 n6) to n6 if the corresponding object abstract O(R) indicates no object of interest.

6.2.3 Rnet Hierarchy

In ROAD, we structure a road network as a hierarchy of Rnets where large Rnets at the upper levels enclose smaller Rnets at lower levels. At each level, a network can be viewed as a layer of interconnected Rnets. This structure benefits various search scenarios. For objects located far away from a query node, a search range can be quickly expanded with long shortcuts in large Rnets. For objects that are close to query nodes, shortcuts in moderate-sized Rnets or even original edges can be used to reach the answer objects. To derive an Rnet hierarchy, we first treat the entire road network as a single Rnet that has no

border node and partition it into p1 partitioned Rnets. Definition 6.4 states the formal definition of Rnet partitioning. We refer the original Rnet as the level-0 Rnet. The partitioned Rnets are the children of the Rnet they partitioned from. At each subsequent level i, we partition each ∈ , Rnet into pi child Rnets. As a result, at a level x ( [0 l]), the entire network is fully covered x l h by pi interconnected Rnets. For an Rnet hierarchy of l levels, there are pi Rnets. i=1 h=0 i=1 Definition 6.4. Rnet partitioning. Partitioning of an Rnet R = (N, E, B) where N, E, B are a set of nodes, edges and border nodes and B ⊆ N, forms p child Rnets, R , R , ··· R where 1 2 p > R = , , = = ⊆ p 1 and i (Ni Ei Bi). Here, N Ni, E Ei, B Bi. Also, the following 1≤i≤p 1≤i≤p 1≤i≤p three conditions must hold.

∀ ∀  ⇒ ∩ = ∅ 1. Edges of all child Rnets are disjointed, i.e., i ji j Ei Ej . 140

∀ ∀ ,  ∈ , ∈ 2. Nodes in an Rnet are connected by edges in the same Rnet, i.e., i (n n ) Ei n ∧  ∈ Ni n Ni.

3. Border nodes in an Rnet are common to its parent Rnet and some of its sibling Rnets, = ∩ ∪ i.e., Bi Ni (B Nj). j∈([1,p]−{i})

As illustrated in Figure 6.3, a network N is first partitioned into three Rnets, namely, R1, ∈ , R2 and R3, each of which is then partitioned into 2 smaller Rnets, Ria and Rib, i [1 3].

Consequently, R1, R2 and R3 form the first-level Rnets, and R1a, R1b, R2a, R2b, R3a, and R3b form the second-level Rnets. In the figure, n3 is common to both R2 and R3 and hence it is a border node corresponding to the level-1 Rnets. Meanwhile, it is shared by both R2b and R3a and is a border node of level-2 Rnets.

Network N

shortcuts level 1 n1 n3 R1 R2 R3 n2 shortcuts level 2 nc nq o1

nd o o2 a na n1 n n3 f oc R1a n n nb o e R2b R3a g R nq' n2 b 3b R1b R2a

Figure 6.3. An example Rnet hierarchy

6.2.3.1 Network Partitioning Methods

An ideal network partitioning should generate equal-sized Rnets and minimize the number of border nodes which in turn minimizes the number of shortcuts formed and maintained. However, this ideal network partitioning is known as NP-complete [98]. In this study, we adopt geometric approach [65] and Kernighan-Lin algorithm (KL algorithm) [77]. The geometric approach first coarsely partitions a network into two by dividing a set of edges spatially such that these two result subnets have equal numbers of edges. KL algorithm is then used to fine tune the two result Rnets by exchanging edges between them until further exchanges do not 141

= x reduce the number of border nodes. We set pi to be power of 2 (i.e., pi 2 , for x being a

positive integer) and recursively apply this binary partitioning until pi Rnets are formed. This network partitioning approach is also used in [73]. Alternatively, partitioning can be based on network semantics. For instance, a country-wide road network can be partitioned into levels of states, counties, cities, and townships. Further, the network partitioning could be based on the distributed objects. Since ROAD is a general-purpose framework to support searches on various objects to be mapped onto the same spatial network at the query time, our current network partitioning is performed independently of objects. We will study the object-based network partitioning in our future work.

6.2.3.2 Create of Shortcuts and Object Abstracts

After an Rnet hierarchy is formed, object abstracts and shortcuts are constructed in a bottom-up fashion. As edges in child Rnets are fully covered by their parent Rnet (see Definition 6.4), object abstracts of an Rnet can therefore be constructed directly from their child Rnets as stated by Lemma 6.1. On the other hand, the shortcuts of a border node can be determined by adopting Dijkstra’s algorithm [40] to explore paths for all other border nodes in the same Rnet. To speed up shortcut computations, shortcuts in Rnets at level i can be calculated based on those in Rnets at level i+1 (as stated in Lemma 6.2). Further, the representation of shortcuts can be based on those shortcuts in child level Rnets. Referring to our example Rnet hierarchy as shown in , , , , Figure 6.3, the shortcut from n1 and n3, S(n1 n3) can be represented as (S(n1 nd) S(nd n3)). , , To determine a detailed shortest path for this shortcut, S(n1 nd) and S(nd n3) can be explored

at nodes n1 and nd, respectively.

Lemma 6.1. The object abstract of a parent Rnet R fully covers those of all its child Rnets R , ···R R = R 1 p, i.e., O( ) O( i). Also, according to Definition 6.2, the object abstract of a 1≤i≤p smallest Rnet R (= (N, E, B))is O(e). e∈E Proof. The proof is straightforward and is omitted.

Lemma 6.2. Given an Rnet hierarchy, a shortcut S(b, b) between two border nodes of a level i Rnet R can be derived based on those shortcuts of level i+1 Rnets.

 R R  Proof. Suppose node b and b are inside the level i+1 Rnets b, and b , respectively. If   R = R  , R b b , S(b b ) must be the same as the shortcut linking b to b of Rnet b. Otherwise,

R  R  R R  R b b .If b and b are not adjacent, there must be at least one level i+1 Rnet that 142

 R R  , R R bridges b to b . Consequently, the shortcut S(b b ) starts at b, passes through , and reaches  R  , b . As the border nodes are the only entrances to/exits from Rnets, S(b b ) must go through

R R R  border nodes of b, , and b . Consequently, the shortcuts which link border nodes must

R R  be taken. On the other hand, if b and b are adjacent, there should be at least one border node in common. Through a border node, shortcuts in both Rnets are connected. For all those cases, S(b, b) of a level i Rnet can be constructed by shortcuts in level i+1 Rnets and the proof completes. Besides, explored shortcuts in Rnets can be used to determine other shortcuts of Rnets in the same level as indicated in Lemma 6.3. Lemma 6.2 and Lemma 6.3 can help efficiently compute and update shortcuts (as will be discussed later). To alleviate the storage cost for shortcuts, some shortcuts S(b, b) that are composed of other shortcuts in the same Rnets do not need to be maintained, as stated in Lemma 6.4. Hence, when a search reaches b, it can transitively reach b through other shortcuts in the same Rnet. Similarly, for cases that a shortcut S(n, n) covers completely a reverse path of S(n, n), the detail of either one can be omitted for storage space saving and the distances of those shortcuts can be retained. Lemma 6.3. Given Rnets R and R at the same level in an Rnet hierarchy, if a shortcut S of R covers an edge (n, n) of R, there must be a shortcut S corresponding to R that covers (n, n) and S must include S.

Proof. Without loss of generality, we assume that a shortcut S(a, b) reaches Rnet R at node R ,  n1 and leaves it at node n2, and the path P between n1 and n2 inside passes by edge (n n ). We prove this lemma by contradiction. Assume that i) no shortcut of R passes by edge (n, n), and ii) S passes by edge (n, n) but not any shortcut of Rnet R.AsS reaches Rnet R, and

the border nodes are the only entrance to/exits from an Rnet, nodes n1 and n2 must be border || , || = || , || + | | + || , || nodes. As S is a shortcut, its distance a b a n1 P n2 b must be minimized. | | = || , || Consequently, P n1 n2 which means P must be the shortest path between n1 and n2, i.e., the shortcut. This violates both assumptions i) and ii), and the proof is completed. Lemma 6.4. Within an Rnet R, a shortcut S(b, b) between border nodes b and b can be safely discarded if there exists another border node b such that S(b, b) exactly covers both S(b, b) and S(b, b) in R.

Proof. We omit the proof to save space. 143

6.2.4 Route Overlay and Association Directory

To facilitate network traversals that explore a network in a node-by-node fashion, we adopt a node-oriented storage scheme that associates nodes with edges and their corresponding dis- tances to their neighboring nodes. As the network is formulated as a hierarchy of Rnets, one straightforward storage scheme is to store all Rnets where border nodes and shortcuts are nodes and edges as separate networks in addition to the original network as suggested in [73, 74]. This implementation, however, has to maintain separate structures and thus may complicate the search traversals, since search mechanisms need to switch between different networks. Based on Definition 6.4 that the border nodes in parent Rnets are always the border nodes in some of their child Rnets, our novel index structure, namely Route Overlay, that naturally flat- tens a hierarchical network into a plain network can effectively avoid all the shortcomings of the separate network implementation. In Route Overlay, nodes are indexed by a B+-tree with unique node IDs as search keys1. Each leaf entry of B+-tree points to a node, together with a shortcut tree, i.e., a specialized tree index structure that organizes shortcuts and edges to facilitate search traversals. The structure of a shortcut tree is generally similar to N-ary tree [110] except that i) non-leaf nodes that represent Rnets are associated with shortcuts and ii) number of branches associated with any node (i.e., the number of child Rnets) that is dependent on the number of child Rnets is not fixed. If a given node n is a border node, every non-leaf entry in n’s shortcut tree maintains the shortcuts from n to other border nodes, corresponding to one Rnet, for which n serves as a border node. Also, in a shortcut tree, parent Rnets are stored immediately above their child Rnets. The leaf entries store all the edges to the neighboring nodes of n. The shortcut tree for a non-border node has only one leaf node containing edges to its neighboring nodes. Figure 6.4 shows a Route Overlay for our network presented in Figure 6.3. Take nq (a non-border node)

as an example. Its shortcut tree has only one leaf node that contains edges to nq’s neighboring  nodes, e.g., na and nq.Forna (a border node of Rnets R1a and R1b), its shortcut tree has two

levels. The first level points to Rnets R1a and R1b, together with shortcuts to other border nodes,  e.g., n1 and n2. The second level keeps the edges to neighboring nodes, i.e., nb, nc, nq and nq. Next, our proposed efficient object lookup mechanism in ROAD, called Association Direc- tory also adopts B+-tree with unique node IDs or Rnet IDs as the search key. Associated with node n (n) is an object o in O(n, n) together with its distances δ(o, n) (δ(o, n)). Similarly,

1Besides B+-tree, alternatives such as Hash index can be used. 144

B+-tree nq na n1 n2 n3 nf ng ...... (nq,na) S(n ,n ) R R NIL base (n ',n ) f 3 3a 3b q a S(n3,n1) S(n ,n ) R2 R3 NIL ... base (n ,n ) NIL R R a 1 S(n3,n2) base f g 1a 1b S(n ,n ) ... a 2 R S(n ,n ) ... base 3a 3 b (nq,na) (n ,n ) base base a b base ... S(n1,n2) (nq',na) (na,nc) R1 R2 S(n1,n3) S(n ,n ) R 1 a 1b ... shortcut tree of na ... base base ... shortcut trees

Figure 6.4. Route Overlay associated with R is the object abstract of an Rnet R. As an Rnet may contain a number of objects, techniques such as aggregated attribute values [156], bloom filter [20], signature [44] can be used to represent an object abstract with fewer storage overheads. Besides, those nodes and Rnets that do not have objects are not kept in the B+-tree to further reduce the storage cost. If the search cannot find a node (Rnet) in an B+-tree, no object is implied for the node (Rnet). B+-tree (on node IDs and Rnet IDs) nf ng R3 R3b

o1: (o1,nf) o1: (o1,ng) {o1,o2} {o1,o2}

Figure 6.5. Association Directory

Figure 6.5 depicts an Association Directory for objects o1 and o2 in our example. In the , index, an object o1 on edge (n f ng) is pointed by the nodes n f and ng. Moreover, objects o1 { , } and o2 in Rnet R3b and its parent Rnet R3 are associated as o1 o2 with the Rnets in the Asso- ciation Directory. Depending on application needs, other objects, say oa, ob, oc can be placed into the same Association Directory or in a separate Association Directory. This provides flex- ibility of mapping various objects on the same road network. Moreover, up to the application needs, multiple Association Directories that carry different types of objects can be accessed simultaneously. 145

6.3 Processing LDSQs on ROAD

While ROAD is designed to support different types of LDSQs, in this research, we focus on k-nearest neighbor (kNN) queries and range queries. Our algorithms based on the idea of network expansion upon ROAD can perform searches efficiently since they navigate Rnets in detail only if those Rnets contain objects of interest; otherwise they bypass those Rnets. We first discuss the evaluation of kNN query. At the first place, we illustrate the basic idea with a simple network that consists of a chain of nodes in Figure 6.6. The network is partitioned into 3 Rnets and each of them is further divided into two smaller Rnets. On this , network, an NN query is issued at n2, and two objects o1 and o2 are located on edges (n11 n12) , and (n12 n13), respectively. Also, in this network, nodes n3, n5, n7 and n9 are border nodes.

The search first expands from n2 to n1 and n3 inside R1a. The expansion is shown as a sequence of annotated arrows (arranged vertically) in the figure. Instead of following the physical edges , to the right side of the network for objects, a shortcut S(n3 n5) at n3, i.e., the border node of

Rnets R1a and R1b, can be taken to bypass R1b (as no object is indicated by the corresponding object abstract) to reach n5.

R1 R2 R3 R1a R1b R2a R2b R3a R3b q S(n5,n9) (n1,n2) S(n3,n5) S(n9,n11) o1 o2

n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11 n12 n13 local edges S(n3,n5), S n ,n , inside R1a ( 5 9) bypass R1b R S(n9,n11), o bypass 2 R reach 1 on bypass 3a edge (n11,n12)

Figure 6.6. An example 1NN query

, Next, a longer shortcut S(n5 n9) at n5 is taken to skip R2 from detailed traversal. Further, , the search at n9 reaches n11 via S(n9 n11).Now,asR3b contains objects, the traversal follows the original edges and the object o1 is found after exploring n11. From the figure, we can see the search only takes three jumps from n3 to n11, that significantly saves the traversal cost, compared with traversing original edges between the query node and the objects. With the logic of network expansion as the basis, our Algorithm kNNSearch (outlined in Figure 6.7) incorporates shortcuts in Route Overlay and object abstracts in Association Direc- tory to speed up the search. In general, it iteratively expands the search in a network from nq 146

by visiting the closest unexplored node. This gradual expansion guarantees the first k objects satisfying search condition to be the kNN objects to the query point. We maintain a priority P , queue to sort pending entries in the non-descending distance order from nq. Each entry ( d) P in records a node or an object ( ) and its distance (d) from nq.

The algorithm takes a Route Overlay (RO), an Associate Directory (AD), a query node (nq), and a desired number of NNs (k) as inputs, and has all nodes and objects marked “unvisited”. P , To start, is initialized with (nq 0) (line 1). Then, the algorithm repeatedly examines the head entry ( , d) from P until k answer objects are retrieved or the network is completely traversed (lines 2-12). The label “visited” of is used to avoid visiting (line 4). Otherwise, if refers to a node, two tasks need to be performed. SearchObject is first called to look up AD for objects o associated with the node and put them as (o, d + δ(o, )) to P for later examination (lines 6-8). Next, Function ChoosePath is invoked to decide subsequent nodes from to continue the network expansion (line 9) that will be discussed next. When is an object, it is collected into a result set Res (lines 10-11). Thereafter, is marked “visited” (line 12). Finally, the answer objects are output and the search completes (line 13).

Algorithm kNNSearch(RO, AD, nq, k) Input. Route Overlay (RO), Associate Directory (AD), query node (nq) and the number of NNs (k) Local. Priority queue (P) Output. Result set (Res) Begin 1. enqueue(P,(nq, 0)); Res = ∅; 2. while (P is not empty AND |Res| < k)do 3. ( , d) ← dequeue(P); 4. if ( is marked “visited”) then goto 2; 5. if ( is a node) then 6. O ← SearchObject(AD, ); // look up AD 7. foreach (o,δ(o, )) ∈ O do 8. enqueue(P,(o, d + δ(o, ))); 9. ChoosePath(RO,AD, P, , d); // see Figure 6.8 10. else // is an object. 11. Res ← Res ∪{ };// is one of result objects. 12. mark visited; // this indicates visited. 13. output Res; End.

Figure 6.7. The pseudo-code of the kNNSearch algorithm based on ROAD 147

With shortcut trees organizing shortcuts and edges in accordance with the Rnet hierarchy, Function ChoosePath (depicted in Figure 6.8) can quickly identify appropriate shortcuts and edges to expand the search from a node n. In brief, it examines the shortcut tree of n loaded from Route Overlay in a depth-first traversal manner (lines 2-12). For every non-leaf level, an Rnet R is checked against Association Directory. If no object of interest is found, R, together with all its child Rnets, are bypassed. The border nodes reachable by the shortcuts are enqueued to P. Otherwise (i.e., R contains objects of interest), the lookup goes down to the next lower level to examine its child Rnets in a similar fashion (lines 9-10). Once the search reaches the leaf level of the shortcut tree, all neighboring nodes connected by edges are collected (lines 11- 12). If n is a non-border node, its shortcut tree contains only edges and all the corresponding neighboring nodes are put into P.

Algorithm ChoosePath(RO, AD, P, n, d) Input. Route Overlay (RO), Associate Directory (AD), a priority queue (P), a node (n), distance (d); Local. Stack (S) Begin 1. T ← LoadShortcutTree(RO,n); 2. push(S, T.root); 3. while (S is not empty) do // search in shortcut tree. 4. s ← pop(S); 5. if (s is not leaf) then 6. foreach R of s do 7. if (SearchObject(AD,R) has no object) then 8. enqueue(P,(b, d+||n, b||)) for all S(n, b) of R; 9. else 10. push all s’s children to S; 11. else // leaf node.    12. enqueue(P,(n , d + |n, n |)) for all edges (n, n )ins; End.

Figure 6.8. The pseudo-code of the function ChoosePath

To visualize Algorithm kNNSearch based on ROAD in comparison with other existing approaches, Figure 6.9 shows the evaluation of a 3NN query on 5 objects based upon California road network [92] (see Section 6.5 for details). As indicated, our algorithm takes the shortest search time and the lowest I/O costs as it quickly expands the search range and reaches the objects via shortcuts. It outperforms other existing works that include network expansion based 148

Time: 475 msec, Time: 1,203 msec, I/O = 230 pages I/O = 297 pages traversal traversal over border nodes

answer answer objects objects objects

query node query node

(a) ROAD (b) Network expansion

Time: 8,422 msec, Time: 625 msec, I/O = 1,729 pages I/O = 285 pages traversal

answer answer objects objects false candidate traversal object query query node node

(c) Euclidean distance bound (d) Distance index

Figure 6.9. Illustrations on different approaches for a 3NN query

approaches, Euclidean distance bound approach and distance index. Network expansion based approaches span a very large network area. Euclidean distance bound approaches include false candidate objects. Distance Index incurs a high I/O cost and long search time due to loading a large number of distance signatures, although it has pre-computed paths towards the answer objects. Our algorithm constantly outperforms all those representative approaches, as evaluated in extensive experiments to be presented in Section 6.5. Algorithm RangeSearch that evaluates range queries is similar to Algorithm kNNSearch with a slight difference that the search ends upon a portion of a network within a distance bound is completely traversed, instead of a specified number of objects having been found. All visited objects are the answer objects. To save space, we omit the discussion of the algorithm. 149

6.4 ROAD Framework Maintenance

This section presents the ROAD framework maintenance in presence of updates that include object changes and network changes. Owning to its clear separation between the objects and network, ROAD can handle these two aspects of updates efficiently as discussed below.

6.4.1 Object Update

Object changes are handled in Association Directory, independently of Route Overlay. To insert an object located on a certain edge (n, n) enclosed by an Rnet R, we associate the object to the nodes n and n and update the object abstracts of corresponding Rnet R and its ancestor Rnets in an Association Directory. For object deletion, we can simply remove the association of the objects from corresponding edges and from the object abstracts of corresponding Rnets in an Association Directory. On the other hand, for the changes of object attributes, we update the object abstract associated with nodes and Rnets.

6.4.2 Network Update

Road condition and road network structure change over time. Instead of immediately rebuild- ing a Route Overlay upon changes that is expensive, we propose several techniques to incre- mentally update Route Overlay for edge distance changes, and network structure changes.

6.4.2.1 Change of Edge Distance

In ROAD, when the distance of an edge that represents the travel distance, trip time or cost of a road segment changes (increases or decreases), some shortcuts that represent shortest paths might become invalid and have to be updated. Here, these updates only affect Route Overlay but not objects. To save unnecessary shortcut re-computations, ROAD adopts a filtering-and- refreshing approach that consists of two steps. In the “filtering” step, shortcuts that may be affected by the change are identified. The identified shortcuts are then updated in the “refresh- ing” step. According to Lemma 6.2 defined in Section 6.2, the update of shortcuts related to level i Rnets in an Rnet hierarchy is not necessary unless shortcuts related to level i+1 Rnets are updated. Thus, in the following, we only explain how to re-compute shortcuts in the bottom level. The same idea can be applied to upper levels. Also, based on Lemma 6.3, an edge, which is not covered by shortcuts in its own Rnet, is definitely not covered by shortcuts in other Rnets 150

at the same level. Therefore, we examine the shortcuts in an Rnet that encloses the changed edge first. If no shortcut update is incurred, the update can be safely terminated. Suppose an edge changes its distance |n, n| from d to d, detailed update procedure is discussed below.

Edge distance increased (i.e, d < d). When the distance of an edge (n, n) in an Rnet R is increased from d to d, only those shortcuts that cover (n, n) might become invalid and need to be refreshed. In the filter step, we identify shortcuts that pass through (n, n). Observing that a shortcut S(b, b) covering (n, n) should have ||b, b|| equal to ||b, n|| + |n, n| + ||n, b|| (where we consider |n, n| before update, i.e., d), we search affected shortcuts by finding the shortest paths from each of end nodes (n and n) to the border nodes in R and then identifying shortcuts whose distances are equal to the path passing through (n, n). In the second phase, all the identified shortcuts are re-evaluated. If no shortcuts are refreshed, the update terminates. Otherwise, the update is propagated up to the parent level, with border nodes and shortcuts at the current level treated as nodes and edges, respectively. Edge distance decreased (i.e., d > d). When the distance of an edge (n, n) in an Rnet R is decreased from d to d, it may contribute to paths shorter than some existing shortcuts. In this case, those shortcuts need to be identified and refreshed. In the first filtering step, we test if the distance of a path from border node b via (n, n) to another border node b (with |n, n| = d, the new edge distance) is shorter than the distance of the shortcut S(b, b). Here we expand from n and n to reach border nodes and to determine the distances as shown in Figure 6.10(a). Once ||b, n|| + |n, n| + ||n, b|| < ||b, b||, S(b, b) between border node b and b is identified to be affected. In the second phase, those identified paths are replaced by the new paths passing by edge (n, n). Again, the update process will be propagated to the parent level if there are shortcuts updated.

6.4.2.2 Change of Network Structure

When new roads are constructed or existing roads are closed, the corresponding network topol- ogy is changed. We model these changes as addition or deletion of nodes and edges. As changes of nodes result in changes of edges, we treat changes of nodes as special cases of changes of edges, and only consider addition and deletion of edges below. Again, we update the network at the bottom level first and propagate the updates to the parent levels if necessary.

Addition of a new edge. A newly added edge (n, n) directly connects two nodes n and n, assuming that n and n belong to Rnet R and R, respectively. There are two possible cases: 151

add (nc,nd) Rnet R1 Rnet R2 |n,n'| = d' n c nd (new edge distance) nn' na border ||b,n|| ||n',b'|| ng nodes ne Rnet R ||b,b'|| n nf bb' b delete (n ,n ) Border nodes add (na,nb) delete (nf,ng) e f (a) Edge dist. decrease (b) Edge addition and deletion

Figure 6.10. Network changes

(1) R = R and (2) R  R. We handle them in the following.

• Case 1: R = R (i.e., both nodes are located inside the same Rnet). Adding an edge , connecting two nodes (e.g., (na nb) in Figure 6.10(b)) can be treated as changing the distance of an edge from infinity to the edge distance. The previously discussed edge distance update mechanism can be applied here. Accordingly, the Route Overlay is updated as well to store the new edge and the new nodes (if any).

• Case 2: R  R (i.e., nodes are located in different Rnets). Since an edge can only be included by one Rnet (say R), the node n which does not belong to R, has to be promoted to a border node between R and R. In Figure 6.10(b), the introduction of , ,  (nc nd) to R1 gets nd promoted to the border node. Also, the new edge (n n ) might affect some shortcuts. The update approach for the change of edge distance can be applied here. As a new border node is introduced, new shortcuts linking the new border node to other border nodes in the same Rnet have to be created.

Deletion of an existing edge. Deleting an edge (n, n) breaks the link between two nodes n and  , n . Consider deleting (ne n f ) in R2 in Figure 6.10(b). Its deletion can be managed as handling the change of its edge distance to infinity and updating affected shortcuts. In addition, it is possible that one of the end nodes of a deleted edge is a border node. If all the edges of n are within one Rnet after the deletion of edge (n, n), n is no longer a border node. As shown in , Figure 6.10(b), after deleting (n f ng), ng becomes a non-border node. Then, the shortcut trees of n and other border nodes in related Rnets in Route Overlay have to be updated. 152

Parameter Value (∗=default) Network CA∗ (21,048 nodes, 21,693 edges) NA (175,813 nodes, 179,179 edges) SF (174,956 nodes, 223,001 edges) ∗ No. of objects (|O|) 10, 50, 100 , 500, 1000 ∗ Partition factor (p) 4 ∗ ∗ No. of levels (l) 2, 3, 4 , 5, 6 for CA, and 6, 7, 8 , 9, 10 for NA and SF ∗ Query kNN query and range query ∗ No. of NNs (k) 1, 5 ,10 ∗ Search range (r) 0.05, 0.1 , 0.2 of network diameter Table 6.1. Evaluation parameters

6.5 Performance Evaluation

This section evaluates our proposed ROAD framework in terms of indexing overhead, main- tenance overhead, and query performance. We applied ROAD (labeled as ROAD, hereafter) on three real road networks, namely, CA, NA and SF obtained from [92]. CA and NA consist of highways in California, USA and North America, respectively. SF is composed of streets and roads in San Francisco. In this evaluation, we simulate one type of objects based on which all the queries are evaluated, while ROAD can handle diverse objects. Objects, with number varying from 10 to 1000, are evenly distributed over those road networks2. The fewer the num- ber of objects is in the network, the larger the search subspaces that contain no objects can be pruned. Thus, an efficient approach should be able to return the search result quickly when a small number of objects is experimented. Table 6.1 summarizes the evaluation parameters, their values and defaults used in the experiments. In addition to ROAD, we implement network expansion [106], Euclidean-based approach [106, 155] and Distance Index [58] (labeled as NetExp, Euclidean, and DistIdx, respectively), in GNU C++ for comparisons. We adopt CCAM [121] to organize network nodes in storage for all the approaches. For NetExp, objects are stored with network nodes. For Euclidean, objects are indexed by an R-tree and the A* algorithm [37] is used to determine objects’ net- work distances from query nodes. For DistIdx, distance signatures are stored with network nodes. We adopt exact object distances in the distance signature to provide the optimal search performance. We measure the performance of all the approaches according to four commonly used per-

2ROAD can benefit more from uneven object distribution that more empty subspaces can be pruned. 153 formance metrics, namely,

• Index construction time. This measures the elapsed time to construct an index.

• Index size. This measures storage consumed to store an index.

• Index update time. This measures the time spent on updating an index in presence of object and network changes.

• (Query) Processing time. This represents the time duration from the time when the query is initiated to the time that a complete result is obtained.

All indices are stored on disk. In our disk storage, the page size is fixed at 4KB. We also employ a memory cache of 50 pages with LRU replacement scheme to buffer loaded pages. In every run, a query is initialized with an empty cache. All experiments were conducted upon Linux 2.6.9 servers with Intel Xeon 3.2GHz CPU. In what follows, we evaluate the index overhead, index maintenance overhead, query performance followed by Rnet hierarchy settings.

6.5.1 Indexing Overhead

The first set of experiments evaluates the index construction time and index sizes of all the approaches for various number of objects and networks. We consider NetExp that has no index on objects as the baseline in this evaluation. Because of various network sizes, we fix p to 4 for all the networks, while l’s for NA and SF are set to 8 and that for CA is set to 4. We shall study the impacts of the settings of Rnet hierarchy level (l) later. Figure 6.11 shows the index construction time (in second) and index sizes (in metabyte) of varying number of objects on CA (in log scale). As shown in the figure, NetExp, Euclidean, and ROAD incur almost constant index construction time (in a few minutes) and index size (in a few MBs) while DistIdx increases drastically in both construction time and index size. For 1,000 objects, DistIdx takes more than 240MB for index storage and nearly half an hour to build an index! The trend keeps increasing with the increase of the number of objects. This finding reveals that DistIdx is not practical for use in realistic applications. Figure 6.12 shows the index construction time and index size for different networks with the number of objects fixed at 100. As shown in the figure, NetExp and Euclidean incur very short index construction times and take less storage. Both DistIdx and ROAD vary with networks. However, they differ a lot in terms of efficiency. DistIdx takes more than 4 hours 154

Index construction time (CA) Index size (CA) 10000 1000 NetExp Euclidean 242MB 1000 100 DistIdx ROAD 100 125MB 10 13MB 25MB NetExp Euclidean 10 4MB 1 DistIdx ROAD 3MB 3MB 3MB 3MB 0.1 3MB

Index size (MB) 1 0.01 Index time (second) 0.001 0.1 10 50 100 500 1000 10 50 100 500 1000 Number of objects (|O|) Number of objects (|O|) (a) Index time (b) Index size

Figure 6.11. Performance of index construction on various cardinalities

Index construction time (|O|=100) Index size (|O|=100) 1000000 1000 NetExp NetExp Euclidean 100000 211MB 225MB Euclidean 4.3hr 4.3hr DistIdx 80MB

10000 1.2hr DistIdx 1hr 100 ROAD 42MB 1000 ROAD 25MB 100 10

10 3MB

1 Index size (MB) 1 0.1 Index time (second) 0.01 0.1 CA NA SF CA NA SF Networks Networks (a) Index time (b) Index size

Figure 6.12. Performance about index construction on various network

to build and more than 210MB to store an index for NA and SF. ROAD incurs considerably shorter construction time (about 1 hour) and less storage space (<100MB). For SF, ROAD is about ∼ 25% of indexing time and ∼ 33% of index size of DistIdx. Recall that the cost of DistIdx increases if more objects are included. However, the index construction cost for ROAD is only attributed to the formation of Route Overlay, which is totally independent of the number of objects included. For a large system that needs to support a huge amount of different types of objects, this index construction time and index storage incurred by ROAD can be amortized, but the costs incurred by DistIdx that are related to the number of objects cannot. While the indexing costs of ROAD are expected to be higher than NetExp and Euclidean, as to be shown next, ROAD is actually very efficient for updates and query processing.

6.5.2 Maintenance Overhead

Here, we evaluate the index update time for object changes and network changes. We first eval- uate the update time for object changes. In this experiment, we delete one randomly picked 155

object from a network and then add it back at a random location. We repeat deletion/insertion for 100 times. The average performances of insertions and deletions are presented in Fig- ure 6.13. DistIdx incurs several orders of magnitude higher update costs than others. For NA and SF, it takes about 2 minutes to finish one object deletion or addition. This is because it has to traverse entire networks to update all distance signatures. In contrast, NetExp, Euclidean and ROAD can handle update within 0.1 second for all the networks.

Object deletion (|O|=100) Object insertion (|O|=100) NetExp 100 NetExp 100 Euclidean Euclidean 10 DistIdx 10 DistIdx ROAD 1 ROAD 1 0.1 0.1 0.01 0.01 0.001 0.001 Object deletion time Object (second) CA NA SF Object insertion time (second) CA NA SF Network Network (a) Object deletion (b) Object insertion

Figure 6.13. Performance about object update

Similarly, we perform network change by randomly removing one edge by setting its edge distance to infinity and adding it back by recovering its original distance. The average perfor- mance of 100 trials is presented in Figure 6.14. The edge change almost has no observable impact on NetExp and Euclidean. However, for DistIdx, distance signatures of many nodes have to be reexamined and updated that involves a lot of disk-write operations. Differently, ROAD only needs to update affected shortcuts between some border nodes. Thus, it has con- siderably lower update costs than DistIdx and it takes less than 2 seconds for NA and SF. NetExp and Euclidean are very update-efficient. However, they are not query efficient as to be evaluated next.

6.5.3 Query Performance

Further, we measure the query performance of all the approaches over different numbers of objects, networks and query types. We evaluated 100 queries issued at random positions and report the average performance in terms of processing time. Evaluation based on kNN query. Our first experiment evaluates the query performance based on kNN query against different factors, namely, the query parameter k, the number of objects in the network and different networks. We first evaluate the query parameter k varying from 1 to 156

Edge deletion (|O|=100) Edge insertion (|O|=100) NetExp NetExp 100 100 Euclidean Euclidean 10 DistIdx 10 DistIdx ROAD ROAD 1 1 0.1 0.1 0.01 0.01 0.001 0.001 Edge deletion time Edge deletion (second) CA NA SF Edge insertion time (second) CA NA SF Network Network (a) Edge deletion (b) Edge insertion

Figure 6.14. Performance about network update

5 and to 10 while the network and the number of objects are fixed at CA and 100, respectively. The results for all approaches are plotted in Figure 6.15(a). From the figure, we can see that Euclidean takes the longest processing time for all evaluated k’s as it suffers from false hits and incurs redundant shortest path searches over the same portion of the network. DistIdx performs slightly better than NetExp since DistIdx has distance signatures to guide the search towards result objects that saves network traversal overhead, but on the other hand, these bulky distance signatures incur higher I/O that outweighs the performance gain. For all evaluated k’s, ROAD performs constantly the best, attributed to the effectiveness of search space pruning. Figure 6.15(b) plots the results obtained by varying the object cardinality from 10 up to 1, 000 while both network and k are fixed at CA and 5. In general, increases of object cardi- nalities will reduce the average distance of objects from a query point; thus the search range is reduced. As a result, performance (i.e., processing time) is improved (shortened). However, as shown in the figure, the performance for Euclidean and DistIdx increase initially and then drop. For Euclidean, this is because of an increase of false hit. For DistIdx, it is caused by the increased size of distance signatures. On the other hand, the performance for both NetExp and ROAD improves continuously as we expected. It is noteworthy that the difference between them narrows since ROAD is also expansion-based; when the objects are close to query points, objects are likely to be found within the same Rnets where query points are located. Finally, we evaluate kNN query on different approaches upon various networks where k and |O| are fixed at 5 and 100, respectively. The result is depicted in Figure 6.15(c). As observed from the results with CA, Euclidean performs the worst. It is noteworthy that NA and SF have different numbers of edges and different node densities though similar numbers of network nodes, approximation of Euclidean distance to network distance is more appropriate for SF than NA. DistIdx perform slightly better than NetExp. Last, ROAD performs the best. 157

kNN query (CA, |O|=100) kNN query (CA, k=5) 10000 10000 NetExp Euclidean 1000 DistIdx ROAD 1000 100 100 10 10 1 NetExp Euclidean Processing time (msec) Processing time (msec) 0.1 DistIdx ROAD 1 1510 10 50 100 500 1000 k Number of objects (|O|) (a) Various k (b) Various object cardinalities

kNN query (|O|=100, k=5) 1000000 NetExp 100000 Euclidean 10000 DistIdx ROAD 1000 100 10 Processing time (msec) 1 CA NA SF Networks (c) Various network

Figure 6.15. Performance of kNN query evaluation

Evaluation based on range query. Our second evaluation is on range query against different approaches. We first evaluate the query parameter, r, that represents the search range. Fig- ure 6.16(a) shows the results obtained by varying r from 0.05 up to 0.2 of the network diameter while we use CA and fix |O| at 100. In general, processing times of all approaches increase when r increases. Among all the approaches, ROAD consistently performs better than all the others. DistIdx takes shorter processing time when small r is used. However, when r is in- creased, due to the high cost of loading of a number of distance signatures, the performance of DistIdx drops. At last, Euclidean performs the worst. Next, we evaluate the impact of object cardinality (|O|) that ranges from 10 up to 1,000. The result is plotted in Figure 6.16(b). Due to fixed range (r = 0.1), the network traversal cost is reasonably fixed. Thus, we can see that NetExp aligns with our expectation. However, due to different reasons, all the other approaches have their processing time increased. The processing time of Euclidean is caused by false hits and that of DistIdx is attributed to the increased size of distance signatures. Finally ROAD gains performance improvement by exploiting search space pruning. When object cardinality increases, the performance of ROAD gets closer to NetExp as ROAD performs almost the same as network expansion. Finally, the results based 158

Range query (CA, |O|=100) Range query (CA, r=0.1) 10000 NetExp Euclidean 10000 NetExp Euclidean DistIdx ROAD DistIdx ROAD 1000 1000

100 100

10 10 Processing time (msec) 1 Processing time (msec) 1 0.05 0.1 0.2 10 50 100 500 1000 r Number of objects (|O|) (a) Various range r (b) Various object cardinality

Range query (|O|=100, r=0.1) 10000000 NetExp 1000000 Euclidean 100000 DistIdx 10000 ROAD 1000 100 10

Processing time (msec) 1 CA NA SF Networks (c) Various network

Figure 6.16. Performance of range query evaluation

on various network are plotted in Figure 6.16(c). In general, the observation is pretty much the same as what obtained from experiments on kNN query and it can be explained similarly.

6.5.4 Evaluation of Rnet hierarchy

Last, we evaluate the impact of the number of levels (l) on the index overhead and query processing time. In this evaluation, we vary l for CA from 2 to 6 and that for both NA and SF from 6 to 10. To be specific, we measure the index construction time and processing time for kNN query (k = 5) while the numbers of objects for all the networks are fixed at 100 and p is fixed at 4. The results are shown in Figure 6.17. With the increase of Rnet hierarchy levels, the index time increases and query processing time drops exponentially. Although there are no absolute optimal points found, we can see a significant drop in query performance when l = 4 for CA and l = 8 for NA and SF that we used as default Rnet hierarchy levels in our experiments. 159

Effects of Rnet hierarchy levels (CA, p=4) Effects of Rnet hierarchy levels (NA, p=4) 160 35 450 200 index time index time 140 30 400 query time 350 query time 120 25 100 300 150 20 250 80 15 200 60 10 150 100 40 100 20 5 50 Index time (second) 0 0 Index time (minute) 0 50 query time (millisecond) query time (millisecond) 23456 678910 Number of levels (l) Number of levels (l) (a) CA (b) NA

Effects of Rnet hierarchy levels (SF, p=4) 500 index time 350 450 400 query time 300 350 250 300 250 200 200 150 150 100 100

Index time (minute) 50 0 50 query time (millisecond) 678910 Number of levels (l) (c) SF

Figure 6.17. Performance about Rnet hierarchy level (l)

6.6 Chapter Summary

The rapid growth of LBSs fosters a need of efficient search algorithms for LDSQs. In the mean time, the on-going trend of LBSs demands a system framework that can flexibly accommodate diverse objects, provide efficient processing of various LDSQs, and support different distance metrics. To meet those needs, we have proposed the ROAD framework for efficient LDSQ processing. The design of ROAD achieves a clean separation between objects and network for better system flexibility and extensibility. It exploits search space pruning, an effective and powerful technique for object search. Upon the framework, efficient search algorithms for common LDSQs, namely, range and kNN queries, are devised and incremental framework maintenance techniques are developed. Via comprehensive experiments on real road networks, our ROAD framework is shown to outperform the state-of-the-art techniques. Chapter 7

Conclusions

The demand for pervasive access of location-related information has fostered a tremendous application base of LBSs. With location-related information and the accesses of such infor- mation modeled as spatial objects and LDSQs, respectively, in this thesis, we have proposed a set of innovative techniques to improve LDSQ processing efficiency and have proved their effectiveness. In this conclusion, we summarize this thesis and discuss our future research plans.

7.1 Summary of the Thesis

In brief, we have proposed and studied different techniques for efficient LDSQ processing. Those include the Complementary Space Caching (CSC), the R-tree based algorithms for Near- est Surrounder (NS) Queries, the processing framework for Skyline Queries based on Z-order space filling curves, and the ROAD framework for supporting LDSQs processing on road net- works. Each of these techniques developed in this study, based on innovative ideas, have produced important results, as summarized in the following.

• Complementary Space Caching (CSC). The main difference between CSC and other caching models is its capability to equip the client with the knowledge of objects dis- tribution in a geographical space. Generally speaking, all objects are kept in a cache based on different levels of granularity with respect to their likelihood to be accessed. To be specific, those cached objects and complementary regions (CRs) that cover the locations of all the non-cached objects together present a global data view stored in a 161

cache. To maintain the integrity of a global data view at all times, query processing and cache management mechanisms have to take care of adjusting the presentation of objects, that makes them completely different from corresponding operations for con- ventional caching schemes. Important design and implementation issues of CSC have been addressed in details in our study.

• Nearest Surrounder (NS) Queries. NS queries search for objects based on their ori- entations and distances to given query points, providing more informative results than conventional nearest neighbor queries that only examine object distances. We extended NS queries to multi-tier NS queries, angle-constrained NS queries, and multi-tier angle- constrained NS queries. To handle NS queries and their extensions, we explored the angular properties of both polygons (that represent the object shapes) and R-tree MBBs. It is worth noting that these properties have not been explored in the literatures before, even though object orientation is an important aspect to be considered in spatial queries. In addition, we have examined object access strategies based on object orientations or distances, which in turn lead to the development of our Sweep and Ripple algorithms. The former scans the search space angularly, whereas the latter is based on distance infor- mation. Both of the algorithms can progressively render a query result and are efficient in processing an NS query within one index lookup.

• Skyline Queries. Skyline queries retrieve all non-dominated objects in a given dataset. We have identified an insightful connection between dominance relationships and Z- order space filling curves (Z-order curves in short). By organizing objects based on their locations along a Z-order curve via a ZBtree index, our ZSearch algorithm enables block- based dominance test (i.e., determines whether a block of objects is dominated without examining individual objects inside) and can efficiently determine skyline objects. Based on the same idea of accessing data in Z-order, we developed ZBand, ZRank, k-ZSearch and ZSubspace to evaluate some skyline query variants, namely, skyband queries, top- ranked skyline queries, k-dominant skyline queries, and subspace skyline queries, re- spectively. All our proposed algorithms are based on the same concept. Meanwhile, they all outperform the existing state-of-the-art approaches specialized for skyline queries or variants thereof. As indexes based on Z-order curves are already implemented in some existing database systems, the proposed algorithms can be seamlessly integrated into existing systems to support skyline queries. 162

• ROAD framework. The ROAD framework was designed to support efficient evaluation of LDSQs on road networks. It tries to develop an effective space pruning technique to shrink the search space, which has not been explored by existing studies for supporting network distance based LDSQ processinge. In order to enable efficient space pruning, we formulated a large road network into a hierarchy of regional subnets (called Rnets). Two separate indexes have been introduced to index objects of interests and the underlying road network, respectively, with the aim of i) supporting LDSQs on different types of objects; and ii) facilitating flexible object and network updates. More specifically, we introduced the Route Overlay (i.e., one index) to maintain the basic network connectivity and shortcuts across Rnets and the Association Directory (i.e., the other index) to map objects to edges and Rnets. The ROAD framework is named after these two major index structures.

To validate the performance of all the proposed techniques, we have conducted an exten- sive set of experiments. All of them demonstrate superior advantages over the state-of-the-art approaches in the corresponding domains.

7.2 Future Research

In the future, we plan to build upon the results from this study and explore possible enhance- ments and extensions of the proposed techniques, investigate other issues related to LDSQ processing, and develop other new emerging platforms that can support LBSs.

7.2.1 Enhancements and Extensions of Proposed Techniques

Although we have conducted a thorough study for each of our proposed techniques, we find that, as with any challenging work, there is still room for improvement. In the following, we discuss the possibilities for enhancing and extending individual techniques.

• CSC. By providing the knowledge of the whereabouts of all the non-cached objects in CRs, CSC enables mobile clients to determine the necessity of downloading additional objects for LDSQs from the server. In situations where client movement and access patterns are available, some CRs that indicate the existence of non-cached objects can be used to predict the likelihood of data access. In other words, CSC offers an opportunity 163

for clients to pre-fetch data that is likely to be accessed in advance in order to make the data ready for use. Consequently, the response time of LDSQs can be significantly improved. To enable data pre-fetching, prediction models for client movements and access patterns are needed to be combined with CSC. Although in our study we proposed CSC for client-side caching, it can of course be naturally applied to other places. For instance, CSC can be implemented on the server as a main memory cache for LDSQs. Also, in mobile environments, CSC can be applied in access points that serve as gateways between server and clients. With frequently accessed data cached, access points can answer some LDSQs on behalf of the server, thereby reducing the relay of LDSQs to the server and alleviate the load of server processing. In the future, we are going to explore data pre-fetching and the use of CSC in the server and access points.

• NS Queries. Our current technique developed for NS queries assumes that all the objects are located within a 2D geographical space. To answer NS queries in a 3D space, we are considering two possible approaches. The first approach is to treat a 3D space as multiple layers of a 2D space, analogous to a building composed of multiple floors. For each layer, an NS query is issued to evaluate the proposed NS processing algorithms. The final query result is the union of NS query results from all layers. This approach is an intuitive sampling approach that returns approximate results. In contrast, the second approach considers a real 3D space. Each 3D coordinate (x, y, z) is transformed to (α, β, d) (i.e., two angles α and β plus a distance d with respect to a query point) in a search space centered at a given query point. Here, α is an angle on the XY plane and β is on the XZ plane. Following the idea of object comparison, we can assert that a given object o is an answer object, if there is no other object located within the same angular range and with a shorter distance to the query point than o. Here, object comparisons need to examine surfaces in a 3D space instead of edges and also the access strategies need to consider multiple angles and distances of objects simultaneously. These are non-trivial. In the future work, we plan to develop efficient algorithms for NS queries in a 3D space.

• Skyline Queries. Thus far, we have addressed skyline queries and some skyline variants such as skyband queries, top-ranked skyline queries, k-dominant skyline queries and subspace skyline queries, based on an assumption that the object attributes are static. Relaxing this assumption would certainly present a challenge. As maintaining an index 164

in the presence of frequent updates is expensive, we have to think of new strategies to organize dynamic attributes values that are involved in skyline query processing. To address this issue, we are considering to separate the highly dynamic object attributes (e.g., hotel prices) from those relatively static ones (e.g., distance to the beach). We can index these relatively static attributes in Z-order curves and may or may not index the dynamic attributes, depending on the access and update performance. When a skyline query is issued, indexes for both types of attributes are accessed and a skyline result is generated by combining retrieved results from individual indexes on the fly. Issues on the design of the indexes for dynamic attributes and new algorithm based on different indexes for skyline computation need to be further explored. We plan to take on these challenge in our future work.

• ROAD framework. The existing design of the ROAD framework is targeted for typical LDSQs such as range queries and nearest neighbor queries on a road network. How- ever, users in some LBS applications may issue complex LDSQs such as reverse nearest neighbor, transitive nearest neighbor, etc. Although these queries have been well stud- ied in conventional spatial databases, they are still new to road networks. We plan to develop new algorithms and/or extend the design of the ROAD framework to facilitate such advanced searches.

In this thesis, we have addressed the issues of spatial databases with NS query engines and skyline query engines and spatial network databases with the ROAD framework. As can be anticipated, the semantics of many spatial queries can be applied to the contexts of both spatial databases and spatial network databases. In the future, we attempt to explore the issues in combining both spatial databases and spatial network databases to provide a unified database framework for supporting efficient LDSQs evaluations.

7.2.2 Other Issues Related to LDSQ Processing

Aside from access efficiency, other concerns related to LBSs and LDSQ processing, such as privacy protection [52, 49, 100, 31], energy efficiency, and trade-offs between LDSQ result accuracy and LDSQ processing efficiency, attract our attention as well. The privacy issue arises as user LDSQs may lead to conjectures about the user identity and user interest, especially in light of today’s advanced data mining techniques. Privacy issues can 165 regard location privacy [52, 49, 100] and query privacy[31]. To protect one’s location privacy due to the concern of user re-identification and personal security, users could intend not to disclose precise user locations. For query privacy, users might choose to not fully submit exact search criteria to an LBS server if they want to hide certain sensitive/confidential information (e.g., a query searching for the nearest drug rehab center). When a vague query is issued, the LBS server has to explore all the possible query results covered by this vague query, and thus return more results than needed to the user in order to satisfy the user’s real need. As the wireless channel may be overloaded by a large amount of data transferred from the LBS server to the users in such cases, it is important to design efficient LDSQ processing techniques that take into consideration both the efficiency issue and the privacy issue. As we know, the battery power is a limited resource of mobile clients. To prolong the oper- ational life of mobile clients, tasks that consume a lot of energy should be suspended whenever possible. As a result, wireless communications, especially when uploading data from the client to the server, should be avoided. In other words, LDSQ processing over data broadcast might be beneficial in certain scenarios. Existing studies on LDSQ processing on data broadcast, however, are limited to simple query types and only consider Euclidean distance. In the future, we plan to examine the feasibility of data broadcast that can support various types of LDSQs on road networks. The fact that the accuracy of LDSQ results can be sacrificed for better access efficiency raises an important research topic regarding how to strike a good balance between LDSQ result accuracy and search performance. For example, with some data cached in clients, LDSQs based on cached data may obtain a part of the results and/or outdated results if the source data has been updated in the server. This research problem is challenging. First, it is difficult to predict the LDSQ processing cost and estimate the LDSQ result accuracy. Second, if users specify an accuracy requirement that is higher than an estimated LDSQ result accuracy, it is not easy to figure out how to process the LDSQ in order to satisfy the accuracy requirement while search efficiency is concerned. As no existing study has explored this problem, it remains another challenge for future research.

7.2.3 Other Emerging Platforms for LBSs

Ad hoc wireless communication technology advances every day. Mobile clients can commu- nicate with others to exchange information. Out of this emerges a new paradigm of mobile 166 peer-to-peer network [141, 139, 83]. With LBSs deployed over a mobile peer-to-peer network, LDSQs can be processed among peer clients in addition to being submitted to the server for processing. In this new paradigm, data managements and LDSQ processing strategies pose several unique challenges. As data is distributed among a party of clients, the evaluation of LDSQs may involve multiple clients. Further, such network topology is dynamic due to client movement and unstable network connectivity, and hence search strategies need to be adaptive to the changing network configuration. Moreover, with the availability of sensor networks [152, 99, 144, 129], location-related in- formation can be captured, stored and processed within sensor networks. We anticipate that in the near future, sensor networks (or sensor webs as their extensions) will become very com- mon and widely deployed. On sensor networks, queries can directly access sensor nodes for the location-related information. Instead of a single data source for LDSQs, information is available from sensor nodes and needs to be integrated on the fly for LDSQ processing. Con- sequently, the evaluation of LDSQs is distributed over sensor networks. Spatial information integration and distributed LDSQ processing in mobile peer-to-peer networks and sensor net- works is another aspect we shall pursue in the future.

Finally, we hope our findings obtained from this research study will encourage other schol- ars and researchers to address any remaining challenges in location based services. Bibliography

[1] 3rd Generation Partnership Project 2. http://www.3gpp2.org.

[2] General Packet Radio Services (GPRS). http://www.topology.org/comms/gprs.html.

[3] IEEE 802.11, The Working Group Setting the Standards for Wireless LANs. http://grouper.ieee.org/groups/802/11/index.html.

[4] MySQL Spatial. http://dev.mysql.com/doc/refman/5.0/en/spatial-extensions.html.

[5] Oracle Spatial. http://www.oracle.com/technology/documentation/spatial.html.

[6] Wi-Fi Alliance. http://www.weca.net/OpenSection/pdf/Wi- Fi Protected Access Overview.pdf.

[7] S. Acharya. Broadcast Disks: Dissemination-based Management for Assymmetic Com- munication Environments. PhD thesis, Brown University, 1998.

[8] S. Acharya, M. J. Franklin, and S. B. Zdonik. Disseminating Updates on Broadcast Disks. In Proceedings of the 22th International Conference on Very Large Data Bases (VLDB), Mumbai (Bombay), India, Sep 3-6, pages 354Ð365, 1996.

[9] S. Acharya, M. J. Franklin, and S. B. Zdonik. Prefetching from Broadcast Disks. In Proceedings of the 12th International Conference on Data Engineering (ICDE), New Orleans, LA, USA, Feb 26 - Mar 1, pages 276Ð285, 1996.

[10] S. Acharya, M. J. Franklin, and S. B. Zdonik. Balancing Push and Pull for Data Broad- cast. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Tucson, AZ, USA, USA, May 13-15, pages 183Ð194, 1997.

[11] S. Acharya and S. Muthukrishnan. Scheduling On-Demand Broadcasts: New Metrics and Algorithms. In Proceedings of the 4th Annual ACM/IEEE International Confer- ence on Mobile Computing and Networking (Mobicom), Dallas, Texas, USA, Oct 25-30, pages 43Ð54, 1998. 168

[12] P. K. Agarwal and J. Matousek. Ray Shooting and Parametric Search. In Proceedings of the ACM Symposium on Theory of Computing, Victoria, British Columbia, Canada, pages 517Ð526, 1992.

[13] D. Aksoy and M. J. Franklin. R × W: A Scheduling Approach for Large-Scale On- Demand Data Broadcast. IEEE/ACM Transaction on Networking, 7(6):846Ð860, 1999.

[14] N. Backmann, H.-P. Kriegel, R. Schneider, and B. Seegar. The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Atlantic City, NJ, May 23-25, pages 322Ð331, 1990.

[15] W.-T. Balke, U. Guntzer,¬ and J. X. Zheng. Efficient Distributed Skylining for Web Information Systems. In Proceedings of the 7th International Conference on Extending Database Technology (EDBT’04), pages 256Ð273, 2004.

[16] D. Barbara« and T. Imielinski. Sleepers and Workaholics: Caching Strategies in Mo- bile Environments. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Minneapolis, MN, USA, May 24-27, pages 1Ð12, 1994.

[17] I. Bartolini, P. Ciaccia, and M. Patella. SaLSa: Computing the Skyline Without Scan- ning the Whole Sky. In Proceedings of the 2006 ACM International Conference on In- formation and Knowledge Management (CIKM), Arlington, VA, USA, Nov 6-11, pages 405Ð414, 2006.

[18] S. K. Baruah and A. Bestavros. Pinwheel Scheduling for Fault-Tolerant Broadcast Disks in Real-time Database Systems. In Proceedings of the 13th International Conference on Data Engineering (ICDE), Birmingham, U.K. Apr 7-11, pages 543Ð551, 1997.

[19] K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When Is ”Nearest Neighbor” Meaningful? In Proceedings of the 7th International Conference on Database Theory (ICDT’99), pages 217Ð235, 1999.

[20] B. H. Bloom. Space/Time Trade-offs in Hash Coding with Allowable Errors. Com- munucations of ACM, 13(7):422Ð426, 1970.

[21] S. Borzs¬ onyi,¬ D. Kossmann, and K. Stocker. The Skyline Operator. In Proceedings of the 17th International Conference on Data Engineering (ICDE’01), pages 421Ð430, 2001.

[22] A. Brimicombe and C. Li. Location-Based Services and Geo-Information Engineering. John Wiley & Sons Inc, 2009.

[23] A. R. Butz. Alternative Algorithm for Hilbert’s Space-Filling Curve. IEEE Transactions on Computers, 20(4):424Ð42, 1971. 169

[24] J. W. Byers, M. Luby, M. Mitzenmacher, and A. Rege. A Digital Fountain Approach to Reliable Distribution of Bulk Data. In Proceeding of the ACM SIGCOMM International Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, Vancouver, B.C., Canada, Aug 31 - Sep 4, pages 56Ð67, 1998. [25] C. Y. Chan, H. V. Jagadish, K.-L. Tan, A. K. H. Tung, and Z. Zhang. Finding K- Dominant Skylines in High Dimensional Space. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’06), pages 503Ð514, 2006. [26] C. Chekuri and A. Rajaraman. Conjunctive Query Containment Revisited. In Proceed- ings of the 6th International Conference on Database Theory (ICDT), Delphi, Greece, Jan 8-10, pages 56Ð70, 1997. [27] L. Chen and X. Lian. Dynamic Skyline Queries in Metric Spaces. In Proceedings of the 11th International Conference on Extending Database Technology (EDBT’08), pages 333Ð343, 2008. [28] M.-S. Chen, P. S. Yu, and K.-L. Wu. Indexed Sequential Data Broadcasting in Wireless Mobile Computing. In Proceedings of the 17th International Conference on Distributed Computer Systems (ICDCS), Baltimore, MD, USA, May 28-30, pages 124Ð131, 1997. [29] H.-J. Cho and C.-W. Chung. An Efficient and Scalable Approach to CNN Queries in a Road Network. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB), Trondheim, Norway, Aug 30 - Sep 2, pages 865Ð876, 2005. [30] J. Chomicki, P. Godfrey, J. Gryz, and D. Liang. Skyline with Presorting. In Proceedings of the 19th International Conference on Data Engineering (ICDE’03), pages 717Ð816, 2003. [31] C.-Y. Chow and M. F. Mokbel. Enabling Private Continuous Queries for Revealed User Locations. In Proceedigs of the International Symposium on Advances in Spatial and Temporal Databases (SSTD), Boston, MA, USA, Jul 16-18, pages 258Ð275, 2007. [32] A. Corral, Y. Manolopoulos, Y. Theodoridis, and M. Vassilakopoulos. Closest Pair Queries in Spatial Databases. In Proceedings of the ACM SIGMOD International Con- ference on Management of Data, Dallas, Texas, USA, May 16-18, pages 189Ð200, 2000. [33] A. Corral, Y. Manolopoulos, Y. Theodoridis, and M. Vassilakopoulos. Multi-Way Dis- tance Join Queries in Spatial Databases. GeoInformatica, 8(4):373Ð402, 2004. [34] A. Corral, Y. Manolopoulos, Y. Theodoridis, and M. Vassilakopoulos. Cost models for distance joins queries using R-trees. Data and Knowledge Engineering, 57(1):1Ð36, 2006. [35] S. Dar, M. J. Franklin, B. T. Jonsson,« D. Srivastava, and M. Tan. Semantic data caching and replacement. In Proceedings of 22th International Conference on Very Large Data Bases (VLDB), Bombay, India, Sep 3-6, pages 330Ð341, 1996. 170

[36] A. Datta, D. E. VanderMeer, A. Celik, and V. Kumar. Broadcast Protocols to Support Efficient Retrieval from Databases by Mobile Users. ACM Transactions on Database Systems (TODS), 24(1):1Ð79, 1999.

[37] R. Dechter and J. Pearl. Generalized Best-First Search Strategies and the Optimality of A*. Journal of ACM, 32(3):505Ð536, 1985.

[38] E. Dellis, A. Vlachou, I. Vladimirskiy, B. Seeger, and Y. Theodoridis. Constrained sub- space skyline computation. In Proceedings of the 2006 ACM International Conference on Information and Knowledge Management (CIKM), Arlington, VA, USA, Nov 6-11, pages 415Ð424, 2006.

[39] P. M. Deshpande, K. Ramasamy, A. Shukla, and J. F. Naughton. Caching Multidi- mensional Queries Using Chunks. In Proceedings of the ACM SIGMOD International Conference on Management of Data, San Diego, CA, Jun 9-12, pages 259Ð270, 1998.

[40] E. W. Dijkstra. A Note on two Problems in Connexion with Graphs. In Numerische Mathematik, pages 269Ð271, 1959.

[41] L. Downs, T. Moller, and C. H. Sequin. Occlusion Horizons for Driving Through Urban Scenery. In Proceedings of the 2001 Symposium on Interactive 3D Graphics, pages 121Ð124, 2001.

[42] H. D. Dykeman, M. H. Ammar, and J. W. Wong. Scheduling Algorithms for Video- tex Systems under Broadcast Delivery. In Proceeding of International Conference on Communications, Toronto, Canada, 1988.

[43] Environmental Systems Research Institute, Inc. ESRI Shapefile Tech- nical Description - An ESRI White Paper - Jul. 1998. (website) http://www.esri.com/library/whitepapers/pdfs/shapefile.pdf.

[44] C. Faloutsos and S. Christodoulakis. Signature Files: An Access Method for Documents and Its Analytical Performance Evaluation. ACM Transactions on Information Systems (TOIS), 2(4):267Ð288, 1984.

[45] H. Ferhatosmanoglu, I. Stanoi, D. Agrawal, and A. E. Abbadi. Constrained Nearest Neighbor Queries. In Proceedings of the 7th International Symposium on Advances in Spatial and Temporal Databases, Redondo Beach, CA, Jul 12-15, pages 257Ð278, 2001.

[46] R. A. Finkel and J. L. Bentley. Quad Trees: A Data Structure for Retrieval on Composite Keys. Acta Informatica, 4:1Ð9, 1974.

[47] D. Fuhry, R. Jin, and D. Zhang. Efficient Skyline Computation in Metric Space. In Proceedings of the 12th International Conference on Extending Database Technology, Saint Petersburg, Russia, Mar 24-26, pages 1042Ð1051, 2009. 171

[48] Y. J. Garc«õa, M. A. Lopez, and S. T. Leutenegger. A Greedy Algorithm for Bulk Loading R-Trees. In Proceedings of the 6th International Symposium on Advances in Geographic Information Systems (GIS), Washington, DC, USA, Nov 6-7, pages 163Ð164, 1998.

[49] B. Gedik and L. Liu. Location Privacy in Mobile Systems: A Personalized Anonymiza- tion Model. In International Conference on Distributed Computing Systems (ICDCS), Columbus, OH, USA, Jun 6-10, pages 620Ð629, 2005.

[50] P. Godfrey, R. Shipley, and J. Gryz. Maximal Vector Computation in Large Data Sets. In Proceedings of 31th International Conference on Very Large Data Bases (VLDB’05), pages 229Ð240, 2005.

[51] V. Grassi. Prefetching Policies for Energy Saving and Latency Reduction in a Wireless Broadcast Data Delivery System. In Proceedings of the 3rd ACM International Work- shop on Modeling, Analysis and Simulation of Wireless and Mobile Systems (MSWIM), Boston, MA, USA, Aug 20, pages 77Ð84, 2000.

[52] M. Gruteser and D. Grunwald. Anonymous usage of location-based services through spatial and temporal cloaking. In Proceedings of the First International Conference on Mobile Systems, Applications, and Services (MobiSys), San Francisco, CA, USA, 2003.

[53] A. Guttman. R-Trees: A Dynamic Index Structure for Spatial Searching. In Proceedings of SIGMOD’84 Annual Meeting, Boston, MA, USA, Jun 18-21, pages 47Ð57, 1984.

[54] S. Hameed and N. H. Vaidya. Efficient Algorithms for Scheduling Data Broadcast. ACM/Baltzer Journal of Wireless Networks (WINET), 5(3):183Ð193, 1999.

[55] K. Hinrichs. Implementation of the Grid File: Design Concepts and Experience. BIT Computer Science and Numerical Mathematics, 25(4):569Ð592, 1985.

[56] G. R. Hjaltason and H. Samet. Distance Browsing in Spatial Databases. ACM Transac- tions on Database Systems (TODS), 24(2):265Ð318, 1999.

[57] C.-L. Hu and M.-S. Chen. Dynamic Data Broadcasting with Traffic Awareness. In Proceedings of the 22nd International Conference on Distributed Computing Systems (ICDCS), Vienna, Austria, Jul 2-5, pages 112Ð119, 2002.

[58] H. Hu, D. L. Lee, and V. C. S. Lee. Distance Indexing on Road Networks. In Proceed- ings of the 32nd International Conference on Very Large Data Bases (VLDB), Seoul, Korea, Sep 12-15, pages 894Ð905, 2006.

[59] H. Hu, D. L. Lee, and J. Xu. Fast Nearest Neighbor Search on Road Networks. In Proceedings of the 10th International Conference on Extending Database Technology (EDBT), Munich, Germany, Mar 26-31, pages 186Ð203, 2006. 172

[60] H. Hu, J. Xu, and D. L. Lee. Adaptive Power-Aware Prefetching Schemes for Mobile Broadcast Environments. In Proceedings of International Conference on Mobile Data Management (MDM), Melbourne, Australia, Jan 21-24, pages 374Ð380, 2003.

[61] H. Hu, J. Xu, W. S. Wong, B. Zheng, D. L. Lee, and W.-C. Lee. Proactive Caching for Spatial Queries in Mobile Environments. In Proceedings of the 21st International Conference on Data Engineering (ICDE), Tokyo, Japan, Apr 5-8, pages 403Ð414, 2005.

[62] Q. Hu, D. L. Lee, and W.-C. Lee. Performance Evaluation of a Wireless Hierarchical Data Dissemination System. In Proceedings of the 5th Annual ACM/IEEE Interna- tional Conference on Mobile Computing and Networking (Mobicom), Seattle, Washing- ton, USA, Aug 15-19, pages 163Ð173, 1999.

[63] Q. Hu, W.-C. Lee, and D. L. Lee. Power Conservative Multi-Attribute Queries on Data Broadcast. In Proceedings of the 16th International Conference on Data Engineering (ICDE), San Diego, CA, USA, Feb 28 - Mar 3, pages 157Ð166, 2000.

[64] Q. Hu, W.-C. Lee, and D. L. Lee. A Hybrid Index Technique for Power Efficient Data Broadcast. Distributed and Parallel Databases, 9(2):151Ð177, 2001.

[65] Y.-W. Huang, N. Jing, and E. A. Rundensteiner. Effective Graph Clustering for Path Queries in Digital Map Databases. In Proceedings of the 5th International Conference on Information and Knowledge Management (CIKM), Rockville, MD, USA, Nov 12-16, pages 215Ð222, 1996.

[66] Z. Huang, C. S. Jensen, H. Lu, and B. C. Ooi. Skyline Queries Against Mobile Lightweight Devices in MANETs. In Proceedings of the 22th International Conference on Data Engineering (ICDE’06), page 66, 2006.

[67] T. Imielinski, S. Viswanathan, and B. R. Badrinath. Power Efficient Filtering of Data an Air. In Proceedings of the 4th International Conference on Extending Database Technology (EDBT), Cambridge, UK, Mar 28-31, pages 245Ð258, 1994.

[68] T. Imielinski, S. Viswanathan, and B. R. Badrinath. Data on Air: Organization and Access. IEEE Transactions on Knowledge and Data Engineering (TKDE), 9(3):353Ð 372, 1997.

[69] C. S. Jensen, J. Kolarvr,« T. B. Pedersen, and I. Timko. Nearest Neighbor Queries in Road Networks. In Proceedings of the 11th ACM International Symposium on Advances in Geographic Information Systems (ACM GIS), New Orleans, Louisiana, USA, Nov 7-8, pages 1Ð8, 2003.

[70] C. S. Jensen, D. Lin, B. C. Ooi, and R. Zhang. Effective Density Queries on Con- tinuously Moving Objects. In Proceedings of the International Conference on Data Engineering (ICDE), Atlanta, GA, USA, April 3-8, page 71, 2006. 173

[71] S. Jiang and N. H. Vaidya. Scheduling Data Broadcast to “Impatient” Users. In Proceed- ings of the ACM International Workshop on Data Engineering for Wireless and Mobile Access, Seattle, WA, USA, Aug 20, pages 52Ð59, 1999.

[72] J. Jing, A. K. Elmagarmid, A. Helal, and R. Alonso. Bit-Sequences: An Adaptive Cache Invalidation Method in Mobile Client/Server Environments. ACM/Baltzer Journal of Mobile Networks and Nomadic Applications (MONET), 2(2):115Ð127, 1997.

[73] N. Jing, Y.-W. Huang, and E. A. Rundensteiner. Hierarchical Encoded Path Views for Path Query Processing: An Optimal Model and Its Performance Evaluation. IEEE Transactions on Knowledge and Data Engineering (TKDE), 10(3):409Ð432, 1998.

[74] S. Jung and S. Pramanik. An Efficient Path Computation Model for Hierarchically Structured Topographical Road Maps. IEEE Transactions on Knowledge and Data En- gineering (TKDE), 14(5):1029Ð1046, 2002.

[75] K. V. R. Kanth, S. Ravada, and D. Abugov. Quadtree and R-tree Indexes in Oracle Spa- tial: a comparison using GIS data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Madison, WI, USA, Jun 3-6, pages 546Ð557, 2002.

[76] A. Ku¨pper. Location-Based Services: Fundamentals and Operation. John Wiley & Sons Inc, 2005.

[77] B. Kernighan and S. Lin. An Efficient Heuristic Procedure for Partitioning Graphs. Bell Systems Technical Journal, 49(2):291Ð308, 1970.

[78] S. Khanna and V. Liberatore. On Broadcast Disk Paging. SIAM Journal of Computing, 29(5):1683Ð1702, 2000.

[79] M. R. Kolahdouzan and C. Shahabi. Voronoi-Based K Nearest Neighbor Search for Spatial Network Databases. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), Toronto, Canada, Aug 31 - Sep 3, pages 840Ð851, 2004.

[80] F. Korn and S. Muthukrishnan. Influence Sets Based On Reverse Nearest Neighbor Queries. In Proceeding of the 2000 ACM SIGMOD International Conference on Man- agement of Data, Dallas, TX, USA, May 16-18, pages 201Ð212, 2000.

[81] D. Kossmann, F. Ramsak, and S. Rost. Shooting Stars in the Sky: An Online Algorithm for Skyline Queries. In Proceedings of 28th International Conference on Very Large Data Bases (VLDB’02), pages 275Ð286, 2002.

[82] D. Kotz and K. Essien. Analysis of a Campus-Wide Wireless Network. In Proceed- ings of the 8th Annual International Conference on Mobile Computing and Networking (Mobicom), Atlanta, GA, USA, Sep 23-28, pages 107Ð118, 2002. 174

[83] W.-S. Ku and R. Zimmermann. Nearest Neighbor Queries with Peer-To-Peer Data Shar- ing in Mobile Environments. Pervasive and Mobile Computing, 4(5):775Ð788, 2008.

[84] K.-Y. Lam, E. Chan, and J. C.-H. Yuen. Data Broadcast for Time-Constrained Read- Only Transactions in Mobile Computing Systems. In Proceedings of International Workshop on Advance Issues of E-Commerce and Web-based Information Systems, Santa Clara, CA, April 8-9, 1999.

[85] K. C. Lee, H. V. Leong, and A. Si. Semantic query caching in a mobile environment. Mobile Computing and Communication Review (MC2R), 3(2):28Ð36, 1999.

[86] K. C. K. Lee, W.-C. Lee, J. Winter, B. Zheng, and J. Xu. CS Cache Engine: Data Access Accelerator for Location-Based Service in Mobile Environments. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, USA, Jun 27-29, pages 787Ð789, 2006.

[87] K. C. K. Lee, H. V. Leong, and A. Si. Semantic Data Broadcast for a Mobile Environ- ment Based on Dynamic and Adaptive Chunking. IEEE Transactions on Computers, 51(10):1253Ð1268, 2002.

[88] K. C. K. Lee, B. Zheng, and W.-C. Lee. Ranked Reverse Nearest Neighbor Search. IEEE Transactions on Knowledge and Data Engineering, 20(7):894Ð910, 2008.

[89] K. C. K. Lee, B. Zheng, H. Li, and W.-C. Lee. Approaching the Skyline in Z Order. In Proceedings of 33th International Conference on Very Large Data Bases (VLDB’07), pages 279Ð290, 2007.

[90] W.-C. Lee and D. L. Lee. Using Signature Techniques for Information Filtering in Wireless and Mobile Environments. Distributed and Parallel Databases, 4(3):205Ð227, 1996.

[91] S. T. Leutenegger, J. M. Edgington, and M. A. Lopez. STR: A Simple and Efficient Algorithm for R-Tree Packing. In Proceedings of the International Conference on Data Engineering, Birmingham U.K., April 7-11, pages 497Ð506, 1997.

[92] F. Li. Real Datasets for Spatial Databases: Road Networks and Points of Interest. http://www.cs.fsu.edu/∼lifeifei/SpatialDataset.htm.

[93] H. Li, Q. Tan, and W.-C. Lee. Efficient Progressive Processing of Skyline Queries in Peer-To-Peer Systems. In Proceedings of the First International Conference on Scalable Information Systems, (Infoscale’06), Hong Kong, China, page 26, 2006.

[94] C.-W. Lin and D. L. Lee. Adaptive Data Delivery in Wireless Communication Environ- ments. In Proceedings of the 20th International Conference on Distributed Computing Systems (ICDCS), Taipei, Taiwan, Apr 10-13, pages 444Ð452, 2000. 175

[95] X. Lin, Y. Yuan, W. Wang, and H. Lu. Stabbing the Sky: Efficient Skyline Computation over Sliding Windows. In Proceedings of the 21th International Conference on Data Engineering (ICDE’05), pages 502Ð513, 2005.

[96] X. Lin, Y. Yuan, Q. Zhang, and Y. Zhang. Selecting Stars: The k Most Representa- tive Skyline Operator. In Proceedings of the 23th International Conference on Data Engineering (ICDE’07), Istanbul, Turkey, Apr 15-20, pages 86Ð95, 2007.

[97] S.-C. Lo and A. L. P. Chen. An Adaptive Access Method for Broadcast Data under an Error-Prone Mobile Environment. IEEE Transactions on Knowledge and Data Engi- neering (TKDE), 12(4):609Ð620, 2000.

[98] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-completeness. W. H. Freeman, 1979.

[99] S. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong. The Design of an Acqui- sitional Query Processor For Sensor Networks. In Proceedings of the ACM SIGMOD International Conference on Management of Data, San Diego, CA, USA, Jun 9-12, pages 491Ð502, 2003.

[100] M. F. Mokbel, C.-Y. Chow, and W. G. Aref. The New Casper: Query Processing for Location Services without Compromising Privacy. In Proceedings of the International Conference on Very Large Data Bases, Seoul, Korea, Sep 12-15, pages 763Ð774, 2006.

[101] G. M. Morton. A computer oriented geodetic data base and a new technique in file sequencing. Technical report, Ottawa, Canada, 1996.

[102] B. C. Ooi. Spatial kd-tree: A data structure for geographic database. In Proceedings of Datenbanksysteme in Buro,¬ Technik und Wissenschaft, GI-Fachtagung, Darmstadt, April 1-3, pages 247Ð258, 1987.

[103] J. A. Orenstein and T. H. Merrett. A Class of Data Structures for Associative Search- ing. In Proceedings of the 3rd ACM SIGACT-SIGMOD Symposium on Principles of Database Systems (PODS’84), pages 181 Ð 190, 1984.

[104] D. Papadias, Q. Shen, Y. Tao, and K. Mouratidis. Group Nearest Neighbor Queries. In Proceedings of the 20th International Conference on Data Engineering, Boston, MA, USA, Mar 30 - Apr 2, pages 301Ð312, 2004.

[105] D. Papadias, Y. Tao, G. Fu, and B. Seeger. Progressive Skyline Computation in Database Systems. ACM Transactions on Database Systems (TODS), 30(1):41Ð82, 2005.

[106] D. Papadias, J. Zhang, N. Mamoulis, and Y. Tao. Query Processing in Spatial Network Databases. In Proceedings of 29th International Conference on Very Large Data Bases (VLDB), Berlin, Germany, Sep 9-12, pages 802Ð813, 2003. 176

[107] J. Pei, B. Jiang, X. Lin, and Y. Yuan. Probabilistic Skylines on Uncertain Data. In Proceedings of 33th International Conference on Very Large Data Bases (VLDB’07), pages 15Ð26, 2007. [108] E. Pitoura and P. K. Chrysanthis. Exploiting Versions for Handling Updates in Broadcast Disks. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB), Edinburgh, Scotland, UK, Sep 7-10, pages 114Ð125, 1999. [109] E. Pitoura and P. K. Chrysanthis. Multiversion Data Broadcast. IEEE Transactions on Computers, 51(10):1224Ð1230, 2002. [110] B. R. Preiss. Data Structures and Algorithms with Object-Oriented Design Patterns in Java. 1998. [111] R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill, 1998. [112] F. Ramsak, V. Markl, R. Fenk, M. Zirkel, K. Elhardt, and R. Bayer. Integrating the UB- Tree into a Database System Kernel. In Proceedings of 26th International Conference on Very Large Data Bases (VLDB’00), pages 263Ð272, 2000. [113] Q. Ren and M. H. Dunham. Using Semantic Caching to Manage Location Dependent Data in Mobile Computing. In Proceedings of the International Conference on Mobile Computing and Networking, Boston, MA, August 6-11, pages 210Ð221, 2000. [114] Q. Ren, M. H. Dunham, and V. Kumar. Semantic Caching and Query Processing. IEEE Transactions on Knowledge and Data Engineering (TKDE), 15(1):192Ð210, 2003. [115] N. Roussopoulos, S. Kelly, and F. Vincent. Nearest Neighbor Queries. In Proceeding of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, CA, May 22-25, pages 71Ð79, 1995. [116] H. Sagan. Space-Filling Curves. Springer-Verlag, 1994. [117] D. Salomon. Computer Graphics and Geometric Modeling. Springer-Verlag New York, Inc., 1999. ISBN: 0-387-98682-0. [118] J. Schiller and A. Voisard. Location-Based Services. Morgan Kauffmann, 2004. [119] T. K. Sellis, N. Roussopoulos, and C. Faloutsos. The R+-Tree: A Dynamic Index for Multi-Dimensional Objects. In Proceedings of the 13th International Conference on Very Large Data Bases, Brighton, England, Sept 1-4, pages 507Ð518, 1987. [120] M. Sharifzadeh and C. Shahabi. The Spatial Skyline Queries. In Proceedings of 32th International Conference on Very Large Data Bases (VLDB’06), pages 751Ð762, 2006. [121] S. Shekhar and D.-R. Liu. CCAM: A Connectivity-Clustered Access Method for Net- works and Network Computations. IEEE Transactions on Knowledge and Data Engi- neering (TKDE), 9(1):102Ð119, 1997. 177

[122] H. Shin, B. Moon, and S. Lee. Adaptive and Incremental Processing for Distance Join Queries. IEEE Transactions on Knowledge and Data Engineering, 15(6):1561Ð1578, 2003.

[123] N. Shivakumar and S. Venkatasubramanian. Energy-Efficient Indexing for Information Dissemination in Wireless Systems. ACM/Baltzer Journal of Mobile Networks and No- madic Applications (MONET), 1(4):433Ð446, 1996.

[124] I. Stanoi, D. Agrawal, and A. E. Abbadi. Reverse Nearest Neighbor Queries for Dy- namic Databases. In Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pages 44Ð53, 2000.

[125] K. Stathatos, N. Roussopoulos, and J. S. Baras. Adaptive Data Broadcast in Hybrid Networks. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB), Athens, Greece, Aug 25-29, pages 326Ð335, 1997.

[126] C.-J. Su, L. Tassiulas, and V. J. Tsotras. Broadcast Scheduling for Information Distribu- tion. ACM/Baltzer Journal of Wireless Networks (WINET), 5(2):137Ð147, 1999.

[127] M. Su and C.-H. Chi. Architecture and Performance of Application Networking in Pervasive Content Delivery. In Proceedings of the 21st International Conference on Data Engineering (ICDE), Tokyo, Japan, Apr 5-8, pages 656Ð667, 2005.

[128] Swarp Acharya, Rafael Alonso, Michael Franklin, Stanley Zdonik. Broadcast Disks: Data Management for Asymmetric Communications Environments. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, California, May 22-25, pages 199Ð210, 1995.

[129] M. L. Talbert and G. Seetharaman. When Sensor Webs Start Being Taken Seriously. In Proceedings of the IEEE International Conference on Sensor Networks, Ubiquitous, and Trustworthy Computing (SUTC), Taichung, Taiwan, June 5-7, pages 200Ð207, 2006.

[130] K.-L. Tan, P.-K. Eng, and B. C. Ooi. Efficient Progressive Skyline Computation. In Proceedings of 27th International Conference on Very Large Data Bases (VLDB’01), pages 301Ð310, 2001.

[131] K.-L. Tan and B. C. Ooi. On Selective Tuning in Unreliable Wireless Channels. Data and Knowledge Engineering, 28(2):209Ð231, 1998.

[132] Y. Tao, L. Ding, X. Lin, and J. Pei. Distance-Based Representative Skyline. In Proceed- ings of the 25th International Conference on Data Engineering, ICDE 2009, Shanghai, China, Mar 29-Apr 2, pages 892Ð903, 2009.

[133] Y. Tao and D. Papadias. Maintaining Sliding Window Skylines on Data Streams. IEEE Transactions on Knowledge and Data Engineering (TKDE), 18(2):377Ð391, 2006. 178

[134] Y. Tao, D. Papadias, X. Lian, and X. Xiao. Multidimensional Reverse k NN Search. VLDB Journal, 16(3):293Ð316, 2007.

[135] Y. Tao, X. Xiao, and J. Pei. Efficient Skyline and Top-k Retrieval in Subspaces. IEEE Transactions on Knowledge and Data Engineering (TKDE).

[136] L. Tassiulas and C.-J. Su. Optimal Memory Management Strategies for a Mobile User in a Broadcast Data Delivery System. IEEE Journal on Selected Areas in Communications, 15(7):1226Ð1238, 1997.

[137] U.S. Census Bureau. 2002 TIGER/Line Files (Website). http://www.census.gov/geo/www/tiger/tiger2002/tgr2002.html.

[138] N. H. Vaidya and sohail Hameed. Scheduling Data Broadcast in Asymmetric Commu- nication Environment. ACM/Baltzer Journal of Wireless Networks (WINET), 5(3):171Ð 182, 1999.

[139] J. van der Merwe, D. S. Dawoud, and S. McDonald. A Survey on Peer-To-Peer Key Management for Mobile Ad Hoc Networks. ACM Computer Survey, 39(1), 2007.

[140] S. Wang, B. C. Ooi, A. K. H. Tung, and L. Xu. Efficient Skyline Query Processing on Peer-to-Peer Networks. In Proceedings of the 23th International Conference on Data Engineering (ICDE’07), Istanbul, Turkey, Apr 15-20, pages 1126Ð1135, 2007.

[141] O. Wolfson, B. Xu, H. Yin, and H. Cao. Search-and-Discover in Mobile P2P Network Databases. In Proceedings of the IEEE International Conference on Distributed Com- puting Systems (ICDCS), Lisboa, Portugal, July 4-7, page 65, 2006.

[142] J. W. Wong. Broadcast Delivery. Proceedings of the IEEE, 76(12):1566Ð1577, 1988.

[143] J. W. Wong and H. D. Dykeman. Architecture and Performance of Large Scale Informa- tion Delivery Networks. In Proceedings of International Teletraffic Congress, Torino, Italy, 1988.

[144] A. Woo. A New Embedded Web Services Approach to Wireless Sensor Networks. In Proceedings of the International Conference on Embedded Networked Sensor Systems (SenSys), Boulder, Colorado, USA, Oct 31 - Nov 3, page 347, 2006.

[145] P. Wu, C. Zhang, Y. Feng, B. Y. Zhao, D. Agrawal, and A. E. Abbadi. Parallelizing Skyline Queries for Scalable Distribution. In Proceedings of the 9th International Con- ference on Extending Database Technology (EDBT’06), pages 112Ð130, 2006.

[146] J. Xu, Q. Hu, D. L. Lee, and W.-C. Lee. SAIU: An Efficient Cache Replacement Policy for Wireless On-demand Broadcasts. In Proceedings of the 2000 ACM International Conference on Information and Knowledge Management (CIKM), McLean, VA, USA, Nov 6-11, pages 46Ð53, 2000. 179

[147] J. Xu, Q. Hu, W.-C. Lee, and D. L. Lee. Performance Evaluation of an Optimal Cache Replacement Policy for Wireless Data Dissemination. IEEE Transactions on Knowledge and Data Engineering (TKDE), 16(1):125Ð139, 2004. [148] J. Xu, X. Tang, and W.-C. Lee. Time-Critical On-Demand Data Broadcast: Algorithms, Analysis, and Performance Evaluation. IEEE Transactions on Parallel and Distributed Systems, to appear. [149] P. Xuan, S. Sen, O. Gonzalez,« J. Fernandez, and K. Ramamritham. Broadcast on De- mand: Efficient and Timely Dissemination of Data in Mobile Environments. In Pro- ceedings of IEEE Real Time Technology and Applications Symposium (RTAS), Montreal, Canada, Jun 9-11, pages 38Ð48, 1997. [150] C. Yang and K.-I. Lin. An Index Structure for Efficient Reverse Nearest Neighbor Queries. In Proceedings of the International Conference on Data Engineering, Hei- delberg, Germany, April 2-6, pages 485Ð492, 2001. [151] C. Yang and K.-I. Lin. An Index Structure for Improving Nearest Closest Pairs and Re- lated Join Queries in Spatial Databases. In Proceedings of International Database En- gineering & Applications Symposium (IDEAS), Edmonton, Canada, July 17-19, pages 140Ð149, 2002. [152] Y. Yao and J. Gehrke. The Cougar Approach to In-Network Query Processing in Sensor Networks. ACM SIGMOD Record, 31(3):9Ð18, 2002. [153] L. Yin and G. Gao. Adaptive Power-Aware Prefetch in Wireless Networks. IEEE Trans- actions on Wireless Communications, 3(5):1648Ð1658, 2004. [154] M. L. Yiu and N. Mamoulis. Efficient Processing of Top-k Dominating Queries on Multi-Dimensional Data. In Proceedings of 33th International Conference on Very Large Data Bases (VLDB’07), pages 483Ð494, 2007. [155] M. L. Yiu, N. Mamoulis, and D. Papadias. Aggregate Nearest Neighbor Queries in Road Networks. IEEE Transactions on Knowledge Data Engineering (TKDE), 17(6):820Ð 833, 2005. [156] D. Zhang and V. J. Tsotras. Optimizing Spatial Min/Max Aggregations. VLDB Journal, 14(2):170Ð181, 2005. [157] B. Zheng, K. C. K. Lee, and W.-C. Lee. Transitive Nearest Neighbor Search in Mobile Environments. In Proceedings of IEEE International Conference on Sensor Networks, Ubiquitous, and Trustworthy Computing (SUTC), Taichung, Taiwan, June 5-7, pages 14Ð21, 2006. [158] B. Zheng, K. C. K. Lee, and W.-C. Lee. Location-Dependent Skyline Query. In Pro- ceedings of the 9th International Conference on Mobile Data Management (MDM’08), pages 148Ð155, 2008. 180

[159] B. Zheng, W.-C. Lee, and D. L. Lee. On Semantic Caching and Query Scheduling for Mobile Nearest-Neighbor Search. Wireless Networks, 10(6):653Ð664, 2004. Vita Chi Keung Lee

Mr. Chi Keung Lee (Ken C. K. Lee) obtained his bachelor and MPhil degrees both in computing from the Hong Kong Polytechnic University. Before starting his PhD study in the Pennsylvania State University in Fall 2004, he worked as a project engineer in the Center for Innovation and Technology, the Chinese University of Hong Kong and as a research associate in the Department of Computing, the Hong Kong Polytechnic University.

In the five years of his PhD program, Ken C. K. Lee completed all required course works within the first two academic years and passed the PhD candidacy examination in his first semester in Pennsylvania State University. He also passed both the English proficiency test and the comprehensive examination by the end of the second year. He coauthored more than twenty research papers, some of which were published in prestigious conference proceedings and journals. He gave talks, presentations and demonstrations about his latest research results to the research community and the general public. He also provided professional services as external reviewers for conferences (such as ICDCS, ICDE, MDM, etc.) and journals (such as TKDE) and as assistants for conference organizations (including CIKM, INFOSCALE, MDM, etc). In the later years of his PhD program, he mentored several junior members of his research group, including three students in the MS/MEng programs, an exchange PhD student, and two novice PhD students. He was the recipient of the research assistant award in the Department of Computer Science and Engineering, the Pennsylvania State University in 2007. He also received industrial awards for his innovative software applications, among which, the First Prize and the Most Useful Widget in Yahoo! Widget Contest in 2007 and the Second Runner- up in Cisco IP Phone Application Competition (HK) in 2003. He is currently a student member of the ACM and the IEEE.