Submodular Optimization in Location-Based

Social Networks

by

Xuefeng Chen

B.E. University of Electronic Science and Technology of China, 2012

M.E. University of Electronic Science and Technology of China, 2015

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF

THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN THE SCHOOL

OF

Computer Science and Engineering

Tuesday 9th March, 2021

All rights reserved.

This work may not be reproduced in whole or in part,

by photocopy or other means, without the permission of the author.

c Xuefeng Chen 2021 i Thesis/Dissertation Sheet

Surname/Family Name : Xuefeng Given Name/s : Chen Abbreviation for degree as give in the University calendar : PhD Faculty : Engineering School : School of Computer Science and Engineering Thesis Title : Submodular Optimization in Location-Based Social Networks

Abstract 350 words maximum: (PLEASE TYPE) In recent years, with the increasing popularity and growth of social networks and mobile devices, the research on location-based social networks (LBSN) has attracted a lot of attention. As an important research topic in LBSN, submodular optimization in location-based social networks (SO-LBSN) has been considered by many studies, because it is useful in applications such as service location selection, marketing, and tourist trip planning. However, due to the rich information in LBSN (e.g., locations, user relationships), various constraints (e.g., spatial constraint, routing constraint) could be specified in LBSN applications, existing methods often fail in solving such problems on SO-LBSN efficiently. In this thesis, we study three typical problems on SO-LBSN, and utilize the location information, the properties of the submodular functions and the constraints to develop efficient algorithms.

Firstly, we investigate the problem of maximizing bichromatic reverse k-nearest neighbor in LBSNs. We use a general submodular function to compute the value of influence. We design a hybrid quadtree-grid index to manage the server points and client points and to reduce the kNN computation cost substantially, and propose an optimization technique called “maximal-arc” to improve the efficiency of the influence computation. We also develop both exact and approximation algorithms with guaranteed error bounds.

Secondly, we focus on the problem of optimal region search with submodular maximization. We prove that the problem is NP-hard and propose an approximation algorithm AppORS. We also design another algorithm with the same approximation ratio called IAppORS to further improve the effectiveness of AppORS, and present two heuristic methods to implement a key function of IAppORS.

Finally, we study the constrained path search with submodular maximization query. We show that answering the query is NP-hard. We first propose a concept called "submodular 훼-dominance" by utilizing the properties of the submodular function, and develop an approximation algorithm based on this concept. By relaxing the submodular 훼-dominance conditions, we design another approximation algorithm with better efficiency that has the same error bound. We also utilize the way of bi-directional path search to further improve the efficiency, and propose a heuristic polynomial algorithm that is efficient yet effective in practice.

Declaration relating to disposition of project thesis/dissertation

I hereby grant to the University of New South Wales or its agents a non-exclusive licence to archive and to make available (including to members of the public) my thesis or dissertation in whole or in part in the University libraries in all forms of media, now or here after known. I acknowledge that I retain all intellectual property rights which subsist in my thesis or dissertation, such as copyright and patent rights, subject to applicable law. I also retain the right to use all or part of my thesis or dissertation in future works (such as articles or books).

…………………………………………………………… ……….… 03/03/2021…...………… Signature Date The University recognises that there may be exceptional circumstances requiring restrictions on copying or conditions on use. Requests for restriction for a period of up to 2 years can be made when submitting the final copies of your thesis to the UNSW Library. Requests for a longer period of restriction may be considered in exceptional circumstances and require the approval of the Dean of Graduate Research.

ii iii

ORIGINALITY STATEMENT

‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged.’

Signed ……………………………………………......

Date ……………………………………………......

iv v INCLUSION OF PUBLICATIONS STATEMENT

UNSW is supportive of candidates publishing their research results during their candidature as detailed in the UNSW Thesis Examination Procedure.

Publications can be used in their thesis in lieu of a Chapter if: • The candidate contributed greater than 50% of the content in the publication and is the “primary author”, ie. the candidate was responsible primarily for the planning, execution and preparation of the work for publication • The candidate has approval to include the publication in their thesis in lieu of a Chapter from their supervisor and Postgraduate Coordinator. • The publication is not subject to any obligations or contractual agreements with a third party that would constrain its inclusion in the thesis

Please indicate whether this thesis contains published material or not:

This thesis contains no publications, either published or submitted for publication ☐

Some of the work described in this thesis has been published and it has been documented in the relevant Chapters with acknowledgement ☐

This thesis has publications (either published or submitted for publication) ☒ incorporated into it in lieu of a chapter and the details are presented below

CANDIDATE’S DECLARATION I declare that: • I have complied with the UNSW Thesis Examination Procedure • where I have used a publication in lieu of a Chapter, the listed publication(s) below meet(s) the requirements to be included in the thesis. Candidate’s Name Signature Date (dd/mm/yy) Xuefeng Chen 03/03/2021

vi

POSTGRADUATE COORDINATOR’S DECLARATION I declare that: • the information below is accurate • where listed publication(s) have been used in lieu of Chapter(s), their use complies with the UNSW Thesis Examination Procedure • the minimum requirements for the format of the thesis have been met. PGC’s Name PGC’s Signature Date (dd/mm/yy) Salil Kanhere 03/03/2021

PRIMARY SUPERVISOR’S DECLARATION I declare that: • the information above is accurate • this has been discussed with the PGC and it is agreed that this publication can be included in this thesis in lieu of a Chapter • All of the co-authors of the publication have reviewed the above information and have agreed to its veracity by signing a ‘Co-Author Authorisation’ form. Primary Supervisor’s name Primary Supervisor’s signature Date (dd/mm/yy) Xin Cao 03/03/2021

vii

viii

ix

COPYRIGHT STATEMENT

‘I hereby grant the University of New South Wales or its agents a non-exclusive licence to archive and to make available (including to members of the public) my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known. I acknowledge that I retain all intellectual property rights which subsist in my thesis or dissertation, such as copyright and patent rights, subject to applicable law. I also retain the right to use all or part of my thesis or dissertation in future works (such as articles or books).’

‘For any substantial portions of copyright material used in this thesis, written permission for use has been obtained, or the copyright material is removed from the final public version of the thesis.’

Signed ……………………………………………......

Date ……………………03/03/2021……………… ......

AUTHENTICITY STATEMENT ‘I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis.’

Signed ……………………………………………......

Date ……………………03/03/2021…………………… ......

x xi Abstract

In recent years, with the increasing popularity and growth of social networks and mo- bile devices, the research on location-based social networks (LBSN) has attracted a lot of attention. As an important research topic in LBSN, submodular optimization in location- based social networks (SO-LBSN) has been considered by many studies, because it is useful in applications such as service location selection, marketing, and tourist trip plan- ning. However, due to the rich information in LBSN (e.g., locations, user relationships, texts), various constraints (e.g., spatial constraint, routing constraint, connective con- straint) could be specified in LBSN applications, existing methods often fail in solving such problems on SO-LBSN efficiently. In this thesis, we study three typical problems on SO-LBSN, and utilize the location information, the properties of the submodular functions and the constraints to develop efficient algorithms.

Firstly, we investigate the problem of maximizing bichromatic reverse k-nearest neighbor (MaxBRkNN) in LBSNs. The problem aims to find a region for setting up a new service site (a server point) such that it can influence the most users (client points).

We design a hybrid quadtree-grid index (QGI) to manage the server points and client points and to reduce the kNN computation cost substantially. In LBSNs, a general sub- modular function is used to compute the influence value of a new server point, and the way of calculating the influence value could be complicated by incorporating the social relationships of users. We propose an optimization technique called “maximal-arc” to

xii improve the efficiency of the influence computation. We also develop both exact and approximation algorithms with guaranteed error bounds for solving MaxBRkNN in LB-

SNs. Our experiments on 4 datasets validate the high efficiency of our algorithms.

Secondly, we focus on the problem of optimal region search with submodular max- imization (ORS-SM). This problem considers a region as a connected subgraph, and computes an objective value over the locations in the region using a submodular func- tion and a budget value by summing up the costs of edges in the region. The target of

ORS-SM is to search the region with the largest objective score under a budget value constraint. We prove that the problem is NP-hard and propose an approximation algo- rithm AppORS with a guaranteed error bound. We also design another algorithm with the same approximation ratio called IAppORS to further improve the effectiveness of

AppORS, and present two heuristic methods to implement a key function of IAppORS.

The experimental study conducted on both real and synthetic datasets demonstrates the efficiency and accuracy of our algorithms.

Finally, we study the constrained path search with submodular maximization (CPS-

SM) query. We aim to find the path with the best submodular function score under a given constraint (e.g., a length limit), where the submodular function score is com- puted over the set of nodes in this path. We show that answering the CPS-SM query is NP-hard. We first propose a concept called “submodular α-dominance” by utilizing the properties of the submodular function, and we develop an approximation algorithm based on this concept. By relaxing the submodular α-dominance conditions, we design another approximation algorithm with better efficiency that has the same error bound.

We also utilize the way of bi-directional path search to further improve the efficiency, and propose a heuristic polynomial algorithm that is efficient yet effective in practice.

Our experiments on 2 applications using 5 datasets show that our algorithms can achieve high accuracy and are faster than one state-of-the-art method by orders of magnitude.

xiii Publications

• Xuefeng Chen, Xin Cao, Zhiqiang Xu, Ying Zhang, Shuo Shang, Wenjie Zhang.

Accelerate MaxBRkNN Search by kNN Estimation. ICDE 2019. (Chapter3)

• Xuefeng Chen, Xin Cao, Zhiqiang Xu, Ying Zhang, Shuo Shang, Wenjie

Zhang. Exact and Approximate Approaches to Maximizing Bichromatic Re-

verse k-Nearest Neighbor in Geo-social Networks. The VLDB Journal (under

review). (Chapter3)

• Xuefeng Chen, Xin Cao, Yifeng Zeng, Yixiang Fang, Bin Yao. Optimal Region

Search with Submodular Maximization. IJCAI 2020. (Chapter4)

• Xuefeng Chen, Xin Cao, Yifeng Zeng, Yixiang Fang, Sibo Wang, Liang Feng,

Xuemin Lin. Constrained Path Search with Submodular Function Maximization.

VLDB 2021 (under review). (Chapter5)

xiv xv Dedication

my parents

my relatives

my friends

For their love and support

xvi xvii Acknowledgements

I would like to express my deepest gratitude to my supervisor Dr. Xin Cao for his patient guidance and valuable support. Dr. Xin Cao gave me too much help, and was very willing to teach everything he knows to me. He taught me how to find interesting ideas, how to design models and develop algorithms, how to write technical papers clearly, and how to become an independent researcher. I remember that Dr. Xin Cao helped me revise some proof and rewrite some sections for my technical papers. This always reminds me to be a rigorous researcher. Without his illuminating instructions and consistent encouragement, this thesis could not have reached its present form.

I am also thankful to my joint-supervisor Prof. Xuemin Lin, Prof. Yifeng Zeng, Prof.

Ying Zhang, Prof. Wenjie Zhang, and Dr. Yixiang Fang for their helpful discussions and insightful suggestions for the work in this thesis.

Besides, special thanks to all the people I have met in this wonderful research group:

Prof. Lu Qin, Prof. Zengfeng Huang, Prof. Weiwei Liu, Prof. Long Yuan, Prof. Jianye

Yang, Prof. Fan Zhang, Prof. Xiaoyang Wang, Prof. Shiyu Yang, Dr. Lijun Chang, Dr.

Longbin Lai, Dr. Xing Feng, Dr. Fei Bi, Dr. Shenlu Wang, Dr. Yang Yang, Dr. Haida

Zhang, Dr. Xubo Wang, Dr. Wei Li, Dr. Dong Wen, Dr. Dian Ouyang, Dr. Chen Zhang,

Dr. Kai Wang, Dr. You Peng, Dr. Boge Liu, Ms. Xiaoshuang Chen, Ms. Conggai Li,

Ms. Yuting Zhang, Ms. Wanqi Liu, Ms. Maryam Ghafouri, Mr. Yu Hao, Mr. Wentao

Li, Mr. Yuren Mao, Mr. Zhengyi Yang, Mr. Mingjie Li, Mr. Qingyuan Linghu, Mr.

xviii Yixing Yang, Mr. Ruisi Yu, Mr. Hanchen Wang, Mr. Jiahui Yang, Mr. Chenji Huang,

Mr. Kongzhang Hao, Mr. Yizhang He. The time we spent together will be memorized forever.

Meanwhile, I appreciate the support of the University International Postgraduate Award provided by UNSW and HDR Completion Scholarship provided by the Commonwealth through an “Australian Government Research Training Program Scholarship”.

Last but not least, I would like to thank my parents, relatives, and friends for their love and support during my PhD study.

xix Contents

Abstract xii

Publications xiv

Dedication xvi

Acknowledgements xviii

List of Figures xxiv

List of Tables xxvi

List of Algorithms xxvii

1 Introduction1

1.1 Background...... 1

1.2 Motivations...... 3

1.2.1 Exact and Approximate Approaches for MaxBRkNN...... 3

1.2.2 Optimal Region Search with Submodular Maximization....5

1.2.3 Constrained Path Search with Submodular Maximization....7

1.3 Contributions...... 10

1.4 Organization...... 11

xx 2 Literature Review 12

2.1 Submodular Optimization and Its Related Work in Location-Based So-

cial Networks...... 12

2.1.1 Submodular Optimization...... 12

2.1.2 Submodular Optimization in Location-Based Social Networks. 14

2.2 The MaxBRkNN Problem...... 15

2.2.1 RkNN...... 15

2.2.2 MaxBRkNN...... 15

2.2.3 Location Selection Based on RkNN...... 17

2.2.4 Influence Spread Computation...... 17

2.3 Optimal Region Search...... 18

2.3.1 Location Search...... 18

2.3.2 Region Search...... 19

2.4 Constrained Path Search...... 19

2.4.1 Constrained Shortest Path Query...... 20

2.4.2 Other Related Work on Constrained Path Search...... 21

3 Exact and Approximate Approaches for MaxBRkNN 23

3.1 Introduction...... 23

3.2 Problem Definition...... 27

3.3 General Framework for Solving MaxBRkNN...... 30

3.4 Exact Approach...... 32

3.4.1 Hybrid Quadtree-Grid Index QGI...... 32

3.4.2 kNLC Radius Estimation using QGI...... 37

3.4.3 Efficient Computation of Overlapping Circles...... 39

3.4.4 Maximal-Arc Based Optimization...... 41

3.4.5 Algorithm GkE ...... 45

xxi 3.5 Approximate Approach...... 47

3.5.1 Algorithm PIC ...... 47

3.5.2 Algorithm PRIC ...... 54

3.6 Experimental Study...... 55

3.6.1 Experimental Settings...... 55

3.6.2 Experimental Results...... 57

3.6.3 Performance Comparison on MaxBRkNN with Complex Influ-

ence Value...... 63

3.7 Conclusion...... 68

4 Optimal Region Search with Submodular Maximization 69

4.1 Introduction...... 69

4.2 Problem Formulation...... 72

4.3 Approximation Algorithm AppORS...... 74

4.4 Improved AppORS...... 79

4.4.1 An Improved Version of AppORS...... 79

4.4.2 Two Heuristic Methods for IAppORS...... 80

K 4.4.3 A Better f indOptTree(Vopt, G, ∆, vopt)...... 82 4.4.4 Time Complexity of the Proposed Algorithms...... 83

4.5 Experimental Study...... 83

4.5.1 Case Study 1: Most Influential Region...... 84

4.5.2 Case Study 2: Most Diversified Region...... 86

4.6 Conclusion...... 89

5 Constrained Path Search with Submodular Maximization 90

5.1 Introduction...... 90

5.2 Problem Formulation...... 94

xxii 5.3 Approximation Algorithm SDD...... 98

5.4 Approximation Algorithm OSDD...... 101

5.5 Bidirectional Search Method...... 107

5.6 A Polynomial Heuristic Algorithm CBFS ...... 113

5.7 Experimental Study...... 114

5.7.1 Experiments setting...... 114

5.7.2 Application 1: Users’ Preferences Coverage...... 115

5.7.3 Application 2: Most Diversified Path Query...... 119

5.8 Conclusion...... 122

6 Conclusions and Future Work 124

6.1 Conclusions...... 124

6.2 Future Work...... 126

Bibliography 128

xxiii List of Figures

3.1 An example of MaxBRkNN...... 24

3.2 An example of MBRkNN in location-based social networks...... 29

3.3 An example of one level of QGI...... 33

3.4 An example of cell rings...... 33

3.5 The hierarchical structure...... 35

3.6 An example of the hybrid quadtree-grid index...... 36

3.7 An example of binary search method...... 38

3.8 An example of checking whether two kNLCs intersect with each other.. 41

3.9 Entering and leaving points...... 42

3.10 An example for Maximal Arc...... 43

3.11 An example of one pruning operation...... 49

3.12 An example for the proof of bound...... 54

3.13 Effect of parameter nvc on run time...... 57 3.14 Comparison of one-level grid and two-level grid on run time...... 58

3.15 Effect of β on run time of PIC...... 59

3.16 Effect of β on the precision of PIC...... 60

3.17 Comparison of runtime on four datasets...... 61

3.18 Comparison of influence on four datasets...... 62

3.19 Scalability...... 63

xxiv 3.20 Comparison of PIC and PRIC on the efficiency...... 65

3.21 Comparison of PIC and PRIC on the result quality...... 65

3.22 Comparison of runtime on four datasets on MaxBRkNNSI...... 66

3.23 Comparison of influence on four datasets on MaxBRkNNSI...... 67

3.24 Scalability on MaxBRkNNSI...... 68

4.1 A toy example of road network with POIs...... 70

4.2 Effect of parameter γ on IAppORSHeu2 in MIRS...... 85

4.3 Comparison of algorithms in MIRS...... 85

4.4 Effect of parameter γ on IAppORSHeu2 in MDRS...... 87

4.5 Performances of the algorithms in MDRS on CA...... 87

4.6 Scalability of our algorithms in MDRS on CA...... 88

4.7 Scalability of our algorithms in MDRS on SY...... 88

5.1 A toy road network with locations labeled by keywords...... 91

5.2 Effect of parameter α with ∆ = 50 kilometer...... 116

5.3 Effect of parameter β with ∆ = 50 kilometers for CBFS in application 1. 117

5.4 Effect of parameter γ with α = 1.1 and ∆ = 50 kilometer on FL..... 117

5.5 Comparison with baselines varying ∆ ...... 118

5.6 Comparison of proposed algorithms varying ∆...... 119

5.7 Effect of parameter α with ∆ = 50 kilometer...... 120

5.8 Effect of parameter γ with α = 1.2, ∆ = 50 kilometer...... 121

5.9 Effect of parameter β with ∆ = 50 kilometer for CBFS...... 122

5.10 Comparison of methods varying ∆...... 123

xxv List of Tables

3.1 Frequently Used Notation in MaxBRkNN...... 27

4.1 Frequently Used Notation in ORS-SM...... 72

5.1 Frequently Used Notation in CPS-SM...... 95

xxvi List of Algorithms

1 General Framework for MaxBRkNN...... 31

2 Estimating Radiuses of kNLCs in One-Level...... 34

3 Estimating kNLCs in Two Layers...... 40

4 Algorithm PIC with One Pruning Operation...... 48

5 Algorithm for finding Opru ...... 51

6 AppORS Algorithm...... 77

K 7 New update(vopt, Vopt)...... 79

8 Function f indFeaNodeS et1(vi, G, ∆)...... 81

9 SDD...... 99

10 OSDD...... 103

11 runPhase2(G, vs, vt, ∆, Q2, L(vi) for all nodes vi ∈ V)...... 105 12 Bi-SDD...... 109

T 13 computeBackwardLable(G , vt, vs, ∆, ∆t)...... 110

−→k ←− 14 joinLabels( L i , L(vi), ∆rem, OS max)...... 111

xxvii xxviii Chapter 1

Introduction

Submodular optimization is a fundamental optimization problem that is widely useful in many applications such as viral marketing, information gathering, recommendation system, region search and so on. In this chapter, we briefly introduce the background of submodular optimization in location-based social networks, the motivations of three studied topics, the contributions of the thesis, and present the organization of the thesis.

1.1 Background

Submodularity is a property of set functions that exhibit diminishing returns. Formally, a set function f : 2V → R is submodular, for any two sets A ⊆ B ⊆ V and an element v ∈ V \ B, we can get f (A ∪ {v}) − f (A) ≥ f (B ∪ {v}) − f (B). It means that the marginal contribution of any element v to the value of f (A) diminishes as the set A grows.

As submodularity is common in a variety of research areas such as combinatorial optimization and machine learning, submodular optimization is an important problem that has received much attention [Fuj05, KG14]. Submodular optimization is also useful a wide range of real-world applications including viral marketing [KKT03], informa- tion gathering [LKG+07], image segmentation [JB11] and deep neural network train-

1 2 Chapter 1. Introduction ing [JSB19], and so on. Most existing studies of submodular optimization focus on the problem of selecting a subset of items from a whole set in order to maximize a given sub- modular function over the set, such as influence maximization [KKT03] and information gathering [LKG+07]. For these problems of subset selection, the greedy algorithm can always achieve an approximate solution with theoretical guarantees [NWF78, DK11].

In recent years, the wide availability of wireless communication and GPS-equipped mobile devices (e.g. mobile phones and tablets) enables people to access the internet al- most anywhere at any time. This fosters the emergence of location-based social networks where social networks are combined with locations to allow users to interact relative to their current locations. There exist many popular location-based social network appli- cations such as Facebook (www.facebook.com) and Foursquare (www.foursquare.com), in which huge volumes of data are available involving multiple types (e.g., spatial, tem- poral, textual) and complex relationships (e.g., user social links). This facilitates lots of applications in location-based social networks and related research work.

In many applications in location-based social networks, a simple accumulative func- tion (e.g., SUM) cannot measure the objective value of the search results (e.g., the like- lihood of information [LKG+07], the influence or the diversity of a region [FCB+16], the entropy of locations [ZV16]). To address this issue, we need to utilize a general sub- modular function to compute the objective value. It suggests that there exist a large num- ber of problems on submodular optimization in location-based social networks, which have been studied in many recent research works. For instance, influence maximization in location-based social networks [LCF+14, ZCL+15, WYX+19], the best region search problem in location-based social networks [FCB+16], and submodular maximization un- der routing constraint [ZCC+15, ZV16].

The data in location-based social networks has large-scale and complex data types, and there could be various constraints (e.g., spatial constraint, routing constraint, and Chapter 1. Introduction 3 connective constraint, etc.) in real-world applications. As a result, the classical greedy algorithm for submodular optimization cannot find an approximate solution in a reason- able time [LCF+14, ZCC+15], or even a solution with an error bound [FCB+16]. This means that the existing work is not sufficient to solve the problems in these real-world applications considering the data volume and the various constraints.

In this thesis, we focus on three typical problems of submodular optimization in location-based social networks, and utilize the location information, the properties of the submodular functions, and the constraints to design exact and approximation algorithms to solve these problems efficiently.

1.2 Motivations

In this section, we present the motivations of three studied topics that are useful in many applications such as service location selection, marketing, and tourist trip planning.

1.2.1 Exact and Approximate Approaches for MaxBRkNN

Maximizing bichromatic reverse k-nearest neighbor (MBRkNN) is a useful problem in smart city, as it has many practical applications such as service location planning, adver- tising, and profile-based marketing. For example, the companies would find a suitable region for creating a new branch to attract as many potential customers as possible, and this problem can be modeled as the MBRkNN problem, which aims to find an optimal region to place a new server point s such that the size of the bichromatic reverse k-nearest neighbor (BRkNN) query result of s is maximized.

MaxBRkNN is defined based on BRkNN query which can be used in many applica- tions such as decision support, location planning, and resource allocation [KMS+07]. In

BRkNN query, there are two types of objects P (server points set) and O (client points 4 Chapter 1. Introduction set) in a given Euclidean space, and the target of the query q is to find all client points o ∈ O whose k-nearest neighbors (kNN) in P ∪ q contain the location of q.

Especially, to study MaxBRkNN in location-based social networks, we use a mono- tone submodular function to compute the influence value of a region. It makes the

MaxBRkNN problem more general, but brings some challenges.

Challenges. There are several algorithms for solving the MaxBRkNN problem, e.g., the MaxOverlap algorithm [WOY¨ +09,W OF¨ +11], the MaxFirst algorithm [ZWL+11], the MaxSegment algorithm [LWW+13], and the OptRegion algorithm [LCGL13,

HFYD16]. The main steps of these existing algorithms are similar, in which we first

find out the kNN results for each client point, and then draw a circle (called the kth near- est location circle (kNLC) of a client point) centered at each client point enclosing all its kNN results, and finally perform a search based on these circles. The challenges of efficient solving MaxBRkNN are twofold.

1. As k is given by users dynamically, kNN must be computed in an online manner.

Since the step of kNN computation consumes huge processing time, existing al-

gorithms are time-consuming if there is a large number of client points and server

points or if there are some updates of server points. It is a challenge to efficiently

compute kNNs when the dataset is large.

2. In some applications, the submodular function for computing the region influ-

ence value could be complicated (e.g., the social influence spread [KKT03] whose

computation is #P-hard [CWW10]). How to reduce the computations of the region

influence value and keep high solution quality at the same time?

Our Approaches. To address the first challenge, we propose a hybrid quadtree- grid index (QGI) that adopts a two-level structure with different granularity estimated by using k. This index can help us estimate the radius of the circle enclosing the kNN result Chapter 1. Introduction 5 for a group of client points together, and speed up the computation of estimated circle intersections. Based on QGI, we design an algorithm GkE that uses the estimated circles to quickly prune a lot of client points that could not be in the influence set of the optimal region, and then computes exact kNNs on only a few clients to obtain the final result.

To solve the second issue, we develop a maximal-arc based optimization technique to reduce the influence computation cost. Meanwhile, to achieve even better efficiency, we develop approximation algorithms PIC and PRIC with user-controlled error bounds to solve MaxBRkNN approximately. The approximation algorithms are useful in cases where the runtime is more important than the exactness.

1.2.2 Optimal Region Search with Submodular Maximization

Region search, an important problem in location-based services, is widely used in applications such as user exploration in a city [FCB+16, CCJY14], similar regions query [FCJG19], region detection [FGC+19], and so on.

Most existing studies consider a region as a rectangle with a fixed length and width [NB95, CCT12, THCC13, FCB+16]. But it is not general in practice since the regions might be of arbitrary shapes. To fill this gap, Cao et al. [CCJY14] define a region as a connected subgraph and propose the Length-Constrained Maximum-Sum

Region (LCMSR) query. It considers two attributes for region search: an objective value of the region computed by summing up the weights of nodes in a region and a budget value by summing up the costs of edges in a region. The target of the LCMSR query is to find the region that has the largest objective value under a given budget constraint.

However, the objective value in some applications (e.g., the influence or the diversity of a region [FCB+16], the likelihood of information [LKG+07]) cannot be computed by a simple accumulative function (e.g., SUM). In this thesis, we use a submodular function to compute the objective score of a region defined as a connected subgraph, which makes 6 Chapter 1. Introduction the problem more general and challenging. We denote this problem as optimal region search with submodular maximization (ORS-SM).

Challenges. The ORS-SM problem is a type of submodular optimization [KG14], which is extremely hard [ZCC+15, ZV16, SZKK17, Cra19]. And solving the ORS-

SM problem is proved to be NP-hard. To our best knowledge, only the Generalized

Cost-Benefit Algorithm (GCB) [ZV16] for the problem of submodular optimization with routing constraints can be adapted to address the ORS-SM problem. But the algorithm requires a root node, which does not exist in our problem. Hence, we have to find a tree with the largest objective value by scanning all subtrees rooted at each node. We denoted this method as GCBAll. The main challenges of efficiently solving the ORS-SM problem are as below.

1. As solving the ORS-SM problem is NP-hard and GCBALL is time-consuming, it

is a challenge to develop more efficient approximation algorithms for the ORS-SM

problem.

2. In general, when the approximation algorithm is faster, its solution quality is

worse. Thus, it is important to propose some optimizations to improve the so-

lution quality of the developed approximation algorithms.

Our Approaches. To solve the ORS-SM problem efficiently, we propose a √ 1 - O( ∆/bmin) approximation algorithm AppORS, where bmin is the minimal cost on edges, and ∆ is the budget limit. We observe that the returned region is always a tree. The basic idea of this algorithm is that we first find a set of nodes with a large objective score, such that there exists a tree connecting them whose total cost is smaller than the budget limit ∆. Next, we obtain the best region by finding a tree to connect these nodes.

To further improve the effectiveness of AppORS, we develop another algorithm IAp- pORS with the same approximation ratio. This algorithm first finds a node set using the Chapter 1. Introduction 7 method in AppORS to keep the approximation ratio, and then achieves a better node set with a larger objective score to improve the quality of the solution. Meanwhile, we present two heuristic methods to implement a key function for obtaining a better node set in IAppORS.

1.2.3 Constrained Path Search with Submodular Maximization

The path search problem is an essential task in many applications such as navigation and tourist trip planning. The problem of path search with constraints [Jok66, Has92,

WXYL16] is useful in real-applications, and it aims to find the optimal path under a given constraint (such as a length limit).

In common, the problem of path search with constraints considers two scores for each path: an objective score (to be optimized) and a budget score (for constraint checking).

The objective score of the path can be defined variously according to applications. In this thesis, we use a general submodular function to compute the objective score over a set of nodes, and we study the problem that aims to find the path with the maximal objective score (computed over the nodes in the path) from an origin s to a destination t satisfying a given budget limit ∆. This problem is first studied by Chekuri and Pal [CP05] as the submodular orienteering problem. As the problem is an s-t path query, we rename this problem the constrained path search with submodular maximization (CPS-SM) query.

The CPS-SM query can be used in many applications such as optimal route search for keyword coverage [ZCC+15] and most diversified path query.

Challenges. Most of constrained path search problems use a simple aggregation sum function as the objective function [WXYL16]. In this setting, given two partial paths P1 and P2 from the origin s to the same node v j, if both two scores of P1 are worse than those of P2, P1 can be pruned safely. This is because the two scores of path P1 ∪ v j are also worse than those of path P2 ∪ v j, where P1 ∪ v j and P2 ∪ v j are two paths obtained 8 Chapter 1. Introduction

by extending P1 and P2 with the same node v j respectively. Unfortunately, this does not hold in the CPS-SM query, where the objective score is computed by a submodular function on the set of nodes in a path. Due to the submodular properties, given that P1 is worse than P2 on vi, P1 ∪ v j may be not worse than P2 ∪ v j. It implies that existing algorithms for constrained path search problems are inapplicable to the CPS-SM query.

Meanwhile, like most submodular optimization problems [ZCC+15, ZV16, SZKK17,

Cra19], solving the CPS-SM query is proved to be NP-hard. There exist the recursive greedy (RG) algorithm [CP05] and GreedyDP algorithm [SNY18] to answer the CPS-

SM query. RG is a quasi-polynomial time algorithm, which finds the solutions by recur- sively guessing the middle node and the budget cost to reach the middle node, such that it requires a large number of iterations in practice. For each query, GreedyDP constructs a zero-suppressed binary decision diagram (ZDD) to represent all feasible solutions, and then obtains an approximate solution from the ZDD. As the size of the ZDD generally increases exponentially with the number of nodes in the graph, GreedyDP will cost too much time and space to answer the query on large graphs. Some experimental results show RG costs more than 104 seconds in a small graph [SKG+07] with 22 nodes and

GreedyDP takes 4.0 × 103 seconds on a dataset with only 7 × 13 grids (each grid can be viewed as a node in the network) [SNY18], both algorithms are so time-consuming that they are not scalable and cannot be used to solve the problem in real-world datasets which typically have thousands of nodes. The study [ZCC+15] studies a special case of the CPS-SM query and proposes a weighted A* algorithm to obtain approximate so- lutions. But weighted A* algorithm is extremely time-consuming, as it stores a large number of partial paths during the search. The work [ZV16] proposes the Generalized

Cost-Benefit (GCB) algorithm for the problem of submodular optimization with routing constraints. The algorithm can be adapted to answer the CPS-SM query, but the adapted algorithm can only return approximate solutions under a smaller budget limit. It means Chapter 1. Introduction 9 that GCB cannot guarantee any error bounds for the optimal solution under the original budget limit, and the returned result also has poor quality in some applications.

Thus, there are two major challenges in solving the CPS-SM query.

1. How to design an efficient approximation algorithm based on the submodular prop-

erties for answering the CPS-SM query.

2. How to further improve the efficiency of the designed approximation algorithms

and keep their effectiveness at the same time.

Our Approaches. To address the above challenges, we first propose a concept called

“submodular α-dominance” by using the properties of the submodular function. This concept allows us to compare the quality of two partial paths from the origin to the same node, such that we can prune some partial paths during path search with guaranteed error bounds. Based on this, we design the approximation algorithm SDD (submodular α-

b ∆ c−1 1 bmin dominance based Dijkstra) which has an approximation ratio of ( α ) , where bmin is the minimal budget value on edges, α is a parameter, and ∆ is the budget limit. Next, we further improve the efficiency of the algorithm by relaxing the submodular α-dominance conditions and develop another approximation algorithm OSDD. This algorithm not only has the same approximation ratio, but also is much faster, as it can discard more partial paths during the search.

To further speed up SDD and OSDD, we utilize the way of bi-directional path search to reduce the number of generated labels. After applying the bi-directional search to both SDD and OSDD, we obtain another two algorithms Bi-SDD and Bi-OSDD. We

b ∆ c 1 bmin show that the two algorithms have an approximation ratio of ( α ) , and are more ef- ficient than SDD and OSDD, respectively. Meanwhile, we devise an efficient heuristic algorithm CBFS with polynomial runtime. Although it does not have any guaranteed error bounds, it can return results with good quality in practice. 10 Chapter 1. Introduction

1.3 Contributions

We briefly summarize the main contributions of this thesis in the following three parts.

Exact and Approximate Approaches for MaxBRkNN. We propose a hybrid quadtree-grid index QGI to estimate kNNs for each client point quickly, and an exact method GkE based on QGI to avoid expensive exact kNN computation by pruning some client points. We also propose a maximal-arc based optimization technique to reduce the influence computation cost, and two approximation algorithms to allow users to balance the run time and the accuracy. In addition, we conduct experiments on both real and synthetic location-based social networks to show the effectiveness and efficiency of our algorithms.

Optimal Region Search with Submodular Maximization. We propose the ORS-

SM problem and prove that the problem is NP-hard. To solve the problem, we develop two approximation algorithms AppORS and IAppORS, and present two heuristic meth- ods to implement a key function in IAppORS. The experimental study on two appli- cations using three real-world datasets demonstrates the efficiency and accuracy of the proposed algorithms.

Constrained Path Search with Submodular Maximization. We introduce the

CPS-SM query and prove its NP-hardness. We propose a concept called “submodular α- dominance” by using the properties of the submodular function, and then design two ap- proximation algorithms based on the concept to answer the CPS-SM query. Meanwhile, we utilize the bi-directional search method to speed up the proposed approximation al- gorithms, and develop an efficient yet effective heuristic algorithm with polynomial run- time. We finally conduct experiments on both real and synthetic datasets to illustrate the efficiency and accuracy of the proposed algorithms. Chapter 1. Introduction 11

1.4 Organization

This thesis is organized as follows:

• Chapter2 presents a survey of the related work.

• Chapter3 studies the maxBR kNN problem in LBSNs and proposes some efficient

methods for the problem.

• Chapter4 proposes the ORS-SM problem and some e fficient algorithms to solve

the problem.

• Chapter5 presents the CPS-SM query and our approaches for answering the query

efficiently.

• Chapter6 concludes this thesis and presents several possible directions for future

work. Chapter 2

Literature Review

In this chapter, we give an overview of the related work for submodular optimization problems studied in this thesis. We first introduce the studies for submodular optimiza- tion and its related problems in location-based social networks in Section 2.1. Next, we show the related work for the MaxBRkNN problem, optimal region search and con- strained path search in Section 2.2, 2.3 and 2.4 respectively.

2.1 Submodular Optimization and Its Related Work in

Location-Based Social Networks

In this section, we briefly review the studies on submodular optimization and present recent works on submodular optimization in location-based social networks.

2.1.1 Submodular Optimization

Submodular optimization has been studied substantially due to its breadth of ap- plicability with applications including viral marketing [KKT03], information gather- ing [LKG+07], recommendation system [CZC+15], region search [FCB+16], deep neu-

12 Chapter 2. Literature Review 13 ral network training [JSB19], etc. A classical problem of submodular optimization is to maximize a non-negative monotone submodular function under cardinality constraints.

To solve this problem, Nemhaser et al. [NWF78] proposed a simple greedy algorithm with a constant factor approximation ratio of 1 − e−1 and Das et al. [DK11] further im- proved the approximation ratio by introducing a submodular ratio. Meanwhile, different types of generalizing submodular optimization have been the focus of many studies re- cently. Next, we briefly review different types of submodular optimization problems from three aspects.

In terms of optimization objectives, there are two types of submodular optimiza- tion problems. One is submodular function maximization such as the classical problem mentioned above, while another one is submodular function minimization. Submodu- lar function minimization aims to compute the minimum value as well as a minimizer of a submodular function [Iwa08]. It arises in applications such as power assignment and transportation problems [IJB13]. The first polynomial algorithm for submodular function minimization is proposed by Grotschel¨ et al. [GLS81], and then a long line of work has been dedicated to accelerating the solution of submodular minimization. Re- cently, a randomized algorithm [ALS20] and a parallel algorithm [BS20] are developed to solve submodular minimization efficiently. As our problems all are the problems of submodular function maximization, we focus on submodular function maximization in the following discussion.

Based on the properties of submodular functions, submodular function maximization falls into several categories. The above classical problem considers the monotone sub- modular functions, while some work studies non-monotone submodular function maxi- mization [QYT18, Sak20] and proposed some algorithms with error bounds. Meanwhile, the effect of the order of selecting items on the submodularity has been considered in submodular function maximization. Tschiatschek et al. [TSK17] proposed a new class 14 Chapter 2. Literature Review of sequence submodular functions on a directed graph. As this function is expressive, it has been used to study the submodularity on a hypergraph [MFKK18] and adaptive sequence submodularity [MKF+19]. Meanwhile, Qian et al. [QFT18] proposed an al- gorithm POSEQSEL based on Pareto optimization to solve the problem of sequence submodular maximization.

In addition, the problems of submodular function maximization can be divided into some categories according to their problem settings. There exist a lot of different sub- modular function maximization problems under different constraints, such as cardinal- ity constraint [KKT03], cost constraint [QSYT17], matroid constraint [BFG19, CQ19], routing constraint [ZCC+15, ZV16], connectivity constraint [KLT14, GN20], and so on. Moreover, the problems vary in static and dynamic problem settings. For example,

Roostapour et al. [RNNF19] investigate the subset selection in dynamic cost constraints where the budget constraint changes over time. Mitrovic et al. [MKF+19] model and study the adaptive sequence submodularity by taking into account past observations, as well as the uncertainty of future outcomes.

2.1.2 Submodular Optimization in Location-Based Social Networks

In recent years, many studies have considered the problems of submodular optimiza- tion in location-based social networks, since location-based social networks play an important role in real-world applications. As a typical problem of submodular op- timization, influence maximization [KKT03] is studied in location-based social net- works [LCF+14, ZCL+15, WYX+19]. The best region search problem aims to find a

fixed-size rectangular region such that the score of a given submodualr function is max- imized in location-based social networks [FCB+16]. The studies [ZCC+15, ZV16] of submodular optimization under routing constraint consider the path search with sub- modular function maximization in location-based social networks. we will discuss more Chapter 2. Literature Review 15 studies on submodular function maximization in location-based social networks in the following section relevant to the three problems studied in this thesis.

2.2 The MaxBRkNN Problem

We first introduce the related work on RkNN, MaxBRkNN, and location selection based on RkNN, and then show some studies on influence spread computation which is related to the computation of the submodular function in our MaxBRkNN problem.

2.2.1 RkNN

Reverse nearest neighbor search (RNN) was first proposed by Korn et al. [KM00].

It can be extended to the Reverse k-nearest neighbor search and then classified into monochromatic (RkNN) and bichromatic (BRkNN). This problem then has been stud- ied extensively in different settings, such as in road networks [TTS10, LLS+18], in ad- hoc networks [NMW+13], on moving objects [KMS+07, CSC19], and over trajecto- ries [WBC+17]. Social influence has been taking into consideration for kNN or RkNN search by recent studies. For example, Yuan et al. [YLC+16] propose kNN search on road-social networks by incorporating social influence, Zhao et al. [ZGC+17] study the reverse top-k geo-social keyword (RkGSKQ) query which computes the similar- ity between two objects considering social relevance, textual similarity, and spatial dis- tance, and Jin et al. [JGCZ20] present a group processing framework to solve multiple

RkGSKQs.

2.2.2 MaxBRkNN

The MaxBRNN problem was first introduced by Cabello et al. [CDL+05, SMS+10], which aims to maximize the size of the bichromatic RNN (BRNN) result for a new 16 Chapter 2. Literature Review server. They also present a solution for the two-dimensional Euclidean space. Wong et al. [WOY¨ +09] extend the MaxBRNN problem to the MaxBRkNN problem, which is based on BRkNN. They propose the MaxOverlap algorithm to solve MaxBRkNN. The

MaxOverlap algorithm transforms the region search problem into a point search prob- lem. It iteratively finds the intersection point that is covered by the most number of the k-Nearest Location Circles (kNLC) (the circle enclosing the kNN result). In their sub- sequent work, MaxOverlap is extended to solve MaxBRkNN in the Lp-norm and three- dimensional space [WOF¨ +11]. The MaxOverlap algorithm is prohibitively expensive on large datasets because of the huge cost incurred by checking all intersection points.

To solve the MaxBRkNN problem more efficiently, Zhou et al. [ZWL+11] pro- pose the MaxFirst algorithm, which partitions the space into quadrants recursively and prunes the unpromising quadrants according to the upper and lower bounds of the size of a quadrant’s BRkNN. They also consider the user preferences on different sites for

MaxBRkNN. Liu et al. [LWW+13] present the MaxSegment algorithm to convert the

MaxBRkNN problem into an optimal circle arc search problem, and then utilize a plane sweep-like method to find the optimal arcs. Lin et al. [LCGL13, HFYD16] develop the

OptRegion algorithm which uses a sweep-line method to scan the minimum bounding rectangle of kNLCs for finding the overlapping kNLCs. The OptRegion algorithm prunes kNLCs that cannot be in the result based on the estimation of the BRkNN result size of overlapping kNLCs. In addition, Luo et al. [LCB+18] propose an approach based on the

OptRegion algorithm to solve the MaxBRkNN problem on streaming geo-data.

The issue of all existing studies is that the kNN of all clients must be computed in advance, which is a very time-consuming process. As such, these methods cannot scale well with the size of data sets. Our proposed exact algorithm only computes kNN for a few clients, and thus the efficiency is improved significantly. In addition, we also first develop an approximation algorithm with a guaranteed error bound for this problem. Chapter 2. Literature Review 17

2.2.3 Location Selection Based on RkNN

Based on RkNN, Xia et al. [XZKD05] propose a problem of finding the top-t most influ- ential sites among a given set of service sites. The problem defines the influence of a site as the total weight of its RNN results. Huang et al. [HWQ+11] and Chen et al. [CHW+15] study the top-k most influential locations selection problem that finds k locations from a set of candidate locations such that the size of the k locations’ BRNN set is maximized.

Choudhury et al. [CCSC16] investigate the problem of finding an optimal location and a set of keywords from given candidate locations and keywords such that the size of the bichromatic reverse spatial and textual k nearest neighbors is maximized. Note that in these works the location is selected from a given set (a limited search space), and thus they are not feasible in MaxBRkNN.

2.2.4 Influence Spread Computation

The problem of computing the social influence of a set of users is a basic problem in the influence maximization problem [KKT03]. It is proved to be a #P-hard prob- lem [CWW10], and thus the time complexity is extremely high. This problem is considered on two basic diffusion models, the Independent Cascade (IC) model and the Linear Threshold (LT) model. In existing studies, IC model is a more popular model [CWW10, LCZ+13, OAYK14]. In influence spread computation, a widely used baseline method is the Monte Carlo (MC) simulation. To reduce the time cost of

MC simulation, some heuristic algorithms (e.g., maximum influence arborescence al- gorithm [CWW10], the independent paths based methods [LCZ+13]) are proposed to estimate the social influence of a group of users. We adopt a sampling-based algo- rithm [OAYK14], because it can quickly calculate the social influence of a group of users with bounded error. 18 Chapter 2. Literature Review

2.3 Optimal Region Search

In this section, we review the related work on location search and region search.

2.3.1 Location Search

The location search is relevant to optimal region search. First, there are some studies search for geo-textual objects based on computing results with a single-object granularity in Euclidean space [CCJ10, CJW09, YZZ+19]. They aim to find lists of single objects, such that each object is close to the query location and relevant to the query keywords.

The min-dist location selection query [ZDXT06, QZK+12] also considers results with a single-object granularity, it finds a location for establishing a new facility such that the average distance between each client point and its nearest facility is minimized. The query is also studied in the context of road networks [XYL11, CLW+14, CLW+15] with road network distance.

Meanwhile, finding a set-of-objects result has been studied substantially. Zhang et al. [ZZZ+14] present the problem of diversified spatial keyword search on road net- works by considering both the relevance and the spatial diversity of the results. Mehta et al. [MSSV16] study a spatial-temporal-keyword query that combines keyword search and the problem of maximizing the spatio-temporal coverage and the diversity of the re- turned items. Collective spatial keyword query [CCJO11, LWWF13] is to retrieve a set of spatial objects such that the keywords of result objects cover the query keywords and the objects are expected to be close to the query location. The query is also investigated in the moving spatial objects recently [XGS+20].

Our optimal region search problem aims to find a region with the maximal objective score computed by a submodular monotone function, which is different from the criteria used in the above problems of location search. Chapter 2. Literature Review 19

2.3.2 Region Search

Region search has been well studied and has a wide range of applications such as user exploration in a city [FCB+16, CCJY14], similar regions search [FCJG19], region de- tection [FGC+19], etc.

Most existing region search problems consider the region as size-fixed rectangles or radius-fixed circles. The Max-Enclosing Rectangle (MER) problem [NB95] aims to find a rectangle with a given size a × b such that the rectangle encloses the maximum number of spatial objects. This problem is systematically studied as the maximizing range sum problem [CCT12, THCC13]. The similar region search [FCJG19], bursty region detec- tion [FGC+19], and the best region search [FCB+16] also consider regions as rectangles.

However, as shown in the study of the LCMSR query proposed by Cao et al. [CCJY14], in practice the regions are usually arbitrarily shaped. LCMSR defines a region as a con- nected subgraph with relevant objects. It uses a sum function to accumulate the weights of objects in the region, and it aims to find the optimal region that has the largest total weight and does not exceed a size limit. We use a submodular function to compute the objective value for a region and aim to maximize this value under a budget constraint, which makes the problem more challenging.

Submodular Optimization under Connectivity Constraint (SOCC) [KLT14] is similar to our problem, but it considers the cost on nodes to compute the budget values. In con- trast, the budget values are associated with edges in our problem, and thus the algorithms for SOCC are not feasible to solve ORS-SM.

2.4 Constrained Path Search

The path search is an important problem that has received a lot of attention due to its wide applications. One of the most classic problems is the shortest path search [D+59, 20 Chapter 2. Literature Review

AMOT90]. However, in many applications, users often need to consider multiple criteria during the path search. In order to address this problem, a large number of studies consider the path search under some constraints. Next, we first review the studies on the constrained shortest path query and other related work on constrained path search, and then discuss the differences between our work and those works.

2.4.1 Constrained Shortest Path Query

Joksch [Jok66] first studies the constrained shortest path (CSP) query and develop a dynamic programming algorithm to obtain exact solutions for the CSP query. The CPS query is to search the best path based on one criterion under a constraint on another criterion. After that, the follow-up work designs algorithms to answer the query from two aspects. One is finding the exact solutions (denoted as exact CSP), and another one is achieving α-approximate solutions where α is a parameter (denoted as α-CSP).

For exact CSP, Handler and Zang [HZ80] propose two approaches, in which one approach formulates the CSP problem as an integer linear programming (ILP) problem, and then uses a standard ILP solver to address it. But this ILP-based approach scales poorly in practice [MZ00], as it costs much time in the search when the road networks are large. Another approach is to reduce the CSP problem to a k-shortest path problem, and then search the next shortest path iteratively until finding the path that satisfies the budget constraint. To solve exact CSP more efficiently, Hansen [Han80] proposes Sky-

Dijk algorithm based on Dijkstra’s algorithm. Compared to Dijkstra’s algorithm, Sky-

Dijk incrementally maintains a set of paths at each vertex, rather than a single shortest path. Sky-Dijk is a state-of-the-art index-free method. Meanwhile, Storandt [Sto12] develops CSP-CH algorithm which accelerates Sky-Dijk using contraction hierarchies, and CSP-CH is a state-of-the-art indexed method.

For α-CSP, Hansen [Han80] first proposes a polynomial-time algorithm with high Chapter 2. Literature Review 21 complexity, and Lorenz and Raz. [LR01] present some optimizations to improve the complexity of the algorithm. But the algorithm is orders of magnitude slower than a k-shortest path based algorithm for exact CSP in experimental studies [KORVM06].

To solve α-CSP quickly, Tsaggouris et al. [TZ09] develop CP-Dijk algorithm which has a conservative pruning technique based on the dominant relationship between paths.

Moreover, Wang et al. [WXYL16] propose COLA that is a state-of-the-art method for solving α-CSP on large road networks. COLA first constructs an index by partitioning the road network and getting a relatively small set of landmark vertices, and then applies an on-the-fly algorithm α-Dijk for path computation within a partition, which effectively prunes paths based on landmarks.

2.4.2 Other Related Work on Constrained Path Search

Besides the CSP query, there are a lot of other studies that consider the path search under some constraints. [dA16] proposes new formulations for finding the s-t shortest path by visiting a given set of nodes. [LCH+05] study trip planning query which aims to find the shortest path from a source location to a target location passing by all user-specified types of locations. The studies [SKS08, LKSS10] consider the problem which searches the shortest path from a specified starting point passing by a sequence of user-specified types of locations. The keyword-aware optimal route query [CCCX12, FLL+20] takes into account the user’s preferences which are expressed by keywords in route search.

It aims to return a route covering all specified keywords with the best objective value satisfying a budget limit.

However, all the aforementioned studies use an accumulative function (e.g., SUM) to formulate the objective function for optimizing the path search, but such a function cannot measure the objective value in many applications such as the coverage of users’ preferences [ZCC+15], the entropy of selected locations [ZV16], and the likelihood of 22 Chapter 2. Literature Review information [LKG+07]. Our problem uses a submodular function to compute the objec- tive score of a path, which makes the problem more challenging.

Our problem is first studied by Chekuri and Pal [CP05] as the submodular orienteer- ing problem (SOP), and they propose the recursive greedy (RG) algorithm. RG achieves the solutions by recursively guessing the middle node and the budget cost to reach the middle node, hence it is a quasi-polynomial time algorithm with high time complex- ity. There also exists a GreedyDP algorithm [SNY18] for solving the problem, which constructs a zero-suppressed binary decision diagram (ZDD) to represent all feasible so- lutions, and then uses the ZDD to find an approximate solution. Since the size of the

ZDD grows exponentially with the number of nodes in the graph, GreedyDP also has high time complexity. Due to the high time complexity, RG and GreedyDP cannot be used to solve the problem in real-world datasets that have thousands of nodes. We will give a detailed explanation in Section 5.1.

The studies [ZCC+15, ZV16] of submodular optimization under routing constraints are relevant to our problem. The ORS-KC problem [ZCC+15] defines a special sub- modular function for path search. The problem of submodular optimization with routing constraints [ZV16] aims to find a route with the best submodular score under a length constraint, but it does not specify the source and target nodes. Their proposed algorithms can be adapted to answer our CPS-SM query, and we denote them by WA* and GCB.

We will compare our algorithms with them in experiments. Chapter 3

Exact and Approximate Approaches for MaxBRkNN

3.1 Introduction

Given two types of objects P (server points set) and O (client points set) in Euclidean space, a bichromatic reverse k nearest neighbor (BRkNN) query q finds all client points o ∈ O whose k nearest neighbors (kNN) in P ∪ q contain the location of q. BRkNN can be used in many applications such as decision support, location planning, resource allocation, and profile-based marketing [KMS+07].

The problem of maximizing BRkNN (MaxBRkNN) [WOY¨ +09,W OF¨ +11, ZWL+11,

LWW+13, LCGL13, HFYD16] is defined based on the BRkNN query, which finds an optimal region to place a new server point p such that the size of the BRkNN result of p is maximized. The cardinality of the BRkNN result is called the “influence” of p, and it could also be computed as the total weight of objects in the BRkNN result [ZWL+11].

The MaxBRkNN problem can be interpreted as finding a region to place a new server point such that it has the maximum influence. It is useful in location planning, advertis-

23 24 Chapter 3. Exact and Approximate Approaches for MaxBRkNN

o5 o5

p1 R2 o1 o1 p2 o2 o2

p3 R1 o o3 3 o4 o6 o4 o6 p4 p5 p6

(a) (b)

Figure 3.1: An example of MaxBRkNN. ing, and marketing. For example, the companies would find a suitable region for creating a new branch to attract as many potential customers as possible. We explain this by the following example.

Example 3.1. Figure 3.1(a) shows six customers (represented by dots) and six business shops (represented by triangles) in a space. Now a company wants to set up a new store at a location such that as many customers as possible could be attracted, assuming that the customers are more interested in visiting a convenient shop based on their distances.

Given k = 2, we can draw a circle with each customer as the center and the radius as the distance between the customer and its 2-NN neighbor, as shown in Figure 3.1(b).

MaxBRkNN returns region R2 as the optimal region, because the BRkNN result set of any location in R2 contains the most customers, including o1, o2, and o5.

There exist several algorithms for solving the MaxBRkNN problem, e.g., the

MaxOverlap algorithm [WOY¨ +09,W OF¨ +11], the MaxFirst algorithm [ZWL+11], the

MaxSegment algorithm [LWW+13], and the OptRegion algorithm [LCGL13, HFYD16].

However, all existing algorithms find out the kNN results for each client point in the first step, then draw a circle (called the kth nearest location circles (kNLC) of client points) Chapter 3. Exact and Approximate Approaches for MaxBRkNN 25 centered at each client point enclosing all its kNN result, and perform searches based on these circles. As k is given by users dynamically, kNN must be computed in an on- line manner. In addition, the server points could be updated as well, and the existing algorithms have to compute the kNN results from scratch in this case as well. Existing algorithms cannot perform well if there are a large number of client points and server points or if there are some updates of server points, because the step of kNN computa- tion consumes huge processing time.

To reduce the kNN computation cost, we design a hybrid quadtree-grid index (QGI).

Grid index is easy to be implemented yet very effective, and has wide applications espe- cially in map services. However, how to set the granularity of the grid index properly is not easy, and the grid index cannot deal with non-uniformly distributed data well. Our proposed index adopts a hierarchical structure with different granularity, where the lower level is a grid index with granularity estimated using k, and the upper level is a quadtree managing the grid cells to deal with data sparsity. This index can help us: 1) estimate the radius of the circle enclosing the kNN result (kNLC) for a group of client points to- gether; 2) speed up the computation of estimating circle intersections; 3) accelerate the estimation of the influence upper bound for the estimated kNLCs.

In addition, we investigate the MaxBRkNN problem in location-based social net- works where client points (users) are connected according to their social relationships.

Thus, a general submodular function is used to compute the influence value of a new server point, and the way of calculating the influence value could be complicated by in- corporating the social relationships of users. For example, the users in the new server point’s BRkNN result could further propagate their influence to the other users. Hence, we can define the influence of a new server point as the social influence spread [KKT03] of the users in the BRkNN result, whose computation is #P-hard [CWW10]. To reduce the influence computation cost, we develop an optimization technique called “maximal- 26 Chapter 3. Exact and Approximate Approaches for MaxBRkNN arc”.

Based on QGI and the maximal-arc optimization, we develop an algorithm GkE

(Grid-based kNN Estimation enhanced region search), which utilizes the estimated cir- cles to quickly prune a lot of client points that could not be the BRkNN result of the optimal region, and then compute exact kNNs on only a few clients to obtain the final result.

We also observe that all existing MaxBRkNN studies aim to provide exact answers.

In the cases where the runtime is more important than the exactness, approximate so- lutions are preferable. To achieve even better efficiency, we develop efficient approx- imation algorithms (e.g., PIC and PRIC) with user-controlled error bounds to solve

MaxBRkNN approximately.

We use the most recent two state-of-the-art MaxBRkNN algorithms, MaxSeg- ment [LWW+13] and OptRegion [LCGL13, HFYD16] as baselines. The experimental study conducted on both real and synthetic data sets shows that our algorithm GkE is

3 − 5 times faster than the best existing approach, and our approximation algorithms are usually one order of magnitude faster than existing methods while preserving high quality.

Contributions. In summary, the contributions of the chapter are as below: i) We propose a method GkE which involves a hybrid quadtree-grid index QGI, an efficient method for pruning the client points to avoid expensive exact kNN computation, and a maximal- arc based optimization technique to reduce the influence computation cost, to solve the

MaxBRkNN problem in location-based social networks efficiently; ii) We firstly de- velop approximation algorithms to allow users to balance the run time and the accuracy for MaxBRkNN; iii) We show the effectiveness and efficiency of our algorithms by con- ducting experiments on both real and synthetic location-based social networks.

Organization. The chapter is organized as follows. Section 3.2 defines the MaxBRkNN Chapter 3. Exact and Approximate Approaches for MaxBRkNN 27

Table 3.1: Frequently Used Notation in MaxBRkNN

Notation Meaning O, o a set of client points, a client point P, p a set of server points, a server point k the parameter of the MaxBRkNN problem R a region ψR the influence set of region R σ(R) the region influence value of region R f (·) A submodular function for computing the region influence value M the number of cells in the lower-layer grid kNLC the kth nearest location circle of a client point EkNLC a kth nearest location circle with the estimated radius problem formally, and Section 3.3 introduces the general framework for solving

MaxBRkNN. Section 3.4 proposes the algorithm GkE, and Section 3.5 presents approxi- mation algorithms and analyzes their approximation ratios. Finally, Section 3.6 discusses the experimental studies, and Section 3.7 concludes this chapter.

3.2 Problem Definition

In this section, we give the formal definition of the MaxBRkNN problem. The frequently used notations are presented in Table 3.1.

The MaxBRkNN problem is defined over a set of server points P (locations) and a set of client points O (users) in Euclidean space. For a server point p ∈ P, a bichromatic reverse k-nearest neighbor (BRkNN) query aims to find all clients in O whose k-nearest neighbors in P contain p. We denote the client points set returned by a BRkNN query on p by OBR(p, P). Before formally defining the problem, we introduce the following definitions.

Definition 3.1. Consistent Region [WOY¨ +09]. A region R is consistent if and only if,

0 0 0 for any two new server points p and p in R, OBR(p, P ∪ {p}) = OBR(p , P ∪ {p }). 28 Chapter 3. Exact and Approximate Approaches for MaxBRkNN

It means that all server points in a consistent region R have the same BRkNN results.

Definition 3.2. Region Influence Set. Given a consistent region R, the influence set of

R, denoted by ψR, is defined as the set of clients in OBR(p, P ∪ {p}), where p is any point inside R.

For example, as shown in Figure 1, the influence set of region R1 includes clients o1 and o4.

Definition 3.3. Monotone Submodular Function [CR04]. A set function f : 2S → R, which maps a set of objects S to a real number is a monotone submodular function if for every Oi ⊆ O j ⊆ O and o ∈ O\Oi, f (·) satisfies: (1) f (Oi) ≤ f (O j); and (2) f (Oi ∪ {o}) − f (Oi) ≥ f (O j ∪ {o}) − f (O j).

In existing studies, the influence value of a region R is either computed by the number of clients in R’s influence set [WOY¨ +09] or the sum of the weights on clients in R’s influence set [ZWL+11]. In this work, we consider the MaxBRkNN problem in location- based social networks (such as Foursquare or Twitter) where the clients are connected according to their social relationships. More generally, we compute the influence value by a general monotone submodular function f (·) over a set of clients. This is because, in some applications, the influence value computation could be complicated, as shown in the following example.

Example 3.2. Figure 3.2 shows some customers (represented by dots) and business shops (represented by triangles), together with the social network of the customers, where the edges represent the user relationships and the edge weight is the probabil- ity of one customer influencing another customer. The circles represent the 2-NN result

range of the customers. Given k = 2,R2 is the best since its BRkNN result ψR2 contains the most customers including o1, o2, and o5. However, if we show an advertisement in R2, this piece of information would be trapped in the three customers according to the social Chapter 3. Exact and Approximate Approaches for MaxBRkNN 29

o5 o5

0.9 0.2 R2 o1 o1 0.8 o2 o2 0.1 0.2 R 1 o o3 3 o o4 o6 0.8 o4 0.6 6

(a) (b)

Figure 3.2: An example of MBRkNN in location-based social networks.

network. By considering the social networks, a better region is R1, as the region influ- ence set of R1 contains o1 and o4, and then the two customers could potentially spread this information to the entire network.

In the example, a region’s influence value is better captured by the social influence of clients in the region’s influence set, e.g., the number of customers that could be fi- nally influenced in the networks. Hence, f (·) computes the social influence spread (e.g., computed by the independent cascade (IC) model [KKT03]), which is well known to be

#P-hard [CWW10]. Formally, the influence value is defined as below.

Definition 3.4. Region Influence Value. Given a consistent region R, the influence value of R, denoted by σ(R), is computed by a monotone submodular function f (ψR).

Note that in the traditional MaxBRkNN problem, the region influence value is com- puted as the cardinality of ψR which is a special monotone submodular function.

Definition 3.5. Maximal Consistent Region. A consistent region R is maximal if and only if, there does not exist another consistent region R0 where R ⊂ R0 and σ(R) = σ(R0). 30 Chapter 3. Exact and Approximate Approaches for MaxBRkNN

For example, both R1 and R2 are maximal consistent regions as shown in Figure 1. Now we are ready to formally define the studied problem.

Definition 3.6.M aximizing Bichromatic Reverse k-Nearest

Neighbor (MaxBRkNN). Given a set of server points P and a set of client points O,

MaxBRkNN aims to find a maximal consistent region R such that the influence value of

R, i.e., σ(R), is maximized.

3.3 General Framework for Solving MaxBRkNN

Most of existing methods [WOY¨ +09,W OF¨ +11, ZWL+11, LWW+13, LCGL13,

HFYD16] for MaxBRkNN share similar basic steps. We follow the idea of them to get the general framework for solving the problem. We first introduce the following definition.

Definition 3.7. kth Nearest Location Circle (kNLC). Given a client point c, the kth nearest location circle of c is defined as the circle centered at c and with the distance between c and the kth nearest neighbor of c in P as the radius.

For example, in Figure 3.1, the circles are the kNLCs (k = 2) of the users. As proved by Wong et al. [WOY¨ +09], the region R returned by the MaxBRkNN problem can be represented by an intersection of multiple kNLCs. Based on this, the general framework can be presented in Algorithm1.

The algorithm starts by constructing a kNLC for each client point o ∈ O. Next, it estimates an upper bound of influence σup for each kNLC and uses a max-priority queue

Q to store the kNLCs and their upper bounds σup in decreasing order of σup (lines 1-5). Then, the algorithm dequeues candidate kNLC labels from Q one by one until either Q is empty or all the kNLCs in Q have influence upper bounds smaller than σmax (lines 7- 14). For the current circle c, we first find all circles intersecting with c and obtain the Chapter 3. Exact and Approximate Approaches for MaxBRkNN 31

Algorithm 1: General Framework for MaxBRkNN Input: D = (O, P), σ(·), k

Output: A maximal consistent region region R with the maximum influence σmax

1 Initialize a max-priority queue Q ← ∅;

2 for each o ∈ O do

3 c ← kNLCs of o ;

4 Estimate an upper bound of influence for c;

5 Q.enqueue(hσup, ci);

6 R ← ∅,σmax ← 0;

7 while Q is not empty do

8 hσup, ci ← Q.dequeue();

9 if σup 6 σmax then

10 break;

11 Get the circles intersecting with c;

12 Find all intersection points ip on c;

13 Compute the largest influence and the corresponding region in c by using ip;

14 update R and σmax;

15 return R and σmax;

intersection points. Next, the best region in c can be constructed from the intersection points with the largest influence, as proved by the studies [WOY¨ +09].

We use two state-of-the-art algorithms for MaxBRkNN as the baseline algorithms, i.e., OptRegion [LCGL13, HFYD16] and MaxSegment [LWW+13]. Both algorithms follow the framework as depicted in Algorithm1. OptRegion utilizes a sweep line method to get the intersected kNLCs and estimate the upper bound for each kNLC.

MaxSegment algorithm uses R-tree to find the lists of kNLCs that overlap with each kNLC for constructing the overlap table, and then estimates the upper bounds for kNLCs.

MaxSegment algorithm also obtains all the segments of a kNLC by the intersection points, and scans these segments to find the one with the largest influence value. 32 Chapter 3. Exact and Approximate Approaches for MaxBRkNN

The major problem of the baseline algorithms is that it needs to compute kNLCs for all client points (lines 2-5). Because k is a user-defined value, kNN must be computed in an online manner, and thus the cost of this step is prohibitively expensive on large datasets. However, we observe that much of such computation is unnecessary.

3.4 Exact Approach

To address the issues in the baseline approaches, we propose algorithm GkE for the

MaxBRkNN problem. We first introduce a hybrid quadtree-grid index in Section 3.4.1 to reduce the kNN computation cost. Next, to speed up the process of estimating upper bounds of influence for kNLCs, we use the index to get lists of kNLCs that overlap with each kNLC for constructing the overlap table quickly. We further develop a maximal-arc based optimization to reduce the influence computation cost in Section 3.4.4. We finally summarize the steps of GkE and analyze the complexity.

3.4.1 Hybrid Quadtree-Grid Index QGI

Grid index is widely used in spatial databases for query processing. It is simple and easy to implement yet very effective. The intuition is that, by utilizing the index, we are able to quickly estimate the upper bound of the radiuses of the kNLCs for a group of client points. We denote the circle with the estimated radius by EkNLC. In this section, we introduce the details of constructing QGI and finding EkNLCs using the index. We will present how to process MaxBRkNN using the EkNLCs in section 3.4.5. With the help of

QGI, we get rid of computing kNLCs for all client points.

The basic idea of the index is shown in Figure 3.3, where the space is divided into small cells of equal size, and the client and server points are put into the cells according to their geo-locations. For the server points, we count their number in each cell and store Chapter 3. Exact and Approximate Approaches for MaxBRkNN 33

Key Value Key Value

Cell[0,0] NC(0,0) O[1] Cell[i1,j1]

Cell[0,1] NC(0,1) O[2] Cell[i2,j2] ......

Cell[i,j] NC(i,j) O[n] Cell[in,jn] Server Point Index Client Point Index

Figure 3.3: An example of one level of QGI. this information in a server point index SI. For each client point, we record the cell it locates in (e.g., Cell(i1, j1) for o1) and store this information in a client point index CI. The following definition lays the basis of computing kNLCs using the grid index.

Definition 3.8. Cell Ring. Given a cell ci, j, where 0 ≤ i ≤ a and 0 ≤ j ≤ b (a is the length and b is the width of the grid index), we call the cell itself the 0-ring of ci, j.

Iteratively, the n-ring of ci, j is the cell set containing cells cx,y where (x, y) ∈ {(i − n ≤ x ≤ i + n, y = j ± n) ∪ (x = i ± n, j − n ≤ y ≤ j + n)|0 ≤ i ± n ≤ a, 0 ≤ y ± n ≤ b}.

For example, as shown in Figure 3.4, if the number of client points in one cell is

1

0-ring 1 1-ring

1 2-ring

3-ring

4-ring 2 5-ring

Figure 3.4: An example of cell rings. 34 Chapter 3. Exact and Approximate Approaches for MaxBRkNN

larger than 0, the number is presented in the cell. For the cell containing a client point

(the dot), its 0-ring is itself, its 1-ring is the cell set containing cells around it, and its

2-ring is the cell set containing cells around its 1-ring.

With the definition of the cell ring, for each client point, we can count its k-nearest

server points ring by ring iteratively to compute the upper bound of its kNLC’s radius.

The process of computing the upper bounds of radiuses for kNLCs and constructing

EkNLCs in the above one-level grid index is shown in Algorithm2.

This algorithm begins by constructing a server point index SI and a client point index

CI (line 1). It uses the diagonal length Ld of a cell to get the upper bounds of radiuses

for kNLCs (line 2). A list Ce is used to store EkNLCs. nr and ns are the number of visited

rings and the number of server points in visited rings, respectively. For each cell co ∈ CI

containing client points, the algorithm first sets nr and ns to 0 (lines 4-5). After that, it

scans the cells from co’s lower to higher rings until ns ≥ k (lines 6-9). Next, it constructs

Algorithm 2: Estimating Radiuses of kNLCs in One-Level Input: D = (O, P), k, M

Output: EkNLC set Ce

1 Divide D into M cells and construct a grid index SI and a client point index CI;

2 Get the diagonal length Ld of a cell;

3 Initialize a list Ce ← ∅;

4 for each cell co ∈ CI containing client points do

5 nr ← 0, ns ← 0;

6 while ns < k do

7 nr ← nr + 1;

8 for each c ∈ nr-ring of co in S I do

9 ns ← ns+ the number of clients in c;

10 Re ← nr ∗ Ld;

11 Construct an EkNLC for all points in co with radius Re and insert it into Ce;

12 return Ce; Chapter 3. Exact and Approximate Approaches for MaxBRkNN 35

EkNLCs with estimated radius Re ← nr ∗ Ld for all client points in co, and inserts EkNLCs into Ce (lines 10-11). Finally, it returns Ce (line 12).

Example 3.3. In Figure 3.4, in order to estimate the radius of the 5th NLC for a client point, the algorithm first counts the nearest server points in the 0-ring cell, in which the number of the nearest server points is 0. It continues checking the 1-ring cells and get 2 nearest server points. The 5th nearest server point is found until we scan to the 4-ring, and it finally returns 5Ld as the estimated radius.

A common and challenging problem in a grid index is how to set the cell size. We adopt an adaptive way determined by the value of k to dynamically compute the cell size, which is expected to achieve good performance. We denote the expected average number of visited rings for estimating the radius of a kNLC as nvc (set as a parameter of

2 the grid index), and thus the average number of visited cells is (2nvc − 1) . Assume the client points and server points are evenly distributed, and the number of cells is M. One

|P| cell contains M server points on average, and the expectation of the number of visited M M 2 cells to find k nearest server points for a client point is k |P| . Let k |P| = (2nvc − 1) , we can 2 (2nvc−1) |P| obtain M = k . We use nvc to control the grid index granularity, which is more intuitive comparing to set the cell size directly.

The Upper-layer Grid

The Lower-layer Grid

Figure 3.5: The hierarchical structure 36 Chapter 3. Exact and Approximate Approaches for MaxBRkNN

Using only one level in QGI is not able to handle non-uniformly distributed data well.

This is because, when the data is non-uniformly distributed, the server points are sparse in some areas, so that it costs much time to scan a large number of cells for estimating the radius of kNLCs. This inspires us to use a hierarchical structure (as shown in Figure 3.5) to address this problem. In the upper layer, we utilize the quadtree to partition the space and store the data, and in the lower layer, we use a uniform grid index. Compared to the lower-layer with equal-sized cells, the quadtree layer can adjust the sizes of cells according to the density of points. In the areas in which the server points are sparse (in

Figure 3.5), the sizes of cells in the upper-layer are larger than those in the lower-layer.

Thus, we can first get rough upper bounds of the radiuses for kNLCs in the upper layer, and then further improve the rough upper bounds by zooming into the lower-layer. We call this index the hybrid quadtree-grid index (QGI).

Next, we proceed to present the hybrid quadtree-grid index, the approach for getting rough upper bounds of the radiuses for kNLCs in the upper layer, and the method for improving the rough upper bounds by zooming into the lower-layer.

To build this index, we use a quadtree to recursively partition the space into four equal-sized quadrants until each leaf node contains one server point (as shown in Fig- ure 3.6). In the quadtree, each quadrant is represented by a node vi, in which we maintain

p1 v1 Level 0 a a {p5,p6}, n 4 {p1,p2,p3},n 2 p v v v 2 p 2 3 4 Level 1 3 a {p4},n 3 p5

v5 v6 v7 v8 v9 Level 2 a a a a a p6 {p1},n 5 {p2},n 6 {p3},n 7 {p5},n 8 {p6},n 9 p4

Figure 3.6: An example of the hybrid quadtree-grid index Chapter 3. Exact and Approximate Approaches for MaxBRkNN 37 the server points in the quadrant and the number of server points in the quadrant and its

a neighboring quadrants (denoted by ni ).

3.4.2 kNLC Radius Estimation using QGI

The objective of designing such an index is to estimate the radius of kNLCs quickly for avoiding the kNN computation on all client points. We explain how to do this in this subsection. We first introduce the following definition.

Definition 3.9. Surrounding Quadrants. Given a quadrant vi, we call it together with its neighboring quadrants the surrounding quadrants of vi, denoted by vi.SQ.

For example, in Figure 3.6, node v8 corresponds to the quadrant containing p5, and its

a SQ is the red-dashed-line rectangle, thus the number of server points in v8.SQ is n8 = 3. With the hybrid quadtree-grid index, we can get initial upper bounds and lower bounds of the radiuses for kNLCs, and then we improve the upper bounds and lower bounds using a binary search method.

To get an initial upper bound of the radius for the kNLC centered at a client point oi, we first use the coordinate of oi to find the node with the smallest quadrant whose SQ contains no less than k server points in the hybrid quadtree-grid index. After that, we get the SQ of the node and map it to the lower-layer grid. Next, we use the largest distance between the cell contains oi and the bounds of the SQ as an initial upper bound of the radius for the kNLC. For example, as shown in Figure 3.7, when k = 3, for the client point o1, the red-dashed-line rectangle in Figure 3.7(a) is the smallest SQ containing no less than k server points. After mapping the SQ to the lower-layer grid, we can get the red-dashed-line rectangle in Figure 3.7(b) as an initial upper bound of the radius for the kNLC centered at a client point o1. For getting an initial lower bound of the radius for the kNLC centered at a client point oi, we begin with finding the node with the largest quadrant that contains less than k 38 Chapter 3. Exact and Approximate Approaches for MaxBRkNN

p1 p1

p2 p2 p3 p3 o1 o1

p5 p5

p6 p6 p4 p4

(a) (b)

Figure 3.7: An example of binary search method

server points in the hybrid quadtree-grid index based on the coordinate of oi. Finally, we map the node to the lower-layer grid, and then use the smallest distance between the cell contains oi and the bounds of the node’s quadrant as an initial lower bound of the radius for the kNLC. For example, as shown in Figure 3.7, when k = 3, for the client point o1, the blue-dashed-line rectangle in Figure 3.7(a) is the largest quadrant containing less than k server points. We can get the blue-dashed-line rectangle in Figure 3.7(a) as an initial lower bound of the radius for the kNLC through mapping the quadrant to the lower-layer grid.

Let dlu denotes the distance between the upper bound and the lower bound. When the server points are sparse, dlu is usually large. In order to improve the upper bound and lower bound, we use a binary search method to reduce dlu. We denote the threshold for

min minimal dlu as dlu . The process can be described as follows. Based on the upper bound and lower bound, we get the middle bound. After that, we use the coordinate of oi and the middle bound to do range queries in the hybrid quadtree-grid index for getting the number of server points in the middle bound (denoted as nsm). If nsm < k, the middle bound is a new lower bound, we update the lower bound and the number of server points in the lower bound by nsl = nsm. Otherwise, we use the middle bound as the new upper Chapter 3. Exact and Approximate Approaches for MaxBRkNN 39 bound. Next, we get the new middle bound and do the binary search recursively until

min min dlu <= dlu . For example, in Figure 3.7, when k = 3 and dlu equals to the length of 2 cells in the lower-layer grid, after one binary search, the new lower bound is found as the green-dashed-line rectangle in Figure 3.7(b).

At the last step, we zoom into the lower-layer grid to further improve the rough upper bound by scanning cells to find k −nsl server points starting from the new lower bound in the upper-layer grid. For example, in the above example shown in Figure 3.7(b), starting from the green-dashed-line rectangle, we only need to scan one ring to find 3 nearest server points for o1 and get the upper bound of the radius for the kNLC centered at o1. We present the process of estimating the radiuses of kNLCs in the two-level QGI in

Algorithm3. The method starts by constructing the two-level index structure consisting of a server point index SIl and a client point index CIl for the lower-layer grid, and build a hybrid quadtree-grid index QT for the upper-layer (lines 1-3). For each client point oi, we compute initial upper bound du, lower bound dl and the number of server points in the lower bound nsl in QT, and then improve du and dl using a binary search method (lines

4-14). Next, starting from the new lower bound dl in the upper-layer grid, we zoom into the lower-layer to scan cells for finding k − nsl server points (lines 15-21). Finally, it constructs EkNLCs for oi with Re = nr ∗ Ld and return the EkNLC set Ce.

We denote the side length of the cell in the lower-layer grid as Llow. In order to be consistent with the previous settings, we use the average distance of scanning cells to

min find k nearest server point as the default value of dlu = Llow ∗ nvc.

3.4.3 Efficient Computation of Overlapping Circles

After constructing kNLCs, we compute intersections of kNLCs for estimating the upper bounds of the influence values for kNLCs. Rather than computing intersections for all kNLC pairs, we do range queries based on the radiuses of kNLCs for computing the inter- 40 Chapter 3. Exact and Approximate Approaches for MaxBRkNN

Algorithm 3: Estimating kNLCs in Two Layers min Input: D = (O, P), k, ρ, dlu

Output: EkNLC set Ce

1 Construct QGI consists of SIl and CIl;

2 Get the diagonal length Ld of a cell in the lower-layer;

3 Build a hybrid quadtree-grid index QT based on P and initialize a list Ce ← ∅;

4 for each oi ∈ O do

5 Compute initial du, dl and nsl in QT;

6 dlu ← du − dl;

min 7 while dlu > dlu do

8 dm ← (du + dl)/2;

9 Do a range query in QT and get nsm;

10 if nsm < k then

min 11 dl ← dm, dlu ← dlu ;

12 else

13 du ← dm;

14 Update dlu;

15 Initialize co for oi in CIl;

16 Compute nr in SIl using dl, ns ← nsm;

17 while ns < k do

18 nr ← nr + 1;

19 for each c ∈ nr-ring of co in S Il do

20 ns ← ns+ the number of client points in c;

21 Re ← nr ∗ Ld;

22 Construct an EkNLC for oi with radius Re and insert it into Ce;

23 return Ce; sections quickly.

We begin with building an index IO for all client points (e.g., R-tree or quadtree). For a client point oi, we denote its kNLC as kNLCi and the radius of kNLCi as ri. To find all intersections for the kNLC of oi, we perform range searches on IO to find the client points Chapter 3. Exact and Approximate Approaches for MaxBRkNN 41

2r2

o1 o2 2r1

Figure 3.8: An example of checking whether two kNLCs intersect with each other.

in the range with radius 2ri. For all the clients within the range search space, we check whether its corresponding kNLC (or EkNLC) overlaps with the current circle kNLCi. If it is, we insert it into kNLCi’s overlapping list, and also insert kNLCi into its overlapping list, if it has a smaller radius.

Example 3.4. As shown in Figure 3.8, there are two overlapping circles c1 and c2, and r1 < r2. When we use 2r1 to search the circles that overlap c1, we can only get the overlapping circles whose radiuses are not larger than r1, so that c2 is missed in this step. But when we use 2r2 to search the circles that overlap c2, we can know that c1 intersects with c2. As r1 < r2, we insert c2 into c1’s overlapping list as well.

Finally, we can construct the overlap table, and then use the overlap table to compute the upper bounds of the influence values for kNLCs.

3.4.4 Maximal-Arc Based Optimization

We use a general monotone submodular function to compute the influence value of a re- gion. In some applications, the computation of this function is time-consuming. For ex- ample, when we define the region influence value as the social influence spread [KKT03] 42 Chapter 3. Exact and Approximate Approaches for MaxBRkNN of the users in the region’s influence set on a social network, it is well known that com- puting social influence for a given group of users is a #P-hard problem [CWW10]. To reduce the influence computation cost in such scenarios, inspired by the MaxSegment algorithm [LWW+13], we developed the maximal-arc optimization technique by per- forming the computation only on the “maximal arcs”.

Instead of checking the intersection points one by one, we analyze the arcs of kNLCs which connect the intersection points. Given a kNLC of a point oi, if we set oi as the pole of the polar coordinate system, the intersection points on the kNLC can be represented by their polar angles, and they divide the kNLC into several arcs. We can find all arcs by scanning the intersection points on oi’s kNLC in a counter-clockwise direction from angle 0◦ to 360◦.

There are two intersection points if two kNLCs intersect. We assume that the kNLC of a point o j intersects the kNLC of oi. For an intersection point x, if in the counter-clockwise scanning process on oi.kNLC, the arc starting with x on oi.kNLC is within o j.kNLC, we call x the entering point of o j; otherwise, if the arc is out of o j.kNLC, we call x the leaving point of o j (as shown in Figure 3.9). Given an arc e bounded by two intersection points oh and ot, we call the point oh with the smaller polar angle the “head” of e, and the point

entering point oj of oj (oh, head)

leaving point o of oj (ot, tail) 0 oi

Figure 3.9: Entering and leaving points Chapter 3. Exact and Approximate Approaches for MaxBRkNN 43

ot with the larger polar angle the “tail” of e. With these concepts, we can define the maximal arc as below:

Definition 3.10. Maximal Arc. Given a point oi, its kNLC is divided into several arcs by the intersection points, and an arc is defined as a maximal arc if and only if:

1. the head of this arc is an entering point of a kNLC;

2. the tail of this arc is a leaving point of a kNLC.

We denote A(ei) as the set of server points whose kNLCs enclose ei, and we call these

max max points ei’s intersecting point set. The meaning of a maximal arc ei is that A(ei ) is not a subset of any other arc’s intersecting point set.

For example, in Figure 3.10(a), there exist three kNLCs of o1, o2 and o3. o1’s kNLC is divided into four arcs by four intersection points. We can project the arcs and the intersection points into a one-dimensional line segment, as shown in Figure 3.10(b). e3

90o

135o o3

180o 0o o1 o2

225o

(a)

e1 e2 e3 e4 e1 0o 90o 135o 180o 225o 360o

(b)

Figure 3.10: An example for Maximal Arc. 44 Chapter 3. Exact and Approximate Approaches for MaxBRkNN

is a maximal arc, since the head of e3 is an entering point, and the tail of e3 is a leaving point. In o1’s kNLC, for any arc ei, we have A(ei) ⊆ A(e3). Next, we analyze the properties of maximal arcs.

Lemma 3.1. Given two adjacent arcs ea and eb on a kNLC where point x is the head of ea and the tail of eb, if x is an entering point of the kNLC of a point o, A(ea) = A(eb) ∪ {o}; otherwise, if x is a leaving point of the kNLC of a point o we have A(eb) = A(ea) ∪ {o}.

Proof. When x is an entering point of a kNLC of a point o, ea is contained in o’s kNLC while eb is not. Because ea and eb are adjacent arcs, their intersecting point sets are only different on o, and thus A(ea) = A(eb) ∪ {o}. We can prove similarly for the other case. 

Based on Lemma 3.1, we introduce the following lemma which lays the basis of the maximal-arc optimization technique.

Lemma 3.2. For any arc ei, there exist a maximal arc e j such that A(ei) ⊆ A(e j).

Proof. Given an arc ei, there are two cases:

Case 1: The tail of ei is an entering point of a client point o. Let ep denote ei’s subsequent arc (counterclockwise), by Lemma 3.1, we know A(ep) = A(ei)∪{o}. We check the tail of ep, and if it is a leaving point of a client point, we know ep is a maximal arc. Otherwise, we keep checking ep’s precedent arc, until we find one arc denoted by emax whose tail is a leaving point. Since the head of emax is an entering point, it is a maximal arc, and it holds that A(e j) ⊃ A(ei) according to Lemma 3.1.

Case 2: The tail of ei is a leaving point. If the head of ei is an entering point, ei is a max- imal arc. Otherwise, we keep checking its precedent arcs until we reach one arc whose head is an entering point. This arc is maximal and its intersecting point set contains A(ei) according to Lemma 3.1.  Chapter 3. Exact and Approximate Approaches for MaxBRkNN 45

Lemma 3.2 guarantees that we can find the arc whose intersecting point set has the maximal influence σ(A(ei)) by checking the maximal arcs only. Due to the submodu- lar property of computing the influence value, any subset of A(ei) must have a smaller influence value. Thus, we need not compute the influence values of other arcs.

In order to find out the maximal arcs and the corresponding intersecting point set, we perform the scanning process as follows. Given o’s kNLCs, we initialize a queue Q = {o}.

Next, we scan the kNLC from angle 0◦ counter-clockwise. When meeting an entering point of o0, we push o0 into Q. When meeting a leaving point of o0, if the previous

max intersection point is an entering point, we get a new maximal arc ei with A(ei ) = Q and then we remove o0 from Q; if the previous intersection point is a leaving point, we only remove o0 from Q. The process is terminated when we meet angle 360◦, and all maximal arcs are obtained.

For example, to find the maximal arcs in Figure 3.10(b), we first initialize Q = {o1} and then scan the circle from point 0◦. When meeting the point with angle 90◦ which is

◦ an entering point of o3, we add o3 into Q. Similarly, at the point with angle 135 which is an entering point of o2, we insert o2 into Q. After that, we meet the point with angle

◦ 180 which is a leaving point of o3, as the previous intersection point is a leaving point, we create a maximal arc e3 with Q3 = Q = {o1, o2, o3}, and remove o3 from Q. Next, at

◦ ◦ the points with angles 225 and 360 , we remove o2 and o1 respectively, and terminate the search process.

3.4.5 Algorithm GkE

We proceed to present the steps of our algorithm GkE which enhance the naive method introduced in Section 3.3 with the hybrid quadtree-grid index in Section 3.4.1, the effi- cient upper bounds of influence estimation for kNLC in Section 3.4.3, and the maximal- arc based optimization in Section 3.4.4. 46 Chapter 3. Exact and Approximate Approaches for MaxBRkNN

In the first step (lines 2-5 in Algorithm1), instead of constructing kNLC for each o,

GkE constructs QGI and uses it to estimate the upper bounds of the radiuses for kNLCs and obtain EkNLCs. This step can reduce accurate kNN searches remarkably when the number of promising clients (whose upper bound of influence value σup is not larger than

σmax) is small. After that, we compute σup for each EkNLC. This step would be fast when the radiuses of kNLCs’ are small. Next, to find the accurate best region, the algorithm obtains EkNLC from LEkNLCs in descending order of their σup, and then does kNN search using QGI to construct accurate kNLCs for the EkNLC and its overlapping EkNLCs. At the last step, it uses the maximal-arc based optimization to find the maximal consistent region R in accurate kNLCs, and this step is efficient even though the computation of region influence value is complex. The process will terminate when σup ≤ σmax or all EkNLCs are checked.

We analyze the complexities of GkE as below. This algorithm has three main steps:

(1) constructing EkNLC for each o ∈ O; (2) estimating an upper bound of influence σup for each EkNLC; (3) finding the point set with the maximum influence by searching kNLCs in decreasing order of their σup.

In step 1, it scans all grids in the lower-layer grid to estimate the radius for a kNLC in the worst case, thus it costs O(M|O|) (where M is the number of cells in the lower-layer grid) time to construct EkNLCs for all client points. The average number of server points in a cell is |P|/M , so the average number of cells that are visited for estimating the radius

M for a kNLC is k |P| . In the average case, the run time of this step is O(kM|O|/|P|).

In step 2, it first builds an index IO for all client points. For each EkNLC, it does a range search to find all EkNLCs that overlap with the EkNLC in IO. Unfortunately, in the worst case, we still may need to check all pairs of points in O, and thus the worst complexity of this step is O(|O|2). However, in the average case, the range search is much faster since the search range is only two times of the radius of an EkNLC. Chapter 3. Exact and Approximate Approaches for MaxBRkNN 47

In step 3, for each candidate EkNLC (some are pruned in step 2), we first take

O(k log |P|) time to do BRkNN query for computing the accurate radius. After that, we need to compute the influence value in at most O(|O|2) maximal arc. Let t(σ(·))) be the cost of computing the influence of a region. This step costs at most O(k|O|log|P| +

|O|2t(σ(·))) time.

Therefore, we get the complexity of GkE for MaxBRKNN is O(M|O| + |O|2 + k|O|log|P| + |O|2t(σ(·))).

3.5 Approximate Approach

From the complexity analysis of GkE, we can observe that its time cost is high when the number of client points |O| is still large. This motivates us to develop approximation algorithms by pruning some client points to further reduce the time cost. The chal- lenge is how to select the client points for pruning meanwhile guaranteeing the quality of the results. In this section, we introduce two algorithms called PIC and PRIC for

MaxBRKNN.

3.5.1 Algorithm PIC

We first design an algorithm based on the thought that the client points in independent kNLCs (without any intersection) cannot be contained in the same region influence set.

And thus the influence loss in the result after pruning independent kNLCs is no larger than the maximum influence of a pruned client point σmax(o|o ∈ Opru), where Opru denotes the set of pruned client points. For example, in Figure 3.1, as the kNLCs of o3, o5 and o6 are independent of each other, any two of them cannot appear in a region’s influence set together.

As discussed, computing the kNLCs for all client points has a huge cost, and thus 48 Chapter 3. Exact and Approximate Approaches for MaxBRkNN

Algorithm 4: Algorithm PIC with One Pruning Operation Input: D = (O, P), σ(·), k

Output: A region R with the maximum influence σmax

1 for each o ∈ O do

2 Compute σ(o);

3 Initialize a list C ← ∅;

4 for each o ∈ O do

5 Construct an EkNLC for o and insert it into C;

6 Find Cind ⊆ C \ {EkNLCmax} in which EkNLCs are independent of each other;

7 Get Opru from Cind;

8 Onew ← O \ Opru;

9 Generate a new instance φ of MaxBRkNN based on D = (Onew, P), σ(·), k;

10 Find the optimal region R for φ by GkE;

11 return R and its influence;

we compute independent EkNLCs instead of the kNLCs by utilizing the index QGI intro- duced in Section 3.4. The pruned points set generated by using EkNLCs is smaller than that obtained by using kNLCs. However, due to the significant cost saving on the kNN computation, it is still much more efficient. In the following paragraphs, we describe our approximation algorithms using EkNLC, and the theoretical analysis holds when utilizing kNLC.

We denote the optimal region of MaxBRkNN as Ropt, and its region influence set as

OBR(Ropt). The details of PIC with one pruning operation are shown in Algorithm4. It first computes σ(o) for each o ∈ O (lines 1-2). After that, it constructs EkNLCs for client points using the index QGI and inserts them into a list C (lines 3-5). Next, this algorithm

finds a set of EkNLCs that are independent of each other and stores it into a list Cind from

C\{EkNLCmax} where EkNLCmax is the EkNLC of the client point omax with the maximum influence σmax(o|o ∈ O) (line 6). From Cind, it obtains a client point set Opru consisting of all the client points that are the center of EkNLCs (line 7). Based on the new client Chapter 3. Exact and Approximate Approaches for MaxBRkNN 49

o5

R2 R2 o1 o1 o2 o2

R1 R1 o 3 o4 o6 o4

(a) (b)

Figure 3.11: An example of one pruning operation.

point set Onew ← O \ Opru, this algorithm generates a new instance φ of the MaxBRkNN problem (lines 8-9). Finally, it finds the optimal region R in φ via GkE algorithm (line

10).

For example, as shown in Figure 3.11, in one pruning operation for the EkNLCs in

Figure 3.1, we can prune o3, o5 and o6, as their EkNLCs are independent EkNLCs and contained in C\{EkNLCmax}. After that, we get only 3 EkNLCs in Figure 3.11(b) for an easier MaxBRkNN problem.

Lemma 3.3. In D = (Opru, P), for any new server point p ∈ D, σ(OBR(p)) 6 σmax(o|o ∈

Opru).

Proof. Since the EkNLCs of the points in Opru are independent, their kNLCs are inde- pendent as well because kNLCs are smaller. As a result, a new server point can only be included in at most one kNLC of the points in Opru, and OBR(p) contains at most one point in Opru, whose influence value is not larger than σmax(o|o ∈ Opru).  Based on this lemma, we can prove the error bound of this algorithm.

1 Theorem 3.1. PIC with one pruning operation is a 2 -approximation algorithm.

Proof. We denote the best region found by PIC with one pruning operation as Rpic and 50 Chapter 3. Exact and Approximate Approaches for MaxBRkNN

the region influence sets of Rpic and Ropt as OBR(Rpic) and OBR(Ropt), respectively.

We can divide OBR(Ropt) into two parts OBR1 (Ropt) and OBR2 (Ropt), where OBR1 (Ropt) ⊆

Opru and OBR2 (Ropt) ⊆ Onew. As omax ∈ Onew, σ(omax) 6 σ(OBR(Rpic)). From Lemma 3.3,

we know that σ(OBR1 (Ropt)) 6 σmax(o|o ∈ Opru) 6 σ(omax) 6 σ(OBR(Rpic)). Since

OBR2 (Ropt) is the influence set in φ, σ(OBR2 (Ropt)) 6 σ(OBR(Rpic)). Therefore, considering that σ(·) is computed by a submodular function, we can get:

σ(OBR(Ropt)) = σ(OBR1 (Ropt) ∪ OBR2 (Ropt))

≤ σ(OBR1 (Ropt)) + σ(OBR2 (Ropt)) ≤ 2σ(OBR(Rpic)).

Next, we discuss how to find a suitable client points set Opru. Obviously, if |Opru| is larger, |Onew| is smaller, the new instance φ is easier to solve. Unfortunately, this problem is NP-hard.

Theorem 3.2. It is NP-hard to find Opru ⊆ O such that its corresponding EkNLCs are independent of each other, omax < Opru, and |Opru| is maximized.

Proof. We denote this problem as FMIC problem and prove its NP-hardness by a reduc- tion from the well-known Maximal Independent Set (MIS) problem. Given an instance of the MIS problem ψ with a graph G = (V, E), we can construct an FMIC problem instance ω as follows: for each vertex v ∈ V, we use a client point o to represent it and draw a EkNLC for the client point, the EkNLCs of oi and o j intersects if (vi, v j) ∈ E. In addition, we add a client point with the maximum influence as omax into the client point set. Giving this mapping in polynomial time, if a vertex set V0 is the optimal result of ψ in MIS, the node set corresponding to V0 is the maximal independent set of ω, and vice versa. 

Due to the NP-hardness of finding Opru with the maximum size, we use a greedy method to quickly find Opru as shown in Algorithm5. It begins with initializing Opru ← ∅ and building an index Ipru for Opru (e.g, R-tree or quadtree) (lines 1-2). Next, the Chapter 3. Exact and Approximate Approaches for MaxBRkNN 51

Algorithm 5: Algorithm for finding Opru

Input: C\{EkNLCmax}

Output: Opru

1 Initialize Opru ← ∅;

2 Build an index Ipru for Opru;

3 Sort C\{EkNLCmax} in ascending order of their radiuses;

4 for each EkNLCu ∈ C \ {EkNLCmax} do

5 Flagsel ← true; rq 6 Get a list Cu by doing a range query on Ipru with the scale ru ← ru + rmax;

7 if Cu! = ∅ then

8 for each EkNLCv ∈ Cu do

9 if EkNLCu intersects EkNLCv then

10 Flagsel ← f alse;

11 Break;

12 if Flagsel == true then

13 Insert EkNLCu into Opru and Ipru;

14 return Opru;

algorithm sorts the EkNLCs in ascending order of their radiuses (line 3). For each

EkNLCu ∈ C \ {EkNLCmax}, the algorithm first initializes the flag of selection Flagsel as true. After that, it finds EkNLCs which belong to Opru and may intersects with EkNLCu by range query on Ipru, and then stores the EkNLCs in Cu (lines 4-6). Note that the scale of the range query is the sum of the radius of current EkNLC and the maximal radius in all EkNLCs. If Cu is not an empty set, the algorithm checks whether any EkNLC ∈ Cu has intersections with EkNLCu one by one, if yes, Flagsel is set to false (lines 7-11). After checking all EkNLCs in Cu, if Flagsel is still true, the algorithm inserts EkNLCu into Opru and update Ipru (lines 12-13). Finally, it returns Opru (line 14).

1 Theorem 3.1 guarantees that we can get a 2 -approximate solution after performing one pruning operation. An interesting question is whether we can further improve the 52 Chapter 3. Exact and Approximate Approaches for MaxBRkNN efficiency of the algorithm while still guarantee a bounded error. Can we further perform the pruning process again on the remaining client points (i.e., O\Opru)? We continue to improve the method using this idea.

β We denote the set of pruned client points after β times pruning operations as Opru,

β their corresponding EkNLC set as Cind. We first prove the following lemma.

β Lemma 3.4. In D = (Opru, P), for any new server point p ∈ D, σmax(OBR(p)) 6

β βσmax(o|o ∈ Opru).

Proof. We prove this by an induction on β. Lemma 3.3 shows the lemma is true for

t+1 t t+1 β = 1. Suppose it holds for β = t > 1. When β = t + 1, Opru = Opru ∪ O1s−pru, t+1 where O1s−pru is the set of client points which were pruned in step t + 1. For any region O ⊆ Ot+1 O ⊆ Ot O ⊆ Ot+1 influence set BR pru, we divided it into two parts BR1 pru and BR2 1s−pru2. t t+1 Thus, σ(OBR1 ) ≤ tσmax(o|o ∈ Opru) ≤ tσmax(o|o ∈ Opru), and σ(OBR2 ) ≤ σmax(o|o ∈

t+1 t+1 t+1 t+1 O1s−pru) ≤ σmax(o|o ∈ Opru). Therefore, σ(OBR|OBR ⊆ Opru) ≤ (t + 1)σmax(o|o ∈ Opru), the lemma is still true for β = t + 1. 

We next prove that PIC is still an approximation algorithm with guaranteed perfor- mance after performing β times pruning operations.

1 Theorem 3.3. PIC with β times pruning operations is a 1+β -approximation algorithm.

Proof. We denote the optimal region returned by PIC with β times pruning operations as

β β Rpic, and the set of remaining client points after β times pruning operations as Onew. The β O β O O β region influence set of Rpic is BR(Rpic). We divide BR(Ropt) into two parts BR1 (Ropt) β β β β β β and OBR2 (Ropt), where OBR1 (Ropt) ⊆ Opru and OBR2 (Ropt) ⊆ Onew. As omax ∈ Onew, O β O β | ∈ Oβ σ(omax) 6 σ( BR(Rpic)). Based on Lemma 3.4, we get σ( BR1 (Ropt)) 6 βσmax(o o pru) O β O β 6 βσ(omax) 6 βσ( BR(Rpic)). Since BR2 (Ropt) is the region influence set in D = Chapter 3. Exact and Approximate Approaches for MaxBRkNN 53

Oβ P O β O β ( new, ), we know that σ( BR2 (Ropt)) 6 σ( BR(Rpic)). Therefore, we can get:

β β σ(OBR(Ropt)) = σ(OBR1 (Ropt) ∪ OBR2 (Ropt))

β β ≤ σ(OBR1 (Ropt)) + σ(OBR2 (Ropt))

β ≤ (1 + β)σ(OBR(Rpic)).

Theorem 3.4. PIC with β times pruning operations is a

β (1 − )-approximation algorithm for MaxBRkNN with σ(R) = |OBR(R)|. σ(OBR(Ropt))

Proof. As σ(R) = |OBR(R)|, in each independent circle pruning operation, Ropt loses at most one point and it influence is decreased by at most 1. After β operations, the influence of Ropt is decreased by at most β. Following the proof of Theorem 3.3, we β β σ(OBR(Rpic)) β have σ(OBR(R )) ≥ σ(OBR(Ropt)) − β. Hence, ≥ 1 − .  pic σ(OBR(Ropt)) σ(OBR(Ropt))

1 Theorem 3.5. The 1+β approximation ratio is tight for PIC with β times pruning opera- tions on MaxBRkNN.

Proof. We prove this with a worst case example. According to Theorem 3.4, on

MaxBRkNN with σ(R) = |OBR(R)|, when σ(OBR(Ropt)) = 1 + β, the approximation ratio of PIC with β times pruning operations is 1 − β . As shown in the example of σ(OBR(Ropt)) Figure 3.12, there are 3 EkNLCs. Since EkNLCs overlap with each other, we can only prune one EkNLC in one pruning operation. In this case, σ(OBR(Ropt)) = 3, and when

1 β = 2, the bound is 3 . 

Complexity. In PIC, computing the influence σ(o) for each o can be done in preprocess-

2 ing (like GkE algorithm), and finding Opru costs O(|O|log|O| + |Opru| ) time in the worst case. Moreover, the running time of solving the new MaxBRkNN instance using GkE

β β β 2 algorithm is O(M|Onew| + |P| + k|Onew|log|P| + |Onew| t(σ(·))). Thus, PIC algorithm has

2 β 2 β β 2 the complexity O(|O|log|O| + |Opru| + M|Onew| + |Onew| + k|Onew|log|P| + |Onew| t(σ(·))). 54 Chapter 3. Exact and Approximate Approaches for MaxBRkNN

o3

o1 o2

Figure 3.12: An example for the proof of bound.

3.5.2 Algorithm PRIC

β According to Lemma 3.4, for any new server point p ∈ D in D = (Opru, P),

β σmax(OBR(p)) 6 βσmax(o|o ∈ Opru). It means that the influence loss due to the prun- ing operation on independent EkNLCs depends on the maximum influence value of one client point in the pruned client point set. Taking this into account, we extend the PIC

β algorithm by limiting the value of σmax(o|o ∈ Opru) to get a new approximation algorithm called Pruning Restricted Independent EkNLCs (PRIC).

The algorithm framework of PRIC is the same as that of PIC. The difference is that, in line 6 of Algorithm4, for each circle in Cind, PRIC guarantees that the influence values of their corresponding client points are less than a given threshold. We set the threshold as γσmax(o|o ∈ O) where γ ∈ (0, 1]. We denote the pruned client point set as

Opru0 . Following the proof of Theorem 3.3, we can prove that PRIC is an approximation algorithm with controlled error bound after doing β times pruning operations.

1 Theorem 3.6. PRIC with β times pruning operations is a 1+βγ -approximation algorithm.

Proof. We denote the optimal region returned by PRIC with β times pruning operations

β β as Rpric, and the set of remaining client points after β times pruning operations as Onew0 . β β Then the region influence set of Ropt2 is OBR(Ropt2). Chapter 3. Exact and Approximate Approaches for MaxBRkNN 55

β 0 β 0 β 0 β We divide OBR(Ropt) into two parts O BR1 (Ropt) and O BR2 (Ropt), where O BR1 (Ropt) ⊆ Oβ O0 β ⊆ Oβ ∈ Oβ O β pru0 and BR2 (Ropt) new0 . Since omax new0 , σ(omax) 6 σ( BR(Ropt2)). According to O0 β | ∈ Oβ O β Lemma 3.3, we get σ( BR1 (Ropt)) 6 βσmax(o o pru0 ) 6 βγσ(omax) 6 βγσ( BR(Rpric)). O0 β Oβ P O0 β O β As BR2 (Ropt) is a region influence set in D = ( new0 , ), σ( BR2 (Ropt)) 6 σ( BR(Rpric)). Therefore, we can get:

β 0 β 0 β σ(OBR(Ropt)) = σ(O BR1 (Ropt) ∪ O BR2 (Ropt))

0 β 0 β ≤ σ(O BR1 (Ropt)) + σ(O BR2 (Ropt))

β ≤ (1 + βγ)σ(OBR(Rpric)).

2 Complexity. Similar to PIC algorithm, PRIC algorithm takes O(|O|log|O|+|Opru0 | ) time

β β β 2 to find Opru0 , and costs O(M|Onew0 | + |P| + k|Onew0 |log|P| + |Onew0 | t(σ(·))) time to solve the new MaxBRKNN instance using GkE algorithm. Hence, the complexity of PRIC

2 β 2 β β 2 algorithm is O(|O|log|O| + |Opru0 | ) + M|Onew0 | + |Onew0 | + k|Onew0 |log|P| + |Onew0 | t(σ(·))).

3.6 Experimental Study

In this section, we first describe the experimental settings in Section 3.6.1. Next, on the

MaxBRkNN problem, we follow some previous work [WOF¨ +11, LWW+13, LCGL13] to set σ(R) = |OBR(R)|, and compare our algorithms with two existing state-of-the-art algo- rithms in terms of quality of results, efficiency and scalability in Section 3.6.2. Moreover, we study the performance of our algorithms on the MaxBRkNN problem with complex influence value in Section 3.6.3.

3.6.1 Experimental Settings

Algorithms. We study the performance of the following algorithms: two state-of-the- art algorithms MaxSegment [LWW+13] and OptRegion [LCGL13] for MaxBRkNN, the 56 Chapter 3. Exact and Approximate Approaches for MaxBRkNN exact algorithm GkE in Section 3.4, and the approximation algorithms PIC and PRIC in

Section 3.5. As PIC and PRIC are same when σ(R) = |OBR(R)|, we only show the results of PRIC in Section 3.6.3.

All algorithms are implemented in C++, and run on a Windows PC with Intel(R)

Core(TM) I5-6500 @3.20GHz CPU and 16G memory.

Datasets. We use four datasets in our experiments. The first two datasets are crawled from real-life location-based social networks. One is in the area of USA crawled from

Foursquare 1 (denoted by FS). It contains 21, 3380 client points with their social rela- tionships and check-ins, and we crawl 450, 000 server points using Foursquare APIs 2.

Another one is obtained from Yelp Dataset Challenge 3 (denoted by YELP). We obtain

265, 432 client points with their social relationships and check-ins in the area of North

America, and we also crawl 500, 000 server points using Foursquare APIs.

Another two datasets are synthetic. Following the method [YCLW15], we obtain two California datasets CA1 and CA2 from the U.S.census database by generating a set of points on each line. The data in CA1 is uniformly distributed, while the data in CA2 is Gaussian distributed. Both datasets contain 300, 000 client points and 600, 000 server points.

The client points and server points have a pair of latitude and longitude. To compute the distances in Euclidean space, we utilize the NDSF Coordinate Conversion Utility 4 to convert the data into a coordinate system in Euclidean space.

Parameters. All proposed algorithms use nvc to set the number of cells M in the lower- layer grid of QGI according to k dynamically. PIC and PRIC use β to set the number of pruning operations, and PRIC also uses γ to control the influence threshold for pruning

1https://github.com/jalbertbowden/foursquare-user-dataset 2https://developer.foursquare.com/docs 3https://www.yelp.com/dataset/download 4http://www.whoi.edu/marine/ndsf/utility/NDSFutility.js Chapter 3. Exact and Approximate Approaches for MaxBRkNN 57 the independent EkNLCs. We study their effects on the performance of their related algo- rithms at the beginning of Section 3.6.2 and Section 3.6.3. From the parameter tuning process, we get the values that can achieve the best performance for all parameters. We use k = 6 by default for all experiments.

Varying parameter nvc for index QGI. nvc is used in QGI for all proposed algorithms.

Here we only show the results of tuning this parameter in GkE on four datasets, as the results of tuning the parameter in PIC and PRIC are similar.

3.6.2 Experimental Results

Varying Parameters

We vary the value of nvc in this set of experiments.

As shown in Figure 3.13, with the increase of nvc, the run time of GkE first drops,

35 35 30 GkE 30 GkE 25 25 20 20 15 15 10 10 Run Time (s) Run Time (s) 5 5 0 0 100 120 140 160 180 20 40 60 80 100

nvc nvc

(a) FS (b) CA1

60 40 GkE GkE 50 35 30 40 25 30 20 20 15

Run Time (s) Run Time (s) 10 10 5 0 0 100 120 140 160 180 20 40 60 80 100

nvc nvc

(c) YELP (d) CA2

Figure 3.13: Effect of parameter nvc on run time. 58 Chapter 3. Exact and Approximate Approaches for MaxBRkNN

and then increases slightly. This is because when nvc is small, the size of the cell is large. It makes the estimated radiuses of EkNLCs large, and thus the number of EkNLCs which are pruned in the last step of GkE is small. On the other hand, when nvc is large, the time cost of estimating the radiuses of EkNLCs is also large. We can observe that the run time of GkE does not vary too much (the change is less than 5 seconds), thus nvc has little effect on the efficiency of GkE. We set nvc = 160 for FS and YELP, and nvc = 60 for

CA1 and CA2 by default in subsequent experiments.

One-level grid and two-level grid comparison on GkE. GkE has a two-level grid by default. To investigate the effect of the two-level grid and the one-level grid, we denote

GkE with an one-level grid as OGkE, and then compare the efficiency of GkE and OGkE by comparing their run time on constructing EkNLCs. As the data in CA1 and YELP is uniformly distributed, GkE and OGkE have similar performance. While the data in FS and CA2 is non-uniformly distributed, the results on these two datasets are shown in Figure 3.14. It can be seen that GkE deals with non-uniformly distributed data quickly than OGkE.

Varying the approximation algorithm’s parameter β. The value of β balances the efficiency and accuracy of PIC, we vary β from 0 to 15 and show the efficiency and the precision in Figure 3.15 and Figure 3.16 respectively. With the increase of β, the

25 GkE 20 OGkE

15

10 Run Time (s) 5

0 3 6 9 12 15 k

(a) FS (b) CA2

Figure 3.14: Comparison of one-level grid and two-level grid on run time. Chapter 3. Exact and Approximate Approaches for MaxBRkNN 59

30 30 PIC PIC 25 25 20 20 15 15 10 10 Run Time (s) Run Time (s) 5 5 0 0 0 3 6 9 12 15 0 3 6 9 12 15 β β

(a) FS (b) CA1

50 30 PIC PIC 40 25 20 30 15 20 10 Run Time (s) Run Time (s) 10 5 0 0 0 3 6 9 12 15 0 3 6 9 12 15 β β

(c) YELP (d) CA2

Figure 3.15: Effect of β on run time of PIC. efficiency improves (shorter run time) while the precision drops. This is because, with a larger β , the approximation algorithm PIC can prune more EkNLCs. When β ≥ 9 on both datasets, more than 60% EkNLCs are pruned, and thus very few independent EkNLCs can be found in the following pruning operations. Therefore, we set β = 9 for these four datasets, as it drastically reduces computation time while keeping more than 75% solution quality.

Algorithms Performance Comparison

Efficiency Comparison. We compare our algorithms (GkE and PIC) with two existing algorithms, i.e., OptRegion and MaxSegment by varying k on the four datasets. Fig- ures 3.17 reports the run time of the four algorithms when we vary k from 3 to 15.

It can be observed that GkE is 3 − 5 times faster than OptRegion and MaxSegment. 60 Chapter 3. Exact and Approximate Approaches for MaxBRkNN

1 1

0.8 0.8

0.6 0.6

0.4 0.4 Precision Precision 0.2 0.2 PIC PIC 0 0 0 3 6 9 12 15 0 3 6 9 12 15 β β

(a) FS (b) CA1

1 1

0.8 0.8

0.6 0.6

0.4 0.4 Precision Precision 0.2 0.2 PIC PIC 0 0 0 3 6 9 12 15 0 3 6 9 12 15 β β

(c) YELP (d) CA2

Figure 3.16: Effect of β on the precision of PIC.

This indicates that the index QGI can reduce a large number of kNN computations for constructing kNLCs in practice. PIC outperforms the two existing methods about one or- der of magnitude, and its run time growth slower than other algorithms with the increase of k, it shows that the pruning technique of PIC can speedup GkE effectively. The run time of all algorithms increases as k grows. This is because with a larger k, more EkNLCs intersect with each other, and thus more time will be spent on searching intersecting

EkNLCs.

We also observe that OptRegion and MaxSegment have similar performance, and

OptRegion is slightly better than MaxSegment on the large datasets used in our experi- ments. After analyzing the results, we found that this is because the kNN computation dominates more than 90% runtime in the two algorithms. It verifies our idea of reducing the kNN computation cost. This cannot be observed in previous works because the size Chapter 3. Exact and Approximate Approaches for MaxBRkNN 61

350 400 PIC 350 PIC 300 GkE GkE 250 OptRegion 300 OptRegion MaxSegment 250 MaxSegment 200 200 150 150

Run Time (s) 100 Run Time (s) 100 50 50 0 0 3 6 9 12 15 3 6 9 12 15 k k

(a) FS (b) CA1

500 400 PIC 350 PIC 400 GkE GkE OptRegion 300 OptRegion 300 MaxSegment 250 MaxSegment 200 200 150

Run Time (s) Run Time (s) 100 100 50 0 0 3 6 9 12 15 3 6 9 12 15 k k

(c) YELP (d) CA2

Figure 3.17: Comparison of runtime on four datasets.

of the datasets used is only about 10% to 20% of ours, and thus the kNN computation does not dominate the query processing.

Effectiveness Comparison. Figures 3.18 depicts the influence values (i.e., the number of client points in the influence set) of results returned by GkE and PIC. Since the results of OptRegion and MaxSegment are the same as those of GkE, we omit them here. PIC always keep more than 68% quality results compared to GkE. With the increase of k, the influence of regions returned by all algorithms rises as well. The reason is that when k is larger, more EkNLCs intersect with each other, and thus larger influence values will be included in the results. 62 Chapter 3. Exact and Approximate Approaches for MaxBRkNN

30 35 PIC PIC 25 GkE 30 GkE 20 25 20 15 15

Influence 10 Influence 10 5 5 0 0 3 6 9 12 15 3 6 9 12 15 k k

(a) FS (b) CA1

35 30 PIC PIC 30 GkE 25 GkE 25 20 20 15 15 Influence 10 Influence 10 5 5 0 0 3 6 9 12 15 3 6 9 12 15 k k

(c) YELP (d) CA2

Figure 3.18: Comparison of influence on four datasets.

Scalability Study

In order to study the scalability of those four algorithms, we vary the number of client points |O| from 200, 000 to 1 million in the two CA datasets, and we generate server points (|P| = 2|O|) by following previous works [LWW+13].

Figure 3.19(a) and Figure 3.19(b) show the run time and influence values of four al- gorithms on uniformly distributed dataset CA1. While Figure 3.19(c) and Figure 3.19(d) show the results on Gaussian distributed dataset CA2. Both the run time and influence values increase as the number of client points becomes larger. It can be seen that the run time of GkE and PIC grows smoothly as the number of client points increases, while

OptRegion and MaxSegment do not scale well because of the high cost on kNN com- putation and verification on a large number of server points and client points. We also observe that PIC is able to obtain high-quality results on large datasets. Chapter 3. Exact and Approximate Approaches for MaxBRkNN 63

1800 25 PIC PIC 1500 GkE 20 GkE 1200 OptRegion MaxSegment 15 900 10

600 Influence Run Time (s) 300 5 0 0 20 40 60 80 100 20 40 60 80 100 Number of client points(104) Number of client points(104)

(a) Run time on CA1 (b) Influence on CA1

1800 25 PIC PIC 1500 GkE 20 GkE 1200 OptRegion MaxSegment 15 900 10

600 Influence Run Time (s) 300 5 0 0 20 40 60 80 100 20 40 60 80 100 Number of client points(104) Number of client points(104)

(c) Run time on CA2 (d) Influence on CA2

Figure 3.19: Scalability.

3.6.3 Performance Comparison on MaxBRkNN with Complex In-

fluence Value

As introduced, the region’s influence value computation could be complicated, as shown in the example of Figure 3.2. In this section, we study the performance of algorithms when MaxBRkNN has a complex influence value computation function.

Specifically, we consider MaxBRkNN on location-based social networks, and a re- gion’s influence value is computed by the social influence spread [KKT03] of the clients in its influence set. The social influence spread is the number of users that can be influenced by the clients in the region detected. And its computation is quite time- consuming, as the computation of social influence for a given group of users is a #P- hard problem [CWW10]. We denote the MaxBRkNN problem with social influence as

MaxBRkNNSI. 64 Chapter 3. Exact and Approximate Approaches for MaxBRkNN

Besides the maximal-arc optimization technique, we also utilize the sampling-based method [OAYK14] to compute the social influence of a set of users, as it is able to compute the social influence quickly with a bounded error determined by the times of sampling. We apply this method in all algorithms for the influence computation.

We generate two synthetic location-based social networks by extracting a social net- work with 300, 000 users from a public social network dataset Pokec 5, and mapping the users in two datasets CA1and CA2. And We also use this method to generate these two datasets used in the scalability experiments up to 1 million client points.

Varying approximation algorithms’ parameter β and γ. The threshold parameter γ affects the number of client points that can be pruned in PRIC. We compute the ratios of client points which can be pruned by PRIC (i.e., whose social influence less than

γσ(omax)) by varying γ from 0.1 to 0.5 in four datasets. The results show that, when

γ = 0.2, more than 99% client points can be pruned by PRIC in all datasets, thus we use

γ = 0.2 as the default setting for PRIC. This is because in real-world the social networks usually follow the power-law distribution.

After setting γ = 0.2, the value of β balances the efficiency and accuracy of PIC and

PRIC. In order to compare PIC and PRIC at the same approximation ratios, we vary the approximation ratio and fix γ to compare their performance, rather than varying β directly. Figure 3.20 and Figure 3.21 show the efficiency and the precision respectively.

With the decrease of approximation ratio, the efficiency grows (shorter run time) while the precision falls. Because, with a smaller approximation ratio, β becomes larger and the approximation algorithms can prune more EkNLCs.

Moreover, at the same approximation ratios, compared to PIC, PRIC always has better efficiency and keeps a competitive precision. The reason is that the influence threshold γ makes the results of PRIC closer to the real approximate solutions in practice.

5http://snap.stanford.edu/data/soc-pokec.html Chapter 3. Exact and Approximate Approaches for MaxBRkNN 65

20 20 PIC PIC PRIC PRIC 15 15

10 10

Run Time (s) 5 Run Time (s) 5

0 0 1 1/2 1/3 1/4 1/5 1 1/2 1/3 1/4 1/5 Approximation Ratio Approximation Ratio

(a) FS (b) CA1

30 25 PIC PIC PRIC PRIC 25 20 20 15 15 10 10 Run Time (s) Run Time (s) 5 5

0 0 1 1/2 1/3 1/4 1/5 1 1/2 1/3 1/4 1/5 Approximation Ratio Approximation Ratio

(c) YELP (d) CA2

Figure 3.20: Comparison of PIC and PRIC on the efficiency.

1.2 1.2 PIC PIC 1 PRIC 1 PRIC

0.8 0.8

0.6 0.6

Precision 0.4 Precision 0.4

0.2 0.2

0 0 1 1/2 1/3 1/4 1/5 1 1/2 1/3 1/4 1/5 Approximation Ratio Approximation Ratio

(a) FS (b) CA1

1.2 1.2 PIC PIC 1 PRIC 1 PRIC

0.8 0.8

0.6 0.6

Precision 0.4 Precision 0.4

0.2 0.2

0 0 1 1/2 1/3 1/4 1/5 1 1/2 1/3 1/4 1/5 Approximation Ratio Approximation Ratio

(c) YELP (d) CA2

Figure 3.21: Comparison of PIC and PRIC on the result quality.

1 Thus, we set the approximation ratio to 3 by default as it reduces lots of computation time while keeping more than 90% solution quality. And we only use PRIC in the comparison with other algorithms in subsequent experiments. 66 Chapter 3. Exact and Approximate Approaches for MaxBRkNN

350 400 PRIC 350 PRIC 300 GkE GkE 250 OptRegion 300 OptRegion MaxSegment 250 MaxSegment 200 200 150 150 Run time (s) 100 Run time (s) 100 50 50 0 0 3 6 9 12 15 3 6 9 12 15 k k

(a) FS (b) CA1

500 400 PRIC 350 PRIC 400 GkE GkE OptRegion 300 OptRegion 300 MaxSegment 250 MaxSegment 200 200 150

Run Time (s) Run Time (s) 100 100 50 0 0 3 6 9 12 15 3 6 9 12 15 k k

(c) YELP (d) CA2

Figure 3.22: Comparison of runtime on four datasets on MaxBRkNNSI.

Efficiency and Effectiveness Comparisons. We vary k to compare the efficiency and the effectiveness of our algorithms (GkE and PRIC) with two existing algorithms (Op- tRegion and MaxSegment) on the four datasets.

Figure 3.22 and Figure 3.23 show the performance of these algorithms on the four datasets. Our proposed methods GkE and PRIC are significantly faster than OptRe- gion and MaxSegment, as the maximal-arc based optimization further speeds up these methods by reducing many influence computations. In particular, PRIC which can ob- tain high-quality results (above 85% solution quality) is usually one order of magnitude faster than OptRegion and MaxSegment. It demonstrates that the index QGI can reduce a large number of kNN computations, and the pruning method of PRIC can improve the efficiency of GkE substantially and keep high-quality solutions at the same time on

MaxBRkNNSI. Chapter 3. Exact and Approximate Approaches for MaxBRkNN 67

1200 1500 PRIC PRIC 1000 GkE 1200 GkE 800 900 600 600 Influence 400 Influence 200 300 0 0 3 6 9 12 15 3 6 9 12 15 k k

(a) FS (b) CA1

3500 1500 PRIC 3000 GkE 1200 2500 2000 900 1500 600 Influence 1000 Influence 300 500 PRIC GkE 0 0 3 6 9 12 15 3 6 9 12 15 k k

(c) YELP (d) CA2

Figure 3.23: Comparison of influence on four datasets on MaxBRkNNSI.

Scalability Study. Moreover, we study the scalability of four algorithms on

MaxBRkNNSI by varying the number of client points |O| from 200, 000 to 1 million in the two CA datasets like the method in Section 3.6.2.

The run time of four algorithms on CA datasets is shown in Figure 3.24(a) and Fig- ure 3.24(c). The result shows that as the number of client points increases, the run time of GkE and PRIC grows slightly, while the run time of OptRegion and MaxSegment rises dramatically. It demonstrates that GkE and PRIC scales better than OptRegion and

MaxSegment on MaxBRkNNSI. In addition, the result of social influence is shown in

Figure 3.24(b) and Figure 3.24(d), it illustrates that PRIC can speedup GkE markedly and obtain high-quality results on large datasets. 68 Chapter 3. Exact and Approximate Approaches for MaxBRkNN

2100 1500 1800 PRIC PRIC GkE 1200 GkE 1500 OptRegion 1200 MaxSegment 900 900 600 600 Influence Run Time (s) 300 300 0 0 20 40 60 80 100 20 40 60 80 100 Number of client points(104) Number of client points(104)

(a) Run time on CA1 (b) Influence on CA1

1800 1800 PRIC PRIC 1500 GkE 1500 GkE 1200 OptRegion 1200 MaxSegment 900 900

600 Influence 600 Run Time (s) 300 300 0 0 20 40 60 80 100 20 40 60 80 100 Number of client points(104) Number of client points(104)

(c) Run time on CA2 (d) Influence on CA2

Figure 3.24: Scalability on MaxBRkNNSI.

3.7 Conclusion

In this chapter, we consider the MaxBRkNN problem in location-based social networks and use a general submodular function to compute the value of influence. To solve

MaxBRkNN on large datasets, we propose both exact and approximation algorithms.

In the exact algorithm GkE, we first propose a hybrid quadtree-grid index (QGI) and then present an efficient method for estimating the upper bound of kNLCs’ influence val- ues, which enables us to compute exact kNNs for only a few clients. We also propose a maximal-arc optimization technique to reduce the influence computation cost. Moreover, we devise two approximation algorithms with error bounds to improve the efficiency of

GkE.The experimental results demonstrate that, compared to two state-of-the-art algo- rithms (i.e., MaxSegment and OptRegion) for MaxBRkNN, GkE is 3−5 times faster, and our approximation algorithms are one order of magnitude faster with high accuracy. Chapter 4

Optimal Region Search with

Submodular Maximization

4.1 Introduction

With the proliferation of mobile devices and Location-Based Services (LBS) (e.g.,

Google Maps, Foursquare), a huge amount of geo-tagged data are being generated every day. The geo-tagged data has rich information including timestamp, geo-location, and comments, which facilitates a large number of location-based applications. As an im- portant problem in LBS, region search has wide applications such as user exploration in a city [FCB+16, CCJY14], similar regions query [FCJG19], region detection [FGC+19], etc.

Most existing studies consider a region as a rectangle with a fixed length and width [NB95, CCT12, THCC13, FCB+16]. It is not general in practice since the regions might be of arbitrary shapes. In order to address this limitation, Cao et al. [CCJY14] define a region as a connected subgraph and propose the Length-Constrained Maximum-

Sum Region (LCMSR) query. It considers two attributes for region search: an objective

69 70 Chapter 4. Optimal Region Search with Submodular Maximization

{coffee} {mall}

V3 3 V5

1 2 {park}

V1 3 1 V6 {mall, bar} 3 2

V2 2 V4 {bar} {mall, bar}

Figure 4.1: A toy example of road network with POIs. value of the region computed by summing up the weights of nodes in a region and a bud- get value by summing up the costs of edges in a region. The target of the LCMSR query is to find the region that has the largest objective value under a given budget constraint.

A simple accumulative function (e.g., SUM) cannot measure the objective value in some applications (e.g., the influence or the diversity of a region [FCB+16], the likeli- hood of information [LKG+07], the entropy of locations [ZV16]). In this chapter, we use a submodular function to compute the objective score of a region defined as a con- nected subgraph, which makes the problem more challenging. We denote this problem as optimal region search with submodular maximization (ORS-SM).

We use the example network shown in Figure 4.1 to illustrate the problem, which consists of 6 locations connected by 8 edges. Each location is associated with some key- words (e.g, mall, coffee) which represent its POI categories, and each edge has a travel time cost. Consider a user travels in a city. She would like to explore the most diver- sified region to experience the most types of POIs, and she does not want to spend too much time on the road. We can use ORS-SM to find solutions for the user by comput- ing the objective score of a region as the number of different types of POIs contained in the region and setting a travel time limit ∆. Given ∆ = 5, LCMSR returns the region Chapter 4. Optimal Region Search with Submodular Maximization 71

R1 = {e1 = (v2, v4), e2 = (v4, v5), e3 = (v4, v6)}, as R1 has the most the number of keywords

(not distinct!), but R1 is not the most diversified region since it only contains two types of locations. By contrast, ORS-SM returns R2 = {e1 = (v1, v3), e2 = (v3, v5), e3 = (v4, v5)} which has the most distinct keywords.

As the objective function we use is submodular, the ORS-SM problem is a type of submodular optimization [KG14], which are extremely hard [ZCC+15, ZV16, SZKK17,

Cra19]. We prove that solving the ORS-SM problem is NP-hard. To our best knowl- edge, only the Generalized Cost-Benefit Algorithm (GCB) [ZV16] for the problem of submodular optimization with routing constraints can be adapted to solve our ORS-SM problem. The algorithm requires a root node, which does not exist in our problem.

Hence, we need to find a tree with the largest objective value by treating each node as a root, and the best tree among them is returned as the result for ORS-SM. We denoted this method as GCBAll. GCBAll has high complexity, so that it has poor efficiency and is not scalable.

To better solve ORS-SM, we propose a √ 1 -approximation algorithm AppORS, O( ∆/bmin) where bmin is the minimal cost on edges. We observe that the returned region is always a tree. The basic idea of this algorithm is to first find a set of nodes with a large objective score, and we guarantee that there exists a tree connecting them whose total cost is not larger than the budget limit ∆. Next, we use these nodes to find the best region. Next, we propose another algorithm with the same approximation ratio called IAppORS to further improve the effectiveness of AppORS. This algorithm first finds a node set using the method in AppORS to keep the approximation ratio, and then obtains a better node set with a larger objective score to improve the quality of the solution. Thus, IAppORS can obtain regions of better quality. We also present two heuristic methods to implement a key function for obtaining a better node set in IAppORS.

Contributions. In summary, the contributions of this chapter is threefold. Firstly, we 72 Chapter 4. Optimal Region Search with Submodular Maximization

Table 4.1: Frequently Used Notation in ORS-SM

Notation Meaning G A directed graph V, v A node set, a node E An edge set ∆ A budget constraint R A region BS (R) The budget score of region R OS (R) The objective score of region R f (·) A submodular function for computing objective scores T A tree MST A minimum-cost spanning tree propose the ORS-SM problem through a submodular optimization and prove its NP- hardness. Secondly, we develop two approximation algorithms AppORS and IAppORS for this problem, and present two heuristic methods to implement a key function in

IAppORS. Finally, we conduct experiments on two applications using three real-world datasets to demonstrate the efficiency and accuracy of the proposed algorithms.

Organization. The chapter is organized as below. Section 4.2 presents the definition of the ORS-SM problem and proves its NP-hardness. Section 4.3 introduces the approxi- mation algorithm SDD, and Section 4.4 describes the approximation algorithm OSDD with two heuristic methods. At last, Section 4.5 reports experimental results, and Sec- tion 4.6 concludes this chapter.

4.2 Problem Formulation

In this section, we formally define the problem of optimal region search with submodu- lar maximization (ORS-SM) and prove its hardness. The frequently used notations are presented in Table 4.1.

The problem is defined over a spatial network G = (V, E), where V is a set of nodes and E ⊆ V × V is a set of edges. Each node v ∈ V represents a location, and each edge Chapter 4. Optimal Region Search with Submodular Maximization 73

hvi, v ji ∈ E represents an undirected path between two locations vi and v j in V associated with a cost b(vi, vi+1) (representing the distance, the travel time, etc.).

Definition 4.1. Region. Given a spatial network G = (V, E), a region R is a connected subgraph of G. We denote the nodes and edges in R as R.V and R.E respectively.

In order to measure the quality of a region, we consider two values, namely the budget value and the objective value. The budget value is defined as the sum of the costs on all the edges in the region, which is computed as:

X BS (R) = b(vi, v j) (4.1)

(vi,v j)∈R.E

The objective value is computed by a submodular function f () over the set of loca- tions in a region (e.g., the number of distinct keywords, the social influence, etc.):

OS (R) = f (∪vi∈R.Vvi) (4.2)

S S OS (R) is submodular means that OS (R1 vi) − OS (R1) ≥ OS (R2 vi) − OS (R2), for all S R1 ⊆ R2 and vi < R1, and R1(R2) vi composes a new region. For example, we consider a problem of finding the most diversified region in a net- work G as shown in Figure 4.1. In this example, the objective value is computed as

OS (R) = | ∪vi∈R.V K(vi)|, where K(vi) is the set of distinct keywords on vi. For R2 in the example, we can get OS (R2) = |{bar, co f f ee, park, mall}| = 4 and BS (R2) = 5. Formally, we define the ORS-SM problem as follows:

Definition 4.2. Optimal Region Search with Submodular Maximization (ORS-SM).

Given G = (V, E), and a budget constraint ∆, we aim to find the region R such that

R = argmaxR OS (R) (4.3) sub ject to BS (R) ≤ ∆ 74 Chapter 4. Optimal Region Search with Submodular Maximization

Theorem 4.1. Solving the ORS-SM problem is NP-hard.

Proof. We prove the NP-hardness of solving the ORS-SM problem by reducing the problem from a unit cost version of the budgeted maximum coverage (UBMC) prob- lem [KMN99]. Given a collection of sets S = {S 1, S 2, ··· , S m} with a unit cost C, a domain of elements X = {x1, x2, ··· , xn} with associated weights {w1, w2, ··· , wn}, and a budget L, UBMC is to find a collection of sets S 0 ⊆ S whose total budget is smaller than

L and the elements covered by S 0 have the largest total weight.

Given an instance of the UBMC problem ϕ, we can construct an ORS-SM instance

ω as follows: we build a graph containing m nodes, in which each node corresponds to the sets in S , and each pair of nodes is connected by an edge with a unit cost C. The objective values of node sets are computed by their associated weights. The budget value

∆ is set to L. Given this mapping, if S 0 is the optimal result of ϕ, any tree spanning the nodes corresponds to the sets in S 0 is the optimal region of ω, and vice versa. As the

UBMC problem has been proved to be NP-hard, the ORS-SM problem is NP-hard as well. 

4.3 Approximation Algorithm AppORS

In this section, we propose a novel approximation algorithm called AppORS for solving the ORS-SM problem, and we also prove its error bound. For any subgraph region

Ri, we can always find its minimum spanning tree Ti such that BS (Ti) ≤ BS (Ri) and

OS (Ti) = OS (Ri). Hence, the optimal region of the ORS-SM problem must be a tree, and we consider the tree search in the subsequent discussions directly. Before presenting the algorithm, we first introduce the following lemmas which lay the foundation of our algorithm.

The first lemma is from the claim 3 of the Maximum Connected Submodular set Chapter 4. Optimal Region Search with Submodular Maximization 75 function with Budget constraint problem (MCSB) [KLT14]. For a graph G = (V, E), we can get its line graph GL = (VL, EL), and we denote any tree in GL as T L.

L L Lemma 4.1. For any tree T = (VT L , ET L ) with budget value BS (T ), there always exist

BS (T L) L L L n = max(5b c, 1) subtrees T = (V L , E L ), such that BS (T ) ≤ m + BS (r ) and m i Ti Ti i i n L L ∪ V L = V L , where r is the root of T for all 1 ≤ i ≤ n. i=1 Ti T i i

We utilize Lemma 4.1 to prove the following lemma.

Lemma 4.2. For any tree T = (VT , ET ) with budget value BS (T), there always exist ≤ b BS (T) c V E ≤ ≤ ≤ n max(10 m , 2) subtrees Ti = ( Ti , Ti ), 1 i n, such that BS (Ti) m if |E | ∪n V V Ti > 1 and i=1 Ti = T .

Proof. We conduct the proof through the following steps:

1) Given a graph G = (V, E), we can get its line graph GL = (VL, EL). For a tree

T = (VT , ET ) with the budget value BS (T) in G, we can get its corresponding subgraph

L L L GT in G . If GT contains cycles, we break cycles by removing some edges. After that, we

L L get a tree T = (VT L , ET L ) spanning all nodes of GT . Now each edge in T corresponds to a node in T L.

2) According to Lemma 4.1, for T L with budget value BS (T L), there always exist

BS (T L) L L L n = max(5b c, 1) subtrees T = (V L , E L ), such that BS (T ) ≤ m + BS (r ) and m i Ti Ti i i n L L L ∪ V L = V L , where r is the root of T for all 1 ≤ i ≤ n. For each subtree T , if i=1 Ti T i i i L V L > 1, we can always select a node with degree 1 as r , and we can then divide it into Ti i 1,L L 2,L two subtrees T consisting of r and T consisting of the remaining of V L . Hence, i i i Ti 2,L BS (T L) we know that BS (Ti ) ≤ m. After that, we can obtain n ≤ max(10b m c, 2) subtrees 0L 0L n T = (V 0L , E 0L ), such that BS (T ) ≤ m if |V 0L | > 1 and ∪ V 0L = V L . i T i T i i T i i=1 T i T 0L 0L 3) For each subtree T i , we find its line graph Gi. As T i is obtained from the line graph of G, Gi must be a subgraph of the original tree T in the first step. Based on the translated properties of a line graph, i.e., the line graph of a connected graph is connected, Gi must 76 Chapter 4. Optimal Region Search with Submodular Maximization

be connected. Thus, each Gi must be a tree as well and we denote it as Ti, and we have

n ∪ V L = V L = ET . We thus complete the proof.  i=1 Ti T

Now we are ready to present our approximation algorithm AppORS. The basic idea of AppORS is to find a tree T ∗, such that OS (T ∗) is larger than the objective score of any

∗ subtree Ti whose budget score is no larger than m and BS (T ) ≤ ∆. We then can prove that the region R∗ obtained based on T ∗ is an approximate feasible solution of the prob- lem based on Lemma 4.2. In AppORS, Nemhauser’s greedy algorithm [NWF78] is used to solve the Maximum Rooted Submodular set function (MRS) problem [KLT14], which aims to find a node set containing a given root node and other K − 1 nodes such that the score computed by a submodular function over the K nodes is maximum. Nemhauser’s greedy algorithm first iteratively selects a node with the maximal marginal objective value to get K − 1 nodes and pick the root node in the last step.

K As shown in Algorithm6, AppORS starts with an initial node set Vopt, a root node vopt, two parameters K and DisLim (lines 1-2), where bmin = min(vi,v j)∈Vb(vi, v j). After

that, for each node vi ∈ V, AppORS first finds the candidate node set Vvi in which each node has a distance to vi smaller than DisLim (lines 3-4). Then AppORS con-

structs a MRS problem instance MRS (OS (·), Vvi , K, vi), and then uses Nemhauser’s VK greedy algorithm to solve the instance for getting a top-K node set vi (line 5). If VK VK VK OS ( opt) < OS ( vi ), AppORS updates opt and vopt (lines 6-7). Next, AppORS uses K ∗ K function f indOptTree(Vopt, G, ∆, vopt) to find a tree T spanning all nodes in Vopt, sat- isfying BS (T ∗) ≤ ∆ (line 8). After that, AppORS finds T 0 consisting of a feasible edge by scanning all edges in E which has the maximum objective score (line 9). Finally, it returns the tree with the larger objective value as the region R∗ (lines 10-11). One sim-

K ple method of implementing f indOptTree(Vopt, G, ∆, vopt) is to find all shortest paths

K ∗ from vopt to all nodes in Vopt, and then get a tree T by removing duplicate edges in the shortest paths. We will present a better method in Section 4.4.3. Chapter 4. Optimal Region Search with Submodular Maximization 77

Algorithm 6: AppORS Algorithm Input: G = (V, E), ∆ Output: A region R∗

K 1 Initialize a node set Vopt ← ∅, vopt ← ∅; q √ ∆ 2 K ← b c + 1, DisLim ← ∆bmin; bmin

3 for each node vi ∈ V do

4 Vvi ← {v j|vi , v j, dis(vi, v j) ≤ DisLim};

5 Use Nemhauser’s greedy algorithm to solve MRS (OS (·), Vvi , K, vi), and get top-K node set VK vi ; K K 6 if V V then OS ( opt) < OS ( vi ) K K 7 V ← V ← opt vi , vopt vi;

∗ K 8 T ← f indOptTree(Vopt, G, ∆, vopt); 0 0 9 T ← a single edge in E with the maximum OS value and BS (T ) ≤ ∆;

∗ ∗ ∗ 0 0 10 R ← T if OS (T ) ≥ OS (T ) else T ;

∗ 11 return the region R ;

We illustrate the process of Algorithm6 using an example of finding the node set for node v4 when ∆ = 9 in Figure 4.1. We first get K = 4 and DisLim = 3, and then obtain VK { } a candidate node set v4 = v2, v3, v5, v6 . Ultilizing Nemhauser’s greedy algorithm, we VK { } can select v1, v3, v6, v4 sequentially, and achieve a top-K node set v4 = v1, v3, v4, v6 . We analyze the error bound of AppORS as below.

Lemma 4.3. AppORS returns a feasible solution of the ORS-SM problem.

0 K Proof. In AppORS, BS (T ) ≤ ∆. And Vopt contains K nodes, the largest distance

K ∗ between vopt and v j ∈ Vopt is DisLim. Thus, in the tree T is achieved by combining all

K ∗ shortest paths from vopt to all nodes in Vopt, BS (T ) ≤ (K − 1) ∗ DisLim ≤ ∆.  √ ∗ Lemma 4.4. OS (Ropt) ≤ O( ∆/bmin)OS (R ), where Ropt denotes the optimal region of the ORS-SM problem.

MRS Proof. We denote the optimal solution of MRS (OS (·), Vvi , K, vi) as Vopt , the optimal 78 Chapter 4. Optimal Region Search with Submodular Maximization

vi,DL vi,DL tree rooted at vi whose budget value under DisLim as T , and T with the maximal

vi,DL vi,DL DisLim objective value as T . We can get that T has at most b c nodes except vi, opt opt bmin q DisLim ∆ and these nodes are in {v j|dis(vi, v j) ≤ DisLim}. And b c ≤ b c ≤ K − 1. bmin bmin

vi,DL MRS vi,DL For any node v j ∈ Topt , v j ∈ Vvi . Thus, OS (Vopt ) ≥ OS (Topt ). Meanwhile, in

∗ K AppORS, OS (T ) ≥ OS (Vopt). And based on Lemma 2 in [KLT14], Nemhauser’s

e−1 greedy algorithm can get a e -approximate solution for the MRS problem, we have

K e−1 MRS ∗ e−1 vi,DL OS (Vopt) ≥ e OS (Vopt ). Hence we can get OS (T ) ≥ e OS (Topt ).

As Ropt must be a tree, we use Topt to represent it. According to Lemma 4.2, for V E ≤ b ∆ c Topt = ( Topt , Topt ) with the budget value Topt, there always exist n max(10 DisLim , 2) V E ≤ |E | ∪n V V subtrees Ti = ( Ti , Ti ), where BS (Ti) DisLim if Ti > 1, such that i=1 Ti = Topt . Hence, we have:

n OS (Ropt) = OS (Topt) = OS (VTopt ) = OS (∪i=1VTi ) ∆ ≤ max(10b c, 2) ∗ max(OS (T vi,DL), OS (T 0)) DisLim opt p e ≤ O( ∆/b ) ∗ max( OS (T ∗), OS (T 0)) min e − 1 p ∗ = O( ∆/bmin)OS (R ) where the first inequality is derived from the properties of the submodular function

OS (·). 

Based on Lemma 4.3 and 4.4, we can obtain Theorem 4.2:

Theorem 4.2. AppORS is a √ 1 -approximation algorithm for the ORS-SM problem. O( ∆/bmin)

Proof. It is obvious since AppORS returns a feasible solution whose objective score is no smaller than √ 1 times of the optimal value.  O( ∆/bmin) Chapter 4. Optimal Region Search with Submodular Maximization 79

4.4 Improved AppORS

As AppORS considers the worst case to compute K nodes under the budget constraint

and only searches within Vvi = {v j|dis(vi, v j) ≤ DisLim}, it cannot achieve high-quality solutions in practice (as shown in the experimental study). To improve the solution qual- ity of AppORS, we present an Improved version of AppORS (IAppORS). Meanwhile, we develop two heuristic methods to implement a key function of IAppORS.

4.4.1 An Improved Version of AppORS

K We abbreviate lines 3-7 of Algorithm6 as: for each node vi ∈ V do update(vopt, Vopt).

K In the improved version of AppORS, the details of the new update(vopt, Vopt) function are shown in Algorithm7. Di fferent to the old function, the new one improves the solution quality by using a method f indFeaNodeS et(vi, G, ∆) to find another feasible

2,K 2,K 2,K node set Vvi (lines 4-6). Note that, iff Vvi has a tree Tvi spanning its all nodes, and

2,K 2,K BS (Tvi ) ≤ ∆, Vvi is a feasible node set.

K Algorithm 7: New update(vopt, Vopt) K Input: G = (V, E), ∆, K, DisLim, vi, vopt, Vopt K Output: Updated vopt, Vopt 1 1 V ← { | ≤ } vi v j vi , v j, dis(vi, v j) DisLim ; 1 1,K 2 · V V Use Nemhauser’s greedy algorithm to solve MRS (OS ( ), vi , K, vi), and get topK node set vi ; K 1,K 3 V ← V vi vi ; 2,K 4 Vvi ← f indFeaNodeS et(vi, G, ∆); K 2,K 5 if V V then OS ( vi ) < OS ( vi ) K 2,K 6 V ← V vi vi ; K K 7 if V V then OS ( opt) < OS ( vi ) K K 8 V ← V ← opt vi , vopt vi; K 9 return vopt, Vopt; 80 Chapter 4. Optimal Region Search with Submodular Maximization

Theorem 4.3. IAppORS returns a √ 1 -approximate solution of the ORS-SM prob- O( ∆/bmin) lem.

1,K 2,K Proof. As both of Vvi and Vvi are feasible node sets, IAppORS can find a feasible solution for the ORS-SM problem. Meanwhile, the solution quality of IAppORS is not worse than that of AppORS. This leads to the final conclusion. 

4.4.2 Two Heuristic Methods for IAppORS

To implement f indFeaNodeS et(vi, G, ∆) in IAppORS, we present two heuristic meth- ods. At first, we introduce the function compute(BS (Vi)) which is used to compute the minimal budget value of the tree spanning all nodes in the node set Vi. compute(BS (Vi)) is a Steiner tree problem which is proved to be NP-hard [Vaz13]. Rather than computing accurate BS (Vi), we use an approximation algorithm to get an estimated value BSˆ (Vi). Here, we use Kruskal’s algorithm to compute a Minimum-cost Spanning Tree (MST) for spanning all nodes in BS (Vi), as we can utilize the MST to obtain a 2-approximate solution for the Steiner tree problem [Vaz13], and Kruskal’s algorithm is efficient. We denote this method as compute(BSˆ (Vi)).

Next, we present the first implementation of f indFeaNodeS et1(vi, G, ∆). As shown V2,K V2 in Algorithm8, it begins with initializing node sets vi and vi (lines 1). In each while loop, the method selects a node vmax with the maximum marginal objective value for V2,K V2 V2 the current vi from vi . After that, the method removes vmax from vi and computes ˆ 2,K BS (Vvi ∪ vmax) by utilizing the Kruskal’s algorithm. If the estimated budget value is

2,K not larger than ∆, vmax would be inserted into Vvi . The process is terminated when V2 ∅ V2,K vi = (lines 2-7). At the last step, the method returns a node set vi .

We use an example to show the process of f indFeaNodeS et1(vi, G, ∆). As shown V2 in Figure 4.1, when ∆ = 5, for node v4, we first obtain a candidate node set v4 = 2,K {v1, v2, v3, v5, v6}. Next, we can get Vv4 = {v1, v3, v4, v5} by selecting v4, v1, v3, v5 se- Chapter 4. Optimal Region Search with Submodular Maximization 81

Algorithm 8: Function f indFeaNodeS et1(vi, G, ∆) Input: G = (V, E), ∆, vi 2,K Output: Vvi 2,K 2 1 V ← { } V ← { | ≤ } vi vi , vi v j vi , v j, dis(vi, v j) ∆ ; 2 2 while V ∅ do vi , 2,K 3 vmax ← argmaxOS ∈V2 (Vv ∪ v j); v j vi i 2 2 4 V ← V \ vi vi vmax; ˆ 2,K ˆ 2,K 5 BS (Vvi ) ← compute(BS (Vvi ∪ vmax)); ˆ 2,K 6 if BS (Vvi ∪ vmax) ≤ ∆ then 2,K 2,K 7 Vvi ← Vvi ∪ vmax;

2,K 8 return Vvi ;

quentially, as the estimated budget value is always not larger than ∆ in the process.

2,K ˆ 2,K 2,K Note that, as BS (Vvi ∪vmax) ≤ BS (Vvi ∪vmax), and thus we have BS (Vvi ∪vmax) ≤

2,K ∆. So that Vvi is a feasible solution of the ORS-SM problem. In addition, we can speed up the process of f indFeaNodeS et1(vi, G, ∆) using an optimization technique called

+ Lazy Evaluations [LKG 07]. We denote IAppORS with f indFeaNodeS et1(vi, G, ∆) as IAppORSHeu1.

Next, we propose another heuristic method to implement f indFeaNodeS et(vi, G, ∆) which is faster than the first one. Intuitively, for two nodes v1 and v2, when dis(v1, v2)

2,K 2,K 2,K is smaller, the difference between Vv1 and Vv2 is smaller. Thus, we can use Vv1 2,K to approximate Vv2 . Based on this observation, we present the second method

f indFeaNodeS et2(vi, G, ∆). The most steps are similar to Algorithm8, but the line 4 Vnei ← { | ≤ } V2 ← V2 \ {Vnei ∪ } ∈ is replaced by vi v j dis(vi, v j) γ∆ , vi vi vi vmax , where γ (0, 1] is Vnei a parameter, and vi is the neighbor node set of vi. In this way, we use the seconde fea- sible node set of vi as those of vi’s neighbor nodes, and reduce the related computations.

2,K We denote IAppORS with f indFeaNodeS et2(vi, G, ∆) as IAppORSHeu2, and Vvi in IAppORSHeu2 is also a feasible node set. 82 Chapter 4. Optimal Region Search with Submodular Maximization

We also explain the process of f indFeaNodeS et2(vi, G, ∆) by considering an exam- ple for ∆ = 5 in Figure 4.1. When γ = 0.2, γ∆ = 1. After achieving a feasible node set

2,K Vv4 = {v1, v3, v4, v5} for v4, we can skip the computation for finding a feasible node set for v5, as dis(v4, v5) ≤ γ∆. And this operation does not reduce the solution quality in this example.

K 4.4.3 A Better f indOptTree(Vopt, G, ∆, vopt)

In this section, we present a better method to implement the function

K f indOptTree(Vopt, G, ∆, vopt) in line 8 of Algorithm6. We first compute the shortest distances between each pair of nodes using some ex- isting methods. Then, we use Kruskal’s algorithm to find an MST (denoted as MST1)

K spanning all nodes in Vopt, and MST1 contains unknown shortest paths between some pairs of nodes. Next, we find the real shortest paths by inserting the node set Vmid = ˆ K ˆ K K {vi|vi ∈ Vvopt , BS (Vopt ∪ vi) = BS (Vopt)} into Vopt, and finding a new MST (denoted as

K MST2) spanning all nodes in MST2, where Vvopt ← {vi|vi < Vopt, dis(vopt, vi) ≤ ∆}.

MST2 is a feasible solution for the ORS-SM problem, but its budget value may be far less than ∆. We improve MST2 by extending it greedily. In each iteration, we select a node that is connected to MST2 and has the maximum marginal objective value, and insert the node into MST2. The process is repeated until no more nodes can be added

K into MST2. Finally, we can get a better region MST2 from Vopt. Thus, we use this

K method to do f indOptTree(Vopt, G, ∆, vopt) for all proposed algorithms by default. We use an example of the most diversified region search in the network in Figure 4.1 to illustrate the process of this method. When ∆ = 5, we first get a feasible node set

K Vopt = {v4, v5}, and then obtain MST2 = MST1 = {e1 = (v4, v5)}. As the budget value of MST2 is less than ∆, we extend it greedily, and get a new MST2 = {e1 = (v1, v3), e2 =

(v3, v5), e3 = (v4, v5)} by selecting v3, v1 sequentially. Chapter 4. Optimal Region Search with Submodular Maximization 83

4.4.4 Time Complexity of the Proposed Algorithms

In this section, we analyze the complexity of all the proposed algorithms AppORS, IAp- pORSHeu1 and IAppORSHeu2. These algorithms have two main steps. Step1 com-

K ∗ putes a feasible node set Vopt, and Step 2 searches a feasible spanning tree T from

K 0 Vopt and T consisting of one edge. We denote the time cost of computing dis(vi, v j) as tdis. AppORS needs |V| loops in Step 1, and in each loop, it takes O(|V|tdis) time to

get the candidate node set Vvi , and costs O(K|Vvi |max) time to get top-K node set and

K update Vopt and vopt, where |Vvi |max is the maximum value of |Vvi |. The method in Sec-

K 2 K tion 4.4.3 is the default method used in Step 2. It first takes O(|Vopt| log|Vopt|) time to

find MST1. After that, it costs O(|V|tdis) time to get the node set Vvopt , and then costs

K 2 K K O(|Vvopt ||Vopt| (tdis + log|Vopt|)) time to get MST2. Next, it requires O(|Vvopt ||Vopt|) time

0 to extend MST2. Finally, it takes O(|E|) time to find T . Thus, the time complexity of

2 K 2 K AppORS is O(|V| tdis + |Vvopt ||Vopt| (tdis + log|Vopt|)).

IAppORSHeu1 also needs |V| loops in Step 1. In each loop, it first takes O(|V|tdis) V1,K |V| | 2 | |V2,K| 2 |V2,K| time to get vi , and then costs O( tdis + Vvi max( vi max) (tdis + log vi max)) time 2,K to get Vvi , Step 2 of IAppORSHeu1 is the same to that of AppORS, and the worst case of IAppORSHeu2 is IAppORSHeu1. Therefore, the time complexity of IAppORSHeu1 |V|2 |V|| 2 | |V2,K| 2 |V2,K| and IAppORSHeu2 is O( tdis + Vvi max( vi max) (tdis + log vi max)).

4.5 Experimental Study

We compare the proposed algorithms to the state-of-the-art algorithm GCBAll in two applications on three real-world datasets. As discussed in Section 5.1, GCBAll is the adapted version of GCB [ZV16] for solving our ORS-SM problem. In GCBAll, to com- pute an approximation of the minimal budget value of a tree spanning all the given nodes, we adopt the 2-approximation algorithm for the Steiner tree problem [Vaz13] by utilizing 84 Chapter 4. Optimal Region Search with Submodular Maximization the minimum spanning tree. We implement all the algorithms in JAVA on Windows 10, and run on a server with an Intel(R)Xeon(R) W-2155 3.3GHz CPU and 256 GB memory.

4.5.1 Case Study 1: Most Influential Region

Problem Definition and Datasets

We first investigate the Most Influential Region Search (MIRS) which aims to find an optimal region under a budget constraint such that the number of affected users is max- imal in a location-based social network. This problem is useful in many scenarios, e.g., a company wants to find a region for marketing its products, the government intends to

find a region for building some public facilities, etc. We model the location-based social network as an undirected graph G = (V, E) and compute the probability that a user ui R − Q − v j v j visits a region R as Pui = 1 v j∈R.V(1 Pui ), where Pui is the probability that ui visits

v j # o f check−ins in v j o f ui the location v j, and it is computed as P = . Thus, the objective ui # o f check−ins o f ui value of R is the number of users expected to be affected by R, and it is computed as a P R U submodular function: OS (R) = ui∈U Pui , where represents all users. In this problem, we use two real-world datasets SG and AS crawled from FourSquare

(also used in the work [ZCC+15]), in which SG has 189,306 check-ins made by 2,321 users at 5,412 locations in Singapore, and AS contains 201,525 check-ins made by 4,630 users at 6,176 locations in Austin. Following the work [ZCC+15], we make an edge between two locations that were visited continuously in one day by the same user, and use the Euclidean distance as the budget value for each edge. We also removed the check-ins made by users who checked in a location less than 3 times.

Effect of Parameter γ

The value of parameter γ controls the pruning power of IAppORSHeu2, γ is larger, the number of nodes needing to search another feasible node set is smaller, the run time of Chapter 4. Optimal Region Search with Submodular Maximization 85

1500 4000 SG AS 3200 1200 2400 900 1600 600

Run Time (ms) 800 300 SG The Objective Score AS 0 0 0.05 0.1 0.15 0.2 0.25 0.05 0.1 0.15 0.2 0.25 γ γ (a) Efficiency (b) Solution Quality

Figure 4.2: Effect of parameter γ on IAppORSHeu2 in MIRS.

IAppORSHeu2 is less. We run IAppORSHeu2 on SG and AS datsets by varying γ from

0.05 to 0.25. Figure 4.2 validates the analysis and shows that γ is larger, the solution quality is worse. We set γ = 0.15 for IAppORSHeu2 in the subsequent experiments for this application.

1e+009 GCBAll 750 1e+008 AppORS 600 1e+007 IAppORSHeu1 IAppORSHeu2 1e+006 450 100000 10000 300 GCBAll AppORS Run Time (ms) 1000 150 IAppORSHeu1 100 The Objective Score IAppORSHeu2 10 0 20 30 40 50 60 20 30 40 50 60 ∆ (kilometer) ∆ (kilometer) (a) SG (b) SG

1e+010 GCBAll 1500 1e+009 AppORS 1e+008 IAppORSHeu1 1200 1e+007 IAppORSHeu2 1e+006 900 100000 600 GCBAll 10000 AppORS Run Time (ms) 1000 300 IAppORSHeu1 100 The Objective Score IAppORSHeu2 10 0 20 30 40 50 60 20 30 40 50 60 ∆ (kilometer) ∆ (kilometer) (c) AS (d) AS

Figure 4.3: Comparison of algorithms in MIRS. 86 Chapter 4. Optimal Region Search with Submodular Maximization

Performance of Our Methods

We compare three proposed algorithms with GCBAll in terms of efficiency (the

run time) and effectiveness (the objective score) by varying ∆ from 20km to 60km.

Figure 4.3 shows that GCBAll is rather time-consuming when ∆ is large, as it

calls compute(BSˆ (Vi)) too many times and cannot use Lazy Evaluations optimiza-

+ tion [LKG 07] to find vmax quickly. All the proposed algorithms are faster than GCBAll more than one order of magnitude. Meanwhile, the objective scores of IAppORSHeu1

and IAppORSHeu2 are relatively high (more than 90% of GCBAll), which is also consis-

tent with our theoretical analysis of the solution quality. The results demonstrate that two

presented heuristic methods can improve the solution quality of AppORS significantly.

4.5.2 Case Study 2: Most Diversified Region

Problem Definition and Datasets

The second application Most Diversified Region Search (MDRS) is to find the most di-

versified region under a budget constraint. We consider MDRS on road networks which

contain a set of locations associated with a set of keywords (e.g., mall and bar). We

measure the diversity of a region using the number of different keywords, and thus the

objective function is OS (R) = | ∪vi∈R.V K(vi)|, where K(vi) is the set of keywords on vi. We use the road network in California (CA) from a public website1. We then utilized

the Foursquare APIs to fill in the missing keywords for nodes (categories of locations)2.

CA contains 21,048 nodes and 22,830 edges [LCH+05]. In addition, to compare the

performances of the three algorithms by varying the dataset size, we generate a synthetic

dataset (denoted as SY) based on the structure of the CA dataset. And the number of

nodes in SY is from 10,000 to 30,000.

1http://www.cs.utah.edu/ lifeifei/SpatialDataset.htm 2https://developer.foursquare.com/docs/api Chapter 4. Optimal Region Search with Submodular Maximization 87

15000 150 CA 12000 SY 120

9000 90

6000 60

Run Time (ms) 3000 30 CA The Objective Score SY 0 0 0.05 0.1 0.15 0.2 0.25 0.05 0.1 0.15 0.2 0.25 γ γ (a) Efficiency (b) Solution Quality

Figure 4.4: Effect of parameter γ on IAppORSHeu2 in MDRS.

1e+009 1e+008 GCBAll AppORS 120 1e+007 IAppORSHeu1 1e+006 IAppORSHeu2 90 100000 60 10000 GCBAll AppORS

Run Time (ms) 1000 30 IAppORSHeu1 100 The Objective Score IAppORSHeu2 10 0 20 30 40 50 60 20 30 40 50 60 ∆ (kilometer) ∆ (kilometer) (a) Efficiency (b) Solution Quality

Figure 4.5: Performances of the algorithms in MDRS on CA.

Effect of Parameter γ

We vary γ from 0.05 to 0.25 to run IAppORSHeu2 on CA and SY with 2,5000 nodes datsets. The results are reported in Figure 4.4, and we set γ = 0.15 for IAppORSHeu2 in the following experiments for this application.

Performance of Our Methods

In MDRS, we also compare our algorithms with GCBAll in terms of efficiency and effectiveness by varying ∆ from 20km to 60km. As shown in Figure 4.5, GCBAll is time- consuming when ∆ is large, while our algorithms are faster than GCBAll more than one order of magnitude. In addition, IAppORSHeu1 and IAppORSHeu2 can achieve high- 88 Chapter 4. Optimal Region Search with Submodular Maximization

1e+007 AppORS 150 1e+006 IAppORSHeu1 IAppORSHeu2 120 100000 90 10000 1000 60 AppORS Run Time (ms) 100 30 IAppORSHeu1 The Objective Score IAppORSHeu2 10 0 60 70 80 90 100 60 70 80 90 100 ∆ (kilometer) ∆ (kilometer) (a) Efficiency (b) Solution Quality

Figure 4.6: Scalability of our algorithms in MDRS on CA.

1e+007 120 AppORS 1e+006 IAppORSHeu1 IAppORSHeu2 90 100000 10000 60 1000 AppORS

Run Time (ms) 30 100 IAppORSHeu1 The Objective Score IAppORSHeu2 10 0 10K 15K 20K 25K 30K 10K 15K 20K 25K 30K The Number of Nodes The Number of Nodes (a) Efficiency (b) Solution Quality

Figure 4.7: Scalability of our algorithms in MDRS on SY. quality solutions.

Scalability of Our Methods

To study the scalability of Our Methods, we run the three proposed algorithms on a larger ∆ using CA. As shown in Figure 4.6, the objective scores of IAppORSHeu1 and

IAppORSHeu2 are competitive, and higher than that of AppORS. Meanwhile, IAppOR-

SHeu2 is faster than IAppORSHeu1 by about one order of magnitude, and can solve the problem with 4s when ∆ = 100km.

Next, we run the three algorithms on SY by varying the number of nodes from 10,000 to 30,000, and set ∆ = 60km. Figure 4.7 shows that the run time of IAppORSHeu2 grows slowly with the increasing number of nodes, and the solution quality of IAppORSHeu2 Chapter 4. Optimal Region Search with Submodular Maximization 89 is close to that of IAppORSHeu1, and better than that of AppORS.

4.6 Conclusion

In this chapter, we propose the ORS-SM problem which aims to find the optimal region such that it maximizes the objective score of a region computed by a submodular function subject to a given budget score constraint. To efficiently solve the ORS-SM problem, we propose two approximation algorithms and further improve them with some optimization techniques. The results of empirical studies on two applications using three real-world datasets demonstrate the efficiency and the solution quality of our proposed algorithms. Chapter 5

Constrained Path Search with

Submodular Maximization

5.1 Introduction

The path search problem is an essential task in many applications such as navigation and tourist trip planning. Most path search systems (such as Google Maps) compute paths based on a single criterion which is usually either the total path length or the total travel time. However, in practice, the user often needs to consider multiple criteria during the path search. The studies on path search with constraint [Jok66, Has92, WXYL16] aim to find the optimal path under a given constraint (such as a length limit).

Generally, in path search with constraints, each path can compute two scores: an objective score (to be optimized) and a budget score (for constraint checking). The objective score of the path can be defined variously according to applications. In this chapter, we assume a general submodular function computed over a set of nodes, and we aim to find the path with the maximized submodular function score (computed over the nodes in the path) from an origin s to a destination t satisfying a given budget limit ∆.

90 Chapter 5. Constrained Path Search with Submodular Maximization 91

{bar, mall} {mall, restaurant}

V1 V4 1 4 {mall} 2 3 {mall}

{restaurant} Vs V3 2 V5 2 Vt {restaurant}

3 2 3 2

V2 V6 {coffee, museum} {park}

Figure 5.1: A toy road network with locations labeled by keywords.

This problem is first studied by Chekuri and Pal [CP05] as the submodular orienteering problem. As the problem is an s-t path query, We rename this problem the constrained path search with submodular maximization (CPS-SM) query. This query can be used in many applications, e.g. Most Diversified Path Query:

Example 5.1 (Most Diversified Path Query). Assume a tourist who is spending holidays in a city. The tourist would like to plan a trip such that she can visit the most types of points-of-interests (e.g., shopping malls, coffee, parks, and museums). This will enable the tourist to experience many different attractions and services on a single trip. Mean- while, she is not willing to travel too far away from her hotel. We can set the tourist’s hotel as the origin and the destination, and the score of a path is the number of distinct types of POIs passed by (which is submodular).

Figure 5.1 shows a toy road network where each location is associated with some keywords (such as “mall” and “bar”) which represent the categories of POIs on it, and each edge is associated with a cost (such as travel time). Assume a user wants to find the most diversified path under a budget limit ∆ = 10. If using a sum function to compute the number of keywords, P1 = hvs, v1, v3, v4, vti is found because it has 7 keywords (not 92 Chapter 5. Constrained Path Search with Submodular Maximization distinct!), but it only passes 2 types of POIs. In contrast, the CPS-SM query aims to return P2 = hvs, v2, v3, v6, vti which passes by the most types of locations.

In most constrained path search problems, the objective function is a simple aggre- gation sum function. For example, the objective score is computed as the shortest path length [WXYL16]. Hence, given two partial paths P1 and P2 from the origin s to the same node v j, if both two scores of P1 are worse than those of P2, P1 can be pruned safely, because by further extending the two paths with the same node v j, the two scores of path P1 ∪v j are also worse than those of path P2 ∪v j. Unfortunately, this does not hold in our CPS-SM query, since the objective score is computed by a submodular function over the set of nodes in a path. Given that P1 is worse than P2 on v j, we cannot guarantee that P1 ∪ v j is worse than P2 ∪ v j due to the submodular properties. This renders exist- ing methods on constrained path search inapplicable to our problem. The submodular objective function poses more challenges to develop algorithms to process the CPS-SM query, which can be viewed as a submodular optimization problem [KG14].

Most submodular optimization problems are hard to solve [ZCC+15, ZV16,

SZKK17, Cra19], and we prove that solving the CPS-SM query is NP-hard. There exist the recursive greedy (RG) algorithm [CP05] and GreedyDP algorithm [SNY18] to an- swer the CPS-SM query. RG is a quasi-polynomial time algorithm, it finds the solutions by recursively guessing the middle node and the budget cost to reach the middle node, such that it requires a large number of iterations in practice. For each query, GreedyDP constructs a zero-suppressed binary decision diagram (ZDD) to represent all feasible so- lutions, and then obtains an approximate solution from the ZDD. As the size of the ZDD generally increases exponentially with the number of nodes in the graph, GreedyDP will cost too much time and space to answer the query on large graphs. Some experimental results show RG costs more than 104 seconds in a small graph [SKG+07] with 22 nodes and GreedyDP takes 4.0 × 103 seconds on a dataset with only 7 × 13 grids (each grid can Chapter 5. Constrained Path Search with Submodular Maximization 93 be viewed as a node in the network) [SNY18], both algorithms are so time-consuming that they are not scalable and cannot be used to solve the problem in real-world datasets which typically have thousands of nodes.

The study [ZV16] proposes the algorithm Generalized Cost-Benefit (GCB) for sub- modular optimization with routing constraints. The algorithm can be adapted to answer the CPS-SM query, but a problem is that it can only return paths under a smaller bud- get limit with a guaranteed error bound. It means that GCB cannot guarantee any error bounds for the optimal solution under the original budget B, and the returned result also has poor quality in some applications. The work [ZCC+15] studies a special case of the

CPS-SM query, and the proposed weighted A* algorithm can get approximate solutions, but it is extremely time-consuming, as it stores a large number of partial paths during the search.

To answer the CPS-SM query effectively and efficiently, we first propose a concept called “submodular α-dominance” by using the properties of the submodular function.

This concept enables us to compare the quality of two partial paths from the origin to the same node. We can prune some partial paths during path search with guaranteed error bounds according to this concept. Based on this, we develop the first approxima- tion algorithm SDD (Submodular α-Dominance based Dijkstra) which finally yields an

b ∆ c−1 1 bmin approximation ratio of ( α ) , where bmin is the minimal budget value on edges and α is a parameter. Next, we further improve the efficiency of the algorithm by relaxing the submodular α-dominance conditions and then propose another approximation algo- rithm OSDD. OSDD algorithm has the same approximation ratio but it can discard more partial paths during the search and thus is much faster.

To further improve the efficiency of SDD and OSDD, we utilize the way of bi- directional path search to reduce the number of generated labels. We apply the bi- directional search to both SDD and OSDD, and we obtain another two algorithms Bi- 94 Chapter 5. Constrained Path Search with Submodular Maximization

SDD and Bi-OSDD. We show that these two algorithms have an approximation ratio of

b ∆ c 1 bmin ( α ) , and are more efficient than SDD and OSDD, respectively. We finally devise an efficient heuristic algorithm CBFS with polynomial runtime for solving the CPS-SM query. Although it does not have any guaranteed error bounds, it can return results with good quality in practice. The experimental study conducted on several real-world data sets shows that our proposed algorithms can achieve high accuracy and are faster than one state-of-the-art method by orders of magnitude.

Contributions. In summary, we make the following contributions: i) We introduce the CPS-SM query and prove that the problem is NP-hard; ii) We propose a concept called “submodular α-dominance” by using the properties of the submodular function, and then design two approximation algorithms based on the concept to answer CPS-

SM query; iii) We utilize the bi-directional search method to speed up the proposed approximation algorithms; iv) We devise an efficient yet effective heuristic algorithm that has polynomial runtime; v) We conduct experiments on both real and synthetic datasets to demonstrate the efficiency and accuracy of the proposed algorithms.

Organization. The chapter is organized as follows. Section 5.2 gives the formal def- inition of the problem and shows its complexity. Section 5.3 and 5.4 describe the ap- proximation algorithms SDD and OSDD respectively. Section 5.5 proposes to use bi- directional search to develop algorithms Bi-SDD and Bi-OSDD. Section 5.6 explains the heuristic algorithm CBFS. Finally, the experimental study is done in Section 5.7, and the conclusion is drawn in Section 5.8.

5.2 Problem Formulation

In this section, we present the formal definition of the constrained path search with submodular maximization (CPS-SM) query and we show its hardness. The frequently Chapter 5. Constrained Path Search with Submodular Maximization 95

Table 5.1: Frequently Used Notation in CPS-SM

Notation Meaning vs Query source node vt Query target node ∆ Query budget limit P A path from vs to vt k Pi A partial path from vs to a node vi BS (P) The budget score of path P OS (P) The objective score of path P f (·) A submodular function that computes objective scores k k Li A label that represents Pi −→k L i A forward node label representing a path from vs to vi ←−k L i A backward node label representing a path from vi to vt used notations are presented in Table 5.1.

Our problem is defined on a directed graph G = (V, E) which consists of a set of nodes V and a set of edges E ⊆ V × V. G could be used to model many graph data.

For example, if G is a road network, each node v ∈ V represents a location, and each edge hvi, v ji ∈ E represents a directed path from vi to v j, and the edges can be associated with some weights such as travel time or distance. Although we only consider directed graphs in this chapter, it is straightforward to extend our solutions to undirected graphs.

Definition 5.1. Path. Given a directed graph G = (V, E),P = hv1, v2 ··· , vmi is called a path from v1 to vm, where each pair of consecutive nodes in P has an edge in E.

We consider two scores for finding the optimal path, namely the budget score and the objective score. Each edge < vi, v j > ∈ E is attached with a budget value (e.g., the travel time or distance etc.), which is denoted by b(vi, v j). For any set of nodes Vx ⊆ V, we compute its objective score by a submodular function f () over Vx, i.e., f (Vx) (e.g., the degree of POI types coverage, social influence etc.). Given a path, its two scores are defined formally as below.

Definition 5.2. Path Budget Score. Given a path P = hv1, v2 ··· , vmi, its budget score 96 Chapter 5. Constrained Path Search with Submodular Maximization is computed as the sum of the budget values on all the edges along the path, i.e.,

Xm−1 BS (P) = b(vi, vi+1) (5.1) i=1

Definition 5.3. Path Objective Score. Given a path P = hv1, v2 ··· , vmi, its objective score is is computed by f () over the nodes in P, i.e.,

OS (P) = f (∪vi∈Pvi) (5.2)

For example, in Figure 5.1, assume that the objective value is the number of distinct

POI types in a path, its objective score is computed as OS (P) = | ∪vi∈P K(vi)|, where

K(vi) is the set of keywords on vi. Given P1 = hvs, v1, v3, v4, vti, we can get OS (P) = | f {restaurant, bar, mall}| = 3 and BS (P) = 10. To be specific, OS (P) is submodular means that OS (P1 ∪ vi) − OS (P1) ≥ OS (P2 ∪ vi) − OS (P2), given P1 ⊆ P2 and vi < P1, where P1(resp. P2) ∪ vi composes a new path. It is obvious that computing the number of distinct POI types in a path is submodular.

Definition 5.4. Constraint Path Search with Submodular Maximization (CPS-SM)

Query. Given a directed graph G = (V, E), and a search query Q = hvs, vt, ∆i, where vs is the source node, vt is the target node, and ∆ specifies a budget limit, we aim to find a path P starting from vs and ending at vt such that

P = argmaxP OS (P) (5.3) sub ject to BS (P) ≤ ∆

Example 5.2. Figure 5.1 shows a toy road network with locations and their POI cate- gories represented by keywords. The most diversified path query is a special problem of the CPS-SM query. OS (P) is computed as the number of distinct keywords in the path, Chapter 5. Constrained Path Search with Submodular Maximization 97 while OS (B) is the sum of the values on all the edges along the path. Given a query

Q = hvs, vt, ∆ = 10i, the optimal solution is P2 = hvs, v2, v3, v6, vti, as it contains the most distinct keywords and satisfies the budget constraint.

The optimal route search for keyword coverage (ORS-KC) problem [ZCC+15] is also a special problem of the CPS-SM query, and it is proved to be NP-hard. Next, we prove the NP-hardness of answering the CPS-SM query.

Theorem 5.1. The problem of answering the CPS-SM query is NP-hard.

Proof. We develop the proof by reducing the problem from a unit cost version of the budgeted maximum coverage (UBMC) problem [KMN99]. Given a collection of sets

S = {S 1, S 2, ··· , S m} with a unit cost C, a domain of elements X = {x1, x2, ··· , xn} with associated weights {w1, w2, ··· , wn}, and a budget L, the UBMC problem aims to find a collection of sets S 0 ⊆ S whose total cost is smaller than L and the elements covered by

S 0 have the largest total weight.

Given an instance of the UBMC problem ψ, we can construct an instance of CPS-SM query η as follows: we use m + 2 nodes to build a graph, in which two additional nodes are the source node vs and the target node vt and the other nodes correspond to the sets in S . Next, we connect each pair of nodes using two edges (in both directions) with a unit cost C. Besides vs and vt, each other node vi corresponds to a set S i, and its objective score is computed over the weights of elements it contains. The budget limit ∆ is set to

(L + C). Note that (L + C) edges can connect (L + 2C) nodes. With this mapping, if S 0 is the optimal result of ψ, any route passing by the nodes corresponds to the sets in S 0 can be extended to the optimal route of η by inserting vs and vt into it, and vice versa. Since the UBMC problem has been proved to be NP-hard, solving the CPS-SM query is

NP-hard as well.  98 Chapter 5. Constrained Path Search with Submodular Maximization

5.3 Approximation Algorithm SDD

A simple idea to solve the CPS-SM query is to perform an exhaustive search from the source node vs to the target node vt. We call a path from vs to a node vi (vi , vt) a

“partial path”. First, for each neighbor of vs we can obtain a single-edge partial path. Next, from all the partial paths, we select the one with the smallest budget score and expand it to obtain more partial paths. If the newly generated path exceeds the budget limit, we discard it. We repeat this process until no new feasible partial paths can be obtained, and we return the best path on vt as the result. It is obvious that such a method is computationally prohibitive, because all feasible partial paths on each node have to be kept during the search. As discussed in Section 5.1, the A*-style algorithms [ZCC+15] can solve the CPS-SM query, but it is also time-consuming since it needs to maintain too many partial paths on each node as well.

In order to reduce the number of partial paths enumerated during the search, we develop an approximation algorithm that can prune a lot of partial paths and return results with guaranteed error bound. Before presenting the algorithm, we first introduce the following important definitions.

k k Definition 5.5. Node Label. Each node label Li represents a path Pi from the source

k node vs to a node vi, which is in format of hBS, OS, Pi i, where BS and OS represent the

k budget score and the objective score of Pi , respectively.

The node label is used to store the information of the partial paths, and we store the labels on each node. To reduce the number of labels, we define “submodular α- dominance” and then prune the labels on nodes according to this concept.

k l Definition 5.6. Submodular α-Dominance. Let Li and Li be two labels representing

k l two different paths from vs to node vi. We say Li α-dominates Li in submodularity iff OS (Pk) k ≤ l i ≥ 1 Li .BS Li.BS and l k l α−1 , where α > 1. OS (Pi)−OS (Pi ∩Pi) Chapter 5. Constrained Path Search with Submodular Maximization 99

With this definition, we are ready to present our algorithm SDD (Submodular α-

Dominance based Dijkstra). The basic process of SDD is similar to that of CP-

Dijk [TZ09] for constrained shortest path search. We prune all the labels submodular

α-dominated by other labels, and extend the non-dominant partial labels in ascending order of their budget scores.

The pseudocode is presented in Algorithm9. The algorithm starts by initializing a min-priority queue Q which stores partial labels in ascending order of their bud- get scores, and a label list for each node, and inserting the label containing vs into Q (lines 1-3). Next, it dequeues the candidate labels from Q one by one until Q be- comes empty (lines 4-13). In each while-loop, the algorithm inserts the label whose path

Algorithm 9: SDD

Input: G = (V, E), Q = hvs, vt, ∆i Output: A path P

1 Initialize a min-priority queue Q ← ∅ ;

2 Initialize a label list L(vi) for each node vi ∈ V;

3 Create a label h0, OS (vs), (vs))i and insert it into Q;

4 while Q is not empty do

k 5 Li ← Q.dequeue(); k 6 if vi is vt then Insert Li into L(vi); k 7 if DomCheck(Li , L(vi)) == true then continue; k k k 8 Insert Li into L(vi), and obtain Pi from Li ;

9 for each edge (vi, v j) do

l k 10 Create a path P j ← Pi ∪ v j; l 11 if BS (P j) > ∆ then continue; l l l l 12 Create a new label L j ← hBS (P j), OS (P j), (P j)i; l 13 Q.enqueue(L j);

14 If L(vt) is ∅ return ”No feasible path exists”;

15 Else return the path P with the largest OS value in L(vt); 100 Chapter 5. Constrained Path Search with Submodular Maximization

k connects vs and vt into L(vi) (line 6). For the other label Li , the algorithm uses a function

k k DomCheck(Li , L(vi)) to check whether Li is dominated by any label in L(vi) according to Definition 5.6, and then only extends the non-dominant labels (lines 7-13). For a non-

k k l dominant label Li , the algorithm obtains Pi and creates a new path P j for each outgoing

l neighbor v j of vi (lines 8-10). If BS (P j) ≤ ∆, the algorithm creates a new label and enqueues it into Q (lines 11-13). Finally, we return the path P with the largest OS value in L(vt) (lines 14-15). Next, we explain why Definition 5.6 can enable SDD to have a guaranteed error bound. We have the following lemma:

k l Lemma 5.1. Let Pi j be any path from vi to v j, and Li submodular α-dominates Li, it is k OS (Pi ∪Pi j) 1 true that: l ≥ α . OS (Pi∪Pi j)

OS (Pk) k l i ≥ 1 Proof. As Li submodular α-dominates Li, we have l k l α−1 , we can get OS (Pi)−OS (Pi ∩Pi) k l k l (α − 1)OS (Pi ) ≥ OS (Pi) − OS (Pi ∩ Pi). According to the properties of submodular

k l k l l l function, we have OS ((Pi ∩ Pi)∪ Pi j)−OS (Pi ∩ Pi) ≥ OS (Pi ∪ Pi j)−OS (Pi). Therefore, k k k OS (Pi ∪Pi j) OS (Pi ∪Pi j) OS (Pi ∪Pi j) 1 l ≥ k l k l l ≥ k l k ≥ α . The final step OS (Pi∪Pi j) OS ((Pi ∩Pi)∪Pi j)−OS (Pi ∩Pi)+OS (Pi) OS ((Pi ∪Pi)∪Pi j)+(α−1)OS (Pi ) k l k k holds true because both OS ((Pi ∩ Pi) ∪ Pi j) and OS (Pi ) are not larger than OS (Pi ∪ Pi j).  Based on this lemma, we have the following theorem to show the approximation ratio of SDD.

b ∆ c−1 1 bmin Theorem 5.2. Algorithm SDD offers a ( α ) -approximate answer to the optimal solution, where bmin is the minimum budget value on the edges.

opt Proof. Let the optimal path be P = hvs, vo1 , ··· , voc−1 , voc , vti, and the best path returned by SDD be Psd.

opt k Assume that P is submodular α-dominated by another partial path Po1 on node OS (Pk ∪P ) o1 o1t ≥ 1 vo1 during the search. According to Lemma 5.1, we know that OS (Popt) α , where Chapter 5. Constrained Path Search with Submodular Maximization 101

h ··· i k ∪ Po1t = vo1 , , voc−1 , voc , vt . Next, the path Po1 Po1t may be submodular α-dominated OS (Pk ∪P ) by a partial path Pk on node v . Hence, we have o2 o2t ≥ 1 . In the worst case, o2 o2 OS (Pk ∪P ) α o1 o1t each partial path that dominates a path on a node vx is dominated by another path on the OS (Pk ∪P ) opt ox+1 ox+1t 1 subsequent node vx+1 in P , which means that for any 0 < x ≤ c, k ≥ . OS (Pox ∪Poxt) α

Recall that SDD returns the path with the largest OS score on destination vt, i.e., Psd. Thus, we know that OS (Psd) ≥ 1 OS (Pk ∪ P ) ≥ ( 1 )2OS (Pk ∪ P ) ≥ ... ≥ α oc oct α oc−1 oc−1t 1 c−1 k 1 c opt ∆ opt ( ) OS (P ∪ Po t) ≥ ( ) OS (P ). There are at most c = b c − 1 nodes in P . α o1 1 α bmin ∆ OS (Psd) b c−1 ≥ 1 bmin Finally, we can conclude that OS (Popt) ( α ) . 

Complexity Analysis. We denote the largest number of labels on a node as |Lmax|. In the worst case, Algorithm SDD would generate |Lmax||V| labels, and thus it requires at most |Lmax||V| loops. To enqueue and dequeue all of those generated labels with the min- priority queue, Algorithm SDD takes O(|Lmax||V|(log(|Lmax||V|))) time in the worst case.

In each loop, it takes at most O(|Lmax|) time to do submodular α-dominance checks.

Therefore, in total the worst time complexity of SDD is O(|Lmax||V|(log(|Lmax||V|) +

|Lmax|)). |Lmax| depends on α and ∆. A larger α would reduce |Lmax| since more labels can be dominated during the search, and a smaller ∆ would reduce |Lmax| as well since fewer partial paths need to be enumerated.

5.4 Approximation Algorithm OSDD

Using the submodular α-dominance, SDD is able to prune partial paths while guarantee- ing an error bound. Can we further improve its efficiency while preserving its accuracy?

From Definition 5.6, it can be observed that partial paths with small objective scores can only dominate very few other partial paths. If we can improve the pruning power of partial paths with small objective scores, we can obtain faster algorithms. Let’s revisit

k l Lemma 5.1, and we set v j as the destination vt. In the proof, if we have Li .BS ≤ Li.BS 102 Chapter 5. Constrained Path Search with Submodular Maximization

OS (Pk∪P ) i it ≥ 1 and l k l α−1 (where Pit is a path from vi to the target node vt), we still OS (Pi)−OS (Pi ∩Pi) k OS (Pi ∪Pit) 1 can guarantee that l ≥ α . This gives us an idea of relaxing the condition of OS (Pi∪Pit) k submodular α-dominance to obtain larger pruning power, as OS (Pi ∪ Pit) is usually

k much larger than OS (Pi ). However, to guarantee the error bound, we need to show that the above condition holds for all possible Pit, which is intractable to check this. To fill this gap, we propose a lazy extension optimization to speed up the search. We show that it is not necessary to check the relaxed condition for all possible paths from the current node to the target node. We first quickly obtain some feasible paths from source to target, and then extend some labels in a later stage to avoid unnecessary computation.

We first introduce the following definitions which are used in this optimization.

k Definition 5.7. Extended Node Label. For a partial path Pi on node vi, we denote k ˆ k its extended node label as Li which is in format of hBS, OS, OS f ea, OS f ea, Pi i, where k k ˆ k Li .OS f ea and Li .OS f ea represent the OS score of a feasible result path from Pi to vt k k ˆ k k (i.e., OS (Pi ∪ Pit)) and its estimated value (i.e., OS (Pi ∪ Pit)), respectively.

Next, we define two new types of submodular α-dominance as below, given that

k Pi ∪ Pit is a feasible path.

k Definition 5.8. Forward-looking Submodular α-Dominance (FS-α-Dominance). Li Lk.OS Ll Lk ≤ Ll i f ea ≥ 1 FS-α-dominates i iff i .BS i.BS and l k l α−1 . OS (Pi)−OS (Pi ∩Pi)

k Definition 5.9. Potential Submodular α-Dominance (PS-α-Dominance). Li PS-α- Lk.OSˆ Ll Lk ≤ Ll i f ea ≥ 1 dominates i iff i .BS i.BS and l k l α−1 . OS (Pi)−OS (Pi ∩Pi)

k k k k ˆ In the extended node label Li , Li .OS , Li .OS f ea and Li .OS f ea are used to do α- dominance check, FS-α-dominance check, and PS-α-dominance check, respectively.

We proceed to present the algorithm with the lazy extension optimization, denoted by OSDD (Optimized Submodular α-Dominance based Dijkstra). The algorithm has two phases: the first phase aims to quickly find some feasible solutions (i.e., paths from Chapter 5. Constrained Path Search with Submodular Maximization 103

Algorithm 10: OSDD

Input: G = (V, E), Q = hvs, vt, ∆i Output: A path P

1 Initialize two min-priority queues Q1 ← ∅ and Q2 ← ∅ ;

2 Initialize a label list L(vi) for each node vi ∈ V;

3 Create a new extended label h0, OS (vs), OS (vs), OS (vs), (vs))i and insert it into Q1;

4 while Q1 is not empty do

k 5 Li ← Q1.dequeue(); k 6 if vi is vt then Insert Li into L(vi); k 7 if DomCheck(Li , L(vi)) == true then continue; k 8 if PS DomCheck(Li , L(vi)) == true then k 9 Q2.enqueue(Li );

10 else

k k k 11 Insert Li into L(vi), and obtain Pi from Li ;

12 for each edge (vi, v j) do

l k 13 Create a path P j ← Pi ∪ v j; l 14 if BS (P j) > ∆ then continue; OS (Pl )∗∆ ˆ ˆ l j 15 OS f ea = OS (P j ∪ P jt) = l ; BS (P j) l l l ˆ l 16 Create a new extended label L j ← hBS (P j), OS (P j), null, OS f ea, (P j)i; l 17 Q1.enqueue(L j);

18 Update OS f ea for existing labels according to labels in L(vt);

19 runPhase2(G, vs, vt, ∆, Q2, L(vi) for all nodes vi ∈ V);

20 if L(vt) is ∅ then return ”No feasible route exists”;

21 else return P with the largest OS value in L(vt);

vs to vt satisfying the budget constraint) by extending some partial labels based on Defi- nition 5.9, and the second phase searches the final solutions based on Definition 5.8. The details are shown in Algorithm 10.

OSDD uses two min-priority queues Q1 and Q2 to store partial labels in ascending order of their budget scores in the first and second phases respectively (line 1). In the first 104 Chapter 5. Constrained Path Search with Submodular Maximization phase, the algorithm first creates label lists for all nodes, and generates a new extended label on vs and insert it into Q1 (lines 2-3). Next, it dequeues and deals with the candidate labels from Q1 one by one until Q1 becomes empty (lines 4-17). Different from SDD,

k the algorithm also uses PS DomCheck() to check whether a non-α-dominated label Li is PS-α-dominated by any label in L(vi) before extending it (line 8). If it is true, the algorithm enqueues it into Q2 for further processing in the second phase. Otherwise, the

k algorithm inserts the label into Li and extends it by generating a new extended label for each outgoing neighbor v j of vi (lines 8-17). We estimate OSˆ f ea for each new label in a simple way. OS f ea is not used in this phase and thus is set to null. (lines 15-16).

After finishing the first phase, the algorithm can get some feasible solutions in L(vt) which is used to update OS f ea values for stored labels (line 18). With the labels have accurate OS f ea values, the second phase can use FS-α-Dominance to prune the remain- ing labels in Q2. As shown in Algorithm 11, this process is similar to the process of

k SDD, while it uses FS DomCheck(Li , L(vi)) to check the FS-α-Dominance relationship between labels, rather than α-Dominance (line 4). As the new extended labels created in

l the second phase do not have accurate OS f ea value, the algorithm sets OS f ea = OS (P j) and OSˆ f ea = null for new labels (line 9). OSDD can prune more partial paths during the search, as in the first phase we quickly obtain some feasible paths which enable us to apply the FS-α-Dominance check which has larger pruning power.

Example 5.3. Consider a CPS-SM query for the most diversified path query with a budget limit ∆ = 10 in G in Figure. 5.1. We run Algorithm OSDD with α = 1.2 to solve

1 the query. After some steps, OSDD now gets a label L3 = hBS = 5, OS = 4, OS f ea = ˆ 1 2 0, OS f ea = 8, P3 = hvs, v2, v3ii stored on node v3. Next, it dequeues L3 = hBS = 5, OS = ˆ 2 3, OS f ea = 0, OS f ea = 6, P3 = hvs, v1, v3ii from the min-priority queue Q1. In the L1.OS 3 1 L2 submodular α-dominance check, as 2 − 2∩ l = 4 < 5 = α−1 , 3 is not submodular OS (P3) OS (P3 P3) Chapter 5. Constrained Path Search with Submodular Maximization 105

Algorithm 11: runPhase2(G, vs, vt, ∆, Q2, L(vi) for all nodes vi ∈ V)

1 while Q2 is not empty do

k 2 Li ← Q2.dequeue(); k 3 if vi is vt then Insert Li into L(vi); k 4 if FS DomCheck(Li , L(vi)) == true then continue; k k k 5 Insert Li into L(vi), and obtain Pi from Li ;

6 for each edge (vi, v j) do

l k 7 Create a path P j ← Pi ∪ v j; l 8 if BS (P j) > ∆ then continue; l l l l l 9 Create a new extended label L j ← hBS (P j), OS (P j), OS (P j), null, (P j)i; l 10 Q2.enqueue(L j);

11 return L(vt);

L2.OSˆ L1 3 f ea α-dominated by 3. While in the PS-α-dominance check, as 2 − 2∩ l = 8 > OS (P3) OS (P3 P3) 1 L2 1 L2 5 = α−1 , 3 is PS-α-dominated by L3. Thus, the algorithm holds on 3 by inserting it into another min-priority queue Q2. After phase 1, OSDD can get a feasible path

1 2 1 1 Pt = hvs, v2, v3, v6, vti. In phase 2, it dequeues L3 from Q2, we can use Pt and L3 to L2.OS L2 3 f ea ≥ 1 L2 do FS-α-dominance check for 3. Since 2 − 2∩ l = 5 5 = α−1 , 3 is FS- OS (P3) OS (P3 P3) 1 2 α-dominated by L3. Therefore, OSDD can prune L3 (no need to extend it), and get a solution with an error bound at the end of the phase 2.

We next show that OSDD has the same approximation ratio as SDD.

Lemma 5.2. Let Pi j be a known path from vi to v j,Pn be any unknown path from vi to k k k l OS (Pi ∪Pn) 1 OS (Pi ∪Pi j) 1 v j. If Li FS-α-dominates Li, it is true that: either l ≥ α or l ≥ α . OS (Pi∪Pn) OS (Pi∪Pn)

OS (Pk∪P ) Lk Ll i i j ≥ 1 − Proof. As i FS-α-dominates i, we have l k l α−1 , we can get (α OS (Pi)−OS (Pi ∩Pi) k l k l 1)OS (Pi ∪ Pi j) ≥ OS (Pi) − OS (Pi ∩ Pi). According to the properties of submodular

k l k l l l function, we have OS ((Pi ∩ Pi) ∪ Pn) − OS (Pi ∩ Pi) ≥ OS (Pi ∪ Pn) − OS (Pi). 106 Chapter 5. Constrained Path Search with Submodular Maximization

k k If OS (Pi ∪ Pi j) ≥ OS (Pi ∪ Pn), we have: k k OS (P ∪ Pi j) OS (P ∪ Pi j) i ≥ i l k l k l l OS (Pi ∪ Pn) OS ((Pi ∩ Pi) ∪ Pn) − OS (Pi ∩ Pi)) + OS (Pi) k OS (P ∪ Pi j) ≥ i k l k OS ((Pi ∩ Pi) ∪ Pn) + (α − 1)OS (Pi ∪ Pi j) k OS (P ∪ Pi j) 1 ≥ i ≥ k k α OS (Pi ∪ Pn) + (α − 1)OS (Pi ∪ Pi j) k k If OS (Pi ∪ Pi j) ≤ OS (Pi ∪ Pn), we have:

k k OS (P ∪ Pn) OS (P ∪ Pn) i ≥ i l k l k l l OS (Pi ∪ Pn) OS ((Pi ∩ Pi) ∪ Pn) − OS (Pi ∩ Pi)) + OS (Pi) k OS (P ∪ Pn) ≥ i k l k OS ((Pi ∩ Pi) ∪ Pn) + (α − 1)OS (Pi ∪ Pi j) k OS (P ∪ Pn) 1 ≥ i ≥ k k α OS (Pi ∪ Pn) + (α − 1)OS (Pi ∪ Pn) We thus complete the proof. 

k l Lemma5.2 shows that, if Li FS-α-dominates Li, either the two partial paths are ex-

k l tended to vt in the same way, or Li goes through a known path to vt and Li goes through

l an unknown path to vt, discarding Li can always guarantee an error bound.

b ∆ c−1 1 bmin Theorem 5.3. Algorithm OSDD offers an approximation ratio of ( α ) , where bmin is the minimum edge budget value.

Proof. The proof is similar to that for Theorem 5.2, because Lemma 5.2 also guarantees

1 that each dominance only sacrifice α accuracy.  Speeding Up Dominance Check. In Algorithm 10, the submodular α-dominance check is terminated until all existing labels are visited or we find an existing label that dom- inates the new label (in line 7 of Algorithm 10). In order to speed up the process, we maintain the existing labels in the descending order of their objective values. It is ex- pected that when a label’s objective value is larger, it is more likely to submodular α- dominate other labels. Note that the newly generated labels always have larger budget values, and thus we only need to check the objective values.

Dominance Pre-check for New Labels. In addition, we observe that inserting a new label into a min-priority queue Q has large costs (in lines 9, 17 of Algorithm 10 and line Chapter 5. Constrained Path Search with Submodular Maximization 107

10 in Algorithm 11). To reduce the insertion operations for the new labels with low OS

l values, for a new label L j we compare it with the label with the maximum objective value in L(v j) to check if it can be dominated before enqueuing it into Q.

1 For example, assume that α = 1.2 and there exist 3 labels on node v1, L1 with

1 1 2 2 2 3 3 L1.BS = 1 and L1.OS = 1, L1 with L1.BS = 3 and L1.OS = 2 and L1 with L1.BS = 5 3 4 4 4 and L1.OS = 5. When we generate a new label L1 with L1.BS = 7 and L1.OS = 3 1, rather than inserting it into Q1, we first use L1 to check whether it submodular α- 4 4 dominates L1. If it is true, we can prune L1 directly to save time.

1 2 Complexity Analysis. Let |Lmax| and |Lmax| denote the largest numbers of labels on a node in the first and second phases respectively. The second phase is similar to SDD,

2 2 2 and the complexity is O(|Lmax||V|(log(|Lmax||V|) + |Lmax|)). In the first phase, we also need to enqueue and dequeue all of generated labels, and do submodular α-dominance

(or PS-α-dominance) check for those labels in the worst case. Then, the complexity of

1 1 1 the first phase is O(|Lmax||V|(log(|Lmax||V| + |Lmax|). Thus, the complexity of OSDD is

1 1 2 2 1 2 O(|Lmax||V|(log(|Lmax||V|)+|Lmax||V| log(|Lmax||V|)). Both |Lmax| and |Lmax| are smaller than |Lmax| in SDD, and more importantly the FS-α-dominance helps prune more partial paths, and thus OSDD is more efficient.

5.5 Bidirectional Search Method

From the complexity analysis of Algorithm SDD and OSDD, we can observe that the time complexity mainly depends on the number of generated labels (partial paths) in both algorithms. More labels are generated during the process, more loops are required and more time is spent on dequeuing and enqueuing labels. The number of partial paths is affected by ∆. With a larger ∆, more possible partial paths need to be enumerated and maintained. Inspired by the bi-directional search of the Dijkstra algorithm, we also 108 Chapter 5. Constrained Path Search with Submodular Maximization employ this idea in our algorithms to reduce the number of generated labels.

The basic idea of this method is to first divide the budget limit ∆ into two parts ∆s and ∆t. Next, we perform two types of search: a forward search from vs under budget limit ∆s, and a backward search from vt under budget limit ∆t. Hence, in this algorithm, we have two types of labels based on Definition5.5.

−→k Definition 5.10. Forward Node Label. Each forward node label L i represents a path

−→k −→k P i from the source node vs to a node vi, which is in format of hBS, OS, P i i, where BS

−→k and OS represent the budget score and the objective score of P i , respectively.

←−l Definition 5.11. Backward Node Label. Each backward node label L i represents a

←−l ←−l path P i from a node vi to the target node vt, which is in format of hBS, OS, P ii, where

←−l BS and OS represent the budget score and the objective score of P i, respectively.

We still apply the submodular α-dominance to prune labels in both forward and back- ward searches. Finally, we can find a path from vs to vt having a high objective score by merging the forward and backward node labels. The improvement is due to that in both types of search the budget limit is reduced, and thus the number of both types of labels needs to be maintained is reduced significantly. Another advantage of this method is that it allows us to estimate the upper bounds of the objective scores for some paths, then we can prune such paths whose OS upper bounds are smaller than the current largest OS value.

We first apply the bi-directional search to the SDD algorithm, and we denote this al- gorithm as Bi-SDD (Bi-directional Submodular α-Dominance based Dijkstra), as shown in Algorithm 12. The method first calls the function computeBackwardLable() to com- pute the non-dominated backward node labels for each node (line 2). Next, we begin to compute the non-dominated forward node labels utilizing the backward labels and find the final result (lines 3-25). The process is similar to SDD. The difference is that, for a Chapter 5. Constrained Path Search with Submodular Maximization 109

Algorithm 12: Bi-SDD

Input: G = (V, E), Q = hvs, vt, ∆i, γ Output: A path P

T 1 ∆t = γ∆, ∆s = ∆ − ∆t, G ← the transpose of G;

T 2 Get backward labels on each node by using computeBackwardLable(G , vt, vs, ∆, ∆t);

3 Initialize a min-priority queue Q ← {h0, OS (vs), (vs))i}; −→ 4 Initialize a label list L(vi) for each node vi ∈ V;

5 Initialize OS max to store the current best path OS score;

6 while Q is not empty do

−→k 7 L i ← Q.dequeue(); −→k −→ 8 if vi is vt then Insert L i into L(vi) and update OS max; −→k −→ 9 if DomCheck( L i , L(vi)) == true then continue; −→k −→ 10 Insert L i into L(vi); −→k 11 if L i .BS ≥ ∆s then −→k 12 ∆rem = ∆ − L i .BS ; ←− ←−k ←−k 13 From L(vi) find L i with the maximum OS value satisfying L i .BS ≤ ∆rem; −→k −→k ←−k 14 OS ub(P i ) = L i .OS + L i .OS ; −→k 15 if OS ub(P i ) ≤ OS max then continue; −→l −→k ←− 16 L t ← joinLabels( L i , L(vi), ∆rem, OS max); −→l −→l −→ 17 if L t! = null then Insert L t into L(vt);

18 else

−→k −→k 19 obtain P i from L i ;

20 for each edge (vi, v j) do

−→l −→k 21 Create a path P j ← P i ∪ v j; −→l 22 if BS (P j) > ∆ then continue; −→l −→l −→l 23 Q.enqueue(hBS (P j), OS (P j), (P j)i);

24 if L(vt) is ∅ then return ”No feasible path exists”; −→ 25 else return P with the largest OS value in L(vt);

−→k −→k non-dominant label L i whose budget value is not less than ∆s, the method extends L i directly utilizing the function joinLabels(), which gets a feasible path with the maxi- 110 Chapter 5. Constrained Path Search with Submodular Maximization

−→k ←− mum OS value by joining L i with backward node labels in L(vi) (lines 12-17). In this

−→k −→k process, before extending L i , we first compute an upper bound of the OS value for P i

−→k −→k ←−k ←−k as OS ub(P i ) = L i .OS + L i .OS , where L i is the label with the maximum OS value in

←− ←−k −→k L(vi) satisfying L i .BS ≤ ∆rem. If OS ub(P i ) ≤ OS max, we can discard it safely. If the

←−k budget score of L i is smaller than ∆s, we further extend the label to obtain new labels (lines 19-23).

How to compute the backward node labels is shown in Algorithm 13. The process is similar to SDD. We use the transpose of the original graph, and use vt as the source node to find non-dominated partial paths satisfying the budget limit ∆t for each node.

Next, as shown in Algorithm 14, we present the details of function joinLabels(). It

l ←−m ←− first initializes Lt as “null”, and then checks labels L i ∈ L(vi) one by one in descending

←−m order of their OS values, where L i denotes the backward node labels on vi. For a label

←−m −→k ←−m L i , if the total budget of P i ∪ P i is not larger than ∆, the function creates a new path

−→l ←−k ←−m −→l P t by joining P i and P i (lines 3-4). If the upper bound of P t’s OS value is not larger

T Algorithm 13: computeBackwardLable(G , vt, vs, ∆, ∆t)

1 Initialize a min-priority queue Q ← ∅ ; ←− 2 Initialize a label list L(vi) for each node vi ∈ V;

3 Create a label h0, OS (vt), (vt))i and insert it into Q;

4 while Q is not empty do

←−k 5 L i ← Q.dequeue(); ←−k ←− 6 if DomCheck( L i , L(vi)) == true then continue; ←−k ←− ←−k ←−k 7 Insert L i into L(vi), and obtain P i from L i ;

8 for each edge (vi, v j) do

←−l ←−k 9 Create a path P j ← P i ∪ v j; ←−l ←−l 10 if BS (P j) > ∆t or BS (P j) > ∆ then continue; ←−l ←−l ←−l 11 Q.enqueue(hBS (P j), OS (P j), (P j)i);

←− T 12 return L(vi) for all nodes in G ; Chapter 5. Constrained Path Search with Submodular Maximization 111

−→k ←− Algorithm 14: joinLabels( L i , L(vi), ∆rem, OS max)

−→l 1 L t ← null; ←−m ←− 2 foreach L i ∈ L(vi) do ←−m −→k 3 if L i .BS + L i .BS ≤ ∆ then −→l −→k ←−m 4 P t ← P i ∪ P i ; −→l −→k ←−m 5 OS ub(P t) = L i .OS + L i .OS ; −→l 6 if OS ub(P t) ≤ OS max then break; −→l 7 Compute OS (P t); −→l 8 if OS (P t) > OS max then −→l 9 OS max = OS (P t); l −→k ←−m −→l −→l 10 Lt ← h L i .BS + L i .BS, OS (P t), (P t)i;

−→l 11 return L t;

than OS max, the process can be terminated (lines 5-6). Otherwise, the function computes

−→l −→l OS (P t) and BS (P t), and if a better path is found, it updates OS max and create a new

−→l −→l −→l label to update L t using P t (lines 7-10). Finally, it returns L t (line 11).

Example 5.4. Consider a CPS-SM query for the most diversified path query again. As- sume that a query with budget limit is on G as shown in Figure. 5.1. When we run Bi-SDD ←− with α = 1.2 and γ = 0.5 to answer the query, we first obtain a backward label list L(v3)

T ←− on node v3 using computeBackwardLable(G , vt, vs, ∆, ∆t). L(v3) contains 3 labels:

←−1 ←−1 ←−2 ←−2 L 3 = hBS = 5, OS = 3, P 3 = hv3, v6, vtii, L 3 = hBS = 4, OS = 2, P 3 = hv3, v5, vtii,

←−3 ←−3 −→1 and L 3 = hBS = 5, OS = 2, P 3 = hv3, v4, vtii. For the forward label L 3 = hBS = −→1 −→1 −→1 5, OS = 3, P 3 = hvs, v1, v3ii, as L 3.BS ≥ ∆s = 5, Bi-SDD extends L 3 by joining it with

←− −→1 1 the labels in L(v3). Firstly, L t = hBS = 10, OS = 5, Pt = hvs, v1, v3, v6, vtii is found

−→1 ←−1 ←−2 by joining L 3 with L 3, and OS max is updated as 5. After that, the algorithm checks L 3.

←−2 Since OS up(P 3) = 5 ≤ OS max, which means the upper bound of OS value is not larger

−→1 −→1 than OS max, the algorithm stops and return L t as the result of extending L 3 to vt. Sim-

−→2 −→2 ←− ilarly, Bi-SDD can extend L 3 = hBS = 5, OS = 4, P 3 = hvs, v2, v3ii to vt using L(v3) 112 Chapter 5. Constrained Path Search with Submodular Maximization

−→1 −→2 directly. In this example, to extend L 3 and L 3 to vt, Bi-SDD generates 6 labels in the initialization (line 2 of Algorithm 12), where SDD needs to generate 12 labels which all are longer than the generated labels in Bi-SDD. This example illustrates that Bi-SDD is more efficient than SDD in practice.

We analyze the error bound of Algorithm Bi-SDD in the following lemma.

b ∆ c 1 bmin Theorem 5.4. The approximation ratio of Algorithm Bi-SDD is ( α ) for the CPS-SM query.

opt opt Proof. We denote the optimal path as P = hvs, ··· , vi, ··· , vti. We use Ps to represent

opt the path hvs, ··· , vii, and Pt to represent the path hvi, ··· , vti. We prove this theorem based on Lemma 5.1, which shows that after each α-

1 dominance, we lose at most α accuracy. Hence, similar to the proof of Theorem 5.2, opt if Ps is dominated by other paths on vi, even in the worst case, we must have a forward

−→k −→k partial path P i such that for any path Pit from vi to vt, and we have OS (P i ∪ Pit) ≥ ∆ b t c opt 1 bmin ( α ) OS (Ps ∪ Pit). opt Next, let us consider the backward node labels on vi. If Pt is dominated on vi, even

←−m ←−m in the worst case, we must have a backward partial path P i such that OS (P i ∪ Psi) ≥ ∆ b t c opt 1 bmin ( α ) OS (Pt ∪ Psi), where Psi is any path from vs to vi. ∆ −→ ←− −→ b t c opt −→ k m k 1 bmin k As P i could be Psi, we can get that OS (P i ∪ P i ) ≥ ( α ) OS (Pt ∪ P i ). Next, as ∆ ∆ opt ←− −→ b t c b s c opt opt m k 1 bmin 1 bmin Pt could be Pit, we can further get OS (P i ∪ P i ) ≥ ( α ) ( α ) OS (Ps ∪ Pt ) = b ∆ c 1 bmin opt ( α ) OS (P ). ←−m −→k P i ∪ P i is just one possible combination in our algorithm Bi-SDD, and thus the final b ∆ c 1 bmin opt result of Bi-SDD must be no worse than ( α ) OS (P ), and we complete the proof. 

f b Complexity Analysis. We use |Lmax| and |Lmax| to denote the largest number of labels on a node in the forward node label list and backward node label list respectively. computeBackwardLable is similar to SDD, and the complexity is Chapter 5. Constrained Path Search with Submodular Maximization 113

b b b O(|Lmax||V|(log(|Lmax||V|) + |Lmax|)). For the forward search, in each loop we also need

b to do joinLabels, which takes O(|Lmax|) in the worst case. Thus, the complexity of the

f f f b f b forward search is O(|Lmax||V|(log(|Lmax||V|) + |Lmax| + |Lmax|). |Lmax| and |Lmax| depend on

∆s and ∆t respectively, and thus are much smaller than |Lmax| in SDD.

Algorithm Bi-OSDD. We can also apply the bi-directional search to the OSDD al- gorithm, and we call this method Bi-OSDD (Bi-directional Optimized Submodular α-

Dominance based Dijkstra). Most steps are the same as that of Bi-SDD. The only differ- ence is in the first phase. As the first phase in OSDD aims to find some feasible solutions quickly, the bi-directional search method cannot use OS max to prune some paths in func- tion joinLabels(). Thus, it always returns the label with the maximum OS value by

−→k ←− joining L i with the labels in L (vi). It is straightforward to prove that Bi-OSDD can also achieve an approximation ratio b ∆ c 1 bmin of ( α ) for answering the CPS-SM query based on the proof of Theorem 5.4. This algorithm benefits from both the FS-α-dominance and the bi-directional search, and thus it has better performance than SDD, OSDD, and Bi-SDD, and this is also verified in our experiments as shown in Section 5.7.

5.6 A Polynomial Heuristic Algorithm CBFS

Although Algorithm9 and Algorithm 10 can answer the CPS-SM query with guaranteed error bound, they are no polynomial-time algorithms. Here we propose a polynomial heuristic algorithm based on submodular α-dominance.

This algorithm utilizes a cost-effective first search to find the path with a large objec- tive value quickly, and we denote this algorithm as CBFS. In CBFS, the Cost-Effective

l l l value (CE) is computed as CE(P j) = OS (P j)/BS (P j). The most steps of CBFS are sim- ilar to those of Algorithm9, but it initializes a min-priority queue Q which stores partial 114 Chapter 5. Constrained Path Search with Submodular Maximization labels in ascending order of their CE values in line 1 of Algorithm9. In addition, rather than storing all non-dominant labels on each node, CBFS only stores at most β labels on

k each node, thus it prunes the label Li when the number of labels in L(vi) is equal to β (between line 6 and line 7 in Algorithm9). Thus, CBFS cannot maintain the error bound as does SDD.

Next, we analyze the complexity of CBFS. Since there exists at most β labels on each node, CBFS would generate at most β|V| labels. To dequeue and enqueue β|V| labels, it requires O(β|V|log(β|V|)) time. CBFS needs β|V| loops in the worst case, thus it takes O(β2|V|) time to do submodular α-dominance checks. Therefore, the worst time complexity of CBFS is O(β|V|(log(β|V|) + β)).

5.7 Experimental Study

5.7.1 Experiments setting

We compare the proposed algorithms SDD, Bi-SDD, OSDD, Bi-OSDD and CBFS with two state-of-the-art algorithms: Weighted A*(WA*) [ZCC+15] and Generalized Cost-

Benefit Algorithm(GCB) [ZV16] using five real and synthetic datasets on two appli- cations. We set  = 0.2 for WA* according to the experimental results in [ZCC+15], and use a log n-approximation algorithm [FGM82] for the asymmetric TSP problem to compute the cost function for GCB, as computing the cost function for GCB (i.e., find- ing the shortest path visiting a given set of nodes) is NP-hard [dA16]. We adopt the

H2H index [OQC+18] to compute the shortest distance between two nodes efficiently, as all of the proposed algorithms needs it to do some optimizations (e.g., Pruning Op- timization [ZCC+15] can be used in WA* and our algorithms, as it can prune the path

l satisfying BS (P j) + dist(v j, vt) > ∆ safely, where dist(v j, vt) computes the shortest dis- tance between v j and vt). We implement all the algorithms in JAVA on Windows 10, and Chapter 5. Constrained Path Search with Submodular Maximization 115 run on a desktop with Intel i7-9800X 3.8GHz CPU and 64 GB memory.

5.7.2 Application 1: Users’ Preferences Coverage

The study [ZCC+15] proposes to find an optimal route from a source node to a target node, such that the users’ preferences (represented by keywords) coverage is optimized and the budget score satisfies a given constraint. The users’ preferences coverage func- P tion is submodular, it is computed as KC(P) = κq∈K λκq covκq (R) where covκq (vi) is the degree to which the location vi covers the keyword κq, Thus, this problem is a special

type of the CPS-SM query (covκq (vi) is computed from users’ check-in records, and please refer the details to the work [ZCC+15]).

Following the work [ZCC+15], in this set of experiments, we use two real-world datasets SG and AS , where the SG dataset contains 189,306 check-ins made by 2,321 users at 5,412 locations in Singapore, and the AS dataset has 201,525 check-ins made by 4,630 users at 6,176 locations in Austin. We add an edge between two locations that were visited continuously in 1 day by the same user, and use the Euclidean distance as the budget value for each edge.

To study the scalability of our proposed algorithms, we use a synthetic dataset in which the graph is the road network in Florida (FL) from DIMACS1. The dataset contains

1,070,376 nodes and 2,712,798 arcs. The degree that each location covers the keyword is generated randomly. For each experiment, we generate 100 queries randomly and report the average results.

Effect of Parameter α.

In all proposed algorithms, the parameter α balances the runtime and the solution quality.

To study the effect of α with ∆ = 50 kilometer on processing the CPS-SM query, we run

1http://users.diag.uniroma1.it/challenge9/download.shtml 116 Chapter 5. Constrained Path Search with Submodular Maximization

250 1 SDD 200 OSDD 0.9

150 0.8

100 0.7

Run Time (ms) 50 0.6 Relative Accuracy SDD OSDD 0 0.5 1.05 1.1 1.15 1.2 1.25 1.05 1.1 1.15 1.2 1.25 α α (a) SG (b) SG

350 1 SDD 300 OSDD 0.9 250 200 0.8 150 0.7 100 Run Time (ms) 0.6 50 Relative Accuracy SDD OSDD 0 0.5 1.05 1.1 1.15 1.2 1.25 1.05 1.1 1.15 1.2 1.25 α α (c) AS (d) AS

Figure 5.2: Effect of parameter α with ∆ = 50 kilometer.

SDD and OSDD on the SG and AS datasets and get the result as shown in Figure 5.2.

As the problem is NP-hard and we do not have an exact algorithm, we get an ap- proximation result by running OSDD with α = 1.001, and we compute the “relative” accuracy of the results using different parameter values by comparing with this. We can see that with the increase of α, the efficiency improves (shorter runtime) while the accu- racy drops. We set α = 1.1 for both algorithms since the query processing time is short and about 95% accuracy can be achieved. We use α = 1.1 as a default setting for other proposed algorithms in this application.

Effect of Parameter β.

The value of β controls the maximal number of labels stored on each node in CBFS.

When β is larger, more labels are generated and maintained on nodes, thus the solution Chapter 5. Constrained Path Search with Submodular Maximization 117

120 1 SG 100 AS 0.9 80 0.8 60 0.7 40

Run Time (ms) 0.6 20 Relative Accuracy SG AS 0 0.5 5 10 15 20 25 5 10 15 20 25 β β (a) run time (b) precision

Figure 5.3: Effect of parameter β with ∆ = 50 kilometers for CBFS in application 1.

500 0.9 Bi-SDD 400 Bi-OSDD 0.8

300 0.7

200 0.6

Run Time (ms) 100 0.5 Bi-SDD The Objective Score Bi-OSDD 0 0.4 0.2 0.3 0.4 0.5 0.6 0.2 0.3 0.4 0.5 0.6 γ γ (a) Efficiency (b) Solution Quality

Figure 5.4: Effect of parameter γ with α = 1.1 and ∆ = 50 kilometer on FL. quality is better, while the runtime is longer. To tune β, we run CBFS to process the query with ∆ = 50 kilometers on both datasets. Figure 5.3 validates the analysis, and we set β = 10 for CBFS for the experiments in the application.

Effect of Parameter γ.

The value of parameter γ adjusts the search ranges of two directions search in Bi-SDD and Bi-OSDD. We vary γ from 0.2 to 0.6 and run the two algorithms on the FL dataset.

Figure 5.4 shows that the runtime of both algorithms first drops, and then increases slightly. The reason is that when γ is small, the forward search still has to generate and maintain a lot of labels. On the other hand, when γ is large, the initialization for getting the backward node labels has very high costs. γ = 0.3 is used for Bi-SDD and Bi-OSDD 118 Chapter 5. Constrained Path Search with Submodular Maximization

1e+005 0.7 WA* 1e+004 SDD 0.6 OSDD 1e+003 GCB 0.5 CBFS WA* 1e+002 0.4 SDD OSDD Run Time (ms) 1e+001 0.3 GCB The Objective Score CBFS 1e+000 0.2 21 22 23 24 25 21 22 23 24 25 ∆ (kilometer) ∆ (kilometer) (a) SG (b) SG

1e+005 0.7 WA* 1e+004 SDD 0.6 OSDD 1e+003 GCB 0.5 CBFS WA* 1e+002 0.4 SDD OSDD Run Time (ms) 1e+001 0.3 GCB The Objective Score CBFS 1e+000 0.2 21 22 23 24 25 21 22 23 24 25 ∆ (kilometer) ∆ (kilometer) (c) AS (d) AS

Figure 5.5: Comparison with baselines varying ∆ in the subsequent experiments in this application.

Performance comparison.

We compare 3 proposed algorithms SDD, OSDD and CBFS with WA* and GCB on both efficiency and accuracy. AS WA* is extremely slow, we conduct the experiments by varying ∆ from 20km to 25km using SG and AS . Figure 5.5 shows that WA* is very time-consuming when ∆ is large, as it stores too many partial paths. The quality of the solution of GCB is the worst, as GCB relaxed the budget limit and always returned a local optimal value. Our proposed algorithms can get paths of high quality and have shorter runtime. It can also be observed that in our three algorithms, SDD and OSDD have better accuracy than CBFS, but are slower than CBFS.

It is hard to show the performance difference of our algorithms on small datasets. We Chapter 5. Constrained Path Search with Submodular Maximization 119

SDD OSDD Bi-SDD Bi-OSDD CBFS

1e+004 0.9

0.8 1e+003 0.7 1e+002 0.6

Run Time (ms) 1e+001 0.5 The Objective Score 1e+000 0.4 50 55 60 65 70 50 55 60 65 70 ∆ (kilometer) ∆ (kilometer) (a) Efficiency (b) Solution Quality

Figure 5.6: Comparison of proposed algorithms varying ∆. conduct another experiment and run the five proposed algorithms on larger ∆ using the

FL dataset. As shown in Figure 5.6, OSDD is at least 2 times faster than SDD, and the two have similar accuracy. For SDD and OSDD, the bi-directional search method can improve efficiency. Surprisingly, the solution quality is also improved slightly, although the theoretical error bound is worse. The reason might be that the bi-directional search reduces the number of generated labels and the length of stored partial paths in both algorithms. Note that the pruning power of submodular α-dominance depends on the OS value of the label. Thus, if the partial paths are shorter, their OS values are smaller, and the pruning power is weaker, which leads to better solution quality. Moreover, OSDD,

Bi-OSDD and CBFS scale well when varying ∆.

5.7.3 Application 2: Most Diversified Path Query

The second application is to find the most diversified path (MDP) given a source node, a target node, and a budget constraint. MDP considers a set of locations in road networks that is associated with a set of keywords, e.g., “mall” and “bar”. The diversity of a path is measured by the number of different keywords covered by this path, and thus the

objective function is OS (P) = | ∪vi∈P K(vi)|, where K(vi) is the set of keywords on vi. 120 Chapter 5. Constrained Path Search with Submodular Maximization

1e+005 1 SDD 1e+004 OSDD 0.9

1e+003 0.8

1e+002 0.7

Run Time (ms) 1e+001 0.6 Relative Accuracy SDD OSDD 1e+000 0.5 1.1 1.15 1.2 1.25 1.3 1.1 1.15 1.2 1.25 1.3 α α (a) NY (b) NY

1 1e+005 SDD OSDD 0.9 1e+004 0.8 1e+003 0.7 1e+002

Run Time (ms) 0.6 1e+001 Relative Accuracy SDD OSDD 1e+000 0.5 1.1 1.15 1.2 1.25 1.3 1.1 1.15 1.2 1.25 1.3 α α (c) CN (d) CN

Figure 5.7: Effect of parameter α with ∆ = 50 kilometer.

We use two real road networks New York City (NY) and California & Nevada (CN)

from DIMACS, where the NY dataset contains 264,346 nodes and 733,846 arcs and the

CN dataset contains 1,890,815 nodes and 4,657,742 arcs. We adopted the Foursquare

APIs2 to obtain the categories of locations (keywords) for the NY dataset, and randomly

generate keywords for the CN dataset. We also generate 100 queries randomly and report

the average results for each experiment.

Effect of Parameter α.

We vary α from 0.1 to 0.3 to run SDD and OSDD with ∆ = 50 kilometer on both datasets

for answering the MDP query. This application is more complicated and the datasets are

larger than in the previous one, we get an approximation result by running OSDD with

2https://developer.foursquare.com/docs/places-api/ Chapter 5. Constrained Path Search with Submodular Maximization 121

α = 1.05 and use it to compute the relative accuracy of the results. The results are shown in Figure 5.7. When α = 1.2 both algorithms can obtain solutions with 90% relative accuracy in a short time, and thus we use α = 1.2 as a default setting for all our algorithms in this application.

Effect of Parameter γ.

We vary γ from 0.2 to 0.6 to run Bi-SDD and Bi-OSDD with ∆ = 50 kilometer on

NY and CN to process the MDP query. Figure 5.8 shows the results, and the trend is similar to that in the first application. We set γ = 0.4 for Bi-SDD and Bi-OSDD in the subsequent experiments.

1000 1 Bi-SDD 800 Bi-OSDD 0.9

600 0.8

400 0.7

Run Time (ms) 200 0.6

Relative Accuracy Bi-SDD Bi-OSDD 0 0.5 0.2 0.3 0.4 0.5 0.6 0.2 0.3 0.4 0.5 0.6 γ γ (a) NY (b) NY

1500 1 Bi-SDD 1200 Bi-OSDD 0.9

900 0.8

600 0.7

Run Time (ms) 300 0.6

Relative Accuracy Bi-SDD Bi-OSDD 0 0.5 0.2 0.3 0.4 0.5 0.6 0.2 0.3 0.4 0.5 0.6 γ γ (c) CN (d) CN

Figure 5.8: Effect of parameter γ with α = 1.2, ∆ = 50 kilometer. 122 Chapter 5. Constrained Path Search with Submodular Maximization

500 1 NY 400 CN 0.9

300 0.8

200 0.7

Run Time (ms) 100 0.6 Relative Accuracy NY CN 0 0.5 5 10 15 20 25 5 10 15 20 25 β β (a) run time (b) precision

Figure 5.9: Effect of parameter β with ∆ = 50 kilometer for CBFS.

Effect of Parameter β.

We tune β by running CBFS to answer the query with ∆ = 50 kilometers on both datasets. The results are shown in Figure 5.9, and we set β = 10 for CBFS for the experiments in this application.

Performances Comparison.

In order to compare the performances of our algorithms and GCB in this application, we run them on NY and CN by varying ∆. Figure 5.10 demonstrates that the solution quality of GCB is poor, while our algorithms can achieve much higher accuracy. The bi-directional search speeds up SDD and OSDD significantly in this application, as it reduces a lot of labels in the two algorithms. Meanwhile, CBFS can return high-quality results in practice (90% of the Bi-OSDD result) in a short time (less than 500ms, even faster than GCB). Thus, on larger datasets and larger ∆, CBFS is a good choice.

5.8 Conclusion

In this chapter, we study the CPS-SM query which is to find the optimal path from a source node to a target node such that its submodular function score is maximized under a Chapter 5. Constrained Path Search with Submodular Maximization 123

SDD Bi-SDD CBFS OSDD Bi-OSDD GCB

1e+005 140

120 1e+004 100 1e+003 80

Run Time (ms) 1e+002 60 The Objective Score 1e+001 40 50 55 60 65 70 50 55 60 65 70 ∆ (kilometer) ∆ (kilometer) (a) NY (b) NY

1e+005 140

120 1e+004 100 1e+003 80

Run Time (ms) 1e+002 60 The Objective Score 1e+001 40 50 55 60 65 70 50 55 60 65 70 ∆ (kilometer) ∆ (kilometer) (c) CN (d) CN

Figure 5.10: Comparison of methods varying ∆. given budget constraint. Solving the query is proved to be NP-hard. To answer the query efficiently, we propose an approximation algorithm with guaranteed error bound based on a concept called “sumodular α-dominance” utilizing the properties of the submodular function. We develop another algorithm with the same approximation ratio which is more efficient by relaxing the sumodular α-dominance conditions. We also present a bi-directional search method to speed up the proposed algorithms, and devise a heuristic algorithm with polynomial runtime. The experimental study on 2 applications and 5 datasets demonstrates the efficiency and the accuracy of the proposed algorithms. Chapter 6

Conclusions and Future Work

In this chapter, we conclude the entire thesis in Section 6.1, and present several possible directions for future work in Section 6.2.

6.1 Conclusions

Due to the prominence of location-based services, the problems of submodular opti- mization in location-based social networks become important in real-applications. But existing methods cannot find high-quality solutions for these problems efficiently, due to the special constraints (e.g., spatial constraint, routing constraint and connective con- straint, etc.) in location-based social networks. In this thesis, we study three important problems in submodular optimization in location-based social networks, and propose some exact and efficient approximation algorithms for these problems. The details are as follows.

Exact and Approximate Approaches for MaxBRkNN. MBRkNN is a useful problem in smart city. It aims to find an optimal region to place a new server point s such that the size of the bichromatic reverse k nearest neighbor (BRkNN) query result of s is maxi- mized. Existing approaches are time-consuming if the datasets are large, as they have to

124 Chapter 6. Conclusions and Future Work 125 cost much time to compute accurate kNNs for all client points and calculate the region influence values. To solve the problem on large datasets and in location-based social networks, we first propose a two-level hybrid quadtree-grid index QGI that can be used to estimate kNNs quickly, and then present an exact and efficient algorithm GkE that can reduce the computations of getting exact kNNs on many client points though estimating the upper bound of kNLCs’ influence. We also present a maximal-arc optimization to reduce the influence computation cost and develop approximation algorithms PIC and

PRIC to further improve the efficiency of GkE. The experimental results show that GkE is 3 − 5 times faster than two state-of-the-art methods, and our approximation algorithms not only outperforms these state-of-the-art methods by one order of magnitude, but also have high accuracy.

Optimal Region Search with Submodular Maximization. Region search is an impor- tant problem in location-based services. We study the ORS-SM problem that aims to find the optimal region such that the objective score of the region computed by a submodular function is maximized under a given budget score constraint. Compared to other region search problems, the ORS-SM problem is more general and more complex. Because the ORS-SM problem defines the region as a connected subgraph, and use a submodu- lar function to compute the objective score of a region. We develop two approximation algorithms to solve the ORS-SM problem efficiently, prove their approximation ratios, and propose some optimizations to improve the developed algorithms. We also demon- strate the efficiency and the solution quality of our proposed algorithms by conducting experiments on two applications using three real-world datasets.

Constrained Path Search with Submodular Maximization. Path search is an essential task in many applications of location-based services such as navigation and tourist trip planning. We study the CPS-SM query which is to find the optimal path from a source node to a target node such that its submodular function score is maximized under a given 126 Chapter 6. Conclusions and Future Work budget constraint. Solving the CPS-SM query is proved to be NP-hard. To this end, we

first propose a concept called “sumodular α-dominance” utilizing the properties of the submodular function. Based on the concept, we design two approximation algorithms with error bounds to answer the CPS-SM query efficiently. Meanwhile, we present a bi-directional search method to speed up the designed algorithms, and devise a heuristic algorithm with polynomial runtime. The experimental study on two applications and five datasets shows the efficiency and the effectiveness of our proposed algorithms.

6.2 Future Work

For future work, we present several possible directions as below.

Constructing Efficient Index. For the ORS-SM problem and the CPS-SM query, when the objective function is a simple aggregation sum function, there exist some efficient indexes to speed up the solutions [CCJY14, WXYL16]. But these indexes cannot keep any error bounds for the solutions if the objective function is submodular due to the submodular properties. In the future, we will exploit the properties of the submodular functions and the location information to construct efficient indexes which can guarantee some error bounds for these two problems.

Developing Parallel Algorithms with Approximation Ratios. Recently, some parallel algorithms with error bounds are proposed to solve the problems of submodular max- imization efficiently [EN19, FMZ19, LLV20, BBS20]. This indicates that developing parallel algorithms is a good choice to solve the studied problems on large-scale data.

Next, we will design parallel algorithms with approximation ratios for the studied prob- lems.

Investigating Dynamic Problem Settings. In this thesis, our solutions only consider the static problem settings for these three problems, but they cannot work well when Chapter 6. Conclusions and Future Work 127 the problem settings are dynamic (e.g., dynamic cost constraints and time-dependent graphs). Dynamic cost constraints [RNNF19] means that the budget constraint changes over time, and it often appears in real-applications. Time-dependent graphs [YLW+19], where an edge is associated with a time-dependent weight function for modeling time- dependent speed, typically formalizes the real-world road networks. In the future, we will study these three problems on dynamic cost constraints and time-dependent graphs. Bibliography

[ALS20] Brian Axelrod, Yang P Liu, and Aaron Sidford. Near-optimal approxi-

mate discrete and continuous submodular function minimization. In Pro- ceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete

Algorithms (SODA), pages 837–853. SIAM, 2020.

[AMOT90] Ravindra K Ahuja, Kurt Mehlhorn, James Orlin, and Robert E Tarjan.

Faster algorithms for the shortest path problem. Journal of the ACM,

37(2):213–223, 1990.

[BBS20] Adam Breuer, Eric Balkanski, and Yaron . The fast algorithm for

submodular maximization. In Proceedings of the 34th International Con-

ference on Machine Learning (ICML), 2020.

[BFG19] Niv Buchbinder, Moran Feldman, and Mohit Garg. Deterministic (1/2+

ε)-approximation for submodular maximization over a matroid. In Pro- ceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Al-

gorithms (SODA), pages 241–254, 2019.

[BS20] Eric Balkanski and Yaron Singer. A lower bound for parallel submodular

minimization. In Proceedings of the 52nd Annual ACM SIGACT Sympo-

sium on Theory of Computing (STOC), pages 130–139, 2020.

128 BIBLIOGRAPHY 129

[CCCX12] Xin Cao, Lisi Chen, Gao Cong, and Xiaokui Xiao. Keyword-aware op-

timal route search. Proceedings of the VLDB Endowment, 5(11):1136–

1147, 2012.

[CCJ10] Xin Cao, Gao Cong, and Christian S Jensen. Retrieving top-k prestige-

based relevant spatial web objects. Proceedings of the VLDB Endowment,

3(1-2):373–384, 2010.

[CCJO11] Xin Cao, Gao Cong, Christian S Jensen, and Beng Chin Ooi. Collective

spatial keyword querying. In Proceedings of the 2011 ACM SIGMOD

International Conference on Management of data (SIGMOD), pages 373–

384, 2011.

[CCJY14] Xin Cao, Gao Cong, Christian S Jensen, and Man Lung Yiu. Retrieving

regions of interest for user exploration. Proceedings of the VLDB Endow-

ment, 7(9):733–744, 2014.

[CCSC16] Farhana M Choudhury, J Shane Culpepper, Timos Sellis, and Xin Cao.

Maximizing bichromatic reverse spatial and textual k nearest neighbor

queries. Proceedings of the VLDB Endowment, 9(6):456–467, 2016.

[CCT12] Dong-Wan Choi, Chin-Wan Chung, and Yufei Tao. A scalable algorithm

for maximizing range sum in spatial databases. Proceedings of the VLDB

Endowment, 5(11):1088–1099, 2012.

[CDL+05] Sergio Cabello, Jose´ Miguel D´ıaz-Ba´nez,˜ Stefan Langerman, Carlos

Seara, and Inmaculada Ventura. Reverse facility location problems. In Proceedings of the 17th Canadian Conference on Computational Geome-

try (CCCG), pages 68–71, 2005. 130 BIBLIOGRAPHY

[CHW+15] Jian Chen, Jin Huang, Zeyi Wen, Zhen He, Kerry Taylor, and Rui Zhang.

Analysis and evaluation of the top-k most influential location selection

query. Knowledge and Information Systems, 43(1):181–217, 2015.

[CJW09] Gao Cong, Christian S Jensen, and Dingming Wu. Efficient retrieval of

the top-k most relevant spatial web objects. Proceedings of the VLDB

Endowment, 2(1):337–348, 2009.

[CLW+14] Zitong Chen, Yubao Liu, Raymond Chi-Wing Wong, Jiamin Xiong, Gan-

glin Mai, and Cheng Long. Efficient algorithms for optimal location

queries in road networks. In Proceedings of the 2014 ACM SIGMOD

international conference on Management of data (SIGMOD), pages 123–

134, 2014.

[CLW+15] Zitong Chen, Yubao Liu, Raymond Chi-Wing Wong, Jiamin Xiong, Gan-

glin Mai, and Cheng Long. Optimal location queries in road networks.

ACM Transactions on Database Systems, 40(3):1–41, 2015.

[CP05] Chandra Chekuri and Martin Pal. A recursive greedy algorithm for walks

in directed graphs. In 46th Annual IEEE Symposium on Foundations of

Computer Science (FOCS), pages 245–253. IEEE, 2005.

[CQ19] Chandra Chekuri and Kent Quanrud. Parallelizing greedy for submodular

set function maximization in matroids and beyond. In Proceedings of the

51st Annual ACM SIGACT Symposium on Theory of Computing (STOC),

pages 78–89, 2019.

[CR04] Michelangelo Conforti and Romeo Rizzi. Combinatorial optimization-

polyhedra and efficiency: A book review. 4OR: A Quarterly Journal of

Operations Research, 2(2):153–159, 2004. BIBLIOGRAPHY 131

[Cra19] Victoria G Crawford. An efficient evolutionary algorithm for minimum

cost submodular cover. In Proceedings of the 28th International Joint

Conference on Artificial Intelligence (IJCAI), pages 1227–1233, 2019.

[CSC19] Ye-In Chang, Jun-Hong Shen, and Che-Min Chu. An adjustable-tree

method for processing reverse nearest neighbor moving queries. In In-

ternational Conference on Frontier Computing, pages 362–371. Springer,

2019.

[CWW10] Wei Chen, Chi Wang, and Yajun Wang. Scalable influence maximization

for prevalent viral marketing in large-scale social networks. In Proceed- ings of the 16th ACM SIGKDD international conference on Knowledge

discovery and data mining (KDD), pages 1029–1038, 2010.

[CZC+15] Xuefeng Chen, Yifeng Zeng, Gao Cong, Shengchao Qin, Yanping Xiang,

and Yuanshun Dai. On information coverage for location category based

point-of-interest recommendation. In Proceedings of the 25th Conference

on Artificial Intelligence (AAAI), AAAI’15, pages 323–329, 2015.

[D+59] Edsger W Dijkstra et al. A note on two problems in connexion with

graphs. Numerische mathematik, 1(1):269–271, 1959.

[dA16] Rafael Castro de Andrade. New formulations for the elementary shortest-

path problem visiting a given set of nodes. European Journal of Opera-

tional Research, 254(3):755–768, 2016.

[DK11] Abhimanyu Das and David Kempe. Submodular meets spectral: Greedy

algorithms for subset selection, sparse approximation and dictionary se-

lection. In Proceedings of the 28th International Conference on Machine

Learning (ICML), pages 1057–1064, 2011. 132 BIBLIOGRAPHY

[EN19] Alina Ene and Huy L Nguyen. Submodular maximization with nearly-

optimal approximation and adaptivity in nearly-linear time. In Proceed- ings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algo-

rithms (SODA), pages 274–282. SIAM, 2019.

[FCB+16] Kaiyu Feng, Gao Cong, Sourav S Bhowmick, Wen-Chih Peng, and Chun-

yan Miao. Towards best region search for data exploration. In Proceedings

of the 2016 International Conference on Management of Data (SIGMOD),

pages 1055–1070, 2016.

[FCJG19] Kaiyu Feng, Gao Cong, Christian S Jensen, and Tao Guo. Finding

attribute-aware similar regions for data analysis. Proceedings of the VLDB

Endowment, 12(11):1414–1426, 2019.

[FGC+19] Kaiyu Feng, Tao Guo, Gao Cong, Sourav S Bhowmick, and Shuai Ma.

Surge: Continuous detection of bursty regions over a stream of spatial

objects. IEEE Transactions on Knowledge and Data Engineering, 2019.

[FGM82] Alan M Frieze, Giulia Galbiati, and Francesco Maffioli. On the worst-case

performance of some algorithms for the asymmetric traveling salesman

problem. Networks, 12(1):23–39, 1982.

[FLL+20] Zijin Feng, Tiantian Liu, Huan Li, Hua Lu, Lidan Shou, and Jianliang Xu.

Indoor top-k keyword-aware routing query. In 2020 IEEE 36th Interna-

tional Conference on Data Engineering (ICDE), pages 1213–1224. IEEE,

2020.

[FMZ19] Matthew Fahrbach, Vahab Mirrokni, and Morteza Zadimoghaddam. Sub-

modular maximization with nearly optimal approximation, adaptivity and BIBLIOGRAPHY 133

query complexity. In Proceedings of the Thirtieth Annual ACM-SIAM

Symposium on Discrete Algorithms (SODA), pages 255–273. SIAM, 2019.

[Fuj05] Satoru Fujishige. Submodular functions and optimization. Elsevier, 2005.

[GLS81] Martin Grotschel,¨ Laszl´ o´ Lovasz,´ and Alexander Schrijver. The ellipsoid

method and its consequences in combinatorial optimization. Combinator-

ica, 1(2):169–197, 1981.

[GN20] Rohan Ghuge and Viswanath Nagarajan. Quasi-polynomial algorithms

for submodular tree orienteering and other directed network design prob-

lems. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium

on Discrete Algorithms (SODA), pages 1039–1048. SIAM, 2020.

[Han80] Pierre Hansen. Bicriterion path problems. In Multiple criteria decision

making theory and application, pages 109–127. Springer, 1980.

[Has92] Refael Hassin. Approximation schemes for the restricted shortest path

problem. Mathematics of Operations research, 17(1):36–42, 1992.

[HFYD16] Lin Huaizhong, Chen Fangshu, Gao Yunjun, and Lu Dongming. Finding

optimal region for bichromatic reverse nearest neighbor in two-and three-

dimensional spaces. GeoInformatica, 20(3):351–384, 2016.

[HWQ+11] Jin Huang, Zeyi Wen, Jianzhong Qi, Rui Zhang, Jian Chen, and Zhen

He. Top-k most influential locations selection. In Proceedings of the 20th ACM international conference on Information and knowledge manage-

ment (CIKM), pages 2377–2380, 2011.

[HZ80] Gabriel Y Handler and Israel Zang. A dual algorithm for the constrained

shortest path problem. Networks, 10(4):293–309, 1980. 134 BIBLIOGRAPHY

[IJB13] Rishabh K Iyer, Stefanie Jegelka, and Jeff A Bilmes. Curvature and op-

timal algorithms for learning and minimizing submodular functions. In Proceedings of the Twenty-Seventh Conference on Neural Information

Processing Systems (NeurIPS), pages 2742–2750, 2013.

[Iwa08] Satoru Iwata. Submodular function minimization. Mathematical Pro-

gramming, 112(1):45, 2008.

[JB11] Stefanie Jegelka and Jeff Bilmes. Submodularity beyond submodular en-

ergies: coupling edges in graph cuts. In Proceedings of 24th IEEE Confer-

ence on Computer Vision and Pattern Recognition (CVPR), pages 1897–

1904. IEEE, 2011.

[JGCZ20] Pengfei Jin, Yunjun Gao, Lu Chen, and Jingwen Zhao. Efficient group

processing for multiple reverse top-k geo-social keyword queries. In In- ternational Conference on Database Systems for Advanced Applications

(DASFAA), pages 279–287. Springer, 2020.

[Jok66] H. C. Joksch. The shortest route problem with constraints. Journal of

Mathematical analysis and applications, 14(2):191–197, 1966.

[JSB19] KJ Joseph, Krishnakant Singh, and Vineeth N Balasubramanian. Sub-

modular batch selection for training deep neural networks. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelli-

gence (IJCAI), pages 2677–2683, 2019.

[KG14] Andreas Krause and Daniel Golovin. Submodular function maximiza-

tion., 2014.

[KKT03] David Kempe, Jon Kleinberg, and Eva´ Tardos. Maximizing the spread

of influence through a social network. In Proceedings of the ninth ACM BIBLIOGRAPHY 135

SIGKDD international conference on Knowledge discovery and data min-

ing (KDD), pages 137–146, 2003.

[KLT14] Tung-Wei Kuo, Kate Ching-Ju Lin, and Ming-Jer Tsai. Maximizing sub-

modular set function with connectivity constraint: Theory and applica-

tion to networks. IEEE/ACM Transactions on Networking, 23(2):533–546,

2014.

[KM00] Flip Korn and Suresh Muthukrishnan. Influence sets based on reverse

nearest neighbor queries. ACM Sigmod Record, 29(2):201–212, 2000.

[KMN99] Samir Khuller, Anna Moss, and Joseph Seffi Naor. The budgeted max-

imum coverage problem. Information Processing Letters, 70(1):39–45,

1999.

[KMS+07] James M Kang, Mohamed F Mokbel, Shashi Shekhar, Tian Xia, and

Donghui Zhang. Continuous evaluation of monochromatic and bichro-

matic reverse nearest neighbors. In 2007 IEEE 23rd International Con-

ference on Data Engineering (ICDE), pages 806–815. IEEE, 2007.

[KORVM06] Fernando Kuipers, Ariel Orda, Danny Raz, and Piet Van Mieghem. A

comparison of exact and ε-approximation algorithms for constrained rout-

ing. In International Conference on Research in Networking, pages 197–

208. Springer, 2006.

[LCB+18] Hui Luo, Farhana M Choudhury, Zhifeng Bao, J Shane Culpepper, and

Bang Zhang. Maxbrknn queries for streaming geo-data. In International

Conference on Database Systems for Advanced Applications (DASFAA),

pages 647–664. Springer, 2018. 136 BIBLIOGRAPHY

[LCF+14] Guoliang Li, Shuo Chen, Jianhua Feng, Kian-lee Tan, and Wen-syan Li.

Efficient location-aware influence maximization. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data

(SIGMOD), pages 87–98, 2014.

[LCGL13] Huaizhong Lin, Fangshu Chen, Yunjun Gao, and Dongming Lu. Optre-

gion: Finding optimal region for bichromatic reverse nearest neighbors.

In International Conference on Database Systems for Advanced Applica-

tions (DASFAA), pages 146–160. Springer, 2013.

[LCH+05] Feifei Li, Dihan Cheng, Marios Hadjieleftheriou, George Kollios, and

Shang-Hua Teng. On trip planning queries in spatial databases. In In-

ternational symposium on spatial and temporal databases (SSTD), pages

273–290. Springer, 2005.

[LCZ+13] Bo Liu, Gao Cong, Yifeng Zeng, Dong Xu, and Yeow Meng Chee. Influ-

ence spreading path and its application to the time constrained social in-

fluence maximization problem and beyond. IEEE Transactions on Knowl-

edge and Data Engineering, 26(8):1904–1917, 2013.

[LKG+07] Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos,

Jeanne VanBriesen, and Natalie Glance. Cost-effective outbreak detection

in networks. In Proceedings of the 13th ACM SIGKDD international con-

ference on Knowledge discovery and data mining, pages 420–429, 2007.

[LKSS10] Roy Levin, Yaron Kanza, Eliyahu Safra, and Yehoshua Sagiv. Interactive

route search in the presence of order constraints. Proceedings of the VLDB

Endowment, 3(1):117–128, 2010.

[LLS+18] Jiajia Li, Yuxian Li, Panpan Shen, Xiufeng Xia, Chuanyu Zong, and BIBLIOGRAPHY 137

Chenxi Xia. Reverse k nearest neighbor queries in time-dependent road

networks. In 2018 IEEE 20th International Conference on High Per-

formance Computing and Communications (HPCC), pages 1064–1069.

IEEE, 2018.

[LLV20] Wenzheng Li, Paul Liu, and Jan Vondrak.´ A polynomial lower bound on

adaptive complexity of submodular maximization. In Proceedings of the

52nd Annual ACM SIGACT Symposium on Theory of Computing (STOC),

pages 140–152, 2020.

[LR01] Dean H Lorenz and Danny Raz. A simple efficient approximation scheme

for the restricted shortest path problem. Operations Research Letters,

28(5):213–219, 2001.

[LWW+13] Yubao Liu, Raymond Chi-Wing Wong, Ke Wang, Zhijie Li, Cheng Chen,

and Zhitong Chen. A new approach for maximizing bichromatic reverse

nearest neighbor search. Knowledge and information systems, 36(1):23–

58, 2013.

[LWWF13] Cheng Long, Raymond Chi-Wing Wong, Ke Wang, and Ada Wai-Chee

Fu. Collective spatial keyword queries: a distance owner-driven approach.

In Proceedings of the 2013 ACM SIGMOD International Conference on

Management of Data (SIGMOD), pages 689–700, 2013.

[MFKK18] Marko Mitrovic, Moran Feldman, Andreas Krause, and Amin Karbasi.

Submodularity on hypergraphs: From sets to sequences. In Proceedings of 21rd International Conference on Artificial Intelligence and Statistics

(AISTATS), pages 1177–1184, 2018.

[MKF+19] Marko Mitrovic, Ehsan Kazemi, Moran Feldman, Andreas Krause, and 138 BIBLIOGRAPHY

Amin Karbasi. Adaptive sequence submodularity. In Proceedings of the Thirty-Third Conference on Neural Information Processing Systems

(NeurIPS), pages 5352–5363, 2019.

[MSSV16] Paras Mehta, Dimitrios Skoutas, Dimitris Sacharidis, and Agnes` Voisard.

Coverage and diversity aware top-k query for spatio-temporal posts. In Proceedings of the 24th ACM SIGSPATIAL International Conference on

Advances in Geographic Information Systems (SIGSPATIAL), pages 1–10,

2016.

[MZ00] Kurt Mehlhorn and Mark Ziegelmann. Resource constrained shortest

paths. In European Symposium on Algorithms (ESA), pages 326–337.

Springer, 2000.

[NB95] Subhas C Nandy and Bhargab B Bhattacharya. A unified algorithm for

finding maximum and minimum object enclosing rectangles and cuboids.

Computers & Mathematics with Applications, 29(8):45–61, 1995.

[NMW+13] Thao P Nghiem, Kiki Maulana, Agustinus Borgy Waluyo, David Green,

and David Taniar. Bichromatic reverse nearest neighbors in mobile peer-

to-peer networks. In 2013 IEEE International Conference on Pervasive

Computing and Communications (PerCom), pages 160–165. IEEE, 2013.

[NWF78] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An

analysis of approximations for maximizing submodular set functions.

Mathematical programming, 14(1):265–294, 1978.

[OAYK14] Naoto Ohsaka, Takuya Akiba, Yuichi Yoshida, and Ken-ichi

Kawarabayashi. Fast and accurate influence maximization on large BIBLIOGRAPHY 139

networks with pruned monte-carlo simulations. In 28th AAAI Conference

on Artificial Intelligence (AAAI), pages 138–144, 2014.

[OQC+18] Dian Ouyang, Lu Qin, Lijun Chang, Xuemin Lin, Ying Zhang, and Qing

Zhu. When hierarchy meets 2-hop-labeling: Efficient shortest distance

queries on road networks. In Proceedings of the 2018 International Con-

ference on Management of Data (SIGMOD), pages 709–724, 2018.

[QFT18] Chao Qian, Chao Feng, and Ke Tang. Sequence selection by pareto opti-

mization. In Proceedings of the Twenty-Seventh International Joint Con-

ference on Artificial Intelligence (IJCAI), pages 1485–1491, 2018.

[QSYT17] Chao Qian, Jing-Cheng Shi, Yang Yu, and Ke Tang. On subset selection

with general cost constraints. In Proceedings of the Twenty-Sixth Inter-

national Joint Conference on Artificial Intelligence (IJCAI), volume 17,

pages 2613–2619, 2017.

[QYT18] Chao Qian, Yang Yu, and Ke Tang. Approximation guarantees of stochas-

tic greedy algorithms for subset selection. In Proceedings of the Twenty-

Seventh International Joint Conference on Artificial Intelligence (IJCAI),

pages 1478–1484, 2018.

[QZK+12] Jianzhong Qi, Rui Zhang, Lars Kulik, Dan Lin, and Yuan Xue. The min-

dist location selection query. In 2012 IEEE 28th International Conference

on Data Engineering (ICDE), pages 366–377. IEEE, 2012.

[RNNF19] Vahid Roostapour, Aneta Neumann, Frank Neumann, and Tobias

Friedrich. Pareto optimization for subset selection with dynamic cost con-

straints. In Proceedings of 33th AAAI Conference on Artificial Intelligence

(AAAI), volume 33, pages 2354–2361, 2019. 140 BIBLIOGRAPHY

[Sak20] Shinsaku Sakaue. Guarantees of stochastic greedy algorithms for non-

monotone submodular maximization. In Proceedings of 23rd Inter-

national Conference on Artificial Intelligence and Statistics (AISTATS),

pages 11–21, 2020.

[SKG+07] Amarjeet Singh, Andreas Krause, Carlos Guestrin, William J Kaiser, and

Maxim A Batalin. Efficient planning of informative paths for multiple

robots. In Proceedings of the Twenty International Joint Conference on

Artificial Intelligence (IJCAI), volume 7, pages 2204–2211, 2007.

[SKS08] Mehdi Sharifzadeh, Mohammad R. Kolahdouzan, and Cyrus Shahabi.

The optimal sequenced route query. VLDB Journal, 17(4):765–787, 2008.

[SMS+10] Cabello Sergio, D´ıaz-Ba´nez˜ Jose´ Miguel, Langerman Stefan, Seara Car-

los, and Ventura Inmaculada. Facility location problems in the plane based

on reverse nearest neighbor queries. European Journal of Operational Re-

search, 202(1):99–106, 2010.

[SNY18] Shinsaku Sakaue, Masaaki Nishino, and Norihito Yasuda. Submodular

function maximization over graphs via zero-suppressed binary decision

diagrams. In 32th AAAI Conference on Artificial Intelligence (AAAI),

pages 1422–1430, 2018.

[Sto12] Sabine Storandt. Route planning for bicycles—exact constrained short-

est paths made practical via contraction hierarchy. In Proceedings of the Twenty-second international conference on automated planning and

scheduling (ICAPS), pages 234–242, 2012.

[SZKK17] Serban Stan, Morteza Zadimoghaddam, Andreas Krause, and Amin Kar-

basi. Probabilistic submodular maximization in sub-linear time. In BIBLIOGRAPHY 141

Proceedings of the 34th International Conference on Machine Learning

(ICML), pages 3241–3250, 2017.

[THCC13] Yufei Tao, Xiaocheng Hu, Dong-Wan Choi, and Chin-Wan Chung. Ap-

proximate maxrs in spatial databases. Proceedings of the VLDB Endow-

ment, 6(13):1546–1557, 2013.

[TSK17] Sebastian Tschiatschek, Adish Singla, and Andreas Krause. Selecting se-

quences of items via submodular maximization. In Proceedings of 31th

AAAI Conference on Artificial Intelligence (AAAI), pages 2667–2673,

2017.

[TTS10] Quoc Thai Tran, David Taniar, and Maytham Safar. Bichromatic re-

verse nearest-neighbor search in mobile systems. IEEE Systems Journal,

4(2):230–242, 2010.

[TZ09] George Tsaggouris and Christos Zaroliagis. Multiobjective optimization:

Improved fptas for shortest paths and non-linear objectives with applica-

tions. Theory of Computing Systems, 45(1):162–186, 2009.

[Vaz13] Vijay V Vazirani. Approximation algorithms. Springer Science & Busi-

ness Media, 2013.

[WBC+17] Sheng Wang, Zhifeng Bao, J Shane Culpepper, Timos Sellis, and Gao

Cong. Reverse k nearest neighbor search over trajectories. IEEE Trans-

actions on Knowledge and Data Engineering, 30(4):757–771, 2017.

[WOF¨ +11] Raymond Chi-Wing Wong, M Tamer Ozsu,¨ Ada Wai-Chee Fu, Philip S

Yu, Lian Liu, and Yubao Liu. Maximizing bichromatic reverse nearest

neighbor for lp-norm in two-and three-dimensional spaces. The VLDB

Journal, 20(6):893–919, 2011. 142 BIBLIOGRAPHY

[WOY¨ +09] Raymond Chi-Wing Wong, M Tamer Ozsu,¨ Philip S Yu, Ada Wai-Chee

Fu, and Lian Liu. Efficient method for maximizing bichromatic reverse

nearest neighbor. Proceedings of the VLDB Endowment, 2(1):1126–1137,

2009.

[WXYL16] Sibo Wang, Xiaokui Xiao, Yin Yang, and Wenqing Lin. Effective indexing

for approximate constrained shortest path queries on large road networks.

Proceedings of the VLDB Endowment, 10(2):61–72, 2016.

[WYX+19] Liang Wang, Zhiwen Yu, Fei Xiong, Dingqi Yang, Shirui Pan, and Zheng

Yan. Influence spread in geo-social networks: a multiobjective optimiza-

tion perspective. IEEE Transactions on Cybernetics, 2019.

[XGS+20] Hongfei Xu, Yu Gu, Yu Sun, Jianzhong Qi, Ge Yu, and Rui Zhang. Effi-

cient processing of moving collective spatial keyword queries. The VLDB

Journal, 29(4):841–865, 2020.

[XYL11] Xiaokui Xiao, Bin Yao, and Feifei Li. Optimal location queries in road

network databases. In 2011 IEEE 27th International Conference on Data

Engineering (ICDE), pages 804–815. IEEE, 2011.

[XZKD05] Tian Xia, Donghui Zhang, Evangelos Kanoulas, and Yang Du. On com-

puting top-t most influential spatial sites. In Proceedings of the 31st in-

ternational conference on Very large data bases (VLDB), pages 946–957.

Citeseer, 2005.

[YCLW15] Shiyu Yang, Muhammad Aamir Cheema, Xuemin Lin, and Wei Wang.

Reverse k nearest neighbors query processing: experiments and analysis.

Proceedings of the VLDB Endowment, 8(5):605–616, 2015. BIBLIOGRAPHY 143

[YLC+16] Ye Yuan, Xiang Lian, Lei Chen, Yongjiao Sun, and Guoren Wang. Rsknn:

knn search on road networks by incorporating social influence. IEEE

Transactions on Knowledge and Data Engineering, 28(6):1575–1588,

2016.

[YLW+19] Ye Yuan, Xiang Lian, Guoren Wang, Yuliang Ma, and Yishu Wang. Con-

strained shortest path query in a large time-dependent graph. Proceedings

of the VLDB Endowment, 12(10):1058–1070, 2019.

[YZZ+19] Junye Yang, Yong Zhang, Xiaofang Zhou, Jin Wang, Huiqi Hu, and

Chunxiao Xing. A hierarchical framework for top-k location-aware error-

tolerant keyword search. In 2019 IEEE 35th International Conference on

Data Engineering (ICDE), pages 986–997. IEEE, 2019.

[ZCC+15] Yifeng Zeng, Xuefeng Chen, Xin Cao, Shengchao Qin, Marc Cavazza,

and Yanping Xiang. Optimal route search with the coverage of users’

preferences. In Proceedings of the Twenty-Fourth International Joint Con-

ference on Artificial Intelligence (IJCAI), pages 2118–2124, 2015.

[ZCL+15] Tao Zhou, Jiuxin Cao, Bo Liu, Shuai Xu, Ziqing Zhu, and Junzhou Luo.

Location-based influence maximization in social networks. In Proceed- ings of the 24th ACM International on Conference on Information and

Knowledge Management (CIKM), pages 1211–1220, 2015.

[ZDXT06] Donghui Zhang, Yang Du, Tian Xia, and Yufei Tao. Progressive compu-

tation of the min-dist optimal-location query. In Proceedings of the 32nd

international conference on Very large data bases (VLDB), pages 643–

654, 2006.

[ZGC+17] Jingwen Zhao, Yunjun Gao, Gang Chen, Christian S Jensen, Rui Chen, and Deng Cai. Reverse top-k geo-social keyword queries in road net-

works. In 2017 IEEE 33rd International Conference on Data Engineering

(ICDE), pages 387–398. IEEE, 2017.

[ZV16] Haifeng Zhang and Yevgeniy Vorobeychik. Submodular optimization

with routing constraints. In Proceedings of 30th AAAI Conference on

Artificial Intelligence (AAAI), pages 819–825, 2016.

[ZWL+11] Zenan Zhou, Wei Wu, Xiaohui Li, Mong Li Lee, and Wynne Hsu. Max-

first for maxbrknn. In 2011 IEEE 27th International Conference on Data

Engineering (ICDE), pages 828–839. IEEE, 2011.

[ZZZ+14] Chengyuan Zhang, Ying Zhang, Wenjie Zhang, Xuemin Lin, Muham-

mad Aamir Cheema, and Xiaoyang Wang. Diversified spatial keyword

search on road networks. In Proceedings of the 17th International Con-

ference on Extending Database Technology (EDBT), 2014.

144