1 CrowdExpress: A Probabilistic Framework for On-Time Crowdsourced Package Deliveries Chao Chen, Sen Yang, Weichen Liu, Yasha Wang, Bin Guo and Daqing Zhang

Abstract—Speed and cost of logistics are two major concerns regarding status of the vehicles, , and packages (e.g., to on-line shoppers, but they generally conflict with each other real-time positions, origins and destinations of passengers and in nature. To alleviate the contradiction, we propose to exploit packages, available capacity of a vehicle) can be existing taxis that are transporting passengers on the street to easily recorded and accessible in real time [10], [56]. In this relay packages collaboratively, which can simultaneously lower context, crowdshipping (also termed as crowd logistics, crowd- the cost and accelerate the speed. Specifically, we propose a prob- sourcing logistics) which receives the increasing attention from abilistic framework containing two phases called CrowdExpress for the on-time package express deliveries. In the first phase, we both academic and industrial communities in the last few years, mine the historical taxi GPS trajectory data offline to build the has been recognized as a promising and cost-effective way to package transport network. In the second phase, we develop an alleviate the contradiction by sending passengers and packages online adaptive taxi scheduling algorithm to find the path with the simultaneously in a shared space and transport network [14], maximum arriving-on-time probability “on-the-fly” upon real- [17], [21], [30], [41]. In line with the previous research, with time requests, and direct the package routing accordingly. Finally, a particular focus on taxis, we propose having packages take we evaluate the system using the real-world taxi data generated by hitchhiking rides collaboratively with existing taxis that are over 19,000 taxis in a month in the city of New York, US. Results transporting passengers on the street, i.e., the existing mobility show that around 9,500 packages can be delivered successfully of taxi drivers [4], [32]. We illustrate the basic idea using the on time per day with the success rate over 94%, moreover, the following example. average computation time is within 25 milliseconds. There is a package to be delivered from A to B. Here Keywords—package delivery; hitchhiking rides; route planning; we simply assume that both A and B have facilities such taxi scheduling; trajectory data mining as smart parcel boxes that can store packages temporally. Opportunistically, a (e.g. passenger1) at A makes a real-time taxi ordering request1, intending to go to B. Once I.INTRODUCTION a taxi (e.g. taxi1) responds to the request, we can assign the He pervasive use of tablet and mobile devices leads to package delivery task to its driver. More specifically, we can T increased popularity of round-the-clock online shopping, ask the taxi driver to first collect the package and then pick which urges the sustainable development of logistics indus- up the passenger. After dropping off passenger1 at B, the taxi try [35]. For instance, online ordered products generated over driver leaves the package at an appointed location. Finally, one billion package deliveries in 2013, and this number is the package will be delivered to or collected by its receiver at predicted to grow by 28.8% in 2018 [35]. To on-line shoppers, B. speed and cost are two major concerns, between which they As shown in the above example, this approach only requires pay relatively more attention to cost [47]. Unfortunately, speed small additional efforts and time from the taxi drivers involved, and cost usually conflict with each other in nature. In another without degrading the service quality and any interrupt to word, speeding up the shipping process often implies more passengers. We can formulate the proposed taxi-based package personnel and vehicles on the road, incurring more extra cost delivery as a route planning problem with the objective of as a result [4]. For example, users have to pay a high amount delivering the packages to their destination on time (by the of money to enjoy the express service, such as 5 dollars per deadline given by users). To make the idea of organizing package if users want the same-day delivery service in US [49]. the passenger flow and package flow seamlessly via the taxi arXiv:1809.02897v1 [cs.OH] 8 Sep 2018 Therefore, logistics services which can lower the cost but still transport network feasible, we need to address the following ensure the arrivals of packages on time are preferable [42]. two major research challenges: With the sustainable development and proliferated daily • Package flow and passenger flow are incompatible in use of positioning and mobile Internet technologies, rich data time and space. More specifically, 1) compared to package flow, passenger flow presents salient peak- Chao Chen, Sen Yang and Weichen Liu are with the Key Laboratory of Dependable Service Computing in Cyber Physical Society (Chongqing hour patterns; 2) due to the financial considerations, University), Ministry of Education and also with the College of Com- most passengers choose to take taxis only when the puter Science, Chongqing University, Chongqing 400044, China. E-mail: destination is close, e.g., within 4 km [13]. While for [email protected]. Yasha Wang and Daqing Zhang are with the Institute of Software, School 1It is popular that passengers order taxis in real time with mobile apps of Electronics Engineering and Computer Science, Peking University, Beijing such as Uber (https://www.uber.com/). Usually, to make a request, a passenger 100871, China. E-mail: {dqzhang,wangys}@sei.pku.edu.cn. has to provide information including his/her origin, intended destination. The Bin Guo is with the School of Computer Science, Northwestern Polytech- request will be broadcast locally, and the taxi driver who accepts request would nical University, Xi’an 710072, China. E-mail: [email protected]. come to pick up the passenger [29]. 2

packages, the destination is generally far away from the from prior research. In Section III, we introduce some basic origin (e.g., longer than 5 km) [4]. Therefore, we argue concepts, the assumptions we have made, the formal problem that routing algorithms based on the framework of query- formulation, and overview the system. We present the technical matching merely may not work [17], [30], since a single details about our two-phase approach in Section IV and hitchhiking ride may not be able to deliver a package to Section V respectively. We evaluate the performance of the its destination; instead, a collaborative relay of taxis is proposed framework in Section VI. Finally, we conclude the needed. paper and chart the future directions in Section VII. • Requests for demands of packages and passengers come with high uncertainties in stream. Although regular II.RELATED WORK spatial and temporal patterns about passenger flow have Here, we will review the related work, which can be cate- been unveiled from taxi GPS trajectory data in the coarse gorized into two groups. The first group consists of the work granularity, it is challenging to predict the passenger on crowdsourced logistics, whereas the second group focuses demands accurately in a quite fine granularity (e.g., the on taxi trajectory mining and its supporting applications. passenger demands in the next 5 minutes for a given road segment) [37], [53], [59]. Meanwhile, same situation also happens to package demands [47]. In summary, A. Crowdsourced Logistics uncertainties exist in both requests, which hinder the Crowdsourcing has been used for many different applica- estimation and comparison of time cost of package tions, from problem solving [19] to various sensing tasks [26], routing paths. [45], [58]. There exist two concrete papers that particularly With the above-mentioned research objective and challenges, targeted at the package delivery problem, leveraging the spatial the main contributions of the paper are: and time overlaps between crowdsourcing workers. Specifi- cally, Sadilek et al. [44] recruited a group of twitter users, 1) We propose a novel passenger and package mixed asking one person to pass the assigned package to another transport mode which leverages the unintentional co- twitter user that happened to be nearby (within a certain operations among a crowd of occupied taxis to de- distance). However, this work had two main limitations: 1) It is liver city-wide packages on time, in order to lower hard to trace and coordinate the users since people rarely share the transport cost and enhance the transport efficiency their location information continuously via geo-twitter [43]. simultaneously. Therefore, the choice of suitable deliverers is limited, probably 2) We formulate the package routing problem as the resulting in longer and uncontrollable package delivery delay. arriving-on-time problem [38] to tackle with uncertain- 2) It may be not practical to ask a participant to make a ties of passenger and package requests. Moreover, we dedicated trip to pass the package to another suitable user, as it propose a probabilistic framework named CrowdEx- may interrupt his/her on-going activities (e.g. having conver- press, which contains two phases to solve it. In the sation/dinner with friends) that are hard to be inferred from first phase, we build the package transport network by the user’s geo-twitters data. Similarly, the work in [36] which mining the historical taxi GPS trajectory data offline. In employed mobile users based on the overlaps of space and time the second phase, for each real-time generated package inferred from cell towers had similar limitations: 1) The cell delivery request, we propose an online adaptive taxi towers may be sparse in certain areas and thus it is difficult for scheduling algorithm based on the probabilistic model mobile users to relay packages in that areas. 2) Only with data called maxProb to iteratively determine the next stop of when people make calls, as a result, the number of mobile users the package “on-the-fly". The algorithm monitors real- can be recruited is limited. Compared to the proposed solutions time taxi ordering requests, recursively computes the in [36], [44], there are also some papers which intended maximum arriving-on-time probability if assigning the to leverage the abundant existing passenger-delivery trips to delivery task to the currently available taxi, and com- hitchhike packages appeared in these two years [4], [3], [17], pares it to the one if waiting for future taxi rides, based [22], [27], [30], [31]. Although the passenger flow and package on the real-time package location and the remaining flow are combined to be transported mixed, the authors fail to time budget. consider their distinct patterns in time and space. Specifically, 3) We conduct extensive evaluations using road network they formulate the problem as the share-a-ride problem and data and taxi GPS trajectory data generated over 19,000 insert the package requests into the passenger-delivery trips, taxis in a month in the city of New York (NYC) to which may not able to deliver packages successfully in real verify the efficiency and effectiveness of CrowdExpress. cases as we argued previously. To make the matters worse, in Results demonstrate that CrowdExpress responds within their solutions, during the passenger-sending course, taxis have 25 milliseconds. What is more, it can throughput around to make several dedicated stops and detour, which degrades 9,500 packages daily with the success rate over 94%, the service quality to passengers. At the current stage, the i.e., over 94% of packages can be delivered successfully research on crowdsourced logistics mainly focused on issues on time, which is consistently better than the baseline regarding how to efficiently discover the ‘optimal’ package approaches. delivery paths and almost completely ignored the multi-criteria The rest of the paper is organized as follows. In Section II, design of the package relay network [4], [14], [15], [27]. How we review the related work and show how this paper differs to design the package relay network (e.g., the optimal number 3 and location of package interchange stations) is vital and 200 million dimensions of features to predict the original challenging, which should be a separated research issue [2], taxi demands for each POI. However, most of the existing [51]. In this paper, we exploit existing taxi services to deliver work that leverages taxi trajectory data focuses on transporting packages [15] and simply cluster frequent pickup and drop- passengers. Little attention has been paid to shipping goods. off points to get package interchange stations without any To the best of our knowledge, we are among the first to target optimization technicals. Packages can be temporary stored this new application. at interchange stations in-between rides, and thus no time overlap and pair-wise contact between participants is needed. III.PRELIMINARY,PROBLEM STATEMENT AND SYSTEM Our method requires less effort from the participants and can OVERVIEW transport packages over a longer distance. In addition, we In this section, we provide definitions of some basic con- try to minimize the impact on the quality and experience of cepts, elicit assumptions we have made, and give a formal passenger service, which is similar to our previous work [14]. problem statement. Finally, we give an overview of the pro- Generally speaking, CrowdExpress is different from [14] in the posed CrowdExpress system. objective, proposed solution as well as the evaluation environ- ment. To be more specific, CrowdExpress aims to send each package to the destination on time (by the deadline required A. Preliminary by the user), rather than as quick as possible. We argue that 1) Basic Concepts: We define the basic concepts used in the objective is more reasonable and matches the real-life this work as follows: application scenarios more. In CrowdExpress, we formulate Definition 1 (Taxi Trajectory). A taxi trajectory is a the problem as the one of finding arriving-on-time paths, and tr sequence of GPS points corresponding to a single passenger- propose a probabilistic framework to address it. We evaluate delivery trip. Here, the taxi trajectory is represented by a the performance in NYC, using the real-world taxi trajectory pair of Origin-Destination (OD), where the origin is the road data. Due to the openness, we choose to use the taxi trajectory segment that the trip starts and the destination is the one that data from NYC in our experiments, expecting to lower the data the trip ends. The time is exactly the time difference barrier and attract more researchers with different backgrounds between the ending and starting times. into the promising interdisciplinary filed. Definition 2 (Real-time Taxi Ordering Request). A taxi order- ing request (tor) is defined as a triplet ho , d , r i, where o B. Taxi Trajectory Data Mining t t t t and dt refer to the passenger’s origin and intended destination, Information mined from taxi trajectory data can benefit for respectively. rt refers to the time that the passenger submits taxi drivers, passengers and city planners. For taxi drivers who the request. are mostly interested in making more money while minimizing Definition 3 (Package Transport Network). A package trans- the fuel cost, many papers have tried to recommend areas network is a graph G(S, E), consisting of a node set S with more potential passengers, e.g. [23], [55]. In addition, and an edge set E, where each element s in S is an interchange Zhang et al. [57] investigated the differences between efficient station which is responsible for package collections and stor- and inefficient taxi service in terms of passenger-searching, age. Edge set E is a subset of the cross product S × S. Each passenger-delivery and service-region preference. Work on element e(i, j) in E is a non-stop directional transport route recommending the best corner to catch taxis, real-time ordering from node s to node s , implying that there is an abundant free taxis, and the taxi fee estimation aims to improve the i j passenger flow for hitchhiking packages. It should be noted experiences of passengers, e.g. [55]. An interesting work that the edge in the package transport network has different detected anomalous taxi rides and warned the passengers “on- meaning from the edge defined over the road network. There the-fly” that they were taken on a unnecessary detour [11]. can be multiple driving paths over the road network connecting For city planners, taxi trajectory data provides a rich data two interchange stations. source to identify flaws in city planning [33], [62], probe traffic conditions [8], estimate the travel demands, infer the Definition 4 (Package Delivery Request). A package delivery land-use efficiency [34], [40], suggest routes [33], etc. request (pr) is defined as a triplet hop, dp, tpi, where op and Castro et al. [7] provided a good survey on understanding dp refer to the origin and destination of the package delivery city dynamics using taxi trajectory data. Recent studies also respectively; tp refers to the time when the user submits incorporate taxi trajectory data with other data sources such the request (i.e. the birth time). Note that here op ∈ S, as POI data, Foursquare check-in data, and Flickr image data, dp ∈ S, indicating that packages should originate and end to enable smarter applications, such as functions at interchange stations. inferring, personalized and fuel-efficient travel route planning, real-time travel purpose inference and so on [9], [12], [18], Definition 5 (Time Slot Slicing). We divide a whole day [39], [54], [63]. Another stream is to revisit previous well- into different time slots (periods) according to the day type, known research issues using the advanced machine learning since the traffic conditions are changing in different time slots, skills. For instance, Xu et al. [52] forecasted taxi demands in resulting in large variance in travel time. A work day is divided real time using recurrent neural networks. Tong et al. [50] into three time slots and a rest day is divided into two time applied a simple linear regression model with more than slots, as detailed in Fig. 1. 4

Night-time hours Rush hours Day-time hours s1 0.6 c 1 (1-5 min) 0.4 0:00 6:59 9:00 16:59 19:00 23:59 c 2 (6-10 min) Work Days 0.7

0.9 0:00 7:59 19:00 23:59 0.3 Rest Days

0.5 0.5 s2 0.1

Fig. 1. Time slot slicing. so 0.8 0.7 sd

0.3 0.2

Definition 6 (Travel Time Probability Function). Each edge 0.8 s3 in the package transport network (G) is associated with an 0.8 0.2 independent random travel time (cost) tij whose probability 0.2 s density function is denoted by pij(t). pij(t) varies at different 4 time slots.

Fig. 2. An illustrative example of package delivery paths from so to sd, as For instance, the probability that a package spends a time well as the distribution of discrete travel times on each edge. in the interval [0, t0] from node si to node sj directly can be computed by the definition, as shown in Eq. 1. It is obviously that the integral computation becomes more Z t0 complicated if the given path is longer (i.e., contains more Pij(t ≤ t0) = pij(t)dt (1) 0 interchange stations) as more travel time combinations are generated. The travel time probability function in each time slot can be Similarly, to ease the computation of integral in Eq. 3, we obtained separately, according to Eq. 1. What is more, as first let the travel time be considered in the discrete manner, can be observed, the travel time probability is a monotonic then the integral can be degraded to the sum computation. increasing function of the time t. Taking the path (path1 : so → s1 → sd) shown in Fig. 2 as Definition 7 (Travel Time Discretization). To simplify the an example, its arriving-on-time probability within 15 minutes calculation of travel time probability along an edge, we can be computed as: consider the travel time in a discrete manner. More precisely, ( ≤ 15| ) = 0 3 × (0 6 + 0 4) + 0 7 × 0 6 = 0 72 we use a piecewise constant function with equal step width τ Pij t path1 ...... 2 | {z } | {z } to discretize different travel times . Case I Case II In the discrete case, the integral shown in Eq. 1 can be For the given path, two cases can lead to the successful arrival replaced using the formula shown as follows. of packages by the deadline. Case I: If the first recruited taxi driver spent no more than 5 minutes in the first segment (i.e., #traveltimes < ατ so → s1), due to the sufficient time margin left, then the Pij(t ≤ t0) = #traveltimes second recruited taxi driver can arrive at sd by the deadline at a hundred percent. Case II: If the first recruited taxi driver took t0 = (α − 1)τ + δ (2) more than 5 minutes in the first segment, then the package can be arrived on time only if the second recruited taxi driver spent where #traveltimes refers to the number of all the possible no more than 5 minutes to accomplish the second segment (i.e., travel times from node i to node j which are recorded in the s1 → sd). taxi trajectory in history, while #traveltimes < ατ refers to the number of travel times less than ατ after time discretization Theorem 1. For a unique path, its arriving-on-time probability + becomes higher (or at least unchanged) if given a longer as defined; α ∈ N , and 0 ≤ δ < τ. For the edge from so to deadline, mathematically, we have: s1 shown in Fig. 2, Po1(t ≤ 5) = 0.3; Po1(5 < t ≤ 10) = 0.7.

Pij(t ≤ t1|P ath) ≤ Pij(t ≤ t2|P ath), if t1 < t2 (4) Definition 8 (Arriving-on-Time Probability). The arriving on time probability of a package-delivering path within a given Proof: The proof can be found in Appendix A. time duration (i.e., deadline, t0) is defined as the ratio of the Definition 9 (Maximum Probability of Arriving-on-Time). For number of travel times less than t0 to the number of all possible an OD pair, the maximum probability of arriving-on-time (with travel times (suppose that the package is shipped from to si time cost no greater than t ) is defined as the maximal one via ), as follows. 0 sj sk among all probabilities on all possible N paths from the origin (s ) to the destination (s ), denoted by u (t ≤ t ). In another Zt0 Zt1 i j ij 0 word, uij(t ≤ t0) serves as the upper bound of Pij(t ≤ t0). Pij(t ≤ t0|P ath) = dt2 pik(t1)pkj(t2 − t1)dt1 (3) If a package at node s is firstly sent to s next, the t −t 0 i k 0 1 probability that the package spends a time in the interval 2 Here, we set τ = 5 min throughout the whole paper. [ω, ω + dω] on edge si, sk is pik(ω)dω, thus the time margin 5

at node sk is t0 − w. On the basis of Bellman’s principle Assumption 3. The packages are traceable. of optimality [6], [38], no matter which node s that the j In the delivery process, a package is either stored at the inter- package is elected to send next, the package must follow change station or carried by a scheduled taxi. Each interchange the optimal routing strategy in shipping from node s to the k station is authorized and has a unique Id; each taxi is registered destination s within the remaining time t − ω. Therefore, j 0 in taxi management department and also has a unique Id. This the maximum probability of arriving-on-time can be formally assumption tries to address the package security issues. Any defined recursively as follows: package damage or loss will impair this novel package delivery t0 service. Z uij(t ≤ t0) = max pik(ω)ukj(t ≤ t0 − ω)dω (5) k=6 j 0 B. Problem Statement The collaborative crowdsourced package deliveries leverag- Intuitively, according to Definition 5, to compute the max- ing the relays of passenger-occupied taxis can be viewed as imum probability of arriving-on-time for an OD pair, one the problem of finding arriving-on-time paths, and thus can be needs to find all possible paths from the origin to the des- formulated as follows: tination, which is well-known as an NP-hard problem [24]. Given: To make the concept clear, we use the example shown in 1) A historical set of taxi trajectory records {T r}, such as Fig. 2 again. It is easy to find all paths from so to sd, that is, path : s → s → s , path : s → s → s , from the past month in the designated city, 1 o 1 d 2 o 2 d 2) A set of real-time taxi ordering requests TOR from path3 : so → s3 → sd and path4 : so → s4 → s3 → sd, respectively. For each path, similar to the computation in mobile phone apps, and a set of real-time package Eq. 3, it is not difficult to obtain the respected arriving-on-time delivery requests PR. Note that these two requests come in stream, probability within a given deadline (e.g., t0 = 15 minutes). As a result, u (t ≤ 15) = 0.95 for the example when delivering 3) A given deadline specified by the user for each package ij delivery request. the package via path2. It is obvious that uii(t) = 1 given any deadline, since no travel time is needed if the package stays Objectives: Build a package transport network (i.e., the still. We exclude the round trip in the study. identification of interchange stations and estimation of edge 2) Assumptions: In this work, we make the following as- values) based on the historical taxi trajectory data. Moreover, sumptions. find a package delivery path for each package request (pr), which can make the package arrive the destination by the Assumption 1. Taxis cannot be recruited to take part in the deadline. However, such package delivery path may not be package delivery tasks in the course of sending passengers. unique or existed. To migrate the issue, we thus transform We further assume the recruited taxis have enough room for the problem to the arriving-on-time problem, i.e., finding the the package storage. optimal one that is expected to have the maximum probability 3 Compared to [4], [30], [31], the taxis are recruited to collect of arriving-on-time . packages before picking up passengers, and offload packages The following two constraints should be satisfied. after dropping off passengers to minimize potential impact on 1) Only taxis that accept the real-time taxi ordering re- passengers’ experiences. This assumption tries to guarantee the quests after the package delivery request is posted can quality of taxi services for passengers. In fact, it is infeasible be scheduled, i.e. {tor.rt} > pr.tp. that taxi drivers delicately stopped to get packages on-the- 2) A recruited taxi can be available to participate again way when delivering passengers, unlike to the ride-sharing only after completing the current task (i.e., dropping among passengers. In the case of passenger ride-sharing, off the package at the predefined interchange station). passengers can be easily picked up on-the-way because they can proactively wait and get on cars at the appointed locations. C. System Overview On the contrary, in order to get packages, taxi drivers must We develop a two-phase system called CrowdExpress, i.e., have to take a sequence of complex actions, including parking, offline package transport network building and online taxi getting off the car, loading the package, getting on the car, and scheduling and package routing to find the optimal route with so on. the maximum probability of arriving-on-time for each package Assumption 2. The taxi drivers are willing to accept the delivery request within a given deadline, by collaboratively assigned package delivery tasks. recruiting taxi drivers that have been reserved to passengers (occupied by passengers), as shown in Fig. 3. We believe that this assumption can be realistic given proper Phase I is an offline process, with the historical taxi tra- incentive mechanisms. In the design of incentive mechanisms, jectory data as input, aiming to identify the package inter- a prime principle is to ensure that the reward matches the change stations, estimate the edge values, as well as find amount of efforts put in by the drivers. For example, some the reference paths for any given OD package pairs. Based places are harder to get to or park the taxi, then the incen- tive should be higher. However, designing a proper incentive 3If the optimal one is still unable to send the package on-time, then it is mechanism is beyond the scope of this paper. safe to claim that the package delivery is an unsuccessful one. 6

Historical Taxi Trajectory Data Real-time Taxi Ordering and Package Delivery Requests

Phase I: Transport Network Building Phase II: Taxi Scheduling and Package Routing

Edge Estimation Exploitability Probability Computation Checking and Comparison Station From Trajectory Data to Identification Passenger Flow No

Reference From Passenger Flow to Stopping Package Information Path Finding Time Distributions Criteria ? Updating

Yes

Package Delivery Paths

Fig. 3. Overview of the CrowdExpress system.

40.88 on the constructed package transport network, for a real-time incoming individual package delivery request, Phase II mainly Landmarks 40.86 20 takes four online steps to tackle with, namely, Exploitability High Checking, Probability Computation and Comparison, Package Package Station

Information Updating and Stopping Criteria Checking, with 40.84 the streaming taxi ordering requests as input. The system finally outputs the corresponding package delivery paths. The technical details will be presented in the next two sections. 40.82

40.8 IV. PHASE I:OFFLINE PACKAGE TRANSPORT NETWORK Central Park 10 BUILDING Median 40.78 The task of offline package transport network building is to Broadway identify interchange stations (i.e., node locations) as well as 40.76 the estimation of travel time distributions (i.e., edge values). Here, we mainly take a three-step procedure to achieve the objectives, detailed as follows. 40.74 911 Memorial Fundation

40.72 Low A. Package Interchange Station Identification 0

One basic principle for the identification of interchange 40.7 stations is that they should be located where passengers are −74 −73.98 −73.96 −73.94 −73.92 −73.9 −73.88 Fig. 4. Results of the package interchange station in Manhattan, NYC. The frequently picked-up or dropped-off, to take as full advantage color of the station corresponds to its popularity, which is measured by the of passenger-sending rides for the package hitchhiking as total times of the pick-ups and drop-offs. possible, and probably to minimize the extra efforts imposed to the taxi drivers as well. Fortunately, with the taxi trajectory data left, such information (i.e., pick-up and drop-off points) B. Edge Estimation can be easily extracted. Then, we cluster them using DBSCAN 1) From Trajectory Data to Passenger Flow: It is straight- algorithm as it is capable of merging closer data points with forward to infer the passenger flow between any two inter- arbitrary distributions [20]. Finally, locations near the point change stations during a given time slot from the trajectory centroids of each cluster and the road sides are identified as the data [14]. Specifically, we first group the trajectories according locations of the interchange stations to serve the purposes of to their starting time (t ). Second, to compute the passenger package collection, storage and receiving. The locations of the s flow from si to sj, we count the number of the trajectories package interchange stations in the Manhattan area of NYC is satisfying Eqns. 6∼7. It should be noted that there could be shown in Fig. 4. No package interchange stations are identified no passenger flow between some interchange station pairs. in the Upper Manhattan since there were fewer passengers take taxi to reach there. Ddist(T ri.o, loc(si)) ≤ δ (6) 7

0.5 Ddist(T ri.d, loc(sj)) ≤ δ (7) where T ri.o and T ri.d are the original and destination points 0.4 of T ri, respectively; loc(·) gets the latitude and longitude location of the given interchange station; Ddist(a·b) calculates 0.3 the driving distance from point a to point b; δ is a user- specified parameter. The physical meaning of δ is that any passenger-delivery ride which starts and ends near a pair of Rate Trips 0.2 Percentages interchange stations (i.e., with driving distance less than δ) can be hitchhiked for the package delivery between this pair. 0.1 Hence, for a given OD pair, a bigger δ would result in a bigger number of passenger flow. It is worth noting that, 0 for a specific trajectory, there could be multiple interchange 5mins 15mins 25mins 35mins >40mins Discrete Driving Time station pairs that satisfy Eqns 6∼7, in other words, can provide Travel Time package hitchhiking ride between all these pairs. Therefore, Fig. 5. Histogram of all driving times after discretization. a bigger δ also leads to a bigger number of interchange station pairs, suggesting that the corresponding trajectory can be more capable of providing hitchhiking rides. However, for the shortest path when assuming value on each edge is the passengers, a bigger δ may mean a longer waiting time for minimum travel time (SP ath_min), and the shortest one when the reserved taxis, since the taxi driver might have to travel assuming value on each edge is the maximum travel time farther to collect the package before picking up passengers. To (SP ath_max), respectively. More specifically, given an OD control for the additional waiting time, we set δ to 500 meters. pair, when choosing all minimum travel times on edges, we 2) From Passenger Flow to Time Distributions: To estimate can find the shortest path, i.e. SP ath_min, using Dijkstra’s an edge value, we need to estimate two parts, i.e. the waiting algorithm [46]. SP ath_min refers to the case that the package time and the driving time [14]. The driving time is simply the can be delivered in the most efficient manner if sent via that travel time of each taxi trajectory. path. Similarly, when choosing all maximum travel times on The waiting time on the edge is defined as the time required edges, we can also discover the shortest path, i.e. SP ath_max. to wait for a suitable hitchhiking ride that can transport a SP ath_max, as a comparison, refers to the case that the package from si to sj directly. To address this problem, we package can be delivered in the lowest efficient manner if sent employ the Non-Homogeneous Poisson Process (NHPP) to via that path. Moreover, the corresponding total travel times model the behavior of passenger taking taxis [61]. According of those two paths are also recorded, which can be used to to the statistical frequency of passenger taking taxis from si to guide the online taxi scheduling and package routing upon sj in history (i.e. passenger flow), we can estimate the waiting the real-time taxi ordering requests in the second phase, with time of packages at different time slots at the interchange details would be further addressed in the next section. It should stations. Specifically, the waiting time on the edge from si be noted that, although it is a time-consuming procedure to to sj is: find two shortest paths for each OD pair with a computation ∆T t = waiting time = (8) complexity of O(k2), it can be operated offline4. w N¯ ¯ where N is the average number of passengers taking taxis from V. PHASE II:ONLINE TAXI SCHEDULINGAND PACKAGE si to sj during the given time slots; ∆T is the length of that ROUTING time slot. Note that the waiting time obtained by Eq. 8 is in the statistical sense, and it could be much smaller in the real Given an OD pair of the package, the task of online taxi case due to the timely availability of right passenger-sending scheduling is the decision-making. To be more specific, the trips. phase should make a decision whether utilizing the current The waiting time component is substituted in advance when available hitchhiking ride for the package delivery or waiting computing the arriving-on-time probability of a given path at a for the future rides, according to the upcoming taxi ordering given time duration, thus the corresponding time distributions requests generated in real-time and the remaining time margin. is obtained by discretizing the driving time component only While the task of online package routing is much simpler, i.e., according to Definition 7. Fig. 5 shows the time distributions just assigning the delivery task to the scheduled taxi. As a after discretization. As one can observe, driving times greater result, the package will be delivered to the next stop, which is than 40 minutes occupy quite a small percentage, i.e., less than same to the destination of the taxi. The two steps impact each 1.38% in the taxi trajectory data in NYC. Note that the edge other mutually. On one hand, both the potential hitchhiking value would be +∞ if there was no passenger flow (or less rides for the package delivery and the time margin is highly than a certain amount) on the respected edge in history. related to the current stop where the package locates; on the other hand, which the next stop that the package will head to is C. Reference Path Finding determined by the scheduled taxi. In the following, we mainly On the basis of the constructed package transport network, 4Suppose there are k nodes in the network, the maximum number of all we find two reference paths for each given OD pair, i.e., OD pairs is k(k − 1). 8 focus on the step of online taxi scheduling, which includes the given the remaining time, which are two boundaries and used following operations. to guide the process of branch trimming. For brevity, we use 5 minPo→k d and maxPo→k d to represent the arriving-on- A. Exploitability Checking time probabilities of so → SP ath_min(sk, sd) and so → SP ath_max(s , s ), respectively. Before triggering this phase, we first need to conduct the k d exploitability checking, i.e., to determine: 1) whether the origin of an upcoming taxi ordering request is close to the location of Algorithm 1 F indNextStop(sk, sd,TN) the package; and 2) whether it ends at one of the interchange 1: ngb = Neigh(sk,TN); //get the neighbouring stations of stations. If both conditions are met, then we further need to sk based on the transport network topology. compare the maximum arriving-on-time probability of sending 2: ns = ∅; the package via sk (hitchhiking the current) to the maximum 3: refP = minPo→k d; arriving-on-time probability of sending the package via other 4: for i == 1 to |ngb| do potential next stations (the other neighbours of so in the 5: ski = ngb(i); transport network) (waiting for the future). If the former value 6: if refP ≤ is greater, then the package will be sent out immediately by P (t ≤ t0 − ∆t − t_min(ski, sd)|P atho→k→ki) then hitchhiking the current taxi ride; otherwise, the system will 7: ns = ns ∪ ngb(i); wait for the new future taxi ordering requests that may lead 8: end if to a higher arriving-on-time probability and the decision will 9: end for made again, given the new time margin. Thus, the core of the online taxi scheduling is the maximum arriving-on-time Depth-First-Searching: From to , we mainly apply probabilities computation and comparison. sk sd the Depth-First-Search (DFS) method to recursively get each possible path [48], and compute the maximum probability of B. Probability Computation and Comparison arriving-on-time. One exception is that the user specifies an

1) Probability Computation if Hitchhiking the Current: extremely long deadline, mathematically, maxPo→k d = 1, According to the definition, the maximum arriving-on-time implying that the package can be delivered on time for sure via probability for case if hitchhiking the current (suppose that the reference path. Therefore, no DFS is needed and a simple the recruited taxi will go to sk) can be computed as follows: taxi scheduling can be enough under such circumstance. The overall procedure of DFS starting from s can be summarized t0−∆t k Z as follows. Pod(t ≤ t0−∆t|P athokd) = Pok(ω)ukd(t ≤ t0−∆t)dω The core function is to find the next package stop of sk, with 0 the pseudocode shown in Algorithm 1. The very beginning task (9) is to get the neighouring stations of sk, given the topology of where P athokd refers to the path from so to sd while stopping the built package transport network (Line 1). DFS starts to find at s in the next; ∆t is the time difference between the k the next stop from one of the neighbouring nodes of sk (e.g., occurring time of the taxi ordering request and the birth time ski) in the loop (Lines 4∼9). Whether ski can be the next of the package delivery request, i.e. ∆t = tor.t − pr.t; t0 is package stop is determined by the inequation shown in Line the given time duration of the deadline; ukd is the maximum 6. In the in-equation, the reference probability (refP ) is first arriving-on-time probability from s to s , as defined in k d set to minPo→k d (Line 3). t_min(ski, sd) is the time cost Definition 5. of the reference path SP ath_min(ski, sd) estimated by the By definition, it is easy to compute the arriving-on-time historical taxi trajectory data; the right part of the in-equation is probability of a determined path under a given deadline time, the maximum arriving-on-time probability which corresponds R t0−∆t such as 0 Pok(t)dt. By contrast, it is rather challenging to to the ideal case that the package can be shipped from ski get the value of the latter part in Eq. 9. As discussed, one naive to sd in the most-efficient way (i.e., the time cost on each and straightforward way is first to enumerate all the possible edge in the package transport network is the minimal). If the paths from sk to sd, then compute the value of arriving-on- in-equation satisfies, it indicates that there exists a potential time probability for each of them, finally pick up the maximum path which can lead to a higher maximum arriving-on-time one as the final value. It is easy to understand that the trivial probability than the reference path, thus DFS will continue to method cannot work in real cases as the problem of finding all search with a new start from ski recursively, with the same possible path for a given OD pair is NP-hard. Actually, it is procedure to the DFS starting from sk; otherwise, DFS will also no need and some branches in the transport network can be terminated and the related branches will be trimmed at the be trimmed recursively. We propose a novel algorithm named same time. Thus, a recursion may be stopped either at some maxProb to compute the probability, which mainly consists intermediate node or generates a successful path reaching the of two operations, i.e, initialization and deep-first searching. given destination. If a valid path is resulted (P ath_valid), Initialization: From sk to sd, it will be easy to find two shortest paths, SP ath_min(sk, sd), SP ath_max(sk, sd). We 5 a → b refers to the package is sent from sa to sb directly, while a b further obtain the two corresponding reference paths from so refers to the package is sent from sa to sb via some intermediate stops that to sd via sk, and compute their arriving-on-time probabilities are determined by the reference paths between sa and sb. 9

TABLE I. STATISTICS OF THE TAXI TRAJECTORY AND ROAD NETWORK refP will be updated using its corresponding probability if DATA SETS. and only if it is greater than the previous value. The whole DFS ends when all neighhours of sk are checked Datasets Properties Statistics by repeatedly calling the above recursive DFS. Finally, the Taxi trajectory Number of taxis >19,000 maximum arriving-on-time probability if assigning the package Number of occupied rides ≈ 13 M Road network Number of road intersections 11,999 delivery task to the current available taxi shall be the final value Number of road segments 15,202 of refP . 2) Probability Computation if Waiting for the Future: If the criteria is satisfied: 1) the package has arrived at its destination; future taxi rides heading to any one of the other neighouring 2) the time is running out (the package cannot be delivered by nodes of so except for sk (marked as {ngb(so)} − sk) could the deadline), in that case, the system would report failure. For lead to a higher arriving-on-time probability, compared to the those failure package deliveries, empty taxis can be recruited case if hitchhiking the current, a better decision should be the dedicatedly to send them to the destinations. However, the waiting. The maximum arriving-on-time probability if waiting topic is out of the scope of the paper. for the future can be computed as follows: 0 VI.EXPERIMENTAL EVALUATIONS Pod(t ≤ t0 − ∆t |P athojd) = In this section, we empirically evaluate the performance t −∆t0 0Z of the proposed maxProb algorithm. We first introduce the 0 (10) max Poj(ω)ujd(t ≤ t0 − ∆t )dω experimental setup, baseline algorithms used for comparison, sj ∈{ngb(so)}−sk 0 evaluation metrics and results on algorithm efficiency and effectiveness. We discuss some open research issues to be where ∆t0 = ∆t + t (s , s ) and t (s , s ) refers to the edge w o j w o j further addressed in the end. value component of waiting time from so to sj. As can be seen, the major difference between Eq. 9 and Eq. 10 is the time margin. More specifically, less time margin is left for the A. Experimental Setup package deliveries as an additional time cost would be induced 1) Experimental Data: We use the real-world datasets for while waiting for the future taxi rides. Here, we simply use the evaluation, i.e. the road network data which is extracted the average waiting time to approximate the additional time from OpenStreetMap6, and one month of taxi trajectory data cost. generated by over 19,000 taxis in the city of New York (NYC), Similarly, all maximum arriving-on-time probabilities of US. Readers can refer to [1] for the details on how to extract waiting for the future exploitable taxi rides from so can the road network from the crowd-sourced open platform (i.e., be computed, and the maximal one among them will be OpenStreetMap) correctly. We determine package interchange chosen to represent the maximum arriving-on-time probability stations according to the algorithm discussed earlier. if assigning the package delivery task to the future taxis. For the taxi trajectory data, we split it data into training and 3) Probability Comparison: As discussed, once receiving testing sets, according to the date of the month. Specifically, a real-time taxi ordering request, on-line taxi scheduling and the training set contains taxi trajectories on 1st∼20th, January, package routing will be activated, and the package may be 2013, which are used to build package transport network. It shipped to some intermediate stop by hitchhiking the current should be noted that for the taxi trajectory data in NYC, ride or stands still at the current stop by comparing those no detailed travel routes between the pick-up and drop-off two maximum arriving-on-time probabilities. Note that the points are provided due to the privacy considerations. The remaining time margin shrinks as time goes by, the two testing trajectories were generated from 21st to 31st, January, probabilities computed in Eqns. 9 and 10 are dynamically 2013, which are used as the real-time taxi ordering requests changed, thus the better decision (hitchhiking the current or (TOR) for testing the performance of the proposed maxProb waiting for the future) can be also adjusted adaptively “on- algorithm. Table I shows some basic statistics of the taxi the-fly”. trajectory and road network data. 2) Package Delivery Request: Since the data sets do not C. Package Information Updating contain information about package delivery requests, we apply After the package is sent to the next station whether by simple mechanism to simulate it. The novel package delivery hitchhiking the current or future rides, the information about system targets the city-wide person-to-person service. Hence, the package delivery request should be updated. To be more to simulate a package delivery request (PDR), we randomly specific, the origin of the package should be set as the updated generate its birth time, origin and destination. Regarding the station that the package locates; the birth time should be set origin and destination, any package should be originated and as the time when the package arrives at the current stop. The ended at the interchange stations. newly updated package delivery request will be used as the 3) Evaluation Environment: All the evaluations in the paper input of Phase II. are programmed using Java language under the Eclipse J2SE 1.5 integrated development environment, and run on an Intel D. Stopping Criteria Check Core i5-4950 PC with 8-GB RAM and Windows 7 operation system. For a package delivery, the previous three operations will be iteratively conducted until one of the following two stopping 6www.openstreetmap.org 10

4) Evaluation Metrics: We adopt the following three metrics that is commonly seen in other domains [60], [64]. to evaluate the proposed maxProb. (3) Direct - This method waits for the taxi heading to the Success Rate. The success rate is the ratio of the number of destination near the interchange stations that the package packages which can be delivered successfully within a given will be delivered directly, without any intermediate stops. deadline (i.e. time duration) to the number of total packages Specifically, the package will be assigned to the taxi that will (i.e. the number of package delivery requests simulated). pick up and drop off a passenger near the interchange stations that the package locates and heads to, respectively. Thus, no |T C(P ath(o d )) ≤ deadT | SR(t ≤ deadT ) = p p (11) relays are needed. |PQ| Remark. Each relay in DesCloser is effective as it ensures that the package would move towards its destination step by where P ath(op dp) represents the optimal path generated by the proposed maxProb algorithm for a given package step; while some relays in FCFS can be ineffective as the delivery request; deadT is the given deadline. The delivery package moves further away from its destination. For Direct, performance is better if the success rate is higher within a it may be inefficient for package deliveries where there is given shorter deadline. little passenger flow in-between. However, all three baseline Regarding the deadline setting, we do not set an absolute algorithms do not take the arriving-on-time probability of value for all package deliveries since package originated package deliveries into account, thus probably resulting in a (ended) at different locations would need absolutely varied high failure rate. time. Thus, for an individual package delivery, we set a relative deadline separately instead, according to the following C. Experimental Objectives equation: deadT = t + extraT (12) We plan the experiments to address the following questions. avg Question 1: How much computational resource is required in which tavg is the average value of the time cost by the two to generate the response for a package delivery request? reference paths, which is obviously different for packages with Question 2: How does maxProb perform under different different OD pairs. extraT is the extra time value imposed by given deadlines? the user; a smaller extraT indicates that the user needs the Question 3: How does maxProb perform w.r.t the birth time package more urgently and wants it to be arrived more timely. of package deliveries? Number of Relays. The number of relays (Numrelays) dur- Question 4: How does maxProb perform w.r.t the locations ing a package delivery is defined as the number of participating (both origins and destinations) of package deliveries? taxis Nump_taxis (Formula 13). Question 5: How many packages can be delivered daily on average (i.e., throughput) with the proposed system? = (13) Numrelays Nump_taxis The first question concerns the efficiency of maxProb, and On one hand, fewer relays generally mean a lower chance Questions 2∼4 are related to its effectiveness. To answer the of package loss or damage, and perhaps less overhead cost. first question, we compute the response time of the algorithms. On the other hand, fewer number of participating taxis may Since passenger flows are both time- and space-dependent, imply requiring less reward cost to taxi drivers. Thus, the to assess the effectiveness of the different algorithms, we performance is better if the number of relays is smaller. calculate their success rates and the number of relays with Package Throughput. The average number of package de- respect to packages to be dispatched to different parts of the liveries that the system can complete successfully per day. The city at different time of the day. We test the throughput of the system achieves better performance if the package throughput proposed system and the success rate under different number is bigger. of package requests generated per hour, and also examine the system throughput given different number (density) of interchange stations in the designate city (Question 5). B. Baseline Algorithms To show the superior performance of our proposed al- gorithm, we compare it with the following three baseline D. Experimental Results algorithms. 1) Results of Response Time: We first analyze the main (1) FCFS - This method adopts the First Come First Service operations involved in the four algorithms respectively. For strategy. Specifically, the package will be assigned to the first a given package delivery request (pr), when a new real-time taxi that will pick up a passenger near the interchange station taxi ordering request (tor) comes in, all four algorithms need to that the package locates, regardless of its destination. In fact, determine whether tor starts near the origin of the package and this algorithm always favors the strategy of hitchhiking the stops at some interchange station, i.e., exploitability checking. current, which is also known as an extension of the simple FCFS will recruit the first taxi that satisfies the criteria, but for and well-known flooding strategy [5], [28]. DesCloser, it needs to further determine whether the heading (2) DesCloser - This method assigns the package to the first destination of the taxi is closer to the destination of the pack- taxi that will head to somewhere closer to the destination of age, compared to its current location. For Direct, it also needs the package, compared to the current station of the package. to determine whether the taxi would head to the destination of This algorithm implements a distance-based geo-cast scheme the package. Thus one more comparison operation is required 11

30 30 1 0.99 25 25 0.98 0.97 20 20 0.96 0.95 15 15 0.94 Success Rate Success Time Time

Success Rate Success 0.93 10 10 0.92

Response TimeResponse (ms) 0.91 5 5 0.9 20 mins 40 mins 60mins 80mins 100mins Different Deadlines 0 0 Extra Time maxProb Direct DesCloser FCFS STOA DesCloserSTOA DirectDirect DesCloserFCFS FCFS Fig. 8. Evaluation results of success rate under different deadlines. Fig. 6. Evaluation resultsDifferent of response MethodDifferent time for Method four different algorithms. Time Cost w.r.t Extra Time 60000 Time Cost w.r.t Extra Time in the probability computation for maxProb in terms of the 600 50000 500 average response time saving, with the result shown in Fig. 7. 100 without branch-cutting 100100 withwithout branch-cutting branch-cutting 40000 400 To better illustration, we also highlight the result w.r.t different 100 with branch-cutting 30000 300 deadlines within the range of [20, 40] minutes in the left- Time Cost Time 200 Without Trimming top part of the figure. A significant time saving is obtained 20000 100 WithTrimming

Time Cost Time with the introduction of branch trimming. To be more specific, 0 10000 20 mins 40 mins 60 mins the gap of the average response time with/without branch Response TimeResponse (ms) Extra Time 0 trimming increases exponentially wider as the given deadline becomes bigger, what is more, all the average response times 20 mins 40 mins 60 mins 80 mins 100 mins with branch trimming remain stable and small under all given Different Deadlines Extra Time deadlines. The package delivery requests are the same for the efficiency studies, with a number of 100. Fig. 7. Comparison results of response time of maxProb with/without branch trimming w.r.t different deadlines. 2) Results of Success Rate: We present results of the perfor- mance of maxProb in terms of the success rate under different deadlines in Fig. 8. More specifically, extraT is set in a range for both DesCloser and Direct algorithms. For maxProb, from 20 to 100 minutes, with an equal interval of 20 minutes. the procedure is even more complicated, mainly requiring As one can see, the success rate under all deadlines is high, additional probability computation and comparison operations, with a value above 90%. The success rate firstly becomes as discussed previously. Each algorithm needs to repeat its own slightly higher then decreases gradually as users allow a longer operation procedure at each intermediate station (except for deadline. The highest success rate appears when the extraT Direct) and thus the total response time is the accumulated is set 60 minutes, with the value of around 95%. The observed computational time over all iterations. phenomena seems somehow counterfactual at the first glance We show the comparison results of average response time of as the success rate should be higher when setting a longer the four different algorithms in Fig. 6. The average response deadline in intuition. The root cause is that: when giving a time of FCFS is the biggest while that of Direct is the bigger deadline, the arriving-on-time probability if hitching smallest among all algorithms. The average response times the current becomes greater, as guaranteed by Theorem 1. An of DesCloser and maxProb are in-between and DesCloser extreme case is that the probability would always equal one costs slightly more time than maxProb. More precisely, the and dominates the other possibilities, as a consequence, the average response time of Direct is within 7 milliseconds; the maxProb algorithm tends to select the hitchhiking the current average response time of maxProb is around 22 milliseconds; strategy while routing packages. Under such circumstance, the average response time of FCFS is no more than 30 maxProb degrades to the FCFS algorithm to some extent, milliseconds. All results indicate that all four algorithms are causing the negligible decrease of the success rate. To get rid of quite efficient, and can plan and adjust the shipping routes such degradation, the key issue is to lower the value computed in real time. We also observe an interesting phenomenon: by Eqn. 9. Thus, one potential solution can be the reduction although FCFS only involves two simple comparisons for of remaining time margin during package routing when ap- each candidate taxi, it is the most time-consuming method plying maxProb algorithm. For instance, if hitchhiking the accumulatively. In comparison, maxProb which contains the current strategy is always preferred during package routing, most sophisticated operations needs a shorter response time the remaining time margin can be manually reduced to 90% than FCFS and DesCloser. We argue that this is because: of the true one that imposed by the user. It should be noted FCFS and DesCloser require more rounds of computation that the success rate of FCFS is much lower, compared to (i.e., more relays) than the maxProb algorithm. maxProb, which will be verified in the following experiments. We further evaluate the effectiveness of the branch trimming The number of package delivery requests is 10,000, with the 12

1 1 0.9 0.9 0.8 0.8 Day-time day 0.7 0.7 nightNight-time 0.6 0.6 0.5 0.5 0.4 0.4 Success Rate Success Success Rate Success Success Rate Success 0.3 0.3 0.2 0.2 0.1 0.1 0 0 maxProb DesCloser FCFS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 STOA Direct DesCloser FCFS Time of the Day Method (a) Fig. 10. Evaluation results of success rate for four different algorithms at day time and night time respectively. 10000 9000 8000 periods, although hitchhiking opportunities are accumulated, it 7000 is still insufficient, resulting in a poor success rate. In this study, 6000 the number of package delivery requests is 1,000 for each hour time; the extraT is fixed to 60 minutes. 5000 We report the result of success rate for four different 4000 algorithms w.r.t the time of the day. For simplicity, we do not 3000 provide the success rate of the four algorithms under every 2000

# of Passenger-delivery Trips # of Passenger-delivery hour of a day, and just split a day into two time slots, i.e, 1000 day time from 8:00 to 18:00 and the rest is the night time, 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 respectively. In the time dimension, as shown in Fig. 10, for Time of the Day all four algorithms, the success rate is higher at day time (b) than night time, except for FCFS which only achieves the similarly low success rate at both time slots. Direct algorithm Fig. 9. Evaluation results of success rate under different birth times of the returns quite close performance at day time and night time. package deliveries. As predicted, the success rate of FCFS algorithm is the lowest, i.e., below 10%. Compared to the other three baseline algorithms, maxProb achieves the best success rate at both day birth time uniformly distributed at the day-time (i.e., from 8:00 time and night time. In this experiment, the number of package to 18:00). delivery requests is 10,000, with the birth time uniformly We are also interested at the performance of the success rate distributed at the day time and nigh time, respectively; the at different time of the day. As can be observed in Fig. 9(a), extraT is fixed to 60 minutes. the success rate is quite stable and high from 8:00 to 18:00, In real situation, there are more passengers who prefer to with the value above 95%, demonstrating the usefulness of the take taxis at some interchange stations, providing more hitch- proposed system in practice during the day time. The success hiking opportunities for package deliveries, probably leading rate is extremely low during late-night and early-morning. For to a better success rate. We thus manually categorize the instance, the success rate is even lower than 10% from 2:00 to interchange stations into two classes (i.e. popular, unpopular) 5:00. To get an in-depth understanding on the root cause, we in advance, by taking its total number of pick-ups and drop-offs further show the number of passenger-delivery trips at different in history into account. The number of interchange stations la- time of the day in Fig. 9(b). As expected, the success rate is beled as popular is 19; and the number of interchange stations highly correlated with the number of passenger-delivery trips, labeled as unpopular is 15. We further identify three categories i.e., a greater number of passenger-delivery trips implies a of package delivery requests, according to the labels of the higher success rate since more hitchhiking opportunities for original and destination station, as shown in Table II. Then, package deliveries are provided. An interesting observation is we test the performance on success rate for each category. that the success rate at 17:00 is still high though the number The number of package delivery requests for each category is of passenger-delivery trips is not large at that time, compared same, with a value of 10,000. to nearby hours. This is because: on one hand, the number of hitchhiking opportunities for package deliveries should be TABLE II. THREE IDENTIFIED CATEGORIES OF PACKAGE DELIVERY REQUESTS. accumulated during the given deadline (usually bigger than an hour); on the other hand, the number of passenger-delivery Description trips in the next two hours increases and remains high. On the Category I packages start and end at popular stations (p2p) contrary, the number of passenger-delivery trips during late- Category II packages start or end at popular stations (p2u or u2p) night and early morning is continuously small. During that time Category III packages start and end at unpopular stations (u2u) 13

1 5 0.9 0.8 p2p 4 p2p 0.7 u2uu2u 0.6 p2u or u2pp2u or u2p 3 0.5 0.4 2 Success Rate Success Success Rate Success Transhippment

0.3 of Number Relays 0.2 1 0.1 0 0 maxProbSTOA Direct DesCloserDesCloser FCFSFCFS 20mins 40mins 60mins 80mins 100mins Different Deadlines Region Extra Time Fig. 11. Evaluation results of success rate for four different algorithms of different package categories. Fig. 12. Evaluation results of the number of relays under different deadlines.

5 Fig. 11 shows the comparison of success rate of the four algorithms under a given deadline (the extraT is fixed to 4 60 minutes in this study) for the three categories. maxProb achieves the best performance for three categories, while the 3 performance of FCFS is the worst. One exception is the suc- cess rate of Direct for the Category III packages. In such case, 2

without any relay at the intermediate stations, Direct obtains of Number Relays an extremely low success rate (i.e., under 5%), which is due to 1 the fact that there is insufficient passenger flow between two unpopular stations. For the same algorithm, the performance 0 is also different for different categories of package delivery 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 requests. The performance for the Category I packages is Time of the Day the best; the performance is the worst for the Category III Fig. 13. Evaluation results of the number of relays under different time of packages. Particularly, maxProb ensures that around 90% of the day. the Category I packages (75% for Category II and III) can be arrived-on-time successfully. While for FCFS, only less than 10% of all Category packages can be delivered by the deadline. than the night time. Moreover, a little surprisedly, the number Similar to the the previous results, DesCloser performs better of relays resulted from DesCloser and FCFS are quite close than FCFS and worse than maxProb for all three categories. to that obtained by maxProb, implying that the number of 3) Results of Number of Relays: We compare the number of relays is somehow independent of the adopted algorithms for a 7 relays of maxProb under different given deadlines, with the successful package delivery. We will investigate deeper about results shown in Fig. 12. One can observe that the number of the potential causes qualitatively and quantitatively such as relays presents an ascending tendency with the increase of the the geographical and temporal distributions of the successful given deadline. One possible reason is that more ineffective package deliveries in the future work. relays (i.e., the package moves back and forth towards the We further report the results of the number of relays for four destination commonly) are resulted since maxProb is inclined algorithms w.r.t the package categories, as shown in Fig. 15. to hitchhike the coming taxi immediately, as discussed. We also show the results on how the number of relays obtained by maxProb varies under different time of day 6 in Fig. 13. Overall, the number of relays is more or less 5 Day-time unchanged (i.e., with a value of around 3) during the whole day nightNight-time day, except for the early morning when the number of relays 4 is relatively smaller, probably caused by the extremely little passenger flow during that time. 3 We report the results of the number of relays for four

Transhippment 2 algorithms w.r.t the day time and nigh time respectively, as of Number Relays shown in Fig. 14. As can be predicted, the number of relays 1 for Direct shall equal one. For the other three algorithms, slightly more relays are generally required at the day time 0 maxProbSTOA DirectDirect DesCloser FCFS 7 For the experiments on the number of relays, it should be noted that the Method number of relays is counted and averaged only for the packages that are Fig. 14. Evaluation results of the number of relays for four algorithms at delivered on time. day time and night time respectively. 14

6 1 25000 p2p 5 p2p u2u 0.8 20000 4 p2u or u2pp2u or u2p

3 0.6 15000

Transhippment 2 Number of Number Relays 0.4 10000 Throughput Success Rate Success

1 Rate Success Delivered Packages Delivered 0.2 5000 0 Successsuccess Rate rate maxProbSTOA DirectDirect DesCloserDesCloser FCFS deliveredThroughput packages Region 0 0 Fig. 15. Evaluation results of the number of relays for four algorithms for 0 10000 20000 30000 40000 50000 different package categories. Number of Package Delivery Requests Packages

Similarly, the number of relays for Direct is one, regardless of Fig. 16. Evaluation results of the number of relays for four algorithms for the package categories. Compared to the other two algorithms different package categories. (i.e., DesCloser and FCFS), maxProb requires slightly fewer relays for all three package categories. Similar to the results of 1 10000 the success rate w.r.t the package categories, for all algorithms except for Direct, the performance in terms of the number of 0.8 8000 relays is the best for the Category I packages; the performance for the Category III packages is the worst. 4) Results of Package Throughput: It is necessary to eval- 0.6 6000 uate the package throughput of the system, since it is a primary consideration when applied to real-life scenarios. 0.4 4000 Utilization

Fig. 16 shows the results on the number of packages that the Throughput Utilization Ratio proposed system can transport successfully w.r.t the number Packages Delivered 0.2 2000 of total generated package delivery requests per day, together Utilizationutilization Ratio rate with the success rate. More specifically, package throughput deliveredThroughput packages increases gradually with the number of generated package 0 0 delivery requests before approaching a stable value. On the 10 20 30 40 50 Number of Interchange Stations contrary, the success rate declines almost linearly with the Number of Interchange Station number of generated package delivery requests. As can be seen, the maximum package throughput is around 20,000 per day, however, the corresponding success rate is quite low (i.e., Fig. 17. Evaluation results of the number of relays for four algorithms for different package categories. around 40%) and might be not applicable in real life. To make the proposed system practical in real situations, the package throughput should be around 9,500 per day while maintaining locations but with the same number) may affect the throughput a relatively promising success rate. result, but the overall trend shall be the same. We further examine the package throughput under different density (number) of interchange stations, as shown in Fig. 17. As expected, the more interchange stations our system has, E. Open Research Issues the higher throughput it can achieve (the number of package delivery requests for this study is 10,000). We argue that In this section, we discuss some open issues which are not the root cause is the improvement on utilization ratio of resolved in this work but will be addressed in the future. taxi trips for package deliveries, which is defined as the Package Delivery Request. Modeling package delivery ratio between the number of taxi trips involved in package request is a challenging task, requiring additional data sources, deliveries and the total number of taxi trips. We thus plot the such as demographics, socio-economics, land use, and on-line utilization ratio under different density of interchange stations purchasing activities. How to accurately model the city-wide in Fig. 17 as well. As evidenced, an increasing utilization package flow distributions can be a separate research problem ratio (35.5%, 58.3% and 64.9% respectively) is achieved as itself, and there has not yet been any reliable model as far as we the interchange stations in the city becomes denser (18, 34 know [25]. In the scope of this paper, we focus on discovering and 46 respectively). Moreover, there is still a considerable the near-optimal package delivery paths with high efficiency, room to increase the package throughput further if placing given any package delivery request. Hence, we simply generate more interchange stations to better utilize the taxi trips. We the package birth time, origins and destinations randomly. In are aware that the choice of interchange stations (different the future, we plan to incorporate other more real-world data 15 sources to model the city-wide package delivery requests [47], the interchange stations (number and location). Finally, we and integrate them into our framework. plan to implement and test our system with real users in actual Transport Network Optimization. At the current research settings, collecting feedback on how to further improve the of crowd logistics, most studies focus on developing advanced service. package routing algorithms. In contrast, less attention has been paid to the package transport network optimization (i.e., the ACKNOWLEDGEMENTS number/locations of interchange stations). In the future, we plan to address the two issues simultaneously. The work was supported by the National Key Research and Multi-objective Optimization. The on-time performance is Development Project of China (No. 2017YFB1002000), the one of the important optimization objectives of crowdsourced National Science Foundation of China (No. 61602067 and No. logistics. From the side of logistics service providers, other 61872050), Chongqing Basic and Frontier Research Program objectives are also equally important, such as the money paid (No. cstc2018jcyjAX0551) and the Fundamental Research to the taxi drivers. The more taxi drivers are recruited, the Funds for the Central Universities (No. 2018cdqyjsj0024). more rewards are necessary. Thus, to minimize the number of Chao Chen is the corresponding author. package relays is worth to be considered. From the side of taxi drivers, the minimization of detour driving distance is of great REFERENCES importance. In the future, we plan to formulate the problem of [1] L. Alarabi, A. Eldawy, R. Alghamdi, and M. F. Mokbel. TAREEG: crowd delivery via hitchhiking taxi rides as the multi-objective a mapreduce-based web service for extracting spatial data from open- optimization one. streetmap. In Proc. of the ACM SIGMOD, pages 897–900, 2014. Practical Issues. There are still many practical issues to [2] T. H. Ali, S. Radhakrishnan, S. Pulat, and N. C. Gaddipati. Relay net- be addressed before truly realizing the system. The capacity work design in freight transportation systems. Transportation Research issue of the interchange station is one example. On one hand, Part E: Logistics and Transportation Review, 38(6):405–422, 2002. interchange stations frequently visited by passengers are more [3] C. Archetti, M. Savelsbergh, and M. G. Speranza. The vehicle routing problem with occasional drivers. European Journal of Operational likely to be recruited for relaying packages, and thus require Research, 254(2):472–480, 2016. a larger capacity in general. On the other hand, the spatial [4] A. M. Arslan, N. Agatz, L. Kroon, and R. Zuidwijk. Crowdsourced and temporal traffic patterns of package deliveries also have a delivery—a dynamic pickup and delivery problem with ad hoc drivers. critical impact on the capacity issue [16]. Other practical issues Transportation Science, to appear, 2018. such as the incentive mechanism for taxi drivers, the package [5] A. Balasubramanian, B. Levine, and A. Venkataramani. Dtn routing pricing methods, the cooperation cost of the interchange sta- as a resource allocation problem. In Proc. of ACM SIGCOMM, pages tion, the size of the package, the standard for container of the 373–384, 2007. package also need further investigation. [6] R. E. Bellman and S. E. Dreyfus. Applied dynamic programming. Princeton university press, 2015. [7] P. S. Castro, D. Zhang, C. Chen, S. Li, and G. Pan. From taxi GPS VII.CONCLUSIONAND FUTURE WORK traces to social and community dynamics: A survey. ACM Computing Surveys, pages 17:1–17:34, 2013. In this paper, we present a novel framework called Crowd- [8] P. S. Castro, D. Zhang, and S. Li. Urban traffic modelling and prediction Express for package delivery path planning. The framework using large scale taxi gps traces. In Proc. of Pervasive, pages 57–72, proposes to exploit hitchhiking rides provided by occupied 2012. taxis to transport packages in time without degrading the [9] C. Chen, S. Jiao, S. Zhang, W. Liu, L. Feng, and Y. Wang. Tripimputor: quality of passenger services. More specifically, we first built Real-time imputing taxi trip purpose leveraging multi-sourced urban the package transport network by mining the taxi GPS tra- data. IEEE Transactions on Intelligent Transportation Systems, 99:1– jectory data offline. Then we proposed a two-phase approach 13, 2018. [10] C. Chen, J. Ma, Y. Susilo, Y. Liu, and M. Wang. The promises of big for package delivery path planning with a novel and compre- data and small data for travel behavior (aka human mobility) analysis. hensive process. Using real-world datasets which include road Transportation Research Part C: Emerging Technologies, 68:285–299, network data, and a large-scale taxi trajectory data generated 2016. by over 19,000 taxis in a month in NYC, US, we compared our [11] C. Chen, D. Zhang, P. Castro, N. Li, L. Sun, S. Li, and Z. Wang. proposed method with three baseline algorithms, and showed iBOAT: Isolation-based online anomalous trajectory detection. IEEE that our method is more efficient and effective. Transactions on Intelligent Transportation Systems, 14(2):806–818, 2013. In the future, we plan to broaden and deepen this work [12] C. Chen, D. Zhang, B. Guo, X. Ma, G. Pan, and Z. Wu. TripPlanner: in several directions. First, we plan to correlate the package Personalized trip planning leveraging heterogeneous crowdsourced digi- deadline to the package pricing. Currently, we set a general and tal footprints. IEEE Transactions on Intelligent Transportation Systems, relative deadline for all package deliveries and handle them 16(3):1259–1273, 2015. equally. In the near future, we plan to set different priorities [13] C. Chen, D. Zhang, N. Li, and Z.-H. Zhou. B-Planner: Planning for packages according to users’ expected arriving time. For bidirectional night bus routes using large-scale taxi gps traces. IEEE instance, users are charged higher if they want their packages Transactions on Intelligent Transportation Systems, 15(4):1451–1465, to be arrived earlier, and new package routing algorithms 2014. [14] C. Chen, D. Zhang, X. Ma, B. Guo, L. Wang, Y. Wang, and E. Sha. should be developed accordingly. Second, we intend to take CrowdDeliver: Planning city-wide package delivery paths leveraging actions to improve the system throughput and coverage. Such the crowd of taxis. IEEE Transactions on Intelligent Transportation as grouping packages with close destinations and optimizing Systems, 18(6):1478–1496, 2017. 16

[15] W. Chen, M. Mes, and M. Schutten. Multi-hop driver-parcel matching [36] J. McInerney, A. Rogers, and N. R. Jennings. Crowdsourcing physical problem with time windows. Flexible services and manufacturing package delivery using the existing routine mobility of a local popula- journal, pages 1–37, 2017. tion. In the Orange D4D Challenge, 2014. [16] T. Crainic, L. Gobbato, G. Perboli, and W. Rei. Logistics capacity [37] L. Moreira-Matias, J. Gama, M. Ferreira, J. Mendes-Moreira, and planning: A stochastic bin packing formulation and a progressive L. Damas. Predicting taxi–passenger demand using streaming data. hedging metaheuristic. Technical report, CIRRELT-2014-66, 2014. IEEE Transactions on Intelligent Transportation Systems, 14(3):1393– [17] A. Devari. Crowdsourced last mile delivery using social networks. 1402, 2013. Master’s thesis, The State University of New York at Buffalo, 2016. [38] Y. Nie and Y. Fan. Arriving-on-time problem: discrete algorithm that [18] Y. Ding, C. Chen, S. Zhang, B. Guo, Z. Yu, and Y. Wang. GreenPlanner: ensures convergence. Transportation Research Record: Journal of the Planning personalized fuel-efficient driving routes using multi-sourced Transportation Research Board, (1964):193–200, 2006. urban data. In IEEE International Conference on Pervasive Computing [39] B. Pan, Y. Zheng, D. Wilkie, and C. Shahabi. Crowd sensing of traffic and Communications (PerCom), pages 207–216, 2017. anomalies based on human mobility and social media. In Proc. of ACM [19] A. Doan, R. Ramakrishnan, and A. Y. Halevy. Crowdsourcing systems SIGSPATIAL, pages 344–353, 2013. on the world-wide web. Commun. ACM, 54(4):86–96, Apr. 2011. [40] G. Pan, G. Qi, Z. Wu, D. Zhang, and S. Li. Land-use classification [20] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al. A density-based using taxi GPS traces. IEEE Transactions on Intelligent Transportation algorithm for discovering clusters in large spatial databases with noise. Systems, 14(1):113–123, 2013. In Proc. of ACM KDD, volume 96, pages 226–231, 1996. [41] H. B. Rai, S. Verlinde, J. Merckx, and C. Macharis. Crowd logistics: [21] D. J. Fagnant and K. M. Kockelman. The travel and environmental an opportunity for more sustainable urban freight transport? European implications of shared autonomous vehicles, using agent-based model Transport Research Review, 9(3):39, 2017. scenarios. Transportation Research Part C: Emerging Technologies, [42] A. J. Rohm and V. Swaminathan. A typology of online shoppers based 40:1–13, 2014. on shopping motivations. Journal of Business Research, 57(7):748 – [22] E. Fatnassi, J. Chaouachi, and W. Klibi. Planning and operating a 757, 2004. shared goods and passengers on-demand system for sus- [43] A. Sadilek, H. Kautz, and J. P. Bigham. Finding your friends and tainable city-logistics. Transportation Research Part B: Methodological, following them to where you are. In Proc. of WSDM, pages 723–732, 81:440–460, 2015. 2012. [23] Y. Ge, H. Xiong, A. Tuzhilin, K. Xiao, M. Gruteser, and M. Pazzani. An [44] A. Sadilek, J. Krumm, and E. Horvitz. Crowdphysics: Planned and energy-efficient mobile recommender system. In Proc. of ACM KDD, opportunistic crowdsourcing for physical tasks. In Proc. ICWSM, 2013. pages 899–908, 2010. [45] I. Semanjski and S. Gautama. Crowdsourcing mobility insights– [24] B. L. Golden, L. Levy, and R. Vohra. The orienteering problem. Naval reflection of attitude based segments on high resolution mobility be- research logistics, 34(3):307–318, 1987. haviour data. Transportation Research Part C: Emerging Technologies, [25] J. González-Feliu, M. G. Cedillo-Campo, and J. L. García-Alcaraz. An 71:434–446, 2016. emission model as an alternative to od matrix in urban goods transport [46] S. Skiena. Dijkstra’s algorithm. Implementing Discrete Mathematics: modelling. Dyna, 81(187):249–256, 2014. Combinatorics and Graph Theory with Mathematica, Reading, MA: [26] B. Guo, Z. Wang, Z. Yu, Y. Wang, N. Y. Yen, R. Huang, and Addison-Wesley, pages 225–227, 1990. X. Zhou. Mobile crowd sensing and computing: The review of an [47] X. Tan, Y. Shu, X. Lu, P. Cheng, and J. Chen. Characterizing and emerging human-powered sensing paradigm. ACM Computing Surveys, modeling package dynamics in express shipping service network. In 48(1):7:1–7:31, 2015. Proc of IEEE Congress on Big Data, pages 144–151, 2014. [27] N. Kafle, B. Zou, and J. Lin. Design and modeling of a crowdsource- [48] R. Tarjan. Depth-first search and linear graph algorithms. SIAM journal enabled system for urban parcel relay and delivery. Transportation on computing, 1(2):146–160, 1972. Research Part B: Methodological, 99:62–82, 2017. [49] The Economist. Same-day dreamer. http://www.economist.com/news/ [28] Y.-B. Ko and N. H. Vaidya. Flooding-based geocasting protocols for business/, 2014. mobile ad hoc networks. ACM/Springer MONET, 7(6):471–480, 2002. [50] Y. Tong, Y. Chen, Z. Zhou, L. Chen, J. Wang, Q. Yang, J. Ye, and W. Lv. [29] B. Leng, H. Du, J. Wang, L. Li, and Z. Xiong. Analysis of taxi drivers’ The simpler the better: a unified approach to predicting original taxi behaviors within a battle between two taxi apps. IEEE Transactions on demands based on large-scale online platforms. In Proceedings of the Intelligent Transportation Systems, 17(1):296–300, 2015. 23rd ACM SIGKDD International Conference on Knowledge Discovery [30] B. Li, D. Krushinsky, H. A. Reijers, and T. Van Woensel. The share- and Data Mining, pages 1653–1662, 2017. a-ride problem: People and parcels sharing taxis. European Journal of [51] H. Üster and P. Kewcharoenwong. Strategic design and analysis of Operational Research, 238(1):31–40, 2014. a relay network in truckload transportation. Transportation Science, [31] B. Li, D. Krushinsky, T. Van Woensel, and H. A. Reijers. The share- 45(4):505–523, 2011. a-ride problem with stochastic travel times and stochastic delivery [52] J. Xu, R. Rahmatizadeh, L. Bölöni, and D. Turgut. Real-time prediction locations. Transportation Research Part C: Emerging Technologies, of taxi demand using recurrent neural networks. IEEE Transactions on 67:95–108, 2016. Intelligent Transportation Systems, 19(8):2572–2581, 2018. [32] Y. Liu, B. Guo, C. Chen, H. Du, Z. Yu, D. Zhang, and H. Ma. [53] H. Yao, F. Wu, J. Ke, X. Tang, Y. Jia, S. Lu, P. Gong, J. Ye, and Z. Li. Foodnet: Toward an optimized food delivery network based on spatial Deep multi-view spatial-temporal network for taxi demand prediction. crowdsourcing. IEEE Transactions on Mobile Computing, to appear, In 2018 AAAI Conference on Artificial Intelligence (AAAI’18), 2018. 2018. [54] Z. Yu, H. Xu, Z. Yang, and B. Guo. Personalized travel package with [33] Y. Liu, C. Liu, N. Yuan, L. Duan, Y. Fu, H. Xiong, S. Xu, and J. Wu. multi-point-of-interest recommendation based on crowdsourced user Exploiting heterogeneous human mobility patterns for intelligent bus footprints. IEEE Transactions on Human-Machine Systems, 46(1):151– routing. In Proc. of IEEE ICDM, pages 360–369, Dec 2014. 158, 2016. [34] Y. Liu, F. Wang, Y. Xiao, and S. Gao. Urban land uses and traffic [55] N. J. Yuan, Y. Zheng, L. Zhang, and X. Xie. T-finder: A recommender ‘source-sink areas’: Evidence from gps-enabled taxi data in shanghai. system for finding passengers and vacant taxis. IEEE Transactions on Landscape and Urban Planning, 106(1):73–87, 2012. Knowledge and Data Engineering, 25(10):2390–2403, 2013. [35] R. Lowe and M. Rigby. The last mile–exploring the online purchasing [56] D. Zhang, B. Guo, and Z. Yu. The emergence of social and community and delivery journey. Technical report, Barclays, 2014. intelligence. Computer, (7):21–28, 2011. 17

[57] D. Zhang, L. Sun, B. Li, C. Chen, G. Pan, S. Li, and Z. Wu. Under- Weichen Liu (S’07-M’11) is currently a professor at standing taxi service strategies from taxi GPS traces. IEEE Transactions College of Computer Science, Chongqing University, on Intelligent Transportation Systems, 16(1):123–135, 2014. China. Before that, he worked as a postdoctoral [58] D. Zhang, L. Wang, H. Xiong, and B. Guo. 4w1h in mobile crowd research fellow with teaching duties in Nanyang sensing. IEEE Communications Magazine, 52(8):42–48, 2014. Technological University, Singapore. He received the PhD degree from the Hong Kong University [59] J. Zhang, Y. Zheng, D. Qi, R. Li, X. Yi, and T. Li. Predicting citywide of Science and Technology, Hong Kong, and the Artificial crowd flows using deep spatio-temporal residual networks. MEng and BEng degrees from Harbin Institute of Intelligence , 259:147–166, 2018. Technology, China. [60] L. Zhang, B. Yu, and J. Pan. Geomob: A mobility-aware geocast scheme Dr. Liu served as the chairs, technical program in metropolitans via and . In Proc. of INFOCOM, pages committee members, editors and reviewers for over 1279–1787, April 2014. 20 premier international conferences and journals. He authored and co- [61] X. Zheng, X. Liang, and K. Xu. Where to wait for a taxi? In Proc. of authored more than 70 research papers in peer-reviewed journals, conferences ACM UrbComp, pages 149–156, 2012. and books, and received the best paper candidate awards from ASP-DAC 2016, [62] Y. Zheng, Y. Liu, J. Yuan, and X. Xie. Urban computing with taxicabs. CASES 2015, CODES+ISSS 2009, the best poster award from AMD-TFE In Proc. of UbiComp, pages 89–98, 2011. 2010, and the most popular poster award from ASP-DAC 2017. His research interests include embedded and real-time systems and network-on-chip. [63] C. Zhong, X. Huang, S. M. Arisona, G. Schmitt, and M. Batty. Inferring building functions from a probabilistic model using public transportation data. Computers, Environment and Urban Systems, 48:124 – 137, 2014. [64] M. Zorzi and R. Rao. Geographic random forwarding (GeRaF) for ad hoc and sensor networks: multihop performance. IEEE Transactions on Mobile Computing, 2(4):337–348, 2003.

Yasha Wang received his Ph.D. degree in North- Chao Chen is an associate professor at Col- eastern University, Shenyang, China, in 2003. He is a lege of Computer Science, Chongqing University, professor and associate director of National Research Chongqing, China. He obtained his Ph.D. degree & Engineering Center of Software Engineering in from Pierre and Marie Curie University and Institut Peking University, China. His research interest in- Mines-TELECOM/TELECOM SudParis, France in cludes urban data analytics, ubiquitous computing, 2014. He received the B.Sc. and M.Sc. degrees in software reuse, and online software development en- control science and control engineering from North- vironment. He has published more than 50 papers in western Polytechnical University, Xi’an, China, in prestigious conferences and journals, such as ICWS, 2007 and 2010, respectively. Dr. Chen was the UbiComp, ICSP and etc. As a technical leader and recipient of the Best Paper Runner-Up Award at manager, he has accomplished several key national MobiQuitous 2011. projects on software engineering and smart cities. Cooperating with major In 2009, he worked as a Research Assistant with Hong Kong Polytechnic smart-city solution providing companies, his research work has been adopted University, Kowloon, Hong Kong. His research interests include pervasive in more than 20 cities in China. computing, social network analysis, urban logistics, data mining from large- scale taxi data, and big data analytics for smart cities.

Sen Yang is currently a master student at Col- lege of Computer Science, Chongqing University, Chongqing, China. His research interests include scenic travel route planning, taxi GPS trajectory data mining, and last-mile delivery. Bin Guo is a professor from Northwestern Polytech- nical University, China. He received his Ph.D. degree in computer science from Keio University, Japan in 2009 and then was a post-doc researcher at Institut Mines-TELECOM/TELECOM SudParis in France. His research interests include ubiquitous computing, mobile crowd sensing, and HCI. 18

Daqing Zhang is a professor at Institute of Soft- ware, School of Electronics Engineering and Com- puter Science, Peking University, China. He obtained his Ph.D from University of Rome “La Sapienza” and the University of L’Aquila, Italy in 1996. His research interests include large-scale data mining, urban computing, context-aware computing, and am- bient assistive living. He has published more than 180 referred journal and conference papers, all his research has been motivated by practical applications in digital cities, mobile social networks and elderly care. Dr. Zhang is the Associate Editor for four journals including ACM Trans- actions on Intelligent Systems and Technology. He has been a frequent Invited Speaker in various international events on ubiquitous computing. He is the winner of the Ten Years CoMoRea Impact Paper Award at IEEE PerCom 2013, the Best Paper Award at IEEE UIC 2015/2012 and the Best Paper Runner Up Award at Mobiquitous 2011. He is a member of the China Thousand-Talent Program.

APPENDIX A PROOFOF THEOREM 1 Proof: The theorem can be proven by induction, detailed as follows: Base Case: When n = 1 (in which n is length of the given path which is quantified by the number of interchange stations contained by the path minus one), it is obviously that we have n=1 n=1 Pij(t ≤ t1|P ath ) ≤ Pij(t ≤ t2|P ath ), if t1 < t2, according to Definition 6. Induction Step: Let k ∈ N be given and suppose Eq. 4 is true for n = k, that is, we have: n=k n=k P (t2|P ath ) ≥ P (t1|P ath ), if t1 < t2 Then,

Z t1 n=k+1 n=k P (t1|P ath ) = p(t)P (t1 − t|P ath )dt 0 where p(t) is the probability density function on the edge from the origin to the first stop.

Z t2 n=k+1 n=k P (t2|P ath ) = p(t)P (t2 − t|P ath )dt 0 Z t1 n=k = p(t)P (t2 − t|P ath )dt+ 0 Z t2 n=k p(t)P (t2 − t|P ath )dt t1 Z t1 n=k ≥ p(t)P (t1 − t|P ath )dt+ 0 Z t2 n=k p(t)P (t2 − t|P ath )dt t1 Z t2 n=k+1 n=k = P (t1|P ath ) + p(t)P (t2 − t|P ath )dt t1 n=k+1 ≥ P (t1|P ath ) Conclusion: By the principle of induction, Eq. 4 is true for all n ∈ N.