This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2409815, IEEE Transactions on Vehicular Technology JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 1

Spatio-Temporal Segmentation of Trips Using Smart Card Data

Fan Zhang1, Juanjuan Zhao1,ChenTian2, Chengzhong Xu1,3,XueLiu4,andLeiRao5 1Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, 2School of Electronic Information and Communications, Huazhong University of Science and Technology, China 3Department of Electrical and Computer Engineering, Wayne State University, Detroit, MI 48202, USA 4Department of Computer Science, McGill University, Canada 5General Motors Research Lab, USA

Contactless smart card systems have gained universal prevalence in modern metros. In addition to its original goal of ticketing, the large amount of transaction data collected by the smart card system can be utilized for many operational and management purposes. This paper investigates an important problem: how to extract spatio-temporal segmentation information of trips inside a metro system. More specifically, for a given trip, we want to answer several key questions: how long it takes for a passenger to walk from the station gantry to the station platform? How much time he/she waits for the next train? How long he/she spends on the train? How long it takes to transfer from one line to another? etc. This segmentation information is important for many application scenarios such as travel time prediction, travel planning, and transportation scheduling. However, in reality we only assume that only each trip’s tap-in and tap-out time can be directly obtained; all other temporal endpoints of segments are unknown. This makes the research very challenging. To the best of our knowledge, we are the first to give a practical solution to this important problem. By analyzing the tap-in/tap-out events pattern, our intuition is to pinpoint some special passengers whose transaction data can be very helpful for segmentation. A novel methodology is proposed to extract spatio-temporal segmentation information: first for non-transfer trips by deriving the boarding time between the gantry and the platform, then for with-transfer trips by deriving the transfer time. Evaluation studies are based on a large scale real-system data of metro system, which is one of the largest metro systems in China and serves millions of passengers daily. On-site investigations validate that our algorithm is accurate and the average estimation error is only around 15%.

Index Terms—Trip segmentation, intelligent transportation systems, metro systems, smart card, smart city.

I. INTRODUCTION before the departure of the next available train, Alice waits L2 Nowadays, metro systems [1] have become one of the most seconds; she travels in the train for L3 seconds (including the preferred public transit services [2]. Compared with other ser- dwell time at the intermediate stations) and arrives the tap-out vices, metro has the benefits of high efficiency, large volume, platform; walking from the platform to the tap-out gantry takes and fast speed. Recently, contactless smart card systems have L4 seconds; she waits at the tap-out gantry for L5 seconds, due gained universal prevalence in modern metros. Compared with to passenger congestion, before she finally leaves the system. traditional magnetic cards or paper tickets, smart card is a Figure 1(b) is Alice’s one-transfer trip from to Line more secure and convenient method for authentication and fare 2 (the wide line), and there are more segments: L6 seconds collection. for transferring to a platform; L7 seconds have passed The transaction data collected by the smart card system can before the next Line 2 train departs; Alice also spends L8 be utilized for many operational and management purposes. seconds on the second train. Spatio-temporal segmentation of For metro systems, usually both the tap-in and tap-out records trips with multiple transfers can be modelled similarly. In this for each trip are saved in the database. Each record contains at model, L1, L4 and L6 are of particular importance. They are least the time, the station and the card’s ID of the transaction. dominated by the building structures of each metro station, With this data, we can measure user travel behaviors and their hence independent of moving passengers or trains. For the possible variances for the purpose of customer managemen- convenience of presentation, we use the term boarding time t [3], [4], [5]. We can also analyze the access data to improve to refer to both L1 and L4, and use the term transfer time for transit planning or scheduling [6], [7], [8]. L6. This paper investigates an important emerging problem: giv- More specifically, for a given trip, we want to segment it en the smart card tap-in and tap-out data, how to extract spatio- to several travel segmentations, with both location and time temporal segmentation information of metro trips? Shown in information. This spatio-temporal segmentation information is Figure 1(a) is an example of a metro user, say, Alice’s non- important to many application scenarios. With successfully de- transfer trip in Line 1 (the dotted line): after tap-in at the metro duced L1, L4 and L6 values of every station, we have developed gantry, Alice spends L1 seconds to walk to the tap-in platform; several exciting motivation cases ( details in Section II). For most modern metro systems, only each trip’s tap-in and Copyright (c) 2015 IEEE. Personal use of this material is permitted. tap-out time can be directly obtained, and all other temporal However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected]. endpoints of the segments are unknown. By subtracting the Corresponding author: Chen Tian (email: [email protected]). tap-in time from the tap-out time, the whole trip duration is

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2409815, IEEE Transactions on Vehicular Technology JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 2

spatio−temporal density of FTKA FM 1750 HZZX 7UDLQ SMZX 1500 SNG LHB 1250 SML ML BSL 1000 SZB 750

passager number HS ST LS 500 LH 7UDLQ QH 7:00 8:00 9:00 10:00 7UDLQ time Fig. 2: Spatio-temporal passenger density of Line 4 trains from 7:00 AM to 10:00 AM.

tation information). The measured results validate that our algorithm is accurate and the average estimation error is only around 15% (Section VII). Fig. 1: Spatio-temporal segmentation of Alice’s (a) non- transfer trip, and (b) one-transfer Trip Section III presents the overview of our solution. Section- s IV to VII discuss in details of each challenge and solution. Section VIII discusses the related work. Section IX concludes the paper. composed of: L1 + L2 + L3 + L4 + L5 for a non-transfer trip, or L1 + L2 + L3 + L6 + L7 + L8 + L4 + L5 for a one-transfer trip, or even more segments for a multi-transfer trip. Given II. MOTIVATING APPLICATIONS only the start and end time, the problem is how to deduce the In this section, we present several exciting use cases, as boundaries between two consecutive segments. To solve this the motivation example for this paper. They all are our problem, we provide an efficient yet effective spatio-temporal ongoing projects, with the objective of transform our designed segmentation solution. algorithms to real applications that benefit the public trans- The intuition of our solution is as follows: there are some portation. special passengers (termed as Border-Walkers in the paper) in the system and we utilize their transaction data for segmenta- A. Realtime Train Density Estimation tion. To the best of our knowledge, we are the first to give a Many passengers care more about comfort than travel practical solution to this important problem. The contributions duration [9], [10]. In a metro system, if the realtime crowdness of this paper include: information of the following (several) trains can be informed • We define the role and illustrate the special functions of to passengers waiting on the platform (e.g., via the in-site Border-Walkers; a set of novel algorithms are proposed LED Display), some might change their travel plans by getting to identify Border-Walkers by analyzing the tap-in/tap-out on an earlier or later while more comfortable train with less events pattern (Section IV). population aboard. • We further propose a novel methodology to extract spatio- Currently this service is just unavailable. With only the tap- temporal segmentation information, first for non-transfer in and tap-out information, the authority has no method, yet, to trips by deriving the boarding time between gantry and figure out the approximated population in each running train. platform, then for with-transfer trips by deriving the Shown in Figure 2 is the passenger spatio-temporal density transfer time (Section V). of Line 4 from a typical day: between morning peak hours • We study our approach using a large scale data collected (i.e., 7:00 AM to 10:00 AM), and between stations BSC and from Shenzhen metro system, which is one of the largest SNG; the deeper the red color, the more people on a train. metro systems in China; it serves millions of passengers The densities are unevenly distributed: we can clearly find daily. We also present the detailed system design which that red and blue density interleave among consecutive trains. realizes our algorithm (Section VI). After investigation, we find that inter-zone trains are added • We perform a large scale on-site investigation (i.e.,we for morning peak, between SZB and FTKA stations; they manually take measurements and acquire the trip segmen- interleave with the normal scheduled trains and are much less

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2409815, IEEE Transactions on Vehicular Technology JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 3

8 5 the waiting time for train on the platform(min) 6 the waiting time for leaving in front of tap−out gantry(s) the average waiting time for train on the platform (min) 4 4

2 3

0 6:30 10:30 14:30 18:30 LongSheng 2

8 1 6 the percentage of passengers(%)

4 0 10 20 30 40 50 2 the percentage of missing trains(%) Fig. 4: The individual factors 0 6:30 10:30 14:30 18:30 FuMin Fig. 3: Waiting time in tap-in platform and before tap-out algorithms, we demonstrate the extra waiting time in tap-in gantry platform and before tap-out gantry for a typical day in two stations. The average expected waiting time (blue color line) in tap-in platform is calculated as half of the interval between populated since they skip the first several stations. However, consecutive trains. Some part of the average line are lower passengers on a platform are uncertain about the realtime (usually at peak hours), since extra trains are added hence the crowdness of the next coming train and the next next coming interval are reduced. train. Mostly people just choose to board on the next train: In LongSheng station, the derived waiting time in tap-in to them, without information, the next train might be even platform (red color line) is always higher than expected, since crowded; if this is the case, then waiting is just a waste of this station is crowded and many passengers have to wait for time; as a result, most people simply take the next available several trains before get on. Only in the evening peak, the train. waiting line matches the average line, due to the effects of By travel segmentation, we can already accurately estimate added trains. As a comparison, the derived waiting time of the historic spatial-temporal density of every operating train FuMin station is almost the same as the expected time: this from the records (as illustrated by Figure 2). More im- station has extra trains for both morning and evening peaks. portantly, combined with passengers Origin-Destination (OD As a comparison, tap-out waiting in LongSheng station (green pair) prediction [11], [12], we can predict the destination of color line) seldom exists for almost the whole day because each passenger when he/her gets into the system; with that passengers getting off at this station are few, While for Fumin information and the historical segmentation results, we then station, there is extra waiting time before tap-out gantry for can perform realtime passenger density estimation for each almost the whole day, especially for morning and evening running train. We leave the details of this ongoing project to peaks. future works. We also analyzed the difference among individual passen- gers, by segmenting each individual’s travels in non-peak hours. For each person, we sum up the numbers of trips he/she B. Personalized Travel Planning misses a train, which a passenger at normal speed should The exact spatio-temporal status of a passenger actually get in. Shown in Figure 4, 80% passengers seldom miss any also depends on other factors: the physical condition of a train. However, nearly 1% passengers are clearly slower in passenger (e.g., disabled or not), peak or low traffic hours, action compared with the average: some even with a missing etc. With the help of the algorithms in this paper, we can probability as high as 50%. perform Personalized Travel Planning, instead of the general With all these factors extracted, we are developing a project travel planning service currently provided (e.g.,BingMap, that could provide Personalized Travel Planning based on both Google Map). hour consideration and historical individual record. Our algorithms can extract information whose value is gen- erally stable: for example, the average L1 (walk-in), L4 (walk- C. Metro Acquaintance out) and L6 (transfer) values of every station for normal people under non-peak hours. We can then analyze the variability of A social phenomenon is called “The Familiar Stranger”. every segmentation in time domain. In Figure 3, based on our For example, Alice and Bob have different Origin-Destination

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2409815, IEEE Transactions on Vehicular Technology JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 4

• David alights the train, also walks L4(n) seconds, but waits for L5(David) seconds in the tap-out list before he leaves the station.

B. Intuition and challenges of solution Charlie and David represent normal passengers: they spend extra time in the system waiting, both at the platform, and before the tap-out gantry. As a comparison, Alice do not waste extra time waiting at the platform, i.e., L2(Alice)=0; Bob do not waste extra time waiting before the tap-out gantry, i.e., L5(Bob)=0. Alice and Bob are the special passengers we Fig. 5: System model from a train’s view termed as Border-Walkers in the paper; their transaction data can be very helpful for segmentation. Suppose for each (m,n) pair, there are two transaction pairs; they live in different districts and work in different records in the smart card data: Alice’s tap-in time Tin(m,n) districts. However, they happen to transfer at the same station and Bob’s tap-out time Tout (m,n). As illustrated in Figure 5, at approximately the same time every workday; they are we can derive the following Equation (1) which relates Border- familiar with each other, and they have feelings about each Walkers data with train timing A(m,n) and D(m,n), for each other, while none of them dare to make the first contact [13]. (m,n) pair: “Metro Acquaintance” is an ongoing project with the ob- (i)Tout (m,n)=A(m,n)+L4(n) jective of establishing such a special social network. Directly (1) ( ) ( , )= ( , ) − ( ). from historical records, by travel segmentation, we calculate ii Tin m n D m n L1 n the meet probability, when and where, that each pair of We already have Tin(m,n) and Tout (m,n);ifA(m,n) and passengers have met each other, either in the same platform, D(m,n) can be fixed, then we can derive L1(n) and L4(n) for or in the same train. By enter their card ID in the web entry, station n. With all the values of A(m,n), D(m,n), L1(n) and people can establish a metro social network; the connection L4(n), it seems straightforward, at least for non-transfer trips, recommendation of the social system is mostly based on the to perform spatio-temporal segmentation information. meet probability deduced by the travel segmentation algo- However, there are challenges to be overcome. First, how rithms in this paper. to find these Border-Walkers (i.e., Bob and Alice) from all the transaction records. Second, even with their data, the III. OVERVIEW segment calculation cannot directly rely on published train A. Modeling arrival/departure schedule. In many cases only the schedules of the first/last trains of each line are available to the public. We model the system from a train’s point of view. As shown Also, there is no guarantee that each train would be punctual at in Figure 5, let us suppose n is a station of a metro line; we also a second level: e.g., an unexpected contact between a train gate suppose there is a specific train m travels in one direction and and a passenger’s body can delay the scheduled departure time stops by n; let us denote the time of its arrival and departure as for 10 seconds. In other words, A(m,n) and D(m,n) cannot ( , ) ( , ) A m n and D m n respectively. Note here we abuse the term be directly fixed from public source; instead, they should be train m to represent, instead of a physical one, a logical entity derived too. as the combination of three attributes: line, direction and train The solution architecture is shown in Figure 6. The de- sequence number. duction firstly focuses on the non-transfer tap-in/tap-out data: We assume that the walking speed is constant for all from all the records, we find the Border-Walkers for each passengers: in this case, L1 and L4 are station-dependent, (m,n) combination (step 1). With this information, we then ( ) ( ) which we can denote as L1 n and L4 n respectively. As a derive the boarding time for each station (step 2). Now we comparison, L2 and L5 are passenger-dependent; we denote incorporate tap-in/tap-out data with one-transfer: the transfer ( ) ( ) them as L2 passenger and L5 passenger respectively. time for each station-pair can be derived (step 3). Eventually, We illustrate the trajectories of four fictional passengers in we can segment the trips in our dataset (step 4). Figure 5: • ( ) Charlie enters station n,walksL1 n seconds, boards IV. PINPOINT BORDER-WALKERS train m and waits for L2(Charlie) seconds before train m departs; A. Finding Bob • Alice enters station n later, also walks L1(n) seconds but Bob can be found by mining the tap-out events. The key boards train m exactly at the departure time D(m,n); insight here is that: tap-out events usually form a periodical • Bob alights from train m exactly at time A(m,n) when idle-busy-idle pattern. As shown in Figure 7(a), when a train train m stops at station n,walksL4(n) seconds and goes m arrives, there can exist many passengers leave the train through the station gantry immediately before all other together; they walk the same distance to the gantry and tend travellers; to congest right in front of it; the resulting tap-out events are

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2409815, IEEE Transactions on Vehicular Technology JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 5

 6HJPHQWDWLRQ IRUQRWUDQVIHUWULSV

 ,GHQWLI\ 1RWUDQVIHU %RUGHU:DONHUV WULSVWDSGDWD Fig. 9: Tap-out value matrix

 %RDUGLQJWLPH B. Finding Alice )RUHDFKVWDWLRQ Alice can be found after Bob. After identifying Bob, all passengers inside this dense cluster are correlated with a specific train. Suppose there is a train m travels in one 2QHWUDQVIHU  7UDQVIHU WLPH IRU direction and stops by station n. For all the subsequent stations WULSVWDSGDWD HDFKVWDWLRQSDLU n+1,n+2,···,N, tap-out passengers belong to train m can be identified. Among them, the subset of passengers tapping-in at station n are grouped together. We consider the one with the latest tap-in record as the special Alice we are looking for.  6HJPHQWDWLRQ IRUZLWKWUDQVIHUWULSV C. Handling exceptions In practice there are more challenges. We present in Figure 8 Fig. 6: Solution scheme the tap-out events data of a metropolitan metro system in China. The graph shows a partial set of stations of one line, with four time periods in the same day; the time duration are all 15 minutes. Figure 8(a), from 08:15 to 08:30, shows a part of the morning peak; Figure 8(b), from 16:15 to 16:30, is the normal state of the system; Figure 8(c), from 18:30 to 18:45, is the evening peak; Figure 8(d), from 22:15 to 22:30, shows the low hours of a day. Overall, the periodical idle-busy-idle pattern is quite clear in most cases. There are more implications that can be found. First, the interval between consecutive trains is time- dependent: compared with Figure 8(b) and (d), Figure 8(a) and (c) have more clusters in 15 minutes, which implies shorter train intervals (this is reasonable for morning/evening peaks). Second, there are some types of exceptions which hinder the clustering step: • (Type 1) There are too many events in some cases. The Fig. 7: Tap-out events (a) busy time (b) idle time (c) temporal train arrivals can be extremely frequent in the morning pattern peak, such as at station XiangMiHu shown in Figure 8(a). It is clear that the clustered long line of events contains passengers from several consecutive trains. relatively very frequent in the time domain. As a comparison, • (Type 2) There are too few events in some cases. Station the tap-out events are sparsely distributed between two such DaXin demonstrates regular clusters in Figure 8(a), while busy times, as illustrated in Figure 7(b). it is almost impossible to identify any reasonable cluster Theoretically it is straightforward to find Bob. All tap-out in Figure 8(b), (c) and (d). events in the station are mapped to the time axis as shown • (Type 3) There are passengers moving in abnormal pat- in Figure 7(c); the classic DBSCAN algorithm is adopted to terns. A passenger Ethan in haste can run from the gantry cluster the points [14]; the dense clusters, which are almost to the platform; as a result, he might get on train m, + evenly-spaced in time domain (due to the evenly-spaced train instead of on m 1 if he moves at a normal speed. schedule), can be identified. Each dense cluster corresponds Let us denote Bob’s tap-out time of train m at station n as to the arrival of a train, and we consider in each cluster the Tout (m,n). A whole day’s Tout (m,n) values thus form a M ×N passenger with the earliest tap-out time as the special Bob we matrix (Figure 9). Type 1 and Type 2 exceptions cause missing are looking for. entries in the matrix.

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2409815, IEEE Transactions on Vehicular Technology JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 6

(a) (b) BaoAnZhongXin BaoAnZhongXin XinAn XinAn QianHaiWan QianHaiWan LiYuMen LiYuMen DaXin DaXin TaoYuan TaoYuan ShenZhenDaXue ShenZhenDaXue GaoXinYuan GaoXinYuan BaiShiZhou ShiJieZhiChuang ShiJieZhiChuang HuaQiaoCheng HuaQiaoCheng QiaoChengDong QiaoChengDong ZhuZiLin ZhuZiLin CheGongMiao CheGongMiao XiangMiHu XiangMiHu GouWuGongYuan GouWuGongYuan HuiZhanZhongXin HuiZhanZhongXin GangSha GangSha HuaQiangLu HuaQiangLu

08:15 08:18 08:21 08:24 08:27 08:30 16:15 16:18 16:21 16:24 16:27 16:30 (c) (d) BaoAnZhongXin BaoAnZhongXin XinAn XinAn QianHaiWan QianHaiWan LiYuMen LiYuMen DaXin DaXin TaoYuan TaoYuan ShenZhenDaXue ShenZhenDaXue GaoXinYuan GaoXinYuan BaiShiZhou BaiShiZhou ShiJieZhiChuang ShiJieZhiChuang HuaQiaoCheng HuaQiaoCheng QiaoChengDong QiaoChengDong ZhuZiLin ZhuZiLin CheGongMiao CheGongMiao XiangMiHu XiangMiHu GouWuGongYuan GouWuGongYuan HuiZhanZhongXin HuiZhanZhongXin GangSha GangSha HuaQiangLu HuaQiangLu

18:30 18:33 18:36 18:39 18:42 18:45 22:15 22:18 22:21 22:24 22:27 22:30 Fig. 8: Tap-out events pattern in (a) Morning Peak (b) Normal (c) Evening Peak (d) Low hours

The matrix has hidden structures: for example Tout (m,n) − A. Segment non-transfer trips ( , − ) ( − , ) − ( − , − ) Tout m n 1 and Tout m 1 n Tout m 1 n 1 are cor- The challenge is to derive the boarding time L1(n) and L4(n) related since they are both dominated by the travel time for each station n. From Equation 1(i) and (ii) we could get: from station n − 1ton;also,Tout (m + 1,n) − Tout (m,n) and Tout (m,n) − Tin(m,n)=L1(n)+L4(n) − (D(m,n) − A(m,n)). (2) Tout (m+1,n+1)−Tout (m,n+1) are correlated since they are both dominated by the interval between train m and m + 1. Leveraging the presence of these certain types of structures Two characteristics can be exploited to solve Equation (2). and redundancy in collected data, compressive sensing [15] First, for almost all metro lines in one direction, the passen- can be used to interpolate the matrix to handle Type 1 and gers’ route to the platform and the route from the platform in Type 2 exceptions. the same station are the same. It is safe to assume that L1(n) ( ) ( ) ( ) Type 3 exceptions are handled by anomaly detection [16]. equals L4 n , hence we approximate both L1 n and L4 n to ( ) Sticking to the Ethan example, we can analyze the relationship be Lp n . between the tap-in time and the belonging train of passengers; Second, to reduce operational complexity, the dwell time Ethan can be identified as an anomaly, instead of being treated (i.e., D(m,n) − A(m,n)) is generally fixed for each station; as Alice. for the same reason, the value options are also limited. Compared with Lp(n), dwell time values are relatively smaller. Our interactions with transit agencies reveal that: the typical dwell time is around tens of seconds for small stations, and is estimated at a larger value for large stations. It is safe V. D ERIVE BOARDING &TRANSFER TIME to approximate dwell time by a fixed value Lw(n). In our system (Section VI), we use on-site investigations to estimate Using these special passengers’ data, we propose a novel Lw(n). methodology to extract spatio-temporal segmentation informa- Now we have tion, both for non-transfer trips by deriving the boarding time L (n)=(T (m,n) − T (m,n)+L (n))/2. (3) between gantry and platform, and for with-transfer trips by p out in w deriving the transfer time. If Tout (m,n), Tin(m,n) and Lw(n) can be fixed, then for each

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2409815, IEEE Transactions on Vehicular Technology JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 7

$QDO\VLVRQWKH $QDO\VLVRQ FKDUDFWHULVWLFVRISDVVHQJHUV ERDUGLQJWLPH $QDO\VLVRQVSDWLRWHPSRUDO $QDO\VLVRQ VHJPHQWV WUDQIHUWLPH

DQDO\WLFDOPRGHOIRU FOXVWHULQJDOJRULWKP DQDO\WLFDOPRGHOIRU WDSRXWYDOXHPDWUL[ IRUWDSRXWYDOXH WUDQVIHUWLPH Fig. 10: System model from a station’s view PRGHOIRUFRPSOHWLQJ DQDO\WLFDOPRGHOIRU VSDWLRWHPSRUDO WDSRXWYDOXHPDWUL[ ERDUGLQJWLPH VHJPHQWVPRGHO

line each station, the segment length Lp(n) between gantry and platform can be derived. The timing for each train can be calculated by: =RRNHHSHUˈ+EDVHˈ3LJ 0DSUHGXFH+DGRRS

A(m,n)=Tout (m,n) − Lp(n) (4) VPDUWFDUGGDWD PHWUREDVLFLQIRUPDWLRQ ILHOGLQYHVWLJDWLRQ D(m,n)=Tin(m,n)+Lp(n). Now we have the estimated time values for each A(m,n)/D(m,n) pair. Fig. 11: System architecture For each non-transfer trip, we can have spatio-temporal segmentation. Assume there is a trip from station n to n ;the 1 2 least D(m ,n ) − A(m ,n ) value has no waiting time (i.e., passenger performs tap-in at t(n ) and tap-out at t(n ). We can 2 2 1 2 1 2 L ≈ 0). We take this as the transfer time L (n ,n ) for each get his boarding time L /L as L (n )/L (n ). Based on the 7 6 1 2 1 4 p 1 p 2 n ,n pair. Based on this, we can perform segmentation for tap-out event’s belonging cluster (Figure 7(c)), the passenger’s 1 2 each transfer trip, similar to that of the non-transfer trips. train m can be identified. We can then get how long the pas- senger waits on the tap-in platform L2 as D(m,n1)−t(n1)−L1. The travel time L3 in train m is A(m,n2) − D(m,n1);the C. Post-processing waiting time L5 can be derived as t(n2) − A(m,n2) − L4. There are many trains running one after another in each line, and a station is visited by 100 to 200 trains daily. Generally speaking, with each train’s transaction data, we can obtain a B. Segment with-transfer trips value of the boarding time Lp(n) for each station. As a result, The challenge is to derive the transfer time L6 for each for each Lp(n), we actually obtain a group of values. The possible transfer option. To reduce interference, we only use problem is: how to converge this set of values to a single Lp(n) one-transfer trips for derivation. Let us suppose the tap- result. The same is the situation of transfer time calculation. in/transfer/tap-out stations are n1, n2 and n3 respectively; and For a single Lp(n), most obtained values are close enough. we suppose there are two trains, m1 on the first line k1 and We did find abnormal values, which can be several hours. m2 on the second line k2. A passenger gets on train m1,then The classic DBSCAN algorithm is again adopted to cluster arrives at station n2 at time A(m1,n2),andleavesn2 at time the points [14]. After excluding the exceptions, the minimum D(m2,n2) with train m2. Note that in the context of transfer value in the cluster is chosen as the optimal candidate for trips, term station is a logical entity as the combination of boarding time or transfer time. three attributes: line, direction and the physical station. For each transfer trip, we can easily figure out the tap-out VI. PROTOTYPE AND DATASET train m by the same approach mentioned above. However, 2 A. Processing System how to figure out the passenger’s tap-in train m1?Letus first model the system from a station’s point of view. Shown Our spatio-temporal segmentation algorithm processes a in Figure 10, time advances from left to right. Let us also large amount of data, and correspondingly, requires intensive suppose n is a station; a specific train m arrives and departs at sorting and grouping operations. Shown in Figure 11, the A(m,n) and D(m,n) respectively; the next train m + 1 arrives implemented system has three layers: the data layer, the model and departs at A(m+1,n) and D(m+1,n) respectively. In our layer, and the application layer. approach, passengers with tap-in time between Tin(m,n) and • The data layer is used mainly for storage purpose and Tin(m + 1,n) are assumed to enter train m + 1, i.e., m1 can be MapReduce [17]/Hadoop [18] job processing. There are fixed. three kinds of data: smart card transactions, metro basic The method is to group all passengers transferred from line information, intermediate and final results. The storage k1 to line k2 at station n2: for each such trip, we can get part uses HDFS [19] and Hbase [20], which is a dis- its A(m1,n2) and D(m2,n2) values. All values of D(m2,n2) − tributed, scalable, big data store. The data layer accepts A(m1,n2) are calculated. It is assumed that the trip with the MapReduce jobs from the model layer, and effectively

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2409815, IEEE Transactions on Vehicular Technology JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 8

TABLE I: Transaction Record Format Field Value CardID Anonymous unique card id TrmnlID For metro, it represents the stop id TrnsctTime Transaction date and time TrnsctType Transaction type

the end of 2013; there are 8 more lines coming in the next 7 years. The dataset contains two months’ metro transaction records from Sep 1 to Oct 30, 2013. Each record represents a single card swipe event, either tap-in or tap-out. The four import fields are listed in Table I. The data are about 500M bytes per day, or, 15G bytes per month. We perform a dwell time on-site investigation for all 5 lines, and get the resulting Lw(n) for our algorithm. The result is that for small stations, the dwell time is around 20 seconds; while Fig. 12: Processing flowchart for large stations, it is around 35 seconds.

accomplishes them by data processing. Some software VII. EMPIRICAL EVALUATION AND VALIDATION RESULTS tools, such as PIG [21] and HIVE [22] are also used. In this section we present the empirical results by applying The results are sent back to the model layer or stored to our approach to Shenzhen metro system analysis. First we HDFS. analyze the obtained boarding/transfer time. A large-scale on- • The model layer is used to accept the query requests from site investigation is performed which validates our design. the application layer; a query is translated to a series Finally, we segment the trips to show the effectiveness of our of MapReduce jobs. It uses the smart card and other approach in analyzing passenger travel patterns. data from the data layer to build the analytical model: clustering model for tap-out events, analytical model for tap-out value matrix, analytical model for boarding time, A. Boarding time etc. For each station, the boarding time of two opposite direc- • The application layer performs model analysis, such as tions are derived separately. The obtained boarding time of 5 analysis on the characteristics of passengers, boarding lines are shown in Figure 14(a) to (e). There are two bars in time, spatio-temporal segments, transfer time, etc. each station: each represents one direction. First let us observe the boarding time of Line 1. The B. Flowchart boarding time pattern of Line 1 is typical, compared with other The processing flowchart is shown in Figure 12. The details 4 lines (Figure 14(b)-(e)). of data pre-processing and classification steps are given below. A first observation is that the time of both directions in Data pre-processing: Every trip contains one tap-in event the same station are mostly comparable. The reason is that and one tap-out event; this step joins them together out of most stations have their trains, of the same line, stop at the the transaction data by matching card ID and time. Also, the opposite sides of the same platform. There are also exceptions, redundancy should be removed and the inconsistence should i.e., LaoJie and GuoMao. We perform on-site investigation be solved in smart card data. in the two stations. The reason is that their platforms of Classification: This step is to divide the trips from step 1 into opposite directions are located in different layers of the same three categories: non-transfer trips, one-transfer trips, multi- building, due to architectural constraints. The identification of transfer trips. As mentioned above, only non-transfer and one- these exceptional cases indirectly verify the accuracy of our transfer trips are used for boarding/transfer time extraction. approach. Second, for most stations, the boarding time is less than 60 seconds. Only very few stations have large boarding time, as C. Dataset that of QianHaiWan and JiChangDong stations. Usually these Our dataset is the metro transaction data from Shenzhen, are large stations with transfer functions: the building design China. There have been over 10 million public transit smart is larger hence resulting in prolonged walking time. cards issued, and these smart cards can be used for both the Shown in in Figure 14(f) is the total cumulative distribution bus and metro systems. Metro alone has on average 2.5 million function. As we can see, for 70 percent stations, the boarding transactions per day, which is estimated one third of the total time is less than 60 seconds; for only 5 percent, the boarding public transit load. Figure 13 shows the existing 5 lines by time is larger than 100 seconds.

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2409815, IEEE Transactions on Vehicular Technology JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 9

Fig. 13: Metro graph of Shenzhen

B. Transfer time scale on-site investigation. Figure 16 is the results’ comparison Figure 15 shows the transfer time of every possible choice. between the investigation and our approach: the x-axis is the value of boarding/transfer time; the y-axis is the total Again there are two bars for each n1,n2 pair: each represents one direction. There are several wide bars: in these stations, cumulative distribution function value. As shown in the fig- there are only one way transfer from one line to another; ure, our approach can effectively and accurately derive the usually such a station is the end point of a line. boarding/transfer time. As we can observe, mostly the transfer time is less than 90 The numerical results are also given in Table II. Regarding seconds. However, the transfer time related to ShenZhenBei the boarding time, our average estimation error is between station is extremely high, compared with others. The reason 10% and 20% for each line; on average the error is only is that the metro station itself is actually fully embedded in around 15%, which validates that our algorithm is relatively the Chinese High Speed Rail station. The huge building makes accurate. Regards the transfer time, the results are even better: the transfer between lines a really long journey. the average estimation error is less than 13%. D. Segmentation C. Validation Based on the obtained boarding and transfer time, we In order to verify the correctness and reliability of the can extract the spatio-temporal segments of trips. The trip methodology given in this paper, we have performed a large segmentation results are shown in Figure 17. For each trip,

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2409815, IEEE Transactions on Vehicular Technology JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 10

(f) 1

0.8

0.6 F(x) 0.4

0.2

0 20 40 60 80 100 x Fig. 14: Boarding Time (a) Line 1 - Green, (b) Line 2 - Orange, (c) - Blue, (d) Line 4 - Red, (e) - Purple (f) Distribution

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2409815, IEEE Transactions on Vehicular Technology JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 11

Fig. 15: Transfer time at each station. We have to abbreviate the information on the x-axis for better presentation. For example, 1D5UBAZX (second item from left to right in the x-axis) denotes: from Line 1Downward direction, transfers to Line 5 Upward direction, in BaoAnZhongXin station

TABLE II: Estimation Errors VIII. RELATED WORK Mean Error Line 1 Boarding 13.95% Most travel behavior analyses focus on the aggregated traffic Line 2 Boarding 19.97% patterns. The two peak access patterns are common during Line 3 Boarding 10.1% weekdays: people go to work in the morning and go home in Line 4 Boarding 17.4% the evening. As a comparison, transactions are more evenly Line 5 Boarding 18.5% distributed during weekends [3], [4]. There are some existing All Boarding 15.78% analyses which also used data from Shenzhen [23], [24], All Transfer 12.74% [5]. They focus on the aggregated temporal usage of the whole metro system. [24] studies the spatial characteristics of the station usage model. Our previous work [5] study the the obtained L1/L2/L3/L4/L5 values are transformed to the aggregated temporal and spatial travel patterns of passengers corresponding percentile format. For each line, the percentile by mining smart card data. Different from all previous works, values are averaged over all trips in our database. this paper focuses on independent trips: we plan to segment each transaction trip to travel segments. For non-transfer trips, the onboard time L3 dominates: it accounts for around 70 to 80 percent of the total trip time. One of our previous works study the passenger density for Notice that L3 includes both the train travel time and the dwell public bus [8]. However, unlike metro, the bus boarding and time at intermediate stations. The boarding time is similar for departing time can be easily identified by the card tap events. all lines: L1/L4 normally takes around 3 to 5 percent of the Sun et al. estimate the spatial-temporal density inside a metro total trip. The waiting time at the tap-in platform L2 is different system. However, their system model is oversimplified: they among lines: less than 12 percent is spent for Line 1 travelers; assume a uniform boarding time and dwell time for every while for Line 4 this value is nearly 19 percent. The reason is station. Further, their algorithm requires physical distance that Line 1 is the most crowded line, and the interval between among stations as input and does not deal with the transfer trains is smaller. As a comparison, the waiting time at tap-out time between lines [25]. As a comparison, in our paper, the platform (i.e., L5) is negligible: only 1 to 2 percent for every boarding time is derived for every station separately, which line. is a more practical work. We require the estimation (instead The one-transfer trips’ segmentation is shown in Figure 18. of explicit value) of the dwell time of each station as input, The trips with transfers are typically longer. Boarding time which is much easier to get via field investigation. We also derive transfer time between lines. (L1 and L4), and the waiting time (L2, L7 and L5) are similar in value compared with non-transfer trips; as a result, their There are other works dedicated to the analysis of trans- percentiles decrease in the whole trip. The average onboard portation systems based on mining large scale data, such as time L3 + L6 is over 72 percent. The average transfer time is freight trunks [26], taxis [27], public buses [28] and even 84 seconds, around 4 percent of the trip. electrical vehicles [29].

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2409815, IEEE Transactions on Vehicular Technology JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 12

(a) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 On−site investigation Our approach 0 20 40 60 80 100 Fig. 18: Spatio-temporal segmentation of one-transfer trips (b) 1

0.9 IX. CONCLUSION 0.8 In this paper, we investigate an important problem: how 0.7 to extract spatio-temporal segmentation information of metro 0.6 trips by only utilizing the tap-in and tap-out information. To the best of our knowledge, we are the first to provide 0.5 a practical solution to this important problem. We first pro- 0.4 pose a set of novel algorithms to identify Border-Walkers 0.3 by analyzing the tap-in/tap-out events pattern; based on that, we further propose a novel methodology to extract spatio- 0.2 temporal segmentation information, first for non-transfer trips 0.1 On−site investigation Our approach by deriving the boarding time between gantry and platform, 0 then for with-transfer trips by deriving the transfer time. We 0 100 200 300 400 study our approach in the case of Shenzhen metro system and perform a large scale on-site investigation. The on-site Fig. 16: On-site validation: (a) boarding time CDF (b) transfer measured results validate that our algorithm is accurate and time CDF the average estimation error is only around 15%.

ACKNOWLEDGMENT The authors would like to thank anonymous reviewers for their valuable comments. This work is partially supported by China National Basic Research Program (973 Program, No. 2015CB352400), by National Natural Science Foundation of China (No. 61202107, No. U1401258, No. 61202303), by Nat- ural Science Foundation support under grant CCF-1016966, by National High Technology Research and Development Pro- gram of China (863 Program No. 2014AA01A702), by Natural Science Foundation of Hubei Province (No. 2014CFB1007), by National Key Technology Research and Development Pro- gram of China (No. 2012BAH46F03), and by Fundamental Research Funds for the Central Universities.

REFERENCES [1] C. Mayet, A. Bouscayrol, L. Horrein, P. Delarue, J. Verhille, E. Chattot, and B. Lemaire-Semail, “Comparison of different models and simulation approaches for the energetic study of a subway,” IEEE Transactions on Fig. 17: Spatio-temporal segmentation of no-transfer trips Vehicular Technology, vol. 63, no. 2, pp. 556–565, 2014. [2] Wikipedia, the free encyclopedia, “Urban rail transit in china,” http://en.wikipedia.org/wiki/Urban rail transit in China.

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2409815, IEEE Transactions on Vehicular Technology JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 13

[3]B.Agard,C.Morency,andM.Trepanier,´ “Mining public transport [27] Y. Li, C. Tian, F. Zhang, and C. Xu, “Traffic condition matrix esti- user behaviour from smart card data,” in 12th IFAC Symposium on mation via weighted spatio-temporal compressive sensing for unevenly- Information Control Problems in Manufacturing-INCOM, 2006, pp. 17– distributed and unreliable gps data,” in Intelligent Transportation Sys- 19. tems (ITSC), 2014 IEEE 17th International Conference on. IEEE, 2014, [4]C.Morency,M.Trepanier,´ and B. Agard, “Analysing the variability of pp. 1304–1311. transit users behaviour with smart card data,” in Intelligent Transporta- [28] L. Yin, J. Hu, L. Huang, F. Zhang, and P. Ren, “Detecting illegal pickups tion Systems Conference, 2006. ITSC’06. IEEE. IEEE, 2006, pp. 44–49. of intercity buses from their gps traces,” in Intelligent Transportation [5] J. Zhao, C. Tian, F. Zhang, C. Xu, and S. Feng, “Understanding temporal Systems (ITSC), 2014 IEEE 17th International Conference on. IEEE, and spatial travel patterns of individual passengers by mining smart card 2014, pp. 2162–2167. data,” in Intelligent Transportation Systems (ITSC), 2014 IEEE 17th [29] Z. Tian, Y. Wang, C. Tian, F. Zhang, L. Tu, and C. Xu, “Understanding International Conference on. IEEE, 2014, pp. 2991–2997. operational and charging patterns of electric vehicle taxis using gps [6]B.Agard,C.Morency,andM.Trepanier,´ “Mining smart card data from records,” in Intelligent Transportation Systems (ITSC), 2014 IEEE 17th an urban transit network.” 2009. International Conference on. IEEE, 2014, pp. 2472–2479. [7] M.-P. Pelletier, M. Trepanier,´ and C. Morency, “Smart card data use in public transit: A literature review,” Transportation Research Part C: Emerging Technologies, vol. 19, no. 4, pp. 557–568, 2011. [8] J. Zhang, X. Yu, C. Tian, F. Zhang, L. Tu, and C. Xu, “Analyzing passen- ger density for public bus: Inference of crowdedness and evaluation of scheduling choices,” in Intelligent Transportation Systems (ITSC), 2014 IEEE 17th International Conference on. IEEE, 2014, pp. 2015–2022. [9] I. Ceapa, C. Smith, and L. Capra, “Avoiding the crowds: understanding tube station congestion patterns from trip data,” in Proceedings of the ACM SIGKDD International Workshop on Urban Computing.ACM, 2012, pp. 134–141. Fan Zhang is an associate professor in the Shenzhen [10] S. Liu, Y. Liu, L. Ni, M. Li, and J. Fan, “Detecting crowdedness spot Institutes of Advanced Technology, Chinese Acade- in city transportation,” Vehicular Technology, IEEE Transactions on, my of Sciences. He received his Ph.D. in Com- vol. 62, no. 4, pp. 1527–1539, 2013. munication and Information System from Huazhong [11] Y. Li and M. J. Cassidy, “A generalized and efficient algorithm for University of Science and Technology in 2007. He estimating transit route ods from passenger counts,” Transportation was a postdoctoral fellow at University of New Research Part B: Methodological, vol. 41, no. 1, pp. 114–125, 2007. Mexico and University of Nebraska-Lincoln from [12] B. Li, “Markov models for bayesian analysis about transit route origin– 2009 to 2011. His research topics include big data destination matrices,” Transportation Research Part B: Methodological, processing, data privacy and network security, and vol. 43, no. 3, pp. 301–310, 2009. wireless networks. [13] N. Belloni, L. E. Holmquist, and J. Tholander, “See you on the subway: exploring mobile social software,” in CHI’09 Extended Abstracts on Human Factors in Computing Systems. ACM, 2009, pp. 4543–4548. [14] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise.” in KDD, vol. 96, 1996, pp. 226–231. [15] D. L. Donoho, “Compressed sensing,” Information Theory, IEEE Trans- actions on, vol. 52, no. 4, pp. 1289–1306, 2006. [16] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM Computing Surveys (CSUR), vol. 41, no. 3, p. 15, 2009. [17] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on Juanjuan Zhao received her MS from the Depart- large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, ment of Computer Science from Wuhan University 2008. of Technology, China in 2009. She was a research [18] T. White, Hadoop: The Definitive Guide: The Definitive Guide. assistant from 2009 to 2012 in the Shenzhen Insti- O’Reilly Media, 2009. tutes of Advanced Technology, Chinese Academy of [19] D. Borthakur, “Hdfs architecture guide,” HADOOP APACHE PROJECT Sciences and currently is a Ph.D. student there. Her http://hadoop. apache. org/common/docs/current/hdfs design. pdf, 2008. research interests include cloud computing, big data [20] L. George, HBase: the definitive guide. O’Reilly Media, Inc., 2011. processing, streaming-data processing, data fusion [21] A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, technique, big-data-driven systems, spatio-temporal C. Olston, B. Reed, S. Srinivasan, and U. Srivastava, “Building a data mining. high-level dataflow system on top of map-reduce: the pig experience,” Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1414–1425, 2009. [22] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy, “Hive: a warehousing solution over a map- reduce framework,” Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1626–1629, 2009. [23] L. Liu, A. Hou, A. Biderman, C. Ratti, and J. Chen, “Understanding individual and collective mobility patterns from smart card records: A case study in shenzhen,” in Intelligent Transportation Systems, 2009. ITSC’09. 12th International IEEE Conference on. IEEE, 2009, pp. 1–6. Chen Tian is currently an associate professor in [24] Y. Gong, Y. Liu, Y. Lin, J. Yang, Z. Duan, and G. Li, “Exploring the School of Electronics Information and Com- spatiotemporal characteristics of intra-urban trips using metro smartcard munications, Huazhong University of Science and records,” in Geoinformatics (GEOINFORMATICS), 2012 20th Interna- Technology, China. He received the BS, MS and tional Conference on. IEEE, 2012, pp. 1–7. PhD degrees from the Department of Electronics [25] L. Sun, D.-H. Lee, A. Erath, and X. Huang, “Using smart card data to and Information Engineering, Huazhong University extract passenger’s spatio-temporal density and train’s trajectory of of Science and Technology, China, in 2000, 2003, system,” in Proceedings of the ACM SIGKDD International Workshop and 2008 respectively. From 2012 to 2013, he was on Urban Computing. ACM, 2012, pp. 142–148. a postdoctoral researcher with the Department of [26] J. Huang, L. Wang, C. Tian, F. Zhang, and C. Xu, “Mining freight Computer Science, Yale University. His research truck’s trip patterns from gps data,” in Intelligent Transportation Systems interests include network function virtualization, da- (ITSC), 2014 IEEE 17th International Conference on. IEEE, 2014, pp. ta center networks, distributed systems, Internet streaming and big data 1988–1994. processing for smart city.

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2015.2409815, IEEE Transactions on Vehicular Technology JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 14

Chengzhong Xu received his Ph.D. degree from the University of in 1993. He is currently a professor of the Department of Electrical and Com- puter Engineering of Wayne State University, USA. He also holds an adjunct appointment with the Shen- zhen Institute of Advanced Technology of Chinese Academy of Science as the Director of the Institute of Advanced Computing and Data Engineering. His research interest is in parallel and distributed systems and cloud computing. He has published more than 200 papers in journals and conferences. He was the Best Paper Nominee of 2013 IEEE High Performance Computer Architecture (HPCA), and the Best Paper Nominee of 2013 ACM High Performance Distributed Computing (HPDC). He serves on a number of journal editorial boards, including IEEE Transactions on Computers, IEEE Transactions on Parallel and Distributed Systems, IEEE Transactions on Cloud Computing, Journal of Parallel and Distributed Computing and China Science Information Sciences. He was a recipient of the Faculty Research Award, Career Devel- opment Chair Award, and the Presidents Award for Excellence in Teaching of WSU. He was also a recipient of the “Outstanding Oversea Scholar award of NSFC. For more information, visit http://www.ece.eng.wayne.edu/ czxu.

Xue Liu is a William Dawson Scholar and an Associate Professor in the School of Computer Sci- ence at McGill University. He received his Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign. He has also worked as the Samuel R. Thompson Chaired Associate Professor in the University of Nebraska-Lincoln and HP Labs in Palo Alto, California. His research interests are in computer and communication networks, real-time and embedded systems, distributed systems, cyber- physical systems, and green computing. He has published over 150 research papers in major peer-reviewed international journals and conference proceedings in these areas. His research received the Year 2008 Best Paper Award from the IEEE Transactions on Industrial Informatics, and the First Place Best Paper Award from the ACM Conference on Wireless Network Security 2011 (WiSec 2011). Dr. Liu’s research has been reported by news media including the New York Times, Computer World, The Register, Huffington Post, CBC, NewScientist, MIT Technology Review’s Blog, etc. He serves on the editorial boards of IEEE Transactions of Parallel and Distributed Systems (TPDS), IEEE Transactions on Vehicular Technology (TVT), and IEEE Communications Surveys and Tutorials (COMST).

Lei Rao is a senior researcher at General Motors, United States. In 2011, she was a research associate at School of Computer Science, McGill University. She received her Ph.D in the Department of Elec- tronics and Information Engineering from Huazhong University of Science and Technology in 2010. Her research interests include Statistical Learning and Big Data, Connected Vehicle, Green Computing, Smart Grid and Mathematical Modeling and Opti- mizations. Her papers have appeared in Proceedings of the IEEE, IEEE Transactions on Smart Grid, IEEE Transactions on Vehicular Technology, IEEE INFOCOM, ACM/IEEE ICCPS, and IEEE ICC, etc.

0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.