Mining Large-scale Mobility Patterns Using Mobile Phone Network Data Amnir Hadachi, Mozhgan Pourmoradnasseri, Kaveh Khoshkhah

To cite this version:

Amnir Hadachi, Mozhgan Pourmoradnasseri, Kaveh Khoshkhah. Mining Large-scale Mobility Pat- terns Using Mobile Phone Network Data. 2020. ￿hal-02974853￿

HAL Id: hal-02974853 https://hal.archives-ouvertes.fr/hal-02974853 Preprint submitted on 22 Oct 2020

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Mining Large-scale Mobility Patterns Using Mobile Phone Network Data a ∗ a a Amnir Hadachi , ,1, Mozhgan Pourmoradnasseri and Kaveh Khoshkhah aITS Lab, University of , Ülikooli 17, 51014 Tartu,

ARTICLEINFO ABSTRACT

Keywords: In this study, with Estonia as an example, we established to which extent and how we can use a mas- Commuting patterns sive amount of mobile data of the cellular networks, which is referred to as Call detail record (CDR), CDR data to extract large-scale commuting patterns at different geographical levels. We constructed a model OD-Matrix based on Hidden Markov Model for reconstructing and transforming the trajectories extracted from Large-scale mobility the CDR data. This step allowed us to perform origin-destination Matrix extraction among different Hidden Markov model geographical levels, which helped in depicting the commuting patterns. Besides, we introduced dif- Mobile Cellular Network data ferent techniques for analyzing the commuting at the urban level. Our results unveiled that there is great potential behind mobile data of the cellular networks after transforming it into meaningful mo- bility patterns that can easily be used for understanding urban dynamics, large-scale daily commuting and mobility. The aggressive development and growth of ubiquitous mobile sensing have generated valuable data that can be used with our approach for providing answers and solutions to the growing problems of transportation, urbanization and sustainability.

1. Introduction physics as well as geography, to exploit the data for mod- elling and analyzing different factors of human mobility. Spatiotemporal population movement has a significant GPS data provides the most accurate spatial trajectories impact on the environment, people’s lifestyle and economy. of individuals, but it is not usually available on a larger scale. Therefore, understanding human mobility patterns over time, A common source of GPS data are vehicles equipped with both in short and long periods, is in the heart of sustain- GPS transmitter (Bazzani et al., 2010; Liu et al., 2012). Al- able urban and transportation planning and resolving envi- though high accuracy in positioning makes this type of data ronmental problems. a rich source for mapping human mobility patterns, the low To this end, many different data sources are used for ex- degree of diversity and amount of the data remains a chal- tracting information on human mobility. Census data has lenge. Besides, collecting the data requires GPS equipment been the main source for a long time and it is usually col- and maintenance which impose extra costs. lected periodically by governments. The regularity and the Mobile data is game-changing for monitoring both macro types of census data vary widely in different countries and and micro levels of human mobility behaviour (Järv et al., regions. 2014). Unlike traditional methods, it allows to track a large Another source of data commonly used for deriving in- number of individuals frequently and for a long time inter- formation on trips and the flows of population between cer- val. Cellular networks record spatiotemporal trajectories of tain locations are travel surveys (Zhai et al., 2019). Travel a considerable portion of the population for billing purposes. surveys are usually carried out by local governments and This already available data has originated the possibility of contain more detailed information in comparison to census larger-scale studies on human mobility that could be more data, such as trip purpose and mode of travel. However, frequent and executed at lower cost. As another advantage, the data is collected on a smaller scale and may be biased the cellular date is influenced less by the traditional methods due to self-reporting errors. In addition to the traditional inconsistency, caused by intrinsic differences in collection ways of collecting data for extracting mobility patterns, there methods in different regions (Tolouei et al., 2017). are some research papers reflecting the usage of less trivial Nevertheless, the cellular network data is sparse in time sources of data. Bank notes (Brockmann et al., 2006), transit and with low spatial resolution. These characteristics are smart cards (Ma et al., 2017), online publicly shared data in due to cellular events records which are representing the lo- social networks, such as Twitter and Flicker (Jurdak et al., cation of the user within a tower coverage area or Location 2015; Yang et al., 2019; Barchiesi et al., 2015), are some Area (LA) and the user’s location is lost when the mobile examples of less-common data sources. phone is not in use or no keep-alive signal is received from Recently, we witnessed an increase in the availability of the network. However, by refining the known limitations of massive data sources such as GPS and mobile data and also mobile phone data and through a careful choice of analyzing the means for handling them, which provides a new con- methods such as map matching (Hadachi & Lind, 2019), it text for scientists in several fields, such as computer science, is possible to control these barriers to some extent. Never- ∗ Corresponding author theless, there are other obstacles such as privacy concerns [email protected] (A. Hadachi) or providing a large scale of a new type of mobile sensors 0000-0001-9257-3858 ORCID(s): (A. Hadachi) datasets such as in the “Data for Development” (D4D) chal-

: Preprint submitted to Elsevier Page 1 of 15 lenge (Blondel et al., 2012) by Orange or the “Big Data Chal- individual usually follow a repetitive pattern, but that these lenge” (Barlacchi et al., 2015) by Telecom Italia. These can patterns are also very common between different people when easily initiate several new problems related to human mobil- a large amount of population is observed, even in different ity based on mobile data. countries. Schneider et al. (Schneider et al., 2013) study In general, cellular network data can be very useful in the behaviour of mobile users in different cities of different the study of broadening areas related to human studies, such countries and surprisingly discover that with 17 unique sim- as estimating the population density (Ricciato et al., 2017), ple networks alone it is possible to describe the daily com- community detection (Lind et al., 2017), segregation (Järv muting patterns of up to 90% of the population. The authors et al., 2015), Carbon footprint detection (Becker et al., 2013) explain that each individual has a preferred list of locations and many more. Understanding the different aspects of hu- for daily visits and that the characteristics of his or her mo- man mobility is among the promising use-cases of mobile bility remain stable over several months. Jiang et al. (Jiang data. A wide range of factors are studied in this area, such as et al., 2017) extract the daily mobility patterns of inhabitants predicting human mobility based on users’ history (Hadachi of the city-state Singapore using CDR data. By labelling the et al., 2014), travel time estimation (Kujala et al., 2016), most frequently visited locations for each individual, mean- relative traffic volumes in metropolitan areas (Becker et al., ingful stay locations are identified. It is observed that a few 2013) and traffic monitoring (Janecek et al., 2015). extracted network motifs are sufficient to describe the mobil- From this perspective, our paper focused on unveiling ity patterns of the whole population. Next, for each motif, the potential behind Mobile data of the cellular networks to the areas with the highest density are detected and the re- extract movement and commuting patterns of the population sults are analyzed and compared with official survey data. on a large scale. The adopted approach has two major levels. The authors conclude that Big Data, if properly treated, can One level is based on the Hidden Markov Model (HMM) and provide further insights beyond traditional methods with a OD-matrix for reconstructing and building complete mobil- robust outcome and in wider study areas. ity patterns and flows. The second level is about estimating In addition to the existence of common regular patterns the departure and arrival time of the commuting journeys in in human mobility, common behaviours in mobility are also the OD-matrix and the movement status classification (Stay investigated. Oliveira et al. (Oliveira et al., 2016) study the or Move). Therefore, the paper is organized in such a manner individual mobility patterns of people in eight major world to lay emphasise on describing the used data in this research cities: Beijing, Tokyo, New York, , San Francisco, Lon- work and all the challenges faced due to its nature. The sec- don, Moscow, and Mexico City by using different types of ond section is about related work and similar projects. The datasets, including mobile network data. Their result reflects third section is detailing a view of our proposed approach. a clear repetitively in human mobility patterns. Additionally, Finally, the last section gives a clear overview of the results they deduce that people tend to use the shortest path when trailed with a thorough discussion. moving around and that their displacement is confined. In another study, with the goal of supporting/refuting the 2. Related Works constant travel time budget hypothesis, Kung et al. (Kung et al., 2014) study the home-work commuting patterns with Despite all the differences in the lifestyles of people in different mobile phone datasets at the country level (Ivory different urban areas, humans’ daily displacement follows Coast, Portugal, and Saudi Arabia) and the city level (Boston), similar and simple rules from a high-level perspective. There as well as a car-only GPS tracking dataset (Milan). The au- are extensive studies in this direction that shed light on these thors propose to minimize confounding factors such as data similarities by using cellular network data. collection and analysis by using a common methodology. Statistical properties of human mobility are explored by They conclude, by focusing on a single region, that the com- González et al. (Gonzalez et al., 2008) in a study of the muting time is independent of commuting distance. More- individual trajectories of 100, 000 mobile users during six over, some studies have also demonstrated that there is a di- months. They discovered, unlike the previous studies on rect impact of daily commuting and suburbanites on trans- superdiffusive human travel behaviour (Brockmann et al., portation demand such as in (Järv et al., 2012), where the 2006), that individuals’ mobility patterns show a high degree authors have proven this is occurring due to the amplifica- of spatial and temporal regularity. Individuals tend to return tion during the evening rush hours. to a few highly frequent locations with a high probability Many other studies use mobile data with the focus on and a single spatial probability distribution can be used for specific geographical region. Labeling the points of interest calculating the likelihood of finding a user in any location. (POIs) for individuals in a city and extracting the Origin- In another work (Song et al., 2010), limits of predictabil- Destination (OD) matrix provides valuable information for ity in human mobility are discussed. The authors study dif- understanding regional commuting patterns. Different meth- ferent entropy measures of individuals’ mobility patterns of ods are acquired in the literature for extracting individuals’ anonymized mobile phone users and their analysis shows a POIs, such as home, work and other locations and then de- surprising average of 93% predictability in humans’ mobil- riving daily trips in urban environments. ity. In one of the early studies, Frias-Martinez et al (Frias- Other studies have uncovered that not only does every Martinez et al., 2012) show that it is possible to construct

: Preprint submitted to Elsevier Page 2 of 15 mobile phone generated matrix that capture the same pat- makers and investors helpful and cost-efficient data for un- terns as traditional commuting matrix. They use a CDR derstanding urban dynamics in the absence of well-designed dataset of 3.5 million phones for the duration of 2 months travel surveys. in 2009 in Madrid. For capturing the commuting time of Demissie et al (Demissie et al., 2019), using CDR data, the users, the authors use Genetic Algorithms and Simulated study the mobility patterns at the country level in Senegal. Annealing and conclude that their result constructs commut- They consider multiple origin and destination locations within ing matrices that are as good as the one provided by National a zone for estimating the inter-zone trip distances instead Statistical Institute. of centroid-to-centroid trip distance. After identifying the Using passive mobile databases, Ahas et al (Ahas et al., places of stay for every user, three entropy measures are cal- 2010) develop a model for determining meaningful locations culated for each user’s mobility pattern to measure the regu- for mobile users in the cities of Estonia. The detected home larity of the trips of each user. The results show a high ratio locations are supported by the official population register of of non-home based trips in the area of study. the country. They conclude that it is possible to monitor the For an extensive review of different approaches to repro- population’s geography and mobility by using the available ducing human mobility and its fundamental modeling prin- mobile database. ciples and metrics with various data sources, we refer to In another early study, Isaacman et al (Isaacman et al., (Barbosa et al., 2018). 2011) develop a method based on logistic regression-based analysis of volunteers for clustering the Base Transceiver 3. The Nature of Mobile Data of Cellular Stations (BTS) and detect the important places for each user, Networks in particular home and workplace. The authors argue that their method is capable of estimating the home and work There are many different types of mobile data discussed locations with a high accuracy. Moreover, they apply their in literature for extracting human mobility patterns. In most results to find carbon emissions from daily traces. cases, the type of the data, richness and its quality depend Using mobile positioning data, Šveda et al. (Šveda & a lot on the telecommunication provider and their equip- Barlík, 2018) study the daily commuting in the Bratislava ment. The most common ones used are Call Details Records metropolitan area in the absence of reliable census data. First, (CDR) and Visitor Location Registry (VLR) or a combina- the home and work locations of the users in the database tion of both. are identified based on the BTS with the highest ratio of activities during the night and working hours on weekdays. • CDR data is a set of registered records of informa- Then the daily commuting flow between the central part of tion that reflect the mobile phones’ interaction with Bratislava and other districts is extracted. At the next step, the mobile network, other telecommunication devices regional centers of commuting are identified. These coin- or equipment that documents the details of transac- cide with the features of spatial organization. The authors tions. believe their commuting model, based on mobile location CDR data might contain mobile identity, the cell-ID data presents meaningful results that are cost effective and associated with the coverage area of Base Transceiver replicable at any time. Station (BTS), the start time and duration of the trans- Intending to reduce the positioning error, Graells-Garrido action and the event type of the transaction. Besides, et al (Graells-Garrido et al., 2016) exploit the Antenna Vir- the recorded events are triggered, for example, when tual Placement method based on the physical configuration a user receives or makes a call or an SMS. A record (height, downtilt, and azimuth) of the BTS for estimating the is generated and no information is collected when the locations of the mobile users. Next, anchor points (home mobile phone is not in use. and work) are identified for a fraction of subscribers based • VLR is a database located in the mobile communica- on defined time windows and the regularity of capturing a tion network and it is associated with Mobile Switch- user in a specific location. Then, users are grouped accord- ing Center (MSC). This type of data contains the ex- ing to their assigned home and work zones and city zones are act location of all the mobile subscribers present in designed. In the next step, the zones are clustered to mean- the service area handled by MSC. Usually, this infor- ingful places based on the distributions of the floating pop- mation is used for routing the connections to the right ulation and the flow of population is gained between all the base station. This type of data is deleted after the sub- clusters. According to the authors, a very high correlation scriber leaves the service area. However, this data can is obtained between important places with a survey based be collected as stream data and it has almost similar OD-matrix, and the study results are quantitatively and qual- information as CDR data but with more details about itatively comparable to previous work in terms of land use handover and phone activity. Furthermore, the VLR analysis. data stream is very suitable for real-time applications. Zagatti et al (Zagatti et al., 2018) studies the development of OD-matrix in Haiti by labeling meaningful locations fol- The mobile data collected from the mobile operator usu- lowing an unsupervised learning method in a data-poor en- ally contains only the cell IDs of the coverage areas serviced vironment. The authors suggest their results provide policy- by an antenna and timestamps. Therefore, the data represent

: Preprint submitted to Elsevier Page 3 of 15 we extract from mobile data the OD matrix of the trips be- tween 77 of Estonia and then the OD matrix of the trips between 15 of the country. The second objective of this study is to understand the daily commuting patterns between different city districts of the capital of Es- tonia, . Therefore, in the first part of the study, the hidden states of the HMM model correspond to municipali- ties and counties of the country and in the second part, they correspond to the districts of the city of Tallinn. After ob- taining the parameters of the HMM, in each part, Vertabi al- gorithm is used to assign the maximum likelihood sequence of hidden states (locations: municipalities/ districts of a city) to a sequence of cell-IDs. A high level view of our algorithm architecture that has Figure 1: Illustration of a trajectory extracted from CDR data - been discussed is represented in the Figure (2). Polygons are the location areas of the coverage and the arrows are reflecting the chronological order of the events. just temporal information about the events occurrence. In order to use it for mobility analysis there is a need to com- bine it with cell plan data that contains information about coverage areas’ shapes and signal strength. This combina- tion allows us to transform this temporal and geographical data into spatiotemporal data (Hadachi & Lind, 2019). Then, this spatiotemporal data is used to construct user’s trajecto- ries. These trajectories are characterized by their sparseness in space and time as illustrated in Figure1. The dataset used for this research has more than 600 mil- lion anonymized CDR and VLR events that have been ex- Figure 2: Illustration of the proposed algorithm steps for con- 300000 tracted from around users in the entire Estonia dur- ducting this analysis ing May 2018 (Table1 ). In addition to the events’ records, the data is supplemented with cell-plan data which contains information about the polygon shapes of the coverage areas corresponding to the BTS antenna. Furthermore, since our 5. Hidden Markov Model interest is the spatio-temporal aspect of the data, the event type (CDR or VLR) is not useful for our approach. Besides, With the goal of converting the series of cell-IDs in a se- merging the two types of events bring with them slightly ries of geographical regions (cities /counties /city districts), more density due to the periodicity of the keep-alive event in we construct an HMM. The location assignments to a se- VLR streams. However, extracting spatial aspects remains quence of cells are changed in a way that the obtained loca- challenging due to different obstacles that may lead to a con- tions are geographically neighbors and in reality they can de- siderable amount of noise when positioning the user, such fine a meaningful and continues movement of an individual as fluctuation of radio because of environment’s topology, mobile user. We should emphasize that due to the sparse- weather conditions changes or false displacements for load ness of the data in time and space, the extraction of smooth balancing in the mobile network. movements is not a straightforward task. In fact, HMM is the discrete version of Kalman filter (Hadachi et al., 2018) for as- Anonymized user ID cell-ID Timestamp Event signing a geo-location to a coverage area of a BTS. Kalman G3Z03R8269 W8575 1525757266 par-c filter assigns longitude and latitude coordinates to a visited G2H99K9882 F3268 1531421050 detach-c cell-ID, in a continues mode. However, HMM translates the G8J84W8462 I7520 1526482568 loc-up-c network cells to a finite set of locations. Table 1 To this end, our model follow a method slightly similar An example of CDR and VLR dataset attributes to (Ghazvininejad et al., 2011) in training an HMM for de- tecting human movement episodes that reflect the signature of their daily activities. S = {s , ⋯ , s } Y = {y , ⋯ , y } 4. Methodology Let 1 n and 1 l be the set of states and the set of observations of the hidden Markov Our main objective is to understand intercity commuting model, respectively. In our case, the states, si, correspond patterns at the country level and also within the city. Hence,

: Preprint submitted to Elsevier Page 4 of 15 to the geographical regions and the observations, yi, corre- At this point parameters of the HMM are known and the spond to cell-IDs. training is completed. In the next step, to convert the set of  A, B, y¨ , , y¨ s¨ , , s¨ We define the hidden Markov model = HMM( Π) observations { 1 ⋯ l} to the set of locations { 1 ⋯ l} as follows: with the highest probability, Vertabi algorithm (Rabiner, 1989) will be acquired. (i) A is the state’s transition probability matrix with The method for the construction of HMM is similar in all aij = Pr(sjðsi); (1) parts of our study. The only difference is the statistics used in calculating the parameters. (ii) Π is the initial state probability vector with 6. Data Preparation i = Pr(si); (2) In the first step, the trajectory of every anonymized user u (iii) B is the observation probability matrix with is extracted from the dataset for the whole period of the study. A trajectory is a sequence of network events such as b (y ) = Pr(y s ). i t tð i (3) u T = {(C1,T1), ⋯ , (Cn,Tn)}, (9) In order to obtain the HMM, , the parameters A, B and where (Ci,Ti) represents any network event, such as VLR Π have to be calculated. The initial state probability vector and CDR, that accrued in the cell-ID Ci and timestamp Ti. In Π, is assumed to be uniform. Therefore, it only remains to other words, the user u is observed by the BTS with cell-ID calculate A and B. Ci at the timestamp Ti. The event type is not of our interest, Let Xt be the frequency of observing the cell-ID yt and since we only analyze the spatiotemporal aspects. Xtr be the frequency of observing the ordered pair of cell- One of the obstacles that arise while using mobile data, IDs (yt, yr) for different users and in the whole dataset. are the false displacements induced by the tower-to-tower B is calculated by the straight forward use of Bayes’ The- load balancing, performed by the mobile service provider orem, known as "Handover". This situation appears in the dataset Pr(siðyt)Pr(yt) in two ways. It can be observed that the user has not changed bi(yt) = Pr(ytðsi) = . (4) Pr(si) his location, but suddenly, in the data stream he is detected by a different BTS with another cell-ID, which might be hun-  = Pr(s ) As mentioned before, i i is assumed to be uni- dreds of meters away. Or, there are several network events Pr(y ) form and t is calculated as follows. with the time difference of a couple of seconds, back and Xt forth, between two cell-IDs. These false displacements are Pr(yt) = ∑ (5) called hop and ping-pong Handover effect respectively. Xi i To reduce the noise imposed by hops and ping-pongs, we apply a two-step filtering. Z = Pr(s y ) y Z Let t,i ið t . When a cell-ID t is observed, t,i Let (Ci,Cj) represent the shortest distance between the can be calculated according to the fraction of the coverage polygons associated to the coverage areas of the cells Ci and y s area of t in the region (city/zone) i, given in the cell-plan. Cj, according to the cell-plan. More precisely, if C is the B i Therefore, is obtained by the Equation4. polygon of the coverage area of the cell-ID Ci, We calculate the transition probability matrix A, based (C ,C ) = min{d(x, y) x ∈ y ∈ } on i j ð Ci and Cj (10) Expected number of transitions from si to sj a = . when d(x, y) is the euclidean distance between the points x ij s (6) Expected number of transitions from i and y. u In the first step, hop filter is applied on T , the trajectory Expectations are attained by using the statistical data of u (C ,T ) (C ,T ) ∈ T u our dataset as follows of user . That is, if for some i i and i+1 i+1 ,

∑ (Ci,Ci+1) Zt ,iZt ,j × Xt t ΔT (i) < t ,t 1 2 1 2 (11) ¨ 1 2 vmax aij = ∑ . (7) Zt,i × Xt u t then (Ci+1,Ti+1) is removed from T . The maximum speed vmax is an upper bound for the speed limit in the studied area. At the end, to obtain the transition probability matrix A, the v 150 km∕ℎ A¨ In our case, max is assumed , considering the fact calculated matrix in (7) needs to be normalized as fol- that there is no domestic flight possibility. lows. In the second step, the ping-pong filter is applied. A ¨ k T u aij window of a constant size, , is considered on . If we a ij = ∑ ¨ (8) observe repeated cell-IDs with the time difference less than aik k t = 1 min, then all the intermediate cells are flagged as false.

: Preprint submitted to Elsevier Page 5 of 15 We shift the window one step at a time and repeat the same cell chain will be translated later to a sequence of cities as a u strategy. In the end, all false events are removed from T . In meaningful trip. our case study, k = 5. Let X be a set of cities and N[X] be the set of all cities in X union with all their neighbors, that means their boarders T u 7. Intercity Commuting intersect. We break to a set of chains with the following breaking rule: According to Estonian Land Board Geoportal1, there are 15 counties and 77 (rural and urban) municipalities in Esto- N[Xi−1] ∩ City(Ci) = ç. (13) nia. A map of Estonia’s counties and municipalities is pre- N[X ] sented in Figure3. While i is calculated based on the following recursive definition. T City(Ci), if i = 1 Xi = (14) N[Xi−1] ∩ City(Ci), if i ≠ 1

If the condition (13) holds, then the i − 1th and the ith net- work events of the user u have accrued in 2 different and non- neighbor cities and they belong to separate chains of cells, as well as different trips. Applying the rules in the equations (13) and (14) guarantee that the obtained chains of cells are not geographically scattered and can be converted into trips that satisfy our definition. At this point, we have extracted several cell-level trajec- tory chains for each individual user. The next step is con- verting a chain of cells into a chain of cities. This will be Figure 3: Map of the administrative partitions of Estonia. done using the HMM and applying the Vertabi algorithm, Counties and the municipalities of each are illustrated. as discussed in section5. The already constructed HMM County borders are obtained from Webpage of Estonian Land and the chain of the cell-IDs, as observation, are inputs for Board for Spatial Data. the Vertabi algorithm and the corresponding chain of hid- den states of the HMM (cities in our case) will be the output In this section, the OD-matrix of the trips in Estonia is for the algorithm which is calculated based on the maximum obtained at two resolution levels. First, the trip OD-matrix of likelihood method. all users in our dataset is obtained at the level Now, for each user, we have several chains of cities while and in the second part, the trips are analyzed based on the each chain is an output of Vertabi algorithm and one city counties. may be repeated several times in a chain. A sample chain u At first, we partition the country into 77 municipalities of cities for a user is similar to the Figure4, with each according to the data of the Estonian Land Board Geoportal. letter corresponding to a city. For convenience, each block For ease of reference in our theoretical representation, we of identical cities will be considered as one block as it is use the word city, instead of the municipality, but keep in presented in the Figure4. mind that in reality, each city may contain more than one municipality. Let  = {C1, ⋯ ,Ck} be the list of all cell-IDs of the cell-plan, provided by the mobile operator, and City(Ci) be the set of all cities incident to the cell Ci. The coverage area of each cell is incident to one or more cities. Figure 4: A sample city trajectory of a user. Consider the trajectory of the user u, T u = {(C ,T ), ⋯ , (C ,T )}. 1 1 n n (12) Every chain reflects only the spatial aspect of the move- ment of the user. In order to detect a trip, temporal aspects Our goal is to extract the trips of all users. A trip occurs must be taken into consideration. In the next step, the aim to when an individual travels from an origin city to a destina- detect the movement status of the user in each visited city. tion city by passing through some intermediate cities. It is Let (ℎ, i, j) be 3 consecutive different cities, observed in assumed that the traveler does not stay in the intermediate the user’s city chain. Let Tf (i) be the timestamp of the first cities very long, otherwise the trip should be broken into two time and Tl(i) be the timestamp of the last time that the city trips. To extract the trips, first, we go through each individ- u i is observed in a block. Let ual trajectory T and break it down to chains of cells. Each distance of the 2 cities 1The data is obtained from https://geoportaal.maaamet.ee, on ΔTij = Tf (j) − Tl(i) − . (15) 22.07.2019. average speed

: Preprint submitted to Elsevier Page 6 of 15 ΔTij is the time that the mobile network has no information about the location of the user and that is not spent for travel. We divide this time difference equally between both cities i and j and calculate the user’s spent time in city i as follows

i = Tl(i) − Tf (i) + 1∕2(ΔTij + ΔTℎi). (16) To extract the trips, we label each city in the trajectory with move and stay. A user is assumed in the stay mode in the city i if i > 3 hours and in the move mode otherwise. we consider a "trip" if there is a trajectory connecting between to consecutive stay locations. In other words, a trip is between two cities in the trajectory chain of a user, while both cities are labeled by stay and the pass-by-cities are not taken into account. In the end, the number of trips between every two cities Figure 5: The normalized OD-matrix of the trips between for all users is aggregated and the trip OD-matrices between , derived by our algorithm. Each entry municipalities are obtained. shows a fraction of the trips originated and destinated from the selected pair of counties, in comparison to all the trips 8. Experimentation, results and discussion that have taken place in Estonia. Our experiment was conducted by using a real dataset County Population in 2018 (described in details in section3) and the focus relied on 352939 investigating and unveiling three major points: Hilu county 9210 • Large-scale commuting all over the country Ida-Viru county 24333 Jögeva county 26400 • Commuting in the most populated counties with the Järva county 29056 highest mobility Lääne county 18165 Lääne-Viru county 51699 • Commuting in the major cities with the highest per- Pölva county 24110 centage of trips Pärnu county 76702 8.1. Large-scale commuting all over the country county 31363 32535 Our first interest in this research is to unveil the large 126395 scale commuting patterns all over Estonia using our pro- 23143 1723866 posed method. As a result, we extracted a total of county 44671 77 trips during May 2018, occurring between different urban Vöru county 34206 and rural municipalities of Estonia. Among these journeys, Unknown 750 1405552 trips originated and ended in the same county and Table 2 318314 trips took place between the cities of different coun- Population of Estonia in 2018 by county according to the statis- ties. The extracted OD-matrix from our dataset is illustrated tic database of Estonia. in figure5. It is easily observable that the majority of the trips of each county is between the cities of the same county. An- • Trips from one specific county to others other concentration of trips is visible in the column, Harju county, the county where the Estonian capital Tallinn is lo- • Trips within the county cated. Besides the inside-county trips and trips to Harju • Trips From other counties to one specific county county, higher number of trips are mostly headed to neigh- boring counties. As expected, the majority of the trips are The results are reflected in figure6, where we show the frac- between the cities of Harju county, with 955654 trips. This is tion of the trips that originated from each county and termi- not surprising due to the fact that more than 40% of the coun- nated in the same county, originated from each county and try’s population live in Harju county according to the statis- terminated in other counties or the trips from all the other tics database of Estonia (Statistika, 2018) (Table2) and 56% counties to selected county. Two interesting observations of enterprises are located in Harju county (Aasmäe & Su- can be made here. Firstly, the major counties with high inter- tova, 2018). Then, the second ranked county is Tartu coun- nal mobility are Harju county (county with the largest pop- try with its educational activities and services industry. ulation number and hosts the Tallinn), Ida-Viru Furthermore, in order to have a clear understanding of County, Tartu county and Pärnu county. Secondly, there is a the mobility patterns happening among all the counties. we symmetry of the number of trips in the counties with respect focused on depicting three types of trips or journeys: to inbound and outbound trips.

: Preprint submitted to Elsevier Page 7 of 15 8.2.1. Commuting in Harju county For obtaining more details on the commuting patterns of the population, we extracted the OD matrix between munic- ipalities within Harju county based on more than 350, 000 trips from different municipalities and cities of Harju county to Tallinn for the duration of one month. The figure8, shows the distribution of the trips between all the municipalities of Harju county as well as the major cities within it.

Figure 6: The diagram of trips between counties of Estonia. County borders are obtained from Webpage of Estonian Land Board for Spatial Data.

Furthermore, to investigate this inbound and outbound mobility, we tried to extract the top trip destinations among Estonian cities, obtained from our trip extraction methods (Figure7). The distribution obtained shows that the capital city of each county is reflecting the importance of mobility within the country and also to inbound and outbound mo- bility. Based on the analysis of the large scale commuting patterns and trips’ destination distribution, it is established Figure 8: The OD-matrix of the trips between municipalities that most of the mobility is happening in Harju county with of Harju county. Every row shows the normalized vector of the a concentration around Tallinn city, and Tartu county cen- percentages of the trips from the selected municipality to all tred around Tartu city. Therefore, in our next section, we the municipalities of the Harju country. will focus on depicting the mobility within Harju and Tartu Counties with Tallinn and Tartu as gravitational pull area, The distribution reflects clearly that the commuting pat- respectively. terns are concentrated around Tallinn. This means that Tallinn is a major destination with respect to all extracted journeys. However, we have to keep in mind some important aspect of the design algorithm in order to read the OD matrix in the right way. For example, when you have both origin and the destination in the same city or municipality, it means that the movement has originated and terminated in the same city, but that the person has also visited at least one different city in between, without a stay for more than 3 hours. In addi- tion, we can notice from the OD matrix that there is more activity or commuting between neighbouring municipalities rather than non-neighboring. Furthermore, for more clarification from a geographical perspective we reflected in figure9 the distribution of these Figure 7: The distribution of top trip destinations among the trips. The map confirms that the major municipalities con- Estonian cities, obtained from our trip extraction methods. tributing to the commuting towards Tallinn as a destination are , Jõelähtme, and Rae. The contributions of the municipalities without arrows to the flow of trips to Tallinn 8.2. Commuting in the most populated counties are less than 2% for each municipality. with the highest mobility To get a better view and understanding of the scale of The population density and mobility are concentrated in commute, it is crucial to reflect upon the percentage of trips two major counties: Harju and Tartu. Therefore, our next during the workdays and weekends per municipalities and step is to converge and zoom in geographically at the level major cities within the county. The figure 10 shows the con- of municipalities to understand better the mobility patterns. tributions of different cities in Harju county to the trips orig-

: Preprint submitted to Elsevier Page 8 of 15 Figure 9: Trips to Tallinn from the other municipalities of Harju county. Figure 11: The percentage of economical activities among County borders are obtained from Webpage of Estonian Land each of the municipalities and major cities within Harju county Board for Spatial Data. during 2018 - based on Statistics Estonia.

8.2.2. Commuting in Tartu county inated and terminated in the same county and both during workdays and weekends. It is possible to observe that a The second most populated county of Estonia is Tartu big fraction of trips to industrial cities such as Tallinn city county and its capital city Tartu, which is also the second or Rae municipality happens more during workdays and, in largest city in Estonia. The Figure 12 presents the OD ma- contrast, a big fraction of trips to Lääne-Harju or mu- trix of the trips between the cities of Tartu county. The total nicipalities happens more frequently on weekends. number of trips extracted within Tartu county is more than 90000. The same patterns appears again in the case of Tartu county, most the mobility is concentrated around Tartu city. Moreover, it is visible that the neighbouring municipalities to Tartu city contribute also to the general commuting events happening in the county.

Figure 10: The percentage of trips terminated in different municipalities and major cities of Harju county during workdays and weekends with respect to the total trips registered within the county.

Moreover, according to the data provided by Statistics Estonia on the economic activities within each municipality or major city in the county, it gives an idea about the nature and the scale of the activities. For example, in figure 11 it re- flects the diversity of the activities in Tallinn. Its role as the capital of Estonia makes all the activities concentrate around it which is also reflected by our findings with respect to the Figure 12: The OD-matrix of the trips between municipalities mobility and commuting rates during weekdays and week- of Tartu county. Every row shows the normalized vector of the ends. In addition, the distribution of commuting shown in percentages of trips from the selected municipality to all the figure 10 during weekdays and weekends is in association municipalities of the Tartu county. with economical activities registered in the Harju county. Furthermore, since the majority of the commuting con- verges to Tartu city, we explore in a similar fashion to the previous section the geographical aspects of the commuting

: Preprint submitted to Elsevier Page 9 of 15 through a map illustration as shown in figure 13. The figure are concentrated around Tartu city as illustrated in figure 15. shows the scale of the trips or commuting distribution with We also noticed that the activities are to some extent sig- destination Tartu city within the county. It is clear that the nificant concerning the municipalities of Elva, and two main municipalities that has high commuting frequency Tartu. The distribution of These regularities are also notice- to Tartu are Tartu county and Kambja, followed by , able in the commuting patterns in figure 14. Elva, Nõo, Kastre and Peipsiääre.

Figure 13: Trips to Tartu from other municipalities of Tartu county. County borders are obtained from Webpage of Estonian Land Board for Spatial Data. Figure 15: The percentage of economical activities among each of the municipalities and major cities within Tartu county during 2018 - Based on Statistics Estonia. Moreover, due to the fact that Tartu city is the most popu- lated town within Tartu county, it is clear that a high share of the commuting would occur in Tartu city. This is confirmed 8.3. Commuting in Tallinn in figure 14. This latter graph shows a comparison of trips terminated to different cities and municipalities within Tartu In the second part of our work, we focus on the capital county during working days and weekends. Tartu city has Tallinn. Our goal is to understand the commuting patterns the most mobility activities happening during the workdays between different districts of the city. To this end, we start with a considerable share of commuting also during week- by detecting the homes and workplaces of the users in our ends. The same patterns appear also in Kambja municipality. dataset and extract the distribution of home places and work- However, Elva municipality and Peipsiääre municipality are places in Tallinn area. Then we compute the OD matrix of more popular during the weekends. the morning flow and the afternoon flow between different administrative districts of the city, with a method similar to the trip extraction technique, explained in section5 and us- ing an HMM. Then, we compare the results obtained with different approaches. 8.3.1. Home and work detection The city is divided into eight administrative districts, ac- cording to the official data. In the first step, home and work cell-IDs are detected for each mobile user. Assuming that people usually stay at home during the night, for every in- dividual mobile user, the most frequent cell-ID in the time window between 22:00 and 3:00 is considered as the home cell-ID. Similarly, the most frequent cell-ID in the time win- dow between 11:00 and 16:00 during the working days of Figure 14: The percentage of trips terminated to different municipalities and major cities of Tartu county during workdays the week is assigned to be the work cell-ID. and weekend with respect to the total trips registered within To achieve a clearer overview of the distribution of the the county. homes and workplaces in the city, we created corresponding heat-maps of these locations, presented in figure 16. Instead of assigning the centroids of the cell coverage area polygons Correspondingly, the economic activities in Tartu county to the position of the user, in our approach, the locations of

: Preprint submitted to Elsevier Page 10 of 15 the users are estimated based on the fraction of the intersec- tion of the polygons with different regions, following to the cell plan. In the end, the expected number of residents of each region is calculated. Therefore, the approach can be formalized as follows: Let A be a random variable representing the number of users with home (work) in the geographical region A and C be the number of users with assigned cell-ID C for home (work), obtained from the highest frequency method. Let C be the polygon of the coverage area associated with the cell-ID C. Then: É area (A ∩ C ) Pr( A) = Pr( C ). (17) C area (C )

Pr( C ) can be estimated based on the ratio of the number of users with triggered home (work) in C to the total number (a) Home of users. In a similar way, the expected value of the number of users with home (work) in the area A will be as follows

É area (A ∩ C ) E[ A] = C . (18) C area (C )

For identifying the places of concentration, first, the area of the city is discretized into rectangles with dimension of approximately 200m × 250m. Then, for each rectangle A, the expected number of users with home (work) in the re- gion A is estimated according to the equation 18. Figure 16 illustrates the results of the explained home and work detec- tion method in the city of Tallinn. For validation, we compared the detected home places of the users in our dataset to the data of population register, ob- tained from the homepage of the city government of Tallinn (b) Work 2. The results obtained in figure 17 show that our method was capable of estimating the population with the standard Figure 16: The heat-map of detected home and work locations 2 error of estimation est = 2.05 percent and R = 0.89. The in eight administrative districts of Tallinn. Each rectangle in results are presented in the Figure 17. the map is of size 200m × 250m and the color represents the No official statistics are available regarding peoples’ work- detected number of people with home or workplace in that rectangle. places in Estonia. Therefore, we are not able to validate the Municipality borders are obtained from Webpage of Estonian results of the workplace estimation. Land Board for Spatial Data. 8.3.2. Home-work distribution matrix of Tallinn Based on the extracted distribution of home places and workplaces in section 8.3.1, in this part, we estimate the The calculation method is similar to what was explained home-work distribution matrix of the districts of Tallinn. in section 8.3.1. Let A,B be the random variable of the num- A Our objective is to find the distribution of workplaces of the ber of users with home place in district and with workplace B inhabitants of each district. In the absence of official data in district . Let C,D be the number of users with a detected C D on workplaces in Estonia, we try to check the consistency of cell-ID for their home and for their workplace. our findings from different perspectives. Equation 19 calculates the expected number of mobile A B To focus more on the working population, firstly, we fil- users living in district and working in district . ter out the users with identical detected cell-IDs for their É area (A ∩ C ) area (B ∩ D) home and work. This group of people may consist of home- E[ A,B] = × C,D. (19) (C ) (D) workers, retired people or people with various work sched- C,D area area ules. Then, we compute the expected number of people with The results are presented in figure 18. Each row of the the home place in district A and workplace in district B for matrix is the normalized distribution vector of the estimated every pair of districts (A, B). workplaces of the selected administrative district. It can be 2https://www.tallinn.ee/est/Tallinna-elanike-arv observed from the matrix that the main concentration of move-

: Preprint submitted to Elsevier Page 11 of 15 Figure 18: Home-work distribution matrix of Tallinn popula- tion. Each row shows the distribution of workplaces of inhab- itants of the selected administrative districts.

For extracting the trips, we follow the same method for training an HMM that is explained in sections5 and7. The hidden states of the HMM, in this case, are the eight adminis- trative districts of Tallinn. To adapt our method to the urban Figure 17: The comparison of the fraction of inhabitants in 8 environment, when detecting the move and stay episodes, we administrative districts of Tallinn, derived from our algorithm choose a threshold time of one hour instead of three hours in and the official data of home registration the intercity commuting case.

ments is on the diagonal, which reflects people leaving and working in the same neighborhood. The second highest level of concentration of workplaces is happening in the Kesklinna (city center) column. This is not surprising, due to the fact that this region is the central and the business heart of the city. Another observation from the results is the low con- centration of workplaces in and Nõmme districts. This can also be well explained based on the fact that these re- gions are mainly residential areas, located on the outskirts of the city. We estimate that 37% of the population in Tallinn live and work in the same district. 8.3.3. Morning and evening movement flows in Tallinn In this part, we try to understand how the population flows between different districts of Tallinn during the morn- ing and evening peak hours. We consider the flow as an ag- Figure 19: The OD matrix of the morning flow between dis- gregated number of trips between pairs of districts in certain tricts of Tallinn. Each row shows the percentage of trips from time windows. the selected district to all the other districts of the city. First, the trips that have taken place during morning and evening peak hours are extracted. Then, OD flows between For computing the morning flow, we extract all the trips the districts of Tallinn in the morning and in the evening time that start in the time window between 7:00 and 12:00, dur- windows are estimated. A trip starts from an origin district ing the working days of the week. The accumulated number and passes by other districts, under the condition that every of trips originated and terminated in each pair of different two consecutive districts in the sequence are geographical districts is considered as a flow between two regions. The neighbors and that the user does not stay in one district for OD matrix of the morning flow is presented in Figure 19. more than one hour. With our approach, if a trip starts and ends in the same

: Preprint submitted to Elsevier Page 12 of 15 district, it must have visited some other districts in between. the consistency of our results obtained from the home de- Therefore, it cannot be considered as an inter-district trip. tection method in section 8.3.1, the morning flow and the Hence, in order to avoid overcounting, we have ignored the afternoon flow and compare it with the official data of the diagonal of the matrix in Figure 19. A concentration of home registry. the morning flow terminated to the city center, Kesklinn, is observable in the results. Also, according to our OD- matrix, the least popular morning destinations are Nõmme and Pirita.

Figure 21: Comparison of the distribution of inhabitants in eight administrative districts of Tallinn derived from the official data of home registration; the estimations obtained from the home detection algorithm; the morning flow and the afternoon flow.

The total number of trips originated in the morning from a region of a city can predict the number of inhabitants of that region. Similarly, this number can be estimated by the num- Figure 20: The OD matrix of the evening flows between dis- tricts of Tallinn. Each column shows the percentage of trips ber of trips terminated to the selected region in the afternoon. to the selected district from all the other districts of the city. The comparison between the estimated population distribu- tion obtained from the morning flow, the afternoon flow, the home estimation method explained in section 8.3.1 and the Similarly, The evening flow of the city is estimated by official data of home registration are represented in figure the trips that begin in the time window between 16:00 and 21. It is observed that the estimation based on the evening 21:00 during working days. The result of the flow estima- flow is more correlated to the result of the home detection tion is presented in figure 20. In order to focus on the desti- algorithm and also the official data. On the contrary, the es- nations, the trip numbers are normalized based on columns. timation based on the morning flow, particularly in the cen- Therefore, each column shows the normalized vector of the tral districts of Kesklinn and deviates drastically. trips terminated in the selected district from other districts. One reason might be the fact that central areas form the eco- In this case, with the same approach as in the morning flow nomic and business heart of the city and many people leave extraction, the diagonal of the matrix is discarded. We can their home places to these neighborhoods earlier than our observe the concentration of trips originating from Kesklinn algorithm threshold, 7:00 in the morning. (city center) to other districts. Another interesting remark 8.3.4. Home to work trips in Tallinn is the symmetry of the derived flows in the morning and in the evening. It can be observed that the distribution of every In the end, assuming that the majority of the trips dur- row in figure 19 is very close to the distribution of the cor- ing working days are home-work trips (Jiang et al., 2017; responding column in Figure 20. On macro-level, this can Schneider et al., 2013), we present the correlation between support the hypothesis that people leave their stay places in the home-work distribution matrix derived in section 8.3.1 the morning and go back there in the evening. and the morning flow and the afternoon flow between dif- Several studies have investigated the regularity and pre- ferent districts of Tallinn. Figures 22 and 23 reflect these dictability of human mobility (Gonzalez et al., 2008; Song correlations. A stronger correlation is apparent in the case et al., 2010). According to Schneider et al. (Schneider et al., of the afternoon flow. 2013) only a few number of patterns are sufficient to describe the daily mobility of up to 90% of the population. More- 9. Conclusion and Future Directions over, the mobility habits of people are highly similar, regard- Understanding human mobility dynamics and patterns less of the city and the nature of the datasets (Oliveira et al., on a large-scale has a direct impact on enhancing positively 2016). Under the assumption that most trips are originated our existing transportation platforms, transportation plan- from home places in the mornings and destinated to home ning strategies and sustainability. In this paper, we presented places in the evenings, during the working days, we check a methodology that can unveil and enable an understanding

: Preprint submitted to Elsevier Page 13 of 15 acteristics which will facilitate the tasks for urban planners in designing new roads, new districts, new residential areas or managing the city during special events by understanding the crowd formation, movement, and displacement. Stronger as- sessments can be obtained when more validation data are available.

Acknowledgment The authors gratefully acknowledge the contribution of Tele2 Eesti for their help in providing the data through the project “Population Movement Analytics, Monitoring and Figure 22: Correlation between the estimated home-work dis- Prediction Algorithms". This project and research is sup- tribution matrix and the morning flow of the population be- ported by Archimedes Foundation and Mooncascade OÜ un- tween districts in Tallinn. der the Framework of Support for Applied Research in Smart Specialization Growth Areas.

References Aasmäe, K., & Sutova, S. (2018). Number of economic units up last year. URL: https://www.stat.ee/article-2018-05-23_ number-of-economic-units-up-last-year. Ahas, R., Silm, S., Järv, O., Saluveer, E., & Tiru, M. (2010). Using mo- bile positioning data to model locations meaningful to users of mobile phones. Journal of urban , 17, 3–27. Barbosa, H., Barthelemy, M., Ghoshal, G., James, C. R., Lenormand, M., Louail, T., Menezes, R., Ramasco, J. J., Simini, F., & Tomasini, M. (2018). Human mobility: Models and applications. Physics Reports, 734, 1–74. Barchiesi, D., Preis, T., Bishop, S., & Moat, H. S. (2015). Modelling human mobility patterns using photographic data shared online. Royal Society Figure 23: Correlation between the estimated home-work dis- open science, 2, 150046. tribution matrix and the evening flow of the population be- Barlacchi, G., De Nadai, M., Larcher, R., Casella, A., Chitic, C., Torrisi, tween districts in Tallinn. G., Antonelli, F., Vespignani, A., Pentland, A., & Lepri, B. (2015). A multi-source dataset of urban life in the city of milan and the province of trentino. Scientific data, 2, 150055. Bazzani, A., Giorgini, B., Rambaldi, S., Gallotti, R., & Giovannini, L. of commuting patterns based on massive anonymized mo- (2010). Statistical laws in urban mobility from microscopic gps data bile phones’ CDR data of cellular networks. The purpose of in the area of florence. Journal of Statistical Mechanics: Theory and Experiment 2010 this work is to present a potential replacement approach for , , P05001. the traditional travel survey data that can be misleading even Becker, R., Cáceres, R., Hanson, K., Isaacman, S., Loh, J. M., Martonosi, M., Rowland, J., Urbanek, S., Varshavsky, A., & Volinsky, C. (2013). if rich in information. This issue can have an impact through Human mobility characterization from cellular network data. Commu- generating erroneous mobility patterns due to corrupted or nications of the ACM, 56, 74–82. incomplete records in the surveys. Therefore, the rapid de- Blondel, V. D., Esch, M., Chan, C., Clérot, F., Deville, P., Huens, E., Mor- lot, F., Smoreda, Z., & Ziemlicki, C. (2012). Data for development: The velopment of ICT and the usage of smartphones in our daily arXiv preprint arXiv:1210.0137 activities, makes them a good resource for sensing mobility D4D challenge on mobile phone data. , . on a large scale in urban and rural areas. Brockmann, D., Hufnagel, L., & Geisel, T. (2006). The scaling laws of In the last decades, we witnessed a big change in our eco- human travel. Nature, 439, 462. nomic structure, interdisciplinary workforce, flexible work- Demissie, M. G., Phithakkitnukoon, S., Kattan, L., & Farhan, A. (2019). Understanding human mobility patterns in a developing country using ing time, and different working locations which have con- Data Science Journal 18 tributed to the increase in our mobility behavior’s complex- mobile phone data. , . Frias-Martinez, V., Soguero, C., & Frias-Martinez, E. (2012). Estimation of ity. Therefore, all the traditional approaches such as off peak- urban commuting patterns using cellphone network data. In Proceedings hours strategy or congestion mitigation might be less effec- of the ACM SIGKDD international workshop on urban computing (pp. tive since they do not consider the increase of non-work trips 9–16). ACM. or unusual movement behavior. Furthermore, the work pre- Ghazvininejad, M., Rabiee, H. R., Pourdamghani, N., & Khanipour, P. sented here can be easily extended to understand and answer (2011). HMM based semi-supervised learning for activity recognition. In Proceedings of the 2011 international workshop on Situation activity some of the important questions about socio-economical as- & goal awareness (pp. 95–100). ACM. pects of mobility concerning economic activities, migration, Gonzalez, M. C., Hidalgo, C. A., & Barabasi, A.-L. (2008). Understanding segregation, community formation, social interactions, etc. individual human mobility patterns. nature, 453, 779. Graells-Garrido, E., Peredo, O., & García, J. (2016). Sensing urban patterns With the algorithm presented here, based on CDR data, we Sensors 16 can have an understanding of the demographics and its char- with antenna mappings: the case of santiago, chile. , , 1098.

: Preprint submitted to Elsevier Page 14 of 15 Hadachi, A., Batrashev, O., Lind, A., Singer, G., & Vainikko, E. (2014). dictability in human mobility. Science, 327, 1018–1021. Cell phone subscribers mobility prediction using enhanced Markov Statistika, E. (2018). Population of estonia - database. chain algorithm. In 2014 IEEE Intelligent Vehicles Symposium Proceed- URL: http://andmebaas.stat.ee/?lang=en&SubSessionId= ings (pp. 1049–1054). IEEE. e94bb697-2ade-4cfa-ad3d-8c2d00730251&themetreeid=-200#. Hadachi, A., & Lind, A. (2019). Exploring a new model for mobile po- Šveda, M., & Barlík, P. (2018). Daily commuting in the bratislava sitioning based on CDR data of the cellular networks. arXiv preprint metropolitan area: case study with mobile positioning data. Papers in arXiv:1902.09399,. Applied Geography, 4, 409–423. Hadachi, A., Lind, A., Lomps, J., & Piksarv, P. (2018). From mobility anal- Tolouei, R., Psarras, S., & Prince, R. (2017). Origin-destination trip matrix ysis to mobility hubs discovery: A concept based on using cdr data of the development: Conventional methods versus mobile phone data. Trans- mobile networks. In 2018 10th International Congress on Ultra Modern portation research procedia, 26, 39–52. Telecommunications and Control Systems and Workshops (ICUMT) (pp. Yang, C., Xiao, M., Ding, X., Tian, W., Zhai, Y., Chen, J., Liu, L., & Ye, 1–6). IEEE. X. (2019). Exploring human mobility patterns using geo-tagged social Isaacman, S., Becker, R., Cáceres, R., Kobourov, S., Martonosi, M., Row- media data at the group level. Journal of Spatial Science, 64, 221–238. land, J., & Varshavsky, A. (2011). Identifying important places in peo- Zagatti, G. A., Gonzalez, M., Avner, P., Lozano-Gracia, N., Brooks, C. J., ple’s lives from cellular network data. In International Conference on Albert, M., Gray, J., Antos, S. E., Burci, P., zu Erbach-Schoenberg, E. Pervasive Computing (pp. 133–151). Springer. et al. (2018). A trip to work: Estimation of origin and destination of com- Janecek, A., Valerio, D., Hummel, K. A., Ricciato, F., & Hlavacs, H. muting patterns in the main metropolitan regions of Haiti using CDR. (2015). The cellular network as a sensor: From mobile phone data to Development Engineering, 3, 133–165. real-time road traffic monitoring. IEEE transactions on intelligent trans- Zhai, W., Bai, X., Peng, Z.-r., & Gu, C. (2019). From edit distance to portation systems, 16, 2551–2572. augmented space-time-weighted edit distance: Detecting and clustering Järv, O., Ahas, R., Saluveer, E., Derudder, B., & Witlox, F. (2012). Mobile patterns of human activities in puget sound region. Journal of Transport phones in a traffic flow: a geographical perspective to evening rush hour Geography, 78, 41–55. traffic analysis using call detail records. PloS one, 7. Järv, O., Ahas, R., & Witlox, F. (2014). Understanding monthly variability in human activity spaces: A twelve-month study using mobile phone call detail records. Transportation Research Part C: Emerging , 38, 122–135. Järv, O., Müürisepp, K., Ahas, R., Derudder, B., & Witlox, F. (2015). Eth- nic differences in activity spaces as a characteristic of segregation: A study based on mobile phone usage in tallinn, estonia. Urban Studies, 52, 2680–2698. Jiang, S., Ferreira, J., & Gonzalez, M. C. (2017). Activity-based human mobility patterns inferred from mobile phone data: A case study of sin- gapore. IEEE Transactions on Big Data, 3, 208–219. Jurdak, R., Zhao, K., Liu, J., AbouJaoude, M., Cameron, M., & Newth, D. (2015). Understanding human mobility from twitter. PloS one, 10, e0131469. Kujala, R., Aledavood, T., & Saramäki, J. (2016). Estimation and mon- itoring of city-to-city travel times using call detail records. EPJ Data Science, 5, 6. Kung, K. S., Greco, K., Sobolevsky, S., & Ratti, C. (2014). Exploring universal patterns in human home-work commuting from mobile phone data. PloS one, 9, e96180. Lind, A., Hadachi, A., Piksarv, P., & Batrashev, O. (2017). Spatio-temporal mobility analysis for community detection in the mobile networks us- ing CDR data. In 2017 9th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT) (pp. 250–255). IEEE. Liu, Y., Wang, F., Xiao, Y., & Gao, S. (2012). Urban land uses and traffic ’source-sink areas’: Evidence from gps-enabled taxi data in shanghai. Landscape and Urban Planning, 106, 73–87. Ma, X., Liu, C., Wen, H., Wang, Y., & Wu, Y.-J. (2017). Understanding commuting patterns using transit smart card data. Journal of Transport Geography, 58, 135–145. Oliveira, E. M. R., Viana, A. C., Sarraute, C., Brea, J., & Alvarez-Hamelin, I. (2016). On the regularity of human mobility. Pervasive and Mobile Computing, 33, 73–90. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77, 257– 286. Ricciato, F., Widhalm, P., Pantisano, F., & Craglia, M. (2017). Beyond the “single-operator, CDR-only” paradigm: An interoperable framework for mobile phone network data analyses and population density estimation. Pervasive and Mobile Computing, 35, 65–82. Schneider, C. M., Belik, V., Couronné, T., Smoreda, Z., & González, M. C. (2013). Unravelling daily human mobility motifs. Journal of The Royal Society Interface, 10, 20130246. Song, C., Qu, Z., Blumm, N., & Barabási, A.-L. (2010). Limits of pre-

: Preprint submitted to Elsevier Page 15 of 15