Analysis and Characterization of the Public Transport Mobility of Senior Citizens

Escuela Técnica Superior de Ingenieros Informáticos Universidad Politécnica de Madrid

Trabajo Fin de Máster Máster Universitario en Inteligencia Artificial

Supervisors: Víctor Rodríguez Doncel, Óscar Corcho García Author: Alexander Lacki

2019 ii Acknowledgements

I would like to express my great appreciation to Dr. Víctor Rodríguez Doncel for his constructive suggestions and support in this project as well as his general advice and mentoring.

I would also like to extend my thanks to Javier Chamorro and the IT department of CRTM for their cooperation in this project, the funding of this work, as well as enabling me to visit their oﬃces.

Finally, I wish to thank my parents for their support and encouragement throughout my study.

iii iv ACKNOWLEDGEMENTS Abstract

This thesis describes an analysis of the mobility of senior citizens in the Madrid metropolitan area within the cooperation between the Consorcio Regional de Trans- portes de Madrid and the Ontology Engineering Group. Check-in registrations from real user smart cards are evaluated in order to char- acterize the behavior of senior citizens in the frequency, temporal, and spatial domains. In addition, clustering of users is performed in the temporal domain in order to identify dominant groups within the population. The mobility of senior citizens is additionally investigated for behavior that would be detrimental to an active and healthy lifestyle. This is accomplished with the use of survival analysis models, which identify what impact the travel to diﬀerent neighborhoods has on customer retention rates. The obtained models are, furthermore, used in order to investigate the possibility of an intervention system. Such a system would continually analyze the travel behavior of individuals, and indicate behavioral changes that could lead to undesirable developments.

v vi ABSTRACT Resumen

Esta tesis describe diferentes aspectos de la movilidad de las personas de edad avanzada (PEAs) en el área metropolitana de Madrid, especialmente aquellos en relación con su condición personal –el decaimiento en la movilidad de una PEA puede indicar una situación de riesgo. En este estudio se han evaluado los registros de las tarjetas de transporte de usuarios reales del CRTM, utilizándose para caracterizar el comportamiento de las PEAs en los dominios de frecuencia, temporal y espacial. Se han calculado y prome- diado las funciones de distribución de probabilidad de las muestras de los usuarios, y se han realizado tareas de clustering en el dominio temporal para identificar grupos dominantes dentro de la población. También se ha investigado la movilidad de las PEAs a fin de detectar conduc- tas que serían perjudiciales para un estilo de vida activo y saludable. Esto se ha logrado mediante el uso de modelos de análisis de supervivencia (Cox proportional hazard model), que identifican el impacto que tiene el viaje a diferentes vecindarios en las tasas de retención de clientes. Los modelos obtenidos se utilizan, además, para investigar la posibilidad de un sistema de intervención, que analizaría contin- uamente el comportamiento de viaje de los individuos y alertaría de cambios de comportamiento que podrían apuntar la existencia de una situación de riesgo.

vii viii RESUMEN Contents

Acknowledgements iii

Abstract v

Resumen vii

1 Introduction 1 1.1 Motivation ...... 1 1.2 Objectives ...... 2 1.3 Madrid’s Public Transport System ...... 2 1.3.1 Metro ...... 2 1.3.2 Urban Bus ...... 2 1.3.3 Road Transport Concessions ...... 3 1.3.4 Cercanias ...... 3 1.3.5 Light Rail ...... 3 1.3.6 Fare Types ...... 3 1.4 Methodology and Reproducibility ...... 4 1.4.1 Methodology ...... 4 1.4.2 Reproducibility ...... 4 1.5 Thesis Organization ...... 4

2 Literature Review 5 2.1 Big Data and Public Transport ...... 5 2.2 Population Characteristics and Clustering ...... 5 2.2.1 Population Characteristics ...... 5 2.2.2 Clustering - General Overview ...... 7 2.2.3 Clustering of Public Transport Users ...... 9 2.3 Survival Analysis ...... 11 2.3.1 Survival Analysis - General Overview ...... 11 2.3.2 Survival Analysis of Public Transport Users ...... 13

3 Data Description and Preprocessing 15 3.1 Data Description ...... 15 3.2 Data Preprocessing ...... 16 3.3 Representation of Individual Mobility ...... 17

ix x CONTENTS

3.3.1 Frequency Domain ...... 18 3.3.2 Temporal Domain ...... 18 3.3.3 Spatial Domain ...... 19

4 Characterization of Senior Citizens’ Mobility 21 4.1 Population Characteristics ...... 21 4.1.1 Methodology ...... 21 4.1.2 Hypotheses and Assumptions ...... 22 4.1.3 Results and Discussion ...... 22 4.1.4 Evaluation ...... 31 4.2 Clustering and Dominant User Groups ...... 31 4.2.1 Methodology ...... 31 4.2.2 Results and Discussion ...... 32 4.2.3 Evaluation ...... 33

5 Mobility and Personal Condition 35 5.1 Objective Reinterpretation ...... 35 5.2 Methodology ...... 35 5.2.1 Hypotheses, Assumptions, and Limitations ...... 36 5.3 Results and Discussion ...... 36 5.4 Evaluation ...... 37

6 Behavior Assessment and Correction 39 6.1 Methodology ...... 39 6.1.1 Hypotheses and Assumptions ...... 40 6.2 Results and Discussion ...... 40 6.3 Evaluation ...... 42

7 Conclusions 43 7.1 Contributions ...... 43 7.2FurtherWork...... 44

References 45 List of Figures

2.1 Travel frequency histogram obtained by Mahrsi et. al. [25] ...... 6 2.2 Temporal travel frequency probability obtained by Yu et. al. [32] . . 7 2.3 Visualization of dendrogram cut [29] ...... 8 2.4 Temporal Mobility Representation of one Cluster Group obtained by Agard et. al. [1] ...... 9 2.5 Activity Sequences of two Users as obtained by Goulet-Langlois et. al. [14] ...... 10 2.6 Cox Model Parameters for diﬀerent Covatiates as obtained by Nishi- uchi et. al. [27] ...... 13

3.1 Visualization of Check-In locations within Neighborhoods ...... 17 3.2 Representation of the Mobility of a Randomly Chosen User in the Frequency Domain ...... 18 3.3 Representation of the Mobility of a Randomly Chosen User in the Temporal Domain ...... 19 3.4 Representation of the Mobility of a Randomly Chosen User in the Spatial Domain ...... 19

4.1 Mean Travel Frequencies of 40,000 Randomly sampled Users ..... 23 4.2 Mean Travel Frequencies of Senior Citizens and Standard Users . . . 23 4.3 Mean Travel Probability by Hour of the Day, with a Gaussian Mixture Model Approximation (green) ...... 24 4.4 Mean Travel Probability by Hour of the Day ...... 25 4.5 Mean Travel Probability by Hour of the Day excl. Outskirts ..... 26 4.6 Logarithm of the Spatial Probability Distributions of Senior Citizens and Standard Users ...... 27 4.7 Number of inferred Homes per Neighborhood, excluding Outskirts . . 29 4.8 Logarithm of Aggregated Probability divided by inferred Number of Homes ...... 30 4.9 Clustering Dendrogram ...... 32 4.10 Visualization of Cluster Centroids ...... 33

6.1 Log Partial Hazard Distribution (blue) and Mean Log Partial Hazard Prior to Death Event (red) ...... 41

xi xii LIST OF FIGURES

6.2 Example of Weekly Spatial Frequency Plot (top) and the corresponding Weekly Log Partial Hazard (bottom) ...... 42

1 Number of inferred Homes per Neighborhood, compared to Number of Senior Citizens as per Census ...... 59 Chapter 1

Introduction

This work analyses the mobility of senior citizens in the Madrid metropolitan area. Due to the increasing size of this demographic, the understanding of its behavior is expected to be beneﬁcial for eﬃciently targeting its needs. Beyond understanding the demographic of senior citizen, this work investigates the impact of personal condition on the mobility, as well as the possibility of detecting mobility patterns that may be detrimental to an active and healthy lifestyle.

1.1 Motivation

Drastic demographic changes are expected to face the European society in the upcoming decades. According to Eurostat [13] the population of senior citizens is expected to increase from 19.4% to 29.1% between the years 2017 and 2080. Similarly, the share of citizens aged 80 and above is expected to more than double. With the upcoming transformations of its customer base in mind, the Consorcio Regional de Transportes de Madrid (CRTM) has entrusted the Ontology Engineering Group (OEG) with an analysis of this growing population and its demands towards Madrid’s public transport services. The aim of this cooperation is the analysis of the mobility of senior citizens as well as the investigation of senior citizens’ needs. With the knowledge obtained from this project, CRTM aims to more eﬀectively plan the infrastructure of the public transport network, schedule investments, and implement services and pro- grams to speciﬁcally target senior citizens. In order to put these aims into practice, it is essential to understand this growing target group with the highest possible thoroughness. In addition to the optimization of services for the needs of senior citizens, CRTM hopes to expand understanding of the relation of the mobility of senior citizens to their personal– and health conditions. Having previously participated in the European project CITY4AGE1, which investigated senior citizens’ health and public transport mobility, the aspect of an active and healthy lifestyle has been

1CITY4AGE Project: http://www.city4ageproject.eu/

1 2 CHAPTER 1. INTRODUCTION carried over into this project. This work, therefore, extends to the investigation of how mobility can be an indicator of a healthy and active lifestyle and how mobility data of individual senior citizens can be utilized for a health related monitoring system.

1.2 Objectives

The objectives of this work were speciﬁed in the contract between CRTM and the OEG, and are as follows.

1. Characterization of the mobility of senior citizens.

2. Analysis of the relationship between the mobility of senior citizens and their personal conditions.

3. Investigation of indicators that could generate alarms for the correction of behavioral changes that could be contrary to a healthy, active lifestyle among senior citizens.

1.3 Madrid’s Public Transport System

Madrid’s public transport system serviced more than 1.44 billion trips in the year 2016- equivalent to 223.6 trips per inhabitant and year. The network consists of the railway networks of metro, light rail, the Cercanias commuter rail service, and urban and suburban bus networks. The interconnected network covers more than 11,000 kilometers, and has recorded almost 600 million vehicle kilometers in the year of 2016, of which 337 million were recorded on rail [8].

1.3.1 Metro The 270 kilometer long metro network is serviced by a total of 13 lines with 236 stations. Of these 236 stations, 39 are multi-line stations that allow transfers between 2 or more metro lines. In 2016 a total of 174 million carriage-kilometers were registered. During work days morning peak hours are serviced by more than 300 trains with an average passing frequency of 4.2 minutes, and an average speed of 27.7 kilometers per hour [8].

1.3.2 Urban Bus Madrid’s urban bus network operated by CRTM’s subsidiary “Empresa Munic- ipal de Transportes de Madrid” is made up of 205 lines, of which 26 are operated at night. In 2016 a total of 88.5 vehicle kilometers and 6.7 vehicle hours were recorded. Urban buses mainly operate with a frequency of 8 to 12 minutes, whereas 30% of lines operate with a frequency below 8 minutes during morning peak hours [8]. 1.3. MADRID’S PUBLIC TRANSPORT SYSTEM 3

1.3.3 Road Transport Concessions The bus network of the road transport concessions is operated by 31 diﬀerent companies and consists of a total of 440 lines, of which 297 are suburban lines, 112 are urban lines, and 31 are suburban night lines. The network had a total production of 173 million vehicle kilometers in the year of 2016 [8].

1.3.4 Cercanias Cercanias is the railway service that connects Madrid to its metropolitan area and population centers in Madrid’s outskirts. The network consists of 9 lines and 94 stations with a total length of 391 kilometers. Of the 94 stations, 49 are multi-line stations which allow for transfers to other cercanias lines or other modes of transport [8].

1.3.5 Light Rail Madrid’s light rail network consists of four lines which are operated by three subsidiary companies. In the year 2016 the light rail ﬂeet completed a total of 12.6 million carriage-kilometers. During work day peak hours a total of 35 trains are in service which provide an average frequency of 6.6 minutes and travel at an average of 21.5 kilometers per hour [8].

1.3.6 Fare Types Three major types of tickets are available in Madrid’s public transport system ranging from single use tickets targeted at infrequent users to personal multimodal tickets with monthly or yearly subscriptions. Single use tickets are available for each operator, which are aimed at infrequent users, and are validated at the moment of purchase. For users of the metro or light rail services, these tickets are valid for combinations of the transport modes.

Ten-trip tickets are bought in advance and aimed at intermediate users. Diﬀerent types of ten-trip tickets exist based on the operators and zones that the user wishes to travel. This type of ticket is based on magnetic technology and the Edmonson format or contactless technology.

Personal mutlimodal tickets is aimed at frequent users, and oﬀers unlimited use in the purchased zones. It is bought in advance and in- corporates contactless technology. The tickets are valid for 30 days, and their price depends on the users age, which is grouped into one of three categories: "young", "standard", and "senior". 4 CHAPTER 1. INTRODUCTION

1.4 Methodology and Reproducibility

1.4.1 Methodology The data used in this project is obtained from the CRTM, and consists of check-in data from physical public transport passengers. The methodology used to analyze the data is described in the corresponding chapters with accompanying hypotheses and their evaluations.

1.4.2 Reproducibility • The code used in this work is publicly available in the Github repository2.

• The data used in this project is regretfully not publicly available due to data and privacy protection regulations.

1.5 Thesis Organization

This thesis is organized into seven chapters. Chapter 2 presents a review of state of the art methods in big data and public transport analysis, analysis of the mobility behavior of populations, user segmentation by means of clustering, as well as applications of survival analysis to public transport data. Chapter 3 describes the approach that is taken in order to preprocess and enrich the used data to improve usability and enable further processing. Chapter 4 describes the work on characterizing senior citizens’ mobility and the identiﬁcation of dominant user groups using clustering algorithms. Chapter 5 introduces the relationship between the mobility of senior citizens and their personal conditions using survival analysis methods, and chapter 6 investigates the possibility of monitoring senior citizens’ travel behavior within the scope of an intervention system. Chapter 7 summarizes the obtained results in previous chapters and suggests possibilities for extending the current state of this thesis by suggesting further research opportunities.

2Github repository: https://github.com/oeg-upm/crtm-pae or Appendix A Chapter 2

Literature Review

2.1 Big Data and Public Transport

The introduction of computer systems and automated data collection into the public transport systems, that has occurred in recent years, has led to the gener- ation of large amounts of information. This information has, in turn, enabled the application of big data methods for the analysis of public transport systems and attracted the attention of an increasing number of scientists. Recent studies have incorporated a variety of data sources, such as smart card data from Automatic Fare Collection (AFC) systems, Wi-Fi and phone network data and GPS data from public transport vehicles.

AFC systems were introduced in order to replace traditional paper-based tickets, and allow customers to reuse the same ticket for longer periods of time [4]. Transactions within AFC systems are registered and records include time stamp and location of the transaction. Depending on the type of system that is used, registrations may either be limited to boarding, or boarding and alighting [31].

2.2 Population Characteristics and Clustering

2.2.1 Population Characteristics The characteristics of public transport user populations have been the topic of many previous studies. Mahrsi et. al. [25] investigated the number of days that individual public transport users have made use of public transport services over the span of 31 days. It is shown that the distribution of active days is of a bi-modal nature with the minor mode indicating a 5-day work week. The obtained result is shown in ﬁgure 2.1.

5 6 CHAPTER 2. LITERATURE REVIEW

Figure 2.1: Travel frequency histogram obtained by Mahrsi et. al. [25]

Similarly Yu et. al. [32] have analyzed the behavior of public transport users in the temporal as well as the spatio-temporal domain. Using data from the bus network of Guangzhou, China, the authors determine the quantity of users traveling at a given hour of the day. Figure 2.2 portrays the result that the authors obtain for a time period within summer vacations and after summer vacations. 2.2. POPULATION CHARACTERISTICS AND CLUSTERING 7

Figure 2.2: Temporal travel frequency probability obtained by Yu et. al. [32]

2.2.2 Clustering - General Overview A characterization of predominant groups in highly dimensional problems is usually performed by accumulating individuals into groups. These groups must be constructed such that individuals of the same group exhibit similar properties, whereas individuals of diﬀerent groups show diﬀerent properties. This process is referred to as clustering.

In order to cluster data, relevant features are identiﬁed and engineered which are further used to distinguish and separate the individuals. A clustering algorithm is designed, which is determined based on the character of the problem but often determined experimentally [12]. The obtained results are evaluated for their validity and the results are interpreted.

Clustering algorithms are generally classiﬁed into the following three categories [2]: Partitioning algorithms decompose a data set into a predeﬁned number of clusters. The individual clusters are represented by cluster centers or centroids. An example of a partitioning algorithm is K-Means clustering [24]. K-Means clustering randomly selects data points in the hy- perspace as centroids, and assigns each data point to a centroid based on the lowest Euclidean distance. Cluster centroids are then iteratively reas- 8 CHAPTER 2. LITERATURE REVIEW

signed based on the mean of all points belonging to a given centroid, and the process is terminated once an optimization criterion is met. The K- Means algorithm converges quickly, but is sensitive to noise and outliers, and does not take into account diﬀerent cluster shapes, sizes and densi- ties. Furthermore it is dependent on a user-speciﬁed cluster number, as well as susceptible to local minima which may prevent convergence [29].

Hierarchical algorithms perform a nested clustering of data points and produce a dendrogram - a visualization of the clustering sequence with a single inclusive cluster at the top, and singleton clusters of individual data points at the bottom. Hierarchical clustering algorithms are either agglomerative (bottom-up) or divisive (top-down). The merging or splitting of clusters is performed based on a distance function, and occurs iteratively [29]. The visualization of the clustering sequence in the form of a dendrogram allows the user to visually identify appropriate distance values to perform a cut, and define the final clusters. Figure 2.3 portrays a visualization of a dendrogram with two different cut options.

Figure 2.3: Visualization of dendrogram cut [29]

While hierarchical approaches provide valuable information with the use of dendrograms, they are considered to exhibit a high time-complexity, and are therefore computationally expensive [3].

Density-based algorithms rely on local density calculations to group neighboring objects into clusters. Instead of relying on proximity between single data points, clusters are assumed to be regions of high density, which are separated by regions of low density. Density-based clustering algorithms are resistant to noise, and can, furthermore, classify it as such. Clusters of various shapes can be identiﬁed using density- based algorithms [29]. The obtained clusters can, however, not be reli- ably described due to the possibility of diﬀerent shapes. Furthermore, density-based algorithms are not applicable to data sets with a high di- mensionality [3]. 2.2. POPULATION CHARACTERISTICS AND CLUSTERING 9

2.2.3 Clustering of Public Transport Users The clustering of public transport users has been performed by a variety of existing studies, which have taken into account different domains for the analysis. The domains in which users have been clustered vary from temporal [1, 25], spatial [25], to spatiotemporal [14, 17]. Clustering of users in the time domain has previously been performed by Agard et. al. [1]. In their study, the authors reduce the activity of each user to 20 binary variables based on five weekdays with four periods per day- morning, midday, evening, and night. The obtained 20-element vectors are clustered using first the K-Means algorithm in order to reduce computation time, and further the Hierarchical Agglomerative Clustering algorithm. The authors obtain four dominant groups. Figure 2.4 shows one of the groups obtained by the authors.

Figure 2.4: Temporal Mobility Representation of one Cluster Group obtained by Agard et. al. [1]

Based on the ticket types adult, student,andelderly, the authors determine the share of clusters within each customer group; thereby determining which temporal behavior is represented how strongly among each customer group. Further studies employ other algorithms such as Gaussian Mixture Models, simple K-Means clustering, and others.

A characterization of public transport users in the spatiotemporal domain has previously been performed by Goulet-Langlois et. al. [14]. In their work, the authors analyze smart card data from London public transport system with regard 10 CHAPTER 2. LITERATURE REVIEW to the geographical – as well as the temporal characteristics. The authors determine the proportion of time individuals spend using the public transport system, and the time individuals remain in areas during given hours of the day. The stations that are visited by each individual are grouped, based on a threshold of geographical distance. These areas are sorted based on the total amount spent in each. Having information about boarding and alighting of public transport modes, the authors are able to construct users’ activity sequences. An example of the activity sequences for two users can be seen in ﬁgure 2.5.

Figure 2.5: Activity Sequences of two Users as obtained by Goulet-Langlois et. al. [14]

The activity sequences obtained from both- the temporal as well as the spatial analysis- are used in order to cluster individuals into groups using principal component analysis. The authors obtain 11 clusters, which are then analyzed in terms of their defining characteristics. Dominant characteristics are clear working days during the week, proportion of travels on weekends, time spent in primary vs. secondary areas, travel frequency in certain weeks of the analysis period, time of first departure, and others. Even though the authors use geographical data, the reduction of the data into areas that are solely characterized by the users time of stay greatly reduces the inherent information. Due to each area exhibiting different properties in terms of attractions and activities, an exclusion of the geography itself overlooks the similarities of user interests and lifestyle, which is considered of great value in the context of this work. 2.3. SURVIVAL ANALYSIS 11

2.3 Survival Analysis

2.3.1 Survival Analysis - General Overview Survival analysis is the statistical evaluation of data measuring the time to a certain event. It is commonly used in the medical ﬁeld in order to assess the impact of treatments on a patients prognosis, in engineering and material sciences to determine the failure of mechanical components, as well as in customer retention studies. There are several commonly encountered terms in the domain of survival analysis:

Birth refers to the time at which an individual becomes relevant to a study. This may be the physical birth of an individual, but may also be the diagnosis of a disease, the start of a mechanical test of a specimen, or the time at which an individual registered for a service.

Death refers to the time of the event that is being investigated, such as the physical death of an individual, the failure of a mechanical component, or the abandonment of a service by a customer.

Censorship is the phenomenon which occurs when the investigated event cannot be observed due to the occurrence of another event that prohibits the observation. Censorship may occur accidentally, or in a controlled fashion- such as the termination of a study.

Truncation occurs when the observational period is limited and the investigated event may have occurred to individuals before the start of the observations.

Covariate is a possibly predictive variable that describes an individual that is being investigated.

Survival Function - deﬁnes the probability that the death of an individual occurs later than a given time t. The survival function is deﬁned as

S(t)=Pr(T>t) (2.1) Hazard Function - indicates the rate of death during a time interval [t, t+dt], with the condition of survival until time t. Pr(t ≤ T ≤ t + dt) S(t) λ(t) = lim = (2.2) dt→0 dt × S(t) S(t)

Three main types of survival analysis models that consider censorship exist: non-parametric, semiparametric, and parametric models. The appropriate model 12 CHAPTER 2. LITERATURE REVIEW type must be chosen taking into account the objective of the study, and, possibly, the nature of the investigated covariates [26]. Non-parametric models do not contain parameters for the representation of the survival. The most popular model is the Kaplan-Meier Esti- mator [20]. The Kaplan-Meier Estimator computes the survival function (2.1) and represents it in the form of a curve.

Semiparametric models are models which contain parametric as well as non-parametric components. A popular example of such model in the domain of survival analysis is the Cox Proportional Hazards model [7]. The Cox Proportional Hazards Model uses a linear regression model in order to derive the influence of covariates on the hazard function. The model outputs hazard ratios as well as other statistics associated with each covariate, which indicate whether the covariate in question has an increasing or decreasing effect on the hazard. The regression model is defined as follows:

λ(t)=λ0(t) × exp(b1x1 + b2x2 + ... + bnxn) (2.3)

where, λ represents the hazard function t is the survival time {b1,b2, ..., bn} represent the linear regression parameters {x1,x2, ..., xn} are the covariates h0 is the baseline hazard

The basic Cox Proportional Hazards model assumes that covariates are constant over time. In order to accommodate time-varying covariates, the model may be extended and deﬁned as follows.

λ(t|x)=λ0(t) × exp(b1x1(t)+b2x2(t)+... + bnxn(t)) (2.4)

It must be noted that while this deﬁnition allows for time-varying covariates, predictions for the survival function become non trivial. In order to perform a prediction of the future survival of the individual, the future covariates would have to be known. The prediction is therefore limited to the survival and hazard function for the time that an individual is observed. Parametric models exhibit a full parameterization of their components. An example of a parametric model in the domain of survival analysis is the Accelerated Failure Time Regression Model [30]. The model does not only determine the impact of covariates on the hazard, but also models the hazard function’s shape. 2.3. SURVIVAL ANALYSIS 13

2.3.2 Survival Analysis of Public Transport Users In a previous work Nishiuchi [27] investigated what type of users are most likely to reduce their public transport use. Using a Cox proportional hazard model [7], the authors evaluate smart card data from the city of Kochi, Japan. Passen- ger mobility is reduced to statical descriptors, which are used as covariates. The covariates are investigated in terms of their impact on decrease in travel frequency. The results indicate that users who travel during weekdays rather than weekends exhibit a lower hazard of use reduction. Similarly passengers who use a variety of routes, and whose travel exhibits a large number of origin-destination pairs, are at a lower risk of reducing travelling frequency. Figure 2.6 shows the model parameters for diﬀerent covariates.

Figure 2.6: Cox Model Parameters for diﬀerent Covatiates as obtained by Nishiuchi et. al. [27]

In a similar fashion, Trepanier et. al. [28] use smart card data from the region of Outaouais, Canada. The authors derive a survival analysis model in order to investigate the impact of demographic indicators of districts on the loyality of public transport users. The authors define the birth of a user as the date at which the smart card was issued. User death is defined as the last month in which a card was used. User home locations are inferred based on the first boarding location of the day, and mapped to the different regions. Results indicate that users who live in high-density urban areas tend to use their smart cards over longer time periods. 14 CHAPTER 2. LITERATURE REVIEW Chapter 3

Data Description and Preprocessing

3.1 Data Description

Mobility data in the public transport system in Madrid consists of check-in registrations which contain the identiﬁcation number of the smart card, the date and time of the check in, as well as the station at which the check-in occurred. This data is accompanied with information about the transportation type, type of smart card, and other. The table below illustrates a reduced extract of the obtained data.

Table 3.1: Reduced Data Extract User ID Date Operator Line ID Line Stop ID Stop 13.06.2018 04147B9 Metro 6 Laguna- 8 Nuevos 14:59:00 Lucero Ministe- rios 21.06.2018 047B24E Metro 8 N. 1 Nuevos 17:22:52 Ministerios- Ministe- Aeropuerto rios T4 13.06.2018 04147B9 Metro 6 Laguna- 8 Nuevos 14:59:00 Lucero Ministe- rios

The data used for this project consists of check-ins made by 309,280 users over the years 2016, 2017, and 2018. The users are holders of the smart card abono tercera edad - a subscription for public transport users aged 65 or above.

15 16 CHAPTER 3. DATA DESCRIPTION AND PREPROCESSING

3.2 Data Preprocessing

Data preprocessing is the process of restructuring raw data into a format that is processable by further algorithms. Raw data is often incomplete or inconsistent, and lacks structure. The following section describes the steps taken in order to preprocess and enrich the raw travel records. The original mobility records are split into individual files based on the user ID that each record represents. The records are augmented with station coordinates and the geographical shapes of Madrid’s neighborhoods obtained from CRTM’s Open Data Portal [9] and Madrid’s Open Data Portal [23] respectively. This approach reduces each of the over 13,000 individual geographical access points to 131 neighborhoods and Madrid’s outskirts. Several challenges are identified with regards to the mapping of travel records to geographical coordinates. The matching of travel record to the corresponding neighborhood can be determined by either the station ID or the station name; nei- ther of the two attributes match in some cases. Furthermore, some stations are not present at all in the station mapping obtained from [9]. In order to appropriately match stations with geographical coordinates, a linguistic approach using the Lev- enshtein word distance metric is employed. The identifiers shown in table 3.2 are used for mapping a station to the coordinate.

Table 3.2: Mapping Identifiers for different Transport Types Primary Identifier Secondary Identifier Terciary Identifier Metro Station Name Levenshtein Distance - Interurbano Station Name Levenshtein Distance - EMT Station ID Station Name Levenshtein Distance Cercanias Station ID Station Name Levenshtein Distance

Diﬀerent approaches are taken depending on the mode of transport that the check-in occurred in. The approaches are outlined below. Metro and Interurbano - An exact match of the station name in the travel record and the mapping record is required. If no exact match can be found, the Levenshtein distance of the station name in the travel record to the station names in the geographical mappings is computed. The station with the lowest Levenshtein distance is chosen to be the correct match. EMT and Cercanias - An exact match of the station ID in the travel record and the mapping record is required. If no exact match is found, an exact match in the station name is located. If no exact match can be found, the Levenshtein distance of the station name in the travel record to the station names in the geographical mappings is computed. The station with the lowest Levenshtein distance is chosen to be the correct match. 3.3. REPRESENTATION OF INDIVIDUAL MOBILITY 17

The described method is manually evaluated for errors. False mappings are identified for stations containing long descriptors such as Intercambiador Atocha and Intercambiador de Moncloa. For stations in which errors are identified, the original mapping files from [9] are modified and the missing station is added manually. The correctness of the approach using Levenshtein distance is evaluated using 5,000 random travel records, and no incorrect mappings are determined. The resulting mapping of check-in locations within Madrid’s neighborhoods is shown in figure 3.1.

Figure 3.1: Visualization of Check-In locations within Neighborhoods

3.3 Representation of Individual Mobility

Mobility data in public transport systems contains information that can be interpreted in a variety of domains. Due to the limited usability of the preprocessed 18 CHAPTER 3. DATA DESCRIPTION AND PREPROCESSING check-in data, the data needs to be represented in a way that enables an eﬃcient processing, while preserving required information. The most evident domains for the representation of mobility data are the frequency, temporal, as well as spatial domain.

3.3.1 Frequency Domain In order to represent user mobility in the frequency domain, frequency is deﬁned as the number of days that a user has travelled using public transport in a given week. While the more simplistic approach would be counting the number of check-ins that a user has made over a given time window, this approach was shown to be vulnerable to outlier days [33]. The travel frequency is therefore deﬁned as the number of days on which a user has travelled during a given week, has a range between 0 and 7. Figure 3.2 shows the weekly frequency diagram for a randomly chosen user.

Figure 3.2: Representation of the Mobility of a Randomly Chosen User in the Fre- quency Domain

3.3.2 Temporal Domain In the temporal domain an individual user’s mobility is reduced to the time at which check-ins occur. Information from check-ins is reduced to the hour of the day at which they occur, which allows for the analysis of the users’ dominant travel hours. In order to enable the analysis of this behavior over time, the temporal 3.3. REPRESENTATION OF INDIVIDUAL MOBILITY 19 frequency can be evaluated using week-long windows. Figure 3.3 shows the resulting temporal frequency diagram, as well as the weekly temporal frequency diagram for a randomly chosen user.

Figure 3.3: Representation of the Mobility of a Randomly Chosen User in the Tem- poral Domain

3.3.3 Spatial Domain The analysis in the spatial domain follows a similar approach. Stations and stops are mapped to their corresponding neighborhood. In order to represent user mobility in the spatial domain, check-ins are reduced to the neighborhood in which they occur. Again, the spatial representation is additionally enabled in weekly time windows. Figure 3.4 portrays the spatial representation of the mobility of a randomly chosen user.

Figure 3.4: Representation of the Mobility of a Randomly Chosen User in the Spatial Domain 20 CHAPTER 3. DATA DESCRIPTION AND PREPROCESSING Chapter 4

Characterization of Senior Citizens’ Mobility

4.1 Population Characteristics

In order to obtain an understanding of the underlying traveling patterns of the population of senior citizens, the entire population is investigated in the previously mentioned domains. To achieve this, user mobility is represented in the previously introduced domains, and averaged over population samples. In order to provide a comparison to the general population, the mobility of senior citizens is compared to the mobility of users below the age of 65.

4.1.1 Methodology In order to assess the travel frequency, the average number of days per week that users use the public transport system is determined. The frequency diagram, previously shown in figure 3.2 is computed for each user, and the mean value is determined, discarding the first and last recorded weeks. Users who have performed less than 20 travels are discarded; so are users for which travels in less than 5 calendar weeks are registered. This approach ensures that the considered data is statistically relevant. The procedure is performed for a a random sample of 40,000 users, allowing for a histogram representation of the travel frequency probability. The mean temporal travel frequency as well as the mean spatial travel frequency are computed in a similar fashion. For a random sample of 40,000 users the travel probability distribution are computed, and the mean temporal travel probability distribution is determined. The home neighborhood of each user is inferred from the mobility data, based on the assumption that the first travel of a day occurs as the user is traveling from their home to their first destination. This is accomplished by accumulating all of a user’s travel entries that are the first occurring travels of the day and computing the spatial frequency diagram. The neighborhood with the highest travel frequency is expected to be the “home” neighborhood of the user. In

21 22 CHAPTER 4. CHARACTERIZATION OF SENIOR CITIZENS’ MOBILITY order to make this inference robust, the highest frequency neighborhood must be traveled at least twice as often as the neighborhood with the second highest frequency. If this condition is not satisﬁed, the inference for the given user is rejected.

In order to assess whether the obtained mean probability distributions are statistically acceptable and are representative of the population, the data from the randomly sampled users is split into two parts. The Kolmogorov-Smirnov statistic for the two-sample test [19] is computed over the two corresponding samples to indicate whether the samples were drawn from the same distribution. In the event that the amount of data is insuﬃcient for the inferences made, the Kolmogorov- Smirnov statistic will show a p-value of less than 0.05, which is the universally accepted critical value.

4.1.2 Hypotheses and Assumptions Hypothesis: The utilization of data from 40,000 individuals enables a characterization of the population’s travel characteristics in the investigated domains, and will show a p-value larger than 0.05 when evaluated with the Kolmogorov-Smirnov statistic.

Assumption: The public transport network is assumed to be static. That is- it does not evolve, lines are not temporarily closed for renovations, or for other reasons.

Limitation: Home inference is prone to falsifying results in users living in relatively small neighborhoods. These neighborhoods may not contain enough stations for their residents to begin their travels in them, which may lead to adjacent neighborhoods being registered as the users home location.

4.1.3 Results and Discussion Figure 4.1 shows the obtained travel frequency histogram, estimated from a random sample of 40,000 users. It can be observed that the obtained histogram appears to follow a Poisson-like distribution with a mean travel frequency of 2.05 days per week. The ﬁtting of a Poisson distribution function was attempted, but later rejected due to a poor ﬁt. The distribution is therefore approximated with a 6-degree polynomial given by equation 4.1.

6 5 4 3 2 PF (x)=0.0045x +0.049x − 0.28x +0.88x + −1.48x +0.90x +0.93 (4.1)

The Kolmogorov-Smirnov statistic computed using two samples of 20,000 users each showed a p-value larger than 0.99, indicating that the sample size is large enough to represent the entire population. 4.1. POPULATION CHARACTERISTICS 23

Figure 4.1: Mean Travel Frequencies of 40,000 Randomly sampled Users

A comparison of the travel frequency of senior citizens with that of lower-aged users is shown in ﬁgure 4.2. As can be observed, senior citizens experience a smaller number of active days than standard users. As previously shown in [25], the travel frequency distribution of standard users exhibits a peak that indicates a 5 day work week.

Figure 4.2: Mean Travel Frequencies of Senior Citizens and Standard Users 24 CHAPTER 4. CHARACTERIZATION OF SENIOR CITIZENS’ MOBILITY

In the temporal dimension, the probability distribution follows a bi-modal distribution with a major mode in the late morning and a minor mode at in the early evening. The probability distribution shows that senior citizens’ temporal travel behavior exhibits two distinct peak hours: the ﬁrst at 11:00, and the second in the afternoon at 18:00. A decrease in activity can be observed around 15:00.

Figure 4.3: Mean Travel Probability by Hour of the Day, with a Gaussian Mixture Model Approximation (green)

The two distinct modes in the morning and evening are similar to those in the results obtained by Yu et. al. (figure 2.2). The morning mode occurs, however, at a later time. The two peaks are not as dominant as obtained by Yu, possibly due to the absence of prevalent work hours. As hypothesized, the Kolmogorov-Smirnov statistic computed over two samples shows a p-value of 0.99, and confirms the validity of the obtained distribution. A Gaussian mixture model with two Gaussian distributions is fitted to the histogram, Equation 4.2 shows the approximation of the distribution.

(x − 11.2)2 (x − 18.0)2 P (x)=0.13 × exp − +0.076 × exp − T 2 ∗ 1.852 2 ∗ 1.962 (4.2) 4.1. POPULATION CHARACTERISTICS 25

A comparison with the standard user population is portrayed in ﬁgure 4.4. It can be observed that senior citizens begin travel later than the population of standard users. Between the hours 9:00 and 13:00, senior citizens exhibit a higher travel probability than the general public. During late evenings, as well as early mornings, travel activity is dominated by the lower aged population.

Figure 4.4: Mean Travel Probability by Hour of the Day 26 CHAPTER 4. CHARACTERIZATION OF SENIOR CITIZENS’ MOBILITY

The analysis of senior citizens’ travel probability in the spatial domain provides the visualization in ﬁgure 4.5. The largest share of travel occurs in the outskirts of Madrid, which have not been included in the visualization due to the resulting lack of visual diﬀerence between Madrid’s neighborhoods.

Figure 4.5: Mean Travel Probability by Hour of the Day excl. Outskirts

Similarly to the previously computed distributions, the Kolmogorov-Smirnov statistic has a p-value of 0.96 conﬁrming the hypothesis and indicating that the sample size is representative of the population. The spatial probability distribution of senior citizens is compared to that of the standard user population. Figure 4.6 portrays the logarithm of the spatial distribution ratios of senior citizens and standard users. Neighborhoods in which travel probability of senior citizens exceeds that of the lower aged population are shown in red. 4.1. POPULATION CHARACTERISTICS 27

There are several neighborhoods which senior citizens travel more commonly to than the lower aged user. Neighborhoods with the highest spatial probability ratio are given in table 4.1. It can be observed that for the 10 neighborhoods with the highest ratio in spatial travel probability, senior citizen travel probability is more than three fold that of the younger population. Hence, the neighborhoods presented in the table can be considered of high importance for senior citizens. The full list of spatial probability distributions for senior citizens, as well as the standard population, can be found in appendix A.

Figure 4.6: Logarithm of the Spatial Probability Distributions of Senior Citizens and Standard Users 28 CHAPTER 4. CHARACTERIZATION OF SENIOR CITIZENS’ MOBILITY

Table 4.1: Neighborhoods with the Highest Spatial Probability Distribution Ratios Neighborhood Senior SPD Standard SPD Ratio Log SPD SPD Ratio Arapiles 3.63 0.74 4.91 1.59 Marroquina 3.77 0.8 4.71 1.55 Hellín 1.63 0.35 4.66 1.54 Estrella 8.01 1.78 4.5 1.5 Fuente del Berro 7.71 1.81 4.26 1.45 Sol 5.84 1.5 3.89 1.36 Gaztambide 3.39 0.87 3.9 1.36 Goya 22.31 6.35 3.51 1.26 Argüelles 7.43 2.19 3.39 1.22 Fontarrón 3.95 1.3 3.04 1.11 4.1. POPULATION CHARACTERISTICS 29

Figure 1 shows the locations of the inferred neighborhoods based on 100,000 randomly selected users. Out of the 100,000 users the inference was rejected for 27,393, resulting in 72,607 data points. Again, the outskirts of Madrid have not been included in the visualization, because the exceedingly high number of users distorts the image visualization. 2613 homes were mapped to the outskirts, whereas the average number of homes per neighborhood is only 114. The obtained number of users residing in each neighborhood is additionally presented in tabular form in appendix B. It can be observed that senior citizens tend to live outside of the city center. Especially the neighborhood Aluche has a high number of inferred senior citizen homes.

Figure 4.7: Number of inferred Homes per Neighborhood, excluding Outskirts

A comparison of inferred homes to the census shows that the applied method is appropriate, and the results are reliable. The comparison is presented graphically in 30 CHAPTER 4. CHARACTERIZATION OF SENIOR CITIZENS’ MOBILITY appendix B. An evaluation using the Kolmogorov-Smirnov statistic shows a p-value in excess of 0.99, conﬁrming the validity of the result. Figure 4.8 illustrates the travel probability in comparison to the number of homes. It can be observed that travel in the neighborhoods in the city center is mainly caused by visitors. The neighborhood Aluche, which has a large number of inferred senior citizen homes, appears to be mostly traveled by its own habitants– it is not a destination that senior citizens visit, but rather one that they live in. Neighborhoods which are considered to have a high number of visitors are mostly in the city center

Figure 4.8: Logarithm of Aggregated Probability divided by inferred Number of Homes

A summary of the spatial probability distribution, number of inferred homes per neighborhood, as well as the log spatial probability distribution per inferred home, is additionally presented in tabular form in appendix B. 4.2. CLUSTERING AND DOMINANT USER GROUPS 31

4.1.4 Evaluation Evaluations of statistical validity of the obtained distributions using the Kolmogorov-Smirnov statistic show p-values in excess of 0.95, confirming the ini- tial hypothesis, that the samples are large enough, and the results are robust. The obtained probability distributions for the frequency and temporal domain are compared to users of the entire age spectrum in other studies. Even though this comparison shows sensible contrasts, a comparison between senior and non-senior public transport users from the municipality of Madrid would be wise. This would make the environment a controlled variable and age the sole dependent variable, hence, ensuring the results are reliable. Spatial travel patterns were identified and compared to inferred home locations. The majority of travel, as well as inferred home locations, were determined to be localized in the outskirts, which were excluded from the visualization due to the exceedingly large magnitude. This indicates that further decomposition of the outskirts into neighborhoods would be advantageous. This could either be accomplished with the use of further shape-files or, potentially, a clustering algorithm if shape-files are not available. The inference of home locations shows a strong corre- lation with the census. The applied method for inferring home locations is therefore applicable. In order to further improve home inference, more exact data needs to be evaluated. Access to addresses of public transport users would be highly advantageous, but could not be obtained due to data and privacy protection regulations. The analysis of travel behavior in the spatiotemporal domain was attempted but did not yield useful results. It appears that the mobility of senior citizens is too sporadic in order to isolate common behaviors in the spatiotemporal domain. The obtained results portray senior citizens’ behavior frequency, temporal, as well as spatial domain. It would, however, be beneficial to explore the results in context with the general population. The comparison between senior citizens and standard users may bring to light insights that otherwise would remain undiscovered.

4.2 Clustering and Dominant User Groups

4.2.1 Methodology Temporal mobility is represented as a vector with 24 hourly bins, as introduced in section 3.3.2. For each individual, this vector is normalized to represent the probability distribution function, in order to assign equal weights to the investigated individuals, and normalize the resulting vector for clustering. The obtained vectors are clustered using the hierarchical agglomerative clustering algorithm, using the Ward method and the Euclidean distance function. The Ward method is chosen in order to encourage the formation of concave clusters and allow for the interpretation of the clustering result using the cluster centroid. An appropriate distance level for the dendrogram cut is visually determined in order do deﬁne the ﬁnal clusters. The centroid of each cluster is computed and visually presented in order to be evaluated. 32 CHAPTER 4. CHARACTERIZATION OF SENIOR CITIZENS’ MOBILITY

4.2.2 Results and Discussion Hierarchical agglomerative clustering of the temporal probability distribution yields the dendrogram presented in ﬁgure 4.9. It can be observed that a cut of the dendrogram at a distance value of 7.5 yields four clusters of similar stability.

Figure 4.9: Clustering Dendrogram

The four obtained cluster centers are visualized in ﬁgure 4.10. 4.2. CLUSTERING AND DOMINANT USER GROUPS 33

Figure 4.10: Visualization of Cluster Centroids

The visualization of each cluster centroid reveals that all clusters exhibit a local minimum around 14:00-15:00, in which the probability of travel is relatively low. Two peaks can be identiﬁed for each cluster occurring in the morning and the evening. Furthermore, it can be observed the morning spike exceeds the evening spike in size in clusters A, B, and D. This indicates that the users within these clusters, which make up 82% of the population, travel mostly in the mornings. The remaining 18% of the population is represented by cluster C and travel more frequently in the evenings.

4.2.3 Evaluation The obtained clusters visualize temporal characteristics of the predominant user groups in the population of senior citizens. Clustering with more advanced algorithms in order to identify outliers and noise was attempted using HDBSCAN with a variety of distance functions including dynamic time warping but did not yield acceptable results. Clustering in the spatial as well as spatiotemporal domain which was expected to reveal popular neighborhood combinations could not be achieved. Attempts were made using algorithms such as K-Means, hierarchical agglomerative clustering, DBSCAN, as well as HDBSCAN, but showed poor results. It is assumed that the travel behavior of senior citizens is more sporadic than that of lower aged users, making it diﬃcult to isolate groups with individuals of similar behavior. 34 CHAPTER 4. CHARACTERIZATION OF SENIOR CITIZENS’ MOBILITY Chapter 5

Mobility and Personal Condition

5.1 Objective Reinterpretation

An analysis of the relationship between personal condition and mobility requires information regarding users’ health. A deduction of information regarding personal condition from mobility is not possible without the corresponding data. The underlying assumption that is made in order to enable such inference is that the use of public transport is an indicator of an active lifestyle. Davis et. al. [11] determined that, for senior citizens, a public transport trip is associated with 412 steps taken per trip, which is equivalent to approximately 8 minutes of walking. An abandonment of public transport services could therefore result in a reduction of physical and social activity, and is assumed to be a health hazard. An individual that is at risk of abandoning public transport is therefore at risk of experiencing deteriorating health. With the introduced assumption, the underlying problem can be interpreted as a customer retention problem, and hence becomes attractive for the use of survival analysis techniques.

5.2 Methodology

In order to determine what factors influence user retention, a time varying Cox Proportional Hazards Model is developed based on users’ spatial as well as temporal mobility representations introduced in section 3.3. The mathematical model is developed using weekly spatial probability distributions as the model covariates. The birth is defined as the first check-in made by a user, where users who have performed the first check-in within the first three months of the observation period are discarded in order to reduce the probability of a left truncation. Death is defined as the last check-in made by a user, where deaths occurring in the last three months of the observation period are assumed to be censored. The obtained model is investigated for the influence of the covariates on the hazard function, and, hence, what behavior is beneficial in terms of a customer retention standpoint.

35 36 CHAPTER 5. MOBILITY AND PERSONAL CONDITION

5.2.1 Hypotheses, Assumptions, and Limitations Hypothesis 1: Individual spatial and temporal mobility representations of senior citizens allow for the estimation of public transport abandonment probability.

Assumption 1: The public transport network is assumed to be static. That is- it does not evolve, lines are not temporarily closed for renovations, or for other reasons.

Assumption 2: The mobility of individual senior citizens is assumed to be an indicator of their lifestyle, which, in turn, is assumed to determine public transport abandonment risk.

Limitation 1: The proposed methodology does not account for travel occurring with non public transport modes.

Limitation 2: Only individuals whos birth is within the 3-year observation period are evaluated.

5.3 Results and Discussion

Table 5.1 shows the 10 covariate parameters with the lowest p-values obtained for the Cox Proportional Hazards model. The full list in alphabetical order can be found in appendix E. Table 5.1: Extract of obtained Covariate Parameters Covariate Coeﬃcient exp(Coef) SE(Coef) z-value p-value Apostol Santiago 1.67 5.31 0.45 3.73 <0.005 Cuatro Vientos -1.41 0.24 0.63 -2.25 0.02 Moscardó -0.92 0.4 0.45 -2.03 0.04 Orcasur 0.9 2.47 0.44 2.05 0.04 Casa de Campo 0.7 2.02 0.35 1.99 0.05

Hour 10 -1.2 0.3 0.47 -2.58 0.01 Hour 12 -1.69 0.18 0.46 -3.65 <0.005 Hour 13 -1.55 0.21 0.47 -3.29 <0.005 Hour 17 -1.44 0.24 0.49 -2.95 <0.005 Hour 19 -1.39 0.25 0.51 -2.71 0.01

It can be observed that the most relevant spatial covariates are accompanied 5.4. EVALUATION 37 by positive as well as negative coefficients. The neighborhood Apostol Santiago is the most significant covariate and has a positive coefficient– indicating that citizens who travel to this neighborhood frequently have an increased hazard. High travel activity in the neighboorhood Cuatro Vientos is associated with decreased hazard. Within the temporal domain the most relevant covariates are travel activity during 12, 13, and 17 o’clock. Activity during these hours leads to a decreased hazard. The obtained model shows a concordance of 0.58, where 0.5 would be achieved with a random method, and 1.0 represents a perfect prediction. Performing a K-Fold cross validation with 20 folds decreases this value to 0.55. That value is assumed to be robust, and is accepted as the true concordance.

5.4 Evaluation

Only marginal predictive power is achieved using the employed model. With a concordance of 0.55, the developed model needs to be further improved in order to become usable. More advanced models may be used for the assessment of the hazard of public transport abandonment. Due to the complexity of the underlying problem, a naive reduction to the visited neighborhoods may oversimplify the problem space. In order to capture more information, user mobility may have to be analyzed in the spatiotemporal domain. Furthermore, the application of more advanced models such as artiﬁcial neural networks should be explored. This has, for example, previously been done by Mokarram et. al. [26] who showed that artiﬁcial neural networks outperform semi-parametric and parametric models in complex problems. It should, however, be noted that the use of such black-box approaches would result in the loss of explanatory power, and the possibility of pinpointing of the changes that led to a change in hazard would become a complicated challenge. Due to the goal of this investigation being an understanding of the relationship between the mobility and personal condition, the used model is considered an appropriate choice, and an addition of further covariates is considered the appropriate measure. 38 CHAPTER 5. MOBILITY AND PERSONAL CONDITION Chapter 6

Behavior Assessment and Correction

The continuous assessment of the behavior of individual users requires the evaluation of the user’s behavior over time. The evaluation of a time dependent behavior requires the extension of the Cox proportional hazards model to accommodate time-varying covariates. The continuous evaluation of the mobility behavior is considered to be independent of the model’s baseline hazard, which may not be constant over time. The evaluation therefore occurs using solely the linear regression components of the time-varying Cox proportional hazards model, which is ﬁtted in this chapter.

6.1 Methodology

The time-varying Cox Proportional Hazards model, for which the hazard was previously defined as indicated in equation 6.1, is fitted to the weekly spatial frequency matrices defined in chapter 3.3.3.

λ(t|x)=λ0(t) × exp(b1x1(t)+b2x2(t)+... + bnxn(t)) (6.1)

In order to enable the continuous monitoring of senior citizens’ mobility, the ﬁtted model is used to evaluate only the weekly log partial hazard, which is deﬁned in equation 6.2.

λ(t|x) HLP =ln = b1x1(t)+b2x2(t)+... + bnxn(t) (6.2) λ0(t)

The log partial hazard indicates how an individual’s behavior aﬀects the hazard function, and is used to assess the instantaneous user behavior in relation to the previously observed behavior. It is assumed that death is a result of an observable negative behavior, which can be deduced from a prior increase in the log partial hazard. The log partial hazard is computed for each individual, and the mean and standard deviation of the log partial hazard are computed. An individual who experiences a death event is hypothesized to exhibit an above average partial hazard.

39 40 CHAPTER 6. BEHAVIOR ASSESSMENT AND CORRECTION

This hypothesis is deﬁned by equation 6.2.

1 t−1 H (t − 1|e ) > H (t|x(t)) LP t t LP (6.3) t=0

In order to evaluate the stated hypothesis, the weekly spatial frequencies of 150,000 users are utilized and a K-fold cross validation is performed using 5 folds. The mean standard deviation of the observed hazard functions is compared to the mean z-score of log partial hazards at death in order to evaluate the hypothesis.

6.1.1 Hypotheses and Assumptions Hypothesis: Individuals who exhibit a death event at time t will have a prior log partial hazard at time t-1 which is greater than the mean of their log partial hazard.

Assumption: The log partial hazard of an individual is assumed to follow a Gaussian distribution.

6.2 Results and Discussion

The results show a mean log partial hazard value of 1.12 with a standard deviation of 0.17. The deviation of the log partial hazard observed during the timestep preceding the death event is 0.11, and therefore exceeds the mean log partial hazard by 0.67 standard deviations. This relationship is visualized in ﬁgure 6.1. 6.2. RESULTS AND DISCUSSION 41

Figure 6.1: Log Partial Hazard Distribution (blue) and Mean Log Partial Hazard Prior to Death Event (red)

Based on the initially made assumption that the log partial hazard follows a Gaussian distribution, the results conﬁrm the hypothesis. Individuals who undergo public transport abandonment do, in fact, experience an increased log partial hazard prior to the abandonment. Figure 6.2 portrays an example plot of the weekly spatial frequency diagram in conjunction with the corresponding log partial hazard plot. 42 CHAPTER 6. BEHAVIOR ASSESSMENT AND CORRECTION

Figure 6.2: Example of Weekly Spatial Frequency Plot (top) and the corresponding Weekly Log Partial Hazard (bottom)

It can be observed that the individual experiences a varying log partial hazard in the observation period. The highest log partial hazard is registered during the week prior to the abandonment event, as stated in the hypothesis.

6.3 Evaluation

The use of a time-varying Cox Proportional Hazards model for the assessment of abandonment hazard conﬁrms the stated hypothesis. Log partial hazard appears to be an appropriate indicator for the monitoring of senior citizens. As previously mentioned in section 5.4 the use of the time-varying Cox Proportional Hazards model trades predictive power oﬀ against the possibility of pinpointing what behavioral changes have caused the change in hazard. Due to the possible application of this method in a medical setting, an explanation of the changes in the log partial hazard are considered an essential component, and must be further investigated. More advanced covariates may be engineered in order to improve the model. Such covariates may take into account not only visited neighborhoods, but the mode of transportation, the time of visit to the neighborhood in question, or statistical indicators such as used by Nishiuchi et. al. [27]. Chapter 7

Conclusions

7.1 Contributions

This work compiles the following contributions to the CRTM, as well as the research community. • The travel frequency of senior citizens– defined as the number of active days in a week– was shown to have a Poisson distribution with a mean of 2.05 days per week. This is in contrast to observations of the entire population, which was shown to have a mean travel frequency of 4.6 days per week, with a probability distribution exhibiting a minor mode indicating a 5-day work week. • Senior citizens have been shown to have a temporal travel probability of bi- modal nature. Travel most commonly occurs in the late mornings around 11:00 and in the evenings around 18:00 with a clear decrease in activity at 15:00. Compared to customers of lower age, senior citizens travel less during early mornings as well as late evenings. • Spatial probability distributions of senior citizens and users of lower age were compared, indicating in which neighborhoods senior citizens travel more compared to their younger counterparts. • The clustering of temporal probability distributions has revealed that 81% of senior citizens travel predominantly in the mornings. Only 19% of senior citizens travel more in the evenings. • The Cox Proportional Hazards model was used in order to investigate the influence of travel habits on the abandonment of public transport by senior citizens. Travel habits in terms of travelled neighborhoods and time of travel that influence abandonment risk were explicitly identified using the model. • The possibility of an intervention system based on a time-varying Cox Pro- portional Hazards model, was investigated. The obtained results show that individuals who exhibit an observable abandonment of public transport services display an increase in log partial hazard prior to the abandonment event.

43 44 CHAPTER 7. CONCLUSIONS

7.2 Further Work

Clustering of senior citizens’ travel behavior was performed in the temporal dimension. A clustering in spatial and spatiotemporal dimensions could not be achieved. Results indicate that senior citizens’ travel habits are highly sporadic compared to the overall population. Whether this is, in fact, the case should be investigated. In order to obtain more insight into the defining characteristics of the mobility of Madrid’s senior citizen population, the geographical analysis must be carried out in a more detailed manner. A division of Madrid’s outskirts into several zones would limit information loss and provide more insight. The reduction of geography to single neighborhoods could be extended in a way that accounts for neighborhood pairs, triplets, or others. Furthermore, an inference of alighting location should be attempted in order to identify travel routes. Such an approach could shift the focus from boarding and alighting locations to the travel routes and modes of transportation. Even though the employment of a time-varying Cox Proportional Hazards model has shown promising results, the underlying regression model may present an oversimplification that possibly does not capture important details. The substitution of semi-parametric and parametric models in survival analysis with artificial neural networks has previously shown promising results for such complicated problems and may be attempted in order to increase predictive performance in the scope of this topic. It should, however, be determined how such black-box models could be used without losing the explanatory ability of simpler systems such as the one used in this work. Within this work the time-varying Cox Proportional Hazards model was applied to the weekly spatial probability distributions of individuals, based on the assumption that senior citizens’ lifestyle can be deduced from the mobility, and is an indicator of abandonment risk. Even though the results confirmed the hypothesis, further research is required in order to identify informative covariates, that may further increase the predictive power of the model. Improvements could be achieved by including spatiotemporal information, assuming that neighborhood lifestyle changes based on the time of the day. More advanced features such as statistical indicators may be engineered to further improve model fit. While this work has assumed senior citizens’ lifestyle to be the major factor in public transport behavior and abandonment risk, the future focus may shift from the neighborhoods in which a user travels to the modes of transport that the user makes use of. Employing survival analysis methods to the analysis of such may reveal that the hazards are not only dependent upon visited neighborhoods but, in fact, inefficiencies and impediments in the public transport network itself. Detecting these impediments may not only be crucial to the targeting of senior citizens but customer groups in general. The pinpointing of exact routes on which users exhibit an increased risk of abandonment may enable precision surveying, which could possibly identify present inefficiencies at a low cost. Bibliography

[1] Agard, Bruno, et al. Mining Public Transport User Behaviour From Smart Card Data. IFAC Proceedings Volumes, vol. 39, no. 3, 2006, pp. 399–404., doi:10.3182/20060517-3-fr-2903.00211.

[2] Arabie P., Hubert L., and De Soete G. Clustering and Classiﬁcation. World Scietiﬁc, 1996

[3] Bhagat, Kshirsagar, Khodke, Dongre, Sadique. (2016). Penalty Parameter Se- lection for Hierarchical Data Stream Clustering. Procedia Computer Science. 79. 24-31. 10.1016/j.procs.2016.03.005.

[4] Blythe, P. (2004). Improving public transport ticketing through smart cards. Proceedings of the Institution of Civil Engineers, Municipal Engineer, volume 157, pages 47-54.

[5] Briand, Anne-Sarah, et al. A Mixture Model Clustering Approach for Temporal Passenger Pattern Characterization in Public Transport. 2015 IEEE Interna- tional Conference on Data Science and Advanced Analytics (DSAA), 2015, doi:10.1109/dsaa.2015.7344847.

[6] Campello R.J.G.B., Moulavi D., Sander J. (2013) Density-Based Clustering Based on Hierarchical Density Estimates. In: Pei J., Tseng V.S., Cao L., Mo- toda H., Xu G. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2013. Lecture Notes in Computer Science, vol 7819. Springer, Berlin, Heidelberg.

[7] Cox, D.R. and Oakes, D., (1984) Analysis of Survival Data. Chapman and Hall, London.

[8] Consorcio Regional de Transportes de Madrid. 2018. Annual Report 2016. Avail- able at https://www.crtm.es/media/579711/annual_report.pdf. Accessed May 8th 2019.

[9] Consorcio Regional de Transportes de Madrid. “Consorcio Regional De Trans- portes De Madrid - Datos Abiertos.” Consorcio De Transportes De Madrid - Ir a La Página De Inicio, www.crtm.es/atencion-al-cliente/area-de-descargas/datos- abiertos.aspx.

45 46 BIBLIOGRAPHY

[10] Davidson-Pilon, Kalderstam, Zivich, Kuhn, Fiore-Gartland, Moneda, Rendeiro. (2019, April 26). CamDavidsonPilon/lifelines: v0.21.1 (Version v0.21.1). Zen- odo. http://doi.org/10.5281/zenodo.2652543

[11] Davis M.G., Fox K.R., Hillsdon M., Coulson J.C., Sharp D.J., Stathi A., Thompson J.L. Getting out and about in older adults: The nature of daily trips and their association with objectively assessed physical activity.Int.J.Behav. Nutr. Phys. Act. 2011;8 doi: 10.1186/1479-5868-8-116.

[12] Estivill-Castro, Vladimir (20 June 2002). Why so many clustering algorithms – A Position Paper. ACM SIGKDD Explorations Newsletter. 4 (1): 65–75. doi:10.1145/568574.568575.

[13] Eurostat. Population structure and ageing - Statistics Explained. May 2018. Available at https://ec.europa.eu/eurostat/statistics- explained/index.php/Population_structure_and_ageing. Accessed May 1st 2019.

[14] Goulet-Langlois, G., Koutsopoulos, H. N., Zhao, J. (2016, 03). In- ferring patterns in the multi-week activity sequences of public transport users. Transportation Research Part C: Emerging Technologies, 64, 1-16. doi:10.1016/j.trc.2015.12.012

[15] M. Halkidi and M. Vazirgiannis, Clustering validity assessment: Finding the optimal partitioning of a data set. ICDM, Washington, DC, USA, 2001, pp. 187–194

[16] Harrell, F. E. Jr.; Lee, K. L.; Caliﬀ, R. M.; Pryor, D. B.; Rosati, R. A. (1984). Regression modelling strategies for improved prognostic prediction. Stat Med. 3 (2): 143–52. doi:10.1002/sim.4780030207

[17] He, Li, et al. A Classiﬁcation of Public Transit Users with Smart Card Data Based on Time Series Distance Metrics and a Hierarchical Clus- tering Method. Transportmetrica A: Transport Science, 2018, pp. 1–20., doi:10.1080/23249935.2018.1479722.

[18] He, Trépanier, Agard. Space-time classiﬁcation of public transit smart card users’ activity locations from smart card data. (2018). TransitData 2018 Sym- posium / Conference on Advanced Systems in Public Transport (CASPT).

[19] Hodges, J.L. Jr., The Signiﬁcance Probability of the Smirnov Two-Sample Test. Arkiv ﬁur Matematik, 3, No. 43 (1958), 469-86.

[20] Kaplan, E. L., and Paul Meier. Nonparametric Estimation from Incomplete Observations. Journal of the American Statistical Association, vol. 53, no. 282, 1958, p. 457., doi:10.2307/2281868. BIBLIOGRAPHY 47

[21] Liu, Yanchi, et al. Understanding of Internal Clustering Validation Measures. 2010 IEEE International Conference on Data Mining, 2010, doi:10.1109/icdm.2010.35.

[22] Lowry, Richard. “Concepts and Applications of In- ferential Statistics”. Chapter 8. Available at https://web.archive.org/web/20171022032306/http://vassarstats.net:80/textbook/ch8pt1 Accessed May 3rd 2019.

[23] “Portal De Datos Abiertos Del Ayuntamiento De Madrid.” BiciMAD. Alta De Usuarios y Usos Por Día Del Servicio Público De Bicicleta Eléctrica - Portal De Datos Abiertos Del Ayuntamiento De Madrid, datos.madrid.es/portal/site/egob/. Accessed November 21st 2018.

[24] MacQueen J. (1967) Some methods for classiﬁcation and analysis of multivari- ate observations. ProcFifth Berkeley Symp Math Stat Probab 1:281–297

[25] Mahrsi, Mohamed K. El, et al. Clustering Smart Card Data for Urban Mobility Analysis. IEEE Transactions on Intelligent Transportation Systems, vol. 18, no. 3, 2017, pp. 712–728., doi:10.1109/tits.2016.2600515.

[26] Reza Mokarram, Mahdi Emadi, Arezou Habibi Rad, Mahdi Jabbari Nooghabi (2017): A Comparison of Parametric and Semi-parametric Survival Models with Artiﬁcial Neural Networks. Communications in Statistics - Simulation and Computation, DOI:10.1080/03610918.2017.1291961

[27] Nishiuchi, Hiroaki, and Makoto Chikaraishi. Identifying Passengers Who Are at Risk of Reducing Public Transport Use: A Survival Time Analysis Using Smart Card Data.” Transportation Research Procedia, vol. 34, 2018, pp. 291–298., doi:10.1016/j.trpro.2018.11.021.

[28] Trépanier, M., Habib, K., Morency, C. Are transit users loyal? Revelations from a hazard model based on smart card data. Canadian Journal of Civil Engineering - 06 / 2012.

[29] Sisodia, Singh, Sisioda, Saxena. 2012. Clustering Techniques: A Brief Survey of Diﬀerent Clustering Algorithms International Journal of Latest Trends in Engineering and Technology (IJLTET) - Vol. 1 - 09 / 2012.

[30] Wei LJ. The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis. Statist Med. 1992;11:1871–1879.

[31] Wilson, N. H. M., Zhao, J., and Rahbee, A. (2009). The potential impact of automated data collection systems on urban public transport planning.In Schedule-Based Modeling of Transportation Networks, volume 46 of Operations Research/Computer Science Interfaces Series, pages 1-25. Springer US. [32] Yu, Chang, and Zhao-Cheng He. “Analysing the Spatial-Temporal Character- istics of Bus Travel Demand Using the Heat Map.” Journal of Transport Geog- raphy, vol. 58, 2017, pp. 247–255., doi:10.1016/j.jtrangeo.2016.11.009.

[33] Zhao, Z., Koutsopoulos, H. N., Zhao, J. (2018, 06). Detecting pattern changes in individual travel behavior: A Bayesian approach. Transportation Research Part B: Methodological, 112, 73-88. doi:10.1016/j.trb.2018.03.017

48 Appendix A - Github Repository Pre-requisites installation

The following libraries are required in order to run the examples.

* Python 3.7 Official Site1 * Scikit-Learn Official Site2 * Numpy, Scipy, and Matplotlib Official Site3 * Unidecode Official Site4 * Natural Language Toolkit Official Site5 * PyShp Official Site6 * UTM Official Site7 * Shapely Official Site8 * Lifelines Official Site9

Run the below commands in order to install the python packages. pip install -U scikit-learn unidecode nltk pyshp utm Shapely\newline python -m pip install –user numpy scipy matplotlib lifelines\newline pip install unidecode

How to use

Data Preprocessing

Raw data files are first preprocessed in order to split the files by user, and replace faulty delimiters. Running the following command will load the raw data files from the directory ’raw’ and save user data split to individual folders to the ’users’ folder. python preprocess.py

1 2 3 4 5 6 7 8 9

49 Note: Delimiter errors will be raised with the indication of ﬁlename and row number. If such error occurs, edit the ’preprocess.py’ ﬁle to accommodate the appropriate string replacement.

Geospatial grouping of stops and stations into the corresponding neighborhoods

This operation will group the stops (which are contained in the "location_files" folder). The location files will be edited in their rightmost column, which will represent the number of the neighborhood that the station is located in. Additionally the file "barrios_encoding.csv" will be created in the "location_files" folder. The file will contain the name of the neighborhoods and the corresponding index. A plot of stops and stations, and the outlines of Madrid’s neighborhoods will be shown. python shp.py

Representations of user mobility

User mobility is represented in the following domains.

• Travel Frequency - i.e. number of days in a week when a user has traveled (scalar) • Spatial Frequency - i.e. number of times a user has been at a region (cluster of stations) during the week (vector with 170 components) • Temporal Frequency - i.e. number of travels per time-bands in a week (vector of 24 elements) • Spatio Temporal - i.e. travels done in regions at given times of the day. Regions are expressed according to the corresponding station cluster, whereas the time of the day is represented by the hourly bin (0-23). The weekly space-time plot shows vertical heatmaps for each week. These heatmaps are ﬂattened vectors which correspond to each cluster number and its 24 hourly bins. In order to obtain diﬀerent diagrams representing the example user’s mobility, run the following command:

50 python user_representation.py USERID

Note: USERID is an optional argument, which allows for the displaying of mobility of diﬀerent users. Possible values for USERID are the folder names in the folder "users". Additionally a short list is given below.

040A3A0AE92E80 040444CA3C3180 040446A23C3180 0404436A3D3180 0404448AAA2980 0404466A3D3180 04044392833380

Global Population Patterns

This section is the implementation of the global population analysis. Execution of the code below computes and displays the travel frequency histogram, the mean time-frequency histogram, as well as the space-frequency which is displayed as a color coded map. python global_trends.py frequency_histogram python global_trends.py time_frequency python global_trends.py space_frequency

Furthermore, the following code executes the home neighborhood inference, as well as the neighborhood classiﬁcation. python neighborhood_inference infer_homes python neighborhood_inference travels_vs_neighborhoods

Clustering of users based on temporal frequency

51 Users are clustered by the temporal representation of their travel time. In order to perform the user clustering run the below command. python user_clustering.py

The output will be a plot with the mean temporal frequency probability distribution of the population, as well as the temporal probability distributions identiﬁed clusters.

Mobility and Personal Condition

This section is concerned with the Cox proportional hazards model for the assessment of spatial and temporal covariates in terms of their inﬂuence on the public transport abandonment hazard. In order to run a validation process with a 5-Fold cross validation, run the following command. python survival_analysis_personal_cond.py k_fold

The output will be the mean concordance index calculated over the test sets. Running the following command will build a Cox proportional hazards model over the entire dataset. python survival_analysis_personal_cond.py build_model

The output will be a model summary indicating the coeﬃcients and statistical values for the linear regression model. This data should be inspected for the evaluation of what behavior increases/decreases abandonment hazard. Furthermore, a Cox proportional hazards model will be saved to the ﬁle ’coxph.model’.

Behavior Assessment and Correction

This section ﬁts a time-varying Cox proportional hazards model to the weekly spatial frequency matrices of senior citizens. The following command will run a K-Fold cross validation to indicate the log partial hazards at death compared to the mean log partial hazards.

52 python survival_analysis_prediction.py test

Additionally a time-varying Cox proportional hazards model will be saved to the ﬁle ’coxtvph.model’. Running the below command will evaluate the behavior of a user based on the user id. python survival_analysis_prediction.py USERID

The spatial frequency plot will be displayed with the corresponding log partial hazard. High log partial hazard values indicate negative behavior.

53 Appendix B

Table 1: Spatial Probability Distribution for Senior and Standard Users, SPD Ratio, as well as Log SPD Ratio per Neighborhood Neighborhood Senior SPD Standard SPD SPD Ratio Log SPD Ratio Abrantes 4.6 3.2 1.44 0.36 Acacias 10.11 10.69 0.95 -0.06 Adelfas 7.92 3.42 2.32 0.84 Aeropuerto 0.78 1.01 0.77 -0.26 Aguilas 12.78 7.04 1.82 0.6 Alameda de Osuna 2.74 3.36 0.82 -0.2 Almagro 10.79 11.73 0.92 -0.08 Almenara 5.86 2.87 2.04 0.71 Almendrales 4.52 4.17 1.08 0.08 Aluche 18.91 9.78 1.93 0.66 Amposta 3.17 1.85 1.71 0.54 Apostol Santiago 2.19 1.21 1.81 0.59 Arapiles 3.63 0.74 4.91 1.59 Aravaca 2.38 4.5 0.53 -0.64 Arcos 5.04 5.47 0.92 -0.08 Argüelles 7.43 2.19 3.39 1.22 Atalaya 0.52 0.3 1.73 0.55 Atocha 0.98 0.73 1.34 0.29 Bellas Vistas 6.7 2.49 2.69 0.99 Berruguete 5.05 7.96 0.63 -0.46 Buenavista 8.69 7.53 1.15 0.14 Butarque 0.97 1.78 0.54 -0.61 Campamento 5.54 8.05 0.69 -0.37 Canillas 6.05 5.93 1.02 0.02 Canillejas 6.18 5.27 1.17 0.16 Cármenes 5.19 7.25 0.72 -0.33 Casa de Campo 9.98 7.56 1.32 0.28

54 Casco Histórico de Barajas 0.51 0.28 1.82 0.6 Casco Histórico de Vallecas 5.7 5.44 1.05 0.05 Casco histórico de Vicálvaro 6.72 5.06 1.33 0.28 Castellana 6.47 2.57 2.52 0.92 Castilla 3.51 3.58 0.98 -0.02 Castillejos 6.5 9.74 0.67 -0.4 Chopera 6.2 2.9 2.14 0.76 Ciudad Jardín 9.42 7.56 1.25 0.22 Ciudad Universitaria 11.63 19.52 0.6 -0.52 Colina 3.03 1.07 2.83 1.04 Comillas 2.57 1 2.57 0.94 Concepción 8.88 5.82 1.53 0.42 Corralejos 0.74 5.47 0.14 -2 Cortes 14.24 7.99 1.78 0.58 Costillares 3.74 1.44 2.6 0.95 Cuatro Caminos 15.78 7.56 2.09 0.74 Cuatro Vientos 3.22 8.32 0.39 -0.95 Delicias 6.67 10.35 0.64 -0.44 El Cañaveral 0 0 0 0 El Goloso 0.98 7.3 0.13 -2.01 El Pardo 0 0 0 0 El Plantío 0.07 0.35 0.2 -1.61 El Viso 10.12 17.3 0.58 -0.54 Embajadores 23.91 59.98 0.4 -0.92 Ensanche de Vallecas 1.49 5.51 0.27 -1.31 Entrevías 5.16 4.06 1.27 0.24 Estrella 8.01 1.78 4.5 1.5 Fontarrón 3.95 1.3 3.04 1.11 Fuente del Berro 7.71 1.81 4.26 1.45 Fuentelareina 0.48 0.57 0.84 -0.17 Gaztambide 3.39 0.87 3.9 1.36 Goya 22.31 6.35 3.51 1.26

55 Guindalera 14.28 4.94 2.89 1.06 Hellín 1.63 0.35 4.66 1.54 Hispanoamérica 8.66 9.24 0.94 -0.06 Horcajo 0.4 0.51 0.78 -0.24 Ibiza 12.49 6.02 2.07 0.73 Imperial 5.72 2.09 2.74 1.01 Jerónimos 7.32 12.72 0.58 -0.55 Justicia 7.7 17.09 0.45 -0.8 La Paz 14.89 14.6 1.02 0.02 Legazpi 2.41 3.43 0.7 -0.35 Lista 9.48 3.6 2.63 0.97 Los Angeles 7.61 4.33 1.76 0.56 Los Rosales 7.21 10.97 0.66 -0.42 Lucero 8.84 5.79 1.53 0.42 Marroquina 3.77 0.8 4.71 1.55 Media Legua 8.68 5.17 1.68 0.52 Mirasierra 3.71 5.34 0.69 -0.36 Moscardó 6.65 8.56 0.78 -0.25 Niño Jesús 9.13 11.56 0.79 -0.24 Nueva España 6.05 4.79 1.26 0.23 Numancia 6.26 3.09 2.03 0.71 Opañel 8.62 13.72 0.63 -0.46 Orcasitas 3.45 3.43 1.01 0.01 Orcasur 7.46 5.09 1.47 0.38 OUTSKIRTS 128.96 197.63 0.65 -0.43 Pacíﬁco 7.54 6.59 1.14 0.13 Palacio 18.91 10.08 1.88 0.63 Palomas 0.94 1.07 0.88 -0.13 Palomeras Bajas 6.48 8.07 0.8 -0.22 Palomeras Sureste 9.32 8.89 1.05 0.05 Palos de Moguer 8.71 7.38 1.18 0.17 Pavones 5.01 3.79 1.32 0.28

56 Peñagrande 9.12 3.96 2.3 0.83 Pilar 18.09 7.05 2.57 0.94 Pinar del Rey 15.25 9.74 1.57 0.45 Piovera 1.3 1.02 1.27 0.24 Portazgo 4.81 4.52 1.06 0.06 Pradolongo 2.33 1.32 1.77 0.57 Prosperidad 5.83 8.14 0.72 -0.33 Pueblo Nuevo 12.79 8.75 1.46 0.38 Puerta Bonita 4.64 2.17 2.14 0.76 Puerta del Angel 14.94 10.25 1.46 0.38 Quintana 9.22 9.42 0.98 -0.02 Recoletos 15.42 19.21 0.8 -0.22 Rejas 1.69 1.12 1.51 0.41 Rios Rosas 6.3 6.16 1.02 0.02 Rosas 3.47 5.48 0.63 -0.46 Salvador 2.11 2.44 0.86 -0.15 San Cristobal 0.89 0.48 1.85 0.62 San Diego 8.65 13.87 0.62 -0.47 San Fermín 2.77 3.46 0.8 -0.22 San Isidro 7.69 3.24 2.37 0.86 San Juan Bautista 4.53 5.51 0.82 -0.2 San Pascual 3.06 4.43 0.69 -0.37 Santa Eugenia 4.95 8.21 0.6 -0.51 Simancas 8.14 7.96 1.02 0.02 Sol 5.84 1.5 3.89 1.36 Timón 1.45 4.74 0.31 -1.18 Trafalgar 9.51 11.42 0.83 -0.18 Universidad 13.71 11.69 1.17 0.16 Valdeacederas 3.79 1.77 2.14 0.76 Valdebernardo 1.78 2.61 0.68 -0.38 Valdefuentes 1.75 2.2 0.8 -0.23 Valdemarín 0.32 0.85 0.38 -0.98

57 Valderrivas 3.02 6.91 0.44 -0.83 Valdezarza 6.82 4.48 1.52 0.42 Vallehermoso 10.05 6.55 1.53 0.43 Valverde 12.21 16.76 0.73 -0.32 Ventas 10.82 6.15 1.76 0.56 Villaverde Alto 9.79 16.75 0.58 -0.54 Vinateros 5.96 3.04 1.96 0.67 Vista Alegre 13.63 12.34 1.1 0.1 Zofío 1.51 0.68 2.22 0.8

58 Appendix C

Figure 1: Number of inferred Homes per Neighborhood, compared to Number of Senior Citizens as per Census

59 Appendix D

Table 2: Spatial Probability, Share of Inferred Homes, and Log Spatial Probability per Inferred Home, for each Neighborhood Spatial Prob. Inferred Home Log SPD Neighborhood (× 103) Share (× 103) per Home SPD Abrantes 19.83 6.89 1.36 Acacias 49 6.75 2.11 Adelfas 41.71 4.61 2.31 Aeropuerto 3.49 0.67 1.83 Aguilas 56.02 23.43 1.22 Alameda de Osuna 11.02 4.77 1.2 Almagro 36.81 4.79 2.16 Almenara 24.19 6.39 1.57 Almendrales 22.7 3.36 2.05 Aluche 78.71 31.48 1.25 Amposta 14.79 3.07 1.76 Apostol Santiago 10.08 4.68 1.15 Arapiles 12.41 3.44 1.53 Aravaca 5.66 3.59 0.95 Arcos 26.48 4.64 1.9 Argüelles 29.99 3.15 2.35 Atalaya 3.24 0.34 2.35 Atocha 4.95 0.34 2.74 Bellas Vistas 37.12 4.5 2.22 Berruguete 27.1 2.96 2.32 Buenavista 38.89 9.09 1.66 Butarque 5.97 1.47 1.62 Campamento 15.38 7.2 1.14 Canillas 30.72 9.6 1.44 Canillejas 22.54 8.18 1.32 Cármenes 23.53 4.45 1.84

60 Casa de Campo 44.57 11.51 1.58 Casco Histórico de Barajas 2.48 0.65 1.57 Casco Histórico de Vallecas 27.97 6.86 1.62 Casco histórico de Vicálvaro 32.78 11.82 1.33 Castellana 25.36 3.7 2.06 Castilla 14.4 4.64 1.41 Castillejos 23.55 5.21 1.71 Chopera 34.06 6.72 1.8 Ciudad Jardín 39.26 7.62 1.82 Ciudad Universitaria 40.74 6.2 2.02 Colina 16.96 1.85 2.32 Comillas 14.21 5.45 1.28 Concepción 46.01 13.88 1.46 Corralejos 2.81 0.25 2.5 Cortes 64.79 1.65 3.7 Costillares 16.52 6.24 1.29 Cuatro Caminos 71.14 10.27 2.07 Cuatro Vientos 9.82 2.56 1.58 Delicias 32.01 7.4 1.67 El Cañaveral 0 0.01 0 El Goloso 2.99 0.67 1.7 El Pardo 0.01 0 0 El Plantío 0.15 0.08 1.06 El Viso 34.72 3.82 2.31 Embajadores 90.89 11.28 2.2 Ensanche de Vallecas 8.04 1.75 1.72 Entrevías 25.68 10.63 1.23 Estrella 33.62 10.01 1.47 Fontarrón 24.83 4.5 1.87 Fuente del Berro 43.21 4.79 2.3 Fuentelareina 1.87 0.88 1.14 Gaztambide 11.17 3.07 1.53

61 Goya 105.01 7.49 2.71 Guindalera 70.5 12.73 1.88 Hellín 7.13 3.42 1.13 Hispanoamérica 33.68 8.47 1.6 Horcajo 2.48 0.61 1.62 Ibiza 51.7 7.75 2.04 Imperial 27.46 4.66 1.93 Jerónimos 30.38 1.5 3.06 Justicia 29.39 2.26 2.64 La Paz 54.64 10.01 1.87 Legazpi 10.34 1.5 2.07 Lista 40.06 4.96 2.21 Los Angeles 33.37 12.13 1.32 Los Rosales 24.46 7.85 1.41 Lucero 46.38 13.39 1.5 Marroquina 20.15 4.46 1.71 Media Legua 49.34 10.78 1.72 Mirasierra 16.05 4.12 1.59 Moscardó 27.39 8.08 1.48 Niño Jesús 34.45 7.74 1.7 Nueva España 25.84 5.07 1.81 Numancia 34.94 8.47 1.63 Opañel 36.92 8.53 1.67 Orcasitas 16.25 5.27 1.41 Orcasur 22.25 2.04 2.48 OUTSKIRTS 223.25 174.45 0.82 Pacíﬁco 38.09 8.7 1.68 Palacio 86.62 7.23 2.56 Palomas 3.6 0.65 1.88 Palomeras Bajas 35.32 7.48 1.74 Palomeras Sureste 43.92 11.1 1.6 Palos de Moguer 47.33 6.07 2.17

62 Pavones 27.05 3.61 2.14 Peñagrande 45.15 13.48 1.47 Pilar 94.21 17.35 1.86 Pinar del Rey 74.47 21.91 1.48 Piovera 6.04 0.72 2.24 Portazgo 26.02 7.41 1.51 Pradolongo 10.26 3.61 1.35 Prosperidad 22.59 6.61 1.49 Pueblo Nuevo 63.64 13.13 1.77 Puerta Bonita 21.78 4.24 1.81 Puerta del Angel 64.43 14.79 1.68 Quintana 42.83 5.19 2.22 Recoletos 59.35 4.42 2.67 Rejas 6.75 3.18 1.14 Rios Rosas 26.4 5.87 1.7 Rosas 16.98 4.32 1.6 Salvador 11.21 2.69 1.64 San Cristobal 3.59 2.02 1.02 San Diego 42.19 8.69 1.77 San Fermín 13.18 4.28 1.41 San Isidro 33.74 6.51 1.82 San Juan Bautista 20.36 4.26 1.75 San Pascual 13.79 4.06 1.48 Santa Eugenia 23.04 5.77 1.61 Simancas 39.51 8.07 1.77 Sol 23.98 0.56 3.78 Timón 3.5 1.78 1.09 Trafalgar 36.2 6.6 1.87 Universidad 48.74 7.86 1.97 Valdeacederas 24.44 4.63 1.84 Valdebernardo 7.02 2 1.51 Valdefuentes 6.9 2.2 1.42

63 Valdemarín 1.1 0.25 1.69 Valderrivas 13.12 1.98 2.03 Valdezarza 44.88 12.96 1.5 Vallehermoso 36.22 7.27 1.79 Valverde 46.13 12.75 1.53 Ventas 62.52 18.61 1.47 Villaverde Alto 38.83 11.56 1.47 Vinateros 29.04 9.03 1.44 Vista Alegre 50.17 14.02 1.52 Zofío 6.43 0.88 2.12

64 Appendix E

Table 3: Cox Proportional Hazard Model Parameters Covariate Coeﬃcient exp(Coef) SE(Coef) z-value p-value Abrantes -0.35 0.7 0.48 -0.74 0.46 Acacias 0.66 1.94 0.36 1.81 0.07 Adelfas 0.3 1.34 0.41 0.72 0.47 Aguilas -0.23 0.79 0.33 -0.7 0.48 Alameda de Osuna 0.19 1.21 0.46 0.41 0.68 Almagro 0.56 1.74 0.38 1.45 0.15 Almenara -0.47 0.62 0.46 -1.02 0.31 Almendrales 0.56 1.75 0.47 1.19 0.24 Aluche -0.17 0.84 0.31 -0.56 0.58 Amposta 0.15 1.17 0.67 0.23 0.82 Apostol Santiago 1.67 5.31 0.45 3.73 <0.005 Arapiles -0.55 0.58 0.63 -0.86 0.39 Aravaca -0.46 0.63 0.48 -0.97 0.33 Arcos 0.31 1.36 0.47 0.66 0.51 Argüelles 0.17 1.19 0.54 0.32 0.75 Bellas Vistas 0.72 2.05 0.42 1.71 0.09 Berruguete -0.04 0.96 0.52 -0.08 0.94 Buenavista 0.25 1.28 0.39 0.64 0.52 Campamento -0.38 0.69 0.41 -0.91 0.36 Canillas -0.14 0.87 0.4 -0.35 0.72 Canillejas -0.12 0.88 0.39 -0.31 0.75 Cármenes 0.05 1.06 0.54 0.1 0.92 Casa de Campo 0.7 2.02 0.35 1.99 0.05 Casco Histórico de Vallecas 0.28 1.32 0.37 0.75 0.45 Casco Histórico de Vicálvaro 0.2 1.22 0.34 0.58 0.57 Castellana 0.24 1.27 0.5 0.49 0.62 Castilla -0.31 0.73 0.5 -0.62 0.54

65 Castillejos -0.46 0.63 0.49 -0.94 0.35 Chopera 0.54 1.72 0.46 1.19 0.23 Ciudad Jardín -0.37 0.69 0.41 -0.89 0.37 Ciudad Universitaria -0.35 0.7 0.39 -0.9 0.37 Colina 0.51 1.67 0.73 0.7 0.48 Comillas 0.67 1.95 0.6 1.12 0.26 Concepción 0.03 1.03 0.37 0.09 0.93 Cortes -0.05 0.95 0.47 -0.1 0.92 Costillares -0.3 0.74 0.48 -0.64 0.52 Cuatro Caminos 0.39 1.47 0.34 1.13 0.26 Cuatro Vientos -1.41 0.24 0.63 -2.25 0.02 Delicias -0.62 0.54 0.44 -1.41 0.16 El Viso 0.31 1.36 0.44 0.7 0.48 Embajadores 0.07 1.07 0.32 0.21 0.83 Ensanche de Vallecas 0.72 2.05 0.59 1.22 0.22 Entrevías -0.06 0.94 0.37 -0.17 0.87 Estrella -0.1 0.9 0.43 -0.23 0.82 Fontarrón -0.08 0.92 0.53 -0.16 0.87 Fuente del Berro 0.21 1.24 0.49 0.44 0.66 Gaztambide 0.47 1.6 0.67 0.7 0.49 Goya -0.52 0.6 0.38 -1.38 0.17 Guindalera -0.04 0.96 0.36 -0.11 0.91 Hellín -1.14 0.32 0.84 -1.36 0.17 Hispanoamérica 0.66 1.93 0.4 1.65 0.1 Ibiza 0.5 1.64 0.38 1.3 0.19 Imperial -0.81 0.45 0.56 -1.45 0.15 Jerónimos 0.47 1.6 0.55 0.86 0.39 Justicia -0.14 0.87 0.55 -0.26 0.8 La Paz -0.47 0.63 0.36 -1.3 0.19 Legazpi 0.98 2.65 0.58 1.68 0.09 Lista 0.17 1.18 0.49 0.34 0.74 Los Angeles -0.05 0.95 0.36 -0.13 0.89

66 Los Rosales -0.38 0.69 0.38 -0.98 0.33 Lucero -0.16 0.85 0.41 -0.39 0.69 Marroquina 0.58 1.79 0.53 1.09 0.27 Media Legua -0.48 0.62 0.39 -1.21 0.23 Mirasierra -0.88 0.42 0.53 -1.65 0.1 Moscardó -0.92 0.4 0.45 -2.03 0.04 Niño Jesús -0.29 0.75 0.41 -0.72 0.47 Nueva España -0.18 0.84 0.45 -0.4 0.69 Numancia 0.17 1.19 0.39 0.44 0.66 Opañel -0.11 0.89 0.45 -0.25 0.81 Orcasitas 0.5 1.64 0.45 1.11 0.27 Orcasur 0.9 2.47 0.44 2.05 0.04 OUTSKIRTS -0.41 0.66 0.26 -1.6 0.11 Pacíﬁco 0.26 1.3 0.42 0.62 0.53 Palacio 0 1 0.37 -0.01 1 Palomas -0.13 0.88 0.9 -0.15 0.88 Palomeras Bajas -0.27 0.76 0.44 -0.63 0.53 Palomeras Sureste 0.47 1.59 0.35 1.32 0.19 Palos de Moguer -0.02 0.98 0.41 -0.06 0.96 Pavones 0.68 1.97 0.47 1.43 0.15 Peñagrande -0.22 0.81 0.37 -0.58 0.56 Pilar -0.25 0.78 0.32 -0.79 0.43 Pinar del Rey -0.07 0.93 0.3 -0.24 0.81 Portazgo 0.23 1.25 0.46 0.49 0.63 Pradolongo -0.83 0.44 0.74 -1.12 0.26 Prosperidad 0.4 1.49 0.46 0.87 0.39 Pueblo Nuevo -0.11 0.9 0.37 -0.29 0.77 Puerta Bonita 0.57 1.76 0.56 1.01 0.31 Puerta del Angel -0.07 0.94 0.33 -0.2 0.84 Quintana 0.46 1.59 0.42 1.11 0.27 Recoletos -0.56 0.57 0.39 -1.44 0.15 Rejas -0.5 0.6 0.7 -0.72 0.47

67 Rios Rosas -0.54 0.58 0.49 -1.1 0.27 Rosas -0.38 0.68 0.57 -0.67 0.5 Salvador 0.6 1.82 0.57 1.06 0.29 San Diego 0.29 1.34 0.41 0.7 0.48 San Fermín 0.4 1.49 0.49 0.81 0.42 San Isidro 0.01 1.01 0.46 0.01 0.99 San Juan Bautista -0.56 0.57 0.58 -0.96 0.34 San Pascual 0.96 2.62 0.51 1.89 0.06 Santa Eugenia 0.11 1.11 0.43 0.25 0.8 Simancas 0.04 1.04 0.43 0.1 0.92 Sol 1.4 4.05 0.72 1.95 0.05 Trafalgar 0.31 1.37 0.41 0.76 0.45 Universidad -0.57 0.57 0.4 -1.42 0.16 Valdeacederas 0.4 1.49 0.45 0.89 0.38 Valdebernardo -0.17 0.84 0.55 -0.31 0.75 Valdefuentes 0.43 1.54 0.56 0.77 0.44 Valderrivas -0.88 0.42 0.69 -1.27 0.2 Valdezarza -0.05 0.95 0.4 -0.13 0.9 Vallehermoso 0.27 1.31 0.38 0.71 0.48 Valverde 0.16 1.17 0.34 0.47 0.64 Ventas -0.44 0.64 0.35 -1.26 0.21 Villaverde Alto -0.03 0.97 0.31 -0.1 0.92 Vinateros 0.06 1.06 0.49 0.13 0.9 Vista Alegre -0.27 0.77 0.34 -0.79 0.43 Hour 7 -0.19 0.83 0.63 -0.3 0.76 Hour 8 -1.22 0.29 0.51 -2.4 0.02 Hour 9 -1.05 0.35 0.48 -2.19 0.03 Hour 10 -1.2 0.3 0.47 -2.58 0.01 Hour 11 -0.98 0.37 0.46 -2.12 0.03 Hour 12 -1.69 0.18 0.46 -3.65 <0.005 Hour 13 -1.55 0.21 0.47 -3.29 <0.005 Hour 14 0.01 1.01 0.5 0.02 0.98

68 Hour 15 -0.64 0.53 0.53 -1.21 0.23 Hour 16 -0.79 0.45 0.5 -1.57 0.12 Hour 17 -1.44 0.24 0.49 -2.95 <0.005 Hour 18 -1.21 0.3 0.5 -2.42 0.02 Hour 19 -1.39 0.25 0.51 -2.71 0.01 Hour 20 -1.2 0.3 0.53 -2.26 0.02 Hour 21 -0.26 0.77 0.56 -0.47 0.64 Hour 22 1.22 3.38 0.71 1.73 0.08