Analytics Practicum Report - 2015 Analysing Mass Rapid Public Transportation Travel Patterns of Through Time-Series Data Mining Koh Ying Ying Trecia; Luqman Haqim Bin Ab Rahman; Singapore Management University

ABSTRACT The adoption of Ezlink smart card technology allows transportation analyst to discover new insights of the consumption and lifestyle of their commuters’ in the transportation network. As smart cards contain rich data and all the transactions are in temporal sequences, it gives an opportunity to analyse the complex and voluminous time- series data using time-series data mining techniques. This is particularly interesting as there is a need to transform these rich data into actionable information and knowledge, which users can understand. Therefore, this paper seeks to explores the problem of the transportation network and validate against the implementation current policy of the free rides and discusses the use of time-series data mining techniques to achieve insights that will provides a picture on whether the policy matches with the findings.

INTRODUCTION Singapore is a small country, yet it has a complex and comprehensive public transportation network. Consisting of train which can be further split apart into Mass Rapid Transit (hereinafter known as MRT), bus, light and rapid trains (Light Rail Transport, hereinafter known as LRT), and taxis, the public transport in Singapore employs the hub-and- spoke strategy; busses serve as the means of transportation within a town, and MRT trains are used for long distance travel. The governs the public transport network in Singapore. (Hereinafter known as LTA)

The demand for MRT ridership has significantly increased since 1997 as it served as a cheaper or faster alternative to car or taxi for long distance travel. However, since 2011 to the time of this paper, confidence in the MRT system has dropped, as it has been plaque with service breakdowns. Some of these breakdowns can be as short as 45 minutes and some as long as a full day. Most Singaporeans attribute train breakdowns to the sudden influx of foreigners into the country. This influx has been purported to increase the frequency of ridership.

Calls from the public to improve the MRT infrastructure have been a priority for the MRT operators. It is important that the operators understand the traffic patterns of the MRT ridership to be able to constructively understand and cater or improve the reliability and re-instil confidence in the MRT.

Should the MRT operators cater to the morning peak by increasing the frequency of trains in the morning, or should they increase the train frequency in the evenings when commuters end the day? Should policies be applied across all stations or should each station have different policies?

With the Government’s plans to have 6.9 million citizens in Singapore by 2020, we hope to use analytics to be able to understand the travel patterns of the MRT so as to improve the MRT services.

This paper attempts to explore the travel patterns of the MRT ridership in Singapore for the first week of November of 2011. This paper will continue the work done by Roy LEE’s Master Thesis and we seek to explore the areas that LEE does not cover in his Master Thesis.

REVIEW OF SIMILAR WORK The literature review of the report is broken down into four parts; smart cards, the role smart cards play in public transport, conventional data mining and time-series data mining. In the first part of the report, we introduce smart cards. The second part introduces how smart cards play a role in the public transport system. The third part introduces the conventional data mining techniques and its limitations when being applied to time-series data. Lastly,

1 we introduce time-series data mining techniques.

SMART CARDS Smart card has been around since 1968, developed by two German inventors (Shelfer and Procaccino, 2002). The Japanese went on to further improve the smart card technology. And in the later 1970s, Motorola successfully secured its first smart cards, which was being implemented and used by French banking system. Further in the 1990s, smart cards have become much more substantial with the introduction of the Internet. [9] Since then, smart cards have outgrown and replaced the traditional and unsecure magnetic cards or tickets that were used, thus improving business processes. As smart cards are designed to store and process data (Lu, 2007), it is suitable for different domains to adopt such technology. Nowadays, it is being used in various areas ranging from banking, healthcare, and telephony services to transportation. Smart cards is a powerful tool for analysis as it has the capability to store rich contextual data of users such like demographics, photos, fingerprints, banking data, transportation fares and others.

PUBLIC TRANSPORT Public transport service is a key to the country’s development, which increases the competitiveness and market share of smart card technologies. [1] With public transport services, it allows commuters to travel conveniently at any period of the day. To enhance the transportation services, it is important for transit operators to further understand and study about the commuter’s travel patterns to make better decision for the economy as well as the people’s livelihood and lastly for the country.

In larger cities, smart cards come into play to manage the transit network, which provides greater flexibility by allowing commuters to use the card at various times of the day and in different parts of the transit network. Within the smart card, it has the capability of validating and collecting fares, which simplify the fare collection process and reduce the overheads with cash-based payments. This allows analysts to easily retrieve the transit data from the database server, which contains several columns of transactional, and the volume size of transactional data is usually huge. These give analysts the opportunity to analyse and discover real-time activity of travel patterns of the commuters using time-series data mining techniques. However, to adopt smart card, technology, the country must have enough funding and be open to accept new technology. Thankfully, Singapore has adopted the technology.

On the other hand, in smaller cities, transport information is gathered through on-board travel surveys, regional travel surveys or through visiting households. This process is tedious, time-consuming and erroneous, as the data have been aggregated. This leads to transit analysts using conventional statistics from synthetic models to draw insights on the commuter’s travel pattern. [1] Using conventional statistics has limitations with regards to its capability to analyse data in the temporal form.

Singapore Public Transport Land Transport Authority (LTA) started working on the development of a new contactless smart card based ticketing system since 1994 and finally decided to replace the existing magnetic ticket based system to an Enhanced Integrated Fare System (EIFS). [7] As a result, Ezlink smart card was introduced in Singapore in April 2002 to simplify the fare collection process in public transportation. [2] Ezlink Pte Ltd is a subsidiary of Singapore LTA, which sells Ezlink cards. In 2007, one of the telecommunication companies, Starhub decided to integrate functions of Ezlink into the smart phones. Ezlink monopolized public transportation in Singapore until the introduction of CEPAS Ezlink, which was introduced in during 2009. CEPAS Ezlink provides capabilities in both debit and credit, areas that is able to easily adapt into a wide variety of payment schemes from different card or system operators, replacing the original Ezlink smart card. Ezlink furthered their plans to work with NETS (a point of sale cashless debit payment service) to create a new multi-purpose contactless card, which integrates with CashCard (Stored value card). This will allow commuters to use a new payment mode for both public transports as well as a payment mode for car park, Electronic Road Pricing (ERP) and retail purchases. In late 2009, in collaboration with NETS, EZlink introduced the NETS FlashPay card, which allows storing a higher value, of up to SGD 500 at any ATMs, ticketing offices at Mass Rapid Transit (MRT) stations, bus interchanges, convenience stores, self-service iNETS kiosks with NETS Flashpay access. This Flashpay card could be used in more than 30,000 retail places island-wide. [10] Later in 2010, by integrating with VISA to provide the Auto-top-up service which automatically top up a commuter’s card to a predetermined amount enabling them to use the card as both credit card as well as contactless smart card for

2 ticketing. [11]

With the Ezlink smart card capabilities, it provides a new window for analysts to study and discovered patterns of commuter’s travel in both bus and rapid transit routes, which allows them to easily extract trips information from the Ezlink smart cards such as occupancy, velocity, arrival and departure time of each stops and commuter’s demographics for analytical purposes. [2] Since the implementation of the automated fare collection system, there are large potential benefits to improve the public transport planning and operation. [8]. As these information are usually in form of spatio-temporal characteristics, data mining techniques need to be applied to explore the insights of commuter’s travel pattern.

CONVENTIONAL DATA MINING Data mining is also known as Knowledge Discovery in Databases (KDD). KDD is a process involving processing large amount of unprocessed and raw data stored in data warehouse or data marts waiting to be transformed into meaning insights that can be easily interpreted and understood by business users. There are three types of data mining techniques: association rules mining, classification and statistical.

All these mining techniques mentioned above are usually suitable for transactional database to monitor daily transactions. However, businesses and us want to discover greater insights or knowledge from time-stamped data being stored in transactional database. Thus, conventional techniques would not be appropriate for analysis, and techniques of time-series data mining has to be applied to cater for time-stamped data.

TIME-SERIES DATA MINING Time-series data is a representation of a collection of data that is obtained from sequential measurements over time. The data collected by businesses is often accumulated from daily transactions, formed into large volumes of data overtime. Time-series data mining poses a challenge to data analysts, which stems from the desire to reify the ability to visualize the shape of data over time. Since conventional mining techniques are inappropriate for analysing time- stamped data, it motivated data analysts to look into other techniques for the analysis of time-series data.

This gave rise to the development of time-series data mining techniques, where it unveils numerous facets of complexity. One of the most prominent issues is obtaining high-dimensionality data and difficulty in defining the similarity measure of time-series based on human perception. Thus with time-series data mining techniques, it have to take into consideration the constraints such as data representation, similarity measurement and indexing of time- series data for faster query.

In data representation, data analyst has to take into consideration the method to transform high-dimensional data from time-stamped transactions into a table that is suitable for time-series application to understand that these data are in time-series format to perform analysis and pattern detection. [12] This process of transformation is tedious, as data analyst needs to ensure that data is transformed into a set of contiguous time instance and the time series are usually univariate or multivariate that is usually spanned into multiple dimensions within the same time range. [11]

In similarity measurement, one issue faced by data analysts are the irregularity of time-stamped that is recorded. Therefore, this results in different time-series having common trends occurring at different times. One of the technique used is dynamic time warping (DTW) proposed by Berndt and Clifford 1994[11], which is able to match various sections of time-series data, allowing warping of time-axis which produces optimal alignment using the shortest warping path in distance matrix.

Time-series mining techniques are not new to businesses, as studies have shown to prove that organizations have been implementing it. However, it is most commonly seen in retail business domain and public transportation in other countries. In one case study, Nakkeeran et al [5] demonstrated time-series mining techniques used in retail domain for clustering and profiling of store-level revenue over time. In another case study, Hirokaki et al [3] demonstrated

3 time-series mining techniques applied to public transport in Japan to analyse trip patterns of passengers using smart card for bus and tram rides.

Lastly this paper hope to explore time-series data mining techniques to discover patterns in both entry and exit timing in commuter’s travel behaviour within the public transportation network of Singapore.

TIME-SERIES DATA PREPARATION After Exploratory Data Analysis (EDA) phase, there is a need to transform the data into time-series format to perform further analysis. The time-series data transformation process is an overview of how the EDA data has been transformed to time-series data shown below. We use JMP Pro 11 and SAS Enterprise Guide interchangeably as we are comfortable with the different feature of each tool for different tasks.

Figure 1: Time-series data transformation process

EZLINK DATA This Ezlink Data consists of a week’s worth (1st November 2011- 6th November 2011) of Ezlink transactions used in Singapore public transport. This dataset is currently from MYSQL database provided from Living Analytics Research Centre. In the database, there are currently four different tables:  Bus_service_mapping  Location_gis_mapping  Location_mapping  Lta_ride

For our project we will only be focusing on location_mapping and lta_ride. In order to perform analysis on the data, there is a need to first extract the data from the database using MySQL database export. The raw data showed enter and exit time-stamp transactions of both buses and trains based on card_number, and it approximately around 33 million rows of data. (Refer to Appendix 1 - 2)

FILTERING RELEVANT DATA In this process, as we are only interested in the train dataset, we first need to filter based on transport type to extract only “RTS”. After which we only select data such like entry_time, exit_time, number of commuters based on card_number_e, origin_location_id, destination_location_id and commuters category that are relevant to our analysis. Here are the descriptions of the variables: Variable Description card_number_e Commuter’s Ezlink Card Commuters category There is four different category in which commuters can be group: 1. Adult 2. Child 3. Senior Citizen 4. Student entry_time Timing in which commuter entered the station exit_time Timing in which commuter exit the station origin_location_id Origin station id destination_location_id Destination station id Figure 2: Description of each of the variables

4 After filtering, the dataset was reduced from 33 million rows to approximately 10 million rows of data, which would be used in our analysis.

CREATING NEW COLUMNS First, from the raw data there is a need to aggregate all the entry_time, exits_time into 15 minutes interval time-stamp for daily records. We used tools such like SAS Enterprise Guide to create recoded columns (new_entry_DM_format / new_exit_DM_format) to store the timing in intervals of 15 minutes starting from 5.00 AM to next morning 1.45 AM. After which, we created another new column (time_series_entry_time_format / time_series_exit_time_format), using recoded column based on advanced coding-case function to generate auto-increment numbers starting from 1 to 83 for each day. By using the case function, it allows us to define the starting of each auto-increment number. Since the time-interval need to be in running order, we need to recode the 2nd day to start from 84 instead of 1. This would be applied until the 6th day.

Creating these two new columns for entry and exit time-stamp allows easy references of each running number, as it provides a reference to the timing as well as date of the week. For example, 1-83 will be on 01/11/2011 from 5.00 AM to 1.45 AM, whereas 84-166 will be on 02/11/2011 from 5.00 AM to 1:45 AM and so on. The data shows a sample of how the time-stamp format and time-series should be for entry, and the same steps is done to for exit time-stamp. (Refer to Appendix 3)

DATA MAPPING From the origin_location_id and destination_location_id, we need to perform data mapping by extracting the location name from Location_mapping table into two new columns, namely origin_location_name and location_name. By doing so, it gives user clearer details of the origin and destination stations name instead of just the station id. After data mapping is done, the data shows the location name that is linked to each location id. (Refer to Appendix 4)

COMBINE DATA After creating new columns for time-series interval and performing data mapping on location names, there is a need to combine everything into one single file. This file, includes attributes such as time-series interval, origin_location_name, location_name, card_number_e and n_rows. (Refer to Appendix 5) We then segmented the data into the different commuter’s type mainly Adult, Senior Citizen and Student.

There are a total of four types of commuters’, however; Child commuter type was not taken into our analysis. This is because Child category takes up less than 0.1% of the entire population and it may not produce meaningful time series graph for analysis. Therefore, we exclude the Child category from our analysis. After segmentation, the three dataset Adult, Senior Citizen and Student is then saved in sas7bdat format.

LOAD INTO SAS SERVER The data in sas7bdat format will be exported to SAS Server using SAS Enterprise Guide. This allows easy retrieval of the dataset when using SAS Enterprise Miner. Even though using file import node allows us to import external data files to perform analysis, since we prepared the data using SAS Enterprise Guide, it is a faster workaround to directly export the data to its server.

TIME-SERIES METHODOLOGY SAS Enterprise Miner, is an analytical software that streamlines and simplifies data mining process which allows user to perform descriptive, predictive and time-series analysis on huge volumes of data. The software has interactive visualization functions and its user interface allows easy interaction by drag and drop functionality. For this project, we will be using it as a tool to gather insights of commuter’s travel patterns from Ezlink dataset based on public transport in Singapore. The diagram below shows the SAS Enterprise Miner Workflow process, which consist of four different nodes that enable us to derive at our final findings.

5

Figure 3: SAS Enterprise Miner Workflow

DATASET As we need the dataset to perform time-series data mining analysis, we need to retrieve the dataset from SAS Server and define the properties within the data. This will ensure that SAS Enterprise Miner can recognise the correct format of each of the column. (Refer to Appendix 6)

First we need to define the time-series column (Refer to Appendix 6, entry_time_format_cumulative), which contains the aggregated time-series interval and set the role to “Time ID”. This would allow the application to recognise it as time-series data and plot the x-axis with the relevant time-stamp interval.

Secondly, we need to define the number of commuters (N_card_number_e), which contains the count of the commuters and set the role to “Target”. The “Target” role tells us whether commuters entered or exit a station based on the card number. This would allow the application to plot the y-axis frequency in the generated time-series charts.

Lastly, we need to define the location of each station (origin_location_name) and set it to “CrossID”. This would allow us to perform data aggregation for analysis.

TS DATA PREPARATION (TSDP) TSDP node transforms dataset into a readable format that is recognized by the program to time-stamp format. Looking at the results generated from the TSDP node, allows user to analyse the analysis of the time-series dataset based on the summary statistics. (Refer to Appendix 7)  Multiple Time Series Comparison Plot Represents time-series graph based on a multiple MRT stations.

 TSID Map Table Figure 3 shows the original dataset with one time series variable and one CrossID variable has been transpose and converted into 128 time series with 128 unique variables. In the table, each time-series is named TS_n where n is the value of the TSID created for the time-series.

 TSIP Map Summary Table Figure 3 shows the level, count, and frequency and percentage information for the CrossID and TSID variables. There are a total of 128 origin_location_name for input and target variables, which resulted in a unique combination of 256 TSIDs.

 Time Series Metadata Table Figure 3 shows that the Adult_Entry input data uses entry_time_format_cumulative as the TimeID variable to analyse data from interval of 1 to 491, which consists of data from 01 November 2011 to 06 November 2011.

 Time Series Plot Represents time-series graph based on a particular MRT station.

 Time Series Summary

6 Show distribution of the commuters for each station in a form of bar graphs, which includes max, mean, min and sum.

VARIABLE CLUSTERING (VC) After understanding the variables, there is a need to group similar time-stamp stations into the same clusters, thus we performed clustering on the dataset. (Refer to Appendix 8)  Cluster Plot Figure 4 shows the 8 clusters created from the interval input variables in the Adult_Entry data.

 Dendrogram Shows a tree hierarchy to display how clusters are being formed.

 Selected Variable Show cluster statistics sorted by cluster. The R-Square with the Next Cluster Component Column indicates the scores with the nearest cluster. If the clusters are well separated from one another, it resulted in a low R- Square score.

 Variable Selection Table Shows the cluster statistics sorted by the cluster and its variable that are link to the clusters. The R-Square with the Next Cluster Component Column indicates the scores with the nearest cluster. If the clusters are well separated from one another, results in a low score.

 Variable Frequency Table Show level, frequency count and percentage information for each of the cluster.

We do recognize that clustering is one of the vital steps of unsupervised learning problems. There are two different types of clustering, conventional and time-series clustering.

Conventional Clustering Conventional clustering is a well-known algorithm and is applicable in most application. In conventional clustering, it uses methods such like hierarchical clustering or simple k-means algorithm, which is effective when dataset is small and consistent, producing, clusters with structural quality. This is because the algorithm is able to compare multiple variables based on intra-cluster distance or inter-cluster distance.

However, when applying conventional clustering into time-series data, it is useless because of high dimensionality data and structure of the unique time-stamp. If hierarchical clustering is applied, it loses its effectiveness, as it cannot process large time-series dataset. This leads to a poor quality dataset resulting, in potential loss of data, as the clustering algorithm is not scalable.

On the other hand, if simple k-means algorithm is applied, even though this may work for time-series dataset it does not provide the optimal number of clusters. This might be a potential problem as user need to manually define the number of clusters based on dendrogram which is difficult if the dendrogram is cluttered or user is able to use the trial and error method based on one’s assumption to define the number of clusters; this may lead to inaccuracy of the analysis.

Time-series clustering using VC node Unlike time-series clustering using VC node, the VC algorithm removes co-linearity, decrease variable redundancy and reveals underlying structure of input variables with minimal loss of information. It uses a combination of clustering technique of distance matrix and latent variables based on probability distribution, which is applicable to large datasets.

7

Time-series clustering compares the variable based on time-stamp of the data. Thus, DTW algorithm is often used to overcome this problem of varying lengths. The DTW algorithm is computed to identify two time-series sequence based on Euclidean Distance (ED), which aligns the time-series by creating warping matrix to search for optimal path. The elastic shifting in time-domain matches sequence that are similar but out of phase.

One most prominent features of VC node in SAS Enterprise Miner is it is able to automatically generate the optimal number of clusters without the need to define the parameters settings even though there is a limitation that VC node is only available in SAS Enterprise Miner.

TS SIMILARITY (TSS) Lastly, by using the TSS node it allows user to analyse the similarity analysis of time-series grouped by clusters over time. As there are different variations of time-series algorithms, DTW algorithm is then applied to match the different lengths of similar time-series together. (Refer to Appendix 9)  Cluster Constellation Plot Show simple view of the identified clusters.

 Cluster Dendrogram Shows a tree hierarchy to display how clusters are being formed.

 Distance Map Shows the time-series data that has been clustered on both axes that provides visual display of similarity between one clusters to every other clusters. Blue colour indicates that it is similar and red colour indicates that it is dissimilar.

Using the above workflow, we are able to get meaningful insights of commuter’s travel patterns in each clusters group by similar time-series of train stations. Further analysis of TSDP, VC and TSS would be discussed in the later section of Advance finding.

ADVANCE FINDING Before proceeding with the graphs, here’s a general guide on how to analyse the graphs shown in Figure 4.

8

Figure 4: General Guide on interpretation of graph Figure 4 explains the time series graphs. For each set of graph, there are a total of 6 sets of mini ‘mountains’. Each mountain represents a day in the dataset received.

ADULT

Entry Analysis Graphs Description There are a total of 8 clusters for the analysis of Adult entry time-stamp. Refer to Appendix 10 for the remaining analysis on the clusters 1, 3, 4, 7 and 8.

Cluster 2 This cluster has a strong and dense evening peak with a less dense mid-day peak. The stations in this cluster are stations that are mainly in the commercial areas. This suggests that the office workers are heading back home. The mid-day peak suggests that people who work in these stations are heading to other stations nearby (maybe one or two stops away) for lunch.

MRT: Buona Vista-CCL, Dover, Haw Par Villa, Joo Koon, Kent Ridge, Labrador Park, Nicoll Highway, Novena,

9 Outram Park, Outram Park NEL, Pasir Panjang, Raffles Place, Tai Seng, Tanjong Pagar, one-north Cluster 5 This cluster has two peaks; one in the morning while the evening is stronger. The stations in this cluster are stations that are located close, or the only station near to major industrial estates. The strong peak in the evening indicates that the workers who work in the major industrial estates heading back home.

MRT: Botanic Gardens, Buona Vista, Caldecott, Clementi, Lavender, MacPherson, Marymount, Newton, Paya Lebar,Queenstown, Redhill, Yio Chu Kang LRT: BPLRT, Pending Cluster 6 This cluster has two peaks, a strong early morning peak and a weaker afternoon peak. The stations in this cluster are in residential areas out of the main heartland areas and are close to industrial areas located in the North Western part of Singapore. The dominant early morning peak suggests that the Adults are heading to work in the morning. The very early peak suggests that these Adults work very far away from home. These stations are not in the main heartlands explains the low density. The afternoon peak explains the surge of workers heading back MRT: Bukit Batok, Choa Chu Kang, Kranji, Marsiling, home after working in the industrial estates near Pioneer, Woodlands, Yew Tee these stations. LRT: Bangkit, Fajar, Jelapang, Keat Hong, Petir, Phoenix, Segar, Senja, Teck Whye

Exit Analysis Graphs Description There are a total of 11 clusters for the analysis of Adult exit time-stamp. Refer to Appendix 11 for the remaining analysis for clusters 4 to 11.

10 Cluster 1 Clusters 1 are stations that are mainly made up of stations that are located in the heart of residential areas. There is one main peak, in the evening where people are heading back home after work. This is further supported by the fact that on Friday, the peak is slightly lesser than other days as people tend to not head back home directly; instead they tend to engage in leisure activities such as retail or movies. MRT: Admiralty, Aljunied, Ang Mo Kio, Bedok, Bishan, Bishan CCL, Bukit Gombak, Chinese Garden, Choa Chu Kang, Dakota, Hougang, Kallang, Kembangan, Khatib, Kovan, Lakeside, Marsiling, Pasir Ris, Punggol, Sembawang, Serangoon, Serangoon CCL, Simei, Tampines, Telok Blangah, Yew Tee, Yishun LRT: Layar, Petir, Sengkang, Senja Cluster 2 Cluster 2 has a huge volume of morning peak. These stations are surrounded by commercial and/or industrial estates with no/minimal residential estate. Traffic from these stations is mainly people who are heading to work or school.

MRT: Bras Basah, Buona Vista-CCL, Caldecott, Clarke Quay, Expo, Harbour Front, Haw Par Villa, Kent Ridge, Labrador Park, MacPherson, Marina Bay, Newton, Nicoll Highway, Outram Park NEL, Promenade, Raffles Place, Tai Seng, Tanjong Pagar, one-north Cluster 3 Cluster 3 describes the ridership patterns of those who are heading towards residential that are close to industrial estates. The first peak describes passengers who are heading to work. The second peak describes the people who are heading back home. Both peaks are similar which further proves that these estates are residential, commercial and industrial estates. MRT: Boon Keng, Braddell, Bukit Batok, Eunos, Farrer Road, Mountbatten, Paya Lebar CCL, Pioneer, Potong Pasir, Tanah Merah, Tiong Bahru, Toa Payoh, Woodlands

LRT: Teck Whye

SENIOR CITIZEN

Entry Analysis Graphs Description

11 There are a total of 8 clusters for the analysis of Adult entry time-stamp. Refer to Appendix 12 for the remaining analysis on clusters 1, 3, 6, 7, 8, 9, and 10.

Cluster 2 Cluster 2 has a ‘two pronged’ like peaks, one during mid-day and another shortly after. The stations in this cluster are stations that have government buildings such as CPF Building and Immigrations and Checkpoint Building. The stations available are also close to markets. This suggests that the elderly are heading back home after a day marketing or dealing with government matters. MRT: Aljunied, Bukit Panjang, Dakota, Haw Par Villa, Holland Village, Jurong East, Lavender, Little India, MacPherson, Novena, Outram Park, Paya Lebar, Paya Lebar CCL LRT: Choa Chu Kang BPLRT, Pending, Phoenix Cluster 4 Cluster 4 shows a peak just after mid-day and it slowly tappers off. The stations in cluster 4 are in the retail areas. This suggests that the elderly are heading back home after a day shopping. It is interesting to note that the peak is before the thought 6pm peak as identified by Roy Lee. This indicates that the elderly are avoiding the peak MRT: Bras Basah, Bugis, Changi Airport, Chinatown, City Hall, from the Adult population of 6pm. Clarke Quay, Dhoby Ghaut, Dhoby Ghaut NEL, Esplanade, Expo, Farrer Park, Harbour Front, HarbourFront-CCL, Marina Bay, Orchard, Promenade, Somerset LRT: Farmway Cluster 5 Cluster 5 has a dominant evening peak. Commercial or industrial buildings surround the stations in this cluster. This suggests that the elderly who are working at these stations are heading back home after working. The consistency in the peak patterns, even on Friday indicates that the elderly do not skive while working; they go home only when it is time to go home and they MRT: Joo Koon, Kent Ridge, Labrador Park, Newton, Nicoll lead a fixed set of life. Highway, Outram Park NEL, Pasir Panjang, Raffles Place, Tai Seng, Tanjong Pagar, one-north

12 Exit Analysis Graphs Description There are a total of 8 clusters for the analysis of Adult entry time-stamp. Refer to Appendix 13 for the remaining analysis on clusters 1, 3, 4, 5, 7 and 8.

Cluster 2 Cluster 2 shows a strong mid-day peak. The stations in cluster are within walking distance to places of worship such as Kwan Im Thong Hood Cho Temple and Sri Veeramakaliamman temple. This suggests that the elderly are going to places of worship. There is a distinct and very strong evening peak on Friday, and a peak for Sunday. This is traced to Kranji MRT station. This is due to the regular horse race held on Friday nights and Sunday morning. MRT: Bras Basah, Bugis, Chinatown, Dhoby Ghaut NEL, Esplanade, Expo, HarbourFront-CCL, Haw Par Villa, Kranji, Little India LRT: Choa Chu Kang BPLRT, Phoenix Cluster 6 Cluster 6 shows a strong mid-day peak. This can be explained as the stations in this cluster are mainly for leisure and retail.

MRT: Botanic Gardens, City Hall, Kent Ridge, Lavender, Nicoll Highway, Novena, Orchard, Outram Park, Outram Park NEL, Paya Lebar, Somerset LRT: Bangkit Cluster 9 Cluster 9 contains a strong evening peak. This can be explained as Senior Citizens are heading back home after a day of work. Most of the estates around the stations in cluster 9 are in estates that are about 15 years old. There is a very dominant peak in brown. This station is in the Clementi Station. The Clementi housing MRT: Chinese Garden, Clementi, Eunos, Kallang, Lakeside, estate is one of the oldest housing estates, Marymount, Tanah Merah, Yio Chu Kang dating back from 1893. Most of the houses projects are older than the rest of Singapore. We LRT: Teck Whye can conclude that there are a lot of elderly living

13 in Clementi, which is a very old town in Singapore.

STUDENT

Entry Analysis Graphs Description There are a total of 8 clusters for the analysis of Adult entry time-stamp. Refer to Appendix 14 for the remaining analysis on clusters 1, 2, 4, 7, 8, 9, 10 and 11.

Cluster 3 Cluster 3 has a dominant morning peak and evening peak that is also very close to the morning peak. The stations in cluster 3 are housing estates that are close to industrial areas. This suggests that Students are heading to school or part time work in the morning, while the afternoon peak suggests that Students who are working at these locations are heading back home.

MRT: Buangkok, Bukit Gombak, Eunos, Hougang, Kembangan,

Khatib, Kranji, Lakeside, Pasir Ris, Telok Blangah, Tiong Bahru, Woodlands, Woodleigh, Yishun LRT: Compassvale, Coral Edge, Cove, Fernvale, Keat Hong, Senja Cluster 5 Cluster 5 shows a dominant afternoon peak. The stations in this cluster are located in commercial, leisure and retail areas. This suggests that Students are heading back home after leisure activities or work.

MRT: Chinatown, Farmway, Farrer Park, Holland Village, Kent Ridge, Little India, Novena, Paya Lebar, Stadium, Tai Seng, Tanjong Pagar, one-north

14 Cluster 6 Cluster 6 displays a strong morning peak, with a second peak around 9-10am and another dominant peak mid-day. The morning peak suggests the Students who are going to school. The second peak suggests Students who are heading out for supplementary class and last would be the Students who are heading out for leisure. The stations in the cluster are dominantly stations that are housing estates that are close to schools such as Bouna Vista (NUS) MRT: Bishan, Bishan CCL, Botanic Gardens, Buona Vista, but do not go to this school nearby. This Buona Vista-CCL, Clementi, Dover, Farrer Road, Joo Koon, probably explains why there is a peak at 10am, Lorong Chuan, Marymount, Potong Pasir, Tanah Merah, Yio Chu as the Students need more time to travel to Kang school, as school is not near their homes. LRT: Phoenix, Ranggung, Teck Whye

Exit Analysis Graphs Description There are a total of 8 clusters for the analysis of Adult entry time-stamp. Refer to Appendix 15 for the remaining analysis on clusters 2, 4, 5, 6, 8, 9, 10, 11 and 12.

Cluster 1 The station in this clusters are stations in residential areas out of the main heartland areas. There is a morning peak and a mid-day peak. The first peak suggests Students heading to these stations to head to school. The afternoon peak suggests that Students are heading back home after spending the morning out.

MRT: Admiralty, Buangkok, Chinese Garden, Eunos, Holland Village, Hougang, Kallang, Kembangan, Marsiling, Pasir Panjang, Pioneer, Sembawang, Woodleigh, Yew Tee LRT: Bakau, Fajar, Fernvale, Kangkar, Petir, Renjong, Rumbia, Segar, Tongkang

15 Cluster 3 Cluster 3 has a mid-day peak. Most of the stations in the cluster are in retail and recreational areas. For example, Dhoby Ghaut and Somerset is where two major youth oriented retail malls with Cineplex’s. The very strong Saturday/weekend peak further supports this. The peak explains the behaviour of Students who tend to wake up late to enjoy the holidays.

MRT: Bras Basah, Bugis, Changi Airport, Chinatown, City Hall, Clarke Quay, Dhoby Ghaut NEL, Esplanade, Expo, Farrer Park, Harbour Front, HarbourFront-CCL, Haw Par Villa, Kranji, Little India, Marina Bay, Orchard, Paya Lebar, Promenade, Somerset Cluster 7 Cluster 7 displays a dominant mid-day peak. The stations in cluster 7 are mainly stations that are in the middle of major neighbourhoods. This suggests those Students are going back home after a day in school in the morning. This could also suggest that Students are going back to school in the morning during the school holiday.

MRT: Ang Mo Kio, Bedok, Boon Keng, Boon Lay, Bukit Gombak, Jurong East, Kovan, Newton, Outram Park NEL, Pasir Ris, Queenstown, Serangoon, Serangoon CCL, Simei, Tampines, Tiong Bahru, Toa Payoh, Woodlands, Yishun LRT: Bukit Panjang, Choa Chu Kang BPLRT, Keat Hong, Sengkang, Senja, South View

DISCUSSION To ease with the overcrowding leading to inability to board the trains, LTA introduced a new initiative to move the morning crowd to board the train earlier to ease overcrowding. Introduced in 2013 to 161 stations in the city area (these stations are mainly in the commercial and retail areas of Singapore) commuters who start their journey other than these 16 stations and reach any of these 16 stations before 7.45am will travel for free. For those who reach between 7.45am to 8am will get SGD 0.50 off their fare. This was started as a trial for one year and in 2014, was then extended for another year2 to include two more stations. To further encourage commuters to take up utilise this initiative, LTA partnered with Food and Beverage companies for discount coupons for those who qualify for the free rides3. LTA also encouraged organisations around these stations to allow employees to start work early.

However, is this policy useful in easing the commuter congestion? To further investigate this claim, we will analyse each commuter category travel patterns to determine if they are benefiting from the policy. However, firstly, it is important to note these few points:

 Free ride is only for morning travel  Free ride is only for travels towards city stations  Most adopters of this policy are government entities  Probably the only known policy used to ease commuter congestion

16

ADULT As discovered by Lee, there is a very strong evening peak for stations that are in the city area. The stations discovered by (Lee, 2013) are inclusive of the 16 stations. When we cross-referenced with our own analysis, the clusters that has the station Raffles Place for both entry and exit for Adults, we see an (almost) equal demand for both exits in the morning and entry in the evening. Secondly, (Lee, 2013) identified that the evening peak is at 6pm.

The policy in place is timely and accurate for the morning peak for stations in town. By encouraging people to travel earlier should reduce the morning congestion and possibly reduce breakdowns in the morning (if a breakdown occurs on any line, all the trains, regardless of the distance to the breakdown will be affected). However, as seen from the entry clusters for the Adults, the problem is not with disembarking the trains, but the problem of boarding the train to head home after work in the evening. We have experienced first-hand many times during the evening rush hour where we attempted to board the trains to head home. The platform was filled to the brim and it was impossible to board the train. Most of the time, 4-5 trains would pass before being able to board the train. Even when we get to board the train, commuters would be squeezed like sardines. Currently, to help reduce the evening congestion, LTA has purchased more trains to serve the increasing needs during peak and non-peak hours4 while increasing the train frequency. LTA could introduce a policy similar to the morning peak – but catering it for the evening peak. However, it is hard to convince organisations that employees should leave work by a certain time so that they can enjoy free rides home, which will definitely help ease the evening congestion.

As seen from Cluster 2 (Exit), there is a very strong exit peak in the morning. However, not all the stations are in the city area; some are stations that are surrounded by industrial/commercial areas such e.g. Tai Seng. Tai Seng is in the same cluster such as stations in the city such as Raffles Place. Should the policy of free train ride in the morning also extend to those who work in these stations? It might be a good idea as these stations in the commercial areas too start at specific timings. If half of the commuters have an incentive to start work earlier we believe that it should reduce the congestions. This is also similar to Cluster 3 (Exit) where this is a strong morning and evening peak. Reducing the morning traffic of people heading towards the stations in Cluster 3 (Exit) might help reduce the possibility of two groups of people overcrowding the platform at one time – one group of people exiting the station one group boarding the trains. Importantly, the policy introduced is not in line with the demands – in this case, there is high demand for exit for stations outside the city, yet there is no initiative for free rides for stations outside the city.

Perhaps, to cater to the high demand during peak hours, train operators might deploy carriages that do not have any seats to allow more commuters to board the trains. Perhaps, on top of increasing the frequency of trains, LTA could possibly add more carriages to the trains during peak hours. As train platform only caters to 6 (for North South Line and East West Line) carriages, the extra carriage could be ‘reserved’ for people who are traveling longer distance (i.e. from city to the northern part of Singapore or people who are travelling twenty stations away from the current one)

Another alternative that LTA could do is to have express bus services that run from the city (or from stations that have very high evening entry peaks) to heartlands in the evening. For example, SMRT operates an express unidirectional premium bus service 590 during the morning peak hour that bring commuters from Choa Chu Kang directly to Shenton way with an operating frequency of 10 minutes during 7.40am-7.50am5. While in Choa Chu Kang, service 590 stops at only a few stations before entering the expressway to head to the city area. However, there is no similar service in the evening. This is quite an interesting phenomenon. Surely, commuters who head to city area from the heartlands would definitely want to head back home after work. There should be a service that brings the heart landers back home after a day in the city. Furthermore, bus 590 brings commuters directly into the heartlands to specific housing estates. Such service will definitely be beneficial for commuters and such services would definitely reduce the congestion of the train services from the city area. Furthermore, based on Cluster 1 (Exit), there is a strong evening-exit peak for stations in the heartlands. Such service will useful in reducing the evening peak demand.

Lastly, for Cluster 6 (Entry) most of the stations in that cluster are located in the Northwestern part of Singapore. It is amazing that the stations in that part of Singapore has it’s own cluster. Further more, to add to more of the confusion, most of the stations in the Bukit Panjang LRT Line are in the cluster. It is worth exploring the reason for the similarity of travel pattern for residents of that cluster.

17

Figure 5: This is the original cluster

Figure 6: Map Location for LRT Services in Cluster 6 We first overlay the cluster stations on a map in Figure 6. We can see that the stations Senja, Jelapang, Segar, Fajar, Bangkit are located on a major circular road. It is also observed that there are three stations not in the cluster: Pending, Bukit Panjang, and South View LRT station.

18

Figure 7: We further identified two distinct groups in the cluster In Figure 7, we see that there are two distinct groups in this cluster. One is a very high volume peak while the other is a fairly low peak with a sudden surge only in the morning. To further understand this phenomenon we will split the LRT and MRT stations in Cluster 6 and re-run the time series data mining analysis.

Figure 8: Cluster 6 LRT Stations- Bangkit, Fajar, Jelapang, Keat Hong, Petir, Phoenix, Segar, Senja, Teck Whye. The peak is at 8am.

19

Figure 9 - Cluster 6 MRT Stations: Bukit Batok, Choa Chu Kang, Kranji, Marsiling, Pioneer, Woodlands, Yew Tee From figure 8, we can see that there is lesser variation between the LRT services and MRT services in cluster 9. Could it be due to inadequate bus services in that area? To investigate this claim, we picked the closes bus stops to these four LRT stations: Fajar, Senja, Jelapang and Bangkit. Only four were picked due to the lack of time. To understand the reason why Pending station does not appear in that cluster, I will also analyse the Pending station.

Station Bus Stop Bus Services Selected Fajar 44771 971E, 922 Senja 44799 920, 555 Jelapang 44721 BPS1 Bangkit 44329 920, 922, 971E Pending 44221 171, 184, 700, 700A, 960, 963, 963E, 963R, 966

Out of the 5 LRT stations, only Pending has more variety of busses. All of the busses that serve the stop closes to Pending station are busses that are long distance busses – 960 travels to Marina Bay and 700 travels to Shenton Way. Other stations only have busses that travel to Bukit Panjang Station – 9226 and 9207 travels around Bukit Panjang. We can see that the residents who live near the stations do not have many options when traveling using public transport. The buses that serve their stops are busses that do not bring them to meaningful places such as major MRT stations (the nearest being Choa Chu Kang MRT station which is then connected to the stations in the City). This is further supported by the fact that Pending which has the most connected bus stop is not in the same cluster at the other Bukit Panjang LRT service. To make things worst, the express or premium services such as 5558 and 971E9 only operate one service in the morning before 8am. There could be a possibility that the residents of Bukit Panjang travel to Pending Station to board the long distance busses. If this is the case, it might be testament that there is a lack of bus services to complement the residents who are heading towards other areas, forcing residents to spend more time taking the LRT to stops that have more connected bus or major MRT stations.

SENIOR CITIZEN

20

Figure 5: Comparing the travel patterns of Adult and Commuter Group For Clusters in Retail Areas In Cluster 4 (Entry) there is a strong evening peak. Clusters 4 are stations in retail areas. The peak is rather similar to Cluster 3 (Entry) for Adults. We decided to re-plot the graphs for the Adult Cluster 3 against Senior Citizen Cluster 4 to determine if they share a similar pattern. Based on Figure 5., we can see that there is a very similar pattern for most of the stations that are in the retail areas. It is as though the two commuter groups overlap each other. This suggests that the elderly who board at these stations have similar travel patterns with the Adults. The demands for entry for these two clusters are very close. This suggests that there are as many Senior Citizens boarding as Adults. Now we not only have Adults struggling to board the trains, we also have Senior Citizens competing to board the trains’ home. The suggestions to improve this problem are similar to the Adult – busses that head directly into the estates in the heartlands and also increasing the free train rides to include evening rides. We can take another step by reserving a carriage for Senior Citizens during the peak period. This will reduce the occurrences where commuters do not give up their seats to Senior Citizens. If possible, it would be good for that special carriage to have more seats for the elderly.

STUDENT Cluster 3 (Entry) suggests that during the school holidays, there are some Students who would take part time job. This would increase the traffic for the stations in the commercial areas.

In short: For the first week November 2011, there is a an overlapping demand from the three commuter groups for these stations:  Stations in commercial areas  Stations that are close to both residential and industrial/commercial areas.

However, it seems that the policies in place are not in line with the demands of the train services. LTA could introduce the free train ride initiative to stations that have similar demands to those stations in the city and to possibly have a similar free ride initiative in the evenings. In addition, to reduce the congestion, LTA could get commuters to take special bus service such as 590 that do not serve every stop along its route in the city, and then heads directly into different estates in the heartlands. More radical suggestions include the possibilities of increase the carriage per train or having dedicated carriages for people who are heading to stations far way for those who board from the city area.

It is important that LTA take specific s steps to reduce the congestion. As seen from the analysis above, there is an overlap in travel pattern for Adults, Senior Citizens especially the travels that originate from retail areas.

21 FUTURE WORK From this Analytics Practicum project, we are able to reveal the commuter’s travel patterns through the use of time- series data mining techniques. This gives us great insights and meaningful discovery on Singapore public transport on trains. However, there are some parts that we did not covered in our project and this could be potential area for future works.

Some of the areas would include:

The dataset is currently 4 years old. From the time of the dataset to the time of this paper, there have been numerous breakdowns, with some being more drastic than the other, the population of Singapore has increased, more busses on the roads introduced, introduction of new train lines10, new stations opened on existing lines11, surely the travel demands and patterns of commuters have changed. Having recent data allows us to first and foremost, determine if the free morning ride initiative is effective, as this initiative was started after this dataset. Secondly it allows us to get better understanding of the travel patterns especially when the breakdown has have increased and the recent trends on public commuter travel. Determine if the travel pattern of users who travel short distances. For example, for users who board from Woodlands station to travel towards 2-3 stops away – we analyse why they do not instead take the bus, considering that Woodlands is both an MRT station and bus interchange. Could it be because the busses that serve those stations around Woodlands do not serve the needs of the passengers enough? What other factors can we look at? This would allow us to take a deeper look into the interaction between commuters’ travel patterns.

As November 2011 is during the school holidays, the ridership of the Student commuter group is low and is not able to provide meaningful analysis towards the travel patterns of Students. Even if it overlaps the other travel patterns of the other commuter groups it affects the decision making of LTA.

Based on the results obtained, we could further our analysis to focus on forecasting or predicting commuter’s travel patterns for the newer stations that have been built for the upcoming year(s) and compare and analyse the difference between current and predicted values.

By using prescriptive analysis, we are able to come up with different models on how to improve the SMRT services such like traffic congestion problems or mitigating future risk.

As the dataset attributes are comprehensive, it would be useful to understand the end-to-end travel patterns on commuters. For example, how long does the average Adult from Woodlands travel to work? Is there a correlation between the distances travelled to the station boarded? Which day of the week is popular for train rides? Does the location of the stop boarded affect the choice of commuter? This would allow them to further explore the correlation using statistical methods such like multi-liner regression to determine its outcome.

Lastly, with comprehensive attributes, we could use the data for fraud detection to detect fraud on Singapore’s public transport.

LEARNING EXPERIENCE From this project, it provides an opportunity to be experienced and worked with big data. As the original dataset contains approximately 33 million rows of data, this give us a chance to explore voluminous data using analytics and being able to fully understand the LTA dataset to be able to come up with analysis about the travel patterns of commuters.

Firstly, the data that was given to was quite late during end of week six. Therefore, time-management is very important to ensure that we are able to complete the project on time and also being able to produce quality work at the end of submission.

22 Secondly, as dataset contains transactions data in varying time-stamp, we are able to discover and learn new algorithms and techniques that were used in time-series data mining. During data preparation, we got a chance to explore new software of JMP Pro 11 to clean and perform descriptive analysis. When exploring time-series data mining, we are introduced to use SAS Enterprise Miner to perform the analysis. This allows us to further explore new software and techniques despite not being able to understand fully the concepts of the remaining tools that were not used in our project; at least in this phrase it provides a discovery and learning experience.

Since we are exploring new techniques and software that were not taught during curriculum time, readings were provided to enhance our understanding. However, when dealing with the data, most of the times we often used trial and error method to come up with our findings. We also tried to Google to validate that our findings are appropriate and accurate by looking through tutorials and user guides.

Lastly, we learnt to interpret the graphs of our analysis and to better come up with better discussion for the project. We were told by our supervisor not just to show whatever is in the graph, but to come up with insights from the patterns that we could draw from the time-series graph by finding interesting or unusual patterns and come up with a story to describe our analysis. Furthermore, he told us that we could use the current free ride policy to validate against our findings for its effectiveness.

CONCLUSION In conclusion, we hope that through the work performed, we are able to bring about much needed change to the public transport system in Singapore. Through time series data mining, we hope to have convince LTA on ways to improve the system though identifying the bottleneck, such as identifying the need to address very strong morning peaks in stations that are not in the list of stations eligible for free ride. It is not a matter of fairness, but rather to be able to give all passengers a more pleasant journey. It is impossible to only factor in stations in the City as not everyone works in the City. As more commercial estates are being developed in Singapore, it is timely to review the free ride policy to include more of these stations.

LTA should consider looking into bringing about improvements to the train services in Singapore, such as adding more carriages. For example, Tai Seng is in the same cluster as Raffles Place. This suggests that the travel pattern for Tai Seng and Raffles Place is similar, if not the same. However, Tai Seng, a station on the Circle Line, uses trains that have only three carriages. While Circle Line was designed to be of medium capacity as opposed to the other major lines in Singapore such as the East-West Line or North-South Line, Circle Line serves heavy passenger traffic stations such as Tai Seng as seen in Adult Cluster 2. LTA could have made some future planning by instead providing ample carriages instead of just meeting the anticipated capacity. In addition, LTA can explore the possibility of providing carriages for specific purposes such as elderly carriages and also carriages reserved for passengers going long distances.

It is important to segregate the different commuter type to perform an in-depth analysis that will help point out the peculiarity of each commuter type travel patterns. While the Adult commuter type constitute for the largest passenger ridership, belittling the other commuter groups, segregation helps to harmonise the travel patterns of the different commuter group and how they travel to allow LTA to provide more benefits to commuters.

ACKNOWLEDGEMENT We grateful to be able to complete the Analytics Practicum project within the short time frame given by our supervisor Dr. Kam Tin Seong. We also like to thank Dr. Kam for giving us the opportunity to work on this project and providing us guidance through the course of the practicum.

23 APPENDIX

APPENDIX 1: RAW DATASET FOR LTA_RIDE

APPENDIX 2: RAW DATASET FOR LOCATION_MAPPING

24 APPENDIX 3: RECODED TIME-STAMP FORMAT

APPENDIX 4: MAPPING OF LOCATION ID TO EXTRACT LOCATION NAME

25 APPENDIX 5: MAPPING OF LOCATION ID TO EXTRACT LOCATION NAME

APPENDIX 6: DATASET VARIABLE PROPERTIES

APPENDIX 7: TSDP NODE RESULT

26 APPENDIX 8: VC NODE RESULT

APPENDIX 9: TSS NODE RESULT

APPENDIX 10: ADULT ENTRY ANALYSIS Graphs Description Cluster 1 Cluster 1 shows a high and dense morning peak with a lower peak in the evening. The stations in this cluster are stations that are in the heartlands. The first peak suggests that the heart landers are heading to work. The second peak suggests the workers who work around these stations heading back home. As the heartlands do not have as much commercial activities compared to cluster two, it is justified that the second cluster is lower than the first. MRT: Admiralty, Buangkok, Bukit Gombak, Chinese Garden, Hougang, Khatib, Lakeside, Pasir Ris, Punggol, Sembawang, Simei, Tampines, Yishun LRT: Bakau, Compassvale, Coral Edge, Cove, Damai, Fernvale, Kadaloor, Kangkar, Layar, Meridian, Oasis, Ranggung, Renjong,

27 Riviera, Rumbia, Sengkang, South View, Tongkang Cluster 3 Cluster 3 has two strong evening peaks, with the second one weaker and later into the night. The stations in this cluster are mainly stations in the commercial and retail areas. The first peak suggests that the workers who are working in these areas during office hours are heading back home. The second peak suggests that people who work in the retail shops are heading back home. This second peak is sometimes stronger than the first peak. This can be MRT: Bras Basah, Bugis, Chinatown, City Hall, Clarke Quay, Dhoby explained that the commercial areas are Ghaut, Dhoby Ghaut NEL, Esplanade, Expo, Harbour Front, closed; therefore the peak comes from the HarbourFront-CCL, Little India, Marina Bay, Orchard, Promenade, retail staff going home, and also people Somerset retail in this area heading back home. LRT: Farmway

Cluster 4 Cluster 4 has a strong morning peak, with a lesser evening peak. The stations in this cluster are mainly the heartlands. This suggests that the Adults living in the heartlands are heading to work. The second peak suggests that the people who work in the heartlands are heading back home.

MRT: Aljunied, Ang Mo Kio, Bartley, Bedok, Bishan, Bishan CCL, Boon Keng, Boon Lay, Braddell, Bukit Panjang, Commonwealth, Dakota, Eunos, Farrer Road, Kallang, Kembangan, Kovan, Lorong Chuan, Mountbatten, Paya, Lebar CCL, Potong Pasir, Serangoon, Serangoon CCL, Tanah Merah, Telok Blangah, Tiong Bahru, Toa Payoh, Woodleigh

LRT: Thanggam Cluster 7 Cluster 7 has two peaks, one in the morning and one in the evening. The stations in cluster 7 are mainly located around leisurely areas. This suggests that for such leisurely places, it expects to receive Adults in the morning and in the evening. For the morning peak, this could suggest that the Adults are heading to work after enjoying the facilities available at the MRT: Changi Airport, Farrer Park, Holland Village, Stadium stations. The evening peak suggests that they are heading back home.

28 Cluster 8 Cluster 8 is an outlier, used as testing data and was not part of any MRT or LRT stations. This station is not operational as its surrounding are only a cemetery and a mosque.

Bukit Brown

APPENDIX 11: ADULT EXIT ANALYSIS Graphs Description Cluster 4 Cluster 4 are stations around retail areas. These retail areas contain skyscrapers that are divided into two main distinctions: the lower floors serve as retail areas while the upper floors serve as office space. The two peaks describe the morning peak as people who are heading to work while the second peak describes the end of the day retail crowd who head to the retail areas for retail or leisure.

MRT: Bugis, Chinatown, City Hall, Dhoby Ghaut, Esplanade, Little India, Orchard, Somerset Cluster 5 Cluster 5 has two peaks of about the same pattern – where the morning peak and evening peak are almost similar in number/height. This is attributed to the location of the stations in the heartlands that are close to industrial estates. The morning peak suggests the industrial estate workers who are heading to work while the evening peak suggests the heart landers heading back home. The volume of passengers is lesser as these stations are MRT: Bartley, Clementi, Commonwealth, Lavender, Lorong Chuan, not in the main heartland estates, but rather the sub heartlands stations. Marymount, Paya Lebar, Queenstown, Redhill

LRT: Bukit Panjang, Pending, Phoenix

29 Cluster 6 Cluster 6 has a distinct early morning peak. This is because the stations in cluster 6 mainly serve non-town based commercial industrial estates/office centres such as Yio Chu Kang, which serves Apple Computers and ST electronics. Cluster 6 also has a low distinct evening peak. This peak is attributed as people who live near these stations. Housing estates near these stations are mainly private property such as Castle Green and Goldhill Centre. These MRT: Buona Vista, Dover, Joo Koon, Jurong East, Novena, Outram stations too are also small towns in/around Park, Pasir Panjang, Yio Chu Kang bigger towns, i.e. Yio Chu Kang is a small subset of Ang Mo Kio (a major housing estate), and Joo Koon is a small town in Jurong West. In between the two peaks these stations see lower consistent passenger traffic. Cluster 7 Cluster 7 can be attributed to two main factors, peaks in the morning around 9am and another peak in the mid-day. The first peak can be attributed to the offices around these stations such as Bank of America. The second peak can be attributed to the people who work near these stations who are alight at these stops for lunch at Seah Imm Food Centre and the posh eateries around Holland Village.

MRT: Farrer Park, HarbourFront-CCL, Holland Village LRT: Riviera Cluster 8 Cluster 8 has a single peak around 6pm and a second peak around midnight. These are LRT stations around housing estates that have workers coming home after work for the first peak. The second peak around midnight is attributed to the higher/fixed frequency LRTs have compared to feeder services later in the night.

There is a small peak in the morning. This MRT: Buangkok is because stations in this cluster are around marker and house estate “retail LRT: Bakau, Compassvale, Coral Edge, Cove, Damai, Fajar, centre” where Adults head to for breakfast Fernvale, Jelapang, Kadaloor, Kangkar, Keat Hong, Meridian, Oasis, or marketing. Ranggung, Renjong, Rumbia, Segar, South View, Tongkang

30 Cluster 9 Cluster 9 is distinct for their evening peaks. This is due to the proximity of leisure and entertainment areas around the stations. This is further supported by the higher than usual peak in the evenings, on Friday and the peak in the mid-day on Saturday where commuters are not working and are heading towards these stations after waking up late to patronise the leisure and entertainment areas.

MRT: Dhoby Ghaut NEL, Stadium LRT: Bangkit Cluster 10 Cluster 10 are stations that serve industrial estates. Commuters that disembark from these clusters will proceed to take feeder service to the industrial estate such as Yew Tee Industrial Estate, Sungei Kadut Industrial Estate, Tengah Industrial Estate, and factories along Corporation Road.

The second peak describes commuters

who are heading back home. While these MRT: Boon Lay, Kranji, Woodleigh stations mainly serve industrial estates, it is LRT: Choa Chu Kang BPLRT also close to housing estates (with an exception to Kranji. Kranji, is however the stop for Malaysian workers alight to board bus CW3, 170, 160 towards Malaysia). Cluster 11 Changi Airport: The first peak for Changi Airport describes the people who are on the way to catch a morning flight. Commuters take the train due to the additional SGD 3 surcharge on top of the morning peak surcharge.

Botanic Gardens: The first peak is due to the people who work around the vicinity of

Adam Road and Bukit Timah such as staff MRT: Botanic Gardens, Changi Airport who work along the shop houses and retail LRT: Farmway centres along Adam Road such as the Japanese Association and along Bukit Timah Road such as Coronation Plaza, Serene Centre, French Embassy and Hwa Chong Institution and National Junior College. The second peak describes the office workers who heading to Adam Road Food Centre for lunch. The last peak is due to the people who are going back home after a day’s work. On Saturday the peaks are similar due to the Food Court and also people visiting Botanic Gardens.

31 APPENDIX 12: SENIOR CITIZEN ENTRY ANALYSIS Graphs Description Cluster 1 Cluster 1 shows a very dominant morning peak. The stations in this cluster are mainly stations in the heartland. This suggests that the elderly are heading to work.

MRT: Admiralty, Ang Mo Kio, Bedok, Bukit Batok, Bukit Gombak, Chinese Garden, Choa Chu Kang, Khatib, Lakeside, Marsiling, Pioneer, Sembawang, Tampines, Woodlands, Yew Tee, Yio Chu Kang, Yishun LRT: Keat Hong, Layar, Petir, Renjong, Segar, Teck Whye Cluster 3 Cluster 3 has heavy traffic throughout the day. The stations in cluster 3 are located within older housing estates. This suggests that for these stations, there are no patterns that can describe the travel patterns of the elderly. However, it shows that there is a huge number of elderly living around these stations.

MRT: Bartley, Bishan CCL, Boon Keng, Boon Lay, Botanic Gardens, Buona Vista, Buona Vista-CCL, Clementi, Commonwealth, Dover, Eunos, Farrer Road, Kallang, Kembangan, Kovan, Lorong Chuan, Marymount, Mountbatten, Potong Pasir, Queenstown, Redhill, Serangoon, Serangoon CCL, Simei, Tanah Merah, Telok Blangah, Tiong Bahru, Toa Payoh, Woodleigh LRT: Ranggung, Sengkang Cluster 6 Cluster 6 has a strong peak in the late morning. It then tapers off. The stations in this cluster are fairly new housing estates. This suggests that Senior Citizens who are living around these stations are not working. The timing that they board the train suggests that they are heading towards leisurely activities such as meeting friends.

MRT Bishan, Braddell, Buangkok, Hougang, Pasir Ris, Punggol LRT: Bakau, Bangkit, Compassvale, Coral Edge, Cove, Damai, Fajar, Fernvale, Jelapang, Kadaloor, Kangkar, Meridian, Oasis, Rumbia, Senja, Tongkang

32 Cluster 9 Cluster 9 has a strong dominant peak in the evening and on some days end of the day. As the dominant one is Kranji MRT, this suggests the elderly who are heading back home after a day in visiting Malaysia. The stronger peak during the weekend suggests that the elderly are heading back home after visiting the horseracing track.

MRT: Kranji, Stadium

APPENDIX 13: SENIOR CITIZEN EXIT ANALYSIS Graphs Description Cluster 1 Cluster 1 shows a dominant evening peak. Stations in cluster 1 are mainly stations in housing estates. The peak suggests that Senior Citizens are heading back home after a day working or dealing with religious affairs.

MRT: Admiralty, Ang Mo Kio, Bedok, Bishan, Bishan CCL, Braddell, Buangkok, Bukit Batok, Bukit Gombak, Choa Chu Kang, Dakota, Hougang, Kembangan, Khatib, Lorong Chuan, Pasir Ris, Potong Pasir, Punggol, Sembawang, Serangoon, Serangoon CCL, Simei, Tampines, Telok Blangah, Woodlands, Yew Tee, Yishun LRT: Bakau, Compassvale, Cove, Fernvale, Kadaloor, Keat Hong, Meridian, Renjong, Riviera, Rumbia, Segar, Sengkang, Tongkang Cluster 3 Cluster 3 has a strong morning peak, with a rather weak, but nonetheless a peak in the evening. The stations in this cluster are stations that have a sizeable number of offices and commercial buildings. The morning peak could indicate the Senior Citizens who are heading to work such as the aunties and uncles working in food and beverage industry such Mr Bean - a bean curd company. The second peak suggests MRT: Buona Vista-CCL, Caldecott, Clarke Quay, Dhoby Ghaut, Senior Citizens who are going to work as Labrador Park, MacPherson, Newton, Promenade, Raffles Place, cleaning crew to clean up the office at the Redhill, , Tanjong Pagar end of the day. LRT: Farmway

33 Cluster 4 Cluster 4 displays a strong evening trend. One of the obvious trends can be seen from the chart and the station is Tiong Bahru. Tiong Bahru has been one of the oldest housing estates thus this explained that there are more Senior Citizens found living in this place. Furthermore, the stations in this cluster are also older estates and just outside business districts such as Tanjong Pagar and Woodlands. MRT: Aljunied Boon Keng, Commonwealth, Dover, Marsiling, This might mean the Senior Citizens who Mountbatten, Paya Lebar CCL, Queenstown, Stadium, Tiong Bahru are still working in those districts are LRT: Kangkar, Pending heading back home after work. Cluster 5 Cluster 5 shows a strong morning and evening peak. It can be seen from the graph that these trends of morning peak start to decrease during Saturday and further decreases in Sunday.

MRT: Bartley, Boon Lay, Buona Vista, Changi Airport, Harbour Front, Joo Koon, Jurong East, Marina Bay, Pasir Panjang, one-north LRT: Bukit Panjang Cluster 7 Cluster 7 displays a strong afternoon to evening trend. One of the obvious trends can be seen from the chart and the station is Toa Payoh. Toa Payoh has fairly a larger amount of people compared to the others stations. And this trend explained that elderly who are still working is heading home thus resulting in strong evening peak. Furthermore, it can be deduced that stations in cluster 7 are mainly older MRT: Farrer Park, Farrer Road, Holland Village, Kovan, Toa Payoh, estates with a sizeable number of rental Woodleigh one-room flats. LRT: Coral Edge, Damai, Fajar, Jelapang, Oasis, Petir, Ranggung, Senja, South View

34 APPENDIX 14: STUDENT ENTRY ANALYSIS Graphs Description Cluster 1 Cluster 1 shows a strong morning peak and a second peak in the evening. The stations in the cluster are mainly stations are within housing estates. The first peak suggests Students who are heading to school. The second peak suggests the Students who are studying in schools near these stations heading back home.

MRT: Ang Mo Kio, Bedok, Boon Keng, Boon Lay, Bukit Batok, Choa Chu Kang, Commonwealth, Jurong East, Kovan, Lavender, Mountbatten, Newton, Outram Park, Outram Park NEL, Punggol, Queenstown, Redhill, Serangoon, Serangoon CCL, Simei, Tampines, Toa Payoh LRT: Chu Kang BPLRT, Sengkang, Thanggam Cluster 2 Cluster 2 has rising ridership towards the later part of the day. There is very little ridership in the morning. The stations in cluster 2 are stations that are located in retail areas. As most of the shops around these stations do not open until after 10am, the traffic only rises after 10am. That is only when the traffic begin to build up.

MRT: Bugis, Changi Airport, City Hall, Clarke Quay, Dhoby Ghaut, Dhoby Ghaut NEL, Esplanade, Expo, Harbour Front, HarbourFront- CCL, Marina Bay, Orchard, Promenade, Raffles Place, Somerset Cluster 4 Cluster 4 has a dominant but sudden mid- day peak, and short but sudden in the evening with relatively low but stable peak for other hours. The stations in this cluster are stations that are located in the outskirts of a town and have a school within 300 meters of the station. The short yet high volume peak suggests that Students are boarding the train at one particular time. In this case, as the dataset is during the school holidays, the peak MRT: Bartley, Braddell, Dakota suggests the time Students finished their supplementary classes or co-curricular LRT: Bangkit, Jelapang, Layar activity and proceed to take the train home.

35 Cluster 7 Cluster 7 has a dominant morning peak, but a lesser mid-day peak with lesser evening peak. The stations in cluster 7 are mainly residential areas. The stations in cluster 7 are not major towns; there is the small subset of a major town i.e. Yew Tee is a subset of Chua Chu Kang while Marsling and Admiralty is a subset of Woodlands. The major towns stations have a major bus terminal i.e. Woodlands Regional Bus Interchange. This suggest MRT: Admiralty, Kallang, Marsiling, Pioneer, Sembawang, Yew Tee that Students of these sub towns might take the train service to head to the major LRT: Bakau, Damai, Fajar, Kadaloor, Kangkar, Oasis, Petir, Renjong, town stations to take busses or head to Riviera, Rumbia, Segar, South View, Tongkang the bigger retail malls in the major towns for leisure or school. Cluster 9 The stations in these clusters have a weak morning peak, very strong mid-day peak, and a weak evening peak. The stations in this cluster are residential areas that are close to industrial parks.

MRT: Aljunied, MacPherson, Paya Lebar CCL LRT: Pending, Bukit Panjang Cluster 10 Cluster 10 is categorised into three main peaks. One in the morning, a huge peak in the afternoon and a small peak in the evening. This morning peak suggests the Students going to school. The mid-day peak suggest Students who are either heading to school for extra activities. The evening peak suggests that the Students are heading out for recreational activities. This is normal as it is during the school holiday period where Students have more MRT: Caldecott freedom. LRT: Meridian Both stations are in residential areas. Caldecott being located in the midst of private properties. The low peak suggests that Students either do not take the train to school (as there a good schools such as Raffles Institution and Hwa Chong Institution within reach by various bus services) or residents have other means of transportation.

36

Cluster 11 There is a strong morning peak for Chinese Garden. Chinese garden is surrounded by a lot of condos. As the peak is rather low, in this case, around 30- 40 in the morning, this suggests that there are Students still going to school. This suggest that these Students are going to school a distance away as Chinese Garden is a residential area; therefore there must be bus services that serve nearby schools MRT: Chinese Garden

APPENDIX 15: STUDENT EXIT ANALYSIS Graphs Description Cluster 10 Cluster 10 has a single morning peak. Students who alight at stops in the cluster are Students who have to take a bus to their school, as the nearest school require a short bus ride.

MRT: Bishan, Botanic Gardens Paya Lebar CCL LRT: Cove, Meridian

37 REFERENCES 1. B.Agard., C.Morency. and M.Trépanier.(2009). Mining Smart Card Data from an Urban Transit Network, pp.1-5. 2. D.Lee, L.Sun. and A.Erath.(2012).Study of Bus Service Reliability in Singapore Using Fare Card Data. 3. H.Nishiuchi, J.King. and T.Todoroki.(2012).Spatial-Temporal Daily Frequent Trip Pattern of Public Transport Passenger Using Smart Card Data. 4. (n.d.).Introduction to Data Mining.Jones & Bartlett Learning, LLC, pp.7-10. 5. K.Nakkeeran, S.Garla and G.Chakraborty.(2012). Application of Time-series Clustering using SAS® Enterprise MinerTM for a Retail Chain, Proc of SAS® Global Forum 6. Lu, H.K., (2007). Network smart card review and analysis. Computer Networks 51, pp.2234-2248. 7. L.S.K.Sim, E.A.C.Seow, S.Parkasam.(2003).Implementation of an Enhanced Integrated Fare System for Singapore. RTS Conference, pp.1-5. 8. L.Sun, D-H.Lee, A.Erath and X.F.Huang. Using Smart Card Data to Extract Passenger’s Spatio-temporal Density and Train’s Trajectory of MRT System. 9. M.P.Pelletier, M.Trepanier and C.Morency.(2011).Smart Card Data Use in Public Transit: A Literature Review, Transportation Research Part C, 19, pp.557-568. 10. (2011).Payment,clearing and settlement system in Singapore, EMEAP-Red Book, pp.13-15. 11. P.Esling and C.Agon.(2012).Time-Series Data Mining.Institut de Recherche et Coordination, ACM Computing Surveys (45), pp.1-4. 12. S,Schubert and T.Y.Lee.(2011).Time Series Data Mining with SAS Enterprise Miner, Proc of SAS® Global Forum. 13. Shelfer, M.Procaccino, J.D.,(2002). Smart card evolution. Communications of the ACM 45 (7), pp.83-88.

1 The 16 city stations are Bayfront, Bras Basah, Bugis, Chinatown, City Hall, Clarke Quay, Dhoby Ghaut, Esplanade, Lavender, Marina Bay, Orchard, Outram Park, Promenade, Raffles Place, Somerset and Tanjong Pagar. Taken from: https://www.lta.gov.sg/content/dam/ltaweb/corp/PublicationsResearch/files/AnnualReports/1213/LTA%20Annual%20 Report%202012-2013.pdf 2 http://transport.asiaone.com/news/general/story/free-early-morning-train-rides-city-area-extended-2015 3 http://www.lta.gov.sg/apps/news/page.aspx?c=2&id=e96f7b3a-67dd-4588-9fd9-d115472cf9b0 4 http://www.lta.gov.sg/apps/news/page.aspx?c=2&id=80d3169f-dc5c-4425-9a3e-e578b1a33042 5 https://publictransportsg.wordpress.com/2013/04/06/premium-590/ 6 http://www.transitlink.com.sg/eservice/eguide/service_route.php?service=922 7 http://www.transitlink.com.sg/eservice/eguide/service_route.php?service=920 8 https://publictransportsg.wordpress.com/2013/04/06/premium-555/ 9 https://publictransportsg.wordpress.com/2013/05/23/service-971e/

38