Analysis of Communication and Transport System, Linköping University, 2017.

Analysis of Communication and Transport Systems 2017 Ticket data analysis

Lars Drageryd, Ivan Postigo, and Tobias Åresten

Abstract

As public transport agencies move towards more intelligent ways of handling tickets and payments, new doors to analyze travel patterns are opened in parallel. This project highlights one method to analyze passenger travel data in a smartcard based ticket system. Data from Blekingetrafiken was used with two versions of algorithms based to estimate travel flow OD-matrices during October 2016. The advanced algorithm was able to estimate the destination for 64% of the data set, which is similar to results obtained by Trepanier and Chapleu (2006). The project showed examples of numerous different outputs enabled by the algorithm such as; peak hour analysis, clustered OD-matrices and transit-analysis.

Keywords: Public transports; OD-estimation; Transit analysis; Smartcard data;

1. Introduction Smartcard Automated Fare Collection Systems (SCAFS) are frequently used in public transport systems all around the globe. The system can be considered convenient and efficient, both through the eyes of the user and operator. One operator in that utilizes the system is Blekingetrafiken, responsible for public transports in the southern Swedish county of . The system works such that travelers register a smartcard charged with money or a time validation in a transaction machine upon boarding of the public transport vehicle. The primal scope of the system from the operators' point of view is to have a flexible, efficient and fast way of collecting the fare for the trip (Kurauchi & Schmöcker, 2017). However, SCAFS also provide an alternative benefit by providing the agency with large amounts of individual travel data. Data that can be used to analyze for instance utilization of vehicles, time and frequency of transits. With knowledge of the behavior of the public transport users, the service can be planned more efficiently.

A problem while analyzing travel behavior based on SCAFS is that users, mainly due to convenience, only tap their card upon boarding and not when alighting the vehicle. This is also the case with Blekingetrafiken’s system. This results in data that does not provide information on the whole trip, but merely the start of it. Hence, it is easy to locate the origins of the trips but more challenging to estimate their destination.

A trip estimation model developed in 2006 provides one alternative solution to the problem. Several behavioral assumptions are made in the model; travelers are assumed to always return to their origin by the end of the day, furthermore are travelers that transit between two lines assumed to want to walk as little as possible. The algorithm tries to find the nearest alighting station from the previous trip for transit travelers. (Trépanier & Chapleau, 2006)

This project features data from Blekingetrafiken on trips registered by the smart card system in the county during October 2016. The main scope of the project is to construct an OD-matrix using data from tap-ins and the algorithm developed by Trépanier & Chapleau (2006). With a deeper knowledge of travel patterns, traffic planners and decision makers can be assisted to plan the public transport service more efficiently.

Proceedings of the course TNK103 Analysis of Communication and Transport System, Linköping University, 2017.

1.1. Aim The primal aim of the paper is to estimate the destination of trips using data of tap-ins provided by Blekingetrafiken. With the help of two algorithms based on the one described by Trépanier & Chapleau (2006) will the project generate OD-matrices for trips in Blekingetrafiken during October 2016. One of the algorithm will be considered simpler not taking timetables into consideration while the other will. The concept of the OD is further explained in subchapter 1.2. The OD-matrices serves the purpose of performing a deeper analysis of travel patterns. To be able to visualize the OD-matrix efficiently will origins and destinations, representing stations, be clusterered. This will shrink the size of the OD-matrix substantially making it easier to identify travel patterns. Besides generating the OD-matrices will the scope of the project be to answer the following questions:

• What will the difference be when implementing the time plan in the algorithm? • What are the biggest challenges to overcome while transferring raw data to results?

1.2. Limitations Due to privacy concerns of the passengers, the number to identify the smartcards is randomized every day. This means that no analysis can be done to spot travel patterns for a unique user during a period of time other than those occurring in one single day. Some travelers might behave repetitively on a weekly but this behavior cannot be identified due to this daily randomization. The algorithm presented by Trépanier & Chapleau features methods to analyze also these kinds of data but since the id-number of the cards are randomized each day can no such analysis be conducted in this project.

The algorithm takes into consideration the next blip performed by the same card when estimating the alighting station of the previous trip. Cards registered in the system only once during a day will, therefore, be neglected in the analysis.

A fundamental concept in traffic planning in general is that of OD-matrices. The OD represents origins and destinations and could, depending of what is of interest to analyze represent the movement of vehicles, the travel time or the number of passengers. In this project is are all origins destinations and hence all destinations origins. Origins and destinations are represented in real life by stations and the generated numbers in the matrix represent the registered trips (passengers) between the two during the course of one day. As the number of stations are large (above 1000) will stations later be clustered together to reduce the size of the OD-matrix.

The project features tap-in data from Blekinge for the month of October 2016. In the initial analysis will the entire data-set be used, however, in the final analysis, will a smaller fraction, featuring data from only the 3rd of October be used. The reason behind this delimitation is that the purpose of the project is not to analyze the traffic situation in Blekinge but rather show methods on how tap-in data can be used to give traffic planners support for decisions. Data on the whole month can be later analyzed with the method presented, but due to the size of the data, computation limitations also exist, taking a long time to run the algorithm.

1.3. Outline The remaining of the report has the following structure: Chapter 2 presents the methodology for the project. Chapter 3 presents relevant literature. Chapter 4 features a description of the data given. Chapter 5 highlights initial analysis of tap-in data. Chapter 6 explains the algorithm used. Chapter 7 and 9 features the results and the conclusion of the project.

2

Proceedings of the course TNK103 Analysis of Communication and Transport System, Linköping University, 2017.

2. Methodology

A large obstacle with OD-estimation in public transport systems is the lack of destination data as most systems do not require users, mainly due to convenience, to tap their card upon leaving the vehicle. Due to this will the alighting station need to be estimated to generate an OD-matrix. In this project the destinations are estimated using tap-in data, stations data and the 2016-timetable from Blekingetrafiken. Tap-in data includes a timestamp of when a passenger boarded what bus and at which station. Stations-data includes the stations and their corresponding coordinate. The timetable corresponds to all arrival and departure times for all vehicles at all stations during all periods in time and is hence substantial in size. The stations connected to Blekingetrafiken was extracted from the timetable data-set using a SQL-query. Through this was the timetables only of relevance imported into the algorithm. An example of the three types of data is seen in the tables below (see chapter 4 for a deeper description of data).

Table 1 - A selection of the most important columns stated as an example of tap-in data. The columns represents the time stamp, the route, the stop number, the sequence of the stop in the route and daily randomized card id.

TRS_DT RUT STP_LST_NUM STP_LST_SEQ_NUM CRD_NUM 2016-10-05 21:33 150 150013 33 23b22c6ccd530509c86dd7133477d4ce

Table 2 - The stations data with stations number expressed both using local and national methods, the station name and its coordinate

STP_LST_NUM Agency STPNAMPRN GPS_LATITUDE GPS_LONGITUDE Stop_id 100101 1001 Kungsplan centrum 56,165381 15,586921

Table 3 – Timetable data displaying time of arrival, departure at specific stop, sequence of stop and unique trip id.

Trip_id Arrival_time Departure_time Stop_id Stop_sequence Agency_stop_id 45501 15:32:00 15:35:00 740000096 2 8109

Using this given data, destinations are estimated with an algorithm based on the one presented by Trépanier & Chapleau (see chapter 6 for algorithm description). As the algorithm features, broad assumptions might affect the estimations while studied in deep detail be far from perfect, however, the main purpose of the algorithm is rather to get an indication of travel behavior on a more general and broad level.

Initial analysis and filtering of data have been done through Microsoft Excel and SQL-queries in Microsoft Access. In the excel-analysis (see chapter 5) the pivot-table function was used to analyze tap-ins during different periods in time not taking into consideration the algorithm or any potential alighting station. In the initial analysis was the GIS-software ArcMap used to display the frequency of tap-ins at the different stations.

As mentioned, was the algorithm to a high degree based on the one developed by Trepanier and Chapleau, however not entirely similar. Two different algorithms were developed, one simpler not considering the timetables and another taking the timetables into account. The algorithms were developed using MATLAB.

3

Proceedings of the course TNK103 Analysis of Communication and Transport System, Linköping University, 2017.

3. Literature Review Both traffic models and public transport agencies commonly base decisions and planning on data from travel surveys. These surveys might sometimes not be of satisfactory quality. A study conducted using data from the Oyster-card in London found that data from smart cards could be used to enhance the validity of travel surveys, thus highlighting the potential secondary benefit of SCAFS other than collecting the fair efficiently. (Riegel, et al., 2013)

The problem of estimating destinations having only tap-in data is a problem exposed thoroughly in recent research. The solution that is the base for this project was first introduced in Montréal, Canada by Trépanier and Chapleau in 2006. The authors match tap-in data and the timetable to determine potential alighting stations for the trip conducted. Potential alighting stations are the ones that are following the boarding station as a part of the trip done. The alighting station is estimated as the station with the closest Euclidean distance to the next tap-in done by the same card. This is the first of three assumptions of user behavior in the algorithm, which users will strive to walk as little as possible. If the distance between the alighting station and the next departure is larger than a tolerance value, the trips estimation is classified as a trip in which the destination could not be estimated. The tolerance value used in the study for Trepanier and Chapleau is 2km. The other two assumptions done by the authors is that users will always strive to go home (first boarding) by the end of the day and that travelers usually will have the same destinations through the week, enabling individual analysis for travel patterns more than one day. In Figure 1, one example of the algorithm made by Trepanier and Chapleau (2006). The estimated alighting 1 and 2 is the nearest stations of the second and the third tap-in. The third alighting is at the station with the shortest to the first tap-in. d1, d2 and d3 is the distance between the estimated alighting station and the paired tap-in station. These distance is used to check if the destination can be estimated due to the assumption of the tolerance value. If the distance is larger than the tolerance value, the destination could not be estimated.

Figure 1 - Algorithm implemented by Trepanier and Chapleau (2006) (own interpretation).

4

Proceedings of the course TNK103 Analysis of Communication and Transport System, Linköping University, 2017.

Using their algorithm, the Trepanier and Chapleau (2006) were able to estimate 66% of all possible destinations, the number being 85% for peak-hour traffic. Even though this might not seem high they highlight that the results for regular routes should likely not be affected by this as most of the trips that they were not able to estimate was trips outside what can be considered as “normal”.

A hint to how good the algorithm really is, was given in another study conducted by Li, et al. (2011) in the 7 million residents’ large Chinese city of Jinan where it was concluded how a similar trajectory search algorithm managed to estimate 75% of the true OD-pairs (85% at peak-hour traffic).

When estimating the alighting stations and destinations, a transit may occur between different routes in the same trip. According to H.K. Lo (2003), the transit location is not considered as a trip destination if some rules are reached. H.K. Lo et al. (2003) describes the main rules as that the transfer has to capture realistic transfer behavior, i.e. the purpose of trip on the route decides if the alighting point is considered as a destination or a transit point. If there is a connection between the alighting and the next departure, i.e. if the distance between the nodes is small enough and that the transit waiting time is small enough. The size of the allowed waiting time is according to H.K. Lo (2003) dependent on the frequency of the transport system and if the passenger, in real life, gets a seat on the next departure. If the transit time is small enough and the transit distance is small enough, the aligning and the next departure will be considered as a transit. If either the time or the distance is too large, the aligning is considered to be a destination of the trip.

4. Data description

The following chapter describes the data featured in the project. Data are separated into tap-in data, the location of stops and timetable data. The tap-in data originates from Blekingetrafiken while the other types of data come from the Swedish national traffic cooperation Samtrafiken. Samtrafiken gathers data and information from 59 different public transport agencies and organizations in Sweden (Samtrafiken, 2017).

As users of Blekingetrafikens service board a vehicle they register their pre-charged smartcard in a transaction machine. A data-set over these registrations during October 2016 was given as input to the project. The data-file containing 776827*82 cells of information of registered trips did also contain information of irrelevance to the model. Irrelevant data that was filtered out from the data-set could consider for instance unknown stations. Unknown stations could, for instance, relate to the action of a traveler charging the card rather than going on an actual trip. Extracted data from the tap-in data into the algorithm featured information of; transaction machine id, time of registration, route, stop number, stop sequence and card-id. As previously highlighted in the limitations of the project was the card-id of each traveler randomized each day, meaning that analyzes of more than one day of traveling for the same user were not possible.

In order to give an estimate of the position of where the registration was done was another data-set containing information of the location of all stations served by Blekingetrafiken given. The data-set contained information of the stations expressed as coordinates, using a WGS84 (decimal) reference system. The stations featured in the data are displayed in figure 3. Stations registered in the tap-in data not featured in the location of stops-data was neglected in the analysis. There might be several different reasons why some registered station numbers in the tap- in data were not featured in the location of stops-data, one being that the transaction was done in a machine on a square or in a store rather than in a bus or in a train.

5

Proceedings of the course TNK103 Analysis of Communication and Transport System, Linköping University, 2017.

Figure 2 - Stations served by Blekingetrafiken or its associated partners (Pågatågen, Öresundstågen, Krösatågen).

A challenge to overcome was the fact that data was not homogenous, meaning that stations have one name in the local tap-in data and another at the national timetable or stops-data. In the local case have stations been given a six-digit name while they have a four-digit corresponding number at the national level. To achieve homogeneity among the stations has stations from the local system been merged with stations that have the same four-digit beginning. The coordinate assigned to these stations in the tap-in data is the centroid of alternative local stations. An example of the merging can be seen in table one below. Through this method can the tap-in data be compared and matched with the timetable data. A total of 1094 stations was located using the national 4-digit number.

Table 4 - Merging of the station at Kungsplan Karlskrona

Name of station Local number & coordinates National Estimated merged number coordinate Kungsplan 100101 56,165381 15,586921 1001 56,16538 15,58677 Karlskrona Kungsplan 100107 56,165723 15,587112 1001 56,16538 15,58677 Karlskrona Kungsplan 100109 56,165021 15,586263 1001 56,16538 15,58677 Karlskrona Kungsplan 100151 56,165381 15,586921 1001 56,16538 15,58677 Karlskrona Kungsplan 100157 56,165723 15,587112 1001 56,16538 15,58677 Karlskrona Kungsplan 100159 56,165021 15,586263 1001 56,16538 15,58677 Karlskrona

The timetable data contains the following information; trip-id, arrival time, departure time, sequence number and stop-id (nationally and locally). The trip-id corresponds to a unique trip number for a public transport route during a unique period of time.

6

Proceedings of the course TNK103 Analysis of Communication and Transport System, Linköping University, 2017.

5. Initial data analysis The initial analysis does not involve the algorithm estimating the destination of trips but aims to, in a “simpler” way, analyze tap-ins per weekday, hour and station. Another important initial analysis conducted regards clustering of stations. Stations were clustered into the five different municipalities of the county and within the municipalities to strategically chosen smaller sections.

Figure 3 features tap-ins during October 2016 separated per weekday and hour. As could be expected does the number of registered trips fluctuate heavily during the course of the day and differs between the working day and the weekend.

Tap-ins per weekday and hour 25000

20000

15000

10000 Total

5000

0

0 6 0 6 0 6 0 6 0 6 0 6 0 6

18 18 12 18 12 12 18 12 18 12 18 12 18 12 Monday Tuesday Wednesday Thursday Friday Saturday Sunday

Figure 3- The number of tap-ins by the hour and by the weekday.

By overlooking Figure 3 one can confirm that travel patterns at Blekingetrafiken follow the traditional peak-hour traffic patterns from Monday to Friday and that the weekend corresponds to another type of pattern. With that confirmation, deeper analysis using the algorithm can be conducted where peak-hour traffic is particularly studied. Figure 4 displays the number of tap-ins per station during the third of October 2016. The size of the circles represents the number of tap-ins in each station, the larger the circle is, the larger amount of tap-ins in the station. Note that some smaller circles may be covered by the larger circles. As can be seen by looking in the positions of the larger circles, most of the trips originate in the urban areas in the towns of Karlskrona, , , Sölvesborg, and Olofström. Just a few trips originate from the countryside compared with the trips that originate in the urban areas.

Figure 4 - Number of tap-ins per station during the third of October 2016.

7

Proceedings of the course TNK103 Analysis of Communication and Transport System, Linköping University, 2017.

As mentioned in the data description was a total of 1094 stations located in the project. An aim of the project is to obtain OD-Matrices representing flows of passengers. Without clustering this corresponds to a matrix of the size 1094 by 1094 with the vast majority of cells being equal to zero. This kind of OD would not likely say too much of travel behavior as it is both uninteresting and challenging to grasp. Therefore were stations clustered into larger areas, similar to hubs. The clustering was done twice, on two different levels; municipalities and local areas within municipalities. The clustering was made by assigning the stations to the nearest of the potential hubs. The first level of clustering featured the five different municipalities; Karlskrona, Ronneby, Karlshamn, Sölvesborg, and Olofström with the centroids placed in the central towns of these municipalities. With this clustering, an OD- matrix with the size 5 by 5 was obtained which gave information to travel patterns between and within the different municipalities. Figure 5 displays the aggregated stations were green is Sölvesborg, magenta is Olofström, red is Karlshamn, black is Ronneby and blue is Karlskrona.

Figure 5 - The bus stops in the virtual municipalities. One can denote rather satisfying results, however by investigating the routes some of the north east stations assigned to Ronneby (black trips) should likely fit better in Karlskrona (blue trips).

As most of the trips are made within the different municipalities will the 5*5 matrix be too rough in order to analyze this type of movement. Therefor was, in the second level of clustering, the stations within the different municipalities aggregated to smaller areas to detect travel behavior within the different municipalities. The hubs on this level was selected from where stations where located, which meant that the number of hubs was different for the different municipalities. As an example was Karlskrona divided into 19 smaller areas or hubs. The method was the same as previously where stations was assigned to the nearest of the potential hubs. With this two levels of clustering could movement both between and within the different municipalities be analyzed further.

8

Proceedings of the course TNK103 Analysis of Communication and Transport System, Linköping University, 2017.

6. Algorithm description Two algorithms were implemented, both aim to construct an Origin-Destination matrix with the tap-in data provided together with the data described in section 4. The difference between them is the complexity on which they deliver the OD matrix. In the first one, broad assumptions where made regarding the speed on which users travel from one place to another, this speed assumption is different depending on the distance traveled, being less for short distances and higher speeds for long distances. This speed assumption is used to determine whether a tap- in was done at the beginning of a new trip (origin) or after a transfer stop. The speed assumption is also used to calculate the approximate waiting time in case the tap-in was done after a transfer. On the second algorithm, the timetables were also included, making it much more complex to analyze the data. The scope of the algorithm is trying to identify the most likely route in which the tap-in was done, then looking through the route and try to identify the alighting station. Finally, the algorithm verifies if a transfer was done or if the user reached its destination. In this second approach, there is no need for assumptions about speeds, they can be more precisely calculated with the timetables, together with the waiting times. The outputs of both algorithms are OD-matrices together with data about transfers, waiting times, average travel speeds, and also the count and reason when destination could not be estimated.

There are four major reasons to why some destinations could not be estimated. The first reason is that the user registered only one tap-in during the day, in which case we have knowledge of where the boarding was done but no further information to estimate the user’s destination. The second motive for which a destination would be incorrectly estimated is if the destination location matched the boarding location in which case we would consider it incorrect since it wouldn’t be reasonable that a trip was made to go to the same place it originated, this is called in the algorithm an origin-destination match. A third motive to discard an estimation is similar to an origin- destination match but the case occurs in the last tap-in registered in the day, the way both algorithms work is by assuming that the first tap-in registered during the day is originated at ‘home’ and therefore the last tap-in would aim to get back to the first tap-in location registered, sometimes the first and last tap-in locations are the same, so when the destination of the last tap-in is estimated, it matches its origin. The final motive for which a destination can't be estimated is implemented only in the second algorithm with the use of the timetables. With the information of the routes we know the sequence of stations on which the trip goes, then, if the following tap-in is registered in a station that was already visited by the route, then it is impossible to say that was the destination for this specific trip, the real destination is not possible to be estimated.

Description The program starts by reading the different data files: the location of stops, the tap-in data and the timetables (for the second algorithm).

The first task of the program is to identify all the tap-ins done by the same id number which represents a single user (card) from the tap-in data set. It will do this for all different id numbers.

After the tap-in information for a single user is gathered, it is sorted by date, meaning that we can see the order in which the tap-ins were done throughout the day. If for a certain user there is only one tap-in during the day (푛 = 1), then the estimation is not possible and further analysis is dropped. The sorted data contains the station numbers on which the tap-in was done (푆푇푗), so we can find the location (퐿푗) of them by matching them with the other data set (location of stops). Together with the sequence number (푆푄푗) of the stop and their location, the next task is to estimate the destination of each travel or identify any transfers. Here is where the two different algorithms are implemented. This single-user tap-in information has 푛 > 2 tap-ins.

The first algorithm makes use of the following speed assumptions: For travels of less than 5Km (푝푎푟푎푚푑푖푠푡(푑푖푠푡 < 5퐾푚)), a speed of 10 Km/h is assumed (푝푎푟푎푚푠푝푒푒푑(푑푖푠푡 < 5퐾푚)), for travels between 10 and 20Km ( 푝푎푟푎푚푑푖푠푡(5퐾푚 < 푑푖푠푡 < 20퐾푚) ) a speed of 20 Km/h ( 푝푎푟푎푚푠푝푒푒푑(5퐾푚 < 푑푖푠푡 < 20퐾푚)), between 20 and 45Km (푝푎푟푎푚푑푖푠푡(20퐾푚 < 푑푖푠푡 < 45퐾푚)), 45Km/h (푝푎푟푎푚푠푝푒푒푑(5퐾푚 < 푑푖푠푡 < 20퐾푚)), and for longer distances a speed of 55Km/h (푝푎푟푎푚푑푖푠푡(45퐾푚 < 푑푖푠푡)); (푝푎푟푎푚푠푝푒푒푑(45퐾푚 < 푑푖푠푡)). The algorithm will deliver average speeds for each range of distance, so we could verify the accuracy of these assumptions and correct them if needed. The assumption made by the algorithm is that the alighting location

9

Proceedings of the course TNK103 Analysis of Communication and Transport System, Linköping University, 2017.

for the first travel is done at the location where the second tap-in (퐿푗+1) was registered, the alighting location of this second trip is assumed to be the location where the third tap-in was registered and so on. Finally, the alighting location of the last trip is assumed to be the location of the very first tap-in registered during the day (퐿1). Whether the purpose of leaving the vehicle was to make a transfer or if the user reached its destination is done by analyzing the difference in time between tap-ins (푇푗+1 − 푇푗) and comparing it with a time threshold (푇푚푎푥). For example, if the distance between tap-ins was in the first range (less than 5Km), and the time between tap-ins (푇푗+1 − 푇푗) was less than what it would take to travel 5Km at a speed of 10Km/h (assumed) meaning less than 30min (푇푚푎푥), then the alighting is considered as a transfer. The difference between the time between tap-ins and the time it would take to travel the distance between tap-ins at a speed of 10Km/h would be the waiting time if it is positive. If the result is negative it would mean that the travel speed was higher than 10 Km/h, both the waiting time (if positive) and the travel speed are stored, knowing that the distance was less than 5Km. On the other hand, if the time between tap-ins was over 30min, then the alighting is considered as the end of a trip and the station stored as a destination

(퐷푖 ). The destination is then set as a new origin (푂푖+1) and the algorithm will look on the following tap-in information (푗 = 푗 + 1) and again look for its destination or transfers. At each step, the information on the time and day of the week the trip was done is stored for subsequent analysis.

By the end of the run, this first algorithm delivers the origins (푂푖), destinations (퐷푖), and transfers done by each user, as well as information about the average speed for each range of distance. Together with the average waiting times, and this information, we could calibrate the speed assumptions, and run the algorithm once again. The information delivered about transfers are the station numbers and the number of transfers done at each station, so we can later identify those stations where most transfers occurred. Lastly, we know the time each trip was done

(푇푗) and the day of the week, so we could later sort them by day of the week, or time during the day depending on what type of analysis wants to be done. We must mention that when the last tap-in information is analyzed, there is a variation since the assumption is that from that last point, every user will return to the location of the first tap- in done in the day.

The second algorithm makes use of timetables. After having the tap-in information of each user, the algorithm considers the sequence number (푆푄푗) of the tap-in, as well as the station (푆푇푗) in which the tap-in was registered, and looks into the timetables to find the most likely route (푅푗) the user took, to find it, the routes that stop in the station are filtered (푅푆푇푗), then the sequence number (푆푄푗) is searched and we would obtain the routes that stop at the station (푆푇푗) with the same sequence number (푅푆푇푗푆푄푗 ), if there is no route matching the sequence number, then the algorithm will try to find the most likely sequence number by comparing the time the tap-in was registered

(푇푗) and the departure times on the routes that stop there (푅푇푆푇푗푇), and will assign the one with the minimum difference. It then looks into the route, and all the stations the route goes to or has gone to already (푅푗푆푇), together with the next tap-in information (푗 + 1) to estimate in which of the stations of the route the alighting was done at, ∗ and the time it arrived there (퐴푇 ), if the route (푅푗) does not stop at the next stop (푆푇푗+1), then it will look for the station most near (minimum distance) to the next stop, and will use the time arrival time at this nearest station for ∗ ∗ later comparison (퐴푇 = 푅푗푇(푚푖푛 {푑푖푠푡(퐿푅푗푆푇 , 퐿푗+1)}). If the arrival time (퐴푇 ), to the next station is later than the departure time of the current station (퐷푇푗) that would mean that the next station is a possible destination or a transfer station, so a flag is marked to be analyzed in the next loop. On the contrary is the arrival time occurred before, that would mean that the route has already visited the next station, then the destination cannot be estimated, the current origin dropped, and a new origin set at the next station. To make the distinction between a transfer or a destination, the analysis is done at the next loop where we try to identify the route that was taken in the next tap- in, and the departure time for the next vehicle of that route (퐷푇푛푒푥푡푣푒ℎ), if the vehicle taken was the first to go through the alighting station, then it's a transfer, if it was past the first, then it is stored as a destination (퐷푖). The information delivered by this algorithm is similar, except that fewer destinations (퐷푖) are estimated. This number is used for comparison of the two algorithms. There is no need for speed assumptions, but the speed as well as waiting time, are stored for different travel distances also for comparison with the first algorithm.

10

Proceedings of the course TNK103 Analysis of Communication and Transport System, Linköping University, 2017.

After the OD-matrices of either algorithm is obtained, the program clusters the information for each municipality and even within points of interest within each municipality, this clustering is done to perform analysis within smaller geographical areas, and not part of the destination-estimation algorithm.

Other different analysis could be implemented in the program if specified but the main aim of it is to deliver OD- matrices, by estimating the destination of each trip only with the information of the smart card usage, which are tap-ins done when boarding.

Next we summarize both algorithms, for simplification, we show how they estimate destinations, or/and when they are not possible to be estimated, other information, not concerning destination estimation, such as waiting times, transfer stations, speeds, etc., are not shown but if desired could be implemented within these general formulations for later, more specific analysis.

Algorithm without considering timetables:

a) Set 푗 = 1, 푖 = 1, 푂푖 = 푆푇푗.

b) If 푗 < 푛, set 푑푖푠푡 = 푑푖푠푡(퐿푗, 퐿푗+1), otherwise set 푑푖푠푡 = 푑푖푠푡(퐿푗, 퐿1) 푝푎푟푎푚푑푖푠푡(푑푖푠푡) c) Set 푇 = . 푚푎푥 푝푎푟푎푚푠푝푒푒푑(푑푖푠푡) d) If 푗 < 푛:

i. If 푇푗+1 − 푇푗 ≤ 푇푚푎푥, then it is a transfer.

ii. If 푇푗+1 − 푇푗 > 푇푚푎푥, then it is a destination, set 퐷푖 = 푆푇푗+1.

1. If 퐷푖 ≠ 푂푖, set 푖 = 푖 + 1, 푂푖 = 푆푇푗+1.

2. If 퐷푖 = 푂푖, then the destination can’t be estimated, set 퐷푖 = ∅,푂푖 = 푆푇푗+1. e) If 푗 = 푛:

i. If 퐷푖−1 = 푆푇푗, set 퐷푖 = 푆푇1, otherwise, set 퐷푖 = 푆푇푗, 푖 = 푖 + 1, 푂푖 = 푆푇푗, 퐷푖 = 푆푇1

1. If 퐷푖 = 푂푖, then the destination can’t be estimated, set 퐷푖 = ∅,푂푖 = ∅. f) If 푗 < 푛, set 푗 = 푗 + 1 and return to b), otherwise

Algorithm considering timetables:

a) Set 푗 = 1, 푖 = 1, 푂푖 = 푆푇푗, 푓푙푎푔 = 0.

b) If 푆푄푗 ∈ 푅푆푇푗푆푄, set 푅푆푄푗 = 푅푆푇푗푆푄푗 ; otherwise set 푆푄푗 = 푅푆푇푗푆푄(푚푖푛 {|푇푗 − 푅푆푇푗푇|}), 푅푆푄푗 = 푅푆푇푗푆푄푗 .

c) Set 푅푗 = 푅푆푄푗푛푢푚(푚푖푛 {|푇푗 − 푅푆푄푗푇|}), 퐷푇푗 = 푅푗푇(푚푖푛{|푇푗 − 푅푗푇|}). d) If 푓푙푎푔 = 1: ∗ i. Set 퐷푇푛푒푥푡푣푒ℎ = 푅푗푇(푚푖푛{|푅푗푇 − 퐴푇 |})

ii. If 퐷푇푗 > 퐷푇푛푒푥푡푣푒ℎ, then it is a destination, set 퐷푖 = 푆푇푗

1. If 퐷푖 ≠ 푂푖, set 푖 = 푖 + 1, 푂푖 = 푆푇푗.

2. If 퐷푖 = 푂푖, then the destination can’t be estimated, set 퐷푖 = ∅,푂푖 = 푆푇푗.

iii. If 퐷푇푗 = 퐷푇푛푒푥푡푣푒ℎ, then it is a transfer. iv. Set 푓푙푎푔 = 0.

e) If 푗 < 푛, set 퐴푇푗+1 = 푅푗푇(푆푇푗+1), otherwise set 퐴푇푗+1 = 푅푗푇(푆푇1) ∗ ∗ f) If 퐴푇푗+1 ≠ ∅, set 퐴푇 = 퐴푇푗+1, otherwise if 푗 < 푛, set 퐴푇 = 푅푗푇(푚푖푛 {푑푖푠푡(퐿푅푗푆푇 , 퐿푗+1)}), otherwise ∗ set 퐴푇 = 푅푗푇(푚푖푛 {푑푖푠푡(퐿푅푗푆푇 , 퐿1)}), ∗ g) If 퐴푇 > 퐷푇푗 , set 푓푙푎푔 = 1, otherwise the destination can’t be estimated, if 푗 < 푛, set 푂푖 = 푆푇푗+1 ,

otherwise set 푂푖 = ∅. h) If 푗 < 푛, set 푗 = 푗 + 1 and return to b), otherwise if 푓푙푎푔 = 1:

i. Set 퐷푖 = 푆푇1 ii. If 퐷푖 = 푂푖, then the destination can’t be estimated, set 퐷푖 = ∅,푂푖 = ∅.

11

Proceedings of the course TNK103 Analysis of Communication and Transport System, Linköping University, 2017.

7. Algorithm results As mentioned previously are alternative outputs potential with the algorithm(s). In this results analysis the more advanced second algorithm be used, taking time tables into consideration, if not stated otherwise. What kind of analysis that is to be done highly relates to what is of interest of the stakeholder to analyze. The following outputs are to be looked upon as examples of the wide amount of potential outputs possible for analysis. The results from the algorithm features an analysis of transit-stations, an OD-analysis of movement between the different municipalities and movement within the municipality of Karlshamn.

The first output analyses the transit times of travelers with the aim of answering questions such as: Where are travelers transiting between stations? For how long are travelers waiting? A transit here is defined as described in the algorithm description above; where a user has waited less than 30 minutes between leaving the first vehicle and boarding the second. Figure six below displays the seven most common stations of transit in Karlskrona during the third of October 2016. The most frequent place of transiting is the station 1041 – Amiralen Lyckeby.

Figure 6 - The seven most common station of transits in Karlskrona during the third of October 2016

At Lyckeby Amiralen (1041) travelers wait for an average just short of 10 minutes before boarding the next bus and the total number of transits during the 3rd of October for this station was 268. Table 5 below displays the seven most common stations of transiting in Karlskrona during that day.

Table 5 - Stations number, station name average waiting time and number of transits corresponding to figure 6

Station Name Average waiting time(min) Transits (3 of October) 1041 Amiralen Lyckeby 9,47 268 1001 Kungsplan Karlskrona centrum 13,25 157 1601 Karlskrona centrum Parkgatan 12,59 139 1036 Bergåsa station 9,50 130 1934 Lyckeby slottsbacken 12,29 106 1070 Marieberg Karlskrona 8,93 99

1600 Karlskrona centralstation 10,29 63

12

Proceedings of the course TNK103 Analysis of Communication and Transport System, Linköping University, 2017.

In the following analysis will both algorithms be used to compare how much they differ. How much does the implementation of time table data really improve the model? The chord diagrams below were done with an online tool (Circos, 2017). The chord diagram illustrate the movement of passengers between different areas in Blekinge during the third of October 2016. The first of the diagrams represents the OD for the entire day to and from different municipalities. The vast majority of trips are done within the different municipalites (90,5% using the simpler algorithm, 92,2% using the advanced algorithm). Since the purpose of the clustering is to see travel pattern between municipalities are these internal trips neglected in the first analysis. As an example on how to interpret the diagram does the red color represent Karlshamn, meaning that the large red outlines correspond to trips going from Karlshamn while trips going to Karlshamn corresponds to the other colors leading into the red area. Figure 7 represents the advanced Figure 7 - OD of trips to and from the different municipalities during the algorithm while figure figure 8 represents the third of October. simplified.

The initial impression is that both diagrams (figure 7 and 8) are very similar, meaning that from a broad range perspective can the simplified algorithm still produce rather satisfying results. The simplified algorithm produces in general more trips than the advanced one (42% more for the studied diagrams). What is interesting to find out is that internal trips (not shown in these figures) within the municipalities did not differ as much (only 7%). The reason behind the differences is that the advanced algorithm not only takes timetables into account but also discards more trips due to several different reasons. The reasons for discarding trips for the two algorithms is seen in table 6 and 7 below. Both algorithms do discards blips, in the simpler algorithm the, the primer reason for this is “single blip”. Figure 8 - OD using simplified algorithm

13

Proceedings of the course TNK103 Analysis of Communication and Transport System, Linköping University, 2017.

A single blip represents the scenario where a user only registered the card once during the day. As previously explained can the algorithm not be utilized in this case. The other reason for discarding trips using the simpler algorithm involves the case when the origin is equal to the destination. These two reasons are also represented in the advanced algorithm. However, here is consideration also taken to a case when the alighting station of the trip is not on route, meaning that the estimated alighting station was in fact at a station before the boarding station.

Table 6 Reasons for discarding a blip (simple algorithm)

Algorithm: Simple Discarded: 19% Single blip 16% Origin = Destination 3%

Table 7 - Reasons for discarding a blip (advanced algorithm)

Algorithm: Advanced Discarded: 36% Single blip 17% Destination not on route 18% Origin = Destination 1%

Figure 9 below displays how many trips that was discarded from each origin in the advanced algorithm. One can denote that the absolute change is rather large in Karlskrona, but that the change only amounts to roughly 10 % of all trips. However, in Ronneby both the absolute change and the relative change are substantial in size. One can for instance denote that the flow from Ronneby to Karlskrona increased with 111 passengers (from 192 to 303).

Trips included with simple algorithm (all trips) 1200 140%

1000 120% 100% 800 80% 600 60% 400 40%

200 20%

0 0% Karlshamn Karlskrona Olofström Ronneby Sölvesborg Absolute increase Relative increase

Figure 9 - Trips discarded by the advanced algorithm from the different origins

14

Proceedings of the course TNK103 Analysis of Communication and Transport System, Linköping University, 2017.

Figure 10 below illustrates the differences between the two algorithms with the internal trips within each municipality being neglected. Once again one can denote the unevenness of the performance of the simpler algorithm. What it interesting to see here is that the simpler algorithm included more external trips in the large municipalities of Karlshamn and Karlskrona. The results from figure 9 seemed to show a relatively good performance of the simpler algorithm here, but looking strictly at external trips that is not the case.

Trips included with simple algorithm (external)

180 60% 160 50% 140 120 40% 100 30% 80 60 20% 40 10% 20 0 0% Karlshamn Karlskrona Olofström Ronneby Sölvesborg Absolute increase Relative increase

Figure 10 - External trips discarded with the advanced algorithm

Looked upon in isolation an OD over the movement of passengers for a whole day between different municipalities might not be of large interest for traffic planners. A more interesting alternative would be to analyze when during the day travelers move where. Which municipality attract trips in the morning and where do people travel in the afternoon? The following two chord- diagrams features morning and afternoon peak hour traffic flows between the municipalities. By inspection of figure three in chapter 5, peak hour traffic was defined as the morning hours between 6 and 9 and in the afternoon between 16 and 19. One can denote the difference between the two diagrams where Ronneby, for instance, is an example of a municipality where travelers to a large extent leave in the morning commonly to go to Karlskrona which is an example of a municipality that attracts a large amount of these trips.

Figure 11 - Illustrating morning peak hour traffic (7-10) between the different municipalities.

15

Proceedings of the course TNK103 Analysis of Communication and Transport System, Linköping University, 2017.

In the same way can one denote that more trips are generated from Karlskrona in the afternoon to Karlshamn and Ronneby. As most of the trips are occurring within the different municipalities not in between them it is of interest to not only focus on the external trips. As mentioned in chapter 5 were all 1094 stations assigned both to a municipality and a subpart of a municipality. Therefor is the next step to analyze the flow of traffic within a specific municipality divided into these subparts (described in chapter 5). Figure 12 represents the flow of passengers in Karlshamn during the third of October. Karlshamn was further divided into 13 subparts. In the diagram are other municipalities also represented to be able to analyze from where within Karlshamn travelers leave to go to where including trips outside the municipality.

Figure 12 - Illustrating afternoon peak hour (16-19) between the different municipalities

Though rather hard to grasp due to the large number of nodes, the figure to the left illustrates the flow of passengers in Karlshamn. One can for example denote the larger flow from the southeast inner- city area of Vägga to the, of Karlshamn situated western, urban area of Mörrum. The diagram in figure 13 was created using the advanced algorithm.

Figure 13 - OD within Karlshamn

16

Proceedings of the course TNK103 Analysis of Communication and Transport System, Linköping University, 2017.

8. Analysis of results The first analysis concern results connected to the transit-analysis. Lo (2003) defines a transit by a set of rules highlighting the time window as one of these conditions. However, the exact time threshold is not defined but rather depending on the frequency of vehicles. In the algorithm a transit was defined rather arbitrary as a passenger that waited less than 30 minutes between leaving the first vehicle and boarding the second. This threshold value was chosen as it made some sense that a passenger in general would not have time for a greater purpose of the trip other than transiting during this brief time frame. It was discovered that this general case of a transit definition was in fact a bit too general. Some shorter trips got the same origin as destination since the duration of the stay was too short. As an example; Peter boards bus nr 1 at 11.05 to buy some groceries. He leaves the bus at 11.09 at the shop. He does his shopping and return with the next bus at 11.35. A fully reasonable travel story. However, these short types of errands were, using the 30-minute threshold, not looked upon as a trip but a transit in the model.

A method to overcome this problem would be to instead of having a fixed threshold value for all trips use a threshold value which is a fragment of the total trip length. It makes sense that a traveler would accept a longer transit time if the total length of the trip is longer. The potential ways of defining a transit are numerous. In the algorithm by (Trépanier & Chapleau, 2006) was a threshold distance of 2km used highlighting the assumption that the transit users would walk too far to the next station.

The transit-station that was most frequently used as a transit was station number 1041 – Amiralen Lyckeby. After a brief analysis of the station and conversations with local travelers the station is proved to be a hub connecting several of Karlskronas major city bus lines.

The second part of results concerns both the produced OD-matrices and the comparison of the two different algorithms; the “simpler” and the “advanced” which takes the timetable into account. The simpler algorithm seems to perform decent at a first analysis as figure 7 and 8 to a large extent resemble each other. However, studied more in detail one denotes that there are in fact rather large differences between the two as the more advanced algorithms discards more blips. 36 % of the blips in the advanced algorithm was discarded while the same figure was 19 % for the simpler algorithm. Although the initial impression might be that a higher number of estimated trips always is better might one in fact risk more doubtful estimations. It might seem like a lot of discarded trips for the advanced algorithm, but one should keep in mind that (Trépanier & Chapleau, 2006) managed to estimate 66% of their trips and still highlighted that the effect on the results for regular routes during peak hours would not be too strong.

Regarding the analysis of the peak-hour figure 11 and 12 it seems reasonable that Karlskrona, being both the largest town and the capital of the county also attracts many commuters from other municipalities. Besides the labor opportunities, the city also has a college (Blekinge Tekniska Högskola – BTH), that likely attracts daily commutes.

The results maintained for trips within Karlshamn (figure 13) were created using the advanced algorithm, however they did not take peak hour traffic levels into account. As highlighted in the initial parts of chapter 7 are the results to be seen only as examples of the vast potential outputs possible using the algorithm and the purpose of figure 13 was rather to display how the more detailed level of clustering could display interesting features for the traffic planners of the studied town.

To compare how the two algorithms, differ in a more detailed level was the analysis featured in figure 13 also done using the simpler algorithm. As with the 5*5 matrix (figure 7 and 8) it can be concluded that the advanced algorithm discards more trips than the simpler. Figure 14 denotes the positive difference between the two, meaning that all these trips were discarded using the advanced algorithm.

17

Proceedings of the course TNK103 Analysis of Communication and Transport System, Linköping University, 2017.

Figure 14 – Differences in OD within Karlshamn between simplified algorithm and advanced

To analyze the travel patterns on a detailed level within the municipalities the differences between the two algorithms is highlighted once again. One denotes that the two differs slightly more on a particular type of trip, going to and from the municipality. For instance, trips going from Erik Dahlbergsvägen or Karlshamn station to Ronneby or trips going from Ronneby to Karlshamn station are in absolute numbers more frequently neglected in the advanved algorithm.

18

Proceedings of the course TNK103 Analysis of Communication and Transport System, Linköping University, 2017.

9. Conclusions and future work The potential outputs of the algorithm are numerous. Depending on what is of interest to analyze the algorithm can provide an efficient way of detecting general travel patterns in public transports. Broader analysis was shown in figure 7 and 8, more detailed once in figure 13. Figure 11 and 12 showed how peak hour traffic of the algorithm in an illustrative way.

• What will the difference be when implementing the time plan in the algorithm?

Looking at a broad scale the simple algorithm performs relatively like the advanced; the major flows are covered and represented, even without the implementation of timetables. Studying the differences more in detail one denotes in figure 9 and 10 that for some of the trips are the differences significant. As expected will the more advanced algorithm discard more trips than the simpler one, which in a sense increases the security of the model. In was total 42% more trips estimated using the simpler algorithm.

• What are the biggest challenges to overcome while transferring raw data to results?

Two major obstacles have been denoted while working with the project; the fact that the data only features tap-ins and the fact that the data demanded large amounts of filtering before implementing the algorithm. Though the generated OD-matrices show intuitively good results, it is hard to validate that the model and algorithm produces truly reliable results as it only features data of the boarding station. Furthermore, might assumptions made regarding travel behavior not be reasonable. As an example of that might the definition of a transit need further improvement to capture the true behavior of travelers in a better way. Since there is no direct way of determining the true alighting station for a traveler using the given data set, the accuracy of the model cannot currently be calculated.

One method that could be used to validate the model and its accuracy would be to let passengers blip the card when alighting the vehicle and thereby get the time-stamp and the location of the tap-out. This would lead to an additional process for passengers when alighting the vehicle. This new process and the associated time might not be accepted by all passengers and cheating with tap-outs might occur. Potentially might the complexity of this system cause fewer people to use the public transport system. An alternative would be to just let a test group of some thousand travelers tap when alighting the vehicle and not force the whole population of travelers to do so. With the data from the test group the accuracy of the model would be given. As highlighted in the study by (Li, et al., 2011) their similar algorithm was able to detect 85% of the true OD-demand, which at least give a hint for how good the performance of the model is. Even though hard to validate, one can still conclude that the initial output of the model seems reasonable. The most frequently used transit station was a station visited by many buses which seem reasonable using intuition. As expected will the model likely provide reasonable results on a broad general scale while perform less accurate looking at specific trips in detail.

The filtering and matching of data was a demanding process since the given data was not designated to fit the specific model or algorithm. Some tap-in data featured did in fact not represent data of boarding at stations but rather a situation where a passenger topped up the card with money. Other data-handling challenges included matching of stations between different systems. The stations number in data given from Blekingetrafikens system did not match the station numbers from the national data from Samtrafiken and therefor a transformation was required.

19

Proceedings of the course TNK103 Analysis of Communication and Transport System, Linköping University, 2017.

References Circos, 2017. Vizualize circos. [Online] Available at: http://mkweb.bcgsc.ca/tableviewer/visualize/ [Accessed 5 1 2018].

Kurauchi, F. & Schmöcker, J.-D., 2017. Public transport planning with Smart Card Data. Kyoto: Taylor & Francis.

Li, D. et al., 2011. Estimating a Transit Passenger Trip Origin-Destination Matrix Using Automatic Fare Collection System. Jinan, s.n.

Lo, H., 2001. Modeling transfer and non-linear fare structure in multi-modal network, Hong Kong: Hong Kong University of Science and Technology.

Riegel, L. K., Attanucci, J. P. & Murga, M., 2013. Utilizing Automatically Collected Smart Card Data to Enhance Travel Demand Surveys, s.l.: s.n.

Samtrafiken, 2017. Om Samtrafiken. [Online] Available at: https://samtrafiken.se/samtrafiken/om-samtrafiken/ [Accessed 3 1 2018].

Trépanier, M. & Chapleau, R., 2006. Destination Estimation from public transport smartcard data, Montréal: Centre de recherece sur les transports (CRT).

20