University of PRISM: 's Digital Repository

Graduate Studies The Vault: Electronic Theses and Dissertations

2015-06-16 Historical Vehicle Traffic Analysis and Commute Time Prediction Using Web Mining

Kaur, Charanjeet

Kaur, C. (2015). Historical Vehicle Traffic Analysis and Commute Time Prediction Using Web Mining (Unpublished master's thesis). University of Calgary, Calgary, AB. doi:10.11575/PRISM/26373 http://hdl.handle.net/11023/2302 master thesis

University of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission. Downloaded from PRISM: https://prism.ucalgary.ca UNIVERSITY OF CALGARY

Historical Vehicle Traffic Analysis and Commute Time Prediction Using Web Mining

by

Charanjeet Kaur

A THESIS

SUBMITTED TO THE FACULTY OF GRADUATE STUDIES

IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE

DEGREE OF MASTER OF SCIENCE

GRADUATE PROGRAM IN ELECTRICAL AND COMPUTER ENGINEERING

CALGARY,

MAY, 2015

© Charanjeet Kaur 2015 Abstract

Analyzing historical vehicle traffic data has many applications including urban planning and intelligent in-vehicle route prediction. A common practice to acquire this data is through roadside sensors. This approach is expensive because of infrastructure and planning costs and cannot be easily applied to new routes. A Web mining approach is proposed to address these limitations. The proposed system gathers information about vehicle commute times, accidents, and weather reports from heterogeneous Web sources. This information is combined to support vehicle traffic analytics. Clustering analysis is performed on historical data that investigates the traffic patterns of highways and arterial roads with factors having the most impact on commute time. A commute time prediction model is built on historical vehicle traffic data analytics.

Commute time prediction model is trained with the traffic problems faced in the past and forecasts the commute time incorporating the impact of external factors such as weather and accidents.

ii Preface

Conference Proceeding:

Kaur, C., Krishnamurthy, D., Far, B.H., Using Web Mining to Support Low Cost Historical

Vehicle Traffic Analytics, 26th International Conference on Software Engineering and

Knowledge Engineering, SEKE 2014.

iii Acknowledgements

I would like to take the opportunity to thank all the people who made this work possible. My deepest gratitude and appreciation goes to my supervisor, Dr. Diwakar Krishnamurthy, for his remarkable guidance. I am sincerely grateful to my co-supervisor, Dr. Behrouz H. Far, for teaching and inspiring me over the past two years. I would like to thank both the professors for their valuable time and efforts to make this research possible and also for providing financial support throughout my research.

I would also like to thank my friend, Sukhpreet Dhaliwal, for helping me to refine the thesis and providing suggestions. Last but not the least, I am grateful to my parents for supporting me spiritually throughout my life. I thank my husband for standing by me through the good and bad times.

iv Dedication

I dedicate this thesis to my parents for making me be who I am, and my husband, Amandeep

Sekhon, for supporting me all the way.

v Table of Contents

Abstract ...... ii Preface ...... iii Acknowledgements ...... iv Dedication ...... v Table of Contents ...... vi List of Tables ...... viii List of Figures and Illustrations ...... ix List of Symbols, Abbreviations and Nomenclature ...... xi

CHAPTER ONE: INTRODUCTION ...... 1 1.1 Background and Motivation ...... 1 1.2 Research Objectives ...... 4 1.3 Contributions ...... 5 1.4 Thesis Organization ...... 11

CHAPTER TWO: LITERATURE REVIEW ...... 12 2.1 Background ...... 12 2.2 Related Work ...... 16 2.1.1 Sensor Infrastructures for Traffic Monitoring ...... 17 2.1.2 Traffic Characterization ...... 21 2.1.3 Travel Time Prediction Models ...... 23 2.3 Summary ...... 26

CHAPTER THREE: WEB DATA COLLECTION METHODOLOGY ...... 29 3.1 Data Collection Process ...... 29 3.1.1 Google Maps ...... 31 3.1.2 Twitter Search API ...... 35 3.1.3 Historical Weather Reports ...... 38 3.2 Data Overlaying ...... 38 3.3 Summary ...... 40

CHAPTER FOUR: HISTORICAL VEHICLE TRAFFIC ANALYSIS OF ...... 41 4.1 Deerfoot Trial Commute Time Analysis ...... 44 4.1.1 Q1. What are the peak and off-peak traffic hours? ...... 44 4.1.2 Q2. What are the traffic conditions at peak hours? ...... 48 4.1.3 Q3. How are peak hours related to number of accidents? ...... 54 4.1.4 Q4. What are the bottleneck road segments during peak hours? ...... 58 4.2 Summary ...... 71

CHAPTER FIVE: UNIQUE TRAFFIC PATTERNS WITH K-MEANS CLUSTERING73 5.1 Clustering Analysis ...... 73 5.2 Commute time patterns ...... 76 5.3 Summary ...... 88

vi CHAPTER SIX: COMMUTE TIME PREDICTION MODEL ...... 90 6.1 Prediction Model ...... 90 6.2 Prediction Performance ...... 96 6.3 Summary ...... 106

CHAPTER SEVEN: CONCLUSIONS AND FUTURE WORK ...... 107 7.1 Conclusions ...... 107 7.2 Future work ...... 111

REFERENCES ...... 113

vii List of Tables

Table 4.1: Deerfoot Trail - north to south commute time (minutes) statistical results at morning hours ...... 45

Table 4.2: Deerfoot Trail - north to south commute time (minutes) statistical results at evening hours ...... 45

Table 4.3: Deerfoot Trail – south to north commute time (minutes) statistical results at morning hours ...... 47

Table 4.4: Deerfoot Trail – south to north commute time (minutes) statistical results at evening hours ...... 47

Table 6.1: Accuracy rate of predicted commute time ...... 97

Table 6.2: Performance of prediction model for each cluster ...... 99

Table 6.3: Relative percentage difference for cluster C5 for various time differences between predicted and actual commute times ...... 105

viii List of Figures and Illustrations

Figure 3.1: Google Maps Data Collection Process ...... 32

Figure 4.1: Probability distribution of commute times at morning peak hours ...... 49

Figure 4.2: Probability distribution of commute times at morning peak hours ...... 50

Figure 4.3: Probability distribution of commute times at afternoon peak hours ...... 51

Figure 4.4: Probability distribution of commute times at afternoon peak hours ...... 52

Figure 4.5: Probability distribution of commute times at evening peak hours ...... 53

Figure 4.6: Probability distribution of commute times at evening peak hours ...... 54

Figure 4.7: Probability distribution of accidents at Deerfoot Trail during morning peak hours (north to south) ...... 55

Figure 4.8: Probability distribution of accidents at Deerfoot Trail during morning peak hours (south to north) ...... 56

Figure 4.9: Probability distribution of accidents at Deerfoot Trail during evening peak hours (north to south) ...... 57

Figure 4.10: Probability distribution of accidents at Deerfoot Trail during evening peak hours (south to north) ...... 57

Figure 4.11: Percentage of days having maximum commute time (per KM) ...... 59

Figure 4.12: Percentage of days having maximum commute time (per KM) ...... 61

Figure 4.13: Average commute time at each segment ...... 63

Figure 4.14: Average commute time at each segment ...... 63

Figure 4.15: Segment 2 - Morning rush hours ...... 65

Figure 4.16: Segment 2 - Morning rush hours ...... 66

Figure 4.17: Segment 2 - Evening rush hours ...... 66

Figure 4.18: Segment 3 - Morning rush hours ...... 68

Figure 4.19: Segment 3 - Evening rush hours ...... 69

Figure 4.20: Segment 3 - Evening rush hours ...... 69

Figure 4.21: Segment 4 - Evening rush hours ...... 70

ix Figure 4.22: Segment 5 - Morning rush hours ...... 71

Figure 5.1: Appropriate number of clusters for North to South direction of Deerfoot Trail ...... 75

Figure 5.2: Appropriate number of clusters for South to North direction of Deerfoot Trail ...... 75

Figure 5.3: Distribution of days into categories based on weather and accidents for Deerfoot Trail (north to south) ...... 76

Figure 5.4: Days with no accidents (North to South) ...... 78

Figure 5.5: Day with no accidents (South to North) ...... 78

Figure 5.6: Traffic Pattern for Weekends on Good Weather Days with accidents ...... 80

Figure 5.7: Traffic Pattern for Weekends on Good Weather Days with accidents ...... 80

Figure 5.8: Regular Traffic Pattern for Working Days ...... 82

Figure 5.9: Regular Traffic Pattern for Working Days ...... 82

Figure 5.10: Traffic Pattern for Working days with more number of accidents ...... 83

Figure 5.11: Traffic Pattern for Working days with more number of accidents ...... 83

Figure 5.12: Regular Traffic Pattern for Working days on Bad Weather ...... 85

Figure 5.13: Regular Traffic Pattern for Working days on Bad Weather ...... 85

Figure 5.14: Random Traffic Pattern on Snowy Days with more number of accidents (North to South) ...... 86

Figure 5.15: Random Traffic Pattern on Snowy Days with more number of accidents (South to North) ...... 86

Figure 6.6.1: Prediction process...... 91

Figure 6.2: Naïve Bayes Classifier ...... 94

Figure 6.3: Day predicted in cluster C1 ...... 100

Figure 6.4: Day predicted in cluster C2 ...... 101

Figure 6.5: Day predicted in cluster C3 ...... 102

Figure 6.6: Day predicted in cluster C4 ...... 103

Figure 6.7: Day predicted in cluster C5 ...... 104

x List of Symbols, Abbreviations and Nomenclature

Symbol Definition

API Application Programming Interface

CCTV Closed Circuit Television

CSV Comma Separated Values

FCMP Fuzzy Clustering Multiple Prototype

GPS Global Positioning System

HTTP Hyper Text Transfer Protocol

IMEI International Mobile Equipment Identity

ICT In Current Traffic

ITS Intelligent Transportation System

IVC Inter-Vehicle Communication

JSON JavaScript Object Notation

PDF Probability Distribution Function

QoE Quality of Experience

WEKA Waikato Environment for Knowledge Analysis

xi

Chapter One: INTRODUCTION

1.1 Background and Motivation

With the rapid increase in traffic volumes on arterial roads and highways, several approaches

have been adopted to improve existing traffic conditions. A lot of enhancements have been made

to accommodate the expected increase in traffic volumes for continuous vehicular travel flow

and to reduce traffic congestion.

Traffic congestion is a problem that countries around the world are coping with. A few studies

reviewed the state of congestion in Canadian cities and found that in many of the country’s large

urban areas it has reached serious levels that are imposing significant costs on drivers, the economy, the environment, and the quality of life of Canadians [1, 2]. Transport Canada [1] calculated the economic cost in major Canadian cities from longer travel times and the additional cost of less reliable travel times requiring people to include contingency time in their travel.

Transport Canada calculated the total economic cost of congestion by multiplying the amount of time that commuters and other drivers lost due to congestion by the assumed value those travelers placed on their time. These costs amount to $5.2 billion per year in Canada’s five largest cities; Toronto, Montreal, Vancouver, Calgary, and Ottawa [1].

An efficient and reliable transportation network is required to relieve urban congestion and reduce traffic accidents. One of the approaches is to expand the road capacity. However, evidence shows that building more roads to address congestion in Canada’s largest cities is not only impractical from a cost perspective and restrictions of land area, but also ineffective: the new road space is used up as fast as it is built and congestion remains unaffected [2].

1

A more viable approach would be using the existing network resources more efficiently to provide better road service level. Intelligent transportation systems (ITS) provide keys to resolve congestion problems. ITSs collect, process, and broadcast information to users across transportation networks in order to improve efficiency and safety [3]. A variety of intelligent transportation systems have already been developed and applied in transportation networks to improve the traffic, reduce traffic congestion and predict the commute times [4, 5, 6].

The main emphasis of this research is to support historical traffic analyses to understand the traffic congestion problems in the past. The results of these analyses can be used by traffic controllers to avoid similar congestion problems in future. Second focus of this research is to build a commute time prediction model by taking the historical analyses into consideration.

Here, commute time prediction is the process of estimating the anticipated commute time at a future time by using the historical data. Commute time prediction information can be delivered to road users for either pre-trip planning or during the trip [7]. Pre-trip commute time information enables the user to make decisions on the best route to take and commute time provided during the trip gives user the option to take an alternative route with less commute time or at least relieve the anxiety resulting from being unaware of the situation [7].

Traffic data needed to conduct these studies can be collected in various ways. A common practice to acquire this data is through roadside sensors. This approach is expensive because of infrastructure and planning costs. Furthermore, it cannot be easily applied to new routes. The most significant motivation of this research is to build a traffic analysis and prediction system which does not require an expensive sensors-based data collection platform.

2

In this research, a web mining is proposed as an alternative approach which collects the traffic data from existing web applications. This approach leverages the capabilities of existing Web sources and does not require additional infrastructure to capture traffic data. Very few researches have focused on data collection from Web sources. Most of the existing work relies on the data collected from roadside sensors and detectors. To the best of my knowledge, there exists one study that focused on collecting low cost data in which traffic data is collected from Bing maps to acquire the traffic flow [8]. The drawback of [8] is that it only includes data from one Web source i.e., Bing maps and does not integrate this information with other types of data such as traffic accidents on those roads or weather conditions of the city. This limitation is addressed in my research by incorporating data from multiple Web sources.

In contrast to the traditional approach of relying on specialized roadside instrumentation, my proposed approach is more flexible in that it can be adapted with little effort and cost to analyze any road for which such Web data is available. This approach is flexible and can be tailored to any route of interest and it can also incorporate new data sources and factors which affect the commute time. Integration of data from multiple Web sources will not only identify the congested areas of the city but will also address the problems leading to the congestion. This data will be useful to perform detailed historical analysis to help identify the traffic congestion issues.

This data will also prove beneficial to build a commute time prediction model as this historical data will reveal the congestion problems which could be incorporated into the prediction model.

Heterogeneous data collection from multiple Web sources will be beneficial for both (i) analyzing historical traffic patterns, and (ii) building commute time prediction models.

3

1.2 Research Objectives

This research has three main objectives: a) Supporting historical vehicle traffic analysis. b) Identifying commute time classification patterns. c) Developing prediction model based on the historical vehicle traffic analyses.

These objectives can help find answers to problems related to road traffic by identifying factors that impact the flow of traffic.

The key metric of interest in this research is commute time. Commonly available Web APIs such as the Google Maps API [9] can be used to capture commute time information. However, there are several challenges that need to be addressed before such data can be used. For example, the Google Maps API provides estimates of commute time for a specified route at the instant at which the API query is made. However, it does not support querying of commute time data for arbitrarily specified days and time instants in the past. This makes it difficult to carry out historical traffic analysis. Furthermore, no single Web source provides information about factors that can have an impact on commute time. For example, information about accidents occurring on a route can be queried by mining Tweets [10] generated by commuters and information about weather conditions can be obtained from several Web sites [11][12].

However, such information needs to be correlated in an automated manner with the commute time information to support traffic analyses. To the best of my knowledge, I am not aware of other systems that offer historical commute time data and that allows analysts to combine such

4

data with metrics that can help reason about observed commute time patterns. Implementing such a system is a key aim of this research.

The second objective of my research is to demonstrate how data captured by such a system can be used to support historical traffic analysis. Specifically, this study shows how the data collected by the system can be exploited to answer the following research questions (RQs) about a given route:

RQ1. What are the peak and off-peak traffic hours?

RQ2. What are the traffic conditions at peak hours?

RQ3. How are peak hours related to number of accidents?

RQ4. What are the bottleneck road segments during peak hours?

The third objective of my research is to exploit historical trends obtained from such analysis to

develop commute time prediction models. Specifically, the research is interested in predicting

the commute time for an arbitrarily specified route, given the time of travel and the weather

conditions on that day.

1.3 Contributions

Several new contributions have been made in this research. Firstly, no study has exploited Web

mining for historical traffic analysis incorporating external factors which impact traffic on the

roads. In this study, all the traffic data is collected from multiple free Web sources which impact

the vehicle traffic. For example, the factors which impact the traffic flow may include the

following:

5

 Number of cars, speed, occupancy

 Time of the day, weekday, weekend

 Weather, temperature

 Accidents, detours, lane closures

 Glare (direction of sun)

The information about these factors can be collected from multiple Web sources such as commute time and route information from Google Maps [9]; accidents and events information from Twitter [10]; and temperature, rain or snow information from the weather websites [11].

A Web mining driven traffic analytics system is developed in this research. This designed system captures and stores information about the factors that impact the traffic at 19 time instants of the day. A detailed analysis is performed on this data to measure the impact of each factor on commute time, e.g., how many minutes does an accident add to the commute time? In order to perform this detailed historical traffic analysis, data is collected from three Web sources, namely

Google Maps, Twitter and Climate.Weather.gc.ca website. The data from these Web sources is combined on the basis of common factor of day and time. For example, commute time collected from Google Maps, accidents data collected from Twitter and weather data from climate.weather.gc.ca website are merged on the common factor day and time i.e., what was the commute time when an accident occurred on that time and what were the weather conditions on that day? In this way, the impact of external factors such as accidents and weather conditions is measured.

6

In this research I applied clustering technique to group the days having similar effects of external factors on commute time. Clustering identified 6 unique traffic patterns from 153 days for which the data was collected. Each traffic pattern represents a group of days with a different combination of impact of external factors on commute time. Thus, each traffic pattern is unique in the form of amalgamation of external factors such as amount of snowfall and number of accidents. In the next major part of the system, a commute time prediction model is designed which uses the information contained in these clusters. Information on the impact of external factors in the past is used to predict the commute times for the future days.

Specifically, the analysis system collects and stores information for 18 heavy arterial roads and highways in the city of Calgary, Alberta, Canada. This information is also collected for the sub- segments of these 18 roads and highways in order to study the traffic problems related to the specific intersection of the roads. The system as well extracts reports of accidents on these roads by mining Twitter posts, i.e., Tweets, related to these accidents and overlays this information with the commute time data. Finally, detailed information about weather conditions such as temperature, snowfall, and snow accumulation on the ground are mined from the Web and associated with the other pieces of data. The utility of this system is illustrated through a study that investigates the traffic patterns of the busiest highway in Calgary, i.e. Deerfoot Trail, along with factors having the most impact on commute time on this highway.

As a result, new insights are provided on commute time patterns and the factors that influence them. These insights are found for the whole road and as well as for the sub-segments of the road. Few of the results found in this study are:

7

 The evening peak hours are significantly longer than the morning peak hours. Evening

peak hours is 3 hours long whereas morning peak hours are for one hour and a half.

 Number of accidents happening in the peak hours is much more than the accidents

observed in off-peak hours. The highest number of accidents was 9 accidents recorded

in the historical data at a single peak hour, whereas accidents at off-peak hours were

never more than one accident at the same time.

 Commute time to travel the studied segment of Deerfoot Trail is 35 minutes in off-peak

hours under normal traffic conditions, i.e. no accidents and no inclement weather

conditions. Whereas, commute time could go up to one hour with moderate number of

accidents on a day with no snowfall and could rise more than two hours for the same

distance under inclement weather conditions and high number of accidents.

 On a blizzard day in Calgary [13] with 25 cm of snow and 9 accidents at 3 PM,

commute time went up to 2 hours and 6 minutes. The number of accidents increased 5

times than the accidents observed on a clear weather day at a single time instant.

 Traffic accidents posted on Twitter are associated with the particular intersection of the

highway by text parsing the tweets. This process of text parsing multiple tweets

collected over a period of time helped to identify hotspots of the highway. The

bottleneck segment in the southbound direction of Deerfoot Trail is the segment near

Downtown, that is, 16 Avenue to 17 Avenue in the time period between 4:30 PM and

5:30 PM. Surprisingly, Downtown segment is not the bottleneck for the other direction.

Segment 32 Avenue to 64 Avenue on Deerfoot Trail in northbound direction is the

bottleneck segment in the time period between 5:00 PM and 5:30 PM.

8

The above information could be useful for those that manage the highway, researchers who want to calibrate and validate traffic simulators for the highway, and those that are interested in predicting commute times based on historical trends.

For developing the predictive model, the 202 days’ historical data is divided into two sets; 153 days and 49 days. Clustering is performed on the first set of 153 days’ data in order to recognize trends in the commute time on Deerfoot Trail and is used as training dataset. The other 49 days of data is not clustered and reserved as prediction dataset, also known as testing dataset.

Clustering of 153 days identified 6 unique traffic patterns of commute times for Deerfoot Trail.

These distinct clusters are identified on the basis on time of the day, day of the week, weather conditions and traffic accidents, where each pattern shows dissimilar results for the combination of these factors. Each traffic pattern is different from the other pattern in terms of commute time to travel the same road at 19 time instants of the day. Each traffic pattern is assigned a centroid which is the mean commute time of the days falling into that particular cluster at 19 time instants. For example, days with no accidents and clear weather conditions have one unique pattern and are different from the days with large number of accidents and heavy snowfall, which fall into another traffic pattern.

The other contribution is to develop a commute time prediction model which incorporated the impact of traffic accidents and weather conditions on commute time. Many commute time prediction models have already been proposed and applied in traffic systems [5, 6], and they perform pretty well in case studies. However, incorporating the weather information into the prediction model has only recently been studied. Very few researchers have included weather information as part of the prediction models. Most of these research studies are based on either

9

regression models to include weather information as an explanatory variable or are simulation based approaches [14, 15].

I developed a model that can predict the commute times for a given route given the following inputs: a) day of the week; b) time of the day; and c) weather condition, i.e., bad (snowing or snow on the ground) or good (zero snowfall). The model uses the six clusters identified previously. It first assigns the day for which prediction has to be offered to one of these 6 clusters based on a Naïve-Bayes machine learning classification technique [16]. It then uses the time of the day and the historical accident information for that instant to estimate commute time.

Specifically, the best case commute time to travel the entire stretch of Deerfoot Trail in Calgary is about 35 minutes. Based on historical analysis, each accident inflates commute time by about

7 minutes. Hence, one can estimate the commute time for any given time of the day from the historical probability distribution function of accidents at that instant. A probability distribution function for a random variable provides for each possible value of that random variable the probability of observing that value.

Results show that for 76.8% of days in the prediction dataset, the difference between actual and predicted commute times is less than 5 minutes. The model was particularly effective in predicting commute times for workdays under both good and bad weather conditions.

Furthermore, the model’s predictions are significantly more accurate than a technique that only considers the average commute times at the desired time instant in the assigned cluster.

10

1.4 Thesis Organization

Chapter 1 provides the introduction and the importance of traffic analyses and commute time prediction. Chapter 2 includes a comprehensive review of the literature in vehicle traffic analysis and as well as their application in existing systems, traffic accidents and weather impact on traffic stream and related prediction models. Chapter 3 introduces the statement of the problem and data collection methodology from various web sources which is tested for a case study in this research. Chapter 4 discusses the historical vehicle traffic analysis results conducted on a case study of major highway in the city of Calgary. Chapter 5 described the clustering method to identify the unique commute time patterns from the traffic data collected over a period of six months. Chapter 6 proposed integrated a commute time prediction model based on the historical traffic data analysis and this prediction model also incorporates the impact of traffic accidents and weather conditions on commute time. Finally, Chapter 7 presents the conclusions and directions for future research.

11

Chapter Two: LITERATURE REVIEW

This chapter presents a detailed description on scholarly work done by other researchers in the

field related to this dissertation along with the background knowledge of the techniques used for

analyzing the historical traffic and the techniques used for predicting commute time. The study conducted in this research concentrated on two parts namely, historical vehicle traffic analysis and commute time prediction model. Vehicle traffic analysis and commute time prediction are the two major components of ITS. Therefore, most of the studies focus either on historical traffic

analysis or traffic prediction models. A few studies are conducted on both the topics. This

chapter discusses applications developed so far related to these two topics.

Section 2.1 is the background section which explains the techniques used for historical vehicle

traffic analysis and commute time prediction in this study. Section 2.2 describes the related work

done in the field of historical traffic analysis and commute time prediction models. Section 2.3

summarizes and compares the current study with the previous studies.

2.1 Background

This research has used data mining techniques in order to infer insights from the Web data

collected from heterogeneous Web sources. Two data mining techniques used in this study are

clustering and classification. A data mining tool, WEKA, is used to apply the techniques on the

data [17].

WEKA (Waikato Environment for Knowledge Analysis) is a collection of state-of-the-art

machine learning algorithms for data mining tasks [17]. The algorithms can be directly applied to

the dataset [17]. WEKA contains tools for data pre-processing, classification, regression,

12

clustering, association rules, and visualization [18]. WEKA tool is used in this research for performing clustering and classification.

Clustering is the method of partitioning a set of data points into groups called clusters, where the data points in the same cluster are as similar as possible and dissimilar from the data points in other clusters [19]. There are many clustering techniques available to cluster the data into groups.

In this research, k-means clustering [20] technique is used to cluster the traffic data into clusters having unique commute time patterns in each cluster. K-means clustering partitions the n number of observations/data-points into k number of clusters, where each data-point is assigned to the cluster with nearest centroid [20]. Cluster centroid is the point of coordinates equal to the mean of all the values for the data-points in a cluster [20]. In its simplest form, the k-means method follows the following steps [21].

Step 1. Specify the number of clusters and, arbitrarily, the members of each cluster.

Step 2. Calculate each cluster's centroid, and the distances between each observation and

centroid. If an observation is nearer the centroid of a cluster other than the one to

which it currently belongs, re-assign it to the nearer cluster.

Step 3. Repeat Step 2 until all observations are nearest the centroid of the cluster to which

they belong.

Step 4. To study sensitivity to number of clusters, repeat Steps 1 to 3 with a different

number of clusters and evaluate the results.

13

Then number of clusters, k, can be verified by calculating the inter-cluster variance and intra- cluster variance. The approach presented by Menasce et al. to determine the appropriate number of clusters for a given dataset [22] is used here. This approach uses βvar (equation 2.3) the ratio

of the intracluster variance (equation 2.1) to the intercluster variance (equation 2.2) to decide on the pertinent k-value, i.e., number of clusters.

∑ ̅ ̅ (2.1)

∑∑ (2.2) ,

(2.3)

where, k = number of clusters

̅ = average intracluster distance for cluster k, defined as the average distance of all

points of cluster k to its centroid.

̅ = the sample mean, computed as ∑ ̅

, = intercluster distance between cluster i and cluster j for i ≠ j.

= the sample mean, calculated as ∑∑ / ,

14

For good quality clustering, intra-cluster distance, i.e., the distance of feature vectors within a cluster from their centroid, must be low while inter-cluster distance, i.e., the distance of the centroid of one cluster to the centroid of another cluster, must be high. Therefore, a lower βvar value indicates better clustering. Clustering exercises with progressively higher k-values are carried out till the βvar value shows no appreciable decrease or starts to increase. The k-value

which caused the least βvar is then chosen.

The other data mining technique used in this research is classification. Classification is used to perform the commute time prediction. Classification is a data mining method that assigns data- points in a collection to target categories or classes [23]. The goal of classification is to accurately predict the target class for each point in the data. Classification task begins with a dataset in which the class assignments are known, this dataset is known as training dataset [23].

The other dataset is the testing dataset, in which data-points are not assigned any class and after performing classification on this dataset, each data-point is assigned a class predefined in the training dataset. Naïve Bayes Classifier is used in this research by comparing the results with other classifiers. Naïve Bayes Classifier performed better than the other classifiers. Naïve Bayes

Classifier [24] is a simple probabilistic classifier based on applying Bayes’ theorem [25] by using the method of maximum likelihood. Depending on the precise nature of the probability model, Naïve Bayes Classifier can be trained very efficiently in a supervised learning process

[24]. Naïve Bayes Classifier calculates probability of each feature individually contributing to the classification of the new data-point [26]. Naïve Bayes Classifier assumes that the value of a particular feature is independent of the value of any other feature. Commute time at a given monitored time instant is approximated as being independent of the commute times at other time

15

instants. Results from Chapter 6 show that good prediction can be obtained thereby vindicating this assumption. It classifies data in two steps [24]:

1. Training step: Using the training dataset, the method estimates the parameters of a

probability distribution, assuming predictors are conditionally independent of the

given class.

2. Prediction step: For any unseen test dataset, the method computes the posterior

probability of that sample belonging to each class. The method then classifies the test

data according to the largest posterior probability.

2.2 Related Work

This section describes the literature review of the studies conducted in the field of historical

vehicle traffic analysis and commute (or travel) time prediction models. All the studies have

different methods of data collection and many of the studies used dedicated sensor

infrastructures to collect road traffic data from highways or arterial roads. This section explores the studies which installed sensors for gathering traffic data. Studies are conducted on the traffic data to identify the traffic characteristics which impact the road traffic. This section describes studies which focused on identifying the traffic characteristics. Most of the studies emphasise on commute time prediction models which identify the current situations of the road traffic and estimate the commute time in the near future. Some studies used historical data to first identify the traffic problems in the past and then predicted the commute times. Few of these types of studies conducting commute time prediction are described in this section.

16

2.1.1 Sensor Infrastructures for Traffic Monitoring

Soriguera et. al. [27] designed a method to derive recurrent traffic demand patterns at different hours of the day from historical data. The data collected for this study is only the traffic volume on a Spanish highway, i.e., the number of vehicles moving on the road at a specific time instant.

Traffic demand patterns derived in the analysis shows the mean number of vehicles on the highway at each hour. The method is based on the clustering analysis technique and identified 32 and 24 demand patterns for a Spanish highway in northbound and southbound directions, respectively. The historical data is collected for 500 hours from roadside sensors. My research also used clustering to define traffic trends of the roads. Advantage of this study over my research is that it uses the traffic volume of the roads to measure the congestion. But, this study does not provide commute time to travel the roads. It does not correlate the traffic data with external factors such as accidents and weather. The historical data collected for this study is for only 20 days. In comparison to this study, my research collected historical data in terms of commute times over a period of six months and correlated external factors which impact the commute time on the roads.

Pu et. al. [28] proposed an interactive visual analytics system, T-Watcher, for monitoring and analyzing complex traffic situations in big cities via taxi trajectory data. Historical data of trajectory of more than 200 taxis in the city is collected for this research and a visual analysis system is built to visualize the large scale data. The developed visualization system enables the user to investigate the trajectories at three different levels including region, road and vehicle. The system provides more statistical information exploring spatial and temporal perspectives of the trajectory. The system transforms the numerical information into visual cues such as shape, color

17

and size. With visual information, changes in the spatial situation on the road network over time and temporal changes in traffic situations over a road segment are analyzed easily in the proposed visualization system. This system only collects speed of taxis in the city to analyze the traffic congestion on the roads. This system does not also correlate the traffic problems with any of the external factors. Whereas, my methodology can collect data for any road of the city irrespective of the area used by taxis and correlates the accidents data with the commute time at peak and off-peak hours.

Naija et. al. [29] performed clustering on the historical traffic data on a single road in a French city to identify the factor influencing road traffic situations. Authors presented an evaluation measure called homogeneity degree of the clusters to identify the factors which impact road traffic. This measure is based on class labels such as Monday to Friday with no holidays,

Monday to Friday with holidays, and Saturdays and Sundays. Historical data is collected through

300 roadside sensors installed on the highway for one year. Each sensor recorded 480 values of traffic flow in a day, one value every 3 minutes. This study has performed analysis on historical data of road traffic to identify the majority of days of week in different clusters. The study has only identified one factor, that is days of week, which impact the traffic flow on the highway. It has not performed analysis for identifying hot-spots of the highway and other factors which impact the traffic flow. My research also used the clustering to identify the factors influencing the traffic and came up with factors such as days of the week, peak hours, off-peak hours, number of accidents and weather conditions in the city.

Stutz et. al [30] focused on traffic data analysis using fuzzy clustering method known as FCMP

(Fuzzy Clustering Multiple Prototype). Data is collected in the form of images clicked by the

18

road-side cameras installed on this highway for 321 days for 15-minutes interval in a day. First of all, authors performed clustering on traffic data of a German highway. Clustering came out with four clusters as a result; one for Mondays through Thursdays, other three for Fridays,

Saturdays and Sundays, respectively. Each cluster has a unique traffic pattern. The second application of this research is long-term prediction of the traffic volume at 15-minutes interval for a day. The prediction model uses partially supervised clustering by merging the knowledge base of clusters identified in the first step of traffic analysis to find traffic pattern for a future day. This system needs significant customizing when used on a different road or different city

[30]. The study conducted here more or less is similar to the process used in my research. The advantages of this study over my research are that authors collected traffic volume data for almost a year at 15-minutes interval around the clock. This huge amount of data helped to understand the traffic volume in different seasons of the year. But my study has few advantages over this study, which are as follows:

• This study used a dedicated infrastructure for one road which may be difficult to

replicate for other roads. In contrast, my system uses a different approach to collect

traffic data from existing Web sources.

• Only traffic images are taken as the data for this system and the traffic is identified as

normal, dense and congested. My study captures the specific commute time for each

road from Google Maps.

19

• There is no correlation of the traffic data with external factors to observe the impact of

these factors on travel time in this study. My research addresses this limitation by

incorporating accidents and weather data with the commute time information.

• Though prediction model identified unique traffic pattern for a future day based on the

historical traffic analysis, but the traffic pattern is not customized because the traffic

pattern for historical days is directly assigned to the future day. Commute time

prediction model developed in my research not only identifies the unique traffic pattern

on the basis of historical data, but also customizes it at each hour of the future day by

incorporating the impact of factors such as day of the week, and time of the day.

Quek et. al. [31] proposed a method of traffic analysis. Authors developed a system known as

POPFNN-TVR (Pseudo Outer Product Fuzzy Neural Network using the Truth-Value-

Restriction) using fuzzy neural network. The experiment is performed on a Singapore highway and special road-side cameras are installed on this highway. The data is collected for this highway for 6 days on a 5 minutes interval. Two types of data gathered from the video recording cameras are vehicle count and vehicle speed. The historical analysis is performed on these types of data of 6 days by using classification. This analysis identified nine speed and 13 vehicle categories and extracted a set of fuzzy rules, for example, “if the height of the vehicle is short, then the weight of the vehicle is light”. Using this information mined out of the traffic analysis, a short-term prediction model was designed to predict the traffic density from 5 minutes ahead to 1 hour ahead. This short-term prediction model forecasts the type of traffic density determining what types of vehicles will travel on the highway and at what speed. The approach used in this study has a costly infrastructure by installing dedicated road-side video cameras for data

20

collection, it only collected data for only 6 days and type of data is only vehicle count and vehicle speed on the highway. This study does not show a relationship of traffic data with external factors such as traffic accidents to measure their impact on the speed of the vehicles.

2.1.2 Traffic Characterization

Bing maps [8] have been used for collecting low-cost traffic data. In [8], traffic data is collected

from Bing maps by taking images of the map at different time intervals. These images represents

the flow intensities of the road traffic in the form of colors, where red is used for highly congested road, yellow is used for less congested road and green is used for free-flowing traffic on the road. This data is collected for a period of two weeks and highly congested areas that are shown in red color are identified for Chicago’s roads. In contrast to my work, this study only includes data from one Web source i.e., Bing maps and does not integrate this information with other types of data such as traffic accidents on those roads and weather conditions of the city.

This feature is added in my research by incorporating data from multiple Web sources such as

traffic data from Google Maps, accidents data from Twitter and weather information from a

reliable Web source. A model is developed in the study which unites all the features mentioned

above.

Kumaresan [32] conducted a study on the historical data of traffic accidents for a highway in Las

Vegas. The author performed modeling of short term and long term impacts of freeway traffic

incidents using historical data. The model identified the short-term impacts that occur

immediately during and after an accident. The short-term impacts are quantified by excess travel

time measures, fuel consumption, and vehicle emissions produced due to the incident. Long-term

impacts of incidents are studied which affect the travel time reliability and also affect user’s

21

perceived travel time. Historical data of twelve months is collected for this study from the city’s database. This study has a huge collection of historical data to study the impact of incidents on travel time. But this study has not identified the factor by which travel time increases due to an incident. It does not identify the travel time trends of the highway under study.

TomTom [33] uses GPS navigation devices to provide real-time traffic information to the GPS users. TomTom traffic flow delivers a real time, detailed view of traffic speeds on the entire road network, designed for easy integration into traffic management systems and calculating current routing travel times. TomTom coverage extends over 200 countries and territories, encompassing more than 4 billion people and 40 million kilometers globally. TomTom collects historical traffic data from multiple data sources which are millions of connected GPS devices used by TomTom users, millions of government road sensors and thousands of journalists collecting incident information. Road sensors and real traffic incident data are fused with the anonymous GPS measurements of TomTom device users to create a picture of current traffic conditions. Since

2006, TomTom has collected anonymous GPS measurements from its users. TomTom’s historical data consists of over 9 trillion consumer-driven data points collected from GPS devices used by millions of people worldwide. This ever expanding historical traffic database is used to analyze travel times and bottlenecks across the complete road network. The data is collected every second from all the active GPS devices and the traffic status is updated on the map every

30 seconds. The historical traffic information collected gives valuable insight into the traffic situation on the road network throughout the day such as average speed, average travel time, median travel time, standard deviation, average speed at morning peak hours, average travel time

22

at morning peak hours and average travel time at off-peak hours. Various analyses are performed on the huge historical data for many roads to answer the following questions:

1. Where are the traffic jams and accidents at rush hour?

2. How fast is traffic moving on the overall road network?

TomTom has many advantages over my research as it has a huge amount of data being collected from multiple sources such as GPS devices, roadside sensors and journalists reporting incident information. Historical analysis is performed on trillion of data collected for millions of roads worldwide. The travel time prediction is performed using the historical data and real time data to provide accurate travel time to the users. It incorporates traffic accidents data to analyze the bottleneck areas of the roads. My research also has some advantages over TomTom historical analysis and prediction model. My model uses Web mining approach for collecting heterogeneous traffic information from multiple Web sources. My study incorporates one more external factor which may impact the road traffic that is weather information, but TomTom does not integrate the impact of the weather factor.

2.1.3 Travel Time Prediction Models

Roopa et. al. [34] built a crowdsourcing based traffic information system which proposed an

architecture that addresses the challenges of real time traffic management of non-lane based and

chaotic traffic of developing countries. This study collected the traffic data from smartphones by

tracking the GPS locations of the mobile phones being used by the drivers on road. The data is

collected from the mobile devices not from the car GPS devices because in developing countries

such as India, mobile phones are more common than vehicles fitted with GPS systems. The

23

mobiles devices used to collect the traffic information have to be connected to the traffic server of this application, so that real time traffic updates can be provided to the mobile users constantly. The traffic servers maintain the current location of the commuter and any change in the traffic status along the commuter’s path is updated immediately. The server stores the GPS location of the mobile along with the IMEI number of the mobile device and the date/time at which GPS information was recorded. Traffic server collects GPS information of mobile devices periodically; it does not collect traffic data continuously because it could drain the battery of the mobile devices. Traffic density is calculated along the road using the GPS locations of multiple users travelling on a particular road and this information is passed back to the mobile users to make them aware how congested is the road ahead reaching their destination. This study only collects the GPS location of the mobiles to let users know about the current traffic conditions ahead on the road. It takes only real-time traffic data and does not infer any information from the historical data and also does not correlate external factors which may impact the traffic at the current time. These areas which are not explored in this study are conducted in my research.

Jain et. al. [35] proposed an automated image processing mechanism for detecting the congestion levels in road traffic by processing CCTV camera image feeds. Based on live CCTV camera feeds from multiple traffic signals in Kenya and Brazil, the system showed evidence of long lasting congestion across multiple locations. The system coordinates traffic signal behavior within a small area and provides useful information to prevent congestion collapse and enhance road capacity. This system collects live images from the roadside cameras and analyzes the real time data to prevent congestion on the roads. This system does not provide analysis on historical

24

data and does not provide a commute time based prediction model. This system does not correlate the traffic congestion problems with the external factors.

Wischhof et. al [36] developed a Self-Organising Traffic Information System (SOTIS). This system gathers traffic information from the vehicles itself, where a special In-Car Navigation

(ICN) system is installed in all the vehicles participating in this study. Each vehicle monitors the locally observed traffic situation by recurrently receiving data packets with detailed information from other vehicles, which is known as Inter-Vehicle Communication (IVC). Traffic situation analysis is performed in each individual vehicle and the result is transferred via wireless data- link to all surrounding vehicles in the local neighborhood that are within transmission range. A simulation model is built in this study using IVC. This study has a costly infrastructure where participating vehicle needs to be equipped with special navigation device and the traffic information will be broadcasted to those vehicles only.

Chen et. al. [37] developed a vehicle travel time prediction algorithm based on historical data and shared location. This algorithm provides its users the route guidance advice and travel time information. According to the user’s current position and the destination information, route is calculated for the user by taking historical travel times of the route and recent travel times of two adjoining intersections. The algorithm also adjusts the predicted travel time according to the accidents, route conditions and the deviation of user’s position change and prediction. The historical data of 40 days is collected for this study by tracking the GPS locations of mobile devices of the users. First 20 days’ data is used as training set and another 20 days’ data as testing set. The relative mean error of this algorithm is 10%.

25

This study has many advantages which are, first of all this study has predicted travel time using historical travel times and the real time travel times of the route and secondly, it has incorporated traffic accidents in the calculation of travel time prediction. But this study uses the historical data of 20 days which is insufficient to view all the possible historical traffic trends of the route. This study has collected data from the mobile devices of the users who are associated with the application server of this study, so users’ permission are mandatory in this study to collect the

GPS locations of their mobile devices. This study has not considered weather as another external factor which may impact the traffic flow.

Most of the studies [38, 39, 40, 41, 42 and 43] use real time traffic data to predict the commute times. The real time data is collected either from the roadside sensors or by tracking the GPS locations over a fixed time interval of the mobile users driving on the roads for which data is to be collected. All these studies evaluate the impact of external factors such as time of the day, day of the week and accidents on real time and forecasts the commute time at that hour. But very few studies incorporated information inferred from historical data to scrutinize the impact of external factors in detail.

2.3 Summary

This chapter described the scholarly work done by other researchers in the field of road traffic.

This literature review presented papers similar in objective as the current study. Historical

analysis of road traffic and travel time prediction models has been explored well by many

researchers. Every study has its own method of collecting historical or real time road traffic data.

The methods used in these studies are roadside sensors, image cameras, video cameras, mobile

devices, GPS systems, In-car navigation systems, City’s Database and Inter-Vehicle

26

communication system. All these methods need a costly infrastructure to capture traffic data which may not be possible to install on all the roads of a city.

In the current research, an alternative approach of collecting historical traffic data from existing

Web sources is developed. Web mining approach is not explored well in the road traffic data collection methods. To the best of my knowledge, only one study has used this methodology to collect freely available traffic data from a web source in this form of images. In comparison to this single study available for Web mining data collection, my approach is completely different, where historical data is collected in commute times, but not in the form of images. This approach could be easily replicated for any road anywhere in the world by identifying the latitude/longitude co-ordinates of the road on Google Maps. Thus, it does not require any dedicated infrastructure for each road to study its traffic characteristics. This research is also applied on the sub-segments of the road to identify the traffic problems related to the sub-section of the road such as particular hot-spot of the road for most number of traffic accidents, and the congested sub-section of the road at peak and off-peak hours of the day.

Various studies have explored new methods of developing commute time prediction models. I have compared and differentiated these approaches with my model. The methodology used in the commute time prediction model in current study is slightly different from the other studies.

Almost all the studies have incorporated traffic accidents information to predict the commute times but no study has amalgamated the weather impacts on the commute time prediction. Few studies have premeditated the impact of rainfall in the countries other than Canada. But no model is designed to study the impact of snowfall or snow on ground from previous days’ snowfall on

27

road traffic. The current study has studied the impact of both the factors of snow on road traffic in detail and incorporated this factor into the prediction of the commute times on snowy days.

28

Chapter Three: WEB DATA COLLECTION METHODOLOGY

This chapter presents the data collection mechanism. The heterogeneous traffic data is gathered

from multiple Web sources. These multiple Web sources are identified on the basis of factors

which influence the commute time. All the Web sources used in this study and the factors which

affect the commute time are discussed in detail in this chapter.

Section 3.1 describes the various methodologies used to collect this diverse data from numerous

Web sources. Section 3.2 defines the overlaying of data from one Web source to the others.

Section 3.3 gives the overview of the data collected so far.

3.1 Data Collection Process

Traffic on major urban roadways is influenced by many factors. Typically, the factors which impact the road traffic include the following:

 Number of cars, speed, flow, occupancy

 Time of the day, weekday, weekend, day of the month, season, year

 Weather, temperature, humidity

 Road type

 Events scheduled, e.g., hockey games

 Events unscheduled, e.g., fire

 Accidents, detours, lane closures

 Police archives of incidents

 Parking: location, occupancy

 Zones: schools, elderly houses, event locations, historically accident prone areas

29

 Glare (direction of sun)

 Tweets: incidents, locations

 Points of interest (POIs)

 Local sources: Transit data, road sign locations

To study the impact of such factors on traffic delays, heterogeneous data is collected from multiple Web sources and combined together. While there are many factors that can influence commute time, I focus on the following factors for this research:

 Day of the week

 Time of the day

 Number of accidents (associated with particular time and segment of the road)

 Weather

The above mentioned factors impact the commute time. Information pertaining to these factors is collected from the following Web sources:

 Google Maps

 Twitter

 Historical Weather Reports from weather.climate.gc.ca

Each Web source is explained in detail in the following subsections. Section 3.1.1 explains the collection of route information from Google Maps API. Section 3.1.2 describes the collection of

ICT (in current traffic) commute time from Google Maps Website. Section 3.1.3 presents the collection of accidents from Twitter. Section 3.1.4 illustrates the collection of historical weather reports from weather.climate.gc.ca website.

30

3.1.1 Google Maps

The commute time and route information is collected from Google Maps. It is collected in two ways as all the information is not available through the Google Maps API. Figure 3.1 shows the flow of data collection from Google Maps.

The route information is collected from Google Maps in two ways as shown in Figure 3.1. The

Google Maps API is an application programming interface which supports a programmatic approach to retrieve the information displayed by the Google Maps website. Figure 3.1 shows the process of data collection from Google Maps. The left side flow of data collection process describes the querying of Google Maps API to gather static route information. Static route information is the route information without traffic.

The script takes as input the source and destination GPS coordinates of a target route. To allow finer grained analysis, it also accepts as inputs GPS coordinates of sub-sections of the route. The inputs are carefully chosen to limit the number of queries issued through the Google Maps API and the number of queries submitted to the Google Maps website. This is because the Google

Maps API currently imposes a limit of 2,500 queries per day. Given a source and a destination address, Google Maps provides a basic estimate of commute time based on route distance and posted speed limits of segments constituting the route.

31

Route Information

Google Maps API Custom Script (HTTP request to capture current traffic commute time)

JSON (JavaScript Object Text files (containing web Notation) files containing page information) static data about target route

Text Parser to extract ICT (In current Traffic) commute time

Route Information with and without traffic

Figure 3.1: Google Maps Data Collection Process

The script queries the Google Maps API and it is invoked once for every route of interest and sub-sections of the routes, as the information remains the same irrespective of the time and day.

This was observed for quite a long time that the route information does not change and it does not provide the current traffic commute time for the route. Then decision was made to invoke the

32

script only once in the whole data collection process. Route information returned by the query contains source address, destination address, latitude and longitude of source and destination address, total distance, travel mode, directions for the route, and duration.

The API returns JavaScript Object Notation (JSON) files containing information for the route.

Each file contains several options for traversing from the source to the destination along with the commute time estimates for the options. The JSON format makes it easier to programmatically query route information during the traffic analysis phase because the information is stored in the form of tags in JSON file which can be gathered by naming the tag and there is no need of text parsing to retrieve the information.

However, as mentioned previously, these files do not contain the “in current traffic” (ICT) commute time estimates for the routes, which represent a commute time estimate that takes into account current traffic conditions on the route. The source of ICT commute time information is the Google Maps website as this is not available on Google Maps API. This estimate is obtained by continuously tracking in real-time GPS locations of participating mobile phone users travelling on the target route [44].

A number of challenges are addressed to exploit the commute time data provided by Google

Maps. The free version of the API does not provide the ICT commute time estimates. In order

to collect in current traffic commute time, a custom script is written to query the Web service

through a browser, which saves the resulting Web page, and parses the saved page to extract ICT

commute time. Furthermore, the Google Maps website does not support queries that request

33

historical ICT estimates for a given route. Such a feature is crucial for understanding how factors such as time of the day, day of the week, and month of the year impact commute times.

The right side flow of data collection process in Figure 3.1 shows the method used to gather information from Google Maps Website. A custom C# script is developed to continuously collect route information and ICT commute time estimates for specified routes. The script takes as input the source and destination GPS coordinates of a target route. To allow finer grained analysis, it also accepts as inputs GPS coordinates of sub-sections of the route and additionally takes as input the time instants at which the ICT commute time estimates need to be collected.

The number of queries to be sent to the Google Maps website is limited. Moreover, to avoid the scripts from adversely influencing the Quality of Experience (QoE) of human users of the website, number of queries is chosen to obtain current traffic commute time estimates. As a result, the number of queries is limited to 19 different time instants of the day. Queries were issued at 30-minute intervals during rush traffic periods and 2-hour intervals during other periods. On closely observing the ICT commute time over a period of time and due to the limitation of Google Maps queries, only 19 time instants were chosen to collect the route information.

This custom script uses the HTTP protocol to capture the ICT commute time estimates for the route at 19 time instants of the day. It queries the Google Maps website at each of the time of the day instants specified as input. Each query returns a set of Web pages containing the Google

Maps response. The script uses a custom parser to extract the ICT commute time estimates pertaining to the route from these text files. The extracted values are then stored along with

34

other information pertaining to the route contained in the previously obtained JSON files to facilitate subsequent traffic analyses, explained in detail in Section 3.1.4.

The ICT commute time information is also collected for sub-segments of the route. Sub- segments are the sub-section of the road. The numbers of sub-segments monitored in each road are also limited because of query limitation of Google Maps. For example, for Deerfoot Trail, 6 nearly equidistant sub-segments with each sub-segment roughly encompassing 3 consecutive interchanges are specified. The route is divided into 6 sub-segments because of the number of total interchanges in the route i.e., 21 interchanges and to make every sub-segment equidistant.

The start and end coordinates of such sub-segments were obtained manually. Automating this task is deferred to future works.

3.1.2 Twitter Search API

The commute time information is augmented with traffic accidents information mined from the

Twitter social network as number of traffic accidents also can have a significant impact on the

commute time. Figure 3.2 represents the process of collecting tweets about accidents from

Twitter social network. Tweets from these sources were programmatically gathered using the

Twitter Search API. Twitter Search API is an application programming interface which queries

the tweets related to the search criteria and returns the collection of most recent tweets posted by

the Twitter user [45]. These tweets are retrieved in a JSON file. The API restricts the number of

queries per day and limits the number of Tweets returned per query to 100. Consequently,

queries are issued once every hour and each of these queries retrieved the 100 most recent

Tweets. For several representative days, Tweet stream from the API calls is compared to the

35

corresponding Twitter feeds displayed on a browser and made sure that no information was lost due to the API’s restrictions.

Text parsing using regular expressions is performed on the collected Tweets to look for accident reports on the target routes. Every tweet has a unique ID associated with it. Unique accident tweets are selected for further processing in order to eliminate duplicate information.

Furthermore, accidents are also assigned to the sub-segments of all target routes being monitored.

The accidents tweets are collected from reliable Twitter sources, where the information about the incidents is reported in a standard format, which makes it easier to extract information from the tweet. After an initial analysis of all traffic related Tweets in Calgary, two reliable sources are selected namely, Canadian Traffic Network Calgary (@CTNCalgary) and 660 News Traffic

(@660NewsTraffic). Key reasons for choosing these services were the comprehensiveness of updates and the consistent formatting of the updates, which permitted easy parsing of the location of accidents.

To see that the Twitter script captures all accidents reported by the reliable sources, a manual validation is performed for several days, where the accidents collected from Twitter are matched up manually with the accidents reported by the city of Calgary. It has been observed that almost all the accidents occurring in the city are timely reported on the respective Twitter accounts. The relevant tweets extracted from these JSON files, which contains 100 tweets and also encloses all the relevant information for all the tweets such as Tweet ID, Date, Time, Tweet Description,

Time Zone, and Geographical location. The information which is needed for further processing

36

is extracted and stored into a CSV file for a particular route or its sub-segment, which is explained in detail in Section 3.2.

Search Query for Accidents

Twitter Search API

JSON files containing accident tweet

Text parser for selecting accidents at a particular road

Accidents data in CSV file

Figure 3.2: Twitter Data Collection Process

37

3.1.3 Historical Weather Reports

Weather impacts the traffic significantly, especially snowfall. So to observe the impact of climate on commute time, historical weather data is incorporated with the commute time and accident information. Historical weather data is collected from the Environment Canada website.

A simple assumption is made here that the significant snowfall on a particular day and significant snow and ice on the ground due to previous precipitation activity are better predictors of traffic woes than the temperature and it is shown in chapter 4, where impact of temperature and snow are monitored. For example, it is not uncommon to witness smooth flow of traffic even when the temperature is -25 Celsius provided there is no precipitation and the roads are clear. This metric correlates well with the number of accidents and commute times. This data is captured from climate.weather.gc.ca and stored into the CSV file in the process of data overlaying explained in detail in Section 3.2.

3.2 Data Overlaying

Figure 3.3 shows the process of overlaying the data collected from three main Web sources. Date

and Route name are given as input to the data extraction process. Google Maps data extraction

process takes the input of Date and Route name, and collects information such as Source address,

destination address, basic commute time and most importantly ICT commute times at 19 time

instants of the day, from JSON and text files. Basic commute time is the commute time which is calculated on the basis of speed and distance of the route without taking current traffic conditions into consideration. ICT commute time collected from Google Maps website gives the travel time in the current traffic conditions at particular time of the day but it does not show the reasons of delay on that particular time. In order to find the traffic problems of the specific time, the ICT

38

Date and Route Name

Google Maps Data Twitter Accidents Data Weather Data Extraction Extraction Extraction

Source Address Destination Address Date Basic Commute time Total Snow Total number of ICT Commute Times accidents @TimeInstant1 Snow on Ground …….. ICT Commute Times @TimeInstant19

RouteName.CSV File

Figure 3.3: Data Overlaying commute time is correlated with the factors of number of accidents at that time and the weather conditions of the day.

39

Twitter accidents data extraction process collects the information about the number of accidents occurred on a particular day at the specified route from the JSON files. This information is stored into CSV file along with the information extracted from Google Maps. Weather information about total snowfall and snow on ground is also extracted for a day and stored along with other information into the same CSV file. Results from this overlay analysis is presented in the next three chapters.

3.3 Summary

This chapter presented the data collection process. Heterogeneous data is collected from multiple

Web sources and each Web source’s data collection process was explained in detail in separate

sections of the chapter. One type of data is not enough to understand the traffic problems of a

road or the city. Each road has different traffic patterns based on the different factors affecting

the traffic on that road. Factors which impact the commute time or traffic patterns on the road are

location of the road, speed limit, traffic on road at rush hours, number of accidents on the road,

weather conditions of the day, time of the day, day of the week and many more. Based on these

factors, traffic patterns change on the road. To understand the traffic problems, different factors

need to be incorporated together to see the traffic behavior of the road. This chapter explains

what factors could be studied together and how to overlay them.

40

Chapter Four: HISTORICAL VEHICLE TRAFFIC ANALYSIS OF DEERFOOT TRAIL

The data collection system design is motivated by the objective of understanding some factors that influence commute time patterns in major urban roadways. In order to understand the traffic problems on the major roads/highways of the city of Calgary, data was collected for 18 heavily congested highways and arterial roads in Calgary. The following are the 18 roads:

1. Deerfoot Trail

2. Stoney Trial

3. 16 Avenue

4.

5.

6.

7.

8. Country Hills Blvd

9. Metis Trail

10.

11. John Laurie Blvd

12.

13. 17 Avenue

14.

15. 52 Street

16. McKnight Blvd

17. 14 Street

18.

41

The data was collected for these 18 roads from the one end of the road to the other end of the road and data was collected for the sub-segments of the roads. Each road has some interchanges which connects the road to the other roads of the city. The roads were divided into few segments by taking at least 3 interchanges of the road in one segment. The reason for selecting 3 interchanges remains the same as discussed in chapter 3 for Deerfoot Trail and not to make the sub-segments of the roads shorter than 5 KMs. The heterogeneous data is also recorded for all the sub-segments of 18 roads.

Total data collected for more than 6 months (202 days – 21st September 2013 to 10th April

2014) for 18 roads having multiple segments have disk space of 50 GB. There are total 337,087 files collected for 202 days. Data for all the roads is not discussed in this study. The detailed analysis is performed for only the most contested highway of the city, Deerfoot Trail.

Deerfoot Trail is a multi-lane highway that spans about 50 KMs within Calgary and features 21 interchanges. It has 3 to 4 lanes in each direction and has a speed limit of 100 KMs/hr. The roadway is the province of Alberta’s busiest highway with traffic volumes ranging between

27,000 and 158,000 vehicles per day [46]. Although we focus our analysis on one specific roadway in Calgary, our data collection and analysis methodology can be replicated in a straightforward way for other similar roads. Data collected for Deerfoot Trail consumes 10 GB of disk space and have 64,786 files in total. A detailed analysis is performed on Deerfoot Trail’s data.

Deerfoot Trail is divided into six nearly equidistant segments. These segments are named as follows:

42

Segment 1: Country Hills to Beddington trail (8 km)

Segment 2: 64 Avenue to 32 Avenue (5.9 km)

Segment 3: 16 Avenue to 17 Avenue (6.7 km)

Segment 4: to (8.8 km)

Segment 5: 24 Street to Mckenzie Blvd (6.9 km)

Segment 6: to Macleod Trail (13.7 km)

The sub-segment division for south to north direction of the road is same on the other direction

of the road, but only the direction is changed for the sub-segments. So, the names are in this

order in south to north direction:

Segment 1: Beddington Trail to Country Hills (8 km)

Segment 2: 32 Avenue to 64 Avenue (5.9 km)

Segment 3: 17 Avenue to 16 Avenue (6.7 km)

Segment 4: Anderson Road to Barlow Trail (8.8 km)

Segment 5: McKenzie Blvd to 24 Street (6.9 km)

Segment 6: Macleod Trail to Stoney Trail (13.7 km)

This chapter provides observations from the statistical results for a highway, Deerfoot Trail, in

the city of Calgary. The heterogeneous data, discussed in Chapter 3, was collected for Deerfoot

Trail and its 6 sub-segments in both the directions, north to south and south to north, over a

period of 6 months.

The commute time analysis is performed on traffic data of Deerfoot Trail in both the directions

of north-to-south and south-to-north. The analysis results for both the directions are compared

43

with each other side by side in the following sections. Section 4.1 discusses the statistical results for commute time analysis performed on Deerfoot Trial and lists a set of questions, for which a detailed analysis is performed on traffic data to get the answers. Section 4.2 presents the probability distribution of commute times at peak hours of the day.

4.1 Deerfoot Trial Commute Time Analysis

Commute time analysis is performed on Deerfoot Trail in both the directions of north-to-south

and south-to-north. This analysis is executed to answer the following set of questions.

Q1. What are the peak and off-peak traffic hours?

Q2. What are the traffic conditions at peak hours?

Q3. How are peak hours related to number of accidents?

Q4. What are the bottleneck road segments during peak hours?

These questions are answered one by one in this section after following a detailed analysis on the

traffic data.

4.1.1 Q1. What are the peak and off-peak traffic hours?

To answer the first question Q1, I present stats such as minimum commute time, maximum

commute time, mean, median, absolute deviation, standard deviation and 95th percentile, for

both the directions of the highway. Table 4.1 and Table 4.2 show the overview of the data for

north to south direction. These tables show maximum commute time, minimum commute time,

mean, median, absolute deviation, standard deviation and 95th percentile of the commute time at

19 time instants of the day, at which data was recorded for this direction.

44

Table 4.1: Deerfoot Trail - north to south commute time (minutes) statistical results at morning hours 6:00 6:30 7:00 7:30 8:00 8:30 9:00 11:00 1:00 AM AM AM AM AM AM AM AM PM Maximum 39 41 58 77 87 62 63 59 51 Minimum 33 33 33 33 33 33 33 33 34 Mean 35 36 38 39 40 38 37 36 36 Median 35 36 37 37 38 36 36 35 35 Absolute Deviation 0.9 1.2 3 4 4.7 3.5 2.6 1.2 1.3 Standard Deviation 1.1 1.6 4.2 6.1 7 5.3 4.5 2.5 2.3 95th Percentile 37 39 46 50 55 52 45 39 40

Table 4.2: Deerfoot Trail - north to south commute time (minutes) statistical results at evening hours 3:00 3:30 4:00 4:30 5:00 5:30 6:00 7:00 8:00 9:00 PM PM PM PM PM PM PM PM PM PM Maximum 124 75 79 89 97 105 76 58 87 68 Minimum 33 33 33 33 33 33 33 33 33 33 Mean 37 38 40 41 42 40 38 36 36 35 Median 36 37 38.5 39.5 40 39 36 35 35 35 Absolute Deviation 3.3 3.4 4.1 5.3 5.7 4.7 2.9 1.3 1.4 1 Standard Deviation 8.9 6 5.9 7.7 8.6 8.4 5 2.5 4.4 2.8 95th Percentile 43 46 48 52 55 49 44 40 37 37

Table 4.1 and Table 4.2 show the results for morning and evening hours in terms of commute

time in minutes, respectively. Among these 19 different times of the day, it is noticeable from the

standard deviation that few hours do not show much variability in commute times as compared

to other hours of the day. The hours which have higher value standard deviation are considered

as the peak hours of the day, as most of the traffic flows at those times of the day. As a result,

7:00 AM to 9:00 AM can be considered as morning peak hours and 3:00 PM to 6:00 PM as the

45

evening peak hours and rest of the hours are considered as off-peak hours. These peak hours are the same as the typical office opening and closing hours of businesses in Calgary.

The standard deviation of morning off-peak hours is less than 1.6 and the mean is also close to the basic commute time. Whereas, standard deviation at morning peak hours is between 4.2 and

7.0 and the mean commute time is 37 to 40 minutes, which is 2 to 5 minutes more than the basic commute time. In the evening peak hours, standard deviation is between 5.0 and 8.9 and the mean commute time is 37 to 42 minutes, which is 2 to 7 minutes more than the basic commute time. This means that only peak hours show variability in commute times, whereas mean commute time at off-peak hours remains equal to the basic commute time. So, in the questions

Q2 to Q4, the off-peak hours are removed from further analysis as the off-peak hours remain the same irrespective of the day of week (working day, weekends or holiday). The peak hours, 7:00

AM to 9:00 AM and 3:00 PM to 6:00 PM, are considered for further detailed analysis.

Table 4.3 and Table 4.4 show the results for morning and evening hours in terms of commute time in minutes and show the maximum, minimum, mean, median, absolute deviation, standard deviation and 95th percentile of the commute times at respective times of the day.

46

Table 4.3: Deerfoot Trail – south to north commute time (minutes) statistical results at morning hours 6:00 6:30 7:00 7:30 8:00 8:30 9:00 11:00 1:00 AM AM AM AM AM AM AM AM PM Maximum 39 43 55 61 59 55 51 78 102 Minimum 33 32 33 33 33 32 33 33 33 Mean 35 36 38 41 40 37 36 35 35 Median 35 36 37 40 39 36 35 35 34 Absolute Deviation 0.9 1.7 3.7 4.9 5.0 3.2 2.0 1.2 1.3 Standard Deviation 1.2 2.1 4.6 6.0 6.1 4.2 2.9 3.5 4.9 95th Percentile 37 40 47 51 53 45 41 38 37

Table 4.4: Deerfoot Trail – south to north commute time (minutes) statistical results at evening hours 3:00 3:30 4:00 4:30 5:00 5:30 6:00 7:00 8:00 9:00 PM PM PM PM PM PM PM PM PM PM Maximum 50 60 79 73 72 67 75 54 40 42 Minimum 33 33 33 33 33 33 33 33 33 33 Mean 36 38 41 43 43 40 38 35 34 34 Median 36 38 41 43 44 40 36 34 34 34 Absolute Deviation 2.0 3.3 5.2 6.1 6.2 4.9 3.5 1.4 0.7 0.7 Standard Deviation 2.9 4.6 6.6 7.3 7.4 6.2 5.2 2.7 1.0 0.9 95th Percentile 42 47 52 55 54 52 47 39 36 36

Among these 19 different times of the day, it is noticeable from mean, median and standard

deviation of the commute time that this direction of the highway has the same morning and evening peak and off-peak hours’ time period as the other direction of Deerfoot Trail has. So, for this direction, the same time period of peak hours will be studied in detail to see how different

the traffic trends in this direction are from the other direction of Deerfoot Trail. In order to analyze traffic patterns in more detail, probability distributions are plotted for all the 19 time instants of the day for which data is recorded.

47

4.1.2 Q2. What are the traffic conditions at peak hours?

To answer this question, probability distribution of the commute times is calculated at 19 time instants of the day to see how commute times varied over the data collection period. It has been noticed in the data of 202 days that the conditions go bad at noon off-peak hours only when there is extreme bad weather that day. On one Saturday, the commute time increased drastically due to extreme bad snowfall in the afternoon and multiple accidents at the same time in south to north direction of the road. Otherwise, the commute time remains the same as in normal conditions with very less traffic on the road. One more reason to consider off-peak hours out of further discussion is that there were very few accidents recorded in 202 days’ data for off-peak hours.

For example, there were 813 accidents in 202 days and only 6 accidents occurred at 6:30 AM and only 10 accidents occurred at 8:00 PM. Whereas, at peak hours, maximum accidents recorded in a day was 15 and out of which 9 accidents happened at 5:30 PM and other 6 accidents occurred at different peak hours and zero accidents were recorded at off-peak hours that day. So, in other words, off-peak hours are always predictable as there is very less probability of an accident at off-peak hours and commute time is very near to the average commute time. In further analysis of the commute times, off-peak hours are not considered for the above mentioned reason.

This section represents the probability distribution of commute times at peak hours to examine the flow of traffic. Probability distribution functions of commute times are calculated for historical data at all the peak hours. These distributions show the probabilities commute times at different times of the day. Each peak hour has different spread of values which in turn represents the different traffic conditions at each peak hour.

48

Figure 4.1 shows the probability distribution of commute times at morning peak hours in the direction of north to south. PDF of morning peak hours shows that these hours have long tails having some probability of the commute time to be much longer than the basic commute time.

These are the hours when most of the traffic from the north is heading to the centre of the city.

According to the traffic data recorded for more than six months, 8 AM is the peak hour which has 0.01 probability of the commute time to reach 50 minutes, which is 15 minutes more than the basic commute time. As a result, at 8 AM, it takes longer time to reach the destination through

Deerfoot Trail when compared to other times.

0.45 0.40 7:00 AM

0.35 7:30 AM 0.30 8:00 AM 0.25 8:30 AM 0.20

Probability 9:00 AM 0.15 0.10 0.05 0.00 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 Commute time (in minutes)

Figure 4.1: Probability distribution of commute times at morning peak hours (north to south)

49

0.60 0.55 0.50 0.45 7:00 AM 0.40 0.35 7:30 AM 0.30 8:00 AM 0.25 Probability 0.20 8:30 AM 0.15 9:00 AM 0.10 0.05 0.00 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 Commute time (in minutes)

Figure 4.2: Probability distribution of commute times at morning peak hours (south to north)

Figure 4.2 shows the probability distribution of commute times at morning peak hours in the direction of south to north spanning 7:00 AM to 9:00 AM. Commute time for Deerfoot Trail in the direction of south to north could go to 55 minutes at 7:30 AM and 8:00 AM in the morning, which is 20 minutes more than the basic commute time and 5 minutes more than the other direction.

The peak hours of south-to-north traffic is slightly different from that of north-to-south traffic.

For south-to-north direction, it may take 5 minutes more at 7:30 AM and 8:00 AM. This means that the traffic conditions are more severe in this direction in the morning, which demands to study this difference in detail and we will see it in combination with other factors such as accidents. Whereas, as shown in Figure 4.1 and Figure 4.2, there does not seem to be too much variation in commute time at 9:00 AM in the south-to-north direction when compared to the

50

north-to-south direction. This shows that traffic starts decreasing after 8:30 AM in the direction of south-to-north, but traffic decreases after 9:00 AM in the north-to-south direction.

Figure 4.3 shows the probability distribution of the commute time at afternoon peak hours in north to south direction. These are the times when most of the people are travelling back home from their offices, as significant number offices are located in the centre of the city and moving towards the south of the city. Rush hour starts at 3 PM and commute time keeps increasing till

4:30 PM as shown in the PDFs of these times in the Figure 4.3.

0.45 0.40 0.35 0.30

0.25 3:00 PM 0.20 3:30 PM Probability 0.15 4:00 PM 0.10 4:30 PM 0.05 0.00 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 Commute time (in minutes)

Figure 4.3: Probability distribution of commute times at afternoon peak hours (north to south)

Figure 4.4 shows the commute time probability at afternoon peak hours in south to north direction. Here at these peak hours, commute time can be twice as much as the basic commute time to travel on this highway. On south to north direction, traffic starts building from 3:30 PM.

51

Commute time PDF is similar for both the directions at 3:30 PM. Whereas 3 PM commute time on the direction of north to south has much larger standard deviation, which is 8.9 but in the direction of south to north, it is just 2.9, which is not considered to be an impact on the commute time. The traffic conditions are more severe on the direction of south to north after 3:30 PM, when we compare the afternoon peak hours PDFs with each other for both the direction. It may take 5 minutes more at 4:00 PM and 4:30 PM on south to north direction of Deerfoot Trail.

0.60 0.55 0.50 0.45 0.40 0.35 3:00 PM 0.30 3:30 PM 0.25 Probabillity 0.20 4:00 PM 0.15 4:30 PM 0.10 0.05 0.00 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 Commute time (in minutes)

Figure 4.4: Probability distribution of commute times at afternoon peak hours (south to north)

Figure 4.5 shows the PDF of evening peak hours (north to south) which keeps increasing till 5:30

PM and again starts decreasing from 6 PM. At 5 PM, it takes the maximum time to travel

Deerfoot Trail, as shown in the PDF of this time in the Figure 4.5.

52

0.45 0.40 0.35 0.30 0.25 5:00 PM 0.20

Probability 5:30 PM 0.15 6:00 PM 0.10 0.05 0.00 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 Commute time (in minutes)

Figure 4.5: Probability distribution of commute times at evening peak hours (north to south)

Figure 4.6 shows the probability distribution of commute times at evening peak hours in south to north direction. Traffic conditions at these hours are always severe as the commute time could be twice as much as the basic commute time. There are number of reasons for higher commute time at these hours, which are the number of accidents occurred around these hours, rush hours when people are going back home to the south.

Traffic conditions in the evening peak hours are similar at 5:00 PM and 5:30 PM in both the direction of Deerfoot Trail. But differences in the probability distribution at 6:00 PM for both the directions show that traffic starts decreasing little earlier in north to south direction than the other direction of south to north.

53

0.60 0.55 0.50 0.45 0.40 0.35 0.30 5:00 PM 0.25

Probability 5:30 PM 0.20 6:00 PM 0.15 0.10 0.05 0.00 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 Commute time (in minutes)

Figure 4.6: Probability distribution of commute times at evening peak hours (south to north)

PDFs of all the peak hours for north to south and south to north direction show some similarities and differences at some peak hours. Differences at some peak hours between the two directions show that there are different traffic conditions for both the direction at those peak hours. There could be many reasons for these differences such as number of accidents, and available route alternatives. This is studied in detail in the following sections.

4.1.3 Q3. How are peak hours related to number of accidents?

Road accidents increase the commute time. It is observed in 202 days’ data that on an average,

an accident adds 7 minutes to commute time to travel on Deerfoot Trial. This section studies the impact of accidents on commute time at different peak hours by plotting the probability distributions of accidents at all the peak hours.

54

Figure 4.7 shows the probability distribution of accidents at morning peak hours in the direction of north to south. These probabilities are based on the accidents happened in 202 days. Among the morning peak hours, 7:30 AM has 0.04 probability of observing zero or one accidents, but other morning peak hours have very less probability (0.01) of observing an accident. This means that very few accidents occur in the morning which is the one of the reason of having less commute time in the morning as compared to the commute time in the evenings, as observed in the data, the evening peak is always high than the morning peak.

0.45 0.40 0.35 0.30 0.25 7:00 AM 0.20 7:30 AM Probability 0.15 8:00 AM 0.10 8:30 AM 0.05 0.00 0123456 Number of accidents

Figure 4.7: Probability distribution of accidents at Deerfoot Trail during morning peak hours (north to south)

Figure 4.8 shows there is probability of observing one or two accidents between 7:30 AM and

8:30 AM in the direction of south to north of Deerfoot Trail. But, there is less probability of observing an accident on north to south direction at 8:00 AM and 8:30 AM. That is likely why, it takes 5 minutes more in the morning in south to north direction than the other direction of north

55

to south, which is shown in previous section probability distribution of commute time at morning peak hours.

0.45 0.40 0.35 0.30 0.25 7:00 AM 0.20 7:30 AM

Probability 8:00 AM 0.15 8:30 AM 0.10 0.05 0.00 0123456 Number of accidents

Figure 4.8: Probability distribution of accidents at Deerfoot Trail during morning peak hours (south to north)

Figure 4.9 shows the probability distribution of road accidents at evening peak hours in the

direction of north to south of Deerfoot Trail. Evening peak hours for this direction have 0.3 more

probability of observing more than 4 number of accidents than in morning peak hours. All the

peak hours have probability of at least 2 accidents at a time. Distribution of probability at 5:00

PM and 5:30 PM is longer than the other peak hours and there is possibility of more than 4

accidents to occur at a time.

Figure 4.10 shows the probability distribution of accidents during evening peak hours in the

direction of south to north of Deerfoot Trail. These hours have at least 0.25 probability of

observing an accident with a maximum of 0.4 probability. Peak hours such as 5:00 PM and 5:30

PM have long tail of PDFs as compared to the other peak hours. These two peak hours have

56

probability of observing 4 accidents at a time, whereas other peak hours have probability of maximum 2 or 3 accidents at a time.

0.45 0.40 0.35 3:00 PM 0.30 3:30 PM 0.25 4:00 PM 0.20

Probability 4:30 PM 0.15 5:00 PM 0.10 5:30 PM 0.05 6:00 PM 0.00 0123456 Number of accidents

Figure 4.9: Probability distribution of accidents at Deerfoot Trail during evening peak hours (north to south)

0.45

0.40

0.35

0.30

0.25 3:00 PM 3:30 PM 0.20 4:00 PM Probability 0.15 4:30 PM 0.10 5:00 PM 5:30 PM 0.05 6:00 PM 0.00 0123456 Number of accidents

Figure 4.10: Probability distribution of accidents at Deerfoot Trail during evening peak hours (south to north)

57

[[

In the evening peak hours, 5:00 PM and 5:30 PM have always the highest commute time when seen in 202 days’ traffic data, which is the case for both the directions of the highway. More

number of accidents is one of the major reasons of highest commute time at these hours and the

commute time increases by approximately 20 minutes with minimum of 4 accidents on a good

weather day and the commute time increases by 30 minutes with minimum of 4 accidents on a

bad weather day that is when it is snowing. The other evening peak hours also have more

probability of more than 2 accidents than the morning peak hours. So, number of accidents is one

of the major factors affecting the commute time.

[[ 4.1.4 Q4. What are the bottleneck road segments during peak hours?

I now focus on analyzing the individual sub-segments of Deerfoot Trail to observe the bottleneck

sub-segments during the peak hours. This highway has total 21 interchanges. The commute times are also recorded for the different segments of this highway to analyze the traffic problems more deeply. Further in this section, I am trying to find the answers to following sub-questions in order to find the answer to main question Q4.

Q4.1 How much each segment contributes to the commute time of Deerfoot Trail?

Q4.2 Which segment is the most accident-prone?

Q4.3 How the traffic changes on different segments at different times of the day?

Q4.4 How the traffic moves on the segments near Downtown?

[[

58

Segment‐by‐segment analysis 100 0 000000000 9 8 8 10 11 16 14 14 12 11 12 Segment 6 90 0 1 0 0 1 1 9 80 17 30 25 32 Segment 5 70 29 31 40 30 60 45 49 37 Segment 4 50 46 27 40 36 43 35 24 Segment 3 Percentage 30 10 36 9 8 30 8 8 Segment 2 20 9 11 32 31 33 23 24 10 18 18 19 15 16 21 Segment 1 0 7:00 7:30 8:00 8:30 3:00 3:30 4:00 4:30 5:00 5:30 6:00 AM AM AM AM PM PM PM PM PM PM PM Time of the day

Figure 4.11: Percentage of days having maximum commute time (per KM) (north to south)

Figure 4.11 shows the segment by segment analysis of Deerfoot Trail in the direction of north to south at all peak hours of the day. In this analysis, the commute time at each segment is calculated per kilometer to see which segment is the slowest one at what peak hour of the day.

Figure 4.11 shows the percentage of the days when the particular segment was slow and contributed the most to the total commute time. From the results of percentage of days, it is observed that segment 3 has contributed the most to commute time for most number of days at almost all peak hours of the day. Segment 3 is the sub-segment which spans across 16 Avenue to

17 Avenue and includes the of another major road called Memorial Drive. This segment is near to the downtown of the city. So, most of the traffic is entering downtown in the morning through this sub-segment of Deerfoot Trail and leaving downtown in the evening peak hours using this sub-segment.

59

Segment 2, 64 Avenue to 32 Avenue, has contributed most to the commute time in the morning peak hours with maximum of 43% of days. This section of the city contains lot of light industrial businesses. This suggests that people that work for such businesses contribute significantly to traffic on this segment. Interestingly, this segment did not see congestion in the evening. It contributed the most to commute time for a very small percentage of days in the evening i.e., minimum of 8% and maximum of 11% days. The causes of this behaviour could not be ascertained from the limited data that I collected.

Segment 4, Barlow Trail to Anderson Road, has maximum time at evening peak hours such as

4:30 PM, 5:00 PM and 5:30 PM in the direction of north to south, with 40% of the days contributing maximum commute time on this sub-segment. This segment is close to downtown and also has several heavy industries, which might explain the congestion.

Segment 1, Country Hills to Beddington Trail, had maximum commute time per KM for some days. This segment spans a densely populated residential area. Residents of these communities commuting to work downtown and the southern parts of the city likely contribute traffic in this segment. This segment is also near to the airport, which might also explain the congestion.

Segment 6, Stoney Trail to Macleod Trial, never had maximum commute time in the period of six months in both the directions. This segment of Deerfoot Trail never had any accidents and the commute time of this segment always remain close to the average commute time because this segment is very deep in south which is almost outside the perimeter of the city.

60

Segment‐by‐segment analysis

100 000010 0 000 9 3 4 5 6 4 90 2 14 Segment 6 9 29 28 1 80 5 Segment 5 70 2 2 3 4 66 60 82 50 80 80 78 76 79 Segment 4 50 54 43 40 46 Percentage Segment 3 30 20 Segment 2 29 32 23 26 10 20 16 16 16 18 16 17 0 Segment 1 7:00 7:30 8:00 8:30 3:00 3:30 4:00 4:30 5:00 5:30 6:00 AM AM AM AM PM PM PM PM PM PM PM Time of the day

Figure 4.12: Percentage of days having maximum commute time (per KM) (south to north)

Figure 4.12 shows the percentage of days having maximum commute time in south to north direction. This bar graph shows all the sub-segments of Deerfoot having maximum commute time for a percentage of days. The commute time is calculated per kilometer for all the six sub- segments. Here the results are very different from the other direction. All the sub-segments traffic moves completely in a different manner on this side of the highway.

Segment 2 spanning across 32 Avenue to 64 Avenue in the south to north direction is having highest commute time for maximum number of days at all the peak hours of the day. This sub- segment had the maximum commute time among all segments for more than 80% of the days.

This suggests that a lot of people residing in the south of the city work in the light industrial businesses operating in this segment.

61

The other segments also have maximum commute time for very few percentages of days. As shown in the Figure 4.12, segment 1 also has few days of maximum commute time throughout the peak hours of the day. This segment is also near to the airport and NE merges into this sub-segment. This segment is taken by the people going outside the city to nearest towns/cities such as Airdrie, Crossfield, and Red Deer.

Segment 5, McKenzie Trail to 24 Street, also has very few percentages of days having maximum commute time in the morning peak hours only. This segment is taken by people living in south to come to Downtown for their work. One similarity between the segment-by-segment analyses for both the directions that segment 6, Macleod Trail to Stoney Trail, never had maximum commute time and almost had free flowing traffic throughout the day.

Figure 4.11 and Figure 4.12 answered the question that which sub-segments of Deerfoot Trail contributed the most into the commute time and which are the dominating sub-segments at morning and evening peaks hours in both the directions.

62

Segment‐by‐segment analysis 50 45 40 6 6 6 6 35 6 6 6 6 Segment 6 6 6 6 7 30 6 6 7 (inminutes) 6 6 6 6 6 6 6 Segment 5 25 6 10 6 6 8 9 9 time 6 7 7 20 6 Segment 4 8 7 8 7 8 8 7 15 8 8 8 7 Segment 3 10 8 6 7 7 6 6 6 6 6 6 6 Commute 5 Segment 2 6 6 6 6 5 5 5 5 6 6 5 0 Segment 1 7:00 7:30 8:00 8:30 3:00 3:30 4:00 4:30 5:00 5:30 6:00 AM AM AM AM PM PM PM PM PM PM PM Time of the day

Figure 4.13: Average commute time at each segment (north to south)

Segment‐by‐segment analysis 50 45 5 55 40 Segment 6 55 5 6 665 35 5 5 5 5 Segment 5 99 5 666 30 7 5 6 7 5 6 5 Segment 4 25 5 99 6 776 5 9 20 8 7

Percentage Segment 3 6 7 15 6 777 13 13 11 12 11 Segment 2 10 77778 9 5 66666667777 Segment 1 0 7:00 7:30 8:00 8:30 3:00 3:30 4:00 4:30 5:00 5:30 6:00 AM AM AM AM PM PM PM PM PM PM PM Time of the day

Figure 4.14: Average commute time at each segment (south to north)

63

After looking at the percentage of days having maximum commute time, let’s look at the average commute times of all the six segments. I next present the segment-by-segment analysis on the average commute time on all the segments, shown in Figure 4.14. Averages are calculated for

202 days including weekdays, weekends and holidays. There is a noticeable difference in the averages for the same segment at morning and evening peak hours. These results very well correlate with the results in Figure 4.11. For example at segment 4, evening average commute time is more than the morning peak hours. The difference is approximately 4 minutes between the morning and evening average commute times, which also verifies the results in Figure 4.11 for this segment, where the evening peak hours have maximum commute time for significant percentage of days.

For segment 2, in Figure 4.13, average commute time in the morning peak hours is more than the evening peak hours. The average commute time for segment 3 remains the same throughout the day, which is also the case shown in Figure 4.11 where segment 3 had maximum commute time at both morning and evening peak hours.

Segment 2, 32 Avenue to 64 Avenue, in the other direction of south to north shown in Figure

4.14 has the high average in the evening peak hours, which is almost double the average in the morning peak hours. On an average, morning commute time for this segment is 6 minutes but evening commute time is 13 minutes. This is because of more number of accidents happening on this segment in the evening.

Segment 3, 17 Avenue to 16 Avenue, also has high average commute time in the evenings than in the morning peak hours. This segment did not have highest commute time for large percentage

64

of days, but as the averages shown in Figure 4.20, it takes more time to travel on this segment in the evening than the mornings because more number of accidents happening in the evening on this segment.

The average commute time at each segment does not provide much observable differences, but if we look at the probability distribution of commute times for all the segments at morning and evening rush hours, then one can infer how the segments behave at different times of the day.

Figure 4.15 to Figure 4.22 show the probability distribution of segments identified as problematic in the previous analyses.

0.45

0.40

0.35

0.30

0.25 7:00 AM 0.20 7:30 AM Probability 0.15 8:00 AM 8:30 AM 0.10

0.05

0.00 4 5 6 7 8 9 10 11 12 13 14 15 Commute time (in minutes)

Figure 4.15: Segment 2 - Morning rush hours (north to south)

65

0.45

0.40

0.35

0.30

0.25 7:00 AM

0.20 7:30 AM Probability 0.15 8:00 AM 8:30 AM 0.10

0.05

0.00 4 6 8 10 12 14 16 18 20 22 24 Commute time (in minutes)

Figure 4.16: Segment 2 - Morning rush hours (south to north)

0.45 0.40 0.35 3:00 PM 0.30 3:30 PM 0.25 4:00 PM 0.20 4:30 PM Probability 0.15 5:00 PM 0.10 5:30 PM 6:00 PM 0.05 0.00 4 6 8 10 12 14 16 18 20 22 24 Commute time (in minutes)

Figure 4.17: Segment 2 - Evening rush hours (south to north)

66

Segment 2 in north to south direction had more average commute time in the morning than evening, shown in Figure 4.13, so the probability distribution of this segment at morning peak hours is shown in Figure 4.15. The minimum time to travel the distance of this segment is 4 minutes, but at peak hours it becomes 3 times more than the basic commute time.

Segment 2, 32 Avenue to 64 Avenue, in south to north direction, took maximum commute time for large percentage of days throughout the peak hours of the day and had more average commute time in the evening peak hours than the morning peak hours. Figure 4.15 show the probability distribution of commute times for segment 2 in the morning peak hours. Commute time at these hours remain very close to its average commute time. It could take maximum 12 minutes to travel this segment, whereas the minimum minutes to cover this distance is 4 minutes under normal traffic conditions.

Figure 4.17 shows the probability distribution of evening peak hours for segment 2, 32 Avenue to 64 Avenue, in the direction of south to north. The standard deviation of these hours is much larger than the morning peak hours. The commute time could reach 21 minutes in the evening between 4:30 PM and 5:30 PM, which is almost double the average commute time, which is 13 minutes as shown in Figure 4.14 and five times more than the basic commute time without traffic, which is 4 minutes. The reasons for such large commute time in the evenings are explained above.

According to the Figure 4.11 and Figure 4.13, segment 3, 16 Avenue to 17 Avenue, in the north to south direction had maximum commute time throughout the day. Figure 4.18 and Figure 4.19 show the probability distribution of morning and evening peak hours respectively. The minimum

67

time it takes to travel this distance is 5 minutes, but in the morning rush hours, it could take almost three times more than the basic commute time and in the evening, it could go almost four time the basic commute time, especially at 3:00 PM, when people start moving back home towards south from their offices in the city centre.

Segment 3, 17 Avenue to 16 Avenue, in south to north direction also has more average commute time in the evenings as compared to the average commute time in the morning peak hours. So, only the evening peak hours’ probabilities are discussed here. Figure 4.20 shows the probability distribution of commute time at segment 3. This segment is near to Downtown, which is taken by people leaving offices from Downtown and going towards north to their homes in the evening.

The evening hours for this segment have more standard deviation than in the morning. The minimum time to travel this segment is 5 minutes under normal traffic conditions, but it could take four times more in the evening hours from 4:30 PM to 5:30 PM.

0.45 0.40 0.35 0.30 0.25 7:00 AM 0.20 7:30 AM Probability 0.15 8:00 AM 0.10 8:30 AM 0.05 0.00 5 6 7 8 9 1011121314151617181920212223 Commute time (in minutes)

Figure 4.18: Segment 3 - Morning rush hours (north to south)

68

0.45

0.40

0.35 3:00 PM 0.30 3:30 PM 0.25 4:00 PM 0.20

Probability 4:30 PM 0.15 5:00 PM 0.10 5:30 PM 0.05 6:00 PM

0.00 567891011121314151617181920212223 Commute time (in minutes)

Figure 4.19: Segment 3 - Evening rush hours (north to south)

0.45 0.40 0.35 3:00 PM 0.30 3:30 PM 0.25 4:00 PM 0.20

Probability 4:30 PM 0.15 5:00 PM 0.10 5:30 PM 0.05 6:00 PM 0.00 5 7 9 11 13 15 17 19 21 Commute time (in minutes)

Figure 4.20: Segment 3 - Evening rush hours (south to north)

69

Figure 4.21 shows the PDF of segment 4 in north to south direction at evening rush hours. This segment had more percentage of days having maximum commute time in the evening as compared to morning and the evening hours’ average commute time is more than the morning hours’ average commute time. This segment is taken by the people working in the center of the city while going back to their homes in south in the evening. The minimum time to travel this distance is 7 minutes but it could go to approx. 26 minutes at 5:00 PM, when people are moving out of downtown towards south.

0.45

0.40

0.35

0.30 3:00 PM 3:30 PM 0.25 4:00 PM 0.20

Probability 4:30 PM 0.15 5:00 PM

0.10 5:30 PM 6:00 PM 0.05

0.00 7 9 11 13 15 17 19 21 23 25 27 Commute time (in minutes)

Figure 4.21: Segment 4 - Evening rush hours (north to south)

70

0.45

0.40

0.35

0.30

0.25 7:00 AM 0.20 7:30 AM Probability 0.15 8:00 AM 8:30 AM 0.10

0.05

0.00 6 8 10 12 14 16 18 20 22 24 Commute time (in minutes)

Figure 4.22: Segment 5 - Morning rush hours (south to north)

Segment 5, McKenzie Blvd to 24 Street, in south to north direction has shown more average commute time in the morning than evenings. This segment of Deerfoot Trail in the south is taken by people living in the south while going to the city towards north to their offices. Figure 4.22 shows the probability distribution of commute time at segment 5. The minimum commute time to cover this distance is 6 minutes, but it could take 20 minutes to travel this segment between

7:30 AM and 8:00 AM. These results show that each segment behaves differently from the other segments and some segments show different traffic conditions in the morning and evening.

[ 4.2 Summary

This chapter presented the results of Deerfoot Trail in both the directions of north to south and

south to north. The comparison is performed between the statistical results of both the directions

to view the changes in traffic conditions both ways. The results were shown on respective

71

probability distributions at all the peak hours, number of accidents in two directions, segment by segment analysis, average commute time at peak hours for all segments, etc.

One main conclusion from the statistical results is that traffic conditions on each side of the highway are very much different from each other. Traffic flows differently on all the segments for both directions. Many of the findings observed in segment-by-segment analysis for both the directions of Deerfoot Trail stated that not all the sections of the road have bad traffic conditions at all the times of the day; rather it behaves differently on different times of the day. Especially, segment near the Downtown, 16 Avenue to 17 Avenue, has drastic traffic jams in the period of

4:30 PM to 5:30 PM in the southbound traffic on Deerfoot Trail and morning peak hours’ traffic is not as bad as evening peak hours’ traffic. But on the other direction of the highway, northbound, results are completely different. 32 Avenue to 64 Avenue segment is the bottleneck segment in northbound direction, where most number of accidents of the day are observed on this segment, which ultimately increases the commute time.

72

Chapter Five: UNIQUE TRAFFIC PATTERNS WITH K-MEANS CLUSTERING

To broaden the scope of statistical analysis for Deerfoot Trail shown in Chapter 4, clustering is performed on the commute times. Data is gathered for this highway in both directions, to observe the different traffic trends based on number of road accidents in a day, weather conditions of the city on a particular day and specific time instant of the day. The main objective of clustering is to

recognize distinct traffic patterns out of six months traffic data. A correlation is performed

between the factors affecting the commute time and the impact of each factor on the commute

times is observed in detail. Clustering results for both the direction of this highway are represented in this chapter along with the similarity and dissimilarity of the important findings of the two directions.

The section 5.1 describes the methodology behind the clustering analysis on the collected traffic data to find the appropriate number of clusters of unique commute time patterns. Section 5.2 represents the clustering results for Deerfoot Trail in both directions and compares the unique commute time patterns for both the directions simultaneously. Section 5.3 summarizes the clustering results for Deerfoot Trial and its sub-segments.

5.1 Clustering Analysis

The methodology used to analyze the behavior of Deerfoot Trail is described here in detail. 153

days of data spanning the period September 2013 to February 2014 is taken for this study. Since

it covers both fall and winter, the data captures diverse scenarios with respect to weather

conditions. The days are classified as “Good weather days”, i.e., days with no falling snow and

snow on the ground, and “Bad weather days”, i.e., days with one or both of falling snow and

snow on the ground. Out of 153 days, 54 days were the good weather days and 99 days were bad

73

weather days. For all these 153 days, k-means clustering [21] is applied to ascertain an appropriate number of unique commute time patterns.

For k-means clustering, each day in our dataset was represented by a 19-element feature vector.

The elements of the feature vector represent the ICT commute times recorded by the system at various times of the day for the entire 50 KM stretch of Deerfoot Trail. WEKA toolset [17] is used to implement k-means clustering with the default k-means settings. The default settings of k-means clustering are: distance between two elements is calculated using Euclidean distance and maximum number of iterations to recalculate the distance of each element from cluster centroids are 100. The impact of other settings will be considered as part of future work. The centroid of a cluster reported by WEKA is a 19-element vector whose elements represent the average commute times at the various times of the day for all days included within that cluster.

First of all, the appropriate numbers of clusters are found to identify unique commute time patterns. The approach presented by Menasce et al. to determine the appropriate number of clusters for a given dataset [22] is used here, which is described in Chapter 2. This approach uses βvar (equation 2.3) the ratio of the intra-cluster variance (equation 2.1) to the inter-cluster

variance (equation 2.2) to decide on the pertinent k-value, i.e., number of clusters. For good quality clustering, intra-cluster distance, i.e., the distance of feature vectors within a cluster from their centroid, must be low while inter-cluster distance, i.e., the distance of the centroid of one cluster to the centroid of another cluster, must be high. Therefore, a lower βvar value indicates

better clustering. Clustering exercises with progressively higher k-values are carried out till the

βvar value shows no appreciable decrease or starts to increase. The k-value which caused the

least βvar is then chosen.

74

Ratio of Intra and Inter cluster Variance 9 8 7 6 5

var 4  3 2 1 0 345678 Number of Clusters

Figure 5.1: Appropriate number of clusters for North to South direction of Deerfoot Trail

Ratio of Intra and Inter cluster Variance 13 12 11 10 9 8 7 var 6  5 4 3 2 1 0 345678 Number of Clusters

Figure 5.2: Appropriate number of clusters for South to North direction of Deerfoot Trail

Figure 5.1 and Figure 5.2 shows the calculation of appropriate number of clusters for North to

South and South to North directions of Deerfoot Trail respectively. The experiment was carried

75

out for 3 to 8 number of clusters. Based on the βvar value shown in the two figures, both the directions identified 6 numbers of clusters as the appropriate number of clusters.

By taking 6 as the appropriate number of clusters for both the directions, the six unique commute time patterns are identified for Deerfoot Trail. The following section describes these distinctive commute time patterns in detail.

5.2 Commute time patterns

By applying the technique mentioned in section 5.1, 6 unique clusters are identified, where 1

cluster is for the days with no accidents irrespective of the weather conditions (C1), 3 clusters for

Figure 5.3: Distribution of days into categories based on weather and accidents for Deerfoot Trail (north to south)

76

good weather days with accidents (C2, C3, and C4) and 2 clusters for bad weather days with accidents (C5 and C6). Figure 5.3 shows the representation of clusters. On closer analysis, good weather days without accidents and bad weather days without accidents contained only weekends and holidays and exhibited similar commute time patterns. Therefore, they are represented by a single cluster (C1) as the traffic pattern in these kinds of days is very similar.

From Figure 5.3, about 65% of the days in the observation period had bad weather. This is due to the very short autumn and very long winter season in Calgary. The likelihood of observing at least one accident on Deerfoot Trail is about 85% for both good weather days and bad weather days. Surprisingly, there was not a single working day without accidents in the whole observation period and therefore, possibility of having one accident on a working day is 100%.

However, closer inspection revealed that the average number of accidents per day is 1.8 times higher on bad weather days than on good weather days. I next discuss in detail the properties of each of the six clusters.

Consider first the cluster C1 encompassing days with no accidents. A characteristic of this cluster is that it only contains weekends and holidays. Figure 5.4 shows the commute time trends for this cluster. The x-axis represents the 19 time instants at which the ICT commute times are collected. The y-axis signifies the commute time in minutes and starts from a non-zero origin for the sake of clarity, the observation is done for a 50 KM highway and it never took less than 33 minutes to travel this stretch. The figure shows the centroid of the cluster, the day that diverges most from the centroid, i.e., upper, and the day that diverges least from the centroid, i.e., lower. Although this cluster include days with bad weather, weather does not influence commute times due to the light traffic volumes during weekends and holidays.

77

44 Upper C1 42 Lower

40

38

36

34 Commute Time(in minutes)

32

30 6:00 6:30 7:00 7:30 8:00 8:30 9:00 11:00 1:00 3:00 3:30 4:00 4:30 5:00 5:30 6:00 7:00 8:00 9:00 AM AM AM AM AM AM AM AM PM PM PM PM PM PM PM PM PM PM PM

Time of the day

Figure 5.4: Days with no accidents (North to South)

44 Upper C1 42 Lower

40

38

36

34 Commute Time(in minutes)

32

30 6:00 6:30 7:00 7:30 8:00 8:30 9:00 11:00 1:00 3:00 3:30 4:00 4:30 5:00 5:30 6:00 7:00 8:00 9:00 AM AM AM AM AM AM AM AM PM PM PM PM PM PM PM PM PM PM PM

Time of the day

Figure 5.5: Day with no accidents (South to North)

78

There is very little variability in commute times among different days as well as across different times of the day in this cluster for both the directions. The maximum commute times for days in this cluster varies from 42 minutes, observed on a day with bad weather, to 33 minutes. As all the days in this cluster are weekends and holidays, the mean commute times are very close to the normal commute time under no severe traffic conditions.

I next focus on clusters that contain good weather days with accidents. As shown in Figure 5.3, clustering identified three distinct clusters namely, weekends and holidays (C2), workdays with moderate number of accidents (C3), and workdays with large number of accidents (C4). Each of them is explained below in detail for both the directions of Deerfoot Trail.

Figure 5.6 and Figure 5.7 show the C2 clusters for weekends/holidays with accidents in the direction of North to South and South to North of Deerfoot Trail respectively. These clusters contain the holidays and weekends when at least one accident took place. The accidents impacted commute times significantly. Unlike cluster C1, the mean commute time of cluster C2 is not close to the normal commute time without accidents.

Holidays only had an average of 1.8 accidents per day. Comparing the centroids of clusters C1 and C2 in Figure 5.4 and Figure 5.6, the accidents increase the commute time by up to 35% when compared to holidays without accidents. From Figure 5.6 and Figure 5.7, one can observe that commute times are longer in the evening pointing to more accidents in the evening on weekends and holidays.

79

60 Upper C2 55 Lower

50 minutes)

(in

45 Time

40 Commute

35

30 6:00 6:30 7:00 7:30 8:00 8:30 9:00 11:00 1:00 3:00 3:30 4:00 4:30 5:00 5:30 6:00 7:00 8:00 9:00 AM AM AM AM AM AM AM AM PM PM PM PM PM PM PM PM PM PM PM Time of the day

Figure 5.6: Traffic Pattern for Weekends on Good Weather Days with accidents (North to South)

60 Upper C2 55 Lower

50 minutes)

(in

45 Time

40 Commute

35

30 6:00 6:30 7:00 7:30 8:00 8:30 9:00 11:00 1:00 3:00 3:30 4:00 4:30 5:00 5:30 6:00 7:00 8:00 9:00 AM AM AM AM AM AM AM AM PM PM PM PM PM PM PM PM PM PM PM Time of the day

Figure 5.7: Traffic Pattern for Weekends on Good Weather Days with accidents (South to North)

80

Figure 5.8 and Figure 5.9 show commute time traffic patterns for cluster C3 for both the directions, respectively. Cluster C3 comprises the working days with moderate number of accidents i.e., 4 to 7, in good weather conditions. From Figure 5.3, this cluster represents the most likely pattern for good weather days and contains more than 50% of all good weather days.

The number of accidents per day ranges from 3 to 6 with an average of 4.5. Cluster C3 shows that the morning rush hours are from about 7:30 AM to 8:30 AM while the evening rush hours are from 3:30 PM to 5:30 PM, which matches the results stated in chapter 4. As with holidays, closer analysis showed that more accidents happen during evenings, which might explain why the evening rush hour period is longer.

Figure 5.10 and Figure 5.11 show the traffic pattern for cluster C4, i.e., good weather days with large number of accidents, i.e., 8 to 16, in north to south and south to north directions, respectively. This cluster represents the irregular pattern of commute times at Deerfoot Trail because of large number of accidents happened on these days. These days are only 15% of the total days having accidents with good weather conditions.

The number of accidents per day in this cluster range from 7 to 13 with an average of 8.2.

Comparing the centroids of clusters C3 (regular commute time pattern) and C4 (irregular commute time pattern) in Figure 5.8 and Figure 5.10, the magnitude of rush hour commute times and the durations of the rush hours increase significantly as the number of accidents increase.

From Figure 5.10, the maximum commute time observed for cluster C4 was 1 hour and 46 minutes and evening rush hours are worse than morning rush hours as the commute time increases 3 times than the normal commute time with an average of 4 accidents at one time.

81

70 Upper C3 65 Lower 60

55 minutes)

(in 50 Time

45

Commute 40

35

30 6:00 6:30 7:00 7:30 8:00 8:30 9:00 11:00 1:00 3:00 3:30 4:00 4:30 5:00 5:30 6:00 7:00 8:00 9:00 AM AM AM AM AM AM AM AM PM PM PM PM PM PM PM PM PM PM PM Time of the day

Figure 5.8: Regular Traffic Pattern for Working Days (North to South)

70 Upper C3 65 Lower 60

55 minutes)

(in

50 Time 45

Commute 40

35

30 6:00 6:30 7:00 7:30 8:00 8:30 9:00 11:00 1:00 3:00 3:30 4:00 4:30 5:00 5:30 6:00 7:00 8:00 9:00 AM AM AM AM AM AM AM AM PM PM PM PM PM PM PM PM PM PM PM Time of the day

Figure 5.9: Regular Traffic Pattern for Working Days (South to North)

82

110 Upper 105 C4 100 Lower 95 90 85 80 minutes) 75 (in

70 Time 65 60 55 Commute 50 45 40 35 30 6:00 6:30 7:00 7:30 8:00 8:30 9:00 11:00 1:00 3:00 3:30 4:00 4:30 5:00 5:30 6:00 7:00 8:00 9:00 AM AM AM AM AM AM AM AM PM PM PM PM PM PM PM PM PM PM PM Time of the day

Figure 5.10: Traffic Pattern for Working days with more number of accidents (North to South)

110 Upper 105 C4 100 Lower 95 90 85 minutes)

80 (in 75 70 Time

65 60

Commute 55 50 45 40 35 30 6:00 6:30 7:00 7:30 8:00 8:30 9:00 11:00 1:00 3:00 3:30 4:00 4:30 5:00 5:30 6:00 7:00 8:00 9:00 AM AM AM AM AM AM AM AM PM PM PM PM PM PM PM PM PM PM PM Time of the day

Figure 5.11: Traffic Pattern for Working days with more number of accidents (South to North)

83

The evening rush hour commute times for cluster C4 are worse in the direction of north to south when compared to the other direction of south to north. The average commute time of this cluster increases by 20 minutes as there is high probability of more number of accidents on the north to south direction when people are leaving downtown in the evening to their homes in the south of the city. Whereas, morning rush hours average commute times for both directions are very close to each other as probability of accidents on the both directions is very low. These results conform to the statistical results of commute times and accidents presented in chapter 4.

Closer exploration of sub-segment commute times for both clusters, C3 and C4, shows that the evening rush hour commute times in the stretch leaving the city center were much higher than the morning rush hour commute times in the stretch leading into the city center. This suggests that the evening traffic volume out of the city center is higher than the morning traffic volume into the city center. This evening traffic in this sub-segment in the southbound direction represents residents living south of the city center getting back to their homes. These results confirm the longstanding intuition among Calgary residents that there is a lack of good alternatives to Deerfoot Trail for residents in the south.

I next discuss clusters that include bad weather days i.e., the days with one or both of falling snow and snow on the ground. These days formed two clusters C5 and C6. C5 displays a regular commute time patterns with clear morning and evening peak hours and off-peak hours.

However, C6 does not have such a regular pattern since it includes days with very severe weather conditions. For example, it includes a day where Calgary experienced one of its worst blizzards.

84

Figure 5.12 and Figure 5.13 show the patterns for C5 for both directions, respectively. About

66% of bad weather days were classified into this cluster as indicated in Figure 5.3.

85 Upper C5 80 Lower 75 70 65 minutes)

(in 60 Time

55 50

Commute 45 40 35 30 6:00 6:30 7:00 7:30 8:00 8:30 9:00 11:00 1:00 3:00 3:30 4:00 4:30 5:00 5:30 6:00 7:00 8:00 9:00 AM AM AM AM AM AM AM AM PM PM PM PM PM PM PM PM PM PM PM Time of the day

Figure 5.12: Regular Traffic Pattern for Working days on Bad Weather (North to South)

85 Upper 80 C5 75 Lower

70

65 minutes) 60 (in

55 Time

50

45 Commute 40

35

30 6:00 6:30 7:00 7:30 8:00 8:30 9:00 11:00 1:00 3:00 3:30 4:00 4:30 5:00 5:30 6:00 7:00 8:00 9:00 AM AM AM AM AM AM AM AM PM PM PM PM PM PM PM PM PM PM PM Time of the day

Figure 5.13: Regular Traffic Pattern for Working days on Bad Weather (South to North)

85

130 Upper 120 C6

110 Lower

100

90 minutes)

(in 80 Time 70

60 Commute 50

40

30 6:00 6:30 7:00 7:30 8:00 8:30 9:00 11:00 1:00 3:00 3:30 4:00 4:30 5:00 5:30 6:00 7:00 8:00 9:00 AM AM AM AM AM AM AM AM PM PM PM PM PM PM PM PM PM PM PM Time of the day

Figure 5.14: Random Traffic Pattern on Snowy Days with more number of accidents (North to South)

130 Upper 120 C6

110 Lower

100

90 minutes)

(in 80 Time

70

60 Commute

50

40

30 6:00 6:30 7:00 7:30 8:00 8:30 9:00 11:00 1:00 3:00 3:30 4:00 4:30 5:00 5:30 6:00 7:00 8:00 9:00 AM AM AM AM AM AM AM AM PM PM PM PM PM PM PM PM PM PM PM Time of the day

Figure 5.15: Random Traffic Pattern on Snowy Days with more number of accidents (South to North)

86

Cluster C5 contains the workdays with bad weather where the peaks are seen at the usual rush hours of the day. This cluster follows the regular commute pattern with moderate weather conditions as the normal working day with good weather conditions with a little variation at rush hours. The number of accidents per day range from 1 to 7 for the days in this cluster with an average of 3.9 for the two directions of the highway.

Cluster C5 centroids for both the directions of Deerfoot Trail are quite similar since the average number of accidents are similar in both directions. Comparing the centroid of Figure 5.12 with that of Figure 5.8 which represents the good weather day cluster with comparable average number of accidents, the maximum evening rush hour commute increases by 15 minutes due to bad weather and the morning commute time increases by 8 minutes.

Finally, Figure 5.14 and Figure 5.15 show the cluster C6 that contains bad weather, accident days with unusual patterns where commute time peaks happen outside of the usual peak hours.

About 20% of the bad weather days fall into this cluster. The number of accidents per day varies from 8 to 16 with an average of 9.6.

In Figure 5.14, the upper curve of cluster C6 represents December 2, 2013 when Calgary was hit by an extreme snow blizzard. Commute times were as high as 2 hours and 4 minutes with 16 accidents reported just on the north to south segment of Deerfoot Trail alone. On the other direction of the highway, south to north, the commute time went up to 1 hour 42 minutes at 1 PM with 5 accidents at the same time, shown in Figure 5.15. Cluster C6 centroids have an irregular pattern unlike the centroids of clusters C3 and C5, where the peaks are seen on the typical rush

87

hours. The unusual peaks at non-rush hours are shaped by few days having sudden weather changes which steered to more number of accidents on the road.

From Figure 5.3, clusters C3 and C5 (clusters following regular commute time patterns only at rush hours of the city) contain about 60% of the days in our dataset. A remarkable feature of these clusters is that the days in these clusters have well-defined patterns and are very close to their respective centroids. This suggests that clustering based commute time prediction models can be very effective, which is explained in detail in next chapter, Chapter 6.

Analysis of the sub-segment information provided several interesting observations on the accidents. More than 34% of the total accidents reported in the Tweets were concentrated on a 4

KM stretch covering the 16th Avenue N, Memorial Drive, and 17th Avenue S interchanges.

This sub-segment recorded higher number of accidents than any other sub-segment. Based on the speed limit of the highway, it should only take about 3 minutes to travel this sub-segment.

However, the ICT commute time data indicates that commute times in this sub-segment went as high as 66 minutes at 3 PM on 23rd December 2013 when it was snowing and there was 18 cm of snow on the ground.

5.3 Summary

This chapter presented the clustering results for the highway, Deerfoot Trail. These results

proved the dependency of commute time on factors like weather, accidents, time of the day and

day of the week. Data of 153 days is characterized in to six different commute time patterns here.

These six unique patterns showed how the commute time is affected by the moderate and severe

88

weather conditions as well as the impact of number of accidents on the commute time at particular time of the day.

One main conclusion from the clustering is that more than half of the days follow a regular commute time patterns depending upon the weather conditions of the day. Moreover, the numbers of accidents have highest impact on commute times than the other factors. On an average, one accident increases the commute time by 7 minutes. Almost all the rush hours have the highest probability of occurring one accident, but few hours in the evening have probability of 4 accidents at same time, which could increase the average commute time by 28 minutes in the evening rush hours, especially between 5 and 5:30 PM.

89

Chapter Six: COMMUTE TIME PREDICTION MODEL

Commute time prediction is an important component of Intelligent Transportation Systems

(ITS). Commute time prediction enables the user to be well informed of the likely future conditions on roadways, so that pre-trip plans can be made accordingly in order to reduce commute time. This kind of information can also be used by road traffic system controllers in order to formulate some traffic control strategies to ease traffic congestion. With commute time

predictions, roads may be used more efficiently with better overall network performance. The integrated prediction model proposed here will put emphasis on commute time prediction under various traffic and weather scenarios. Prediction is done on the basis of data collected for 202 days. These days are divided into two parts; 153 days’ data is used for training the model and rest of the 49 days’ data is used for testing the prediction model.

Section 6.1 describes the methodology used for predicting commute time for travelling Deerfoot

Trail on a particular day and time. Section 6.2 shows the measurement of performance of the prediction model. Section 6.3 summarizes the results and concludes this chapter.

6.1 Prediction Model

There is a range of techniques developed by different researchers to predict commute times for a

destination at future times. ITS technologies provide various ways of predicting travel times. A

variety of intelligent transportation systems have already been developed and applied in

transportation networks. These techniques require different parameters as the base for predicting

commute times, depending upon what kind of information is available beforehand. In case of this

research, weather data and day of the week data is available ahead of time to predict the future

commute times. The data collected is already clustered into six unique clusters using the weather

90

data, day of the week and commute time information. So, the weather data and day of week can be successfully used to categorize the future day into one of the six unique clusters.

As commute time is to be predicted at each time instant of the day, the process of prediction is divided into following two parts:

a) Assign the cluster based on weather conditions

b) Predict the commute time at 19 time instants of a day based on the cluster assigned in

step (a) and the probability of accidents at each time instant in that particular cluster.

Training Dataset (153 days) &

Prediction Dataset (49 days)

Step a: Assign Cluster

(Naïve Bayes Classifier)

Prediction Dataset with Probability of accidents at each

assigned cluster time instant for each day of week

Step b: Predict commute time at

each time instant

Prediction dataset with commute

time estimation

Figure 6.6.1: Prediction process

91

Figure 6.1 shows the complete prediction process. The steps are explained in detail below. Based on the traffic data available for 153 days, commute time is predicted for the vehicles travelling on Deerfoot Trail. As described in Chapter 5, 153 days are clustered into 6 clusters having unique traffic patterns from each other. These 6 unique clusters consist of different weather conditions and variable road accidents at different times of the day and day of the week. The data of 153 days with assigned clusters is taken as training dataset and the data of 49 days with unassigned clusters is taken as prediction dataset.

First of all in step (a) in Figure 6.1, a cluster is assigned for the day for which prediction is to be done. One cluster is assigned out of the six unique clusters. Classification method is used to assign one cluster to a day in the prediction dataset. This process is followed in Chapter 5, where six unique clusters are identified for a dataset. The goal of classification is predictive and it needs a reference set of classes to assign one class to one instance in new dataset. The classification process will be followed here to assign one class from a reference set of classes to each day in prediction dataset. The six unique clusters identified from clustering are taken as the reference set of classes for the process of classification.

Classification process is followed based on the weather conditions of the future day and the weather conditions of the 6 different clusters. Classification model is first trained with a training dataset having pre-assigned classes for the data and then prediction dataset is given to assign one of the classes to the new data. The six clusters are considered as the pre-assigned classes for the training dataset. Naïve Bayes classifier is used to classify the prediction dataset in WEKA tool

[18].

92

Naïve Bayes classifier [24] is a simple probabilistic classifier based on applying Bayes’s

Theorem with strong independent assumptions. All the classifiers are first trained with a training dataset. The training dataset is the data with assigned classes. These assigned classes are the reference set of classes. The prediction dataset is the data with unassigned classes. The training dataset and prediction dataset are the two inputs to the classifier. The objective of naïve Bayes classifier is to compute the likelihood of each class for all the instances and the class with maximum likelihood is assigned to the instance. Once classifier assigns classes to the unassigned data in the prediction dataset, data with assigned classes is the output of the classifier.

Here the training data of 153 days, which were clustered earlier, is the input to the naïve Bayes classifier. As shown in Chapter 5, the clusters are further categorized into good and bad weather days. The days in the prediction dataset consists of the days with variable weather conditions i.e., good and bad weather conditions, for which commute time is not known and the information available for these days is weather forecast and day of the week. The prediction dataset consists of 49 days with unknown classes.

The following block diagram in Figure 6.2 shows the representation of the process of Naïve

Bayes classifier [47] used to classify the prediction dataset. The feature vector of training dataset is date, day of the week, total snow, snow on ground, and cluster. The prediction dataset consists of the same feature vector as the training dataset, except the cluster information, because for the new data cluster information is unknown. As described earlier, clusters (C1 to C6) are assumed as the reference set of classes. Based on the training dataset of 153 days, naïve Bayes classifier assigns classes (C1 to C6) to the prediction dataset of 49 days and each day gets its cluster.

93

Figure 6.2: Naïve Bayes Classifier

Now, each day has been assigned a cluster and each cluster has a centroid of 19 values which represents the average commute time of a single day at 19 time instants. The commute times in the centroid is the average commute times of all the days clustered into this cluster. The average commute times are not taken as the predicted commute times for the day because it does not produce good results in prediction having low accuracy and by taking centroid of the cluster as predicted commute times have large time difference from the actual commute times of the day.

Actual commute times of the day are the commute times recorded by Google Maps, following the same process explained in Chapter 3. To improve the predicted commute times in terms of generating high accuracy and less time difference between predicted and actual commute times, probability of accidents is calculated at 19 time instants separately for each cluster and for the

94

group of weekdays in that cluster, as calculated in Chapter 4. The probability of accidents is calculated for the group of weekdays in same cluster because same weekdays share certain behavior of accidents. For example, Fridays have more number of accidents than Wednesdays.

Each time instant has its own probability of accidents depending on which cluster it belongs to and which day of the week is it. For example, 5 PM has different probabilities of having different number of accidents, say 1 or 5, for 6 different clusters on different day of the week in that cluster. Mondays in Cluster 3 have different probabilities than Mondays in Cluster 5 and

Tuesdays in Cluster 3.

Based on the probabilities calculated per time for a single cluster, commute times are predicted for 19 time instants of the day. It has been observed that on an average, one accident adds approximately 7 minutes to the total commute time, as the calculation shown in Chapter 4. By taking the basic commute time for travelling the 50 KMs stretch of Deerfoot Trail at the speed limit of 100 Kmph, that is 35 minutes and adding the weighted sum of the probabilities of the possible number of accidents that could occur at that time of the day, commute time is predicted for a single time. The following pseudocode represents the process explained above for prediction.

Set baseCommuteTime to 35 Set ARRAY predictedTime[19] to ZEROS //get the day of the week dayOfWeek = getDayOfWeek(dayToPredict) //get the cluster assigned to this day through naïve Bayes classifier clusterAssigned = getCluster(dayToPredict) //get the average minutes added per accident to the commute time averageTimeofAccidents = getAverageCommuteTimeForAccidents(clusterAssigned) //19 time instants at which commute time to be predicted

95

FOR each timeInstant in dayToPredict //assign base commute time to time instant predictedTime[timeInstant] = baseCommuteTime FOR i = 1 to maxNumberOfAccidents //get all the probabilities are calculated at a time instant for day of the week probability = getProbabilityOfAccident(clusterAssigned, dayOfWeek, dayTime) //calculate weighted sum of all possible number of accidents at a time instant predictedTime[timeInstant] += probability * averageTimeofAccidents * i ENDFOR ENDFOR RETURN predictedTime

.

As the 49 days’ data is predicted using this process, now each day has its predicted commute

time at 19 time instants of the day. Each day is classified into one of the six clusters. These six clusters were C1, C2, C3, C4, C5 and C6. As discussed in previous Chapter 5, these clusters have different weather conditions and different range of accidents occurred in each cluster. Only the cluster C6 have extreme weather conditions such as Blizzard or heavy snowfall days and large number of accidents occurred on the days grouped into this cluster than other five clusters because of weather. The 49 days used for prediction are classified into all clusters but not to cluster C6, as no days are found having extreme weather situations. The weather conditions are

not bad in Calgary, Canada in the month of February, March and April as in the month of

December and January.

6.2 Prediction Performance

Data is predicted for 49 days and the accuracy rate is calculated. The accuracy of the data is

measured by comparing the predicted commute time with the actual commute time. The possible

window of predicted commute time is assumed to be within the range of 5 minutes. If the

96

predicted commute time is 5 minutes of the actual commute time recorded from Google Maps, then the predicted commute time is considered correct. The reason behind taking the time difference window of 5 minutes is that it is assumed here that to travel 50 KMs stretch under the speed limit of 100 Kmph, 5 minutes of difference in the predicted commute time is not a major difference. With this definition, the accuracy rate is 76.8%. If an exact match between predicted and actual commute times is used as the criterion, then the accuracy rate is 18.7%. Table 6.1 shows the accuracy rate at different time difference tolerances between predicted and actual commute time.

Table 6.1: Accuracy rate of predicted commute time Predicted – Actual commute time Accuracy rate

0 minutes 18.7%

>0 and <=5 minutes 58.1%

>5 and <=10 minutes 18.2%

>10 and <=15 minutes 3.7%

>15 and <=20 minutes 0.9%

>20 minutes 0.5%

Another metric to measure the performance of prediction model is relative percentage difference.

The relative percentage difference of the predicted commute from actual commute time is

calculated by taking the magnitude of absolute difference between the predicted commute times

and actual commute times for all days. The following formula represents the relative percentage

difference which came out to be 8.2% in the whole data.

97

∑ | | ∗ 100 ∑

Table 6.2 shows the percentage of the days clustered into each cluster and the accuracy rate of each cluster with respect to the time difference tolerance of 5 minutes between predicted and actual commute times. It also shows the relative percentage difference for each cluster. As the overall accuracy rate is 76.8%, there are some days which are misclassified into the wrong cluster.

According to the data shown in Table 6.2, all the clusters show low relative percentage difference, except C2. This is because cluster C2 has weekends and holidays when very few accidents happen in a day. These accidents may occur at also non-peak hours, as recorded in the training set of data. On closer inspection, C2 showed high relative percentage difference because the accidents in this cluster are predicted at wrong times of the day. This cluster may benefit from more training and prediction data to better assess the effectiveness of the model for such traffic conditions. The following graphs show the representation of commute time for one day in each cluster indicating how good the results of that cluster are. These days are chosen randomly.

98

Table 6.2: Performance of prediction model for each cluster

Cluster Percentage of days Accuracy rate Relative percentage

classified difference

C1 14.3% 98.5% 0.3%

C2 14.3% 42% 27.4%

C3 18.4% 57.3% 3.3%

C4 6.1% 49.1% 7%

C5 46.9% 77% 7.9%

C6 0% NA NA

The following graph, Figure 6.3, shows the predicted commute time for a single day. This day is

classified into cluster C1 because of the similar weather conditions on this day and same kind of days of week. Cluster C1 has all Saturdays and Sundays with no accidents. The predicted day is also a Sunday and had normal weather conditions and no accident happened on this day. The

Figure 6.3 shows, the actual commute time, predicted commute time and the cluster centroid of a

Sunday. X-axis shows the 19 time instants of the day for which commute time is recorded and predicted for this experiment. Y-axis shows the commute time in minutes for Deerfoot Trail.

This axis starts from 30 minutes because it never takes less than 30 minutes to travel 50 KMs stretch of Deerfoot Trail, as recorded in the data from Google Maps. So, it is insignificant to

99

show the axis starting from 0 minutes. Here, the actual commute time and predicted commute time are very close to each other. The probability of observing an accident on this day is almost

0, so the commute time remains the same as the base commute time.

37 Actual commute time 36 Predicted commute time 35

34 C1 Centroid

33

32

Commute Time (inminutes) 31

30 6:00 6:30 7:00 7:30 8:00 8:30 9:00 11:00 1:00 3:00 3:30 4:00 4:30 5:00 5:30 6:00 7:00 8:00 9:00 AM AM AM AM AM AM AM AM PM PM PM PM PM PM PM PM PM PM PM Time of the day

Figure 6.3: Day predicted in cluster C1

Seven days are predicted into the cluster C1 out of 49 days. For all the days in this cluster, predicted commute times are very close to the actual commute times recorded for these days. It was observed in the clustering results that all the days in the cluster C1 have commute times very close to the base commute time and have very low probability of an accident. These seven days have the same properties as the cluster C1 and 98.5% of the days are predicted accurately in this cluster and the relative time difference of this cluster is 0.3%

The following Figure 6.4 shows the day classified into the cluster C2. The cluster C2 has days like Saturdays, Sundays and holidays. Days in this cluster have some probability of observing at least one accident in a day and the highest probability of observing an accident is in the evenings.

The weather conditions of the days in this cluster are normal and are considered to be in the good weather days category. The following Figure 6.4 shows the day February 22, 2014, which is

100

Saturday and the weather conditions are good on this day, i.e., no snowfall. The predicted commute time is close to the actual commute time in the morning, but it does not represent the same trends for the evening.

50 Actual commute time 48 Predicted commute time 46

44 C2 Centroid minutes) 42 (in

40 Time

38

Commute 36

34

32

30 6:00 6:30 7:00 7:30 8:00 8:30 9:00 11:00 1:00 3:00 3:30 4:00 4:30 5:00 5:30 6:00 7:00 8:00 9:00 AM AM AM AM AM AM AM AM PM PM PM PM PM PM PM PM PM PM PM Time of the day

Figure 6.4: Day predicted in cluster C2

14.3% of the days are predicted in this cluster i.e., 7 days out of 49 days are classified into this cluster. The relative percentage difference of this cluster is 27%, which is a high value when compared to overall relative percentage difference of 8.2%, which shows that the predicted commute times in this cluster have large difference from actual commute times. The accuracy rate is 42%, by taking the commute time window of 5 minutes, which is very low as compared to the overall accuracy rate and shows that more than half of the days of this cluster are misclassified. As discussed previously, prediction for such days may benefit from more training and prediction data. It should be noted that commuters are typically interested in predictions for weekdays where traffic congestions can be significant.

101

Day 12th March 2014 had good weather conditions with no snowfall and it was a Wednesday.

This day is clustered into cluster C3. The commute times at 19 different time instants is predicted for this day by taking the probabilities of all possible number of accidents for Wednesdays in cluster C3 and the predicted commute times are compared with the actual commute times of this day and the cluster centroid of the assigned cluster i.e., C3. The following Figure 6.5 shows the test results. The predicted commute times are more close to the actual commute times at almost all times of the day than the cluster centroid.

46 Actual commute time

44 Predicted commute time

42 C3 Centroid

40 minutes)

(in

38 Time

36

Commute 34

32

30 6:00 6:30 7:00 7:30 8:00 8:30 9:00 11:00 1:00 3:00 3:30 4:00 4:30 5:00 5:30 6:00 7:00 8:00 9:00 AM AM AM AM AM AM AM AM PM PM PM PM PM PM PM PM PM PM PM Time of the day

Figure 6.5: Day predicted in cluster C3

18.4%of the days are classified into this cluster. The relative percentage difference for this cluster is 3.3%, which represents that predicted commute times in this cluster are not very different from the actual commute times. Peak hours have relative percentage difference of 2.8% and non-peak hours have relative percentage difference of 0.5%. This shows that the time difference of predicted and actual commute times is high for peak hours and low for non-peak hours. Accuracy rate for this cluster is 57.3%.

102

The day 28th March 2014, Friday is predicted into the cluster C4. The day came into the category of workdays with large number of accidents. The following Figure 6.6 shows the data predicted for this day. The predicted commute times are close to the actual commute times with minor differences. The commute time is predicted by taking the probabilities of accidents on all the Fridays in this cluster.

90 Actual commute time

Predicted commute time 80 C4 Centroid

70 minutes)

(in

60 Time

50 Commute

40

30 6:00 6:30 7:00 7:30 8:00 8:30 9:00 11:00 1:00 3:00 3:30 4:00 4:30 5:00 5:30 6:00 7:00 8:00 9:00 AM AM AM AM AM AM AM AM PM PM PM PM PM PM PM PM PM PM PM Time of the day

Figure 6.6: Day predicted in cluster C4

6% days are classified into this cluster and this cluster has 7% relative percentage difference. The relative percentage difference is very low which means that the days are correctly classified into this cluster and the predicted commute time is accurate for 49.1% of the days in this cluster.

The following Figure 6.7 shows the day predicted into cluster C5. This cluster consists of the workdays with moderate number of accidents and has bad weather conditions. 26th March 2014 is classified into this cluster using the snowfall data of this day. The predicted commute times are closer to the actual commute times than cluster centroid.

103

65 Actual commute time

60 Predicted commute time

C5 Centroid 55 minutes) 50 (in

Time 45

Commute 40

35

30 6:00 6:30 7:00 7:30 8:00 8:30 9:00 11:00 1:00 3:00 3:30 4:00 4:30 5:00 5:30 6:00 7:00 8:00 9:00 AM AM AM AM AM AM AM AM PM PM PM PM PM PM PM PM PM PM PM Time of the day

Figure 6.7: Day predicted in cluster C5

47% of the days are classified into the cluster C5 because most of the days under prediction dataset had snowfall. But surprisingly, the accuracy rate of this cluster is 77% and the relative percentage difference is only 7.9%, which means the predicted commute times are not very different from actual commute times. The relative percentage difference for all the peak hours in cluster C5 is 11.4%, which also represents that time difference between predicted and actual commute times is not very large.

Since C5 is the cluster to which the most number of days in the prediction data got assigned, I take a closer look at the results. The relative percentage difference is calculated for different time differences between predicted and actual commute times, i.e., 5, 10, 15, 20 minutes. Table

6.3 shows the results for cluster C5.

104

Table 6.3: Relative percentage difference for cluster C5 for various time differences between predicted and actual commute times Time Difference of Predicted and Actual commute time Relative percentage difference

<= 5 minutes 3.5%

>5 and <=10 minutes 4.7%

>10 and <=15 minutes 1.4%

>15 and <=20 minutes 0.9%

>20 minutes 1.1%

The data in Table 6.3 shows that the relative percentage difference keeps on decreasing as the

time difference between the predicted commute time and actual commute time goes on

increasing except for the >20 minutes row. This shows that predictions are quite close to the

actual commute times for this cluster.

No days are classified into cluster C6 as there was no extreme weather conditions occurred in the

period of days under prediction. Data collected for this research is of 202 days and 153 days are

used as the training set, which are already clustered to six unique clusters to see the different trends of the Deerfoot Trail. The remaining 49 days are used for testing the prediction model designed for this research. The prediction model is tested for only single set of data and cross validation among different sets is not performed for this research. In this case, more data would have been more beneficial to correctly predict the commute times. The data of six months is collected for this research because of time constraints. Though these days have diverse scenarios in terms of weather and number of accidents, but data of more than one year would have been more advantageous.

105

6.3 Summary

This chapter represented the commute time prediction model proposed in this research and showed the performance of this model. The commute times at 19 time instants is predicted for 49 days. The process is carried out in two steps by taking the previous results of clustering based on the weather and traffic conditions, represented in Chapter 5. First of all, the model identified the cluster for the day for which prediction is to be done. Then, commute times are predicted at each time instant individually by considering the probabilities of all possible number of accidents at that time in the cluster assigned to that day. The performance of the commute time prediction model is measured by the accuracy rate of each cluster and the relative time difference for each cluster. Five clusters out of six unique clusters are assigned to the 49 days. Only one cluster showed low accuracy and high relative time difference, otherwise the rest of the four clusters showed good results. Overall, the proposed commute time prediction model performed well by incorporating both, weather and traffic accidents impact.

106

Chapter Seven: CONCLUSIONS AND FUTURE WORK

7.1 Conclusions

In this study, a Web mining approach is proposed for supporting historical analyses of vehicle

commute times on roadways and a commute time prediction model is developed on the basis of

results inferred from historical vehicle traffic analyses. The historical traffic analyses are performed here in order to identify the road traffic congestion problems in the past. The results of these analyses can be used by traffic controllers to avoid congestion problems in future.

Unlike the approaches used by other researchers for traffic data collection, this system does not require dedicated roadside sensors and associated infrastructure to gather road traffic data.

Furthermore, it may not be applied immediately to new routes very easily. Consequently, the most important motivation of this research is to develop a Web mining data collection alternative for gathering traffic data in order to perform analyses on historical data.

The proposed system collects traffic data from heterogeneous Web sources such as Google

Maps, Twitter social network and Weather Reports websites. It accumulates commute time estimates for any given route and its sub-segments from the Google Maps website over a long period of time. Moreover, it overlays traffic accident information from the Twitter social network and weather information onto the collected commute times. The system is able to support analyses that characterize commute time patterns and their dependency on factors such as weather, accidents, time of the day, and day of week.

This information is collected for 18 major highways and arterial roads in the city of Calgary. As a proof of concept, a detailed historical analysis is implemented on a major highway, Deerfoot

Trail, in the city of Calgary in both the directions, which are, north to south and south to north.

107

Historical traffic data for this highway is gathered for 202 days from three different Web sources.

Commute time and route information is congregated from Google Maps website and Google

Maps API. Information about traffic accidents on this highway is collected through Twitter

Search API. Weather information, specifically snowfall and snow on ground, is gathered from historical weather website climate.weather.gc.ca. The proposed system, then, combines together the information collected from heterogeneous Web sources. The information is joined together on the basis of common factor of time among the information collected from diverse Web sources. To summarize, following are some key results of the historical vehicle traffic analysis of

Deerfoot Trail:

 Evening rush hours are longer than the morning rush hours.

 Most of accidents happen on Deerfoot Trail between 5:00 PM and 5:30 PM in

northbound direction and between 4:30 PM and 5:30 PM in southbound direction.

 Number of accidents observed in the off-peak hours is far less than the accidents

observed at peak hours. It is never more than one accident at off-peak hours during the

period of 202 days while the highest number of accidents recorded at peak hours is 9.

 On days with heavy snowfall, commute time is four times the normal while the numbers

of accidents increase by five fold.

 In the northbound direction of Deerfoot Trail, a sub-segment, 16 Avenue to 17 Avenue,

near Downtown is the bottleneck. Whereas in the opposite direction, bottleneck sub-

segment, 32 Avenue to 64 Avenue, is the one near to the Airport of the city which is

quite far from Downtown.

108

For building a predictive model, the data is divided into two datasets; 153 days training dataset and 49 days of testing dataset. First clustering is performed on 153 days of training dataset to identify unique traffic patterns. Clustering identified the following interesting outcomes:

 Clustering identified 6 unique traffic patterns for 153 days.

 Each pattern is different from the other traffic pattern in terms of weather conditions and

number of accidents.

 Clustering divided the days into two categories on the basis of its weather conditions.

One set of 54 good weather days, when there was zero snowfall in the city and the other

set of 99 bad weather days, when there was snowfall on a day and/or there was snow on

ground because of previous days’ snowfall.

 Furthermore, clustering divided these two datasets into 6 unique traffic patterns on the

basis of number of accidents on those days. All the weekends and holidays for the both

good and bad weather days fell into one cluster having no accidents.

 Good weather dataset of 54 days showed three clusters; one with weekends and

holidays with very less number of accidents, other one with working days with

moderate number of accidents showing a regular traffic pattern in the morning and

evening rush hours and the last one showing an irregular traffic pattern having large

number of accidents on working days.

 99 bad weather days had two clusters; one with working days and moderate number of

accidents on a snowy day and other with irregular traffic pattern for the days with

extremely inclement weather conditions.

109

After building these clusters, a commute time prediction model is built to forecast the commute times for forthcoming days in the testing dataset. Specifically, the inputs for the prediction technique are the weather forecast for the day and the time instant at which commute time prediction is desired. A data mining procedure called classification is first used to map a day in the training dataset to one of the six clusters. This is done by matching the weather forecast for the day with the cluster having the closest weather pattern. Next, information contained in the cluster is used to predict the number of accidents for the time instant for which commute time prediction is desired. Finally, the number of accidents estimate is used appropriately inflate the

“no accident” commute time to obtain a predicted commute time. The accuracy of the classifier is measured by comparing the predicted commute times with the actual commute times recorded from Google Maps for that particular day. Results of the prediction model are as follows:

 76.8% of the predicted commute times are close to the actual commute times with the

time difference merely 5 minutes between the two commute times.

 The overall relative percentage difference of predicted commute times and actual

commute times is 8.2%

 The commute time prediction model performed well for the toughest case of forecasting

that is, bad weather days. The prediction model showed accuracy rate of 77% for this

class as the impact of weather and accidents is computed correctly by using historical

traffic analysis statistics.

 The model performed badly only for weekends/holidays having at least one accident

and the time of the accident is predicted at a wrong time in this model. This class

showed lowest accuracy of 42% with maximum relative percentage difference of 27.4%

110

among the other classes. To achieve high accuracy and good results for this class, model

needs to be trained with more historical data relating to weekends/holidays with

accidents.

7.2 Future work

The scope of the historical vehicle traffic analysis could be expanded to cover multiple roads to

study how traffic on a particular road affects traffic on other roads connected to it. The proposed methodology relies on the ICT commute time estimates from the Google Maps website. While the results seem to conform to real traffic trends in Calgary, a more rigorous evaluation is

required to validate the accuracy of this metric. This system also only focuses on accidents and

weather as factors that influence commute time. As outlined in Chapter 2, there are other factors

such as sun position, special events, and lane closures that can impact commute times. While

some of these factors, e.g., lane closures can be identified by enhancing Twitter scripts’ text

parsing capabilities, others factors require to identify other Web sources, e.g., hockey game

schedules from nhl.com. However, there may be factors such as traffic volumes that are currently

not reported by any Web service.

As mentioned previously, the identification of sub-segments is currently done manually.

Automating this would increase the flexibility of the system towards handling more routes. This system’s monitoring capabilities are also limited by query restrictions of Google Maps API and

Twitter Search API. Historical analysis ignores fine-grained characterization of weather

conditions based on factors such as temperature, visibility, and precipitation amounts. This

feature can be added easily since the climate.weather.gc.ca website provides many of these

111

metrics. The methodology also did not focus on evaluating the effectiveness of other clustering techniques or on alternative configurations of k-means, e.g., sensitivity to various distance measures. The historical vehicle traffic analysis could also benefit from a larger dataset for identifying the unique traffic trends for the routes depending upon the diverse traffic conditions over the different seasons of the year.

Route planning could be improved by predicting the shortest commute time between source and destination by computing the best route involving historical vehicle traffic data analyses of more than one road. Collection of historical traffic data for all the possible roads in the city could involve much more than petabytes of data and computation of huge amount of data could be difficult without storing this big data into an efficient database. Therefore, to perform efficient analysis the traffic data should be well organized in database involving big data such as

MongoDB, Cassandra, HBase and Terrastore. WEKA could not handle the calculation of huge amount of data at the same time and strong analysis tools such as Hadoop, GridGain and Storm are needed to perform computation on big data. Few of the traffic conditions could not be predicted perfectly ahead of time such as number of accidents to occur at same time and the severity of those accidents. Correct traffic accidents information could only be inferred at real time. Thus, a commute time prediction model could be built by combining information from both historical traffic statistics and real time road traffic conditions about external factors such as traffic accidents and accidents’ severity. Finally, a more rigorous evaluation is needed to cross validate the commute time prediction model which could test the model under multiple training and testing datasets.

112

References

[1]. Benjamin Dachis, “Congestive Traffic Failure: The Case for High-Occupancy and

Express Toll Lanes in Canadian Cities”, The Urban Issues Series, C.D. Howe Institute,

August 2011.

[2]. Benjamin Dachis, “Cars, Congestion and Costs: A New Approach to Evaluating

Government Infrastructure Investment” C.D. Howe Institute, Commentary No. 385, July

2013.

[3]. Book: “The High Cost of Congestion in Canadian Cities”, Urban Transportation Task

Force, Council of Ministers Responsible for Transportation and Highway Safety, April

2012.

[4]. INRIX Transportation System, www.inrix.com

[5]. WAZE Geographical Navigation application, www.waze.com

[6]. Google Maps, www.maps.google.com

[7]. Wenxin Qiao, “Real Time Short Term Travel Time Prediction”, Doctoral Dissertation

2012, University of Maryland.

[8]. A. I. J. Tostes, F. D. L. P. Figueiredo, R. Assuno, J. Salles, and A. A. F. Loureiro, “From

data to knowledge: City-wide traffic flows analysis and prediction using bing maps,” in

SIGKDD International Workshop on Urban Computing, ser. UrbComp’13. Chicago,

USA: ACM, 2013.

[9]. Google Maps API, https://developers.google.com/maps

[10]. Twitter Search API, https://dev.twitter.com/docs/using-search

[11]. Environment Canada, Historical Climate Data, http://climate.weather.gc.ca

113

[12]. Historical Weather Reports, http://www.theweathernetwork.com/weather/historical-

weather/list

[13]. Calgary Blizzard day, http://www.cbc.ca/news/canada/calgary/calgary-weather-blizzard-

conditions-snow-will-persist-1.2447554

[14]. Hong Dai, Zhaosheng Yang, Shengwei Guo, “Real Time Traffic Volume Estimation with

Fuzzy Linear Regression”, Proceedings of the 6th World Congress on Intelligent Control

and Automation, June 21-23, 2006, Dalian, China

[15]. Ta Yin Hu, Chee Chung Tong, Tsai Yun Liao and Wei Ming Ho, “Simulation-

Assignment-Based Travel Time Prediction Model for Traffic Corridors,” IEEE

Transactions on Intelligent Transportation Systems, Vol. 13, No. 3, 2012

[16]. I. Rish, “An Empirical Study of the Naïve Bayes Classifier”, T.J. Watson Research

Center, Hawthrone, New York.

[17]. WEKA : Data Mining Tool, http://www.cs.waikato.ac.nz/ml/weka

[18]. WEKA User Manual, http://www.cs.waikato.ac.nz/ml/weka/documentation.html

[19]. Book: “Data Mining: Practical Machine Learning Tools and Techniques”, Ian H. Witten,

Eibe Frank, Mark A. Hall, Third Edition, Margan Kaufmann Publishers

[20]. Book: “The Data Mining and Knowledge Discovery Handbook”, Oded Maimon, Lior

Rokach, Library of Congress Cataloging-in-Publication Data, ©2005 Springer

[21]. Tapas Kanungo, Nathan S. Netanyahu, and Angela Y. Wu, “An Efficient k-Means

Clustering Algorithm: Analysis and Implementation”, IEEE Transactions on Pattern

Analysis and Machine Intelligence, July 2002.

114

[22]. D.A. Menasce, V. Almeida, R. Fonseca, and M.A. Mendes, “A Methodology for

Workload Characterization of E-commerce Sites,” Proc. 1999 ACM Conference on

Electronic Commerce, Denver, CO, November, 1999.

[23]. Book: “Classification – the Ubiquitous Challenge, Studies in Classification, Data

Analysis and Knowledge Organization”, C. Weihs, W. Gaul, University of Dortmund,

©2004 Springer

[24]. Book: “Combining Pattern Classifiers, Methods and Algorithms”, Ludmila I. Kuncheva,

Second Edition, Wiley

[25]. Bradley Efron, “Bayes’ theorem in the 21st century”, Science, 340(6137):1177–1178,

2013

[26]. Naïve Bayes Classification, www.mathworks.com/help/stats/naive-bayes-

classification.html

[27]. Francesc Soriguera, “Deriving Traffic Flow Patterns from Historical Data”, Journal of

Transportation Engineering, ©ASCE, 2012, pp. 1430-1441

[28]. Jiansu Pu, Siyuan Liu, Ye Ding, Huamin Qu, Lionel Ni, “T-Watcher: A New Visual

Analytic System for Effective Traffic Surveillance”, IEEE 14th International Conference

on Mobile Data Management, 2013

[29]. Yosr Naija, Kaouther Blibech Sinaoui, “A Novel Measure for Validating Clustering

Results Applied to Road Traffic”, 3rd International Workshop on Knowledge Discovery

from Sensor Data (SensorKDD-2009) (pp. 105–113). Paris, France

[30]. Christiane Stutz and Thomas A. Runkler, “Classification and Prediction of Road Traffic

Using Application-Specific Fuzzy Clustering”, IEEE Transactions on Fuzzy Systems,

vol. 10, June 2002.

115

[31]. Chai Quek, Michel Pasquier, and Bernard Boon Seng Lim, “POPTRAFFIC: A Novel

Fuzzy Neural Approach to Road Traffic Analysis and Prediction”, IEEE Transactions on

Intelligent Transportation Systems, vol. 7, no. 2, June 2006

[32]. Vidhya Kumaresan, “Modeling of Short Term and Long Term Impacts of Freeway

Traffic Incidents Using Historical Data”, Doctoral Dissertation, University of Nevada,

Las Vegas, 2008

[33]. TOMTOM Study, Real Time and historical Traffic, TomTom delivers a unique

proposition, www.tomtom.com/en_gb/licensing/products/traffic/historical-traffic/custom-

travel-times/?WT.ac_id=ttlic_footer_ctt

[34]. Roopa T., Anantharaman Narayana Iyer, Shanta Rangaswamy, “CroTIS –

Crowdsourcing based Traffic Information System”, IEEE International Congress on Big

Data, 2013

[35]. Vipin Jain, Ashlesh Sharma, Lakshminarayan Subramanian, “Road Traffic Congestion in

the Developing World”, ACM DEV 2011

[36]. Lars Wischhof, Andre Ebner Herman Rohling, Matthias Lott, Rudiger Halfmann,

“SOTIS – A Self-Organizing Traffic Information System”, IEEE 2003

[37]. Peng Chen, Zhao Lu, Junzhong Gu, “Vehicle Travel Time Prediction Algorithm Based

on Historical Data and Shared Location”, Fifth International Joint Conference on INC,

IMS and IDC, 2009

[38]. Huang Yanguo and Xu Lunhui, “The Urban Road Traffic State Identification Method

Based on FCM Clustering”, 2011 International Conference on Transportation,

Mechanical, and Electrical Engineering (TMEE), Changchun, China.

116

[39]. Gui-yan Jiang, Jiang-feng Wang, Xiao-dong Zhang, and Long-hui Gang, “The Study on

the application of Fuzzy Clustering Analysis in the dynamic identification of road traffic

state”, 2003 IEEE 6th International Conference on ITS, Shanghai, China.

[40]. He-Sheng Zhang, Yi Zhang, Zhi-Heng Li, and Dong-Cheng Hu, “Spatial–Temporal

Traffic Data Analysis Based on Global Data Management Using MAS”, IEEE

Transactions on Intelligent Transportation Systems, vol. 5, no. 4, December 2004

[41]. Rutger Claes, Tom Holvoet, “Ad Hoc Link Traversal Time Prediction”, 14th International

IEEE Conference on Intelligent Transportation Systems, Washington, DC, USA, October

5-7, 2011

[42]. Wanli Min, Laura Winter, “Road Traffic Prediction with Spatio-Temporal Correlations”,

Technical Report RC24275 (W0706-018), IBM Research Watson, Yorktown Heights,

New York, June 2007

[43]. Nadhir Messai, Philippe Thomas, Dimitri Lefebvre, Abdellah El Moudni, “A Neural

Network Approach for Freeway Traffic Flow Prediction”, IEEE International Conference

on Control Applications, September 18-20, 2002, Glasgow, Scotland, U.K.

[44]. Rodric C. Fan, Xinnong Yang, and James D. Fay, “Using location data to determine

traffic information”, US Patent 6594576 B2, Jul 15, 2003.

[45]. Meeyoung Cha, Hamed Haddadi, Fabricio Benevenuto, Krishna P. Gummadi,

“Measuring User Influence in Twitter: The Million Follower Fallacy”, Fourth

International Conference on Weblogs and Social Media, 2010

[46]. Deerfoot Trail, http://www.transportation.alberta.ca/glengp.htm

[47]. Naïve Bayes Classification Process, http://www.nltk.org/book/ch06.html

117